Pythonを使用したAI –データ準備

教師あり機械学習アルゴリズムだけでなく、教師あり機械学習アルゴリズムもすでに研究しています。これらのアルゴリズムでは、トレーニングプロセスを開始するためにフォーマットされたデータが必要です。 MLアルゴリズムへの入力として提供できるように、特定の方法でデータを準備またはフォーマットする必要があります。

この章では、機械学習アルゴリズムのデータ準備に焦点を当てています。

データの前処理

私たちの日常生活では、大量のデータを扱いますが、このデータは生の形式です。機械学習アルゴリズムの入力としてデータを提供するには、それを意味のあるデータに変換する必要があります。そこで、データの前処理が重要になります。言い換えれば、機械学習アルゴリズムにデータを提供する前に、データを前処理する必要があると言えます。

データの前処理手順

Pythonでデータを前処理するには、次の手順に従います-

ステップ1-有用なパッケージのインポート-Pythonを使用している場合、これはデータを特定の形式に変換する最初のステップ、つまり前処理になります。それは次のように行うことができます-

import numpy as np
import sklearn.preprocessing

ここでは、次の2つのパッケージを使用しました-

NumPy -基本的にNumPyは、小さな多次元配列の速度をあまり犠牲にすることなく、任意のレコードの大きな多次元配列を効率的に操作するために設計された汎用配列処理パッケージです。
Sklearn.preprocessing -このパッケージは、多くの一般的なユーティリティ関数とトランスクラスを提供し、生の特徴ベクトルを機械学習アルゴリズムにより適した表現に変更します。

ステップ2-サンプルデータの定義-パッケージをインポートした後、そのデータに前処理技術を適用できるようにサンプルデータを定義する必要があります。私たちは今、次のサンプルデータを定義します-

Input_data = np.array([2.1, -1.9, 5.5],
                      [-1.5, 2.4, 3.5],
                      [0.5, -7.9, 5.6],
                      [5.9, 2.3, -5.8])

Step3-前処理技術の適用*-このステップでは、前処理技術のいずれかを適用する必要があります。

次のセクションでは、データの前処理手法について説明します。

データ前処理の手法

データの前処理のための技術は以下に説明されています-

二値化

これは、数値をブール値に変換する必要がある場合に使用される前処理手法です。私たちは、次のようにしきい値として0.5を使用して、入力データを二値化するために組み込みの方法を使用することができます

data_binarized = preprocessing.Binarizer(threshold = 0.5).transform(input_data)
print("\nBinarized data:\n", data_binarized)

これで、上記のコードを実行した後、次の出力が得られます。0.5（しきい値）を超えるすべての値は1に変換され、0.5未満のすべての値は0に変換されます。

二値化データ

[[Mean Removal

It is another very common preprocessing technique that is used in machine learning. Basically it is used to eliminate the mean from feature vector so that every feature is centered on zero. We can also remove the bias from the features in the feature vector. For applying mean removal preprocessing technique on the sample data, we can write the Python code shown below. The code will display the Mean and Standard deviation of the input data −

[source,result,notranslate]

print（ "Mean ="、input_data.mean（axis = 0））print（ "標準偏差="、input_data.std（axis = 0））

We will get the following output after running the above lines of code −

[source,result,notranslate]

平均= [1.75 -1.275 2.2]標準偏差= [2.71431391 4.20022321 4.69414529]

Now, the code below will remove the Mean and Standard deviation of the input data −

[source,prettyprint,notranslate]

data_scaled = preprocessing.scale（input_data）print（ "Mean ="、data_scaled.mean（axis = 0））print（ "標準偏差="、data_scaled.std（axis = 0））

We will get the following output after running the above lines of code −

[source,result,notranslate]

平均= [1.11022302e-16 0.00000000e + 00 0.00000000e + 00]標準偏差= [1。 1. 1.]

==== Scaling

It is another data preprocessing technique that is used to scale the feature vectors. Scaling of feature vectors is needed because the values of every feature can vary between many random values. In other words we can say that scaling is important because we do not want any feature to be synthetically large or small. With the help of the following Python code, we can do the scaling of our input data, i.e., feature vector −

*# Min max scaling*

[source,prettyprint,notranslate]

data_scaler_minmax =前処理.MinMaxScaler（feature_range =（0,1））data_scaled_minmax = data_scaler_minmax.fit_transform（input_data）print（ "\ nMin max scaled data：\ n"、data_scaled_minmax）

We will get the following output after running the above lines of code −

*Min max scaled data*

[source,prettyprint,notranslate]

[ [ 0.48648649 0.58252427 0.99122807] [ 0. 1. 0.81578947] [ 0.27027027 0. 1. ] [ 1. 0. 99029126 0. ]]

==== Normalization

It is another data preprocessing technique that is used to modify the feature vectors. Such kind of modification is necessary to measure the feature vectors on a common scale. Followings are two types of normalization which can be used in machine learning −

*L1 Normalization*

It is also referred to as *Least Absolute Deviations*. This kind of normalization modifies the values so that the sum of the absolute values is always up to 1 in each row. It can be implemented on the input data with the help of the following Python code −

[source,result,notranslate]

＃データの正規化data_normalized_l1 = preprocessing.normalize（input_data、norm = 'l1'）print（ "\ nL1正規化データ：\ n"、data_normalized_l1）

The above line of code generates the following output &miuns;

[source,result,notranslate]

L1正規化データ：[[L2 Normalization *

最小二乗*とも呼ばれます。この種の正規化は、各行で平方和が常に最大1になるように値を変更します。次のPythonコードの助けを借りて、入力データに実装することができます-

# Normalize data
data_normalized_l2 = preprocessing.normalize(input_data, norm = 'l2')
print("\nL2 normalized data:\n", data_normalized_l2)

上記のコード行は、次の出力を生成します-

L2 normalized data:
[[Labeling the Data

We already know that data in a certain format is necessary for machine learning algorithms. Another important requirement is that the data must be labelled properly before sending it as the input of machine learning algorithms. For example, if we talk about classification, there are lot of labels on the data. Those labels are in the form of words, numbers, etc. Functions related to machine learning in *sklearn* expect that the data must have number labels. Hence, if the data is in other form then it must be converted to numbers. This process of transforming the word labels into numerical form is called label encoding.

==== Label encoding steps

Follow these steps for encoding the data labels in Python −

*Step1 − Importing the useful packages*

If we are using Python then this would be first step for converting the data into certain format, i.e., preprocessing. It can be done as follows −

[source,prettyprint,notranslate]

sklearnインポート前処理からnpをnpとしてインポートする

*Step 2 − Defining sample labels*

After importing the packages, we need to define some sample labels so that we can create and train the label encoder. We will now define the following sample labels −

[source,prettyprint,notranslate]

＃サンプルの入力ラベルinput_labels = ['red'、 'black'、 'red'、 'green'、 'black'、 'yellow'、 'white']

*Step 3 − Creating & training of label encoder object*

In this step, we need to create the label encoder and train it. The following Python code will help in doing this −

[source,result,notranslate]

＃ラベルエンコーダーを作成するencoder = preprocessing.LabelEncoder（）encoder.fit（input_labels）

Following would be the output after running the above Python code −

[source,result,notranslate]

LabelEncoder（）

*Step4 − Checking the performance by encoding random ordered list*

This step can be used to check the performance by encoding the random ordered list. Following Python code can be written to do the same −

[source,result,notranslate]

＃ラベルのセットをエンコードtest_labels = ['green'、 'red'、 'black'] encoded_values = encoder.transform（test_labels）print（ "\ nLabels ="、test_labels）

The labels would get printed as follows −

[source,prettyprint,notranslate]

ラベル= ['green'、 'red'、 'black']

Now, we can get the list of encoded values i.e. word labels converted to numbers as follows −

[source,prettyprint,notranslate]

print（ "Encoded values ="、list（encoded_values））

The encoded values would get printed as follows −

[source,prettyprint,notranslate]

エンコードされた値= [1、2、0]

*Step 5 − Checking the performance by decoding a random set of numbers −*

This step can be used to check the performance by decoding the random set of numbers. Following Python code can be written to do the same −

[source,result,notranslate]

＃値のセットをデコードencoded_values = [3,0,4,1] decode_list = encoder.inverse_transform（encoded_values）print（ "\ nEncoded values ="、encoded_values）

Now, Encoded values would get printed as follows −

[source,prettyprint,notranslate]

エンコードされた値= [3、0、4、1] print（ "\ nDecoded labels ="、list（decoded_list））

Now, decoded values would get printed as follows −

[source,prettyprint,notranslate]

デコードされたラベル= ['white'、 'black'、 'yellow'、 'green']

==== Labeled v/s Unlabeled Data

Unlabeled data mainly consists of the samples of natural or human-created object that can easily be obtained from the world. They include, audio, video, photos, news articles, etc.

On the other hand, labeled data takes a set of unlabeled data and augments each piece of that unlabeled data with some tag or label or class that is meaningful. For example, if we have a photo then the label can be put based on the content of the photo, i.e., it is photo of a boy or girl or animal or anything else. Labeling the data needs human expertise or judgment about a given piece of unlabeled data.

There are many scenarios where unlabeled data is plentiful and easily obtained but labeled data often requires a human/expert to annotate. Semi-supervised learning attempts to combine labeled and unlabeled data to build better models.

Artificial-intelligence-with-python-data-preparation

Pythonを使用したAI –データ準備

データの前処理

データの前処理手順

データ前処理の手法

二値化

目次