機械学習-データの準備

前書き

機械学習アルゴリズムは、モデルトレーニングを可能にする最も重要な側面であるため、データに完全に依存しています。一方、MLアルゴリズムに渡す前にそのデータを理解できない場合、マシンは役に立たなくなります。簡単に言えば、常に正しいデータを供給する必要があります。マシンが解決したい問題に対して、正しいスケール、フォーマット、および意味のある機能を含むデータ。

これにより、データ準備がMLプロセスで最も重要なステップになります。データ準備は、データセットをMLプロセスにより適切にする手順として定義できます。

データの前処理を行う理由

MLトレーニングの生データを選択した後、最も重要なタスクはデータの前処理です。広い意味で、データの前処理は、選択したデータを操作可能な形式に変換するか、MLアルゴリズムにフィードします。機械学習アルゴリズムの期待どおりになるように、常にデータを前処理する必要があります。

データの前処理技術

MLアルゴリズムのデータを生成するためにデータセットに適用できる次のデータ前処理手法があります-

スケーリング

ほとんどの場合、データセットはさまざまなスケールの属性で構成されていますが、そのようなデータをMLアルゴリズムに提供できないため、再スケーリングが必要です。データの再スケーリングにより、属性のスケールが同じになります。通常、属性は0〜1の範囲に再スケーリングされます。勾配降下法やk-Nearest NeighborsなどのMLアルゴリズムには、スケーリングされたデータが必要です。 scikit-learn Pythonライブラリの_MinMaxScaler_クラスを使用して、データを再スケーリングできます。

例

この例では、以前に使用したPima Indians Diabetesデータセットのデータを再スケーリングします。最初に、CSVデータが読み込まれ（前の章で行われたように）、_ MinMaxScaler_クラスの助けを借りて、0から1の範囲で再スケーリングされます。

次のスクリプトの最初の数行は、CSVデータの読み込み中に前の章で記述したものと同じです。

from pandas import read_csv
from numpy import set_printoptions
from sklearn import preprocessing
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

これで、_MinMaxScaler_クラスを使用して、0〜1の範囲でデータを再スケーリングできます。

data_scaler = preprocessing.MinMaxScaler(feature_range=(0,1))
data_rescaled = data_scaler.fit_transform(array)

また、選択に応じて出力用のデータを要約することもできます。ここでは、精度を1に設定し、出力の最初の10行を表示しています。

set_printoptions(precision=1)
print ("\nScaled data:\n", data_rescaled[0:10])

出力

Scaled data:
[[From the above output, all the data got rescaled into the range of 0 and 1.

=== Normalization

Another useful data preprocessing technique is Normalization. This is used to rescale each row of data to have a length of 1. It is mainly useful in Sparse dataset where we have lots of zeros. We can rescale the data with the help of _Normalizer_ class of _scikit-learn_ Python library.

=== Types of Normalization

In machine learning, there are two types of normalization preprocessing techniques as follows −

* link:/machine_learning_with_python/machine_learning_with_python_lone_normalization[L1 Normalization]
 *link:/machine_learning_with_python/machine_learning_with_python_ltwo_normalization[L2 Normalization]

=== Binarization

As the name suggests, this is the technique with the help of which we can make our data binary. We can use a binary threshold for making our data binary. The values above that threshold value will be converted to 1 and below that threshold will be converted to 0.

For example, if we choose threshold value = 0.5, then the dataset value above it will become 1 and below this will become 0. That is why we can call it* binarizing *the data or* thresholding* the data. This technique is useful when we have probabilities in our dataset and want to convert them into crisp values.

We can binarize the data with the help of _Binarizer_ class of _scikit-learn_ Python library

*Example*

In this example, we will rescale the data of Pima Indians Diabetes dataset which we used earlier. First, the CSV data will be loaded and then with the help of _Binarizer_ class it will be converted into binary values i.e. 0 and 1 depending upon the threshold value. We are taking 0.5 as threshold value.

The first few lines of following script are same as we have written in previous chapters while loading CSV data.

[source,prettyprint,notranslate]

パンダからread_csvをsklearn.preprocessingからインポートimport Binarizer path = r’C：\ pima-indians-diabetes.csv 'names = [' preg '、' plas '、' pres '、' skin '、' test '、' mass '、' pedi '、' age '、' class '] dataframe = read_csv（path、names = names）array = dataframe.values

Now, we can use _Binarize_ class to convert the data into binary values.

[source,result,notranslate]

binarizer = Binarizer（threshold = 0.5）.fit（array）Data_binarized = binarizer.transform（array）

Here, we are showing the first 5 rows in the output.

[source,result,notranslate]

印刷（「\ nバイナリデータ：\ n」、Data_binarized [0：5]）

*Output*

[source,result,notranslate]

バイナリデータ：[[標準化

基本的にガウス分布でデータ属性を変換するために使用されるもう1つの便利なデータ前処理手法。平均とSD（標準偏差）は、平均が0でSDが1の標準ガウス分布とは異なります。この手法は、線形回帰、ロジスティック回帰などのMLアルゴリズムで役立ちます。これは、入力データセットにガウス分布を仮定し、再スケーリングされたデータでより良い結果を生成します。 scikit-learn Pythonライブラリの_StandardScaler_クラスを使用して、データを標準化できます（平均= 0およびSD = 1）。

例

この例では、以前に使用したPima Indians Diabetesデータセットのデータを再スケーリングします。最初に、CSVデータがロードされ、次に_StandardScaler_クラスを使用して、平均= 0およびSD = 1のガウス分布に変換されます。

次のスクリプトの最初の数行は、CSVデータの読み込み中に前の章で書いたものと同じです。

from sklearn.preprocessing import StandardScaler
from pandas import read_csv
from numpy import set_printoptions
path = r'C:\pima-indians-diabetes.csv'
names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class']
dataframe = read_csv(path, names=names)
array = dataframe.values

これで、_StandardScaler_クラスを使用してデータを再スケーリングできます。

data_scaler = StandardScaler().fit(array)
data_rescaled = data_scaler.transform(array)

また、選択に応じて出力用のデータを要約することもできます。ここでは、精度を2に設定し、出力の最初の5行を表示しています。

set_printoptions(precision=2)
print ("\nRescaled data:\n", data_rescaled [0:5])

出力

Rescaled data:
[[Data Labeling

We discussed the importance of good fata for ML algorithms as well as some techniques to pre-process the data before sending it to ML algorithms. One more aspect in this regard is data labeling. It is also very important to send the data to ML algorithms having proper labeling. For example, in case of classification problems, lot of labels in the form of words, numbers etc. are there on the data.

=== What is Label Encoding?

Most of the sklearn functions expect that the data with number labels rather than word labels. Hence, we need to convert such labels into number labels. This process is called label encoding. We can perform label encoding of data with the help of _LabelEncoder()_ function of _scikit-learn_ Python library.

*Example*

In the following example, Python script will perform the label encoding.

First, import the required Python libraries as follows −

[source,result,notranslate]

sklearnインポート前処理からnpをnpとしてインポートする

Now, we need to provide the input labels as follows −

[source,result,notranslate]

input_labels = ['red'、 'black'、 'red'、 'green'、 'black'、 'yellow'、 'white']

The next line of code will create the label encoder and train it.

[source,result,notranslate]

encoder = preprocessing.LabelEncoder（）encoder.fit（input_labels）

The next lines of script will check the performance by encoding the random ordered list −

[source,prettyprint,notranslate]

test_labels = ['green'、 'red'、 'black'] encoded_values = encoder.transform（test_labels）print（ "\ nLabels ="、test_labels）print（ "Encoded values ="、list（encoded_values））encoded_values = [3 、0,4,1] decode_list = encoder.inverse_transform（encoded_values）

We can get the list of encoded values with the help of following python script −

[source,result,notranslate]

print（ "\ nEncoded values ="、encoded_values）print（ "\ nDecoded labels ="、list（decoded_list））

*Output*

[source,result,notranslate]

ラベル= ['green'、 'red'、 'black']エンコードされた値= [1、2、0]エンコードされた値= [3、0、4、1]デコードされたラベル= ['white'、 'black'、 '黄色」、「緑」]

Machine-learning-with-python-preparing-data

機械学習-データの準備

前書き

データの前処理を行う理由

データの前処理技術

スケーリング

目次