Biopython-クラスター分析

一般に、クラスター分析は、同じグループ内のオブジェクトのセットをグループ化します。この概念は、主にデータマイニング、統計データ分析、機械学習、パターン認識、画像分析、バイオインフォマティクスなどで使用されます。クラスターがさまざまな分析でどのように広く使用されているかを理解するために、さまざまなアルゴリズムによって実現できます。

バイオインフォマティクスによると、クラスター分析は主に遺伝子発現データ分析で使用され、類似した遺伝子発現を持つ遺伝子のグループを見つけます。

この章では、実際のデータセットでのクラスタリングの基礎を理解するために、Biopythonの重要なアルゴリズムを確認します。

Biopythonは、すべてのアルゴリズムを実装するためにBio.Clusterモジュールを使用します。次のアルゴリズムをサポートしています-

階層的クラスタリング
K-クラスタリング
自己組織化マップ
主成分分析

上記のアルゴリズムについて簡単に紹介します。

階層的クラスタリング

階層的クラスタリングを使用して、距離測定によって各ノードをその最近傍にリンクし、クラスターを作成します。 Bio.Clusterノードには、左、右、距離の3つの属性があります。以下に示すように簡単なクラスターを作成しましょう-

>>> from Bio.Cluster import Node
>>> n = Node(1,10)
>>> n.left = 11
>>> n.right = 0
>>> n.distance = 1
>>> print(n)
(11, 0): 1

あなたがツリーベースのクラスタリングを構築したい場合は、以下のコマンドを使用します-

>>> n1 = [Node(1, 2, 0.2), Node(0, -1, 0.5)] >>> n1_tree = Tree(n1)
>>> print(n1_tree)
(1, 2): 0.2
(0, -1): 0.5
>>> print(n1_tree[0])
(1, 2): 0.2

Bio.Clusterモジュールを使用して階層クラスタリングを実行しましょう。

距離が配列で定義されていることを考慮してください。

>>> import numpy as np
>>> distance = array([[Now add the distance array in tree cluster.

[source,prettyprint,notranslate]

>>> Bio.Clusterからtreeclusterをインポート>>> cluster = treecluster（distance）>>> print（cluster）（2、1）：0.666667（-1、0）：9.66667

The above function returns a Tree cluster object. This object contains nodes where the number of items are clustered as rows or columns.

=== K - Clustering

It is a type of partitioning algorithm and classified into k - means, medians and medoids clustering. Let us understand each of the clustering in brief.

==== K-means Clustering

This approach is popular in data mining. The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K.

The algorithm works iteratively to assign each data point to one of the K groups based on the features that are provided. Data points are clustered based on feature similarity.

[source,prettyprint,notranslate]

>>> Bio.Clusterからkclusterをインポート>>> numpyから配列をインポート>>> data = array（[[clusterid、error、found = kcluster（data）>>> print（clusterid）[0 0 1] >>>印刷（見つかった）1

==== K-medians Clustering

It is another type of clustering algorithm which calculates the mean for each cluster to determine its centroid.

==== K-medoids Clustering

This approach is based on a given set of items, using the distance matrix and the number of clusters passed by the user.

Consider the distance matrix as defined below −

[source,result,notranslate]

>>> distance = array（[[以下のコマンドを使用してk-medoidクラスタリングを計算できます-

>>> from Bio.Cluster import kmedoids
>>> clusterid, error, found = kmedoids(distance)

例を考えてみましょう。

kcluster関数は、Seqインスタンスではなくデータ行列を入力として受け取ります。シーケンスを行列に変換し、それをkcluster関数に提供する必要があります。

データを数値要素のみを含む行列に変換する1つの方法は、 numpy.fromstring 関数を使用することです。基本的に、シーケンス内の各文字を対応するASCII文字に変換します。

これにより、kcluster関数が認識し、シーケンスのクラスター化に使用するエンコードされたシーケンスの2D配列が作成されます。

>>> from Bio.Cluster import kcluster
>>> import numpy as np
>>> sequence = [ 'AGCT','CGTA','AAGT','TCCG']
>>> matrix = np.asarray([np.fromstring(s, dtype=np.uint8) for s in sequence])
>>> clusterid,error,found = kcluster(matrix)
>>> print(clusterid) [1 0 0 1]

自己組織化マップ

このアプローチは、人工ニューラルネットワークの一種です。これはコホネンによって開発され、しばしばコホネンマップと呼ばれます。長方形のトポロジーに基づいてアイテムをクラスターに編成します。

以下に示すように、同じ配列距離を使用して簡単なクラスターを作成しましょう-

>>> from Bio.Cluster import somcluster
>>> from numpy import array
>>> data = array([[clusterid,map = somcluster(data)

>>> print(map)
[[print(clusterid)
[[Here, *clusterid *is an array with two columns, where the number of rows is equal to the number of items that were clustered, and* data* is an array with dimensions either rows or columns.

=== Principal Component Analysis

Principal Component Analysis is useful to visualize high-dimensional data. It is a method that uses simple matrix operations from linear algebra and statistics to calculate a projection of the original data into the same number or fewer dimensions.

Principal Component Analysis returns a tuple columnmean, coordinates, components, and eigenvalues. Let us look into the basics of this concept.

[source,prettyprint,notranslate]

>>> numpy import arrayから>>> numpy import meanから>>> numpy import covから>>> numpy.linalg import eigから

＃行列を定義>>> A = array（[[print（A）[[各列の平均を計算>>> M = mean（A.T、axis = 1）>>> print（M） [ 3. 4.]

＃列の平均を引くことで列を中央に配置>>> C = A-M

>>> print（C）[[中心行列の共分散行列を計算>>> V = cov（C.T）

>>> print（V）[[共分散行列の固有分解>>>値、ベクトル= eig（V）

>>> print（vectors）[[print（values） [ 8. 0.]

Let us apply the same rectangular matrix data to Bio.Cluster module as defined below −

[source,prettyprint,notranslate]

>>> Bio.Clusterからpcaをインポート>>> numpyから配列をインポート>>> data = array（[[columnmean、axes、components、eigenvalues = pca（data）>>> print（columnmean） [ 3. 4.] >>> print（coordinates）[[print（components）[[print（eigenvalues） [ 4. 0.]

Biopython-cluster-analysis

Biopython-クラスター分析

階層的クラスタリング

自己組織化マップ