Gensim-変換

この章は、Gensimのさまざまな変換について学習するのに役立ちます。変換するドキュメントを理解することから始めましょう。

ドキュメントの変換

ドキュメントの変換とは、ドキュメントを数学的に操作できるようにドキュメントを表現することです。コーパスの潜在的な構造を推定することとは別に、ドキュメントの変換は次の目標にも役立ちます-

単語間の関係を発見します。
コーパスの隠れた構造を引き出します。
文書を新しく、よりセマンティックな方法で説明します。
ドキュメントの表現がよりコンパクトになります。
新しい表現はより少ないリソースを消費するため、効率が向上します。
新しい表現では限界データの傾向が無視されるため、効果が向上します。
新しいドキュメント表現でもノイズが低減されます。

ドキュメントをあるベクトル空間表現から別のベクトル空間表現に変換するための実装手順を見てみましょう。

実装手順

ドキュメントを変換するには、次の手順に従う必要があります-

ステップ1：コーパスの作成

最初の基本的な手順は、ドキュメントからコーパスを作成することです。前の例ではすでにコーパスを作成しました。いくつかの拡張機能を使用して別の単語を作成しましょう（一般的な単語と1回だけ現れる単語を削除します）-

import gensim
import pprint
from collections import defaultdict
from gensim import corpora

今コーパスを作成するためのドキュメントを提供します-

t_corpus = [「CNTK以前はComputational Network Toolkitとして知られていました」、「無料で使いやすいオープンソースの商用グレードのツールキット」であり、「人間の脳のように学習するディープラーニングアルゴリズムをトレーニングすることができます。」、「無料のチュートリアルは、finddevguides.comにあります。 "finddevguides.comは、AI深層学習機械学習などのテクノロジーに関する最高のテクニカルチュートリアルも無料で提供しています。]]

次に、トークン化を行う必要があり、それに伴って一般的な単語も削除します-

stoplist = set('for a of the and to in'.split(' '))
processed_corpus = [
   [
      word for word in document.lower().split() if word not in stoplist
   ]
    for document in t_corpus
]

次のスクリプトは、のみ表示される単語を削除します-

frequency = defaultdict(int)
for text in processed_corpus:
   for token in text:
      frequency[token] += 1
   processed_corpus = [
      [token for token in text if frequency[token] > 1]
      for text in processed_corpus
   ]
pprint.pprint(processed_corpus)

出力

[
   ['toolkit'],
   ['free', 'toolkit'],
   ['deep', 'learning', 'like'],
   ['free', 'on', 'finddevguides.com'],
   ['finddevguides.com', 'on', 'like', 'deep', 'learning', 'learning', 'free']
]

これを* corpora.dictionary（）*オブジェクトに渡して、コーパス内の一意のオブジェクトを取得します-

dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)

出力

Dictionary(7 unique tokens: ['toolkit', 'free', 'deep', 'learning', 'like']...)

次に、次のコード行により、コーパスのBag of Wordモデルが作成されます-

BoW_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(BoW_corpus)

出力

[
   [(0, 1)],
   [(0, 1), (1, 1)],
   [(2, 1), (3, 1), (4, 1)],
   [(1, 1), (5, 1), (6, 1)],
   [(1, 1), (2, 1), (3, 2), (4, 1), (5, 1), (6, 1)]
]

ステップ2：変換を作成する

変換はいくつかの標準的なPythonオブジェクトです。これらの変換を初期化できます。訓練されたコーパスを使用したPythonオブジェクト。ここでは、 tf-idf モデルを使用して、トレーニング済みのコーパスの変換を作成します。 BoW_corpus 。

まず、gensimからモデルパッケージをインポートする必要があります。

from gensim import models

今、私たちは次のようにモデルを初期化する必要があります-

tfidf = models.TfidfModel(BoW_corpus)

ステップ3：ベクトルの変換

さて、この最後のステップで、ベクトルは古い表現から新しい表現に変換されます。上記のステップでtfidfモデルを初期化したので、tfidfは読み取り専用オブジェクトとして扱われます。ここでは、このtfidfオブジェクトを使用して、ベクトルを単語のバッグ表現（古い表現）からTfidf実数値の重み（新しい表現）に変換します。

doc_BoW = [(1,1),(3,1)]
print(tfidf[doc_BoW]

出力

[(1, 0.4869354917707381), (3, 0.8734379353188121)]

コーパスの2つの値に変換を適用しましたが、次のようにコーパス全体に適用することもできます-

corpus_tfidf = tfidf[BoW_corpus]
for doc in corpus_tfidf:
   print(doc)

出力

[(0, 1.0)]
[(0, 0.8734379353188121), (1, 0.4869354917707381)]
[(2, 0.5773502691896257), (3, 0.5773502691896257), (4, 0.5773502691896257)]
[(1, 0.3667400603126873), (5, 0.657838022678017), (6, 0.657838022678017)]
[
   (1, 0.19338287240886842), (2, 0.34687949360312714), (3, 0.6937589872062543),
   (4, 0.34687949360312714), (5, 0.34687949360312714), (6, 0.34687949360312714)
]

完全な実装例

import gensim
import pprint
from collections import defaultdict
from gensim import corpora
t_corpus = [
   "CNTK formerly known as Computational Network Toolkit",
   "is a free easy-to-use open-source commercial-grade toolkit",
   "that enable us to train deep learning algorithms to learn like the human brain.",
   "You can find its free tutorial on finddevguides.com",
   "finddevguides.com also provide best technical tutorials on
   technologies like AI deep learning machine learning for free"
]
stoplist = set('for a of the and to in'.split(' '))
processed_corpus = [
   [word for word in document.lower().split() if word not in stoplist]
   for document in t_corpus
]
frequency = defaultdict(int)
for text in processed_corpus:
   for token in text:
      frequency[token] += 1
   processed_corpus = [
      [token for token in text if frequency[token] > 1]
      for text in processed_corpus
   ]
pprint.pprint(processed_corpus)
dictionary = corpora.Dictionary(processed_corpus)
print(dictionary)
BoW_corpus = [dictionary.doc2bow(text) for text in processed_corpus]
pprint.pprint(BoW_corpus)
   from gensim import models
   tfidf = models.TfidfModel(BoW_corpus)
   doc_BoW = [(1,1),(3,1)]
   print(tfidf[doc_BoW])
   corpus_tfidf = tfidf[BoW_corpus]
   for doc in corpus_tfidf:
print(doc)

Gensimのさまざまな変換

Gensimを使用して、さまざまな一般的な変換を実装できます。ベクトル空間モデルアルゴリズム。それらのいくつかは次のとおりです-

Tf-Idf（用語頻度-逆ドキュメント頻度）

初期化中、このtf-idfモデルアルゴリズムは、整数値を持つ学習コーパス（Bag-of-Wordsモデルなど）を想定しています。その後、変換時にベクトル表現を取り、別のベクトル表現を返します。

出力ベクトルは同じ次元になりますが、（トレーニング時の）まれな特徴の値は増加します。基本的に整数値のベクトルを実数値のベクトルに変換します。以下はTf-idf変換の構文です-

Model=models.TfidfModel(corpus, normalize=True)

LSI（潜在的セマンティックインデックス）

LSIモデルアルゴリズムは、整数値のベクトルモデル（Bag-of-Wordsモデルなど）またはTf-Idfで重み付けされたスペースから潜在スペースにドキュメントを変換できます。出力ベクトルは低次元になります。以下はLSI変換の構文です-

Model=models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)

LDA（潜在ディリクレ配分）

LDAモデルアルゴリズムは、ドキュメントをBag-of-Wordsモデル空間からトピック空間に変換する別のアルゴリズムです。出力ベクトルは低次元になります。以下はLSI変換の構文です-

Model=models.LdaModel(corpus, id2word=dictionary, num_topics=100)

ランダムプロジェクション（RP）

非常に効率的なアプローチであるRPは、ベクトル空間の次元を削減することを目的としています。このアプローチは、基本的にはドキュメント間のTf-Idf距離に近いものです。これは、少しランダム性を投入することによって行われます。

Model=models.RpModel(tfidf_corpus, num_topics=500)

階層型ディリクレプロセス（HDP）

HDPはノンパラメトリックベイズ法であり、Gensimに新たに追加されました。使用中は注意が必要です。

Model=models.HdpModel(corpus, id2word=dictionary

Gensim-transformations