Gensim-LDAトピックモデルの作成

この章では、Gensimで潜在ディリクレ配分（LDA）トピックモデルを作成する方法について説明します。

NLP（自然言語処理）の主要なアプリケーションの1つで、大量のテキストからトピックに関する情報を自動的に抽出します。大量のテキストは、ホテルのレビューからのフィード、ツイート、Facebookの投稿、その他のソーシャルメディアチャネルからのフィード、映画のレビュー、ニュースストーリー、ユーザーのフィードバック、電子メールなどです。

このデジタル時代では、人々/顧客が何について話しているのかを知り、彼らの意見や問題を理解することは、ビジネス、政治キャンペーン、および管理者にとって非常に価値があります。しかし、そのような大量のテキストを手作業で読み、トピックから情報を抽出することは可能ですか？

いいえ、ちがいます。これらの大量のテキストドキュメントを読み、そこから必要な情報/トピックを自動的に抽出できる自動アルゴリズムが必要です。

LDAの役割

LDAのトピックモデリングへのアプローチは、ドキュメント内のテキストを特定のトピックに分類することです。ディリクレ分布としてモデル化されたLDAビルド-

ドキュメントモデルごとのトピックと
トピックモデルごとの単語

LDAトピックモデルアルゴリズムを提供した後、トピックキーワード分布の良い構成を得るために、それは再配置します-

ドキュメント内のトピックの分布と
トピック内のキーワード分布

処理中に、LDAによって行われたいくつかの仮定は、

すべてのドキュメントは、トピックの多目的分布としてモデル化されています。
すべてのトピックは、単語の多目的分布としてモデル化されます。
LDAでは、テキストの各チャンクに関連する単語が含まれていると想定しているため、データの正しいコーパスを選択する必要があります。
LDAは、ドキュメントがトピックの混合から作成されていることも想定しています。

Gensimによる実装

ここでは、LDA（潜在ディリクレ配分）を使用して、データセットから自然に議論されたトピックを抽出します。

データセットを読み込んでいます

使用するデータセットは、ニュースレポートのさまざまなセクションからの数千のニュース記事を含む* ’20ニュースグループ ’のデータセットです。 *Sklearn データセットで利用できます。私たちは次のPythonスクリプトの助けを借りて簡単にダウンロードできます-

from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')

次のスクリプトの助けを借りて、サンプルニュースのいくつかを見てみましょう-

newsgroups_train.data[:4]

["From: [email protected] (where's my thing)\nSubject:
WHAT car is this!?\nNntp-Posting-Host: rac3.wam.umd.edu\nOrganization:
University of Maryland, College Park\nLines:
15\n\n I was wondering if anyone out there could enlighten me on this car
I saw\nthe other day. It was a 2-door sports car, looked to be from the
late 60s/\nearly 70s. It was called a Bricklin. The doors were really small.
In addition,\nthe front bumper was separate from the rest of the body.
This is \nall I know. If anyone can tellme a model name,
engine specs, years\nof production, where this car is made, history, or
whatever info you\nhave on this funky looking car, please e-mail.\n\nThanks,
\n- IL\n ---- brought to you by your neighborhood Lerxst ----\n\n\n\n\n",

"From: [email protected] (Guy Kuo)\nSubject: SI Clock Poll - Final
Call\nSummary: Final call for SI clock reports\nKeywords:
SI,acceleration,clock,upgrade\nArticle-I.D.: shelley.1qvfo9INNc3s\nOrganization:
University of Washington\nLines: 11\nNNTP-Posting-Host: carson.u.washington.edu\n\nA
fair number of brave souls who upgraded their SI clock oscillator have\nshared their
experiences for this poll. Please send a brief message detailing\nyour experiences with
the procedure. Top speed attained, CPU rated speed,\nadd on cards and adapters, heat
sinks, hour of usage per day, floppy disk\nfunctionality with 800 and 1.4 m floppies
are especially requested.\n\nI will be summarizing in the next two days, so please add
to the network\nknowledge base if you have done the clock upgrade and haven't answered
this\npoll. Thanks.\n\nGuy Kuo <;[email protected]>\n",

'From: [email protected] (Thomas E Willis)\nSubject:
PB questions...\nOrganization: Purdue University Engineering
Computer Network\nDistribution: usa\nLines: 36\n\nwell folks,
my mac plus finally gave up the ghost this weekend after\nstarting
life as a 512k way back in 1985. sooo, i\'m in the market for
a\nnew machine a bit sooner than i intended to be...\n\ni\'m looking
into picking up a powerbook 160 or maybe 180 and have a bunch\nof
questions that (hopefully) somebody can answer:\n\n* does anybody
know any dirt on when the next round of powerbook\nintroductions
are expected? i\'d heard the 185c was supposed to make an\nappearence
"this summer" but haven\'t heard anymore on it - and since i\ndon\'t
have access to macleak, i was wondering if anybody out there had\nmore
info...\n\n* has anybody heard rumors about price drops to the powerbook
line like the\nones the duo\'s just went through recently?\n\n* what\'s
the impression of the display on the 180? i could probably swing\na 180
if i got the 80Mb disk rather than the 120, but i don\'t really have\na
feel for how much "better" the display is (yea, it looks great in the\nstore,
but is that all "wow" or is it really that good?). could i solicit\nsome
opinions of people who use the 160 and 180 day-to-day on if its
worth\ntaking the disk size and money hit to get the active display?
(i realize\nthis is a real subjective question, but i\'ve only played around
with the\nmachines in a computer store breifly and figured the opinions
of somebody\nwho actually uses the machine daily might prove helpful).\n\n*
how well does hellcats perform? ;)\n\nthanks a bunch in advance for any info -
if you could email, i\'ll post a\nsummary (news reading time is at a premium
with finals just around the\ncorner... :
( )\n--\nTom Willis \\ [email protected] \\ Purdue Electrical
Engineering\n---------------------------------------------------------------------------\
n"Convictions are more dangerous enemies of truth than lies." - F. W.\nNietzsche\n',

'From: jgreen@amber (Joe Green)\nSubject: Re: Weitek P9000 ?\nOrganization:
Harris Computer Systems Division\nLines: 14\nDistribution: world\nNNTP-Posting-Host:
amber.ssd.csd.harris.com\nX-Newsreader: TIN [version 1.1 PL9]\n\nRobert
J.C. Kyanko ([email protected]) wrote:\n >[email protected] writes in article
<[email protected] >:\n> > Anyone know about the
Weitek P9000 graphics chip?\n > As far as the low-level stuff goes, it looks
pretty nice. It\'s got this\n> quadrilateral fill command that requires just
the four points.\n\nDo you have Weitek\'s address/phone number? I\'d like to get
some information\nabout this chip.\n\n--\nJoe Green\t\t\t\tHarris
Corporation\[email protected]\t\t\tComputer Systems Division\n"The only
thing that really scares me is a person with no sense of humor.
"\n\t\t\t\t\t\t-- Jonathan Winters\n']

前提条件

NLTKのストップワードとScapyの英語モデルが必要です。どちらも次のようにダウンロードできます-

import nltk;
nltk.download('stopwords')
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])

必要なパッケージのインポート

LDAモデルを構築するには、次の必要なパッケージをインポートする必要があります-

import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt

ストップワードの準備

ここで、ストップワードをインポートして使用する必要があります-

from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])

テキストを整理する

次に、Gensimの* simple_preprocess（）を使用して、各文を単語のリストにトークン化する必要があります。また、句読点や不要な文字を削除する必要があります。これを行うために、 sent_to_words（）*という名前の関数を作成します-

def sent_to_words(sentences):
   for sentence in sentences:
      yield(gensim.utils.simple_preprocess(str(sentence), deacc=True))
data_words = list(sent_to_words(data))

バイグラムおよびトライグラムモデルの構築

ご存知のように、バイグラムはドキュメント内で頻繁に出現する2つの単語であり、トライグラムはドキュメント内で頻繁に出現する3つの単語です。 Gensimの Phrases モデルの助けを借りて、これを行うことができます-

bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

ストップワードを除外する

次に、ストップワードを除外する必要があります。それに加えて、バイグラム、トリグラムを作成し、見出し語化するための関数も作成します-

def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc))
if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   return [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
     doc = nlp(" ".join(sent))
     texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out

トピックモデルの辞書とコーパスの構築

次に、辞書とコーパスを作成する必要があります。私たちは前の例でもそれをしました-

id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]

LDAトピックモデルの構築

LDAモデルのトレーニングに必要なものはすべて実装済みです。ここで、LDAトピックモデルを作成します。私たちの実装例では、次のコード行の助けを借りて行うことができます-

lda_model = gensim.models.ldamodel.LdaModel(
   corpus=corpus, id2word=id2word, num_topics=20, random_state=100,
   update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True
)

実装例

LDAトピックモデルを構築するための完全な実装例を見てみましょう-

import re
import numpy as np
import pandas as pd
from pprint import pprint
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel
import spacy
import pyLDAvis
import pyLDAvis.gensim
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
stop_words.extend(['from', 'subject', 're', 'edu', 'use'])
from sklearn.datasets import fetch_20newsgroups
newsgroups_train = fetch_20newsgroups(subset='train')
data = newsgroups_train.data
data = [re.sub('\S*@\S*\s?', '', sent) for sent in data]
data = [re.sub('\s+', ' ', sent) for sent in data]
data = [re.sub("\'", "", sent) for sent in data]
print(data_words[:4]) #it will print the data after prepared for stopwords
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100)
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)
def remove_stopwords(texts):
   return [[word for word in simple_preprocess(str(doc))
   if word not in stop_words] for doc in texts]
def make_bigrams(texts):
   return [bigram_mod[doc] for doc in texts]
def make_trigrams(texts):
   [trigram_mod[bigram_mod[doc]] for doc in texts]
def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
   texts_out = []
   for sent in texts:
      doc = nlp(" ".join(sent))
      texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
   return texts_out
data_words_nostops = remove_stopwords(data_words)
data_words_bigrams = make_bigrams(data_words_nostops)
nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=[
   'NOUN', 'ADJ', 'VERB', 'ADV'
])
print(data_lemmatized[:4]) #it will print the lemmatized data.
id2word = corpora.Dictionary(data_lemmatized)
texts = data_lemmatized
corpus = [id2word.doc2bow(text) for text in texts]
print(corpus[:4]) #it will print the corpus we created above.
[[id2word[id], freq) for id, freq in cp] for cp in corpus[:4]]
#it will print the words with their frequencies.
lda_model = gensim.models.ldamodel.LdaModel(
   corpus=corpus, id2word=id2word, num_topics=20, random_state=100,
   update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True
)

これで、上記で作成したLDAモデルを使用してトピックを取得し、モデルの複雑さを計算できます。

Gensim-creating-lda-topic-model

目次