Topic modeling is a subcategory of unsupervised machine learning method, and a clustering task in particular. The main purpose of a topic model is assigning topics to unlabeled text documents., for example, a typical application is the categorization of social media blog into categories, such as sports, finance, world news, politics, and local news.
The specific technique applied in topic modeling is called Latent Dirichlet Allocation (LDA). LDA is a Bayesian statistical approach that tries to find groups of key words that appear most often among text examples. These most important key words represent the aspects of each topic.
In essence, LDA takes the bag-of-words matrix from the preprocessed texts as input, and decomposes it into two new matrices: A document-to-topic matrix and A word-to-topic matrix. Because the multiplication of these two matrix returns the input bag-of-words matrix, LDA tries to find topics that are able to reproduce the bag-of-words matrix, with the lowest possible error.
In the following code snippet, an application of topic model that focus on IMDB review texts was shown.
# Step 1, read csv imdb review data
# source from link 'http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz'
#has been downloaded and extracted.
basepath = 'aclImdb'
labels = {'pos': 1, 'neg': 0}
pbar = pyprind.ProgBar(50000, stream=sys.stdout)
df = pd.DataFrame()
for s in ('test', 'train'):
for l in ('pos', 'neg'):
path = os.path.join(basepath, s, l)
for file in sorted(os.listdir(path)):
with open(os.path.join(path, file),
'r', encoding='utf-8') as infile:
txt = infile.read()
df = pd.concat([df, pd.DataFrame([txt, labels[l]]).transpose()],
ignore_index=True)
pbar.update()
df.columns = ['review', 'sentiment']
# Shuffling the DataFrame:
np.random.seed(0)
df = df.reindex(np.random.permutation(df.index))
# Saving the assembled data as CSV file:
df.to_csv('movie_data.csv', index=False, encoding='utf-8')
df = pd.read_csv('movie_data.csv', encoding='utf-8')
# the following is necessary on some computers:
df = df.rename(columns={"0": "review", "1": "sentiment"})
df.head(3)
#output, first three review texts and their label (note, in topic model we do not #use label information here)
view sentiment
0 In 1974, the teenager Martha Moxley (Maggie Gr... 1
1 OK... so... I really like Kris Kristofferson a... 0
2 ***SPOILER*** Do not read this, if you think a... 0
#Step 2: create bag-of-words matrix, a 50000 *5000 matrix (50000 texts, 5000 vocabulary)
count = CountVectorizer(stop_words='english',
max_df=.1,
max_features=5000)
X = count.fit_transform(df['review'].values)
#Step 3: create LDA model, and training the model with bag-of-words matrix as input.
#we set 10 topics here, and each iteration use all information from the matrix (batch)
lda = LatentDirichletAllocation(n_components=10,
random_state=123,
learning_method='batch')
X_topics = lda.fit_transform(X)
#Step 4: to print the most important 5 words for each topic
n_top_words = 5
feature_names = count.get_feature_names_out()
for topic_idx, topic in enumerate(lda.components_):
print(f'Topic {(topic_idx + 1)}:')
print(' '.join([feature_names[i]
for i in topic.argsort()\
[:-n_top_words - 1:-1]]))
#output
Topic 1:
worst minutes awful script stupid
Topic 2:
family mother father children girl
Topic 3:
american war dvd music tv
Topic 4:
human audience cinema art sense
Topic 5:
police guy car dead murder
Topic 6:
horror house sex girl woman
Topic 7:
role performance comedy actor performances
Topic 8:
series episode war episodes tv
Topic 9:
book version original read novel
Topic 10:
action fight guy guys cool
Based on reading the 5 most important words for each topic, we may use the following candidate topics for IMDB movie review texts:
Generally bad movies (not really a topic category)
Movies about families
War movies
Art movies
Crime movies
Horror movies
Comedies
Movies somehow related to TV shows
Movies based on books
Action movies
#Step 5, to confirm our guess about topics, we print out 3 review texts that have the highest probabilities associating with topic 'Horror movies'.
horror = X_topics[:, 5].argsort()[::-1]
for iter_idx, movie_idx in enumerate(horror[:3]):
print(f'\nHorror movie #{(iter_idx + 1)}:')
print(df['review'][movie_idx][:300], '...')
#output
Horror movie #1:
House of Dracula works from the same basic premise as House of Frankenstein from the year before; namely that Universal's three most famous monsters; Dracula, Frankenstein's Monster and The Wolf Man are appearing in the movie together. Naturally, the film is rather messy therefore, but the fact that ...
Horror movie #2:
Okay, what the hell kind of TRASH have I been watching now? "The Witches' Mountain" has got to be one of the most incoherent and insane Spanish exploitation flicks ever and yet, at the same time, it's also strangely compelling. There's absolutely nothing that makes sense here and I even doubt there ...
Horror movie #3:
<br /><br />Horror movie time, Japanese style. Uzumaki/Spiral was a total freakfest from start to finish. A fun freakfest at that, but at times it was a tad too reliant on kitsch rather than the horror. The story is difficult to summarize succinctly: a carefree, normal teenage girl starts coming fac ...
You can also watch video for more details of topic model application from our YouTube channel.
0 Comments