Abstract
A technical introduction along with working demo codes to two popular topic modeling approaches: Latent Dirichlet Allocation (LDA) and Biterm Topic Modeling (BTM). We use a very large collection of arXiv scholarly paper abstract (in English) for the former model and a very small collection of short text (in Traditional Chinese) about environmental violation fine for the latter.From wikipedia:
In machine learning and natural language processing, a topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents.
The objective is to cluster documents into different topic groups. The task is usually unsupervised, which means that topics are only implicit and not pre-labeled.
In this notebook we will walk through two popular approaches on topic modeling. Both are Bayesian and can be applied efficiently to large dataset.
LDA (Latent Dirichlet Allocation) is probably the most popular approach to date for topic modeling. The approach formulates the problem as a generative process, where a document is characterized by a distribution of topics (a topic mixture), and a topic in turn is characterized by a distribution of words. In such hierarchical framework we can then represent (and indeed generate) a document based on bag-of-words.
As long as we can fully specify both distribution mentioned above, we are able to build a probabilistic model that can generate the document-term matrix in the following steps:
70% topic-1, 20% topic-2 10% topic-3
for a 3-topic setup.)The prior for both the document-topic distribution and the topic-word distribution are postulated as Dirichlet distribution, where the parameters are set such that a sparse distribution is preferred.
For a detailed discussion on Bayesian modeling, readers can refer to the notebook: Bayesian Modeling Explained.
Not a Typical NLU Task
In our technical blog We didn’t put this notebook under the category of Natural Language Understanding
, instead we put it under the Statistics
category. This is to emphasize that the model is a pure probabilistic model based on bag-of-words, without any mechanism of semantic embeddings. It is powerful partly thanks to this simplicity.
Dirichlet distribution is just the multivariate version of Beta distribution. It is commonly used as the prior of a multivariate Bayesian model, probably because it is the conjugate prior to categorical and multinomial distribution. That is, the prior and the posterior are the same distribution family.
Dirichlet can be used to describe a distribution over probability vectors:
\[ q = (q_1, q_2, ..., q_k), \]
where each \(q_i\) is a non-negative real number and
\[ \sum_{i=1}^kq_i = 1. \]
A realization of \(q\) is a multinomial event. For example if we have a dice then \(k = 6\) and \(q_i\) is the probability of the outcome having number \(i\) facing up. A fair dice hence will have \(q_i = \frac{1}{6}\) for all \(i\).
A Dirichlet distribution is parameterized by a vector \(a = (a_1, a_2, ..., a_k)\). When \(k = 2\), Dirichlet \(\text{Dir}(a)\) reduces to a Beta distribution where \(q_1 \sim \text{Beta}(a_1, a_2)\). Indeed, for \(k > 2\) the marginal distribution of each \(q_i\) is still a Beta:
\[ q_i \sim \text{Beta}(a_i, \sum_{j \neq i}a_j). \] The probability density function of Dirichlet is:
\[ P(q\vert a) = \frac{\Gamma(\sum_ia_i)}{\prod_i\Gamma(a_i)}\prod_iq_i^{a_i - 1}, \]
where \(\Gamma\) is the gamma function:
\[ \Gamma(x) = (x - 1)!. \]
The parameter vector \(a\) controls the sparsity of the distribution.
One thing to note is that although we seem to have \(k\) dimension for a Dirichlet distribution, it is actually a distribution over \(k - 1\) dimension. This is because we have the constraint that requires \(\sum_{i=1}^kq_i = 1\), so the last element \(q_k\) is already determined by the previous \(k - 1\) elements.
Let’s play around with this distribution with its pdf:
# PDF of Dirichlet distribution.
ddir <- function(q, a) {
stopifnot(sum(q) == 1)
(gamma(sum(a)) / prod(gamma(a))) * prod(q^(a - 1))
}
apply_ddir <- function(q, a) {
d <- apply(q, 1, ddir, a=a)
q <- cbind(q, d)
q <- q[order(q[, "d"]),]
q
}
plot_dir <- function(q, outfile=NULL, ...) {
rgl::plot3d(x=q[, "q1"], y=q[, "q2"], z=q[, "d"],
xlab="q1", ylab="q2", zlab="density",
col=heat.colors(nrow(q), rev=TRUE))
if ( is.null(outfile) ) {
rgl::rglwidget()
} else {
writeWebGL(filename=outfile, ...)
}
}
# Create some random probability vectors given k = 3.
set.seed(777)
nsample <- 1000
q1 <- runif(nsample)
q2 <- mapply(runif, n=rep(1, nsample), max=1 - q1, MoreArgs=list(min=0))
q <- cbind(q1, q2, q3=1 - q1 - q2)
stopifnot(all(rowSums(q) == 1))
For \(k = 3\) the distribution is indeed over only 2 dimension so we can visualize the first 2 probability dimension with the Dirichlet density as the 3rd dimension. Let’s fix \(a_3 = 1\) and assume both \(a_1\) and \(a_2\) are smaller than 1:
The distributional difference is mainly driven by the sign of \(a_i - 1\). That is, for \(a > 1\) the distribution is dense and for \(a < 1\) it is sparse. It is sparse in a sense that only a few dimension will account for the majority of the proportion.
It is common for a model to have a Dirichlet prior with equal values in the \(a\) parameter vector. For a sparse prior (usually used in LDA):
For a dense prior instead:
Let’s use a HUGE dataset to demonstrate the model. We will use the arXiv dataset. We will use only the abstract of scholarly papers on the field of computer science. There are millions of records stored in Json Lines
format. We parse them using jsonlite
in a streaming fashion:
library(data.table)
library(jsonlite)
# The original file is unzipped and gzip for ease of operation.
infile <- "data/arxiv-metadata-oai-snapshot.json.gz"
outfile <- "data/arxiv_abstract.json.gz"
outfile_s <- "data/arxiv_abstract_cs.json.gz"
if ( !file.exists(outfile_s) ) {
# This could take quite some times to finish.
outf <- gzfile(outfile, "wb")
f <- gzfile(infile, "rb")
stream_in(f,
handler=function(df) {
# We only need a subset of the fields.
setDT(df)
df <- df[, .(id, title, categories, abstract)]
df <- unique(df, by="id") # There are duplicates.
# Remove newline char and heading and trailing white spaces.
df[, abstract:=tolower(trimws(gsub("\n", " ", abstract)))]
stream_out(df, outf, pagesize=50000)
},
pagesize=50000
)
close(f)
close(outf)
# Read back and de-dup again.
f <- gzfile(outfile, "rb")
arxiv <- stream_in(f, pagesize=50000)
close(f)
setDT(arxiv)
arxiv <- unique(arxiv, by="id")
setkey(arxiv, id)
# Filter for computer science papers only.
arxiv_cs <- arxiv[categories %like% "(\\s|^)cs\\."]
outf <- gzfile(outfile_s, "wb")
stream_out(arxiv_cs, outf, pagesize=50000)
close(outf)
}
arxiv_cs <- stream_in(gzfile(outfile_s, "rb"), pagesize=50000)
setDT(arxiv_cs)
arxiv_cs[, categories:=as.character(categories)] # Fix type.
[1] 268671 4
We use an efficient implementation text2vec
(Selivanov, Bickel, and Wang (2020)) to train the LDA model. For tokenization we use a simple one that can handle English without any problem, also removing all punctuation.
[1] '0.6'
dir.create("model", showWarnings=FALSE)
# Simple tokenization.
it <- itoken(arxiv_cs$abstract, ids=arxiv_cs$id, tokenizer=word_tokenizer)
# Filter vocab.
v <- create_vocabulary(it)
v <- prune_vocabulary(v, doc_count_min=50, doc_proportion_max=.2)
# Create doc-term sparse matrix
vectorizer <- vocab_vectorizer(v)
dtm <- create_dtm(it, vectorizer, type="dgTMatrix")
# Model fit.
# Beware that the the API requires the prior parameter to be a single value,
# though theoretically it can be a vector of length = number of topics.
# The same for topic-word distribution prior.
model_file <- "model/lda_model.RData"
if ( !file.exists(model_file) ) {
lda_model <- LDA$new(n_topics=20, doc_topic_prior=.1, topic_word_prior=.01)
doc_topic_distr <- lda_model$fit_transform(x=dtm, n_iter=1000, progressbar=FALSE)
save(lda_model, file=model_file)
} else {
# Since only topic-word distributions are saved with the model object,
# we will need to transfer again to get the doc-topic distribution of our corpus.
load(model_file)
doc_topic_distr <- lda_model$transform(x=dtm)
}
INFO [22:16:21.870] early stopping at 20 iteration
After a LDA model is fitted, each document will be assigned a topic mixing proportion. For example we can take a look at the topic proportions for one of my favorite paper about neural ranking models:
library(ggplot2)
doc_id <- "1709.03856" # StarSpace.
df <- data.table(topic_index=factor(1:20), topic_prop=doc_topic_distr[doc_id,])
ggplot(df, aes(x=topic_index, y=topic_prop)) +
labs(title=arxiv_cs[id == doc_id, title]) +
geom_bar(stat="identity")
Remember that each topic is just another Dirichlet distribution over words. For example we can see what are the most popular words in the 3rd and the 10th topic:
[,1] [,2]
[1,] "language" "algorithms"
[2,] "text" "learning"
[3,] "word" "search"
[4,] "task" "methods"
[5,] "information" "problems"
[6,] "models" "set"
[7,] "semantic" "decision"
[8,] "knowledge" "models"
[9,] "words" "optimization"
[10,] "speech" "selection"
[11,] "domain" "real"
[12,] "translation" "more"
[13,] "natural" "machine"
[14,] "embedding" "many"
[15,] "system" "different"
[16,] "state" "been"
[17,] "embeddings" "approaches"
[18,] "different" "techniques"
[19,] "tasks" "framework"
[20,] "dataset" "test"
One of the big challenge in the LDA approach is not the model training itself, but the interpretation of the result.
Sievert and Shirley (2014) propose a nice framework for this, with a visual tool for interactive inspection of LDA results.
In their work the relevance of a term \(w\) to a specific topic \(k\) is measured by
\[ r(w, k \vert \lambda) = \lambda \log(\phi_{kw}) + (1 - \lambda)\log(\frac{\phi_{kw}}{p_w}), \]
where \(\phi_{kw}\) is the probability of term \(w\) appears in topic \(k\), and \(p_w\) is the marginal probability of term \(w\) in the corpus (marginalized over topics), \(\lambda\) is a parameter controlling the weight of term probability within topic against its lift (as measured by \(\frac{\phi_{kw}}{p_w}\)).
In this setup, setting \(\lambda = 1\) effectively rank the relevancy only in terms of the word probability within topic. So the most popular word within the topic will be the most relevant word as well. If a word is popular over the entire corpus (a common word) it will likely to appear in higher rank as well, which is not helping in interpreting the given topic.
In their study on determining the optimal \(\lambda\) they use the Newsgroup dataset with human testing and find that \(\lambda = 0.6\) is the best choice, at least for their particular model.
The lesson is that we should never choose a trivial \(\lambda\) value on the boundary, i.e., neither 1 nor 0, but experiment with a value in-between.
Back to our previous example, we can inspect the top relevant words in the 3rd topic given different \lambda
s:
rel <- list()
lambdas <- c(.2, .6, 1)
for ( lambda in lambdas ) {
rel[[as.character(lambda)]] <-
lda_model$get_top_words(n=10, topic_number=c(3), lambda=lambda)
}
rel <- do.call(cbind, rel)
colnames(rel) <- sprintf("lambda_%s", lambdas)
print(rel)
lambda_0.2 lambda_0.6 lambda_1
[1,] "text" "text" "language"
[2,] "word" "language" "text"
[3,] "sentence" "word" "word"
[4,] "corpus" "words" "task"
[5,] "embeddings" "semantic" "information"
[6,] "english" "speech" "models"
[7,] "words" "embeddings" "semantic"
[8,] "sentences" "translation" "knowledge"
[9,] "language" "corpus" "words"
[10,] "document" "sentence" "speech"
For this particular example, some words are robustly relevant regardless of the \(\lambda\) values. That is, of course, not always the case.
In LDAvis
the left panel visualization (presented at the end of this section) by default will show the top-2 principal components on topic distances, measured by Jensen-Shannon divergence. It is a measure to compute the distance between two distributions.
Remember that in LDA a topic is modeled as a word distribution. Given any two topics we’d like to compute their distributional distance to understand how well we cluster the documents into topic groups. This can help us understand questions such as: is the \(k\) topics enough to separate our corpus or do we need more (or less) topics?
The Jensen-Shannon divergence is a symmetric version of the Kullback–Leibler divergence, or relative entropy. The KL divergence of a distribution P from a distribution Q is defined as (assuming a discrete setup):
\[ D_{KL}(P \vert\vert Q) = \sum_xP(x)\log\frac{P(x)}{Q(x)}. \]
It is the expectation of logarithmic difference between P and Q.
The Jensen-Shannon divergence is defined as:
\[ JSD(P, Q) = \frac{1}{2}D_{KL}(P\vert\vert M) + \frac{1}{2}D_{KL}(Q\vert\vert M), \]
where
\[ M = \frac{1}{2}(P + Q), \]
is the mixture of the two distribution.
If we expand the above equation:
\[ \begin{aligned} JSD(P, Q) &= \frac{1}{2}D_{KL}(P\vert\vert M) + \frac{1}{2}D_{KL}(Q\vert\vert M) \\ &= \frac{1}{2}\sum_xP(x)\log\frac{P(x)}{M(x)} + \frac{1}{2}\sum_xQ(x)\log\frac{Q(x)}{M(x)} \\ &= \frac{1}{2}\bigg[\sum_xP(x)\log P(x) + \sum_xQ(x)\log Q(x)\bigg] - \sum_x\bigg(\frac{P(x) + Q(x)}{2}\bigg)\log\bigg(\frac{P(x) + Q(x)}{2}\bigg) \\ &= H_M - \frac{H_P + H_Q}{2}, \end{aligned} \]
where \(H_D\) is the entropy of distribution \(D\):
\[ H_D = - \sum_xD(x)\log D(x). \]
Hence the JSD is indeed the entropy of the mixture minus the mixture of the entropy.
Once the JSD between all topics are computed, multidimensional scaling is used to convert the distance matrix into a 2-dimensional coordinates for visualization purpose.
Warning in dir.create(out.dir): 'model/ldavis' already exists
Loading required namespace: servr
Instead of utilizing document-word co-occurrence, Yan et al. (2013) use a word-word co-occurrence approach for topic modeling.1 Their approach is particularly designed for short text, where the document-level framework such as LDA may fail due to data sparsity.
In BTM, the whole corpus is considered a mixture of topics. This is different from LDA where each document is considered as a mixture of topics. In this framework, the probability of a biterm \(b = (w_i, w_j)\) can be expressed as:
\[ P(b) = \sum_zP(z)P(w_i\vert z)P(w_j\vert z), \] where \(z\) is a topic drawn from a Dirichlet distribution representing the whole corpus.
Now how do we recover document-level topic distribution? This is still necessary since we are doing topic modeling. We need to characterize each document \(d\), a relatively short text in the biterm use case, with topic proportions. This is done by assuming the following equation:
\[ P(z\vert d) = \sum_bP(z\vert b)P(b\vert d). \] That is, topic proportions of a document is determined by its biterms.
In the above equation, \(P(z\vert b)\) can be solved by Bayes’ theorem:
\[ P(z\vert b) = \frac{P(b\vert z)P(z)}{P(b)}, \]
and \(P(b\vert d)\) is estimated directly from the empirical distribution of biterms found in the document \(d\).
The model then can be solved by MCMC.
Let’s use the R package BTM
(Wijffels (2020)) to demonstrate the use of biterm modeling.
[1] '0.3.1'
And for the data, we use the violation case report from Environmental Protection Administration in Taiwan. This is a public dataset under the government open data platform: https://data.gov.tw/dataset/10165
It is a very small dataset with short text about the fact of firms’ violation of environmental protection rule.
This time we try to use CkipTagger(Li, Fu, and Ma (2020)), a neural language model specializing in traditional Chinese used in Taiwan to tokenize the text. Since CkipTagger
API is available only in Python as of now, we switch to use Python for data processing:
import csv
import requests
url = "https://data.epa.gov.tw/api/v1/doc_p_17?limit=1000&api_key=9be7b239-557b-4c10-9775-78cadfc555e9&format=csv"
response = requests.get(url)
sents = []
csvreader = csv.reader(response.content.decode("utf-8").splitlines(), delimiter=",")
for row in csvreader:
sents.append(row[4]) # The 5-th column is the `fact` column containing description.
sents = sents[1:]
# Print the first 3 cases.
for s in sents[:3]:
print(s)
執行地下水採樣出具之檢測報告,未符檢驗室品質管理手冊23-1-1及23-2-1之規定
105年12月1日前之稽查案件,於環保稽查處分管制系統(EEMS系統)無填報違反事實。
未依檢測方法執行空氣污染物監測作業
Now let’s parse the text:
import os
import ckiptagger
# Load the pre-trained model.
model_dir = "/media/kylechung/D/Data/ckip_data"
if not os.path.exists(model_dir):
ckiptagger.data_utils.download_data_gdown(os.path.dirname(model_dir))
os.rename(os.path.join(model_dir, "data"), os.path.join(model_dir, "ckip_data"))
ckip_ws = ckiptagger.WS(model_dir)
toks = ckip_ws(sents)
Show the first case in tokens:
執行地下水採樣出具之檢測報告,未符檢驗室品質管理手冊23-1-1及23-2-1之規定
['執行', '地下水', '採樣', '出具', '之', '檢測', '報告', ',', '未', '符', '檢驗室', '品質', '管理', '手冊', '23', '-', '1', '-', '1', '及', '23', '-', '2', '-1', '之', '規定']
The quality of tokenization is very high indeed.
Before we feed those tokens into BTM, we’d like to do one more thing: filtering for nouns only. This can potentially remove lots of noise for our already short text body. We can do POS-tagging easily with CkipTagger
:
# Do POS tagging.
ckip_pos = ckiptagger.POS(model_dir)
pos = ckip_pos(toks)
# Write out only nouns.
outfile = "data/tokens.tsv"
if not os.path.exists(outfile):
with open(outfile, "w") as f:
for i, (l_tok, l_pos) in enumerate(zip(toks, pos)):
for t, p in zip(l_tok, l_pos):
if p in ["Na", "Nb", "Nc"]:
_ = f.write(f"{i}\t{t}\n")
Now let’s load back the processed data in R, where we can run the BTM
model API. The processed data will be in long format, with 1 row per token:
library(data.table)
fine <- fread(py$outfile, header=FALSE)
setnames(fine, c("doc_id", "token"))
head(fine)
Run BTM
with 3 topics and show the estimated distribution (characterizing the whole corpus):
[1] 0.3988618 0.2900227 0.3111155
Show top words in each topic:
[[1]]
token probability
1 地下水 0.07503462
2 標準 0.06992433
3 土壤 0.04586858
4 方法 0.04025973
5 濃度 0.02885506
6 水 0.02318388
7 規定 0.02075338
8 鎳 0.02050410
9 鋅 0.01701415
10 場址 0.01688950
[[2]]
token probability
1 作業 0.03018938
2 方法 0.02753073
3 限值 0.02650158
4 空氣 0.02641582
5 規定 0.02633005
6 需氧量 0.02367140
7 標準 0.02195615
8 署 0.01886869
9 環境 0.01835411
10 化學 0.01732495
[[3]]
token probability
1 環境 0.05955287
2 計畫 0.05171918
3 內容 0.02917732
4 影響 0.02582002
5 報告 0.01830607
6 說明書 0.01750671
7 環評 0.01662742
8 差異 0.01614780
9 工程 0.01486883
10 本案 0.01430928
The result suggests there are 3 interpretable topics: about water pollution (水), air pollution (空氣), and earthwork inadequacy (土方). But we also see lots of common words in each topic.
To infer the document-level topic distribution we use the predict
API. For example we can infer the topics for the first 5 cases:
[,1] [,2] [,3]
0 0.19204327 0.7997814998 0.0081752334
1 0.99957927 0.0003017658 0.0001189621
2 0.03850882 0.9530984162 0.0083927607
3 0.82751386 0.1715276100 0.0009585308
4 0.12923788 0.8528584675 0.0179036524
Let’s also print the original fact descriptions:
0 執行地下水採樣出具之檢測報告,未符檢驗室品質管理手冊23-1-1及23-2-1之規定
1 105年12月1日前之稽查案件,於環保稽查處分管制系統(EEMS系統)無填報違反事實。
2 未依檢測方法執行空氣污染物監測作業
3 執行地下水採監測現場採樣時,未依檢測方法執行樣品檢測,違反本署公告檢測標準方法「監測井地下水採樣方法 NIEA W103.54B」及「水執檢測方法總則 NIEA W102.51C」之規定。
4 執行空氣採樣時,未依檢測方法執行樣品檢測,違反本署公告「空氣中粒狀污染物-高量採樣法NIEA A102.12A」之規定。
The first case is about groundwater quality while the topic proportions suggests a larger part on the air topic. This may be considered as a failed identification.
The second case indeed is neither of the 3 topic. It suggests that we may want to increase k
to identify additional topics (or topics that are ambiguous).
The rest 3 are all fine.
As a toy example we will stop here but the general idea and possible work flow should be clearly demonstrated.
Li, Peng-Hsuan, Tsu-Jui Fu, and Wei-Yun Ma. 2020. “Why Attention? Analyze Bilstm Deficiency and Its Remedies in the Case of Ner.” In AAAI, 8236–44.
Selivanov, Dmitriy, Manuel Bickel, and Qing Wang. 2020. Text2vec: Modern Text Mining Framework for R. https://CRAN.R-project.org/package=text2vec.
Sievert, Carson, and Kenneth Shirley. 2014. “LDAvis: A Method for Visualizing and Interpreting Topics.” In Proceedings of the Workshop on Interactive Language Learning, Visualization, and Interfaces, 63–70.
Wijffels, Jan. 2020. BTM: Biterm Topic Models for Short Text. https://CRAN.R-project.org/package=BTM.
Yan, Xiaohui, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2013. “A Biterm Topic Model for Short Texts.” In Proceedings of the 22nd International Conference on World Wide Web, 1445–56.