Topic modeling with transformers: farewell LDA

Photo by Benjamin Farren on pexels.com

Top­ic mod­el­ing is an unsu­per­vised learn­ing pro­ce­dure sim­i­lar to clus­ter­ing used in explorato­ry analy­sis of text. It clas­si­fies text in dif­fer­ent sets known as top­ics. In this post I give a brief intro­duc­tion to Top­ic Mod­el­ing, intro­duce the clas­sic LDA method and explain why Trans­former based meth­ods (such as my library trans­former­topic) are much more usable in prac­tice. To see a con­crete exam­ple, check out this post where I apply the method to COVID news and in par­tic­u­lar the last sec­tion where I show with a con­crete exam­ple why LDA top­ics are less usable.

Old-school NLP: LDA

Top­ic Mod­el­ing has been around since the late ’90s. Prob­a­bly the most impact­ful paper in the field is from Blei et al. in 2002 [1], in which the authors intro­duced the so-called Latent Dirich­let Allo­ca­tion (LDA). The method, of which nowa­days count­less vari­ants exist, is a sophis­ti­cat­ed Bayesian frame­work in which top­ics are defined as prob­a­bil­i­ty dis­tri­b­u­tions over a dic­tio­nary of words and each sen­tence (or doc­u­ment) is mod­eled as a prob­a­bil­i­ty dis­tri­b­u­tion over top­ics (i.e. an ell_1 — nor­mal­ized lin­ear com­bi­na­tion of top­ics). These dis­trib­tu­tions are learned via Bayesian infer­ence. See also the wikipedia page for some more details.

From a very prac­ti­cal stand­point (which is the per­spec­tive of this whole post) top­ics are de-fac­to iden­ti­fied with their word cloud rep­re­sen­ta­tion, where the larg­er words (graph­i­cal­ly speak­ing) should be more rep­re­sen­ta­tive of the top­ic (seman­ti­cal­ly speak­ing). In LDA-world the nat­ur­al way to do it is by assign­ing the prob­a­bil­i­ty of a word in a top­ic to its graph­i­cal size in the word cloud.

Evaluation of models

Though attempts have been made to find quan­ti­ta­tive mea­sures of qual­i­ty of the gen­er­at­ed top­ics (see [2] for a review and com­par­i­son), in prac­tice the only mea­sure that mat­ters is: “How good are the word clouds?”. Ide­al­ly we would like for each word cloud to be seman­ti­cal­ly very dif­fer­ent from each oth­er and be clear­ly rep­re­sentable by one or two words (that is after all how we com­mon­ly think of top­ics), but it often hap­pens that dif­fer­ent top­ics have a large over­lap in their word clouds and these con­sist most­ly of very com­mon words or words that seem to have no clear seman­tic relationship.

word cloud

Exam­ple of a bad word cloud. What type of texts would you expect to be clas­si­fied as the cor­re­spond­ing top­ic?W

Why LDA performs poorly

I think the under­ly­ing rea­son why LDA often per­forms poor­ly in this sense is because it is based on the bag-of-words (BOW) mod­el. In BOW an ordered list is cre­at­ed with all the words in a lan­guage, and each word is then encod­ed as a one-hot vec­tor: 0 every­where except for the index the word has in the list, where it is 1. A sen­tence (or a para­graph, a doc­u­ment) is then sim­ply rep­re­sent­ed as the sum of the one-hot vec­tors of the words in it. These vec­tors are the data points we feed into LDA, and thus LDA knows only about word fre­quen­cy and noth­ing about seman­tics: the sen­tences “Who are you” and “You are who” have iden­ti­cal rep­re­sen­ta­tions in BOW.

Modern NLP: transformers and sentence embeddings

Since  BERT [3] came out in 2018 how­ev­er NLP has changed for­ev­er, and BOW will soon have no oth­er prac­ti­cal pur­pose than being used in intro­duc­to­ry NLP texts as a sim­ple way to vec­tor­ize text. In fact BERT, which is based on the Trans­former archi­tec­ture and achieved SOTA per­for­mance on 11 dif­fer­ent NLP prob­lems  (at the time of pub­li­ca­tion), employs a clever scheme to learn vec­tor rep­re­sen­ta­tions of sen­tences known as sen­tence embed­dings.

Sen­tence embed­dings are much bet­ter vec­tor rep­re­sen­ta­tions of sen­tences than those obtained with BOW, because they approx­i­mate seman­tic rela­tion­ships between — sen­tences with sim­i­lar mean­ing are close togeth­er in this space. The authors achieve this by tak­ing a large cor­pus of Eng­lish texts and train­ing BERT on two tasks: in the first they take as input sen­tences where a ran­dom word has been sub­sti­tut­ed with a spe­cial MASK sym­bol and try to pre­dict the orig­i­nal word, in the sec­ond using a sen­tence in a para­graph as input they try to pre­dict the next sen­tence. This is a semi-super­vised pro­ce­dure, called like this because labelled data can be auto­mat­i­cal­ly cre­at­ed start­ing from any cor­pus. If you know noth­ing about trans­form­ers but are inter­est­ed in get­ting an idea of how they work, I sug­gest read­ing The Illus­trat­ed Trans­former post by Jay Alammar.

transformertopic: how the method works

Since sen­tence embed­dings have this nice prop­er­ty of approx­i­mat­ing some sort of seman­tic space, where dif­fer­ent regions of the space cor­re­spond to dif­fer­ent areas of mean­ing, they can be clus­tered and the result­ing clus­ters be called top­ics. In this post by Maarten Groo­ten­dorst he out­lines a sim­ple pro­ce­dure based exact­ly on this idea; the steps are:

  1. com­pute sen­tence embeddings
  2. reduce dimen­sion­al­i­ty using UMAP
  3. clus­ter the vec­tors with HDBSCAN [4]
  4. use Tfidf to gen­er­ate a word cloud from a cluster.

Step 2 is not strict­ly nec­es­sary but aids com­pu­ta­tion of step 3. I briefly exper­i­ment­ed using PCA or PACMAP instead of UMAP, but the results were either too slow or worse qual­i­ty. Regard­ing HDBSCAN, this makes a lot of sense because it has a bug (in the sense that because of this it is not real­ly a clus­ter­ing method by the clas­sic def­i­n­i­tion) that is actu­al­ly a fea­ture: it does not force a point into a clus­ter if it is not close enough to the oth­er points in it, i.e. out­liers are left unclas­si­fied. This is a very use­ful fea­ture because, in my expe­ri­ence, these out­liers often cor­re­spond to low-con­tent sen­tences, e.g. very short or using most­ly very com­mon words.

Unlike in LDA, we are not tied to the top­ic-as-a-prob­a­bil­i­ty-dis­tri­b­u­tion par­a­digm, so in step 4 we can gen­er­ate word clouds in a vari­ety of ways — we just need a way to rep­re­sent a clus­ter as a list of (word,rank) pairs, and use the rank for the graph­i­cal size of the word in the word cloud. The default Tfidf, which is obtained by divid­ing the fre­quen­cy of a word in a clus­ter by the fre­quen­cy of that word in the whole cor­pus, works quite well. I had some fun play­ing around with more fan­cy ways to rank words, like extract­ing key­words with Tex­trank [5] or Kmax­oids [6] (which was­n’t meant for this pur­pose), but in the end Tfidf gives the most con­sis­tent results.

While the author pro­vides his own pack­age, I wrote my own imple­men­ta­tion of the method out of curios­i­ty and to adopt a mod­u­lar design that makes it easy to swap dif­fer­ent meth­ods for steps 2 and 4.

Farewell LDA

After hav­ing used LDA-based meth­ods and this trans­former-based pro­ce­dure exten­sive­ly at my pre­vi­ous job in the Future Engi­neer­ing job at Fraun­hofer IIS, I can con­fi­dent­ly say this makes LDA obso­lete. The qual­i­ty of the top­ics is sim­ply much high­er and the word clouds almost always

  • are easy to interpret
  • rep­re­sent a clear seman­tic cluster
  • show top­ics you did­n’t know were present.

To be fair, this last point is true about LDA as well (that is the whole point of explorato­ry analy­sis after all), but in my expe­ri­ence LDA pro­duces a few gems (clear and insight­ful top­ics) amidst lots of noise (top­ics that are very ambigous and don’t feel cohese). With this trans­former-based pro­ce­dure the sig­nal-to-noise ratio is much better.

Still not con­vinced? Check out the last sec­tion of my COVID news post where I show with a con­crete exam­ple why LDA top­ics are less usable.

Conclusions and takeaways

Top­ic Mod­el­ing is “clus­ter­ing for text” and can be used to dis­cov­er what con­tents are present in a large cor­pus and track these over time. While in pre-BERT era many LDA vari­ants were designed specif­i­cal­ly for this task, with the great qual­i­ty of sen­tence embed­dings now avail­able we can sim­ply clus­ter these like any oth­er type of data and obtain bet­ter, or at least more usable, results.

About me

I’m a Math PhD cur­rent­ly work­ing in Blockchain Tech­ni­cal Intel­li­gence. I also have skills in soft­ware and data engi­neer­ing, I love sports and music! Get in con­tact and let’s have a chat — I’m always curi­ous about meet­ing new people.

References

[1] Blei, David M.; Ng, Andrew Y.; Jor­dan, Michael I; Laf­fer­ty, John (Jan­u­ary 2003). “Latent Dirich­let allo­ca­tion”. Jour­nal of Machine Learn­ing Research. 3: 993‑1022. doi:10.1162/jmlr.2003.3.4–5.993.

[2] Stevens, Kei­th, et al. “Explor­ing top­ic coher­ence over many mod­els and many top­ics.” Pro­ceed­ings of the 2012 joint con­fer­ence on empir­i­cal meth­ods in nat­ur­al lan­guage pro­cess­ing and com­pu­ta­tion­al nat­ur­al lan­guage learn­ing. 2012.

[3] Devlin, Jacob; Chang, Ming-Wei; Lee, Ken­ton; Toutano­va, Kristi­na (11 Octo­ber 2018). “BERT: Pre-train­ing of Deep Bidi­rec­tion­al Trans­form­ers for Lan­guage Under­stand­ing”. arXiv:1810.04805v2

[4] Campel­lo, Ricar­do J. G. B.; Moulavi, Davoud; Zimek, Arthur; Sander, Jörg (2015). “Hier­ar­chi­cal Den­si­ty Esti­mates for Data Clus­ter­ing, Visu­al­iza­tion, and Out­lier Detec­tion”. ACM Trans­ac­tions on Knowl­edge Dis­cov­ery from Data. 10 (1): 1–51.

[5] Rada Mihal­cea and Paul Tarau, 2004: Tex­tRank: Bring­ing Order into Texts, Depart­ment of Com­put­er Sci­ence Uni­ver­si­ty of North Texas.

[6] Bauck­hage, Chris­t­ian, and Rafet Sifa. “k‑Maxoids Clus­ter­ing.” LWA. 2015.