COVID news topics

In my pre­vi­ous post on trans­form­ers and top­ic mod­el­ing I intro­duced my library trans­former­topic. In this post I apply it to scraped news about COVID and dis­cov­er a few fun facts: 

  • cur­rent­ly (Novem­ber 2021) about one third of the sen­tences in COVID news is about vaccines
  • sen­tences relat­ed to Trump had a sharp decline in fre­quen­cy since Decem­ber 2021
  • con­cerns about trav­el reg­u­la­tions peaked in May 2021 and have been in sharp decline ever since

Data and methods

I scraped just shy of 12k arti­cles that UK’s divi­sion of Huff­in­g­ton Post cat­e­go­rizes as relat­ed to COVID. I then split each arti­cle into sen­tences using spa­Cy’s Sen­ten­ciz­er and used the result­ing list of approx­i­mate­ly 395k sen­tences as cor­pus. The fol­low­ing is a plot of the cor­pus’ num­ber of sen­tences per month:

I used my pack­age (which you can install with pip install transformertopic) like this:

reducer = UmapEmbeddings(umapNNeighbors=13)
tt = TransformerTopic(dimensionReducer=reducer, hdbscanMinClusterSize=250)
tt.train(documentsDataFrame=df, dateColumn='datetime', textColumn='text', copyOtherColumns = True)

where df is the pan­das DataFrame with the data. In oth­er words, I used the default para­phrase-MiniLM-L6-v2 embed­dings for Sen­tence­Trans­former (you can find more embed­dings here),  the UMAP reduc­er with neigh­bors para­me­ter set to 13 and set the HDBSCAN min­i­mum clus­ter size to 250.

In this way I obtained 128 top­ics. This was actu­al­ly my sec­ond attempt: at first I had set the min­i­mum clus­ter size to 20 which result­ed in some 4000 top­ics, a bit too much. This is a com­mon occur­rence: to get a rea­son­able num­ber of top­ics usu­al­ly some tri­al and error is required with the min­i­mum clus­ter size and the neigh­bors para­me­ter for UMAP (though I found 13 to be more often than not the mag­ic number).

A his­togram of the sizes of the obtained topics

Empirical study of the obtained COVID topics

Ok, so what do the top­ics look like? I will first present 8 of the largest top­ics, i.e. those which the great­est num­ber of sen­tences are clas­si­fied as, and sub­se­quent­ly I will present a selec­tion of hand-picked top­ics that I found interesting.

For each of these 8 top­ics I am show­ing here their word clouds and their trends: the trend of a top­ic is sim­ply a plot of the per­cent­age of all sen­tences in a cer­tain month that belong to the top­ic. I also show some exam­ple sen­tences (this is most­ly to show that the method indeed works: each word cloud gives a clear idea of what con­tent to expect in the sen­tences)

Top­ic 62: cases

“The wor­ry is that, as the epi­dem­ic now is increas­ing expo­nen­tial­ly, we’ll see in a few months more deaths, but also a greater prob­a­bil­i­ty of infec­tion jump­ing into the care homes,” he says, “and there­fore high­er death rates there as well.”

Fol­low­ing that there will be an “evi­dence-led” move down the tiers of restric­tions once the UK has “bro­ken the link” between cas­es and hos­pi­tal­i­sa­tions and deaths.

Top­ic 44: vaccine

“We have dif­fer­ent ways of mea­sur­ing the con­cen­tra­tion of the vac­cine and when it was appar­ent that a low­er dose was used, we dis­cussed this with the reg­u­la­tor, and agreed a plan to test both the low­er dose/higher dose and high­er dose/higher dose, allow­ing us to include both approach­es in the phase III trial.

The health sec­re­tary repeat­ed­ly dodged ques­tions about the rea­sons for the delay at the brief­ing, say­ing only that “vac­cine sup­ply is always lumpy and we reg­u­lar­ly send out tech­ni­cal let­ters to the NHS”.

Top­ic 58: masks

“Both of these can be done by the use of face coverings.”

To get past this pan­dem­ic, we need to plug the holes by hand wash­ing, social dis­tanc­ing, mask-wear­ing and dis­in­fect­ing, Mushatt said.

Top­ic 127: children

As fam­i­lies prac­tise social dis­tanc­ing and face a new real­i­ty togeth­er, kids are still light­en­ing the mood with fun­ny one-lin­ers and sweet observations.

“Senior clin­i­cians still advise that school is the best place for chil­dren to be, and so they should con­tin­ue to go to school.

Top­ic 32: quarantine

Be mind­ful of door slam­ming as well, espe­cial­ly if you’re depart­ing for an ear­ly flight.

Freight Chaos ‘Will Trig­ger Dis­rup­tion Like UK Has Nev­er Experienced’.

Top­ic 64: tests

I think we need a big increase in test­ing,” he told the committee. “

This arti­cle explains, in a nut­shell, how tests can­not be 100% accu­rate and there­fore there is a cer­tain mar­gin of error in the results.

Top­ic 40: president

Oba­ma addressed such con­cerns in his inter­view this week, point­ing specif­i­cal­ly to fears among com­mu­ni­ties of colour that have been dis­pro­por­tion­ate­ly harmed by the coronavirus.

Trump raised $5 mil­lion at the event, accord­ing to CBS.

Top­ic 74: lockdown

So when the enforced slow­ing and soli­tude of lock­down start­ed, I was kind of glad of the break.

As some­one who lives alone, I have to show up for myself and be my own best friend, and dur­ing a lock­down where we can’t real­ly see our pals, this becomes all the more pertinent.

The fol­low­ing are the trends for the top 8 top­ics all rep­re­sent­ed togeth­er. The most obvi­ous trend is the vac­ci­na­tion top­ic going from 1–2% in ear­ly 2020 to more than 30% in Octo­ber 2021. This means that today, about a third of the sen­tences writ­ten about COVID are some­how relat­ed to vaccination.

schools school children said. 127 
travel_quarantine uk said.32 
test_tests testing_people.64 
tump_president said_coronavirus_40 

Speak­ing of trends, I want­ed to see if I could detect the rush for toi­let paper in the first lock­down in 2020: I used the searchForWordInTopics method to search for “toi­let” and sure enough found it in top­ic 96, albeit only in posi­tion 72. This means the word is not shown in the word cloud which for read­abil­i­ty shows only the 25 words in the top­ic with high­est Tfidf score. Though the top­ic is much small­er in size than the ones pre­vi­ous­ly shown, we can clear­ly see a spike in March 2020 and sub­se­quent­ly a decreas­ing trend.

I con­sid­er head­ing to the gro­cery store to stock­pile canned goods with the rest of the coun­try, but social media tells me that lines are out the door and shelves are empty.

Besides the advan­tages of being able to touch, feel and try before you buy, shop­ping in-per­son can also dou­ble up as a social occa­sion and an oppor­tu­ni­ty to catch up with friends.

Fol­low­ing are sev­er­al more hand-picked top­ics and trends. Many of these trends exhib­it behav­iours you would expect, though it is true that there is also a con­sid­er­able amount of noise. This could also be due to the restrict­ed data sample.

And what about LDA?

To con­nect to my pre­vi­ous post com­par­ing trans­former based meth­ods and LDA, I also ran Gen­sim’s LdaMod­el on the same cor­pus. I then hard thresh­old­ed the top­ic dis­tri­b­u­tion for each sen­tence, i.e. I clas­si­fied each sen­tence as belong­ing exclu­sive­ly to the top­ic it had the high­est coef­fi­cient in.

Here are two of the largest topics:

short transmit 

So, should oth­er pet own­ers be wor­ried about their fur­ry friends catch­ing coronavirus?

The pan­dem­ic helped me realise that I do not thrive in an office full time.

re atively 
say n 

“I am wor­ry­ing, because the care job is what I want to do and I’ve been doing this for many years,” says Faye.

“As much as we want to say that indi­vid­u­al­ly we can set these bound­aries or make these changes, it’s real­ly dif­fi­cult to do that if lead­er­ship is not on the same page,” she said.

The sub­jec­tive qual­i­ty of the word clouds is worse, a fact that is clear from con­sid­er­ing that the fol­low­ing top­ic has many sen­tences in com­mon (shown below) with the “vac­ci­na­tion” top­ic we found with our pro­ce­dures. How­ev­er the only words in the word cloud relat­ed to a vac­cine (jab and boost­er) are small­er, while very com­mon words such as “get” and “come” are prominent.

though g e 

About me

I’m a Math PhD cur­rent­ly work­ing in Blockchain Tech­ni­cal Intel­li­gence. I also have skills in soft­ware and data engi­neer­ing, I love sports and music! Get in con­tact and let’s have a chat — I’m always curi­ous about meet­ing new people.


[1] Blei, David M.; Ng, Andrew Y.; Jor­dan, Michael I; Laf­fer­ty, John (Jan­u­ary 2003). “Latent Dirich­let allo­ca­tion”. Jour­nal of Machine Learn­ing Research. 3: 993‑1022. doi:10.1162/jmlr.2003.3.4–5.993.

[2] Stevens, Kei­th, et al. “Explor­ing top­ic coher­ence over many mod­els and many top­ics.” Pro­ceed­ings of the 2012 joint con­fer­ence on empir­i­cal meth­ods in nat­ur­al lan­guage pro­cess­ing and com­pu­ta­tion­al nat­ur­al lan­guage learn­ing. 2012.

[3] Devlin, Jacob; Chang, Ming-Wei; Lee, Ken­ton; Toutano­va, Kristi­na (11 Octo­ber 2018). “BERT: Pre-train­ing of Deep Bidi­rec­tion­al Trans­form­ers for Lan­guage Under­stand­ing”. arXiv:1810.04805v2

[4] Campel­lo, Ricar­do J. G. B.; Moulavi, Davoud; Zimek, Arthur; Sander, Jörg (2015). “Hier­ar­chi­cal Den­si­ty Esti­mates for Data Clus­ter­ing, Visu­al­iza­tion, and Out­lier Detec­tion”. ACM Trans­ac­tions on Knowl­edge Dis­cov­ery from Data. 10 (1): 1–51.

[5] Rada Mihal­cea and Paul Tarau, 2004: Tex­tRank: Bring­ing Order into Texts, Depart­ment of Com­put­er Sci­ence Uni­ver­si­ty of North Texas.

[6] Bauck­hage, Chris­t­ian, and Rafet Sifa. “k‑Maxoids Clus­ter­ing.” LWA. 2015.