Reimagining LLMs: one industry they will change forever (and no one will complain)

Photo by Google DeepMind on Unsplash

What chatGPT is good and bad at

I ❤️ Large Lan­guage Mod­els (LLM): I’m in the camp of “chat­G­PT is the first AGI”. This does­n’t mean there aren’t still strong lim­i­ta­tions and pit­falls to avoid. Know­ing the strengths and lim­i­ta­tions of LLMs is cru­cial to under­stand how this pow­er­ful tool can be used to sim­pli­fy our work.

LLMs oper­ate sole­ly and exclu­sive­ly on sta­tis­ti­cal prin­ci­ples with­out rely­ing on any inter­nal sym­bol­ic rep­re­sen­ta­tion of knowl­edge — they have no built-in Knowl­edge Graph. This makes them great at mim­ic­k­ing humans, not so good with accu­rate ref­er­enc­ing. Recent­ly a lawyer got in trou­ble for cit­ing 6 non-exis­tent cas­es hal­lu­ci­nat­ed by chatGPT.

In which indus­tries is “mim­ic­k­ing humans” the bot­tle­neck? I can’t answer this ques­tion ful­ly (and we’ll find out soon enough any­ways), but I can talk about one field I’m famil­iar with.

Data Engineering is hard

Data Sci­ence is rarely the prob­lem. This is the first thing every­one learns when work­ing with data teams. Most of the effort is spent on Data Engi­neer­ing (DE): acquir­ing, clean­ing, stor­ing and lat­er query­ing data. To avoid headaches down the road you want to spend some time design­ing this process such that it is resilient and scal­able. When the amount of data is large, the way you orga­nize the stor­age (mod­el design) becomes very impor­tant. All this takes time and effort.

As some­one that likes to strive to write clean code, DE can some­times feel icky. Pro­cess­ing data with code means tak­ing care of the numer­ous spe­cial cas­es and idio­syn­crasies of the spe­cif­ic dataset. These are often beyond your con­trol — you just have to deal with it. Think adapt­ing a Sele­ni­um scrap­ing script to small vari­a­tions in the HTML or stan­dard­iz­ing some text data in a Var­char row to a com­mon for­mat. Code must account for these vari­a­tions, often by hard-cod­ed rules that make you feel dirty.

It is no won­der then that the DE indus­try is thriv­ing — where there is a prob­lem there is a busi­ness oppor­tu­ni­ty. Lots of low-code solu­tions (Data­brick­’s Prophe­cyMeltano and Dataiku’s DSS to name a few off the top of my head) are emerg­ing, try­ing to sim­pli­fy and auto­mate DE tasks.

How LLMs will help with DE

I can see LLMs being used in a vari­ety of ways for the bor­ing parts of DE. Here are some ideas:

Data cat­a­log. Giv­en a bunch of dis­parate data sources and a busi­ness con­text, LLMs could be used to gen­er­ate struc­tured doc­u­men­ta­tion on what is found where and in which for­mat. Sen­si­ble col­umn names and com­ments could be used to make rea­son­able assump­tions by the LLM, just like a human would.

Busi­ness Intel­li­gence met­rics. Ana­lysts could describe the met­rics and KPI they need to mon­i­tor in plain Eng­lish and have the LLM gen­er­ate SQL code for it. Of the ideas dis­cussed here, this is prob­a­bly the low­est hang­ing fruit — I would be sur­prised if some­one is not already offer­ing this ser­vice. Lla­maIndex’s Text-to-SQL could be used.

Data clean­ing and stan­dard­iza­tion. When migrat­ing data or con­sol­i­dat­ing data from mul­ti­ple sources, it can hap­pen that some sim­ple trans­for­ma­tions need to be applied to con­form all the data to the same for­mat. By sim­ply stat­ing an exam­ple of the desired for­mat, LLMs could gen­er­ate the appro­pri­ate SQL code.

Data mod­el­ling. Giv­en the busi­ness require­ments and the source data for­mats, LLMs could be used to gen­er­ate SQL code for an appro­pri­ate data ware­house schema.

These last two are hard­er because — in their triv­ial imple­men­ta­tion — would require the LLM to process all the incom­ing data. This would get very expen­sive fast.

Data dedu­pli­ca­tion. Often one has to deal with dupli­cat­ed data in a data­base. This is not always triv­ial to remove: there can be rows that are “the same” for a human but not for SQL (e.g. text can use spaces instead of tabs). LLMs could eas­i­ly catch these instances and mark them as duplicate.

Anom­aly detec­tion. ML is already being used in DE by com­pa­nies like Anom­alo to check for anom­alous changes in data vol­ume, fresh­ness, null val­ues or key met­rics. This can be an invalu­able tool to mon­i­tor real time pipeline like for exam­ple the ones used in the fin­tech indus­try. LLMs could add anoth­er lay­er to these tools, look­ing at the actu­al mean­ing of the data. Of course this would be par­tic­u­lar­ly effec­tive on text data.

Conclusion

DE is hard and often requires for ad-hoc code changes to deal with the idio­syn­crasies of the spe­cif­ic dataset. LLMs could offer a big help in assist­ing and even­tu­al­ly automat­ing these tasks. I think it’s very rea­son­able to expect that LLMs will soon be used in areas such as data cat­a­log, devel­op­ing BI met­rics, data clean­ing as well as more advanced use-cas­es such as anom­aly detec­tion. The big play­ers to watch out for are those with estab­lished low-code DE solu­tions, espe­cial­ly those already inti­mate with ML such as Anom­alo.