The History of Large Language Models
What is a Language Model?
“Writing comes from reading, and reading is the finest teacher of how to write." Annie Proulx
Just like a human author, a Language Model can construct output based on texts that it has read. Separate from the writing “rules” learned in a classroom, a human can also unconsciously transfer vocabulary, spelling, grammar and stylistic voice learned from a lifetime’s reading to their writing. Likewise, a language model has a pre-read set of written data, from which it can create its own rules.
The Cat Sat On The…
In the AI world, a language model takes examples of writing (a set of data), to give the probability of a subsequent word sequence being “valid.” Don’t get confused by the word “valid”. It doesn’t refer to grammatical or factual validity. Instead, valid refers to the closeness of the sequence to entries in the original dataset. This validity score can be used to generate new content.
A simple and obvious example of a language model is predictive text. Consider a language model trained on written data and the incomplete sentence: “The cat sat on the…” A language model might consider a series of alternative endings and match them to patterns in its dataset with various endings being more commonly used than others. The language model doesn’t have any concept of whether a cat might prefer the sidewalk to a mat, or what a cat even is, but - depending on its data - will be able to see which sentence ending is the most probable (or “valid”). Thus, “mat” is predicted before “sidewalk”.
Large Language Models
The first language models trace their roots to the early days of AI. ELIZA, a tool built to converse, debuted in 1966 at MIT. Whilst ELIZA hasn’t learned from data and doesn’t predict the probability of a word or a sentence in language, it’s widely credited with being the first AI to be able to hold a human-like conversation.
Many of today’s “chatbots” are built using a similar technique, scripted responses being served to the user according to identifiable words and phrases in the input text.
Brown Corpus
While Eliza relied on canned and deliberately open responses, other AI Language models began to use libraries of written texts to generate output. One of the earliest datasets used by language models was the "Brown Corpus," which was created in the early 1960s. The Brown corpus consists of 500 texts, each consisting of just over 2,000 words.
Since then, Language models have increased not only in the sophistication of the models used to train the data but also in the size of the data sets themselves. This size increase is primarily due to the availability of large amounts of text data on the internet, coupled with advances in computing power and storage.
As an example of the amount of data available, the Common Crawl project has been crawling the web since 2008, and currently contains over 70 petabytes of data, much of it in the form of text.
To put the magnitude of the Common Crawl project in perspective, some projections suggest that a Petabyte is comparable 500 billion sheets of standard printed text.
At the same time, advances in computing power and data storage have made it feasible to train larger and more complex language models. In particular, the development of specialized hardware such as Google's Tensor Processing Units (TPUs) has made it possible to train models with billions or even trillions of parameters.
While the term “large” is vague, the dataset for training a Large Language Model typically has at least one billion parameters (the variables present in the model on which it was trained.)
The Birth of GPT
The first large language model was the GPT (Generative Pre-trained Transformer) model, developed by OpenAI in 2018. It had 117 million parameters and was trained on a massive amount of text data from the internet. This model was ground-breaking, generating human-like text and performing a range of natural language processing tasks with high accuracy. Since then, larger and more powerful language models have been developed, including GPT-2, 3 and 4. GPT4 is rumored to have 1 trillion parameters. Other big players have joined the scene, with Google, Meta and Amazon all working on their own Large Language Models.
Large Language Models Timeline
2014 | Google buy London based AI company, Deepmind | |
2018 | BERT by Google An influential language model not built to generative text |
340 million parameters |
GPT by OpenAI Ground-breaking LLM that generated human-like text |
117 million parameters | |
2019 | Microsoft invests $1billion into OpenAI | |
2019 | GPT2 | 1.5 billion parameters |
2020 | GPT3 | 175 billion parameters |
Dec 2021 | Gopher by Deepmind | 300 billion parameters |
May 2022 | ChatGPT released Built on a fine-tuned variant of GPT3, ChatGPT brings LLMs to the masses. |
|
July 2022 | BLOOM by Hugging Face Essentially GPT-3 but trained on a multi-lingual corpus |
175 billion parameters |
Nov 2022 | Galactica by Meta Trained on scientific text |
120 billion parameters |
Nov 2022 | AlexaTM by Amazon | 20 billion parameters |
Microsoft invests a further $10 billion into OpenAI | ||
Mar 2023 | GPT4 by OpenAI. Also available through ChatGPT | Rumoured to have 1 trillion parameters |
Mar 2023 | BloombergGPT
LLM trained on financial data from proprietary sources |
50 billion parameters |
Things Suddenly Get Real
Obviously, the size of the dataset correlates to the accuracy of the language model, and the emergence of Large Language Models (LLMs) with huge datasets has driven a significant increase in performance. LLMs have shown improvements in their ability to generate human-like text, as well as their performance on a variety of natural language processing tasks. But the introduction of large datasets has also seen a "discontinuous phase shift," a stage whereby the model suddenly acquires abilities disproportionately larger than the increase in dataset size.
Adam, Travtus’ AI teammate, uses LLMs to create conversational intelligence for real estate operations. By combining the embeddings of large models with Travtus’ own property and industry-specific data, Adam is able to understand and respond to resident queries in natural language and with accuracy.
But progress has not all been plain sailing. The advances in LLMs have been intercut with embarrassing PR fails for some major companies. Amongst others, Meta’s Galactica, an AI claiming to “summarize academic papers, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more.” was released to a chorus of disapproval from its intended userbase, quick to point out its factual failings.
Learn From the Masters: The Emergence of Domain Specific Large Language Models
“Find an author you admire and copy their plots and characters in order to tell your own story…” Michael Moorcock
As humans master writing by focusing on the works they aspire to emulate, so the recent development of Domain Specific LLMs have sought to improve real-world performance with more specialized datasets. The complexity and unique terminology of the domains such as Real Estate and Finance warrant a domain-specific model. While the GPT family of LLMs were pre-trained over a huge amount of text from many different sources such as Common Crawl, webtexts, books, and Wikipedia, domain-specific Large Language Models generally use a separate proprietary dataset combined with a pre-existing LLM. BloombergGPT is one such example
BloombergGPT
Purpose built for finance natural language tasks, BloombergGPT has access to the vast amount of data available on the Bloomberg Terminal, allowing it to outperform similarly-sized open models on financial natural language tasks while still being able to match performance on non-specific natural language tasks.
Where Bloomberg have led, other industries are following, curating their own collections of data to create LLMs that can generate text that is not only human-like but accurately represents their own domain-specific knowledge.
Other Domain Specific Large Language Models
BioMedLM by Mosaic ML and Standford | A purpose-built AI model trained to interpret biomedical language. |
BloombergGPT by OpenAI and Bloomberg | Purpose built for finance-based natural language tasks |
Galactica by Meta | Built for the scientific community |
Codex | Fine-tuned on publicly-available Python code |
The Future of Large Language Models
To the general public perhaps, it may seem like the technology behind Large Language Models is only a few years old. Actually, language models have been around since the 1960’s and the idea of creating writing based on a dataset of existing text has existed for as long as human authors have had access to the written word, created 5500 years ago in Mesopotamia.)
“There is no such thing as a new idea. It is impossible. We simply take a lot of old ideas and put them into a sort of mental kaleidoscope. We give them a turn and they make new and curious combinations. We keep on turning and making new combinations indefinitely; but they are the same old pieces of colored glass that have been in use through all the ages”. Mark Twain
However, the creation of Large Language Models and the freely available release of ChatGPT means that Large Language Models have taken center stage in the public consciousness. No surprise, as larger sets of data filtered against bias, domain-specific LLMs , and advances in computing power and data storage promise to further improve human efficiency and productivity.
The future of LLM’s is bright.