Original article was published by Julian Sitkevich on Artificial Intelligence on Medium
GPT-3 A Powerful New Beginning
A text-generating neural network with the largest trained model to date
OpenAI’s, GPT-3 is a powerful text-generating neural network pre-trained on the largest corpus of text to date, capable of uncanny predictive text response based on it’s input, and is by far, currently the most powerful language model built.
GPT is an acronym for Generative Pre-Trained Transformer. GPT-2, announced in February 2019 by OpenAI, was trained on the WebText dataset which contained over 8 million documents or 38GB of text data extracted from Reddit submissions. In November 2019, the final version of the GPT-2 was released, containing pre-training on 1.5 billion parameters.
To put in perspective, GPT-3 has been trained on 175 billion parameters and under a trillion words, making it’s predecessor GPT-2 with 1.5 billion parameters, look miniature at one hundredth the size. And note that GPT-2 was officially released less than year earlier in November 2019. The next biggest model is Google’s T5 which comes in at only 11 billion parameters.
What is it good at?
GPT-3s capabilities at predicting text and language are uncanny. It’s able to write functioning code, can respond with human sounding dialogue, generate images, write articles, fictional stories, books or even a mundane task of writing an email.
The predictions are not always perfect, for one GPT-3 doesn’t actually understand what the words mean. Read Any Limitations?
In Simple Terms
At its simplest, GPT-3 takes a phrase of input text and predicts what the next text output should be. This type of machine learning doesn’t “think”, it processes the textual input based on it’s previously trained data, and runtime translators.
Pre-training happens on a massive dataset, including the public internet, book corpus, and Wikipedia. By vastly widening the examples to train on, it improves the quality and performance of its response. Being so large, GPT-3 was estimated to cost approximately a whopping $5 million USD to train, which brings into question cost scalability for future, next versions of GPT-3.
One of the primary training datasets used to train GPT-3, came from CommonCrawl, which is a freely available public dataset consisting of crawls of the public web containing nearly a trillion words. CommonCrawl represented 60% of the weight in training and input more that 400 billion tokens.
Dataset breakdown and training distribution
dataset tokens weight in training
----------- ----------- ------------------
CommonCrawl 410 billion 60%
WebText2 19 billion 22%
Books1 12 billion 8%
Books2 55 billion 8%
Wikipedia 3 billion 3%
Why is a larger dataset better?
Recent work has demonstrated substantial gains on many NLP tasks and benchmarks by pre-training on a large corpus of text followed by fine-tuning on a specific task. While typically task-agnostic in architecture, this method still requires task-specific fine-tuning datasets of thousands or tens of thousands of examples. By contrast, humans can generally perform a new language task from only a few examples or from simple instructions — something which current NLP systems still largely struggle to do. Here we show that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. Specifically, we train GPT-3, an autoregressive language model with 175 billion parameters, 10x more than any previous non-sparse language model, and test its performance in the few-shot setting.
Source arvix 2005.14165v4.
Yes, the creators of GPT-3 recognize limitations. In the area of text synthesis:
On text synthesis, although the overall quality is high, GPT-3 samples still sometimes repeat themselves semantically at the document level, start to lose coherence over sufficiently long passages, contradict themselves, and occasionally contain non-sequitur sentences or paragraphs.
In comparison, human beings are capable of holding a persistent mental point of view, whereas GPT-3 can lose focus and “forget” over the course of longer passages.
Within discrete language tasks, such as “common sense physics”:
Within the domain of discrete language tasks, we have noticed informally that GPT-3 seems to have special difficulty with “common sense physics”, despite doing well on some datasets (such as PIQA [BZB+19]) that test this domain. Specifically GPT-3 has difficulty with questions of the type“If I put cheese into the fridge, will it melt?”.
Not clear from this passage if “common sense physics” could be mitigated in the future by training on physics dataset(s).
And this common and important concern with bias in most deep learning systems:
Finally, GPT-3 shares some limitations common to most deep learning systems — its decisions are not easily interpretable,it is not necessarily well-calibrated in its predictions on novel inputs as observed by the much higher variance in performance than humans on standard benchmarks, and it retains the biases of the data it has been trained on.