How does ChatGPT work? As explained by the ChatGPT team.

See a longer version of this article here: Scaling ChatGPT: Five Real-World Engineering Challenges.

Sometimes the best explanations of how a technology solution works come from the software engineers who built it. To explain how ChatGPT (and other large language models) operate, I turned to the ChatGPT engineering team.

"How does ChatGPT work, under the hood?"

I asked this from Evan Morikawa at OpenAI. Evan joined OpenAI in 2020 – two years before ChatGPT launched – and has led the Applied engineering team as ChatGPT launched and scaled. His team was the one that created ChatGPT, and Evan has been there from the very beginning.

With this, it’s over to Evan. My questions are in italic.

A refresher on OpenAI, and on Evan

Evan: how did you join OpenAI, and end up heading the Applied engineering group – which also builds ChatGPT?

OpenAI is the creator of ChatGPT, which is just one of the business’ products. Other shipped things include DALL·E 3 (image generation,) GPT-4 (an advanced model,) and the OpenAI API which developers and companies use to integrate AI into their processes. ChatGPT and the API each expose several classes of model: GPT-3, GPT-3.5, and GPT-4.

The engineering, product, and design organization that makes and scales these products is called "Applied," and was founded in 2020 when GPT-3 was released. It’s broadly chartered with safely bringing OpenAI's research to the world. OpenAI itself was founded in 2015 and at the company’s core today is still a research lab with the goal of creating a safe and aligned artificial general intelligence (AGI.)

The Applied group and ChatGPT within OpenAI

I joined OpenAI in October 2020 when Applied was brand new. I do not have a PhD in Machine Learning, and was excited by the idea of building APIs and engineering teams. I managed our entire Applied Engineering org from its earliest days through the launch and scaling of ChatGPT. We cover more on Evan’s story in Inside OpenAI: how does ChatGPT ship so quickly? 

How does ChatGPT work?

For those of us who have not spent the past few years building ChatGPT from the ground up, how does it work?

When you ask ChatGPT a question, several steps happen:

  1. Input. We take your text from the text input.
  2. Tokenization. We chunk it into tokens. A token roughly maps to a couple of unicode characters. You can think of it as a word. 
  3. Create embeddings. We turn each token into a vector of numbers. These are called embeddings. 
  4. Multiply embeddings by model weights. We then multiply these embeddings by hundreds of billions of model weights.
  5. Sample a prediction. At the end of this multiplication, the vector of numbers represents the probability of the next most likely token. That next most likely token are the next few characters that spit out of ChatGPT.

Let’s visualize these steps. The first two are straightforward:

Steps 1 and 2 of what happens when you ask ChatGPT a question

Note that tokenization doesn’t necessarily mean splitting text into words; tokens can be subsets of words as well. 

Embeddings are at the heart of large language models (LLM), and we create them from tokens in the next step:

Step 3 of what happens when you ask ChatGPT a question. Embeddings represent tokens as vectors. The values in the above embedding are examples

An embedding is a multi-dimensional representation of a token. We explicitly train some of our models to explicitly allow the capture of semantic meanings and relationships between words or phrases. For example, the embedding for “dog” and “puppy” are closer together in several dimensions than “dog” and “computer” are. These multi-dimensional embeddings help machines understand human language more efficiently.

Model weights are used to calculate a weighted embedding matrix, which is used to predict the next likely token. For this step, we need to use OpenAI’s weight matrix, which consists of hundreds of billions of weights, and multiply it by a matrix we construct from the embeddings. This is a compute-intensive multiplication.

Step 4 of what happens when you ask ChatGPT a question. The weight matrix contains hundreds of billions of model weights

Sampling a prediction is done after we do billions of multiplications. The final vector represents the probability of the next most likely token. Sampling is when we choose the next most likely token and send it back to the user. Each word that spits out of ChatGPT is this same process repeated over and over again many times per second.

Step 5. We end up with the probability of the next most likely token (roughly a word). We sample the next most probable word, based on pre-trained data, the prompt, and the text generated so far. Image source: What is ChatGPT doing and why does it work? by Stehen Wolfram

Pretraining and inference

How do we generate this complex set of model weights, whose values encode most of human knowledge? We do it through a process called pretraining. The goal is to build a model that can predict the next token (which you can think of as a word), for all words on the internet. 

During pretraining, the weights are gradually updated via gradient descent, which is is a mathematical optimization method. An analogy of gradient descent is a hiker stuck up a mountain, who’s trying to get down. However, they don’t have a full view of the mountain due to heavy fog which limits their view to a small area around them. Gradient descent would mean to look at the steepness of the incline from the hiker’s current position, and proceed in the direction of the steepest descent. We can assume steepness is not obvious from simple observation, but luckily, this hiker has an instrument to measure steepness. However, it takes time to do a single measurement and they want to get down before sunset. So, this hiker needs to decide how frequently to stop and measure the steepness, so they still can get down before sunset.

Once we have our model we can run inference on it, which is when we prompt the model with text. For example, the prompt could be: “write a guest post for the Pragmatic Engineer.” This prompt then asks the model to predict the next most likely token (word). It makes this prediction based on past input, and it happens repeatedly, token by token, word by word, until it spits out your post! 


This is Gergely again.

How ChatGPT works isn’t magic, and is worth understanding. Like most people, my first reaction to trying out ChatGPT was that it felt magical. I typed in questions and got answers that felt like they could have come from a human! ChatGPT works impressively well with human language, and has access to more information than any one human could handle. It’s also good at programming-related questions, and there was a point when I questioned if ChatGPT could be more capable than humans, even in areas like programming, where humans have done better, until now?

For a sense of ChatGPT’s limitations, you need to understand how it works. ChatGPT and other LLMs do not “think” and “understand” like humans. ChatGPT does, however, generate words based on what the next most likely word should be, looking at the input, and everything generated so far.

In an excellent deepdive about how ChatGPT works, Stephen Wolfram – creator of the expert search engine WolframAlpha – summarizes ChatGPT:

“The basic concept of ChatGPT is at some level rather simple. Start from a huge sample of human-created text from the web, books, etc. Then train a neural net to generate text that’s “like this”. And in particular, make it able to start from a “prompt” and then continue with text that’s “like what it’s been trained with”.

As we’ve seen, the actual neural net in ChatGPT is made up of very simple elements—though billions of them. And the basic operation of the neural net is also very simple, consisting essentially of passing input derived from the text it’s generated so far “once through its elements” (without any loops, etc.) for every new word (or part of a word) that it generates.

But the remarkable—and unexpected—thing is that this process can produce text that’s successfully “like” what’s out there on the web, in books, etc. (...)

The specific engineering of ChatGPT has made it quite compelling. But ultimately (at least until it can use outside tools) ChatGPT is “merely” pulling out some “coherent thread of text” from the “statistics of conventional wisdom” that it’s accumulated. But it’s amazing how human-like the results are.”

This was an expert from the article Scaling ChatGPT: Five Real-World Engineering Challenges. Read that article for more details on the engineering challenges behind building (and scaling) ChatGPT, as explained by Evan.

Other explanations on how ChatGPT and LLMs work:

Subscribe to my weekly newsletter to get articles like this in your inbox. It's a pretty good read - and the #1 tech newsletter on Substack.