A Gentle Guide to Large Language Models

Separating Fact from Hype

10 min readApr 25, 2023

Introduction

I aim to provide an easy-to-understand explanation of how AI systems like ChatGPT work, including GPT-3, GPT-4, Bing Chat, and Bard. ChatGPT is a chatbot that uses a Large Language Model, which I will explain in simple terms throughout this article. We will cover the core concepts behind these technologies, using metaphors to help illustrate them.

You do not need any technical or mathematical background to understand this article. I will explore why these concepts work the way they do and what we can realistically expect from Large Language Models like ChatGPT.

To achieve this, we will begin with a basic explanation of what Artificial Intelligence is and gradually progress to more complex topics, avoiding technical jargon whenever possible. Recurring metaphors will be used throughout this article to aid understanding. Additionally, I will discuss the potential implications of these technologies and what we should or should not expect from them.

Let’s dive in!

What is AI?

Let me start by explaining some basic terms you might have heard before. What exactly is artificial intelligence (AI)? In simple terms, AI refers to a system or entity that performs actions that a person would consider intelligent if a human were to perform them. Of course, defining “intelligence” can be a bit tricky, but this definition generally works well enough. If we observe something artificial that is engaging, useful, and performs somewhat complex actions, we might consider it intelligent. For example, we might call the computer-controlled characters in video games “AI,” even if they are just simple pieces of if-then-else code (e.g. “if the player is within range, then shoot, else move to the nearest boulder for cover”). As long as they keep us engaged and entertained without doing anything obviously stupid, we might perceive them as more sophisticated than they actually are.

However, once we understand how these systems work, we may no longer be impressed and may expect something more sophisticated behind the scenes. It all depends on our level of knowledge about what is happening “under the hood.”

The key point is that artificial intelligence is not magic. It can be explained, and we can understand how it works. So, let’s dive into it and explore the inner workings of AI.

What is ML?

When people talk about artificial intelligence, they often mention machine learning. So what exactly is machine learning? Essentially, it’s a way to create behavior by taking in data, forming a model, and then executing the model. This can be incredibly useful when trying to capture complicated phenomena, such as language, that would be too difficult to manually code with if-then-else statements.

But what is a model? Think of it as a simplification of a complex phenomenon. For example, a model car is a smaller, simpler version of a real car that has many of the same attributes but isn’t meant to replace the original. Similarly, we can create smaller, simpler versions of human language using large language models. These models are called “large” because they require a lot of memory to run. The biggest models, like ChatGPT, GPT-3, and GPT-4, are so massive that they need supercomputers running in data center servers to create and run them.

What is a neural network?

Neural networks are a way to learn models from data that are roughly based on how the human brain works. The human brain consists of interconnected neurons that transmit electrical signals back and forth. While they are inefficient, neural networks have been in use since the 1940s, and the technology to use them at a large scale was available around 2017.

When I think of neural networks, I like to think of electrical circuitry. To illustrate this point, let’s imagine that I want to make a self-driving car that can drive on the highway. The car is equipped with proximity sensors on the front, back, and sides that report a value of 1.0 when there is something very close and 0.0 when nothing is detectable nearby.

The car also has robotic mechanisms that can turn the steering wheel, push the brakes, and push the accelerator. The accelerator receives a value of 1.0 when it should use maximum acceleration, and 0.0 means no acceleration. Similarly, a value of 1.0 sent to the braking mechanism means slam on the brakes and 0.0 means no braking. The steering mechanism takes a value of -1.0 to +1.0 with a negative value meaning steer left and a positive value meaning steer right, and 0.0 meaning keep straight.

I’ve recorded data about how I drive, and it’s a complex process involving different combinations of actions (steer left, steer right, accelerate more or less, brake) based on different combinations of sensor information. But how do I wire up the sensor to the robotic mechanisms? It isn’t clear, so I wire up every sensor to every robotic actuator.

When I take the car out on the road, electrical current flows from all the sensor to all the robotic actuators, and the car simultaneously steers left, steers right, accelerates, and brakes, making it a mess.

To solve this problem, I use resistors to put them on different parts of the circuits, allowing electricity to flow more freely between certain sensors and certain robotic actuators. I also use gates to stop the flow of electricity until enough electricity accumulates to flip a switch or sending electrical energy forward only when the input electrical strength is low. I put these resistors and gates randomly all over the place until I stumble upon a combination that works well enough, and this is called back propagation.

Back propagation is an algorithm that is reasonably good at making guesses about how to change the configuration of the circuit. The algorithm makes tiny changes to the circuit to get the behavior of the circuit closer to doing what the data suggests, and over thousands or millions of tweaks, can eventually get something close to agreeing with the data. We call the resistors and gates parameters because they are everywhere, and what the back propagation algorithm is doing is declaring that each resistor is stronger or weaker. Thus, the entire circuit can be reproduced in other cars if we know the layout of the circuits and the parameter values.

What is deep learning?

Let me explain to you what Deep Learning is. It’s a concept that allows us to incorporate more than just resistors and gates into our circuits. Essentially, we can add a mathematical calculation in the center of the circuit that can add and multiply things together before allowing electricity to proceed. What’s interesting is that, despite these changes, Deep Learning still relies on the same fundamental technique of guessing parameters in increments.

What is a LM?

When I was learning about language models, I started by understanding how a neural network could manipulate the mechanisms of a car in a way that mimics a human driver. Now, I know that we can apply the same idea to language by creating a circuit that can generate a sequence of words similar to how humans produce them.

The goal of a language model is to guess the output word based on a given input sequence of words. We can express this mathematically as the probability of a particular word given the previous words in the sequence. For example, given the words “once”, “upon”, and “a”, a good language model would have a higher probability of guessing “time” than “armadillo”.

To create a language model circuit, we would need to have a sensor for each word in the language and a striker arm for each word in the output sequence. For a language with 50,000 words, that means 50,000 sensors and 50,000 striker arms. If we want to handle longer input sequences, we need even more sensors and striker arms.

The scale of this circuit can quickly become overwhelming, with billions of wires needed to connect all the sensors and striker arms. To tackle this challenge, we need to use some tricks and take things in stages.

As of 2023, the largest language models can handle up to 32,000 words, but we still have a long way to go to create a language model that can mimic human language production at scale.

Why are LLMs so powerful?

As someone who has worked with large language models like ChatGPT and GPT-4, I often get asked why these models are so powerful. The answer is actually quite simple: they are really good at guessing what word should come next. It may seem like a very specialized form of reasoning, but it’s proven to be incredibly effective.

One reason for this is that the transformer architecture used in these models is able to mix word contexts in a way that enables accurate predictions. Additionally, these models are trained on massive amounts of data scraped from the internet, including books, blogs, news sites, Wikipedia articles, Reddit discussions, and social media conversations. During training, we feed a snippet of text and ask it to predict the next word. If it gets it wrong, we make adjustments until it gets it right. The end result is a model that can produce text that looks like it could have reasonably appeared on the internet.

The diversity of text available on the internet is a major factor in why large language models are so effective. They have seen billions of conversations on every topic imaginable, which means they can produce text that sounds like a real conversation. They have seen billions of poems, music lyrics, homework assignments, standardized test questions, vacation plans, code snippets, and more. When you ask a large language model to do something, there’s a good chance it has seen billions of similar examples before.

It’s important to note that large language models don’t actually “think” or “reason” in the way that we do. Instead, they use their vast training data to produce a response that is somewhat the median response. This means that their responses are often what a lot of people writing on the internet would come up with if they had to compromise. It might not be the most creative or sophisticated response, but it’s usually a reasonable one.

So, when you interact with a large language model like ChatGPT, don’t be too impressed by its intelligence or creativity. Instead, recognize that it’s drawing from a massive amount of training data to provide you with a response that’s likely been seen before.

What should you look out for?

As a user of Large Language Models, there are certain things that one should be aware of. First and foremost, it is important to understand that the Large Language Models are trained on the internet and as a result, they have also trained on all the negative aspects of humanity, such as racist rants, sexist screeds, insults, conspiracy theories, and political misinformation. Therefore, the words generated by the language model may regurgitate such language without any intention of promoting those beliefs. It is imperative to watch out for the subtle implications that arise from how these models work and how they are trained.

Large Language Models are word guessers and do not have “core beliefs”. They try to predict what the next words would be if the same sentence were to appear on the internet. Hence, one can ask a Large Language Model to write a sentence in favor of something or against that same thing, and the language model will comply both ways. It is important to note that these are not indications that it believes one thing or the other, or that it changes its beliefs. The model responds more consistently with whatever shows up in its training data more often because it is striving to emulate the most common response.

Large Language Models do not have any sense of truth or right or wrong. While they may tend to guess words that we agree are true, there is no guarantee that the model will provide the truth. An LLM will tend to say that the Earth is round, but if the context is right, it will also say the opposite because the internet does have text about the Earth being flat. Thus, it is essential to verify the outputs of a large language model, especially for high-stakes tasks like deciding which stocks to invest in.

Large Language Models can make mistakes. The training data might have a lot of inconsistent material, self-attention may not attend to all the things we want it to when we ask a question, and as a word-guesser, it may make unfortunate guesses. This leads to a phenomenon called “hallucination” where a word is guessed that is not derived from the input nor “correct”. It is important to understand that Transformers do not have a way to “change their minds” or try again or self-correct. Even if only one error is made, everything that comes after might be tied to that error. Then the language model could make additional errors on top of that.

As a user of Large Language Models, one should always remember that better prompts produce better results. Self-attention means that the more information provided in the input prompt, the more specialized the response will be because it will mix more of the user’s words into its guesses. However, it is also important to understand that one isn’t really “having a conversation” with a large language model. A large language model doesn’t “remember” what has happened in the exchange. The user’s initial input, the response, and their response to the response goes in. Thus if it looks like it is remembering it is because the log of the conversations becomes a fresh new input. It will probably stay on topic because of this trick, but there is no guarantee it won’t contradict its earlier responses.

In conclusion, Large Language Models have revolutionized the way we interact with computers. However, as a user, it is essential to understand their limitations and be aware of their training data. Always verify the outputs and understand that Large Language Models are not problem-solving or planning tools. While they can be asked to create plans and solve problems, they are only word guessers and cannot self-correct or change their minds.