Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Member-only story

Unleashing the ChatGPT Tokenizer

Hands-On! How ChatGPT Manages Tokens?

9 min readJul 6, 2023

--

Press enter or click to view image in full size
Self-made gif.

Have you ever wondered which are the key components behind ChatGPT?

We all have been told the same: ChatGPT predicts the next word. But actually, there is a bit of a lie in this statement. It does not predict the next word, ChatGPT predicts the next token.

Token? Yes, a token is the unit of text for Large Language Models (LLMs).

Indeed one of the first steps that ChatGPT does when processing any prompt is splitting the user input into tokens. And that is the job of the so-called tokenizer.

In this article, we will uncover how the ChatGPT tokenizer works with hands-on practice with the original library used by OpenAI, the tiktoken library.

TikTok-en… Funny enough :)

Let’s dive deep and comprehend the actual steps performed by the tokenizer, and how its behavior really impacts the quality of the ChatGPT output.

How the Tokenizer Works

In the article Mastering ChatGPT: Effective Summarization with LLMs we already saw some of the mysteries behind the ChatGPT tokenizer, but let’s start from scratch.

The tokenizer appears at the first step in the process of text generation. It is

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Andrea Valenzuela
Andrea Valenzuela

Written by Andrea Valenzuela

AI Researcher 🚀 | Writing about Artificial Intelligence & Large Language Models 👩🏻‍💻 | Sharing tricks and experiences✨