How to Summarize Large Documents with LangChain and OpenAI

There are still some limitations when summarizing very large documents. Here are some ways to mitigate these effects.

Apr 22nd, 2024 9:41am by Usama Jamil

Featued image for: How to Summarize Large Documents with LangChain and OpenAI

Image from Azston Designs on Shutterstock.

Large language models have made many tasks easier like making chatbots, language translation, text summarization, etc. We used to write models for summarization, and then there was always the issue of performance. Now, we can do this easily with the use of large language models (LLMs). For example, state-of-the-art (SOTA) LLMs can already handle a whole book in its context window. But there are still some limitations when summarizing very large documents.

Limitations of Large Document Summarization by LLM

Contextual limit or context length in an LLM refers to the number of tokens that a model can process. Each model has its own context length also known as max tokens or token limit. For instance, a standard GPT-4 model has a context length of 128,000 tokens. It will lose information for the tokens more than that. Some SOTA LLMs have a contextual limit of up to 1 million tokens. However, as the contextual limit increases, LLMs suffer from limitations like recency and primacy. We can also delve into ways to mitigate these effects.

Primacy effect in LLMs refers to the model giving more importance to information presented at the beginning of a sequence.
Recency effect pertains to the model emphasizing the most recent information it processes.

Both effects bias the model toward specific parts of the input data. The model may skip important information in the middle of the sequence.

The second issue is cost. We can resolve the first issue of context limit by splitting the text, but we simply can’t pass the whole book directly to the model. It would cost a lot. For example, if we have 1 million tokens of a book and we directly pass it to the GPT4 model, our total cost would be around $90 (prompt and completion tokens). We have to find a middle way to summarize our text considering the price, contextual limit and the complete context of the book.

In this tutorial, you’ll learn to summarize a complete book considering the price and the contextual limit of the model. Let’s start.

Summarize Large Documents with LangChain and OpenAI

Setting up the Environment

To follow along with the tutorial, you need to have:

Python installed
An IDE (VS Code would work)

To install the dependencies, open your terminal and enter the command:

This command will install all the required dependencies.

Load the Book

You will be using the book “David Copperfield” by Charles Dickens, which is publicly available for this project. Let’s load the book using the `PyPDFLoader` utility provided by LangChain.

It will load the complete book, but we are only interested in the content part. We can skip the pages like the Preface and Intro.

Now, we have the content. Let’s print the first 200 characters.

Pre-processing

Let’s remove the unnecessary content from the text like non-printable characters, extra spaces, etc.

After cleaning the data, we are ready to dive into the summarizing problem.

Load the OpenAI API

Before using the OpenAI API, we need to configure it and provide credentials here.

Enter your API key there and it’ll set up the environment variable.

Let’s see how many tokens we have in the book:

We have over 466,000 tokens in this book, and if we pass them all directly to the LLM, it would charge us a lot. So, to reduce the cost, we will implement K-means clustering to extract the important chunks from the book.

Note: The decision to use K-means clustering was inspired by data guru Greg Kamradt’s tutorial.

To get important parts of the book, let’s first split the book into different chunks.

Split the Content into Documents

We will split the book content into documents by using the `SemanticChunker` utility of LangChain.

The `SemanticChunker` receives two arguments, the first one is the embeddings model. The embeddings generated by this model are used to split the text based on the semantics. The second one is the `breakpoint_threshold_type`, which determines the points at which text should be split into different chunks based on semantic similarity.

Note: By processing these smaller, semantically similar chunks, we aim to minimize the recency and primacy effects in our LLM. This strategy allows our model to handle each small context more effectively, ensuring a more balanced interpretation and response generation.

Find the Embeddings of Each Document

Now, let’s get the embeddings of each generated document. You will get the embeddings using the OpenAI default method.

The `get_embeddings` method gives us the embeddings of all the documents.

Note: The `text-embedding-3-small` method is specially released by OpenAI, which is considered cheaper and faster.

Rearrange the Data

Next, we will convert lists of document contents and their embeddings into a pandas DataFrame for easier data handling and analysis.

Apply Faiss for Efficient Clustering

Now, we’ll transform the document vectors into a format compatible with Faiss, cluster them into 50 groups using K-means, and then create a Faiss index for efficient similarity searches among documents.

This K-means clustering will group the documents into 50 groups.

Note: The reason for choosing the K-means clustering is that each cluster will have a similar content or similar context because all the documents within that cluster have related embeddings, and we will select the one that is nearest to the nucleus.

Select the Import Documents

Now, we will just select the most important document from each cluster. For this, we will only select the first nearest vector to the centroid.

This code uses the search method on the index to find the closest document to each centroid in the list of centroids. It returns two arrays: `D`, which contains the distances of the closest documents to their respective centroids, and `I`, which contains the indices of these closest documents. The second parameter `1` in the search method specifies that only the single closest document is to be found for each centroid.

Now we need to sort the selected document indices because the documents are in sequence with respect to the sequence of the book.

Get the Summary of Each Document

The next step is to get the summary of each document using the GPT-4 model to save money. To use GPT-4, let’s define the model.

Define the prompt and make a prompt template using LangChain to pass it to the model.

This prompt template will help the model summarize the documents more effectively and efficiently.

The next step is to define a chain of the LangChain using LangChain Expression Language (LCEL).

The summarizing chain uses the StrOutputParser to parse the output. There are other output parsers as well to explore.

You can finally apply the defined chain on each document to get a summary.

The code above applies the chain on each document one by one and concatenates each summary to the `final_summary`.

Save the Summary as a PDF

The next step is to format the summary and save it in PDF format.

So, here we have the complete summary of the book in PDF format.

Conclusion

In this tutorial, we’ve navigated the complexities of summarizing large texts such as entire books using LLMs while addressing challenges related to contextual limits and cost. We have learned the steps to preprocess the text and implement a strategy combining semantic chunking and K-means clustering to manage the model’s contextual limitations effectively.

By using efficient clustering, we efficiently extracted key passages, reducing the overhead of processing massive texts directly. This approach not only reduces costs significantly by minimizing the number of tokens processed but also mitigates the recency and primacy effects inherent in LLMs, ensuring a balanced consideration of all text segments.

There has been significant excitement about developing AI applications through the APIs of LLMs, where vector databases play a significant role by offering efficient storage and retrieval of contextual embeddings. MyScaleDB is a vector database that has been designed specifically for AI applications, keeping all the factors in mind such as cost, accuracy and speed. Its SQL-friendly interface allows developers to start developing their AI applications without learning something new.

If you want to discuss more with us, welcome to join MyScale Discord to share your thoughts and feedback.

Usama Jamil, a developer advocate at MyScale, brings with him a wealth of experience and a profound interest in data science. With a passion for exploring new trends in the AI/ML domain, Usama strives to make complex concepts accessible to...