Building GPT Applications on Open Source LangChain, Part 2

We’ll use the fast-rising LLM application framework for a practical example of how to use a GPT to help answer a question from a PDF document.

Jun 16th, 2023 8:15am by Akmal Chaudhri

Featued image for: Building GPT Applications on Open Source LangChain, Part 2

This is the second of two articles. In the previous article, we discussed three considerations for developers when building GPT applications with an open source stack, such as LangChain. Let’s now use LangChain for a practical example where we want to store and analyze PDF documents. We’ll obtain a PDF document, divide it into smaller parts, save the document text and its vector representations (embeddings*) in a database system and then query it. We’ll also use a GPT to help answer a question. *In a GPT, an embedding is simply a numerical representation of a word or phrase. Vectors represent the semantic meaning of words and phrases in a way that a machine-learning model can understand.

Create a SingleStoreDB Cloud Account

First, sign up for a free SingleStoreDB Cloud account. Once logged in, select CLOUD > Create new workspace group from the left-hand navigation pane. Next, choose Create Workspace and just work through the wizard. Here are the recommended settings for this example:

Create Workspace Group

Workspace Group Name: LangChain Demo Group Cloud Provider: AWS Region: US East 1 (N. Virginia) Click Next.

Create Workspace

Workspace Name: langchain-demo Size: S-00 Click Create Workspace. Once the workspace is created and available, from the left-hand navigation pane, select DEVELOP > SQL Editor to create a new database, as follows: CREATE DATABASE IF NOT EXISTS pdf_db;

Create a Notebook

From the left-hand navigation pane, select DEVELOP > Notebooks. In the top right of the web page, select New Notebook > New Notebook, as shown in Figure 1 below.

We’ll call the notebook langchain_demo. Select a Blank notebook template from the available options. We’ll also select the Connection and Database using the drop-down menus above the notebook, as shown in Figure 2.

Figure 2. Connection and Database

Fill out the Notebook

First, we’ll import some libraries:

!pip install langchain --quiet
!pip install openai --quiet
!pip install pdf2image --quiet
!pip install tabulate --quiet
!pip install tiktoken --quiet
!pip install unstructured --quiet

Next, we’ll read in a PDF document. This is an article by Neal Leavitt titled “Whatever Happened to Object-Oriented Databases?” OODBs were an emerging technology during the late 1980s and early 1990s. We’ll add `leavcom.com` to the firewall by selecting the Edit Firewall option in the top right. Once the address has been added to the firewall, we’ll read the PDF file:

from langchain.document_loaders import OnlinePDFLoader
loader = OnlinePDFLoader("http://leavcom.com/pdf/DBpdf.pdf")
data = loader.load()

We can use LangChain’s OnlinePDFLoader, which makes reading a PDF file easier. Next, we’ll get some data on the document:

from langchain.text_splitter import RecursiveCharacterTextSplitter

print (f"You have {len(data)} document(s) in your data")
print (f"There are {len(data[0].page_content)} characters in your document")

The output should be:

You have 1 document(s) in your data
There are 13040 characters in your document

We’ll now split the document into pages containing 2,000 characters each, giving us seven pages:

text_splitter = RecursiveCharacterTextSplitter(chunk_size = 2000, chunk_overlap = 0)
texts = text_splitter.split_documents(data)

print (f"You have {len(texts)} pages")

Next, we’ll create a table to store the text and embeddings. We can do this directly using the `%%sql` magic command:

%%sql

USE pdf_db;
DROP TABLE IF EXISTS pdf_docs;
CREATE TABLE IF NOT EXISTS pdf_docs (
    id INT PRIMARY KEY,
    text TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_general_ci,
    embedding BLOB
);

To use Python code to connect to our database, we can use the built-in `connection_url`, as follows:

from sqlalchemy import *
db_connection = create_engine(connection_url)

We’ll set our OpenAI API Key:

import openai
openai.api_key = "OpenAI API Key"

and use LangChain’s `OpenAIEmbeddings`:

from langchain.embeddings import OpenAIEmbeddings
embedder = OpenAIEmbeddings(openai_api_key = openai.api_key)

Now we are ready to obtain the vector embeddings and store them in the database system:

db_connection.execute("TRUNCATE TABLE pdf_docs")

for i, document in enumerate(texts):
    text_content = document.page_content

    embedding = embedder.embed_documents([text_content])[0]

    stmt = """
        INSERT INTO pdf_docs (
            id,
            text,
            embedding
        )
        VALUES (
            %s,
            %s,
            JSON_ARRAY_PACK_F32(%s)
        )
    """

    db_connection.execute(stmt, (i+1, text_content, str(embedding)))

We truncate the table to ensure that we start with an empty table. Then we iterate through the pages of text, obtain the embeddings from OpenAI, and store the text and embeddings in the database table. We can now ask a question, as follows:

query_text = "Will object-oriented databases be commercially successful?"

query_embedding = embedder.embed_documents([query_text])[0]

stmt = """
    SELECT
        text,
        DOT_PRODUCT_F32(JSON_ARRAY_PACK_F32(%s), embedding) AS score
    FROM pdf_docs
    ORDER BY score DESC
    LIMIT 1
"""

results = db_connection.execute(stmt, str(query_embedding))

for row in results:
    print(row[0])

Here we convert the question into vector embeddings, perform a `DOT_PRODUCT` and return only the highest-scoring value. Finally, we can use a GPT to provide an answer, based on the earlier question:

prompt = f"The user asked: {query_text}. The most similar text from the document is: {row[0]}"

response = openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": prompt}
    ]
)

print(response['choices'][0]['message']['content'])

Here is some example output: Based on the information provided in the document, it seems that object-oriented databases are not expected to be commercially successful in the near future. While they are gaining some popularity in niche markets such as CAD and telecommunications, relational databases continue to dominate the market and are expected to do so for the foreseeable future. IDC predicts that the growth rate for relational databases will be significantly higher than that of OO databases through 2004. However, OO databases still have their place in certain niche markets.

Summary

In this example, we saw the benefits of LangChain in the application development process. We also saw how easily we can convert documents from one format to another, store the content in a database system, generate vector embeddings and ask questions about the data stored in the database system. We also have the full power of SQL available if we are interested in performing additional query operations on the data. I will host a workshop on June 22 and will go through building a ChatGPT application using LangChain. I hope you can join. Sign up here.

Akmal Chaudhri helps build global developer communities and raise awareness of technology through presentations and technical writing. He has held roles as a developer, consultant, product strategist, evangelist, technical writer and technical trainer with several Blue Chip companies and big...