SQL Schema Generation With Large Language Models
I’ve looked at both regex and JSON persistence generation with LLMs, but it is Structured Query Language (SQL) that many believe is handled well by AI. To help celebrate SQL’s 50th birthday, let’s talk tables, introducing technical terminology as we need it. However, I don’t want to simply test queries against existing tables. The world of relational databases starts with the schema. A schema describes a group of tables that interact to allow SQL queries to answer questions about real world system models. We use various constraints to control how the tables relate to each other. In this example, I’ll develop a schema about books, authors and publishers. Then we’ll see if an LLM can reproduce the work. We start with the relationships between our things. A book is written by an author and published by a publisher. Indeed, the publication of a book defines the relationship between author and publisher. So in concrete terms, we want to produce a result like this:
| Book | Author | Publisher | Release Date |
|---|---|---|---|
| The Wasp Factory | Iain Banks | Abacus | 16 February 1984 |
| Consider Phlebas | Iain M. Banks | Orbit | 14 April 1988 |
CREATE TABLE Authors (
ID int NOT NULL AUTO_INCREMENT,
Name varchar(255) not null,
Birthday date not null,
PRIMARY KEY (ID)
);
CREATE TABLE Publishers (
ID int NOT NULL AUTO_INCREMENT,
Name varchar(255) not null,
Address varchar(255) not null,
PRIMARY KEY (ID)
);
CREATE TABLE Books (
Name varchar(255) NOT NULL,
AuthorID int, PublisherID int,
PublishedDate date NOT NULL,
PRIMARY KEY (Name, PublishedDate),
FOREIGN KEY (AuthorID) REFERENCES Authors(ID),
FOREIGN KEY (PublisherID) REFERENCES Publishers(ID)
);
CREATE VIEW ViewableBooks AS
SELECT Books.Name 'Book', Authors.Name 'Author', Publishers.Name 'Publisher', Books.PublishedDate 'Date'
FROM Books, Publishers, Authors
WHERE Books.AuthorID = Authors.ID
AND Books.PublisherID = Publishers.ID;
INSERT INTO Authors (Name, Birthday)
VALUES ('Iain Banks', '1954-02-16');
INSERT INTO Authors (Name, Birthday)
VALUES ('Iain M Banks', '1954-02-16');
INSERT INTO Publishers (Name, Address)
VALUES ('Abacus', 'London');
INSERT INTO Publishers (Name, Address)
VALUES ('Orbit', 'New York');
Can an LLM Also Create Schemas?
OK, so now we want to ask an LLM about creating schemas. To summarize how we want to guide the LLM:- When asked in English for a schema, we want it to generate the DDL for three tables, with indexes and constraints.
- We can also hint at the need for constraints (primary keys, foreign keys, etc.) if we need to.
- We can ask for a view.
- We can nudge it toward MySQL syntax, if needed.
So far so good. It hasn’t created the DDL, but we can ask that separately. It has in some ways done better by describing the schema in English. Let’s look at the rest of the reply:
It has described foreign key constraints and added the ISBN, which I didn’t think of. Also “PublicationDate” is better English than my “PublishedDate.” And it made one more table:
This solves the problem for multiple authors for one book — not a problem I was thinking about. The term bridge table indicates the joining of the two tables (books and authors) via foreign keys.
Let’s ask for the DDL: “Show me the data definition language for this schema.”
These came back correctly, including the NOT NULLs to ensure no empty entries. It also noted that the DDL was in some ways “generic” due to differences between vendor SQLs in the real world.
Finally, let’s ask for a view:
This is a much more complicated version than mine; however, when adapted for my schema naming, it works just fine in DB Fiddle. The table alias naming seen here doesn’t really help with comprehension.