Data Dignity: Developers Must Solve the AI Attribution Problem
AI is ripping us off. This post looks at the problems of attribution and how the dev community can help fix this with metadata solutions.
May 27th, 2023 4:00am by
Image via Pixabay
- The belief that our personal data and created media has been abused by large corporations;
- The new GPT-style AIs that can trawl vast amounts of user data from the web; and
- The fact that we currently don’t have a consistent way of digitally memorializing who wrote what.
What Is Data Dignity?
“Data Dignity” is a movement that was forged before generative AI came about, and is firmly connected to noted tech commentator Jaron Lanier. The theory is that the economy should be altered to compensate people when data about them is used, or the data they created is remixed. These ideas are driven by the idea that the “free” online economy has been a disaster in terms of recognition or remuneration. It is quite clear that generative AI will make this position much worse. Our concern in this article is not with the freedom of information to travel, it is with the information about that information (the metadata) — and having it not become lost baggage. So what I’m advocating is, to retain the travel theme, a baggage tag for information. And some trusty baggage handlers. Take any document on the web; and a quote in that document. There is no simple automated way of getting the author of that document, or ensuring the quote is attributed correctly to the author if it later appears elsewhere. Hence when a GPT AI mashes two disparate paragraphs together, the provenience is completely lost. By comparison, Twitter is structured to memorialize the author of the tweet, and even the author of the tweet a tweeter retweets. The metadata associated with a tweet (everything associated with the creation of the tweet, other than the words) is more than just the author; it includes time, location, language, and unique Id. So given that Twitter works, why did we let the web go wrong? During the 2000s, the Semantic Web was proposed as an improved version of the World Wide Web. The goal was to create a web where intelligent agents would be able to understand the content of webpages using injected metadata to provide useful services to humans or to interact with other intelligent agents. Unfortunately, the Semantic Web project was always academic in nature. The proposed languages for adding metadata to web pages were difficult to use. The inference engines in the early 2000s were slow. As we’ll see below, metadata is all too often a weapon of Search Engine Optimization (SEO), not a sword of truth. And metadata itself is not static — it can age and needs to be maintained. Another problem came with the birth and vertiginous rise of JSON. It came a little too late for the Semantic Web, so JSON’s older (and much uglier) step-sister XML was used. But we should accept that the aim of the project was good.How to Fix Things
There are three basic paths to inject meaning back into the web, and help fix the attribution problem:- Let AI create and maintain metadata.
- Use better tools and agreements to re-inject useful and consistent metadata back into the web.
- Stick the metadata transparently into the cloud.
It seems that Google can already do this, even though it doesn’t trumpet the solution:
Note that the author is clearly selected in bold, like a search result.
If we just let a few large companies form tons of metadata about everything so that their LLMs can train properly, we could then task the same companies with using AI to track attribution. This may seem like a reasonable thing to do, but without oversight we will never be sure exactly how much metadata is stored while achieving this.
Metadata in the Document
Web pages have a built-in ability to store metadata in tag form for free, without distorting the information they present. In fact, the point of HTML is to use metadata to enhance information. Inside the document you are reading now, you might find the following tag
<meta name="author" content="David Eastman" class="yoast-seo-meta-tag">
<meta name="author" content="Condé Nast">
<meta property="article:author" content="Jaron Lanier">
An API Solution
A simple REST solution could be implemented in most content platforms. You can query this site in HTML for authors; for example, this query will yield my articles using “https://thenewstack.io/author/david-eastman/”. Although, you would need to know that formulation of my name, and accept that a few early articles won’t appear. What would be more useful is to extract the author (along with other metadata) for any given article in a RESTful fashion.https://thenewstack.io/how-to-software/query?metadata=authorThis above just uses the natural REST query interface, although it could be achieved in a neater formulation — all we want is “return the author of the post called how-to-software”. So if you are designing your REST API for any content platform, make sure the user can get metadata back on any pages they have access to.
Store the Information Elsewhere
Metadata could also be collected and placed in a neutral repository. This would allow third parties to work on metadata while giving the appropriate public access to it. For example, there are a lot of companies that will use AI to do content moderation — which will probably try to add metadata context to existing dodgy media. One example startup proposing AI video moderation is unitary.ai. Ironically, this is the converse problem — instead of making sure media retains attribution, this is adding metadata to media that might otherwise want to stay in the shadows. If the metadata was in a neutral location, users wouldn’t have to accept the platform’s last word on all moderation issues. Similarly, regulated industries trying to use generative AI will probably interact with compliance middleware to avoid compromising recognized compliance standards in user responses. It seems reasonable that the rules and generated metadata would be kept in an open and accessible fashion. Clearly, the dev challenges here are to design architecture and standards to solve these problems. So the future for fair generative AI does depend on the development community’s willingness to provide plenty of ways for attribution to be kept, otherwise AI will spend most of its time chatting with the legal system.
YOUTUBE.COM/THENEWSTACK
Tech moves fast, don't miss an episode. Subscribe to our YouTube
channel to stream all our podcasts, interviews, demos, and more.