Revolutionizing document processing with multimodal GPT

A set of experiments with the new GPT-4V demonstrating its potential in making a whole industry obsolete

Oct 30, 2023

∙ Paid

Welcome to Deep Dives - an AI Tidbits section providing editorial takes and insights to make sense of the latest in AI. Let’s go!

In June 2020, OpenAI unveiled GPT-3. As a veteran in the document processing domain, I had long recognized the limitations of prevailing document extraction technologies, which largely relied on rigid, rule-based logic. I wondered if language models could be the answer to intelligent data extraction. And indeed, they were.

GPT-powered document intelligence with AirPaper. Source

What started as a side project turned into a venture called AirPaper. Back then, GPT-3 was the cutting-edge language model, and it was only one API call away. The main challenges were that GPT-3 was expensive, 55x compared to today’s GPT-3.5 Turbo, and had a tiny context window of 2,048 tokens, compared to today’s 32k.

Another challenge was that language models, even if performant, only play along with text. This necessitated an extensive preprocessing phase to prepare documents for GPT: extracting text, à la OCR, structuring it in a way that would fit GPT’s limited context window, and intelligently mapping GPT's output to the relevant fields, such as invoice numbers or sales tax amounts on an invoice.

That was in 2020.

Enter Multimodal AI

The space of document intelligence has undergone massive shifts in recent years, with the underlying technology getting gradually commoditized. More and more state-of-the-art libraries were released, most of them with a commercially permissible license:

Donut 🍩
PaddleOCR
layoutlm-document-qa
Deepdoctection
And the impressive LayoutLMv3, which is the only one not allowing commercial use

But then powerful multimodal AI in the form of LLaVA and GPT-4V arrived. Easily accessible for anyone with ChatGPT access.

I again wondered, can GPT-4V turn mere images of documents into structured data? The results were mindblowing.

Let's dive deeper into some of the use cases I've explored and their results.

Keep reading with a 7-day free trial

Subscribe to AI Tidbits to keep reading this post and get 7 days of free access to the full post archives.