The future of Internet Search in the era of LLMs
How OpenAI's GPTBot and Google's new generative-AI-infused Search will redefine how we seek information and shop online
Welcome to Deep Dives - an AI Tidbits section providing editorial takes and insights to make sense of the latest in AI. Let’s go!
Three months ago, after years without any substantial updates, Google refactored its money-making machine - Google Search, infusing it with generative AI and launched Search Generative Experience. Implicitly, they promised a new experience - no more scrolling through results, following links, and skimming through whole web pages to find the right information. Instead, a new Generative Overview rectangle, ready to answer follow-up questions and turn into a chat session with Bard.
This week, Google expanded its Search Generative Experience by showing related images and videos for a given search query, capitalizing on its few last years’ multimodal research–this capability was unveiled as part of LaMDA at Google I/O ’21.
Yet another announcement this week had a major impact on internet search as we know it. OpenAI announced GPTBot - its web crawler that collects internet data to improve future models. It filters sources requiring paywall access, known to collect personally identifiable information (PII) or contain text violating OpenAI’s policies. Its documentation page states that “allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety.” - an altruist goal that doesn’t align with content owners. In fact, it remains unclear why wouldn’t everyone block GPTBot given the current lack of incentives. Maybe there are lessons to be learned from SEO LLM companies should consider.
Misalignment of interests and a 1994 protocol
The internet had <10 million websites when Google turned its PageRank paper into the commercial Search product. Google wanted to index the world’s information, and the robots.txt file introduced in 1994 by Martijn Koster was the way to go. It made crawling more efficient and reliable as only relevant content defined by the webmaster was indexed. And above all - it provided legal consent for Google to use this content.
Keep reading with a 7-day free trial
Subscribe to AI Tidbits to keep reading this post and get 7 days of free access to the full post archives.