LinkedIn Highlights, Dec 2024

Claude’s new PDF API, a playground to build with the new Gemini Realtime Multimodal API, open multimodal vision models from Meta, an open-source Perplexity alternative, and easy LLM fine-tuning

Jan 02, 2025

Welcome to LinkedIn Highlights!

Each month, I'll share my five top-performing LinkedIn posts, bringing you the best of AI straight from the frontlines of academia and industry.

As a frequent LinkedIn contributor, I regularly share insights on groundbreaking papers, promising open-source packages, and significant AI product launches. These posts offer more depth and detail than our weekly snippets, providing a comprehensive look at the latest AI developments.

Whether you're not on LinkedIn or simply missed a post, this monthly roundup ensures you stay informed about the most impactful AI news and innovations.

1. MindSearch

An open-source search engine is rivaling top-tier AI products like Perplexity.ai Pro and ChatGPT web search.

MindSearch is an innovative AI search engine framework that combines LLMs and a multi-agent system to tackle three critical issues that often limit LLM-powered search engines:

LLMs struggle to decompose complex queries into simpler, actionable requests
Search results often contain too much noise, making it hard to filter and extract relevant information
Iterative searches can quickly overload the LLM’s input length capacity

MindSearch utilizes two main components:

WebPlanner - decomposes complex queries into sub-tasks and creates a dynamic graph structure for problem-solving
WebSearcher - conducts fine-grained searches and delivers summarized information back to WebPlanner for further refinement

This approach allows MindSearch to handle massive web content (e.g., more than 300 pages) effectively, surpassing limitations faced by traditional LLM-based search systems.

Code https://github.com/InternLM/MindSearch

2. Gemini Multimodal Playground

Holiday coding project: Build voice agents that can see with Google's new Gemini 2.0 model and my new real-time Multimodal Playground repo.

The playground implements voice and video-based interactions with the new Gemini model, allowing natural conversations in real-time while solving the critical background noise challenge using Voice Activity Detection (VAD).

In the last few days, I added a full-stack web app to interact with Gemini (see video below) along with a standalone script for those eager to quickly dive into building real-time voice agents.

Google’s real-time Gemini model is a game-changer, enabling you to independently create production-ready voice agents for industries like customer service, education, and healthcare in a matter of days.

Happy holidays. Go build! https://github.com/saharmor/gemini-multimodal-playground

3. Meta Apollo

Video understanding has been lagging behind text, image, and audio modalities—until now.

Meta and Stanford researchers unveiled Apollo, a new family of state-of-the-art video-centric large multimodal models (video-LMMs) designed to close this gap. Unlike prior efforts, Apollo sets a new standard by efficiently analyzing hour-long videos and achieving breakthrough results on multiple benchmarks.

Paper highlights:

Scaling Consistency - design decisions made with smaller models transfer reliably to larger ones, drastically cutting computational costs
Advanced video sampling techniques - Apollo uses FPS sampling, outperforming traditional uniform sampling methods
Streamlined evaluation - the new ApolloBench benchmark evaluating video-LMMs efficiently, reducing evaluation time by 41x while maintaining accuracy

Apollo’s superior video comprehension capabilities pave the way for breakthroughs like real-time video summarization for content creators, better temporal reasoning for medical diagnostics, and enhanced video analytics for autonomous driving.

With Apollo, video understanding might finally catch up to its multimodal counterparts.

Project page https://apollo-lmms.github.io

4. Claude PDF API

Anthropic has introduced a powerful new PDF-processing feature in its Claude API, surpassing basic text extraction, and it has largely flown under the radar.

Historically, many LLMs stumble when documents include complex elements like images, charts, and LaTeX formulas. But Anthropic’s latest upgrade manages to parse both textual and visual content within a PDF—no extra coding wizardry needed.

Key capabilities include:

Automatically parsing PDF text, images, and tables for further analysis, from answering questions about the attached PDF to turning unstructured data into formatted JSONs
Providing insight on charts and diagrams by evaluating visual context, not just textual tags
Extracting and interpreting LaTeX for scientific or technical documentation

It works by splitting each PDF into two components: the text is extracted as normal, and the entire page is converted into an image. Claude then merges text and visual context for a more holistic understanding. It’s essentially combining LLM intelligence with basic computer vision techniques.

The API supports up to 32MB or 100 pages of PDF content and pricing is similar to the LLM pricing so there’s no premium cost for PDF analysis.

This API could dramatically streamline how we handle financial reports, legal docs, or any PDF requiring detailed interpretation.

Ready-to-run notebook analyzing Anthropic's constitutional AI paper here https://github.com/anthropics/anthropic-cookbook/blob/main/misc/pdf_upload_summarization.ipynb

5. LLaMa-Factory

When is it better to fine-tune a language model than using prompt engineering or RAG? Here’s a clear framework you can apply along with an open-source library I use for fine-tuning.

Good reasons to fine-tune:

Emphasizing knowledge that already exists in the model - for instance, in a text-to-SQL task, fine-tuning can be used to emphasize specific SQL dialects or to avoid error-prone edge cases, utilizing the comprehensive understanding of SQL syntax, dialects, and database functionality that the model already possesses.
Customizing the structure or tone of responses - fine-tuning can modify the structure or tone of a model's output, such as making the model output valid JSON, which is beneficial for programmatic interactions where handling invalid JSON could lead to many downstream error cases. This includes fine-tuning a model to your company’s writing style.
Teaching a model very complex instructions - fine-tuning allows for showing the model many more examples than can be included in a model's context window, which is helpful for complex instructions. This leads to cheaper and faster inference.

Wrong reasons to fine-tune:

Adding new knowledge to the base model - the knowledge in a large language model is established during the pre-training runs. New knowledge can't effectively be introduced during the limited scope of fine-tuning. RAG is better suited in such cases.
Quickly iterating on a new use-case - fine-tuning involves a slower feedback loop and requires substantial investment in creating the dataset and other aspects of the fine-tuning process. Therefore, it's not suitable for rapid iteration of new use cases.

My preferred tool for fine-tuning open language models is LLaMA-Factory. It features 100+ different large language models, including Meta’s Llama-2, Google’s Gemma, and Mistral’s Mixtral. It also supports advanced algorithms like LoRA, QLoRA, and GaLore for optimized performance.

GitHub repo https://github.com/hiyouga/LLaMA-Factory

Last month’s LinkedIn Highlights

LinkedIn Highlights, Oct 2024

Sahar Mor

November 7, 2024

Read full story

AI Tidbits

LinkedIn Highlights, Oct 2024

Discussion about this post