Welcome to the weekly edition of AI Tidbits, where I curate the firehose of AI research papers and tools every week so you won’t have to.
📩 Published a new breakthrough paper? Just released an open-source package? Submit it here to ensure we don’t miss it and that it gets featured in next week’s post.
Overview
✨ Highlights (7 entries)
Language Models (8 entries)
Multimodal (4 entries)
Autonomous Agents (3 entries)
Vision (6 entries)
Open-source Packages (3 entries)
Anthropic releases new versions of its LLMs Claude 3.5 Sonnet and Haiku, boasting top-tier performance in coding and problem-solving, surpassing OpenAI o1-preview, along with new human-like software interaction capabilities, enabling the model to click, type, and automate tasks directly through GUIs ( Anthropic )
Genmo openly releases Mochi 1 - a text-to-video model delivering smooth 30fps videos with precise motion and accurate prompt adherence, with downloadable weights on Hugging Face and a commercially permissive license ( Company blog )
ElevenLabs unveils Voice Design, allowing users to create unique voices from a text prompt alone ( X )
Runway releases Act-One - a cutting-edge tool for transforming simple video and voice inputs into expressive character performances ( Company blog )
Ideogram releases Canvas - an AI-powered image editor offering inpainting and outpainting capabilities, outperforming competitors like Midjourney ( Company blog )
Google elevates its podcast-generating NotebookLM, introducing customizable Audio Overviews and giving users the ability to fine-tune AI summaries with specific instructions ( Google Blog )
Perplexity introduces Internal Knowledge Search, letting users search both public and internal data sources in one platform ( Perplexity )
⭐️ Exciting news - Deepgram and Writer.com join AI Tidbits credits program!
AI Tidbits premium members receive $100 in Writer credits to build LLM-powered apps with RAG tools, AI guardrails, and more, along with $200 in Deepgram credits to build real-time voice agents.
Premium members also get full access to AI Tidbits content and $800+ for other leading AI tools and APIs, including Claude and Hugging Face.
We will be announcing more partners soon. Stay tuned.
Support AI Tidbits as a premium member
Microsoft open-sources bitnet.cpp - a high-performance framework that enables 1-bit LLMs to run smoothly on CPUs, achieving near human reading speeds
Meta open-sources Meta Lingua - a modular library for LLM training and inference, offering quick experiments, multi-GPU support, and real-time profiling tools, simplifying LLM experimentation
IBM releases Granite 3.0 LLMs, combining dense and Mixture-of-Experts models optimized for enterprise tasks, with open-source licensing and fine-tuning tools for tailored enterprise use
Anthropic evaluates sabotage risks in AI models, showing current safeguards suffice for Claude 3 Opus and Claude 3.5 Sonnet, but cautioning that more robust mitigations will be necessary with advancing AI capabilities
Researchers introduce PCDefense - an efficient defense mechanism to mitigate the jailbreak vulnerabilities caused by politically correct biases in LLM safety alignment, ensuring more reliable behavior across varied prompts
OpenAI releases Chat Completions API with support for text and audio, enabling both asynchronous audio experiences and real-time interactions
Researchers present MedEmbed - a specialized family of embedding models fine-tuned for medical data, significantly improving information retrieval and NLP tasks across healthcare applications
Researchers analyze scaling laws for time series foundation models (TSFMs), revealing that encoder-only Transformers scale better than decoder-only models in both in-distribution (ID) and out-of-distribution (OOD) data
Meta develops Spirit LM - a multimodal model that integrates text and speech, combining the semantic capabilities of text models with the expressive abilities of speech models
Cohere releases Embed 3 - a multimodal search model that unifies text and image embeddings for seamless search across different asset types
DeepSeek proposes Janus - an autoregressive framework that enhances multimodal models by decoupling visual encoding for image understanding and generation
Researchers present PyramidDrop - a redundancy reduction strategy that boosts large vision language models' efficiency by 40% and inference by 55% with minimal performance loss
CMU and CUHK propose MultiUI - a dataset enabling multimodal models to achieve a 48% boost in web UI performance and 19% action accuracy, while also generalizing to non-web tasks like OCR and document understanding
Researchers propose MobA - a mobile agent leveraging multimodal large language models to enhance comprehension, task planning, and execution efficiency
Researchers introduce World-model-augmented (WMA) web agents, which enhance LLM-based decision-making with simulated outcomes and efficient policy selection, surpassing existing agents in cost and time across benchmarks like WebArena and Mind2Web
Rhymes AI releases Allegro - a 2.8B open-source model capable of generating cinematic 6-second videos from text prompts at 15 FPS and 720p resolution
Stability AI unveils Stable Diffusion 3.5, offering powerful, customizable text-to-image models optimized for consumer hardware, available for free use
Google and MIT develop Fluid - a new autoregressive model that excels in text-to-image generation with continuous tokens and random-order generation, setting state-of-the-art performance on MS-COCO and GenEval benchmarks
The University of Washington and DeepMind present VidPanos - a method for synthesizing panoramic videos from panning footage by framing the task as a space-time outpainting problem
CUHK presents SAM2Long - a novel segmentation strategy that leverages constrained tree search to enhance the robustness of SAM 2 in long-term video object segmentation, achieving 5% gains on complex benchmarks
Researchers introduce MagicTailor - a framework enabling fine-grained customization in text2image models through component-controllable personalization
Writing Tools - an open system-wide grammar assistant for Windows
BabyAGI-2o - an agent that dynamically creates tools, manages dependencies, and iterates on errors to complete tasks autonomously
Sage - chat with any codebase in less than two minutes, locally or via third-party APIs
Plus >70 more open-source packages for AI engineers
Fix grammar mistakes for free with Writing Tools Last week’s AI Tidbits roundup
Reach AI builders, researchers, and entrepreneurs by partnering with AI Tidbits
If you find AI Tidbits valuable, share it with a friend and consider showing your support.