September 2024 - AI Tidbits Monthly Roundup
OpenAI’s new reasoning model and realtime API to power AI assistants, Qwen's and AI2’s new fully open-sourced state-of-the-art multimodal models, and speech2text with Whisper v3 Turbo
Welcome to the monthly curated round-up, where we curate the firehose of AI research papers and tools so you won’t have to. If you're pressed for time and can only catch one AI Tidbits edition, this is the one to read—featuring the absolute must-knows.
September has been a busy month for everyone in the AI space. It has been packed with groundbreaking developments across various AI domains, from industry giants to open-source breakthroughs.
In the realm of large language models, we've seen significant strides from both industry leaders and open-source initiatives. OpenAI introduced its advanced o1-preview and o1-mini models, excelling in high-level reasoning for coding and math. Meanwhile, Alibaba released the impressive Qwen 2.5 family of open multilingual models, handling an expansive 128K tokens. Meta also made waves with Llama 3.2, featuring edge-optimized text models and their first large multimodal models.
Multimodal AI saw remarkable progress, with AI2's Molmo models rivaling and surpassing industry giants like GPT-4V and Gemini 1.5. Nvidia's NVLM 1.0 and Apple's MM1.5 further pushed the boundaries of vision-language reasoning and diverse task performance.
In the audio domain, OpenAI released Whisper Large v3 Turbo, a faster and more capable speech-to-text model, while Google developed a promising zero-shot Voice Transfer module for cross-lingual applications.
The image and video generation landscape continued to evolve, with Meta's Imagine Yourself technology enabling personalized image generation and advancements in text-to-video models like CogVideoX-5B pushing the boundaries of visual content creation.
This month's roundup features these breakthroughs and many more exciting updates across AI tools, research methodologies, and vision AI.
Let's dive in!
Overview
Industry announcements (11 entries)
Large Language Models
Open-source (11 entries)
Research (9 entries)
Multimodal (7 entries)
Autonomous Agents (3 entries)
Image and Video (8 entries)
Audio (4 entries)
AI Tools (5 entries)
Open-source Packages (5 entries)
Recent Deep Dives
Industry announcements
👆 OpenAI’s Realtime API powering Speak’s language learning app
Become a premium member to get full access to my content and $1k in free credits for leading AI tools and APIs like Claude, Replicate, and Hugging Face. It’s common to expense the paid membership from your company’s learning and development education stipend.
Large Language Models
Open-source
Keep reading with a 7-day free trial
Subscribe to AI Tidbits to keep reading this post and get 7 days of free access to the full post archives.