AI Tidbits

AI Tidbits

Share this post

AI Tidbits
AI Tidbits
September 2024 - AI Tidbits Monthly Roundup
Copy link
Facebook
Email
Notes
More
Monthly's

September 2024 - AI Tidbits Monthly Roundup

OpenAI’s new reasoning model and realtime API to power AI assistants, Qwen's and AI2’s new fully open-sourced state-of-the-art multimodal models, and speech2text with Whisper v3 Turbo

Arthur Mor's avatar
Arthur Mor
Oct 06, 2024
∙ Paid
16

Share this post

AI Tidbits
AI Tidbits
September 2024 - AI Tidbits Monthly Roundup
Copy link
Facebook
Email
Notes
More
2
Share

Welcome to the monthly curated round-up, where we curate the firehose of AI research papers and tools so you won’t have to. If you're pressed for time and can only catch one AI Tidbits edition, this is the one to read—featuring the absolute must-knows.


September has been a busy month for everyone in the AI space. It has been packed with groundbreaking developments across various AI domains, from industry giants to open-source breakthroughs.

In the realm of large language models, we've seen significant strides from both industry leaders and open-source initiatives. OpenAI introduced its advanced o1-preview and o1-mini models, excelling in high-level reasoning for coding and math. Meanwhile, Alibaba released the impressive Qwen 2.5 family of open multilingual models, handling an expansive 128K tokens. Meta also made waves with Llama 3.2, featuring edge-optimized text models and their first large multimodal models.

Multimodal AI saw remarkable progress, with AI2's Molmo models rivaling and surpassing industry giants like GPT-4V and Gemini 1.5. Nvidia's NVLM 1.0 and Apple's MM1.5 further pushed the boundaries of vision-language reasoning and diverse task performance.

In the audio domain, OpenAI released Whisper Large v3 Turbo, a faster and more capable speech-to-text model, while Google developed a promising zero-shot Voice Transfer module for cross-lingual applications.

The image and video generation landscape continued to evolve, with Meta's Imagine Yourself technology enabling personalized image generation and advancements in text-to-video models like CogVideoX-5B pushing the boundaries of visual content creation.

This month's roundup features these breakthroughs and many more exciting updates across AI tools, research methodologies, and vision AI.

Let's dive in!


Overview

  • Industry announcements (11 entries)

  • Large Language Models

    • Open-source (11 entries)

    • Research (9 entries)

  • Multimodal (7 entries)

  • Autonomous Agents (3 entries)

  • Image and Video (8 entries)

  • Audio (4 entries)

  • AI Tools (5 entries)

  • Open-source Packages (5 entries)

Recent Deep Dives

The Great AI Consolidation

The Great AI Consolidation

Sahar Mor
·
September 29, 2024
Read full story
Harnessing research-backed prompting techniques for enhanced LLM performance

Harnessing research-backed prompting techniques for enhanced LLM performance

Sahar Mor
·
December 10, 2023
Read full story
[cross-post] 7 methods to secure LLM apps from prompt injections and jailbreaks

[cross-post] 7 methods to secure LLM apps from prompt injections and jailbreaks

Sahar Mor
·
February 9, 2024
Read full story

Industry announcements

  1. OpenAI introduces o1-preview and o1-mini, advanced models that excel in high-level reasoning for coding and math, with o1-mini being a faster, cost-efficient option 

  2. As part of its DevDay event, OpenAI releases the Realtime API, allowing developers to create low-latency, voice-to-voice interactions with continuous audio streaming, along with new tools like vision fine-tuning, prompt caching, and model distillation

  3. OpenAI releases Advanced Voice Mode for ChatGPT Plus and Team users, adding Custom Instructions, Memory, and five new voices

  4. Anthropic releases the Quickstarts repo, providing ready-to-deploy app projects powered by its API, starting with a Claude-based customer support agent

  5. In its annual Connect conference, Meta announced multiple AI-related releases, including Ray-Ban Meta smart glasses with real-time AI video processing, AI-powered visual search for Instagram, translation and dubbing tools for creator content with lip sync, and Meta AI vocal responses across platforms with customizable celebrity voices

  6. OpenAI's CTO, Mira Murati, along with other OpenAI execs to step down and leave the company

  7. Pika Labs launches its new video generating model, featuring new "Pikaffects" that transform video subjects with surreal, physics-defying effects like melting and cake-ifying objects 

  8. Google introduces new Gemini-1.5 Pro and Flash models, with faster response speeds, reduced prices (>50%) and improved task performance

  9. Deepgram unveils its Voice Agent API, enabling natural, real-time human-machine conversations powered by high-performance speech recognition and synthesis models

  10. Hume releases Empathic Voice Interface 2 (EVI 2) - a GPT-4o-like voice model, allowing users to converse with its AI chatbot with sub-second response times

  11. Replit announces Replit Agent - an AI tool that automates software development tasks like environment setup and deployment

👆 OpenAI’s Realtime API powering Speak’s language learning app

Become a premium member to get full access to my content and $1k in free credits for leading AI tools and APIs like Claude, Replicate, and Hugging Face. It’s common to expense the paid membership from your company’s learning and development education stipend.

Upgrade to Premium

Large Language Models

Open-source

  1. Alibaba releases Qwen 2.5 - a family of open multilingual models handling 128K tokens and outperforming competitors like Mistral 2 (123B) on major benchmarks, offering multilingual support in 29 languages and specialized models across Math and coding

  2. Meta releases Llama 3.2 - a new family of open models featuring edge-optimized text models (1B and 3B) and Meta's first large multimodal models (11B and 90B) supporting 128K tokens

  3. Kyutai Labs openly releases Moshi - a 7.6B open speech-to-speech model with cutting-edge performance and low latency, alongside Mimi, a SoTA streaming audio codec that compresses 24 kHz audio to 1.1 kbps for optimized real-time speech communication

  4. The DeepSeek team releases DeepSeek-V2.5 - a SOTA versatile open model integrating DeepSeek-Coder with advanced features like Function Calling and JSON output

  5. 01.AI unveils Yi-Coder - a high-performing series of code LLMs with up to 9B parameters, excelling in long-context modeling and outperforming other bigger models

  6. Alibaba proposes DocOwl2 - a state-of-the-art model for multi-page document understanding that reduces GPU usage and inference time by compressing high-resolution document images into 324 tokens

Keep reading with a 7-day free trial

Subscribe to AI Tidbits to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Substack Inc
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share

Copy link
Facebook
Email
Notes
More