The Open-Source Toolkit for Building AI Agents
Curated frameworks, tools, and libraries every developer needs to build functional and efficient AI agents
Welcome to a new post in the AI Agents Series - helping AI developers and researchers deploy and make sense of the next step in AI.
My last post explored how the internet will transform for an agent-first future - from websites optimizing for AI interaction through "agent-responsive design" to the emergence of Agent Engine Optimization (AEO) as the next SEO. We saw how tech giants like Google, Apple, OpenAI, and Anthropic are racing to define this next evolution of digital interaction, with Gartner projecting that by 2028, 33% of enterprise software applications will include agentic AI.
In this post, I'll outline a curated, though non-exhaustive, overview of the open-source ecosystem for developers creating these AI agents. While numerous market maps exist for AI agents, they often cater more to venture capitalists than builders. Developers need actionable tools and frameworks to launch functional AI agents today.
Which tools do other builders rely on to develop voice agents? What’s the leading open model for document understanding? With new packages emerging almost daily, I’ll focus solely on the libraries I’ve personally found most effective. This list is, therefore, intentionally selective rather than exhaustive.
Every package included here supports commercial use and has a permissive open-source license.
With the holiday season coming, there's no better time to dive into these tools and start building.
Categories covered in this piece:
→ Frameworks for Building and Orchestrating Agents
→ Computer and Browser Use
→ Voice
→ Document Understanding
→ Memory
→ Testing and Evaluation
→ Monitoring and Observability
→ Simulation
→ Vertical Agents
Frameworks for Building and Orchestrating Agents
Building AI agents requires robust frameworks that can handle complex workflows, memory management, and tool integration. These foundational frameworks serve as the backbone for creating agents that can understand, plan, and execute tasks autonomously.
CrewAI - a framework for orchestrating role-playing, autonomous AI agents
Phidata - build AI assistants with memory, knowledge, and tools
Camel - build customized multi-agent systems to generate data, complete tasks, or simulate real-world interactions
AutoGPT - create, deploy, and manage continuous AI agents that automate complex workflows
AutoGen - develop LLM applications using multiple agents that can converse with each other
SuperAGI - build, manage, and run autonomous AI agents quickly and reliably
Superagent - an open framework for building AI assistants
LangChain & LlamaIndex - the usual suspects, facilitating AI Agents through composability
Computer and Browser Use
For AI agents to be truly useful, they need to interact with computers and browsers just like humans do. These tools enable agents to navigate websites, control applications, and execute commands programmatically, bridging the gap between AI reasoning and real-world actions.
Open Interpreter - turn natural language commands into code that runs on your local machine
Self-Operating Computer - enables multimodal models to operate a computer
Agent-S - an open agentic framework that uses computers like a human
LaVague - create web agents that take actions on websites using LLMs as their reasoning engines
Playwright - a framework for web testing and automation
Puppeteer - a JavaScript library that provides a high-level API to control Chrome or Firefox
Voice
Voice interfaces represent the most natural way for humans to interact with AI agents. These tools enable the creation of agents that can understand spoken language, maintain context in conversations, and respond with natural-sounding speech, making AI interaction more accessible and intuitive.
Speech2speech
Ultravox - a speech2speech model for real-time voice interaction, superior to Moshi for now
Moshi - a speech2speech model for real-time voice interaction
Pipecat - a framework for voice and multimodal conversational AI, supporting speech2text, text2speech, video, etc.
Speech2text
Whisper - OpenAI's speech2text model
Stable-ts - a lightweight Whisper wrapper with timestamps and more
Speaker diarization 3.1 - pyannote’s flagship model for speaker detection
Text2speech
The only decent open model I came across was ChatTTS, which is satisfactory for production. I, therefore, default to ElevenLabs or Cartesia.
Misc
Vocode - a toolkit for building voice-based LLM agents
Voice Lab - a comprehensive testing and evaluation framework for voice agents across language models, prompts, and agent personas
Become a premium member to access the LLM Builders series, $1k in free credits for leading AI tools and APIs, and editorial deep dives into key topics like OpenAI's DevDay and autonomous agents.
Many readers expense the paid membership from their learning and development education stipend.
Document Understanding
Modern AI agents need to process and understand documents in various formats, from PDFs to images with text. These tools provide the crucial ability to extract, comprehend, and act on information from unstructured documents, enabling agents to handle real-world business processes.
Qwen2-VL - vision language model from Alibaba outperforming GPT-4o and Claude 3.5 Sonnet
DocOwl2 - an efficient multimodal LLM for OCR-free document understanding
Memory
Without memory, AI agents are limited to single-turn interactions. These memory tools enable agents to maintain context over long conversations, remember user preferences, and learn from past interactions, making them truly personal assistants rather than just query responders.
Mem0 - provides an efficient, self-improving memory layer for LLMs, enabling personalized AI experiences
Letta (fka MemGPT) - create LLM agents with long-term memory and custom tools
LangChain - offers memory components to manage conversation history and context
Testing and Evaluation
As AI agents become more complex, robust testing becomes critical. These tools help developers evaluate agent performance, identify failure modes, and ensure reliability across different scenarios and environments.
Voice Lab - a comprehensive testing and evaluation framework for voice agents
AgentOps - tools for monitoring and benchmarking agent performance
AgentBench - a benchmark to evaluate LLMs as agents across various environments (Web, Minecraft, Visual Design, etc.)
Monitoring and Observability
Understanding how AI agents perform in production is crucial for maintaining reliability and optimizing costs. These tools provide insights into agent behavior, resource usage, and performance metrics essential for running agents at scale.
openllmetry - an open-source, OpenTelemetry-based end-to-end observability tool for LLM applications
AgentOps - agent monitoring, LLM cost tracking, benchmarking, and more
Simulation
Before deploying agents to real-world scenarios, testing them in controlled environments is crucial. These simulation tools allow developers to validate agent behavior, test edge cases, and refine decision-making capabilities in safe, reproducible environments.
AgentVerse - facilitates the deployment of multiple LLM-based agents in various applications, including simulations
Tau-Bench - a benchmark and testing code for agent-user interactions in real-world domains like retail and airline
ChatArena - multi-agent language game environments for research on autonomous LLM agents
AI Town - A virtual town where AI characters live, chat, and socialize
Generative Agents - Stanford’s Interactive simulacra of human behavior
Vertical Agents
There are dozens of open vertical agents out there, so here are just a few select ones I’ve tinkered with and found the most useful:
OpenHands (Coding) - a platform for software development agents powered by AI
aider (Coding) - pair programming in your terminal
GPT Engineer (Low code) - build applications using natural language. Specify what you want to build, and the AI will ask for clarification before building it.
screenshot-to-code - convert screenshots into a functioning website using HTML/Tailwind/React/Vue
GPT Researcher (Research) - an autonomous agent that performs comprehensive research on any given topic
Vanna (SQL) - chat with your SQL database
Looking Ahead
While this post focused on open-source packages with permissive licenses, I plan to publish another comprehensive list specifically for engineers building voice agents. This upcoming guide will include both open-source and commercial tools, covering solutions like OpenAI's Realtime API (speech2speech) and ElevenLabs (text2speech), along with detailed comparisons of their capabilities, pricing models, and ideal use cases.
Stay tuned for more deep dives in the AI Agents Series.
Comprehensive list of open-source packages for AI engineers (last update: Aug ‘23)
What about interacting with a website or app through it's API? And how do you define, constrain and focus on the function of the system?
When are we adding payments as a category? :)
(stripe SDK, Coinbase SDK, Circle SDK, OpenCommerce SDK)