The Vesuvius Challenge demonstrated how a fast-moving community can achieve breakthroughs, inspiring the application of this approach to AI agents.
Kilo Code aims to create the most user-friendly AI coding agent quickly, leveraging community feedback and rapid development.
The team, assembled in a week, includes experienced professionals like JP Posma, Justin Halsall, and Johan Otten, focusing on speed and innovation.
Recent improvements include no need for OpenRouter accounts, free tier with Claude 3.7 Sonnet, good defaults, and multiple onboarding enhancements.
Future plans include instant app creation, up-to-date docs, browser IDE, local models, live collaboration, parallel-agents, and more advanced AI agent capabilities.
Kilo Code is currently available in VS Code with a free tier offering $15 in tokens monthly, encouraging user feedback via Github and Discord.
Mosaic is an agentic video editing paradigm that allows users to create and run their own multimodal video editing agents in a node-based canvas.
The role involves accelerating the development of the core agentic video editing paradigm, building scalable pipelines for video processing and inference, creating evaluations, and making high-level design decisions.
Mosaic's initial prototype won the $25,000 grand prize in the Google Gemini Kaggle competition and best demo in the Y Combinator W25 batch.
The team consists of ex-Tesla engineers and is looking for a Founding Engineer to help accelerate video editing from hours to seconds.
High-agency tasks require agents to act competently, reliably, and consistently, especially in high-value use cases like customer support.
Customer support is challenging due to knowledge gaps, impatient users, and time constraints, contrasting with ideal environments where agents have complete knowledge and forgiving conditions.
LLMs like Anthropic's 'computer use' and OpenAI's DeepResearch show advancements in high-agency tasks, but real-world applications like Fin face reliability issues.
Customers expect high reliability and control from agents, especially for sensitive tasks like subscription management, refunds, and context gathering.
Measuring agent performance involves simulating tasks with predefined outcomes, user prompts, and stopping conditions to assess reliability and consistency.
The 'pass^k' metric is stricter than 'pass@k', requiring consistent success over multiple repetitions, which is crucial for customer support reliability.
Agency and reliability are inversely related; high-agency agents often perform inconsistently, especially in complex tasks.
The 'Give Fin a Task' (GFAT) agent balances agency and control by using step-based instructions, improving reliability for simple and moderate tasks.
GFAT's composability allows breaking complex tasks into simpler, more reliable steps, enhancing overall performance and customer satisfaction.
Early benchmarks show GFAT significantly improves reliability, especially for simple and moderate tasks, by constraining agency and emphasizing structured execution.
LLMs are initially trained to predict the next token in a sequence, a process known as the next-token objective.
Instruction finetuning is used to adapt LLMs for specific tasks by training them on datasets designed for prompting, improving zero-shot learning capabilities.
Reinforcement Learning from Human Feedback (RLHF) is a key training step where LLMs are optimized to produce outputs that humans prefer, moving beyond simple next-token prediction.
RLHF involves two main steps: reward modeling, where a model learns to predict human preferences, and proximal policy optimization (PPO), which adjusts the LLM to maximize these rewards while staying close to its original behavior.
LLMs can be viewed as agents that take actions (producing tokens) to maximize rewards, similar to how chess-playing models choose moves to win games.
The concept of AI agents extends LLMs by mapping their token outputs to real-world actions, enhancing their utility beyond text generation.
Despite their capabilities, LLMs trained with RLHF can sometimes produce outputs that seem good to humans but are actually flawed, a phenomenon known as reward hacking.
The training and capabilities of LLMs suggest they are more than just next-token predictors; they are complex systems optimized for various objectives, including human appeal and task performance.
MCP (Model Context Protocol) is a standard API for exposing sets of Tools that can be integrated with LLMs.
An Agent can be implemented as a simple while loop on top of an MCP client, making Agentic AI simpler.
The article demonstrates a Tiny Agent implementation in TypeScript, connecting to local MCP servers for tools like file system access and web browsing.
Recent LLMs support function calling (tool use) natively, simplifying the integration of tools without manual prompt engineering.
The MCP client connects to servers, formats their tools for LLM use, and handles tool calls and responses.
The Agent's control flow includes tools for task completion and user questions, breaking the loop when needed.
Future steps include experimenting with different models and inference providers, and contributions are encouraged.
Firecrawl, a Y-Combinator backed startup, is hiring AI Agents for content generation, coding, and customer support roles, offering a monthly salary of $5000.
The Agent2Agent (A2A) protocol, proposed by Google, is an open communication standard enabling interoperability between independent AI agents, treating them as discoverable 'black boxes'.
A2A is designed around a Client-Server model, allowing client agents to access remote agent functionalities without knowing their implementation details.
Core components of A2A include AgentCards (agent business cards), Tasks (work instructions), Artifacts (response content), Messages (conversation tracking), and Push Notifications (async processing).
A2A RPC methods include tasks/send (synchronous processing), tasks/sendSubscribe (streaming), and tasks/get (retrieving task status).
The A2A Memory Layer manages queued messages, status tracking, and result communication, with TaskStore/TaskManager at its core.
A2A implementation involves handlers for different request types and a TaskManager to invoke agent logic and update task statuses.
A comparison between A2A and MCP (another protocol) is teased for a future article.