The last six months in LLMs, illustrated by pelicans on bicycles
a year ago
- #AI
- #LLMs
- #Benchmarking
- The speaker presented a keynote on the last six months in LLMs at the AI Engineer World’s Fair in San Francisco.
- Over 30 significant LLM models were released in the past six months, making it challenging to evaluate and compare them.
- The speaker introduced a unique benchmark involving generating an SVG of a pelican riding a bicycle to evaluate LLMs.
- Notable model releases include Amazon's Nova models, Meta's Llama 3.3 70B, and DeepSeek's open-weight models.
- DeepSeek's R1 reasoning model caused a significant stock market drop, wiping $600 billion from NVIDIA's valuation.
- Mistral Small 3, a 24B model, was highlighted for its efficiency and capability, running on a laptop with limited RAM.
- Anthropic's Claude 3.7 Sonnet and OpenAI's GPT 4.5 were discussed, with Claude being a favorite despite GPT 4.5's high cost and underwhelming performance.
- OpenAI's 'GPT-4o native multimodal image generation' feature was a massive success, attracting 100 million new users in a week.
- The speaker criticized ChatGPT's new memory feature for compromising user control over context.
- Recent trends in LLMs include the integration of tools and reasoning, enhancing their capabilities and applications.
- The speaker highlighted risks associated with LLMs, such as prompt injection and the 'lethal trifecta' of private data access, malicious instructions, and data exfiltration mechanisms.
- The pelican benchmark was humorously acknowledged by Google during their I/O keynote, prompting the speaker to consider a new benchmark.