The last six months in LLMs, illustrated by pelicans on bicycles

a year ago

#AI
#LLMs
#Benchmarking

The speaker presented a keynote on the last six months in LLMs at the AI Engineer World’s Fair in San Francisco.
Over 30 significant LLM models were released in the past six months, making it challenging to evaluate and compare them.
The speaker introduced a unique benchmark involving generating an SVG of a pelican riding a bicycle to evaluate LLMs.
Notable model releases include Amazon's Nova models, Meta's Llama 3.3 70B, and DeepSeek's open-weight models.
DeepSeek's R1 reasoning model caused a significant stock market drop, wiping $600 billion from NVIDIA's valuation.
Mistral Small 3, a 24B model, was highlighted for its efficiency and capability, running on a laptop with limited RAM.
Anthropic's Claude 3.7 Sonnet and OpenAI's GPT 4.5 were discussed, with Claude being a favorite despite GPT 4.5's high cost and underwhelming performance.
OpenAI's 'GPT-4o native multimodal image generation' feature was a massive success, attracting 100 million new users in a week.
The speaker criticized ChatGPT's new memory feature for compromising user control over context.
Recent trends in LLMs include the integration of tools and reasoning, enhancing their capabilities and applications.
The speaker highlighted risks associated with LLMs, such as prompt injection and the 'lethal trifecta' of private data access, malicious instructions, and data exfiltration mechanisms.
The pelican benchmark was humorously acknowledged by Google during their I/O keynote, prompting the speaker to consider a new benchmark.

Hasty Briefsbeta

The last six months in LLMs, illustrated by pelicans on bicycles