Hasty Briefsbeta

All tags

#llm

310 stories total

Bilingual

Prompt caching: 10x cheaper LLM tokens, but how?
5 months ago
- Cached input tokens are 10x cheaper than regular input tokens for OpenAI and Anthropic APIs.
- Prompt caching can reduce latency by up to 85% for long prompts.
- Cached tokens are not saved responses but involve KV caching (Key-Value matrices from attention mechanisms).
- LLMs convert text into tokens, then embeddings, which are processed through attention mechanisms.
- Attention mechanisms determine the importance of each token in context using weights.
- KV caching avoids recalculating attention weights for repeated prompt prefixes, saving computation.
- OpenAI and Anthropic handle caching differently, with OpenAI automating it and Anthropic offering more control.
- Parameters like temperature, top_p, and top_k affect output randomness but not prompt caching.
Response Healing: Reduce JSON defects by 80%+
5 months ago
- OpenRouter introduces Response Healing to automatically fix malformed JSON responses from LLMs.
- Gemini 2.0 Flash saw an 80% reduction in JSON defects, while Qwen3 235B saw a 99.8% reduction.
- A small improvement in JSON reliability can significantly reduce defects, bugs, and support tickets.
- Common JSON issues include trailing commas, unescaped control characters, and missing closing brackets.
- Response Healing is opt-in and free, adding less than 1ms of latency to typical responses.
- The feature does not fix schema adherence issues or work with streaming requests currently.
- OpenRouter aims to provide reliable structured outputs to reduce errors and improve user experience.
LLM Year in Review
5 months ago
- Reinforcement Learning from Verifiable Rewards (RLVR) emerged as a major stage in LLM training, enabling spontaneous development of reasoning-like strategies.
- LLMs in 2025 were seen as 'summoning ghosts' rather than 'evolving animals', highlighting their unique and jagged intelligence profiles.
- Benchmarks lost trust due to susceptibility to RLVR and synthetic data generation, leading to 'benchmaxxing' practices.
- Cursor introduced a new layer of LLM apps, bundling and orchestrating LLM calls for specific verticals with features like context engineering and autonomy sliders.
- Claude Code (CC) demonstrated the first convincing LLM Agent, running locally on users' computers and integrating private data and context.
- Vibe coding became prominent, allowing anyone to build programs via English, democratizing programming and altering software development practices.
- Google Gemini Nano banana hinted at the future of LLM GUIs, combining text generation, image generation, and world knowledge for more visual and spatial interactions.
Structured Outputs Create False Confidence
4 months ago
- Structured outputs force models to prioritize conformance over quality, leading to lower response quality.
- Using structured outputs APIs can result in more mistakes, incorrect error modeling, and difficulties with chain-of-thought reasoning.
- Constrained decoding in structured outputs limits the model's ability to produce accurate responses by filtering allowed tokens.
- Free-form output parsing allows LLMs to respond naturally, improving quality and enabling better reasoning and error handling.
- Structured outputs create false confidence by sacrificing response quality for format compliance.
Langfuse (YC W23) Is Hiring in Berlin, Germany
4 months ago
- Langfuse is an open-source LLM engineering platform focused on continuous monitoring and evaluation for LLM applications.
- The team is backed by notable investors like Lightspeed, General Catalyst, and Y Combinator, and works with top AI teams such as Samsara, Twilio, and Khan Academy.
- Langfuse is growing rapidly with strong metrics: 19,719 GitHub stars, 23.1M+ SDK installs/month, and 6M+ Docker pulls.
- The company is hiring and looking for individuals passionate about solving complex technical problems and creating great developer experiences.
- Langfuse maintains transparency through a public handbook and open-source contributions.
- The team includes members like Marc Klingen, Max Deichmann, Clemens Rawert, and others, all with active GitHub and LinkedIn profiles.
Experiments with Ableton-MCP
4 months ago
- Experimented with Ableton and MCPs, creating 70+ automation tool calls and a mashup track.
- Discovered ahujasid/ableton-mcp, an MCP server bridging tool-calling LLMs to Ableton Live via a Python API.
- Extended ableton-mcp with modern LLMs to cover more Ableton features, including reverse engineering parts of the .als file format.
- Developed higher-level tools like vocal_to_midi() for structured vocal track analysis and alignment.
- Created a Max4Live patch for audio recording and deployed Replicate endpoints for track analysis and music theory-based text output.
- Produced a mashup track combining Deft & Lewis James – Octo with GloRilla – Yeah Glo!, involving both automated and manual work.
- Gained significant Ableton knowledge quickly with AbletonMCP, more than from traditional learning methods.
- All code and workflow documentation from the experiment is available at jhurliman/ableton-mcp/pull/1.
Show HN: Replacing my OS process scheduler with an LLM
4 months ago
- BrainKernel is a TUI process manager that uses an LLM to analyze processes beyond just CPU usage.
- v3.4.0 'The Silent Guardian' Update includes features like Diplomatic Immunity, Stealth Mode, and Roast Mode.
- It distinguishes between essential and unnecessary processes, e.g., protecting browsers but killing bloatware.
- Supports both cloud (Groq) and local (Ollama) models, optimized for low CPU usage (<1%).
- Includes safety features like PID Safety Lock and debouncing to prevent accidental kills.
- Users can interact via key commands for actions like roasting processes or setting a Focus App.
- Designed to be safe with hardcoded protections for certain process categories.
- Open to contributions for identifying new bloatware processes.
Building an internal agent: Code-driven vs. LLM-driven workflows
4 months ago
- The article discusses the comparison between code-driven and LLM-driven workflows for building internal agents.
- Initially, the author believed LLM plus tool-usage could solve complex workflows but later realized some problems are simpler with software.
- A real-world example is given where an LLM-powered workflow was used to mark merged pull requests in Slack but sometimes incorrectly, causing inefficiency.
- The solution involved implementing support for both LLM and code-driven workflows, allowing custom Python scripts to handle cases where LLMs aren't reliable.
- The hybrid approach uses LLMs for initial workflow handling and switches to code-driven workflows as a progressive enhancement when needed.
- The author concludes that even as models improve, using them narrowly for cases requiring intelligence rather than iterative workflows is a long-term strategy.
Implementing a (Vibed) LLM Coding Agent in Prolog
4 months ago
- DeepClause is a neurosymbolic AI system combining Prolog-based domain-specific language (DML) and a runtime engine with a SWI-Prolog meta-interpreter.
- DML enables concise encoding of agentic workflows, such as search-and-extract operations, using LLM-powered functions.
- A recursive loop structure (THINK, ACT, OBSERVE) forms the core of autonomous agents in DML.
- A coding agent experiment using Opus 4.5 and DML resulted in 500 lines of functional code after minor syntax fixes.
- The agent follows a three-phase workflow: UNDERSTAND (explore codebase), PLAN (create change strategy), EXECUTE (apply changes).
- Prolog terms generated by the LLM are executed directly by DML's meta-interpreter, mapping to tool calls or state updates.
- Potential use cases for DML include creating executable specifications, formal verification in coding agents, and bridging specs with LLM implementations.
Digital Red Queen: Adversarial Program Evolution in Core War with LLMs
4 months ago
- Core War is a competitive programming game where assembly-like programs called 'warriors' fight for control of a virtual computer.
- The Digital Red Queen (DRQ) algorithm uses LLMs to evolve warriors through adversarial self-play, leading to increasingly robust strategies.
- Warriors employ tactics like self-replication, data bombing, and multithreading to dominate opponents.
- DRQ demonstrates convergent evolution, where independently evolved warriors develop similar high-performing behaviors despite different implementations.
- Core War serves as a sandbox for studying Red Queen dynamics, offering insights into real-world adversarial scenarios like cybersecurity.
- The environment is Turing-complete, allowing for complex, self-modifying code dynamics.
- DRQ's minimal self-play loop reveals complex strategies, suggesting broader applications in AI-driven discovery and competitive multi-agent systems.
- Sakana AI is advancing this research, exploring AI-driven discovery beyond adversarial programming.
Which programming languages are most token-efficient?
4 months ago
- LLMs have constraints on context length, making token efficiency in programming languages potentially important for software development agents.
- Token efficiency varies significantly between languages, with dynamic and functional languages like Clojure, Haskell, and F# being more efficient.
- APL's terseness is not token-efficient due to its symbol set, while J, an ASCII-based array language, is highly token-efficient.
- There's a 2.6x gap in token efficiency between the least (C) and most (Clojure) efficient languages studied.
- Typed languages with efficient type inference systems (e.g., Haskell, F#) offer benefits for LLMs, including rapid feedback via compilation and LSP.
- Token efficiency could influence language selection for development sessions, especially if most of the context window is used for code-related tasks.
Stop using natural language interfaces
4 months ago
- Natural language interfaces for LLMs are slow and expensive compared to traditional GUIs.
- Popup-MCP is a tool that combines structured GUI elements with LLM flexibility, reducing interaction latency.
- The tool supports conditional visibility for elements, allowing for context-specific followup questions.
- Escape hatches like 'Other' options let users correct LLM assumptions quickly.
- This approach reduces amortized latency by replacing multiple chat rounds with a single GUI interaction.
- Feature requests for Claude Code include open TUI interfaces and conditional element support.
- Structured GUIs with conditional elements can be applied to OS-native popups, terminal, and web UIs.
Bottom-up programming as the root of LLM dev skepticism
4 months ago
- LLM-driven development works for many, including respected peers and prominent developers.
- Some skeptics may dislike AI for ideological reasons, but many have genuinely tried and found LLMs lacking.
- Using inferior tools like Copilot or basic ChatGPT can lead to poor experiences with LLM-driven development.
- Early on, improper use (e.g., giving overly large tasks) hindered effectiveness, but newer models like GPT-5.2/Opus 4.5 have made usage easier.
- The author theorizes that bottom-up programmers (who discover structure while coding) struggle with LLMs, while top-down programmers (who design structure first) benefit more.
- Bottom-up coders may not know how to guide LLMs effectively or recognize when outputs are incorrect.
- The author acknowledges their theory might be flawed and seeks feedback from the community.
Raspberry Pi's New AI Hat Adds 8GB of RAM for Local LLMs
4 months ago
- Raspberry Pi launched the $130 AI HAT+ 2 with Hailo 10H and 8GB LPDDR4X RAM.
- The Hailo 10H can run LLMs standalone, freeing the Pi's CPU and system RAM.
- The chip has 40 TOPS INT8 NPU inference and 26 TOPS INT4 machine vision performance.
- Performance is limited by the 3W power cap, compared to the Pi's 10W CPU.
- 8GB RAM restricts LLM use; Pi 5 can have 16GB, better for medium-sized models.
- Hailo 10H is slightly more efficient but underperforms compared to Pi's CPU.
- Qwen 30B was compressed to fit on a 16GB Pi 5, showing potential for local LLMs.
- AI HAT+ 2 excels in vision processing but similar performance is available with cheaper alternatives.
- Mixed mode (vision + inference) had issues with segmentation faults and device readiness.
- The HAT is best for power-constrained applications needing both vision and inference.
LLM Structured Outputs Handbook
4 months ago
- LLMs mostly produce syntactically valid outputs but can occasionally fail due to their probabilistic nature.
- Developers use LLMs programmatically for tasks like data extraction, code generation, and tool calling.
- There are deterministic ways to ensure structured LLM outputs.
- The handbook covers under-the-hood details, best tools & techniques, and how to pick them.
- It also includes building, deploying, scaling systems, optimizing for latency and cost, and improving output quality.
- Structured generation is evolving rapidly, making most resources outdated quickly.
- The handbook is a regularly updated living document.
- It can be read start-to-finish or used as a lookup table.
- The maintainers are behind Nanonets-OCR models and docstrange.
- A newsletter offers developer insights, latest breakthroughs, and useful tools & techniques twice a month.
ClickHouse Acquires Langfuse
4 months ago
- Langfuse has been acquired by ClickHouse, but the roadmap and commitment to open source and self-hosting remain unchanged.
- Langfuse will benefit from ClickHouse's resources to improve performance, reliability, and enterprise-grade compliance.
- Langfuse started as a solution to common problems in LLM app development and evolved to use ClickHouse for scalability.
- The acquisition was driven by a shared vision and existing collaboration between Langfuse and ClickHouse.
- Langfuse will continue to focus on improving production monitoring, workflows, and performance for agent systems.
- The Langfuse team will join ClickHouse, maintaining the same culture and engineering focus.
Without benchmarking LLMs, you're likely overpaying 5-10x
3 months ago
- Benchmarking LLMs on specific tasks can save significant costs, as default choices like GPT-5 may not be the most cost-effective.
- Standard benchmarks don't accurately predict performance on specific tasks, necessitating custom benchmarks based on actual prompts.
- Creating a benchmark involves collecting real examples, defining expected outputs, and scoring responses with an LLM-as-judge.
- Quality, cost, and latency must be balanced when selecting an LLM, with Pareto Efficiency helping identify optimal models.
- Using tools like Evalry can automate benchmarking across 300+ LLMs, saving time and money by identifying better models for specific use cases.
The percentage of Show HN posts is increasing, but their scores are decreasing
3 months ago
- The percentage of Show HN posts has increased from 2-3% (2012-2022) to over 12% as of December 2025, correlating with the rise of LLMs that can code.
- Average scores for Show HN posts have declined, dropping to 9.04 in December 2025 compared to 19.53 for regular stories.
- Possible reasons for the score decline include an oversaturation of Show HN posts or lower quality due to LLM-generated content.
- The analysis is based on Hacker News data, with methods including filtering Show HN posts by title and excluding 2026 data due to instability.
- Future updates to the analysis are planned, though tracking LLM-generated Show HN posts remains challenging.
Three types of LLM workloads and how to serve them
3 months ago
- LLM workloads are categorized into three types: offline (batch mode, high throughput), online (streaming mode, low latency), and semi-online (bursty, flexible infrastructure).
- Offline workloads prioritize throughput per dollar, leveraging GPUs and mixed batching for efficiency. vLLM is recommended for these workloads.
- Online workloads require low latency and face challenges like host overhead and memory bandwidth limitations. SGLang with speculative decoding is recommended.
- Semi-online workloads need flexible scaling to handle variable demand. Solutions include multi-tenancy and GPU memory snapshotting to reduce cold starts.
- Future trends include more lossy optimizations for speed, exotic hardware for online workloads, and the rise of long-running agent applications.
My answers to the questions I posed about porting open source code with LLMs
3 months ago
- The author discusses porting open source code from Python to JavaScript using LLMs like Codex CLI and GPT-5.2, raising ethical and legal questions.
- Legal and ethical considerations include treating the ported code as a derivative work, maintaining the original license, and giving full credit.
- The impact on the open source ecosystem is debated, with concerns about maintainers leaving and reduced demand for open source due to LLMs.
- The author questions the copyrightability of LLM-generated code but leans towards human intervention being sufficient for copyright under US law.
- Publishing AI-generated libraries responsibly involves clear labeling (e.g., 'alpha slop') and thorough testing before promoting to stable versions.
- The efficiency of LLM-generated code is highlighted, with examples showing comparable quality to expert-crafted code at a fraction of the time and cost.

first prev12next

About|Login

#llm

Prompt caching: 10x cheaper LLM tokens, but how?

Response Healing: Reduce JSON defects by 80%+

LLM Year in Review

Structured Outputs Create False Confidence

Langfuse (YC W23) Is Hiring in Berlin, Germany

Experiments with Ableton-MCP

Show HN: Replacing my OS process scheduler with an LLM

Building an internal agent: Code-driven vs. LLM-driven workflows

Implementing a (Vibed) LLM Coding Agent in Prolog

Digital Red Queen: Adversarial Program Evolution in Core War with LLMs

Which programming languages are most token-efficient?

Stop using natural language interfaces

Bottom-up programming as the root of LLM dev skepticism

Raspberry Pi's New AI Hat Adds 8GB of RAM for Local LLMs

LLM Structured Outputs Handbook

ClickHouse Acquires Langfuse

Without benchmarking LLMs, you're likely overpaying 5-10x

The percentage of Show HN posts is increasing, but their scores are decreasing

Three types of LLM workloads and how to serve them

My answers to the questions I posed about porting open source code with LLMs