May 23 · Declarative APIs cut 1,000 lines to 50
Good morning.Teams are aggressively constraining agent action spaces to guarantee deterministic outcomes.

Constraining coding agents to high-level declarative APIs reduces error-prone generation targets from 1,000 lines of raw PyTorch down to 50 lines of simple calls. Jure Leskovec reports this approach entirely eliminates subtle data science bugs like temporal information leakage during complex enterprise workflows.
Ship This Week
Exposing a custom Marimo linter to Claude 4 via a `uv` tool call allowed the agent to auto-heal, solving 60% of environment-specific syntax errors.
agents
Environment-specific agent linters
Build custom linters for your specific environment to let agents auto-heal instead of relying on complex prompt engineering.
Watch the Frontier
OpenAI's Yann Dubois notes SFT actively trains models to hallucinate unknown facts; RL fixes this by sampling only from the model's existing pre-trained knowledge.
fine_tuning
RL over SFT for hallucination reduction
Replacing supervised fine-tuning with reinforcement learning prevents models from learning to confidently fabricate unknown facts.
Today's AI Patterns
Capability shifts and emerging build patterns from this week's shows.
agents
Environment-specific agent linters
Build custom linters for your specific environment to let agents auto-heal instead of relying on complex prompt engineering.
fine_tuning
RL over SFT for hallucination reduction
Replacing supervised fine-tuning with reinforcement learning prevents models from learning to confidently fabricate unknown facts.
agents
Declarative APIs for coding agents
Constrain coding agents to high-level declarative APIs to prevent subtle data science bugs like temporal information leakage.
agents
Two-pronged admin and user agent architecture
Separate tool creation from tool execution by using an admin agent to define strict tools that a user-facing agent is constrained to use.
agents
State-aware meta-agent for workflow deduplication
Use a meta-agent with full contextual awareness of historical workflows to prevent users from generating duplicate automations.
fine_tuning
Classification-only fine-tuning constraint
Restrict fine-tuning to massive-scale, static classification tasks to avoid being obsoleted by frontier model releases.
evals
Multi-language formal verification via Strata
Use the open-source Strata intermediate representation to translate Python, Java, and Rust into Lean for formal verification.
fine_tuning
Training on unpublished negative data
Eli Lilly trains its molecular models on decades of failed experiments to map the non-viable chemical space.
evals
Game manual in-context evals
Evaluate an LLM's meta-learning and adaptation by putting a complex game manual in the prompt and testing its gameplay.
product
Template-constrained generative UI
Instead of generating apps from scratch, Google's Canvas classifies intent to load pre-built UI templates and uses the LLM solely to populate data.
agents
Emergent agent skill synthesis
Hermes Agent uses self-reflection prompts to automatically synthesize and save successful multi-step workflows as reusable skills.
evals
E-values for anytime eval inference
Use e-values instead of p-values to allow continuous peeking and dynamic stopping in evaluation pipelines without breaking statistical guarantees.
agents
Restricting LLMs to workflow orchestration
Eli Lilly restricts LLMs to generic workflow orchestration, routing actual molecular design to dedicated diffusion and generative flow models.
evals
Semantic evals for design agents
Build rigorous evals around component semantics and auto-layout rules to prevent design agents from generating unusable slop.
fine_tuning
GRPO for long-rollout agentic RL
Scale SFT to establish strong priors, then use GRPO to apply relative rewards to full agentic rollouts.
fine_tuning
High-density mid-training before post-training
Overweight high-quality reasoning data in a mid-training phase to shift the base model's distribution before RL.
fine_tuning
Domain-specific RL for coding models
Cursor achieved Pareto dominance by applying 3-4 weeks of reinforcement learning on their proprietary coding dataset over the Kami K.25 base model.
product
Fast models for exploratory flow
Use faster, less capable models for exploratory coding to maintain mental context and prevent over-trusting the AI.
prompting
In-context learning for relational subgraphs
Use a pre-trained relational transformer to perform in-context learning on database subgraphs without gradient updates.
infra
Accelerator-routed decode phase
Extend the useful life of older GPUs by routing the decode phase to domain-specific accelerators.
agents
Massive multi-agent security harness
Microsoft's Emdash beats single-model approaches by coordinating over 100 specialized agents across a mix of frontier and smaller models.
Capability Watch
New model behaviors and tool patterns showing up across multiple shows.
agents
Agents shift from monolithic models to specialized multi-agent harnesses
Single-model approaches are plateauing on complex tasks like cybersecurity and coding. Operators are deploying coordinated teams of specialized agents—mixing frontier models for reasoning with smaller models for narrow jobs—to achieve higher accuracy and lower false-positive rates.
evals
Continuous evaluation pipelines adopt anytime inference
Classical p-values break when teams continuously monitor evidence and stop early. Evaluation pipelines are shifting to e-values, which mathematically permit dynamic stopping and continuous peeking without losing statistical control over the results.
fine_tuning
High-density mid-training phases emerge before RL
Pre-training on the entire internet dilutes high-signal reasoning data. Teams are inserting a mid-training phase to overweight verified knowledge and code, shifting the base model's priors toward reasoning before applying reinforcement learning.
Operator Bets
What practitioners are actually shipping with — frameworks, stack picks.
Fine-tuning is for massive-scale classification only
Jeremy Mumford bets that fine-tuning should be strictly reserved for massive-scale classification tasks. Defaulting to off-the-shelf models and prompt engineering prevents custom models from being immediately obsoleted by the next frontier release.
LLMs cannot perform direct scientific reasoning
Thomas Fuchs bets that LLMs are fundamentally unsuited for direct scientific reasoning due to language constraints. Eli Lilly restricts them strictly to generic workflow orchestration, routing actual molecular design to domain-specific physical models.
Fast models preserve exploratory developer flow
Vincent Warmerdam bets that faster, technically inferior models are better for exploratory coding. Near-instant generation preserves developer flow while known model limitations force the human to remain actively skeptical.
Stack Drops
Tools, libraries, and infra dropping into operator workflows now.
Strata
Open-source intermediate representation that translates Python, Java, and Rust into Lean for formal verification.
Canvas
Intent-classification system that loads pre-built UI templates and uses LLMs solely to populate data.
Composer 2.5
Pareto-dominant coding assistant built via targeted reinforcement learning on the Kami K.25 base model.
From the Conversations

The MAD Podcast with Matt Turck
RL over SFT for hallucination reduction
OpenAI's Yann Dubois: Why AI Progress Suddenly Feels Real
May 21, 2026 · 1h 13m · 3quotes pulled
▶ Listen“SFT is going to force like Hallucination, while in reinforcement learning given that... you kind of sample from the model in the first place, extremely unlikely that it's sample something that it doesn't know and it's correct.”

Environment-specific agent linters
Agent-Harness.ipynb*
May 20, 2026 · 1h 19m · 2quotes pulled
▶ Listen“what if we make a linter? And the whole point of the linter is it's going to be super Marimo-specific... that solved about 60% of all the problems.”

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)
In-context learning for relational subgraphs
Relational Foundation Models for Enterprise Data with Jure Leskovec - #768
May 21, 2026 · 1h 5m · 2quotes pulled
▶ Listen“The system now goes into the database. It extracts a set of labeled in-context examples that then get passed through a pre-trained neural network to make a prediction... in a single forward pass.”
Sources





























































































































