About Podcasters

The AI Practice Weekly

Our team listens to thousands of podcasts about ai so you don't have to. One weekly briefing, in your inbox.

SATURDAY, MAY 23, 2026 · MINED FROM 129 PODCASTS

May 23 · Declarative APIs cut 1,000 lines to 50

Good morning.Teams are aggressively constraining agent action spaces to guarantee deterministic outcomes.

May 23 · Declarative APIs cut 1,000 lines to 50

Constraining coding agents to high-level declarative APIs reduces error-prone generation targets from 1,000 lines of raw PyTorch down to 50 lines of simple calls. Jure Leskovec reports this approach entirely eliminates subtle data science bugs like temporal information leakage during complex enterprise workflows.

Ship This Week

Exposing a custom Marimo linter to Claude 4 via a `uv` tool call allowed the agent to auto-heal, solving 60% of environment-specific syntax errors.

agents

Environment-specific agent linters

Build custom linters for your specific environment to let agents auto-heal instead of relying on complex prompt engineering.

▾ Show more ▴ Show less

Problem: Agents lack awareness of environment-specific constraints (like Marimo's cell structure), leading to broken code that prompts alone can't reliably fix.

On: custom linters for agent auto-healing

“what if we make a linter? And the whole point of the linter is it's going to be super Marimo-specific... that solved about 60% of all the problems.”

Recipe Create a highly specific linter for your environment. Expose it to the agent via a standard tool call (runnable via `uv` without installing a full virtual environment). When the agent makes an environment-specific error, the linter provides exact feedback, allowing the LLM to auto-heal its own mistakes.

Measured Evidence 60% of agent errors solved

Counterpoint Prompt engineering is brittle for environment-specific syntax; deterministic linters provide reliable auto-healing loops.

Agent-Harness.ipynb* · May 20, 2026

Claude 4Marimouv
▶ Listen

Watch the Frontier

OpenAI's Yann Dubois notes SFT actively trains models to hallucinate unknown facts; RL fixes this by sampling only from the model's existing pre-trained knowledge.

fine_tuning

RL over SFT for hallucination reduction

Replacing supervised fine-tuning with reinforcement learning prevents models from learning to confidently fabricate unknown facts.

▾ Show more ▴ Show less

Problem: Supervised fine-tuning (SFT) actively trains models to hallucinate. If a human labeler cites a paper the model hasn't seen in pre-training, SFT forces the model to mimic citing unknown papers.

On: SFT as a driver of hallucinations

“SFT is going to force like Hallucination, while in reinforcement learning given that... you kind of sample from the model in the first place, extremely unlikely that it's sample something that it doesn't know and it's correct.”

Recipe Transition from SFT to RL for factual alignment. SFT forces the model to mimic human answers, including facts outside its pre-training weights. RL samples from the model's existing knowledge base—if it doesn't know a fact, it won't sample it correctly, naturally penalizing fabrications and rewarding "I don't know" behaviors.

Speaker's word Evidence Speaker's word, not measured.

Counterpoint SFT is the default for teaching models how to answer, but it inadvertently teaches them to confidently fabricate facts they don't actually know.

OpenAI's Yann Dubois: Why AI Progress Suddenly Feels Real · May 21, 2026

▶ Listen

Today's AI Patterns

Capability shifts and emerging build patterns from this week's shows.

agents

Environment-specific agent linters

Build custom linters for your specific environment to let agents auto-heal instead of relying on complex prompt engineering.

▾ Show more ▴ Show less

Problem: Agents lack awareness of environment-specific constraints (like Marimo's cell structure), leading to broken code that prompts alone can't reliably fix.

On: custom linters for agent auto-healing

“what if we make a linter? And the whole point of the linter is it's going to be super Marimo-specific... that solved about 60% of all the problems.”

Recipe Create a highly specific linter for your environment. Expose it to the agent via a standard tool call (runnable via `uv` without installing a full virtual environment). When the agent makes an environment-specific error, the linter provides exact feedback, allowing the LLM to auto-heal its own mistakes.

Measured Evidence 60% of agent errors solved

Counterpoint Prompt engineering is brittle for environment-specific syntax; deterministic linters provide reliable auto-healing loops.

Agent-Harness.ipynb* · May 20, 2026

Claude 4Marimouv
▶ Listen

fine_tuning

RL over SFT for hallucination reduction

Replacing supervised fine-tuning with reinforcement learning prevents models from learning to confidently fabricate unknown facts.

▾ Show more ▴ Show less

Problem: Supervised fine-tuning (SFT) actively trains models to hallucinate. If a human labeler cites a paper the model hasn't seen in pre-training, SFT forces the model to mimic citing unknown papers.

On: SFT as a driver of hallucinations

“SFT is going to force like Hallucination, while in reinforcement learning given that... you kind of sample from the model in the first place, extremely unlikely that it's sample something that it doesn't know and it's correct.”

Recipe Transition from SFT to RL for factual alignment. SFT forces the model to mimic human answers, including facts outside its pre-training weights. RL samples from the model's existing knowledge base—if it doesn't know a fact, it won't sample it correctly, naturally penalizing fabrications and rewarding "I don't know" behaviors.

Speaker's word Evidence Speaker's word, not measured.

Counterpoint SFT is the default for teaching models how to answer, but it inadvertently teaches them to confidently fabricate facts they don't actually know.

OpenAI's Yann Dubois: Why AI Progress Suddenly Feels Real · May 21, 2026

▶ Listen

agents

Declarative APIs for coding agents

Constrain coding agents to high-level declarative APIs to prevent subtle data science bugs like temporal information leakage.

▾ Show more ▴ Show less

Problem: Coding agents writing raw PyTorch or XGBoost pipelines introduce subtle data science bugs, like temporal information leakage (aggregating data to midnight instead of the exact transaction time).

On: high-level APIs for coding agents

“These models write thousands of lines of code but there are this like super subtle data sciency mistakes... If you give it this more like higher level Kumo-like API, then it's able to do the same work in about 50 lines of code. No mistakes.”

Recipe Do not ask agents to write low-level ML framework code. Instead, provide agents with high-level, declarative APIs that abstract away feature engineering and temporal joins. This reduces the generation target from 1,000 lines of error-prone code to ~50 lines of API calls.

Measured Evidence 1,000 lines → 50 lines, 0 mistakes

Counterpoint Agents fail at long-horizon data science tasks due to subtle logic bugs; constraining their action space with high-level APIs eliminates these failure modes.

Relational Foundation Models for Enterprise Data with Jure Leskovec - #768 · May 21, 2026

PyTorchXGBoostKumoClaude Code
▶ Listen

agents

Two-pronged admin and user agent architecture

Separate tool creation from tool execution by using an admin agent to define strict tools that a user-facing agent is constrained to use.

▾ Show more ▴ Show less

Problem: Enterprises want the reasoning capabilities of LLMs for user support but cannot risk giving a single agent unlimited autonomy and system access.

On: two-pronged architecture for enterprise agent security

“The help desk agent can only use the tools and skills that have been expressly built, published with approvals and permissions and all of that by the admins... you get the full ability and intelligence of the help desk agent to use those tools appropriately.”

Recipe Divide the system into two agents. The 'Admin agent' is used by IT to build, approve, and publish specific tools with strict permissions. The 'Help desk agent' handles end-user chats with full reasoning capabilities, but its execution environment is strictly limited to the whitelisted tools published by the Admin agent.

Speaker's word Evidence Speaker's word, not measured.

Counterpoint Instead of relying on system prompts to restrict a single omnipotent agent, physically separate tool creation from tool execution.

Rebuilding IT From the Ground Up for the AI Age: Serval's Jake Stauch · May 19, 2026

▶ Listen

agents

State-aware meta-agent for workflow deduplication

Use a meta-agent with full contextual awareness of historical workflows to prevent users from generating duplicate automations.

▾ Show more ▴ Show less

Problem: When natural language makes automation too easy, users build dozens of duplicate workflows (e.g., 20 password reset flows), confusing the routing AI on which to execute.

On: meta-agent to prevent duplicate workflow generation

“When you say, hey, I want this workflow that does X, Y and Z, it says, hey, actually you've got 19 that already do that. I could modify one of these, but here's what I think you should do.”

Recipe Deploy a meta-agent that intercepts workflow creation requests. Give it full contextual awareness of all previously built workflows. When a user requests a new workflow, the agent checks for duplicates, suggests modifying an existing one, or recommends deleting redundant workflows before building a new one.

Speaker's word Evidence Speaker's word, not measured.

Counterpoint Instead of just executing the user's prompt to build a workflow, the agent acts as a state-aware gatekeeper to prevent database bloat.

Rebuilding IT From the Ground Up for the AI Age: Serval's Jake Stauch · May 19, 2026

▶ Listen

fine_tuning

Classification-only fine-tuning constraint

Restrict fine-tuning to massive-scale, static classification tasks to avoid being obsoleted by frontier model releases.

▾ Show more ▴ Show less

Problem: Teams spend massive resources fine-tuning models only to have their custom model immediately outperformed by the next general frontier model release.

On: limiting fine-tuning to classification

“I have an extremely straightforward classification job that I need to do at scale. I'm talking millions or billions of data points. If you have a problem like this, it may make sense to do fine tuning.”

Recipe Default to off-the-shelf models and prompt engineering. Only approve fine-tuning for extremely narrow, static domains—specifically classification jobs operating at the scale of millions or billions of data points.

Speaker's word Evidence Speaker's word, not measured.

Counterpoint Teams often fine-tune to inject knowledge, but fine-tuning is best reserved for teaching narrow formats or massive-scale classification.

993: How to Build AI-First Organizations, with Jacob Miller and Jeremy Mumford · May 19, 2026

Bloomberg GPTGPT-3GPT-4
▶ Listen

evals

Multi-language formal verification via Strata

Use the open-source Strata intermediate representation to translate Python, Java, and Rust into Lean for formal verification.

▾ Show more ▴ Show less

Problem: Applying mathematical theorem provers to production codebases is difficult because provers don't natively understand modern languages like Rust or Python.

On: translating production code to Lean

“translate the programs from rust into lean via strata, which we've open sourced and then to reason and lean.”

Recipe Translate production code (Python, Java, Rust) into strata, an open-source intermediate representation. This unifies the code into a logical representation that maps directly to the semantics of the Lean theorem prover, allowing LLMs to reason about it formally.

Speaker's word Evidence Speaker's word, not measured.

Formal Methods as Agent Guardrails · May 19, 2026

strataLean
▶ Listen

fine_tuning

Training on unpublished negative data

Eli Lilly trains its molecular models on decades of failed experiments to map the non-viable chemical space.

▾ Show more ▴ Show less

Problem: Training AI solely on published scientific literature limits models to positive outcomes, ignoring the vast majority of experiments that failed due to toxicity or poor binding.

On: training on negative experimental data

“For every molecule that worked, we had millions that failed. It failed because it didn't bind or it was toxic... And all these never get published... And that helps us then to build better molecules.”

Recipe Incorporate internal negative experimental results (unpublished failures, toxic molecules, non-binding candidates) into the training sets of diffusion and generative flow models. This teaches the model which areas of the combinatorial space to avoid, rather than just mimicking successful published literature.

Speaker's word Evidence Speaker's word, not measured.

Counterpoint Most teams train on successful public literature; Lilly trains on decades of unpublished failed experiments to map the non-viable space.

Scaling Scientific R&D with AI Supercomputing Infrastructure — with Thomas Fuchs of Eli Lilly · May 19, 2026

▶ Listen

evals

Game manual in-context evals

Evaluate an LLM's meta-learning and adaptation by putting a complex game manual in the prompt and testing its gameplay.

▾ Show more ▴ Show less

Problem: Standard evals test static knowledge that is often already in the pre-training data, failing to measure a model's ability to learn, adapt, and follow complex new instructions.

On: evaluating meta-learning via game manuals

“You give the instruction manual for, I think it was Civilization, the game. And then you're meant to be able to play... as you play the game, you learn to play better.”

Recipe Create an eval based on a complex, out-of-distribution game (like Civilization or a custom invented game). Place the full instruction manual in the context window. Evaluate the model on two axes: initial instruction following (playing valid moves) and in-context learning (improving gameplay over the course of the context window).

Speaker's word Evidence Speaker's word, not measured.

Counterpoint Standard evals measure memorization; this measures in-context adaptation by forcing the model to learn a complex ruleset entirely from the prompt.

Ep 87: Gemini Co-Lead on World Models, RL's Next Domains & Continual Learning · May 22, 2026

▶ Listen

product

Template-constrained generative UI

Instead of generating apps from scratch, Google's Canvas classifies intent to load pre-built UI templates and uses the LLM solely to populate data.

Nilay Patel · Host / The Vergecast · The Verge · The Vergecast

▾ Show more ▴ Show less

Problem: Generating full applications from scratch via LLM for every user query is inefficient, error-prone, and often causes the model to spiral into broken states.

On: generative UI app templates

“We're not actually in the end going to custom write everybody different apps. We know that people asking for vacations are going to want a vacation app and we will have one and we will tailor it to you on the front end.”

Recipe Classify the user's intent into a known category (e.g., trip planner). Instead of generating the application from scratch, load a pre-built HTML/CSS template with standard UI components. Use the LLM strictly to populate the specific data and tailor the front-end logic, preventing the model from hallucinating application structure.

Speaker's word Evidence Speaker's word, not measured.

Counterpoint Most generative UI approaches attempt to write raw code from scratch; this approach treats the LLM as a data-population engine for static templates.

The post-search Google era begins · May 22, 2026

Canvas
▶ Listen

agents

Emergent agent skill synthesis

Hermes Agent uses self-reflection prompts to automatically synthesize and save successful multi-step workflows as reusable skills.

Jeffrey Connell · Co-founder and CTO · Noose Research · Practical AI

▾ Show more ▴ Show less

Problem: Hard-coding agent tools is brittle; web environments change, and anticipating every user need requires endless manual tool creation.

On: emergent skill creation via self-reflection

“It notices without you telling it that this is... I've learned there's something important here. And it will create a skill that says how to do a specific thing.”

Recipe Provide minimal hard-coded primitives (code execution, web browsing). Add a self-reflection directive to the system prompt: when the agent achieves a complex goal (e.g., bypassing captchas to find a booking API), it must extract the successful steps and write a reusable 'skill' to its registry for future identical tasks.

Measured Evidence 30-45 mins runtime → instant reuse

Counterpoint Instead of developers writing explicit tools for every API, the agent uses basic web/code primitives to discover the API and writes the tool itself.

Hermes Agent: Agents that grow with you · May 21, 2026

Hermes Agent
▶ Listen

evals

E-values for anytime eval inference

Use e-values instead of p-values to allow continuous peeking and dynamic stopping in evaluation pipelines without breaking statistical guarantees.

▾ Show more ▴ Show less

Problem: Classical p-values fail for continuous evaluation; repeatedly peeking at data or stopping early invalidates the statistical guarantees of the evaluation pipeline.

On: anytime inference for continuous evaluation

“e-values are different. It's an expectation of some non-negative random variable or super martingale... By the optional stopping theorem you can stop it whenever you want. So that has opened up a lot of connections... It's called anytime inference.”

Recipe Replace p-values with e-values (the expectation of a non-negative super martingale). Because e-values remain valid under the optional stopping theorem, you can continuously monitor evidence, peek at results, and dynamically gather new data without losing statistical control.

Speaker's word Evidence Speaker's word, not measured.

Counterpoint Standard A/B testing and eval pipelines break if you peek and stop early; e-values mathematically permit continuous monitoring and dynamic stopping.

Intelligence is collective, not artificial — Prof. Michael I. Jordan (UC Berkeley / Inria) · May 21, 2026

▶ Listen

agents

Restricting LLMs to workflow orchestration

Eli Lilly restricts LLMs to generic workflow orchestration, routing actual molecular design to dedicated diffusion and generative flow models.

▾ Show more ▴ Show less

Problem: Human language lacks the dimensional complexity to accurately describe cellular biology, making LLMs fundamentally unsuited for direct molecular design or scientific reasoning.

On: routing scientific tasks to physical models

“In Discovery, we are using large language models mostly just to orchestrate work... The complexity of a single cell goes far beyond human language can even describe. So they go beyond these molecular models, diffusion models, generative flow models.”

Recipe Constrain LLMs strictly to generic workflow orchestration and task routing. For actual scientific generation, route requests to domain-specific physical models (diffusion models, generative flow models) that operate beyond language constraints.

Speaker's word Evidence Speaker's word, not measured.

Counterpoint Instead of forcing LLMs to perform scientific reasoning, restrict them to workflow routing and use physical models for the actual science.

Scaling Scientific R&D with AI Supercomputing Infrastructure — with Thomas Fuchs of Eli Lilly · May 19, 2026

▶ Listen

evals

Semantic evals for design agents

Build rigorous evals around component semantics and auto-layout rules to prevent design agents from generating unusable slop.

Dylan Field · CEO · Figma · TBPN

▾ Show more ▴ Show less

Problem: Design agents tend to over-complexify outputs or generate "one-shot slop" that looks visually correct but breaks underlying component structures and layout rules.

On: evaluating design agent semantics

“In terms of semantics, I mean, the way that you represent a component or the way you apply auto layout even, these are all examples of things that we've had to get more rigorous around how we build evals”

Recipe Shift the agent's focus from full-page generation to rote tasks (design system maintenance, text translation). Implement strict evals targeting specific design semantics—like how a component is represented and how auto-layout is applied—to ensure the generated output remains usable and moldable.

Speaker's word Evidence Speaker's word, not measured.

Counterpoint Visual similarity is insufficient for design generation; evals must measure structural correctness like auto-layout and component semantics.

Google I/O Reactions, Large IPOs Incoming, Figma's AI Assistant | Dylan Field, Brian Chesky, Feross Aboukhadijeh, Tae Kim, Immad Akhund, Marcus Milione · May 20, 2026

▶ Listen

fine_tuning

GRPO for long-rollout agentic RL

Scale SFT to establish strong priors, then use GRPO to apply relative rewards to full agentic rollouts.

▾ Show more ▴ Show less

Problem: In agentic workflows, you only know if the model succeeded at the end of a long rollout. Token-level reward attribution is nearly impossible for multi-step reasoning.

On: simplifying RL for agentic rollouts

“in the open source as well, GRPO seems to be working very well... you sample as many answers as possible and you say which one is correct. So in some way, GRPO is a very simplistic method.”

Recipe For agentic workflows, abandon token-level attribution. Instead, scale SFT to roughly 1 million examples to build strong priors. Then apply GRPO: sample multiple full rollouts from the model and score them relatively against each other to optimize the policy at the sequence level.

Speaker's word Evidence Speaker's word, not measured.

Counterpoint Token-level reward attribution fails on multi-step agent tasks; simple rollout-level sampling (GRPO) scales better once SFT is saturated.

OpenAI's Yann Dubois: Why AI Progress Suddenly Feels Real · May 21, 2026

Kimideep seekGRPOPPODPO
▶ Listen

fine_tuning

High-density mid-training before post-training

Overweight high-quality reasoning data in a mid-training phase to shift the base model's distribution before RL.

▾ Show more ▴ Show less

Problem: Pre-training on the entire internet dilutes high-signal reasoning data (GitHub, Wikipedia) with low-quality tokens (ads, forums), making post-training less effective.

On: overweighting high-signal data pre-RL

“In mid-training, we basically overweight this type of high-quality data that we think is more useful for training the final model... Wikipedia or GitHub... there's way more information in there than some random forums.”

Recipe Insert a mid-training phase between standard pre-training and post-training. Take the pre-trained base model and continue training it on a heavily overweighted mix of high-signal data (like code and verified knowledge) while filtering out low-quality internet text. This shifts the model's priors toward reasoning before applying SFT or RL.

Speaker's word Evidence Speaker's word, not measured.

Counterpoint Most teams jump straight from pre-trained base models to SFT; inserting a continuous pre-training step on high-density data improves the base for RL.

OpenAI's Yann Dubois: Why AI Progress Suddenly Feels Real · May 21, 2026

▶ Listen

fine_tuning

Domain-specific RL for coding models

Cursor achieved Pareto dominance by applying 3-4 weeks of reinforcement learning on their proprietary coding dataset over the Kami K.25 base model.

▾ Show more ▴ Show less

Problem: Base models plateau on coding tasks and require massive compute to pre-train from scratch to reach the Pareto frontier.

On: reinforcement learning for coding models

“This is three or four weeks of doing reinforcement learning on Colossus 2 with Cursor's data... Composer 2.5 is the same base model as Composer 2, which is Kami K.25... this is Pareto dominant.”

Recipe Instead of pre-training a new model, apply 3-4 weeks of reinforcement learning using a massive proprietary dataset (Cursor's coding data) on top of an existing base model (Kami K.25) to create a Pareto-dominant coding assistant.

Speaker's word Evidence Speaker's word, not measured.

Counterpoint Most teams focus on pre-training or standard SFT; Cursor achieved state-of-the-art purely through targeted RL on an existing base model.

SpaceX's $2T Case, Nvidia's Shock Selloff, America Turns on AI, Trump Pulls AI Order, Bond Crisis? · May 22, 2026

Composer 2.5Composer 2Kami K.25Cursor
▶ Listen

product

Fast models for exploratory flow

Use faster, less capable models for exploratory coding to maintain mental context and prevent over-trusting the AI.

▾ Show more ▴ Show less

Problem: Waiting minutes for frontier models breaks developer flow during exploratory work, and highly capable models induce a "slot machine" mentality where developers stop actively thinking.

On: using worse models for exploration

“Maybe using KimiK2 in open code is better than using the most recent version of Claude because again, it can generate nearly to the speed of thought... you trust it less.”

Recipe Swap frontier models for faster, technically "worse" models (like KimiK2) during exploratory notebook work. The near-instant generation preserves context, while the model's known limitations force the developer to remain actively skeptical and engaged in the loop.

Speaker's word Evidence Speaker's word, not measured.

Counterpoint Frontier models are assumed best for all tasks, but their latency breaks flow and their capability breeds dangerous developer complacency.

Agent-Harness.ipynb* · May 20, 2026

KimiK2Claude
▶ Listen

prompting

In-context learning for relational subgraphs

Use a pre-trained relational transformer to perform in-context learning on database subgraphs without gradient updates.

▾ Show more ▴ Show less

Problem: Training custom predictive models for thousands of enterprise clients with unique database schemas requires impossible scaling of data science teams.

On: in-context learning for tabular data

“The system now goes into the database. It extracts a set of labeled in-context examples that then get passed through a pre-trained neural network to make a prediction... in a single forward pass.”

Recipe Define the predictive task (e.g., 'predict transaction.isfraud = true'). Extract historical subgraphs of labeled entities from the database to serve as in-context examples. Pass these labeled subgraphs alongside the new unlabeled target subgraph through a frozen pre-trained relational transformer in a single forward pass.

Measured Evidence 5% relative accuracy improvement

Counterpoint Instead of fine-tuning a model per schema, the system uses historical database subgraphs as few-shot prompts for a frozen relational transformer.

Relational Foundation Models for Enterprise Data with Jure Leskovec - #768 · May 21, 2026

▶ Listen

infra

Accelerator-routed decode phase

Extend the useful life of older GPUs by routing the decode phase to domain-specific accelerators.

▾ Show more ▴ Show less

Problem: Older GPUs quickly lose viability for modern LLM inference, breaking the 4-6 year amortization schedules required by cloud providers.

On: routing decode to extend GPU life

“You can put whether it's a Grok accelerator, whether it's a Cerebras accelerator in front of old GPUs, use Grok or Cerebras for decode. And then those older GPUs, they have a useful life for 10 or 15 years.”

Recipe Deploy domain-specific accelerators (like Cerebras or Grok chips) at the front of the inference pipeline specifically to handle the decode phase. Route the remaining workload to older GPUs, extending their useful lifespan to 10-15 years.

Measured Evidence 10-15 year useful GPU life

Counterpoint Instead of retiring older GPUs entirely, delegating only the bottleneck (decode) to specialized ASICs keeps the older hardware economically viable.

SpaceX's $2T Case, Nvidia's Shock Selloff, America Turns on AI, Trump Pulls AI Order, Bond Crisis? · May 22, 2026

Cerebras
▶ Listen

agents

Massive multi-agent security harness

Microsoft's Emdash beats single-model approaches by coordinating over 100 specialized agents across a mix of frontier and smaller models.

▾ Show more ▴ Show less

Problem: Single-model approaches to complex cybersecurity tasks struggle to achieve high accuracy and low false-positive rates across diverse vulnerability detection workloads.

On: multi-agent architecture for cybersecurity

“MDash uses more than 100 specialized agents across a mix of frontier and smaller models. In other words, the winning pattern here isn't one giant model doing everything. It's a coordinated team of models doing different jobs well.”

Recipe Deploy a multi-agent harness rather than a single monolithic model. Coordinate over 100 specialized agents, mixing frontier models for complex reasoning tasks with smaller, task-specific models for narrow jobs to optimize overall system performance.

Measured Evidence 88.45% on CyberGym (+5 pts)

Counterpoint Single model purity is less effective than a coordinated team of specialized agents mixing frontier and smaller models.

AI Weekly Briefing: Is NVIDIA Finally Getting Real Competition? · May 21, 2026

MythosEmdash
▶ Listen

Capability Watch

New model behaviors and tool patterns showing up across multiple shows.

agents

Agents shift from monolithic models to specialized multi-agent harnesses

Single-model approaches are plateauing on complex tasks like cybersecurity and coding. Operators are deploying coordinated teams of specialized agents—mixing frontier models for reasoning with smaller models for narrow jobs—to achieve higher accuracy and lower false-positive rates.

evals

Continuous evaluation pipelines adopt anytime inference

Classical p-values break when teams continuously monitor evidence and stop early. Evaluation pipelines are shifting to e-values, which mathematically permit dynamic stopping and continuous peeking without losing statistical control over the results.

fine_tuning

High-density mid-training phases emerge before RL

Pre-training on the entire internet dilutes high-signal reasoning data. Teams are inserting a mid-training phase to overweight verified knowledge and code, shifting the base model's priors toward reasoning before applying reinforcement learning.

Operator Bets

What practitioners are actually shipping with — frameworks, stack picks.

Fine-tuning is for massive-scale classification only

Jeremy Mumford bets that fine-tuning should be strictly reserved for massive-scale classification tasks. Defaulting to off-the-shelf models and prompt engineering prevents custom models from being immediately obsoleted by the next frontier release.

LLMs cannot perform direct scientific reasoning

Thomas Fuchs bets that LLMs are fundamentally unsuited for direct scientific reasoning due to language constraints. Eli Lilly restricts them strictly to generic workflow orchestration, routing actual molecular design to domain-specific physical models.

Fast models preserve exploratory developer flow

Vincent Warmerdam bets that faster, technically inferior models are better for exploratory coding. Near-instant generation preserves developer flow while known model limitations force the human to remain actively skeptical.

Stack Drops

Tools, libraries, and infra dropping into operator workflows now.

Strata

Open-source intermediate representation that translates Python, Java, and Rust into Lean for formal verification.

Canvas

Intent-classification system that loads pre-built UI templates and uses LLMs solely to populate data.

From the Conversations

The MAD Podcast with Matt Turck

RL over SFT for hallucination reduction

OpenAI's Yann Dubois: Why AI Progress Suddenly Feels Real

May 21, 2026 · 1h 13m · 3quotes pulled

“SFT is going to force like Hallucination, while in reinforcement learning given that... you kind of sample from the model in the first place, extremely unlikely that it's sample something that it doesn't know and it's correct.”

Yann Dubois · Co-leads Post-Training Frontiers · OpenAI
▶ Listen

Vanishing Gradients

Environment-specific agent linters

Agent-Harness.ipynb*

May 20, 2026 · 1h 19m · 2quotes pulled

“what if we make a linter? And the whole point of the linter is it's going to be super Marimo-specific... that solved about 60% of all the problems.”

Vincent Warmerdam · Engineer · Marimo
▶ Listen

The TWIML AI Podcast (formerly This Week in Machine Learning & Artificial Intelligence)

In-context learning for relational subgraphs

Relational Foundation Models for Enterprise Data with Jure Leskovec - #768

May 21, 2026 · 1h 5m · 2quotes pulled

“The system now goes into the database. It extracts a set of labeled in-context examples that then get passed through a pre-trained neural network to make a prediction... in a single forward pass.”

Jure Leskovec · Co-founder & Chief Scientist · Kumo
▶ Listen

Sources

We listen to every podcast so you don't have to