Logic,in Dialogue

A Research Vision · Originals & Taylor's Versions

Four questions about how humans and formal systems exchange answers, and what changes when a language model joins the loop

Overture

Program analysis and verification turn programs into checkable facts. A type checker proves an invariant. A model checker rules out a state. An interpolation engine isolates the reason a trace fails. Each fact is small, grounded in logic, and stands on its own.

The harder problem is turning that capability into something useful. Four questions show up across two decades of my work. How do you represent a program in logic so a verifier can reason about it effectively. How do you distill the result into a small fact a developer can act on. When the verifier gets stuck, how do you enlist a human, and how do you make that exchange bearable for the human and useful to the machine. How do you know any of it is working on real codebases.

I have worked on these questions since 2006, in academia, at SRI, and at AWS, with software as the running case. The first two are design problems. The third is a protocol problem. The fourth is empirical. Each has its own literature. Each has been pursued under one shared assumption: a human is somewhere in the loop, slow and scarce, and the work is built around getting the most out of that person.

Since 2024 that assumption has changed. The human was always the limiting reagent. Crowds were slow. Annotations were expensive. Reports went unread. Studies of analyzer adoption depended on logs and inference because direct measurement was too costly. Large language models lift the constraint at every step of the loop. They can read the program. They can propose the abstraction. They can guess the invariant. They can read the report and act on it. They can also be confidently wrong in ways verifiers cannot. The constraint that shaped each line of work no longer holds.

I am now in my neurosymbolic era. The catalog below pairs published work with a Taylor's Version that asks how it changes when a language model joins the loop.

The four questions reopen, each one in a different way. An abstraction that took years to design by hand may now be tractable as an agentic refinement under verifier supervision. An interpolant that used to be presented to a developer becomes an assertion an agent can call without recompiling. An invariant a crowd would have been asked to guess is now proposed by a model and rejected or accepted by the same verifier. A study that depended on sparse human signal becomes a controlled experiment with thousands of agent runs. The symbolic backbone is what holds the new loop together. The interesting research question is which parts of that backbone bear weight when the rest is rebuilt around a fluent participant.

The Catalog

Four sides, one question each. The published work appears on top. Below it, a Taylor's Version asks how the same question reads when a language model is in the room. The catalog leans on software because that is where most of the published work lives. The questions do not.

Side A · Why?

Why Did the System Do That?

Finding abstractions that let a verifier reason effectively about a program.

A1 · 2015

Bixie: Finding and Understanding Inconsistent Code

T. McCarthy, P. Rümmer, M. Schäf — ICSE Demo

A weakest-precondition abstraction for Java that flags paths whose preconditions contradict the program's own assumptions. The abstraction is cheap enough to run on every commit and precise enough to find real bugs.

A2 · 2015

Non-Monotonic Program Analysis

D. Schwartz-Narbonne, C. Oh, M. Schäf, P. Rümmer, A. Tiwari, T. Wies — HCVS

An abstraction that adds and drops facts as the analysis proceeds. The classical chain of monotonically growing approximations becomes a search over abstractions. Useful when the right invariant is not the union of cheaper ones.

A3 · 2017

Quantified Heap Invariants for Object-Oriented Programs

T. Kahsai, R. Kersten, P. Rümmer, M. Schäf — LPAR

A heap abstraction that admits quantified invariants over object fields. The right shape of quantifier turns proofs that were out of reach into proofs the verifier can find on its own.

A4 · 2020

Verifying Object Construction

M. Kellogg, M. Ran, M. Sridharan, M. Schäf, M. D. Ernst — ICSE

A type-state abstraction for object initialization. The fact that an object's invariants must hold before any method can be called becomes a type-system property the compiler can check.

A5 · 2021

RAPID: API Checking for the Cloud in the Cloud

M. Emmi, L. Hadarean, R. Jhala, L. Pike, N. Rosner, M. Schäf, A. Sengupta, W. Visser — ESEC/FSE

An abstraction of cloud APIs as protocols, with a checker that verifies client code against the protocol. The model is what the API documentation should have been. The checker turns it into a property the build can enforce.

A6 · 2024

Inference for ever-changing policy of taint analysis

W.-H. Chiang, P. Li, Q. Zhou, S. Banerjee, M. Schäf, Y. Lyu, H. Nguyen, O. Tripp — ICSE-SEIP

A taint analysis whose source/sink/sanitizer policy is inferred from the codebase rather than written by hand. The abstraction is the same. The cost of keeping the policy current drops to where industrial use becomes tractable.

Side A · Taylor's Version

Why, Re-recorded

The expensive part of this work was the abstraction itself. Translating a program into a heap model, into a type-state, into an API protocol, into a taint policy, took years of careful design and engineering for each abstraction. The agent changes that calculus. An agentic loop can do most of the translation in a semi-automatic pass and use the verifier as the oracle that confirms whether the abstraction captures the behavior. Wrong translations get rejected. Right ones are kept. Trying a new abstraction becomes a matter of running the loop instead of writing a paper.

The output side changes too. These tools were designed to give a human developer actionable information. The first reader is now usually an agent. What the agent needs differs from what the developer needs. The information density can be higher. The form can be more structured. The audience for a program-analysis report is no longer fixed, and the design space for the report opens up with it.

From the Vault Which abstractions are too expensive to design by hand but cheap enough for an agent to discover under verifier supervision? And what does a program-analysis output look like when its first reader is another program?

Side B · Show?

How Do You Say It to a Human?

Distilling a failing run into a small fact that explains the bug.

B1 · 2012

Error Invariants

E. Ermis, M. Schäf, T. Wies — FM

A sequence of Craig interpolants along a failing trace produces the smallest formula at each program point that still explains the failure. The output replaces a long trace with a short generalization. Less to read. Fewer variables to track.

B2 · 2013

Flow-sensitive Fault Localization

J. Christ, E. Ermis, M. Schäf, T. Wies — VMCAI

Fault localization that respects control flow. The explanation tracks how information moves through branches and joins, so the localized cause is faithful to the program's structure rather than a flat slice.

B3 · 2015

VERMEER: A Tool for Tracing and Explaining Faulty C Programs

D. Schwartz-Narbonne, C. Oh, M. Schäf, T. Wies — ICSE Demo

The interactive tool. A developer steps through the trace; the explanation engine highlights the variables and statements that the interpolants say matter, and abstracts away the rest.

B4 · 2016

Classifying Bugs with Interpolants

A. Podelski, M. Schäf, T. Wies — TAP

The shape of an interpolant is a fingerprint of the bug class. Quantifiers, variables, and arithmetic operators carry information about what kind of error produced the trace. Off-by-one looks like off-by-one in logic.

Side B · Taylor's Version

Show, Re-recorded

The original motivation was cognitive load. An error invariant generalizes the reason a specific execution failed. Showing a developer a predicate at a program point requires less cognitive effort than reading a trace. The motivation transfers when the reader is an agent, but the gain is different. The agent saves tokens. The agent does not have to re-execute or re-compile to check whether a candidate fix satisfies the invariant. The invariant is a check the agent can call. Error Invariants and Flow-sensitive Fault Localization become assertions the agent's repair has to satisfy. Classifying Bugs with Interpolants becomes a fingerprint the agent can use to retrieve fixes from prior runs. The interpolation engine can also synthesize new tests from the same data, which the agent can use to confirm a fix on cases the developer never wrote.

Beyond the agentic repair loop, the line of work points at a larger question. Our recent paper on delta debugging for LLM-integrated systems is one step into this. When natural language is part of the program, a failure is no longer just a state contradiction. We need to rethink what "the reason for the failure" means.

From the Vault What does fault localization mean when half the program is a prompt? Which of the symbolic explanation primitives transfer, and which need a new definition before they can be applied?

Side C · Listen?

What Does the Other Voice Add?

Enlisting a human when the verifier gets stuck. Where to draw the boundary, and how to structure the exchange.

C1 · 2014

Chekofv: Crowd-Sourced Formal Verification

H. Logas, F. Kirchner, J. Murray, M. Schäf, J. Whitehead — FWFM

A game in which crowds propose candidate loop invariants. The verifier filters them. The protocol works because the search space is large for a machine and not for a human, and the verifier can cheaply confirm a useful guess.

C2 · 2015

Gamifying Program Analysis

D. Fava, J. Signoles, M. Lemerre, M. Schäf, A. Tiwari — LPAR

A study of which program-analysis subproblems can be cast as games for non-experts. The contribution is the boundary itself: which questions are cheap to verify, hard to synthesize, and presentable in a form a layperson can answer.

C3 · 2016

Crowd-Sourcing Program Preconditions

D. Fava, D. Shapiro, J. Osborn, M. Schäf, J. Whitehead — ICSE

Players label test inputs. Preconditions are inferred from their consensus. A precondition, in this view, is the boundary a crowd draws around valid.

C4 · 2017

Abduction by Non-Experts

N. Bjørner, D. Jovanović, T. Lepoint, P. Rümmer, M. Schäf — LPAR

Empirical study showing that laypeople can perform abductive reasoning, given an interface that translates the symbolic problem into something familiar.

Side C · Taylor's Version

Listen, Re-recorded

This body of work was about a particular kind of failure. Some verification problems, like finding loop invariants or doing abductive inference, are hard for machines because the search space is large and a wrong guess provides no useful signal for correcting course. The work asked three questions: where is the boundary at which a machine needs help, how should the interaction be structured so it is bearable for the human and useful for the machine, and what kind of feedback can the machine give to keep the human productive.

The boundary moves. A language model can guess invariants quickly, pull context from the source code and documentation, and draw on its pre-training. Many of the iterations a crowd was needed for can now run with a model in the loop. Not all of them. There remain cases where neither the model nor the verifier can make progress and a human is the best signal.

What also changes is the language of the exchange. The 2014 work could only ask a layperson a question expressed as a puzzle, because the underlying formula was unreadable. Now we can ask the question in prose. The human can answer in prose. The machine can read the answer and turn it into a candidate the verifier can check. The interface stops being a translation problem and starts being a conversation.

From the Vault Which verification subproblems still need a human in the loop after the model has tried, and what does an interface look like that lets the human contribute in natural language without losing the verifier's guarantees?

Side D · Measure?

Was Any of This Productive?

Production studies that ask whether the loop converges on anything useful.

D1 · 2023

Long-term Rule Quality via True Negatives

L. Luo, R. Mukherjee, O. Tripp, M. Schäf, Q. Zhou, D. Sanchez — ICSE-SEIP

Methodology for monitoring static-analysis rule quality over time using true-negative measurements. In industrial settings, precision and recall on a fixed benchmark are a comforting lie.

D2 · 2024

Developer–Analyzer Interactions in Code Reviews

B. Cirisci, L. Luo, M. N. Mansur, D. Sanchez, M. Schäf, O. Tripp, M. B. Zafar, Q. Zhou — ASE

An empirical study of how developers respond to analyzer findings on pull requests. What they accept, what they ignore, what they argue with.

Side D · Taylor's Version

Measure, Re-recorded

The bottleneck of the original work was data. Human feedback was slow, expensive, and sparse. Whether a developer accepted or ignored a finding was sometimes the only signal we had, and even that had to be inferred from messy logs. We built methodologies that wrung as much as possible out of small samples.

The constraint relaxes when machines write most of the code and react to most of the tool signal. Every agent run is an experiment. Every accepted or rejected finding is recorded. Controlled experiments become tractable. We can change the feedback the analyzer gives back to the agent and measure whether the next batch of agent output improves. Long-term Rule Quality via True Negatives becomes a per-rule controlled study, run nightly, with a sample size measured in thousands of agent invocations. Developer-Analyzer Interactions in Code Reviews becomes agent-analyzer interactions, with no recall bias, no Hawthorne effect, and a feedback channel that can be tuned and re-measured.

From the Vault What does an analyzer look like when the only readers are agents, and the evaluation loop is fast enough to do real A/B tests on the feedback that the analyzer gives?

Coda

The Bridge: Tracks From the Current Era

One released, two in the studio. The neurosymbolic era, in progress.

Released · ICSE-SEIP 2026

Delta Debugging for LLM-integrated Systems

Zhu, Mansur, Schäf, Chen, Lepoint, Visser

Delta debugging is a 1999 algorithm. It minimizes a failing input until the smallest one that still fails is found. The paper applies it to systems whose failing component is a language model. The four questions still apply. Why did it fail: the minimal witness. How to show it: a printable counter-example. What does the system add back: each minimization step. Was it productive: yes, measurably, on real LLM-integrated systems. Old machinery, new artifact, same conversation.

In the Studio · Vault Track 1

A neurosymbolic primitive for question answering over rule engines

The artifact is not a program. It is a rule engine, of the kind that underlies contracts, tax software, and eligibility checks. The engine answers point queries cleanly. It answers human questions badly, because the human leaves most of the input form blank and the engine cannot say what depends on the missing parts. The paper introduces a neurosymbolic primitive that walks the rule-trigger graph, constructs a small covering set of inputs, and renders a conditional answer of the form "yes if X, no if Y" that re-executes through the engine for validation. The four questions show up together. Side A distills the relevant dependency. Side C accepts the model's hypotheses through a verifier. Side B presents the result in a form a human can read. Side D measures whether the result was correct. The same loop, in a different discipline.

In the Studio · Vault Track 2

Learned eligibility conditions, validated on production billing

Pricing-term oracles produce thousands of false discrepancies on real billing data. The reason is that the eligibility conditions live in source code and side-channel docs, not in the terms themselves. An LLM coding agent learns the missing conditions from the data, validates each one against production billing with zero false positives, and traces it back to its documentation source. If no source exists, the agent flags it as a documentation gap. The methodology is Side D in full bloom. The analyzer is now the agent. The true-negative monitoring carries forward. The productivity question is answered nightly.