Mar 7, 2026

Teaching a Parser to Learn from Its Mistakes

Parsers are deterministic. Given the same input, they always make the same decisions — and when a grammar is ambiguous, they always pick the same alternative, right or wrong. What if we could teach a parser to prefer better alternatives, or at least recover from bad ones?

That is the question behind antlr-adaptive: a small library that wraps any ANTLR4-generated parser with learned quality scoring and guided backtracking.

The Problem

ANTLR4 is one of the most widely used parser generators. It takes a grammar file and produces a lexer and parser in your target language. At the core of every generated parser is a method called adaptivePredict — the single point where all grammar ambiguities are resolved. ANTLR computes the best alternative using its ATN (Augmented Transition Network) and returns it. The parser follows it. End of story.

The trouble is that “best” means syntactically consistent with the grammar, not closest to what the user meant. For broken or near-broken inputs, ANTLR’s choice can cascade into a chain of errors that makes recovery harder than it needs to be.

Classic error recovery strategies (panic mode, phrase-level recovery) are built into ANTLR’s runtime. They work well for many cases. But they are grammar-specific, manually tuned, and blind to the history of what has worked before.

Can we do better by learning from examples?

The Idea

ANTLR resolves ambiguities at decision points. Each decision has a small number of alternatives (typically 2–22 for real grammars). At every decision, the parser is in a specific state: it knows which tokens are coming next and which rule invoked this decision.

That state is learnable.

We represent it as a fixed-length feature vector:

One-hot decision index — which of the grammar’s decisions are we at?
One-hot lookahead tokens — what are the next 1, 2, 3 tokens?
Invoking state mod 64 — a coarse encoding of calling context.

For the SQLite grammar (263 decisions), this gives a 694-dimensional vector. Then we apply a simple perceptron:

weight[decision][alt] · features  →  quality score

After each parse:

If it succeeded: reward +1.0 — reinforce every decision made.
If it failed: reward -n_errors / total_tokens — penalise every decision made.

Over many examples, weights for decisions that lead to clean parses grow positive; weights for decisions that lead to errors shrink.

The key insight: we never override the parser during a parse. Overriding adaptivePredict mid-parse breaks the ATN state machine — one wrong turn and every subsequent decision is operating on a corrupted state. Instead, we observe only, then use the learned weights to evaluate or reorder alternatives in separate re-parse attempts.

Three Parsers

QualityParser — learn by watching

 input text
     │
     ▼
 ┌─────────────────────────────────────────┐
 │  ANTLR parse  (ObservingATNSimulator)   │
 │                                         │
 │  adaptivePredict called N times         │
 │  → always returns ANTLR's choice        │
 │  → records (decision, alt, features)    │
 └───────────────────┬─────────────────────┘
                     │ trace
                     ▼
             ┌───────────────┐
             │  any errors?  │
             └───┬───────────┘
          yes │         │ no
              ▼         ▼
         reward =    reward = +1.0
      -errors/tokens
              │         │
              └────┬────┘
                   ▼
         ┌──────────────────┐
         │  WeightStore     │
         │  update weights  │  w += lr * reward * features
         │  (perceptron)    │
         └──────────────────┘
                   │
                   ▼
            quality score
         (mean w·f over trace)

QualityParser subclasses ANTLR’s simulator with an ObservingATNSimulator that records every (decision, alt, features) tuple without changing the result. After each parse it updates the weight store with the perceptron rule.

After training, it can assign a quality score to any parse: the mean weight[decision][alt] · features over all decisions in the trace. Valid inputs consistently score higher than broken ones.

qp = QualityParser(SQLiteLexer, SQLiteParser, store=store)
qp.train(corpus, epochs=10)

result = qp.parse("SELECT * FROM users;")
print(result.score)   # e.g. +10.3

result = qp.parse("SELECT FROM;")
print(result.score)   # e.g. +8.1

This is useful for ranking multiple parse candidates, detecting suspicious inputs, or as a signal in a larger pipeline.

BeamParser — model-guided re-parse

 input text
     │
     ▼
 ┌──────────────────┐
 │  baseline parse  │──── trace ────► scan decisions:
 └──────────────────┘                 model prefers alt X
         │ errors?                    over ANTLR's alt Y?
         │ no → return                       │
         │ yes                               ▼
         │                    ┌──────────────────────────┐
         │                    │  top-K suspect decisions  │
         │                    │  sorted by margin score   │
         │                    └────────────┬─────────────┘
         │                                 │
         │              ┌──────────────────┼──────────────────┐
         │              ▼                  ▼                  ▼
         │         beam 1             beam 2   ...       beam K
         │    force dec=D alt=X   force dec=E alt=X
         │         │                    │
         └─────────┴────────────────────┘
                         │ all results
                         ▼
                 ┌───────────────┐
                 │  pick best    │  fewest errors → highest score
                 └───────────────┘

Once the weight store is trained, BeamParser uses it to actively improve parse results on broken inputs.

Algorithm:

Run baseline (ANTLR’s normal parse), record trace.
For each decision in the trace, check if the model prefers a different alternative (margin > 0).
Collect the top-K such decisions, sorted by confidence margin.
Re-parse with each override forced.
Return the result with fewest errors (tie-broken by score).

bp = BeamParser(SQLiteLexer, SQLiteParser, store, beam_width=4)
result = bp.parse("SELECT id name FROM users;")

BeamParser is fast (at most beam_width + 1 parses) and targets the decisions the model is most confident about. It works best when the training corpus is representative of the input distribution.

RetryParser — exhaustive backtracking without training

 input text
     │
     ▼
 ┌──────────────────┐
 │  baseline parse  │──── trace ──► frontier = baseline
 └──────────────────┘
         │ errors?
         │ no → return
         │ yes
         ▼
 ┌─────────────────────────────────────────────────┐
 │  scan frontier trace from the END               │
 │                                                 │
 │  dec[N]  dec[N-1]  ...  dec[1]  dec[0]         │
 │    ▲                                            │
 │    └── find first decision with untried alt     │
 └───────────────────┬─────────────────────────────┘
                     │ override = {dec: next_alt}
                     ▼
             ┌───────────────┐
             │  re-parse     │
             └───────┬───────┘
                     │
          ┌──────────┴──────────┐
          ▼                     ▼
   is_better_than(best)?   tokens >= frontier?
     → update best            → advance frontier
          │                     │
          └──────────┬──────────┘
                     │
             clean? → return
             max_retries? → return best
             else → repeat

RetryParser does not need a weight store at all. It treats the parse trace as a search space and backtracks through it:

Baseline parse → record trace.
Scan the trace from the end (deepest decisions first) for a decision with an untried alternative.
Re-parse with that decision forced.
Track the “frontier” — the attempt that consumed the most tokens.
Repeat from step 2 using the frontier’s trace. Stop when clean or max_retries reached.

Scanning from the end is the key detail. The early decisions in a parse (e.g., which statement type) have many downstream dependencies. Trying alternatives there first cascades into many failures. Starting from the end targets local decisions that are more likely to be independently fixable.

rp = RetryParser(SQLiteLexer, SQLiteParser, max_retries=20)
result = rp.parse("SELECT * FROM t WHERE id = AND name = 'x';")

If weights are available, they are used to order the candidate alternatives — reducing the number of retries needed.

Performance Evaluation

We tested all three parsers on the SQLite grammar using 23 valid queries and 8 deliberately broken ones. The broken queries cover common mistakes: missing table names, missing column lists, truncated WHERE clauses, and missing VALUES targets.

QualityParser — score separation

After 10 training epochs on the mixed corpus:

avg quality score
  valid  : +10.34
  broken :  +8.12
  ──────────────
  sep    :  +2.22

Separation grows steadily with more epochs and more diverse training data. It is not a pass/fail signal on its own — it is a ranking signal. Given two parse candidates, the one with the higher score is more likely to be grammatically well-formed.

BeamParser — guided repair

Tested with beam_width=4 on the 8 broken queries after 10 training epochs:

Metric	Result
Queries where beam found a lower-error parse	0 / 8
Queries where beam found a higher-score result	2 / 8
Avg beams tried per broken query	3.1

The 0/8 error-reduction result is expected: all 8 queries are genuinely broken — there is no alternative decision sequence in the grammar that makes SELECT FROM; valid. What beam search does find in 2 cases is a higher-scoring partial parse, which is useful for ranking.

BeamParser shines on near-valid inputs where a single wrong alternative cascades into multiple errors — for example SELECT id name FROM users; (missing comma between columns), where forcing a different alt at the right decision can recover a clean parse.

RetryParser — backtracking depth

Tested with max_retries=20, no weights:

Metric	Result
Valid queries solved first try	23 / 23
Broken queries reaching 0 errors	0 / 8
Avg retries on broken queries	18.4 / 20
Tokens consumed (broken, baseline)	~2.1 avg
Tokens consumed (broken, best retry)	~3.4 avg

Retry exhausts its budget on genuinely broken input but still improves how far the parse gets — consuming more tokens before failing, which means better partial structure. With weights supplied, the same budget produces deeper exploration because candidate alternatives are ordered by model confidence rather than numerically.

Side-by-side summary

	QualityParser	BeamParser	RetryParser
Training required	yes (self-trains)	yes (pre-trained)	no
Re-parses per query	1	≤ beam_width + 1	≤ max_retries + 1
Fixes broken input	never	sometimes	sometimes
Score separation	yes	yes (uses weights)	only with weights
Best for	ranking / detection	fast guided repair	no-training repair
Limitation	scores only	needs representative corpus	exhaustive; slow on complex grammars

The most effective single choice is RetryParser with weights: it uses the learned alt ordering from QualityParser to reduce retries needed, but does not depend on the weights for correctness — if the weights are wrong, it still finds the right parse, just slower.

Grammar-Agnostic by Design

Nothing in the library is SQLite-specific. The three parsers accept any ANTLR4-generated lexer and parser class:

from antlr_adaptive import RetryParser
from MyLexer import MyLexer
from MyParser import MyParser

rp = RetryParser(MyLexer, MyParser, start_rule="compilationUnit")
result = rp.parse(source_code)

The only grammar-specific constant is N_DECISIONS in _features.py, which should be set to slightly above len(parser.atn.decisionToState) for your grammar. The SQLite value (270, covering 263 actual decisions) is the default; it works as-is for any grammar with fewer decisions.

This means you can apply the same approach to SQL dialects, programming languages, configuration formats, or domain-specific languages — any grammar where ANTLR4 can generate a Python parser.

Testing

The test suite covers all three parsers against the SQLite grammar:

uv run pytest tests/ -v

Key checks:

Valid SQL parses clean (result.ok == True) across all three parsers.
Broken SQL returns a ParseResult (no crash; graceful degradation).
QualityParser score separation: valid average > broken average after 3 training epochs.
WeightStore save/load roundtrip preserves exact weight values.
ParseResult.is_better_than ranking: fewer errors > more tokens consumed > higher score.

19 tests, all passing in under 0.2 seconds.

Source code, examples, and grammar files: github.com/locchh/antlr-adaptive

The SQLite demo runs all three parsers end-to-end:

uv run python examples/sqlite_demo.py all

Teaching a Parser to Learn from Its Mistakes

Teaching a Parser to Learn from Its Mistakes

The Problem

The Idea

Three Parsers

QualityParser — learn by watching

BeamParser — model-guided re-parse

RetryParser — exhaustive backtracking without training

Performance Evaluation

QualityParser — score separation

BeamParser — guided repair

RetryParser — backtracking depth

Side-by-side summary

Grammar-Agnostic by Design

Testing

More