Teaching a Parser to Learn from Its Mistakes
Teaching a Parser to Learn from Its Mistakes
Parsers are deterministic. Given the same input, they always make the same decisions — and when a grammar is ambiguous, they always pick the same alternative, right or wrong. What if we could teach a parser to prefer better alternatives, or at least recover from bad ones?
That is the question behind antlr-adaptive: a small library that wraps any ANTLR4-generated parser with learned quality scoring and guided backtracking.
The Problem
ANTLR4 is one of the most widely used parser generators. It takes a grammar file and produces a lexer and parser in your target language. At the core of every generated parser is a method called adaptivePredict — the single point where all grammar ambiguities are resolved. ANTLR computes the best alternative using its ATN (Augmented Transition Network) and returns it. The parser follows it. End of story.
The trouble is that “best” means syntactically consistent with the grammar, not closest to what the user meant. For broken or near-broken inputs, ANTLR’s choice can cascade into a chain of errors that makes recovery harder than it needs to be.
Classic error recovery strategies (panic mode, phrase-level recovery) are built into ANTLR’s runtime. They work well for many cases. But they are grammar-specific, manually tuned, and blind to the history of what has worked before.
Can we do better by learning from examples?
The Idea
ANTLR resolves ambiguities at decision points. Each decision has a small number of alternatives (typically 2–22 for real grammars). At every decision, the parser is in a specific state: it knows which tokens are coming next and which rule invoked this decision.
That state is learnable.
We represent it as a fixed-length feature vector:
- One-hot decision index — which of the grammar’s decisions are we at?
- One-hot lookahead tokens — what are the next 1, 2, 3 tokens?
- Invoking state mod 64 — a coarse encoding of calling context.
For the SQLite grammar (263 decisions), this gives a 694-dimensional vector. Then we apply a simple perceptron:
weight[decision][alt] · features → quality score
After each parse:
- If it succeeded: reward
+1.0— reinforce every decision made. - If it failed: reward
-n_errors / total_tokens— penalise every decision made.
Over many examples, weights for decisions that lead to clean parses grow positive; weights for decisions that lead to errors shrink.
The key insight: we never override the parser during a parse. Overriding adaptivePredict mid-parse breaks the ATN state machine — one wrong turn and every subsequent decision is operating on a corrupted state. Instead, we observe only, then use the learned weights to evaluate or reorder alternatives in separate re-parse attempts.
Three Parsers
QualityParser — learn by watching
input text
│
▼
┌─────────────────────────────────────────┐
│ ANTLR parse (ObservingATNSimulator) │
│ │
│ adaptivePredict called N times │
│ → always returns ANTLR's choice │
│ → records (decision, alt, features) │
└───────────────────┬─────────────────────┘
│ trace
▼
┌───────────────┐
│ any errors? │
└───┬───────────┘
yes │ │ no
▼ ▼
reward = reward = +1.0
-errors/tokens
│ │
└────┬────┘
▼
┌──────────────────┐
│ WeightStore │
│ update weights │ w += lr * reward * features
│ (perceptron) │
└──────────────────┘
│
▼
quality score
(mean w·f over trace)
QualityParser subclasses ANTLR’s simulator with an ObservingATNSimulator that records every (decision, alt, features) tuple without changing the result. After each parse it updates the weight store with the perceptron rule.
After training, it can assign a quality score to any parse: the mean weight[decision][alt] · features over all decisions in the trace. Valid inputs consistently score higher than broken ones.
qp = QualityParser(SQLiteLexer, SQLiteParser, store=store)
qp.train(corpus, epochs=10)
result = qp.parse("SELECT * FROM users;")
print(result.score) # e.g. +10.3
result = qp.parse("SELECT FROM;")
print(result.score) # e.g. +8.1
This is useful for ranking multiple parse candidates, detecting suspicious inputs, or as a signal in a larger pipeline.
BeamParser — model-guided re-parse
input text
│
▼
┌──────────────────┐
│ baseline parse │──── trace ────► scan decisions:
└──────────────────┘ model prefers alt X
│ errors? over ANTLR's alt Y?
│ no → return │
│ yes ▼
│ ┌──────────────────────────┐
│ │ top-K suspect decisions │
│ │ sorted by margin score │
│ └────────────┬─────────────┘
│ │
│ ┌──────────────────┼──────────────────┐
│ ▼ ▼ ▼
│ beam 1 beam 2 ... beam K
│ force dec=D alt=X force dec=E alt=X
│ │ │
└─────────┴────────────────────┘
│ all results
▼
┌───────────────┐
│ pick best │ fewest errors → highest score
└───────────────┘
Once the weight store is trained, BeamParser uses it to actively improve parse results on broken inputs.
Algorithm:
- Run baseline (ANTLR’s normal parse), record trace.
- For each decision in the trace, check if the model prefers a different alternative (margin > 0).
- Collect the top-K such decisions, sorted by confidence margin.
- Re-parse with each override forced.
- Return the result with fewest errors (tie-broken by score).
bp = BeamParser(SQLiteLexer, SQLiteParser, store, beam_width=4)
result = bp.parse("SELECT id name FROM users;")
BeamParser is fast (at most beam_width + 1 parses) and targets the decisions the model is most confident about. It works best when the training corpus is representative of the input distribution.
RetryParser — exhaustive backtracking without training
input text
│
▼
┌──────────────────┐
│ baseline parse │──── trace ──► frontier = baseline
└──────────────────┘
│ errors?
│ no → return
│ yes
▼
┌─────────────────────────────────────────────────┐
│ scan frontier trace from the END │
│ │
│ dec[N] dec[N-1] ... dec[1] dec[0] │
│ ▲ │
│ └── find first decision with untried alt │
└───────────────────┬─────────────────────────────┘
│ override = {dec: next_alt}
▼
┌───────────────┐
│ re-parse │
└───────┬───────┘
│
┌──────────┴──────────┐
▼ ▼
is_better_than(best)? tokens >= frontier?
→ update best → advance frontier
│ │
└──────────┬──────────┘
│
clean? → return
max_retries? → return best
else → repeat
RetryParser does not need a weight store at all. It treats the parse trace as a search space and backtracks through it:
- Baseline parse → record trace.
- Scan the trace from the end (deepest decisions first) for a decision with an untried alternative.
- Re-parse with that decision forced.
- Track the “frontier” — the attempt that consumed the most tokens.
- Repeat from step 2 using the frontier’s trace. Stop when clean or
max_retriesreached.
Scanning from the end is the key detail. The early decisions in a parse (e.g., which statement type) have many downstream dependencies. Trying alternatives there first cascades into many failures. Starting from the end targets local decisions that are more likely to be independently fixable.
rp = RetryParser(SQLiteLexer, SQLiteParser, max_retries=20)
result = rp.parse("SELECT * FROM t WHERE id = AND name = 'x';")
If weights are available, they are used to order the candidate alternatives — reducing the number of retries needed.
Performance Evaluation
We tested all three parsers on the SQLite grammar using 23 valid queries and 8 deliberately broken ones. The broken queries cover common mistakes: missing table names, missing column lists, truncated WHERE clauses, and missing VALUES targets.
QualityParser — score separation
After 10 training epochs on the mixed corpus:
avg quality score
valid : +10.34
broken : +8.12
──────────────
sep : +2.22
Separation grows steadily with more epochs and more diverse training data. It is not a pass/fail signal on its own — it is a ranking signal. Given two parse candidates, the one with the higher score is more likely to be grammatically well-formed.
BeamParser — guided repair
Tested with beam_width=4 on the 8 broken queries after 10 training epochs:
| Metric | Result |
|---|---|
| Queries where beam found a lower-error parse | 0 / 8 |
| Queries where beam found a higher-score result | 2 / 8 |
| Avg beams tried per broken query | 3.1 |
The 0/8 error-reduction result is expected: all 8 queries are genuinely broken — there is no alternative decision sequence in the grammar that makes SELECT FROM; valid. What beam search does find in 2 cases is a higher-scoring partial parse, which is useful for ranking.
BeamParser shines on near-valid inputs where a single wrong alternative cascades into multiple errors — for example SELECT id name FROM users; (missing comma between columns), where forcing a different alt at the right decision can recover a clean parse.
RetryParser — backtracking depth
Tested with max_retries=20, no weights:
| Metric | Result |
|---|---|
| Valid queries solved first try | 23 / 23 |
| Broken queries reaching 0 errors | 0 / 8 |
| Avg retries on broken queries | 18.4 / 20 |
| Tokens consumed (broken, baseline) | ~2.1 avg |
| Tokens consumed (broken, best retry) | ~3.4 avg |
Retry exhausts its budget on genuinely broken input but still improves how far the parse gets — consuming more tokens before failing, which means better partial structure. With weights supplied, the same budget produces deeper exploration because candidate alternatives are ordered by model confidence rather than numerically.
Side-by-side summary
| QualityParser | BeamParser | RetryParser | |
|---|---|---|---|
| Training required | yes (self-trains) | yes (pre-trained) | no |
| Re-parses per query | 1 | ≤ beam_width + 1 | ≤ max_retries + 1 |
| Fixes broken input | never | sometimes | sometimes |
| Score separation | yes | yes (uses weights) | only with weights |
| Best for | ranking / detection | fast guided repair | no-training repair |
| Limitation | scores only | needs representative corpus | exhaustive; slow on complex grammars |
The most effective single choice is RetryParser with weights: it uses the learned alt ordering from QualityParser to reduce retries needed, but does not depend on the weights for correctness — if the weights are wrong, it still finds the right parse, just slower.
Grammar-Agnostic by Design
Nothing in the library is SQLite-specific. The three parsers accept any ANTLR4-generated lexer and parser class:
from antlr_adaptive import RetryParser
from MyLexer import MyLexer
from MyParser import MyParser
rp = RetryParser(MyLexer, MyParser, start_rule="compilationUnit")
result = rp.parse(source_code)
The only grammar-specific constant is N_DECISIONS in _features.py, which should be set to slightly above len(parser.atn.decisionToState) for your grammar. The SQLite value (270, covering 263 actual decisions) is the default; it works as-is for any grammar with fewer decisions.
This means you can apply the same approach to SQL dialects, programming languages, configuration formats, or domain-specific languages — any grammar where ANTLR4 can generate a Python parser.
Testing
The test suite covers all three parsers against the SQLite grammar:
uv run pytest tests/ -v
Key checks:
- Valid SQL parses clean (
result.ok == True) across all three parsers. - Broken SQL returns a
ParseResult(no crash; graceful degradation). QualityParserscore separation: valid average > broken average after 3 training epochs.WeightStoresave/load roundtrip preserves exact weight values.ParseResult.is_better_thanranking: fewer errors > more tokens consumed > higher score.
19 tests, all passing in under 0.2 seconds.
More
Source code, examples, and grammar files: github.com/locchh/antlr-adaptive
The SQLite demo runs all three parsers end-to-end:
uv run python examples/sqlite_demo.py all