From Prototype to Precision: A Decision Logging Threshold

From Prototype to Precision: A Decision Logging Threshold (Decision Provenance, Part 2)

Author: Yauheni Kurbayeu
Published: Apr 21, 2026

In Part 1, I validated a simple but important claim: agents can self-identify meaningful decisions, log them with context, and later reuse those logs as soft priors.

That prototype became convincing by the end of the split-agent scenario, but it still left one unsolved question:

How does the system decide what is meaningful enough to persist, without turning the decision log into noise?

This article is the continuation of that exploration: the Decision Logging Threshold Policy.

It is a deterministic gate that sits between “an agent made a choice” and “we persist it as provenance.” It turns the fuzzy phrase “meaningful-ish” into an explicit, testable contract.

In the W3C PROV Data Model (PROV-DM), provenance is described as information about what produced something (entities, activities, and responsible parties), and it is explicitly framed as a way to assess trustworthiness.

The work described here is implemented and evidenced in this repository via:

Canonical runtime contract: .github/skills/decision-logging-threshold/SKILL.md
Narrative rationale: threshold/decision-logging-threshold-policy.md
Validation scenario: threshold/DECISION-LOGGING-THRESHOLD-TEST-SCENARIO.md
Test results report: threshold/DECISION-LOGGING-THRESHOLD-TEST-RESULT.md
Saved run artifacts (Mode A/B/C): decisions-logs-mode-a/, decisions-logs-mode-b/, decisions-logs-mode-c/

Why a threshold gate was necessary

Decision provenance is only valuable if it preserves signal.

If everything becomes provenance, nothing is provenance. A log full of “I chose this wording” and “I ran that command” is not memory, it’s noise.

But the opposite failure is just as real:

decisions get bundled (“two real choices” becomes “one vague entry”)
decisions get skipped (“the output looks right, but the log is missing the why”)
reruns produce inconsistent behavior (“sometimes we re-log, sometimes we don’t”)

In Part 1, I treated meaningful decisions as a concept defined by the decision-log skill. That was sufficient to prove the idea works.

However, the moment you want to make provenance auditable, you need a narrower and more operational question:

Is this candidate decision major enough that it should be persisted as durable memory?

That is exactly what the threshold policy answers.

Where the threshold sits in the agentic flow

The threshold is not a new logger format. It is a gate before persistence.

At a high level, the updated provenance flow is:

main-provenance reviews context and prior decisions (soft priors).
user-task-executor produces the task artifact and an execution summary.
main-provenance normalizes explicit candidate decisions (split first).
main-provenance scores each candidate with decision-logging-threshold.
Only threshold-qualified candidates are forwarded to provenance-log.
provenance-log appends strict entries using the decision-log schema (and defensively re-checks the threshold if needed).

In other words:

the logger is strict about how we persist decisions
the threshold gate is strict about whether we persist them

The policy itself (as a deterministic contract)

The threshold policy is intentionally simple and deterministic.

It does not try to compute “importance” from scratch. It uses a lightweight scoring model to approximate a more practical concept:

How valuable will it be to remember this decision later?

The contract in .github/skills/decision-logging-threshold/SKILL.md applies the same steps to every candidate:

1) Normalize the candidate (split first)

The threshold is applied to one candidate decision at a time.

If a task contains multiple explicit choices, they must be split before scoring. This inherits one of the clearest lessons from Part 1: multi-decision provenance fails when boundaries collapse.

2) Mandatory overrides (always log)

Some decisions must be logged regardless of score:

policy / compliance boundaries (security, privacy, legal, financial)
human interaction boundaries (override, escalation, explicit approval/rejection)

This is the “accountability first” rule: these are not optional memory.

3) Explicit non-decision suppression (never log)

If no override applies, the policy suppresses clear non-decisions:

deterministic transforms (formatting, parsing, mapping)
retrieval-only steps with no reasoning
mechanical execution steps
repeated micro-decisions inside an already-chosen strategy

This is the core anti-noise defense.

4) Usefulness gate (memory value filter)

Even if a candidate was discussed at length, it should not be logged if it is not useful later.

The usefulness_gate asks whether at least one of these is true:

it influences future reasoning or later workflow steps
it encodes a non-trivial trade-off worth auditing later
it cannot be reconstructed from the final output alone

5) Score five dimensions (0–2 each)

If a candidate is not suppressed and is useful, it is scored across five dimensions:

impact_radius
reversibility
uncertainty
tradeoff_intensity
longevity

The final score is the sum (0–10), and the persistence threshold is >= 5.

The scoring rubric is intentionally coarse:

Dimension	`0`	`1`	`2`
`impact_radius`	Local to one step or narrow fragment	Affects multiple steps/files/modules/sections	Affects architecture, workflow/release posture, safety/compliance posture, or broad downstream outcomes
`reversibility`	Easy to undo with negligible cost	Reversible with noticeable rework/coordination	Hard to undo or externally consequential
`uncertainty`	Deterministic or clearly prescribed	Bounded judgment required	High ambiguity, incomplete info, or conflicting evidence
`tradeoff_intensity`	No real viable alternative	Alternatives exist, limited consequences	Competing options with materially different consequences
`longevity`	Ephemeral	Matters for the rest of the task / near-term	Persistent, reusable, or policy-shaping

Final gate (the only rule that decides persistence)

In the skill, the final decision is explicitly defined as:

override_applied = len(mandatory_overrides) > 0

should_log =
  override_applied
  OR (
    usefulness_gate
    AND decision_score >= 5
  )

This is the whole point of the policy: when the log/no-log rule is explicit like this, it becomes testable.

What the score ranges mean (in practice)

The threshold is a pragmatic significance heuristic, not an industry standard.

In practice, the score bands help you build intuition for what the gate is doing:

0–2: almost certainly execution noise - do not log unless an override applies
3–4: near-threshold - usually still do not log; re-check candidate splitting and scoring honesty
5–10: likely durable - log if usefulness_gate is true

If something feels important but lands at 3–4, that is usually a signal to:

check whether multiple decisions were bundled into one candidate
verify that each metric score is still honest
use an explicit mandatory_overrides reason (rather than inflating metric scores) when the value is governance/accountability traceability

A consistent scoring workflow

To keep scoring stable across runs (and easier to audit):

Score each metric independently.
Use the smallest number that is still honest.
Write one short sentence explaining each non-zero score.
Apply overrides as a separate gate (do not treat them as “extra points”).
Compute decision_score, then apply the final should_log rule.

Example threshold_check outputs

Example 1: high-score trade-off (log)

impact_radius: 2
reversibility: 1
uncertainty: 1
tradeoff_intensity: 2
longevity: 2
decision_score: 8
mandatory_overrides: []
override_applied: false
usefulness_gate: true
should_log: true
rationale: Long-lived trade-off with broad downstream impact.

Example 2: low numeric score, but governance boundary (log via override)

impact_radius: 0
reversibility: 1
uncertainty: 0
tradeoff_intensity: 0
longevity: 0
decision_score: 1
mandatory_overrides:
  - privacy-boundary
override_applied: true
usefulness_gate: true
should_log: true
rationale: A privacy-governed exception was applied and must remain auditable.

Common mistakes this section is meant to prevent:

inflating metric scores instead of using mandatory_overrides
setting override_applied: true without naming the override reason
treating routine human-facing chatter as an override when no accountability boundary actually changed

Why these five metrics?

The threshold model is not meant to be a perfect theory of decision value. It is meant to be a good-enough proxy for the kinds of decisions that become valuable provenance:

decisions with broad consequences
decisions that are hard to undo
decisions made under uncertainty
decisions that encode real trade-offs
decisions with long-lived effects

The five dimensions were chosen because they approximate those properties in a way that is:

simple enough to apply in real-time
expressive enough to separate “inflection point” from “execution chatter”
stable enough to be validated by a deterministic test suite

Below is the theoretical intuition behind each dimension, and why it matters for provenance memory.

1) `impact_radius`: how far the consequences propagate

Impact radius is the simplest “scope of consequence” signal.

It answers questions like:

Is this local to one step, or does it shape the whole deliverable?
Does it affect one file, multiple files, or the system’s architecture / posture?

This dimension is closely related to the engineering concept of blast radius: how far a change or failure spreads.

In the AWS Well-Architected Framework, one design principle recommends “frequent, small, reversible changes,” explicitly linking smaller changes to reduced blast radius and faster reversal. That is the same intuition: broader impact deserves more care, and should be easier to audit later.
Reference: AWS Well-Architected Framework – Operational Excellence design principles (see “Make frequent, small, reversible changes”).

In provenance terms:

low impact radius decisions rarely justify durable memory
high impact radius decisions are exactly the ones teams revisit and relitigate

2) `reversibility`: whether the decision is a one-way door

Reversibility is one of the strongest predictors of “decision importance” in practice.

It is also one of the most widely shared organizational heuristics:

reversible decisions should be made quickly and iterated
irreversible (or nearly irreversible) decisions should be made slowly and carefully

Jeff Bezos popularized this as the “one-way door / two-way door” framing in an Amazon shareholder letter, distinguishing consequential irreversible decisions from reversible ones.
Reference: Jeff Bezos, Letter to Shareholders (Amazon), Exhibit 99.1 (SEC).

In provenance terms, reversibility matters because irreversible decisions carry higher future audit value:

if you can’t easily undo it, you’ll want to know why you did it
if you can undo it cheaply, you can often treat it as an experiment rather than history

3) `uncertainty`: how much ambiguity the decision must absorb

Uncertainty is where “memory” becomes particularly valuable.

If a decision is deterministic (there is a clear correct answer), provenance adds less value. You can usually reconstruct the choice from the output and the rules.

But if the decision was made under:

incomplete information
conflicting constraints
ambiguous signals

then a future reader needs the stored context to understand why the chosen trade-off was reasonable at the time.

One way to frame this theoretically is via value of information: information has value because it can reduce uncertainty before a decision is made. When uncertainty is high, the “price” of missing context tends to be higher later.
Background reference: Value of information (overview).

In threshold terms, uncertainty is not “how hard the task is.” It is “how much judgment was required because the evidence was incomplete or in tension.”

4) `tradeoff_intensity`: whether alternatives are genuinely competing

Tradeoff intensity is meant to separate:

decisions with real competing alternatives (each with consequences)
from pseudo-choices where there is effectively only one viable path

This is aligned with the logic behind multi-criteria decision analysis (MCDA): complex decisions often involve multiple conflicting criteria, where no option dominates on every axis, and the point of the decision is the trade-off itself.
Reference: TÜV Rheinland Risktec – “Multi-Criteria Decision Analysis (MCDA)”.

In provenance terms, high tradeoff intensity is valuable to remember because:

it captures what was sacrificed
it provides reusable precedent when the same trade-off reappears

5) `longevity`: how long the decision’s effects persist

Longevity is the “future reuse” proxy.

Some decisions matter only for a moment (presentation order, a local rewrite). Others shape a system for months or years.

This is why engineering teams have created practices like Architecture Decision Records (ADRs): small, durable records for decisions with lasting consequences.

Thoughtworks explicitly recommends lightweight ADRs to capture important decisions with context and consequences, and to store them in source control so they stay in sync with the system.
Reference: Thoughtworks Technology Radar – “Lightweight Architecture Decision Records”.

The threshold’s longevity metric is intentionally broader than architecture. Any decision that is likely to shape future reasoning: scope, policy posture, workflow shape and risk posture, should have higher longevity.

Why the threshold is `>= 5`

The five dimensions are each scored 0–2, so the maximum is 10.

Setting the gate at 5 forces a candidate to have multiple meaningful signals, not just one.

Examples:

A single “big” factor (like high uncertainty) is not always enough by itself.
A decision with moderate signals across several dimensions is often exactly the kind of decision that becomes valuable provenance.

The threshold is therefore a pragmatic calibration: high enough to block noise, low enough that real inflection points still pass.

The policy is also explicit about tie-break behavior:

if unsure between adjacent scores, choose the lower number unless explicitly supported
do not inflate one dimension to compensate for another
do not retune thresholds dynamically during a live run

This matters because the threshold gate is trying to be stable enough to audit.

How I tested the threshold gate

The threshold suite is documented in threshold/DECISION-LOGGING-THRESHOLD-TEST-SCENARIO.md.

It is focused on a narrow question:

Not “can the system reason well?”, but “does it correctly decide when to log?”

The suite adds ten prompts (7–16) and validates:

score-based positives (including boundary behavior)
mandatory overrides (policy and human)
explicit non-decision suppression
split-before-score behavior
robustness when a prior log exists (controls must stay suppressed)

It defines three modes:

Mode A: clean-room run
Evidence: decisions-logs-mode-a/
Mode B: rerun only the control prompts in a prior-rich environment
Evidence: decisions-logs-mode-b/
Mode C: rerun the full suite in a mixed order (resilience test)
Evidence: decisions-logs-mode-c/

Prompt set: threshold suite prompts 7–16 in test-prompts/ (for example: test-prompts/DECISION_LOGGING_TEST_PROMPT7-threshold-high-score.md, test-prompts/DECISION_LOGGING_TEST_PROMPT16-micro-decisions-control.md).

What the results show (and what they don’t)

The results report is captured in threshold/DECISION-LOGGING-THRESHOLD-TEST-RESULT.md.

Top-line:

Mode A: pass
Mode B: pass
Mode C: partial pass

The most important proven behavior is that the threshold gate works as a precision filter:

controls stayed suppressed (no fake provenance)
override prompts logged correctly
split prompt logged only the substantive decision

The biggest weakness exposed is not the scoring rule itself, but rerun semantics:

Mode C re-logged two positives (14 and 15) as reconfirmations
similar positives did not re-log
that makes the “duplicate suppression / reconfirmation policy” look inconsistent rather than deterministic

This is an important distinction:

the threshold gate can make “log/no-log” deterministic per candidate
but the workflow still needs deterministic rules for “do we log again on rerun?”

What this adds to the overall decision provenance story

The first phase of the prototype proved that agents can log decisions and reuse those logs.

The threshold phase strengthens the story in a more operational way:

decision provenance becomes selective by explicit contract
the persistence gate becomes testable
the workflow becomes closer to “auditable memory,” not just “helpful notes”

It does not solve everything (actor attribution and rerun semantics are still the big gaps), but it makes the system’s behavior more governable.

The artifacts in the threshold folder converge on four practical next steps:

Make duplicate / reconfirmation policy explicit and deterministic across reruns.
Enforce clean-room boundaries so results cannot act as priors in Mode A-style validation.
Persist threshold_check uniformly (or explicitly decide not to persist it) so audits can validate scoring behavior from saved artifacts.
Normalize decision-log formatting and schema drift across runs.

Those are not “nice to have.” They are the gap between a working prototype and audit-grade decision memory.

From Prototype to Precision: A Decision Logging Threshold (Decision Provenance, Part 2)

Table of Contents

From Prototype to Precision: A Decision Logging Threshold (Decision Provenance, Part 2)

Why a threshold gate was necessary

Where the threshold sits in the agentic flow

The policy itself (as a deterministic contract)

1) Normalize the candidate (split first)

2) Mandatory overrides (always log)

3) Explicit non-decision suppression (never log)

4) Usefulness gate (memory value filter)

5) Score five dimensions (0–2 each)

Final gate (the only rule that decides persistence)

What the score ranges mean (in practice)

A consistent scoring workflow

Example threshold_check outputs

Why these five metrics?

1) impact_radius: how far the consequences propagate

2) reversibility: whether the decision is a one-way door

3) uncertainty: how much ambiguity the decision must absorb

4) tradeoff_intensity: whether alternatives are genuinely competing

5) longevity: how long the decision’s effects persist

Why the threshold is >= 5

How I tested the threshold gate

What the results show (and what they don’t)

What this adds to the overall decision provenance story

Next refinements suggested by the suite

1) `impact_radius`: how far the consequences propagate

2) `reversibility`: whether the decision is a one-way door

3) `uncertainty`: how much ambiguity the decision must absorb

4) `tradeoff_intensity`: whether alternatives are genuinely competing

5) `longevity`: how long the decision’s effects persist

Why the threshold is `>= 5`