Building an AI That Learns From Its Mistakes: (Part 1 – The Journey)

Series: ISAC 2.0 Adaptive Reasoning Development
Part: 1 of 2
Date: April 2026
Author: Fitz
Project: ISAC 2.0 – GCF (Guided Cognitive Framework)

The Challenge: Teaching AI to Think Differently

What if your AI could recognize when its first answer isn’t good enough and automatically try a different approach?

Most AI systems give you one answer and call it done. They don’t second-guess themselves. They don’t pivot strategies when the first approach fails. They certainly don’t learn which reasoning styles work best for different problems.

ISAC 2.0 does built on the GCF (Guided Cognitive Framework), a Rust-based architecture designed for adaptive reasoning.

This is the story of how I evolved the system from a Python prototype into a production-ready Rust implementation that rotates through different cognitive “faces” to find better answers — and the brutal benchmarking process that revealed where it worked, where it failed, and what needs to happen next.

In Part 1, I’ll cover the evolution through Stages 13.1-13.4 and the critical discovery that changed everything.

Part 2 will reveal the solution: constraint-based cognitive enforcement and the complete architectural rebuild.

The Architecture: A Rubik’s Cube for Reasoning

ISAC 2.0 is built on the GCF (Guided Cognitive Framework) – a Rust-based architecture that evolved from an earlier Python prototype that had outgrown its original implementation.

The system uses a dual-cube architecture:

Outer Cube: Environmental constraints, context, user requirements
Inner Cube: The active reasoning engine with 6 cognitive “faces”

Each face represents a different reasoning domain:

Face	Specialty
Synthesis	Combining multiple perspectives into unified answers
Analytical	Structured step-by-step decomposition
Procedural	Sequential operations and state machines
Creative	Divergent thinking and novel combinations
Memory	Pattern matching against past experience
Empirical	Evidence-based observation without speculation

The core idea: When the system produces a low-confidence result, the inner cube rotates to a different face and tries again with a fundamentally different cognitive approach.

The problem we discovered: Rotation was happening, but the faces weren’t actually different.

The Evolution: 4 Stages of Failure and Progress

Stage 13.1: “The Trigger That Never Fired”

Problem: Rotation was theoretically possible but never actually triggered.

What happened:

System correctly identified low-affinity scenarios (e.g., Synthesis face trying to handle procedural math)
Exploration logic existed in code
But the actual rotation never executed

Result: 25-prompt stress test showed rotation count = 0 across all runs.

Lesson: Having code for a feature ≠ that feature actually running.

Stage 13.2: “Safety First, Maybe Too Safe”

Problem: Added aggressive safety blocks that prevented rotation even when needed.

What happened:

Implemented task complexity detection (Simple/Medium/Complex/Hard)
Added budget constraints to avoid wasting compute on trivial tasks
Created affinity scoring to detect face-task mismatch

Result:

✅ System correctly blocked rotation on “hello” and other trivial prompts
❌ System also blocked rotation on legitimate complex tasks
Rotation count: Still approximately 0

8 test runs (4 forward, 4 backward) confirmed: safety was working, but the actual adaptive behavior was still missing.

Stage 13.3: “The Mechanical Fix”

Problem: Exploration flag was set but never checked.

What happened:

Added explicit force_explore flag in Router
Wired up exploration trigger to actually invoke rotation logic
Maintained all safety checks from 13.2

Result:

✅ On “do math” and similar tasks: exploration finally triggered
✅ Logs showed: Exploration triggered: true | reason: force_explore
❌ Rotated results were identical to original results
Confidence before rotation: 80% → After rotation: 80%

8 more test runs proved the mechanism worked, but didn’t produce value.

Stage 13.4: “It Rotates, But Nothing Changes”

Current status as of April 15, 2026:

What works:

✅ Safety blocks trivial tasks (no rotation on “hello”)
✅ Exploration triggers on appropriate tasks
✅ System rotates to different faces
✅ Rejection logic restores original answer when rotation fails

What doesn’t work:

❌ Rotated results have identical structure to original
❌ No meaningful confidence improvement
❌ Different faces produce essentially the same reasoning

Example – “do math” task:

Original (Synthesis face):
  → Confidence: 80%
  → Structure: procedural steps

Rotated (Analytical face):
  → Confidence: 80%
  → Structure: procedural steps
  → Rejection: "No improvement detected"

The bottleneck shifted: From “can’t trigger exploration” to “exploration doesn’t produce different thinking.”

The Benchmark Suite: How I Tested This

I built a 5-tier, 25-prompt test suite designed to stress-test every aspect of the system:

🟢 TIER 1: Easy (Baseline Sanity Check)

hello
what is 5 + 7
define machine learning

Expected: No rotation, high confidence immediately, GCF ≈ raw model

🟡 TIER 2: Moderate (Structured Tasks)

explain how photosynthesis works step by step
compare Python vs Rust for backend development

Expected: Bishop/Rook pieces, some structure gains, minor improvements

🟠 TIER 3: Complex (Multi-Step Reasoning)

solve: A train travels 60mph for 2.5 hours, then 40mph for 1.5 hours. Total distance?
design a weekly study plan for learning Rust from scratch

Expected: Multi-step pipelines, possible rotation triggers

🔴 TIER 4: Hard (Ambiguity + Tradeoffs)

A company increased ad spend by 50% but revenue dropped 20%. Analyze why.
analyze whether AI will replace developers in the next 10 years

Expected: King intervention, strategy overrides, rotation should appear

🔥 TIER 5: Stress Test (Where GCF Must Prove Itself)

analyze a failing business, propose a turnaround plan, and convert it into actionable steps
design a scalable AI system under strict memory and latency constraints

Expected: Rotation triggers, strategy switching, confidence delta increases, GCF > raw model

The Results: What the Numbers Showed

Across 28 benchmark runs (Stage 13.1 → 13.4):

What Improved ✅

Metric	Stage 13.1	Stage 13.4
Rotation triggers on complex tasks	0%	~30%
Safety blocks on trivial tasks	No	Yes
Exploration mechanism functional	No	Yes
Rejection of bad rotations	No	Yes

What Stayed Broken ❌

Metric	Stage 13.1	Stage 13.4
Meaningful confidence improvement	0%	~0%
Structural difference in reasoning	No	No
Face-specific cognitive styles enforced	No	No

The Smoking Gun

Running the “do math” prompt across all stages:

Stage 13.1: [never explored]
Stage 13.2: [blocked by safety]
Stage 13.3: [explored, no difference]
Stage 13.4: [explored, measured difference, rejected as identical]

Progress: We went from “can’t explore” to “explores but doesn’t help.”

That’s still progress.

Key Insights: What I Learned

1. Mechanical vs Behavioral Success

Having rotation trigger is not the same as having rotation work.

Stage 13.4 achieved mechanical success (the cube rotates), but behavioral failure (the rotation doesn’t change the outcome).

2. The Label Problem

Rotating from “Synthesis” to “Analytical” just changes a label. The underlying LLM prompt gets a different domain instruction, but there’s no enforcement that the reasoning must actually be different.

Current prompt for Analytical face:

[Domain: Use structured reasoning. Break down into logical steps.]

Current prompt for Synthesis face:

[Domain: Synthesize multiple perspectives into a unified answer.]

Problem: The LLM can still use the same cognitive primitives (metaphors, step-by-step, analogies) regardless of face.

3. Structural Identity is Detectable

I built a hash function that compares reasoning topology (not content):

Reasoning graph structure
Primitive sequences (decomposition, metaphor, analogy)
Connector types (“therefore” vs “similarly” vs “imagine”)

Finding: 95%+ of rotations produced identical structural hashes.

This proved the rotations weren’t actually generating different thinking — they were just re-generating the same thinking with a different label.

The Critical Discovery

After 28 benchmark runs and analyzing ~700 individual task executions, the pattern was undeniable:

The system could rotate mechanically, but couldn’t think differently.

The faces were labels without walls. Nothing prevented the Analytical face from using metaphors, or the Synthesis face from doing step-by-step decomposition. The LLM received different instructions, but had complete freedom to ignore them.

The solution was obvious: Each face needed hard constraints — forbidden operations and required primitives that would be verified and enforced.

Coming in Part 2

In Part 2 of this series, I’ll reveal:

Stage 13.5: The complete architectural rebuild with constraint-based enforcement
How I implemented cognitive sandboxes with walls
The new verification system that rejects invalid reasoning
Structural hashing that detects genuine differentiation
Expected performance improvements once benchmarks complete
The path forward to Stages 14-16

The transformation: From a label rotator to a true adaptive reasoning engine.

Lessons for Other AI Builders

1. Log Everything

Without detailed execution logs, I would never have caught:

The exploration flag being set but never checked (13.2 → 13.3)
Rotations triggering but producing identical outputs (13.3 → 13.4)

2. Build Safety First

Stage 13.2’s “safety-first” approach was frustrating (blocked too much), but it prevented the system from:

Wasting compute on trivial tasks
Rotating when already confident
Entering infinite rotation loops

3. Mechanical ≠ Behavioral

Just because the code runs doesn’t mean it’s doing what you think. Stage 13.4 has a working rotation mechanism but broken rotation value.

Test the behavior you want, not just the mechanism that enables it.

4. Benchmarks Expose Truth

Running 25 prompts × 8 runs × 4 stages = 800 total test executions revealed patterns I would never have seen with ad-hoc testing.

The data doesn’t lie: rotation triggers 30% of the time, but improves results 0% of the time.

Technical Details for the Curious

Architecture Stack

Core Framework: GCF (Guided Cognitive Framework)
Language: Rust (migrated from Python prototype which had outgrown its capabilities)
LLM Backend: Ollama (local inference)
State Management: JSON-based session persistence
Benchmarking: Automated test suites for 25-prompt validation runs

Key Metrics Tracked

Confidence scores (before/after rotation)
Structural hash (reasoning topology)
Piece usage (Pawn/Rook/Bishop/Knight/Queen/King)
Domain affinity (how well face matches task)
Execution time (latency overhead from rotation)
Memory utilization (hint system effectiveness)

Benchmark Methodology

Forward/Backward runs: Test for order-dependence
Multiple iterations: Confirm consistency (5-8 runs per stage)
Tiered prompts: From trivial to stress-test
Structural comparison: Not just output similarity, but reasoning patterns

The Journey Continues

Current Status: Stage 13.4 complete, Stage 13.5 implemented and testing.

Next Milestones:

✅ Stage 13.1-13.4: Rotation mechanism complete
✅ Stage 13.5: Constraint-based face differentiation (IMPLEMENTED)
🔄 Stage 13.5: Benchmark validation (IN PROGRESS)
⏳ Stage 14: Multi-face fusion (combine strengths)
⏳ Stage 15: Dynamic face creation
⏳ Stage 16: Outer cube integration

This is what indie AI research looks like: one developer, a lot of coffee, Rust compiler errors, and obsessive logging.

If you’re building adaptive AI systems, dealing with LLM reasoning limitations, or just curious about cognitive architectures — reach out at Rqmeo@pm.me. I’d love to compare notes.

Resources

Benchmark Data: 28 full test runs with detailed logs available
Architecture: GCF (Guided Cognitive Framework) – Rust-based adaptive reasoning system
Technical Docs: Dual-cube design, chess-piece cognitive hierarchy

Contact: Rqmeo@pm.me

Update Log

Apr 15, 2026: Stage 13.4 benchmarks complete (28 runs), identified structural identity problem
Apr 16, 2026: Stage 13.5 constraint system implemented, benchmarks in progress
[Next]: Part 2 – Stage 13.5 architecture deep dive and results

Continue to Part 2: “The Solution – Constraint-Based Cognitive Enforcement”

Built with Rust, tested with persistence, improved through failure.

Building an AI That Learns From Its Mistakes: (Part 1 – The Journey)

The Challenge: Teaching AI to Think Differently

The Architecture: A Rubik’s Cube for Reasoning

The Evolution: 4 Stages of Failure and Progress

Stage 13.1: “The Trigger That Never Fired”

Stage 13.2: “Safety First, Maybe Too Safe”

Stage 13.3: “The Mechanical Fix”

Stage 13.4: “It Rotates, But Nothing Changes”

The Benchmark Suite: How I Tested This

🟢 TIER 1: Easy (Baseline Sanity Check)

🟡 TIER 2: Moderate (Structured Tasks)

🟠 TIER 3: Complex (Multi-Step Reasoning)

🔴 TIER 4: Hard (Ambiguity + Tradeoffs)

🔥 TIER 5: Stress Test (Where GCF Must Prove Itself)

The Results: What the Numbers Showed

What Improved ✅

What Stayed Broken ❌

The Smoking Gun

Key Insights: What I Learned

1. Mechanical vs Behavioral Success

2. The Label Problem

3. Structural Identity is Detectable

The Critical Discovery

Coming in Part 2

Lessons for Other AI Builders

1. Log Everything

2. Build Safety First

3. Mechanical ≠ Behavioral

4. Benchmarks Expose Truth

Technical Details for the Curious

Architecture Stack

Key Metrics Tracked

Benchmark Methodology

The Journey Continues

Resources

Update Log

Related dispatches

DYNAMIC COGNITION & AUTONOMOUS EVOLUTION: THE COMPLETE SYSTEM

MULTI-FACE FUSION & SYNERGY DETECTION: BREAKING 70%

GCF EVOLUTION: FROM BASIC EXECUTION TO INTELLIGENT OPTIMIZATION