Blog/Deliberative AI Research
Thought Leadership10 minMarch 24, 2026

Deliberative AI Research

MyCorum.ai architecture — four verified properties, known limitations, and empirical research agenda. For researchers and technical reviewers.

1. Four structural properties — verified in code

MyCorum.ai's deliberative engine, Le Corum, rests on four architectural properties that distinguish it from multi-agent systems, prompt chaining, and model ensembling. Each property is verifiable in the production codebase, not asserted through prompting.

Property 1 — R1 Isolation

Implementation — pipeline.py results = await asyncio.gather(*[analyze(persona) for persona in personas]) # All 5 calls complete before any result is passed to another model # SHA-256 hash of all 5 independent outputs logged as R1_ISOLATION_PROOF

Round 1 analysis is fully parallel. No model sees another's response until all five have completed. This is enforced at the Python coroutine level — not through prompt instructions, which could be violated. The SHA-256 proof in the telemetry provides an auditable record that all five outputs existed before any cross-model interaction occurred.

Property 2 — Algorithmic divergence measurement

Implementation — semantic_entropy.py kle = kernel_language_entropy(embeddings, bandwidth=0.5) # Based on: Farquhar et al. (2024) "Detecting Hallucinations in LLMs" # Nature, doi:10.1038/s41586-024-07421-0 biodiversity_index = entropy(0.5) + inverted_agreement(0.3) + cluster_count(0.2)

Semantic divergence is measured using Kernel Language Entropy on text embeddings — not self-evaluated by the models. The confidence score reflects actual geometric distance between model outputs in embedding space, not how confident each model reports being. The KLE method is adapted from Farquhar et al., 2024 — the first published application of this measure to deliberation stopping criteria.

Property 3 — Anti-convergence enforcement

Implementation — adaptive_orchestrator.py if biodiversity_index < 0.25 or agreement_score > 0.90: trigger_devil_advocate_round() # Triggered by condition, not by prompt instruction # The Contrarian persona cannot opt out

When consensus forms too rapidly, the orchestrator triggers an adversarial round. This is an algorithmic condition — the threshold values (0.25 / 0.90) are hyperparameters, not prompts. Premature agreement is treated as a system failure, not a success. The Contrarian persona is activated whether or not the deliberation appears to be converging naturally.

Property 4 — Mandatory dissent (Minority Report)

Implementation — output_schemas.py SYNTHESIS_SCHEMA = { "minority_positions": { "type": "array", "minItems": 0 }, "required": ["recommendation","confidence","minority_positions","..."], "strict": True } # synthesis_verifier.py: check_minority_quality() # Rule: minority content > 50 words, overlap with consensus < 70%

The Minority Report is a required field in the JSON output schema — not optional, not generated only when divergence is high. A synthesis that omits minority positions fails schema validation and is retried. The synthesis verifier additionally checks that the minority content is substantive (minimum 50 words) and genuinely distinct from the consensus position (overlap threshold < 70%).

2. The Decision-Maker — the 1 in 1 + 5

The Dream Team is not 5 models.
It is 5 models + 1 human.

The distinction between The A-Team and The Dream Team is not the number of rounds. It is the presence of the Decision-Maker — the human — inside the deliberation loop, holding final authority at each inflection point.

In The A-Team, Le Corum deliberates for you. You receive the Corum Synthesis when deliberation is complete. The five minds have disagreed, challenged each other, and converged — or preserved their dissent — without your intervention.

In The Dream Team, the architecture is fundamentally different. You — the Decision-Maker — are present between rounds. You can pause the deliberation, redirect a line of inquiry, introduce new information, or signal that a particular position requires deeper scrutiny before the next round begins. The five minds respond to you. You are not a prompt. You are the authority the deliberation is structured around.

The research question this raises: does Decision-Maker presence improve the calibration of the final recommendation, or does it introduce anchoring bias — the human steering toward a preferred conclusion? This is one of the open empirical questions we intend to measure with VIP cohort data.

Service Architecture Human role Optimal use
The Expert 1 model selected by MyPilot via benchmark scoring. Smart routing, not deliberation. Passive — receives output Single-domain questions requiring the best available expert
The A-Team 5 models, adversarial deliberation, up to 4 adaptive rounds. Fully automated. Passive — receives Corum Synthesis Complex decisions where breadth of perspective matters, no time for interaction
The Dream Team 5 models + 1 human Decision-Maker in the loop. Up to 6 rounds. Human pauses between rounds. Active — the Decision-Maker. Steers, redirects, holds final authority. Highest-stakes decisions where the human brings irreplaceable context, judgment, or authority

3. What we claim — and what we do not

Verified
Round 1 independence is enforced in code

No model in Round 1 has access to another model's output. This is a structural guarantee of the asyncio.gather architecture, auditable via the SHA-256 proof in telemetry logs.

Verified
Divergence is measured algorithmically, not self-reported

KLE on embeddings produces a divergence score independent of model self-assessment. The method is academically grounded (Farquhar et al., Nature 2024).

Reference: Farquhar et al. (2024). Detecting hallucinations in large language models using semantic entropy. Nature, 630, 625–630. doi:10.1038/s41586-024-07421-0
Verified
The Minority Report cannot be omitted

JSON schema with strict: true and synthesis_verifier.py quality checks (minimum 50 words, <70% overlap with consensus) ensure substantive dissent is always present in the output.

Partial
The five models are epistemically independent

The five models are architecturally distinct, trained by different teams on partially different corpora. However, a significant portion of their training data overlaps (Common Crawl, Wikipedia, code repositories). Full Condorcet-style independence is not achieved. What is achieved is architectural diversity combined with adversarial structural pressure.

Partial
The confidence score reflects deliberation quality

The score is a composite of KLE divergence (45%), structured agreement assessment (35%), and verbalized confidence (20%). It measures the quality and depth of the deliberation process. It does not yet have empirical validation against real-world decision outcomes — that calibration is part of the research agenda below.

Open
Deliberative AI produces better decisions than single-model AI

This is the core claim. It is architecturally motivated and theoretically grounded. It is not yet empirically validated at scale. The first calibration data will come from the VIP cohort (March–April 2026). We will publish the results regardless of what they show.

Open
Decision-Maker presence in The Dream Team improves recommendation quality

Theoretical case: the human brings irreplaceable contextual knowledge that the models cannot access. Counter-case: human presence introduces anchoring and confirmation bias. We do not yet have data to distinguish these effects. This is a primary research question.

4. Known limitations — stated honestly

Training data overlap. The models in Le Corum share significant training data. Their "independence" is architectural and instructional — not statistical in the Condorcet sense. A deliberation between five models all trained heavily on Wikipedia does not guarantee five genuinely independent beliefs about a question rooted in Wikipedia facts.

KLE measures linguistic divergence, not epistemic divergence. Two models can produce linguistically diverse outputs while holding the same underlying belief. The epistemic_extractor.py module (GO/PIVOT/STOP extraction) partially addresses this, but the gap between linguistic and epistemic divergence is real and not fully closed.

The confidence score is not calibrated against outcomes. A score of 8.2/10 does not mean the decision is 82% likely to succeed. It means the deliberation was deep and the synthesis was well-supported by the panel. Until we have outcome data, interpreting the score as a probability is incorrect.

The KS stopping criterion is statistically weak at n=5. Kolmogorov-Smirnov tests on five samples have low statistical power. The novelty tracker (novelty_tracker.py) partially compensates, but the stopping criterion remains heuristic at this panel size. A bootstrap/permutation test approach is planned for J+45.

The Dream Team introduces human-in-loop bias risk. A Decision-Maker who steers the deliberation toward a preferred conclusion can produce a Corum Synthesis that confirms rather than challenges their prior. This is a known risk of any advisory system. MyPilot's neutrality principle (never recommend a service that exceeds the actual complexity of the question) is a partial mitigation — not a solution.

5. Open questions

These are the questions MyCorum.ai cannot answer today — and is committed to trying to answer with data:

Q1
Does structural disagreement produce better decisions than best-model selection?

The core hypothesis. Controlled comparison: same question, same user, The Expert (best single model) vs. The A-Team (5 models, adversarial). Outcome tracked over 30–90 days. Requires user consent to outcome reporting.

Q2
What is the KLE threshold below which deliberation quality degrades?

The Condorcet boundary for this architecture. At what divergence level does the confidence score lose predictive validity? This is empirically measurable once outcome data exists.

Q3
Does Decision-Maker presence improve or degrade recommendation quality?

A-Team vs. Dream Team on comparable decisions with comparable stakes. Controlling for question complexity. Primary variable: does human steering increase or decrease calibration of the final confidence score?

Q4
Do regional LLM teams produce genuinely different epistemic outputs?

EU vs. MENA vs. CN presets on identical strategic questions with regional context. Does Mercury-2 (G42, UAE) produce structurally different positions than the US-centric default panel on questions involving MENA regulatory or cultural context? KLE measurement across preset conditions.

Q5
Can the Minority Report be used as an early warning signal?

In cases where the consensus recommendation was later judged incorrect by the Decision-Maker, was the correct position present in the Minority Report? Retrospective analysis on outcomes data. If yes, this is a novel finding about the informational value of structured dissent.

6. Empirical research agenda

J+30
First calibration curve — VIP cohort

50 VIP users, first 30 days of deliberations. Confidence score vs. self-reported outcome quality at 30 days. First empirical dataset for confidence calibration in multi-model deliberation. Results published regardless of outcome — positive or negative.

J+45
KS stopping → bootstrap/permutation test

Replace the current KS stopping criterion with a bootstrap/permutation test approach, which is statistically valid at n=5. Quantify improvement in stopping precision on the calibration dataset.

J+90
arXiv technical note — Biodiversity Index for multi-model deliberation

First publication of the BI = entropy(50%) + inverted_agreement(30%) + cluster_count(20%) composite measure and its behavior on deliberation data. Open dataset accompanying the note for independent replication.

J+90
arXiv technical note — KS stopping criterion for deliberative AI

First published stopping criterion for multi-model deliberation. Comparison of KS, bootstrap, and heuristic approaches on the VIP dataset. Includes the known limitations section from this page as part of the methodology.

2027
Decision-Maker presence study — Dream Team vs. A-Team controlled comparison

Controlled study on Q3 above. Requires sufficient scale (n ≥ 200 comparable decision pairs) and outcome tracking. First empirical evidence on whether human-in-the-loop deliberation outperforms automated deliberation on high-stakes decisions.

7. For researchers — contact and collaboration

We are interested in collaboration with researchers working on multi-agent AI, epistemic calibration, decision quality under uncertainty, and human-AI teaming. If you are studying any of the open questions above — or if you find errors in our claims — we want to hear from you.

We will share anonymized deliberation data with researchers under appropriate data agreements once the VIP cohort has produced a statistically meaningful dataset. The first dataset will be available no earlier than May 2026.

Research inquiries

For technical questions, collaboration proposals, or to challenge any claim on this page — reach the founders directly.

contact@mycorum.ai

Ready to deliberate?

Five independent minds. One structured verdict. You decide.

Start Deliberating

Related articles