Abstract
Compact task-specialized language models demonstrate superior performance in multi-hop reasoning and faithfulness compared to larger general-purpose models through a novel training pipeline and structured reasoning traces.
Recent progress in the development of language models has been defined by scale, with each generation absorbing more of the world's knowledge into its weights. However, many practical applications benefit more from robust reasoning than from extensive parametric knowledge. In this setting, task-specialized small language models (SLMs) offer a principled design choice. We introduce Optimal Cognitive Core (OCC), a family of SLMs built around this premise. As a variant of OCC, we present OCC-RAG, optimized for faithful question answering (QA) grounded in the provided context. This task directly aligns with the OCC design approach, requiring multi-hop reasoning over supplied passages while ignoring memorized knowledge. To train OCC-RAG, we implement a novel pipeline for synthesizing multi-context, multi-hop QA data at scale, producing a corpus of over three million examples targeting multi-hop reasoning, strict context faithfulness, and calibrated abstention. We release OCC-RAG-0.6B and OCC-RAG-1.7B, both mid-trained on this corpus. The models produce structured reasoning traces with source citations grounded in literal quotes from the context. Through OCC-RAG, we demonstrate that compact, task-specialized SLMs can match or exceed general-purpose models 2 -- 6x their size across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un) benchmarks.
Community
Recent progress in the development of language models has been defined by scale, with each generation absorbing more of the world's knowledge into its weights. However, many practical applications benefit more from robust reasoning than from extensive parametric knowledge. In this setting, task-specialized small language models (SLMs) offer a principled design choice. We introduce Optimal Cognitive Core (OCC), a family of SLMs built around this premise. As a variant of OCC, we present OCC-RAG, optimized for faithful question answering (QA) grounded in the provided context. This task directly aligns with the OCC design approach, requiring multi-hop reasoning over supplied passages while ignoring memorized knowledge. To train OCC-RAG, we implement a novel pipeline for synthesizing multi-context, multi-hop QA data at scale, producing a corpus of over three million examples targeting multi-hop reasoning, strict context faithfulness, and calibrated abstention. We release OCC-RAG-0.6B and OCC-RAG-1.7B, both mid-trained on this corpus. The models produce structured reasoning traces with source citations grounded in literal quotes from the context. Through OCC-RAG, we demonstrate that compact, task-specialized SLMs can match or exceed general-purpose models 2 - 6x their size across multi-hop reasoning (HotpotQA, MuSiQue, TAT-QA), faithfulness (ConFiQA), and refusal (MuSiQue-Un) benchmarks.
Congratulations on this release — the focus on faithful, context-grounded QA with calibrated abstention and citation-anchored reasoning traces is really nice work, and getting 0.6B/1.7B models to match or beat much larger ones on ConFiQA faithfulness is an impressive result. We especially liked the explicit "query analysis → source analysis → reasoning → status → answer" structure and the not enough information abstention behavior.
We've been thinking along very similar lines and wanted to share our work, in case it's of interest:
CANOE: Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning (AAAI 2026, Oral)
Code: https://github.com/S1s-Z/CANOE
Paper: https://arxiv.org/abs/2505.16483
Like OCC-RAG, CANOE improves contextual faithfulness without human annotation by synthesizing easily-verifiable short-form QA data across diverse tasks. The main difference is the post-training recipe: we propose Dual-GRPO, a rule-based RL method with tailored rewards that jointly optimizes both short-form and long-form generation (avoiding the over-optimization you get from short-form data alone, and removing the need to train reward models on labeled preference data). It improves faithfulness across 11 downstream tasks.
There seems to be a lot of shared ground between the two approaches (synthetic data for faithfulness, abstention/grounding), and possibly complementary ideas (your citation-anchored traces + small-model specialization vs. our RL-based optimization). We'd be glad if you took a look, and we'll be following your work going forward. Happy to exchange notes anytime.
The finding that CANOE-LLaMA-8B outperforms GPT-4o on faithfulness benchmarks is outstanding. I particularly appreciate your insight that simply scaling model parameters does not reliably improve faithfulness. Our work indicates that mid-training on structured reasoning traces establishes a robust foundation for evidence-based reasoning. Your work demonstrates that Dual-GRPO is highly effective at fine-tuning models for faithfulness. A hybrid approach, in which mid-trained models are further enhanced with Dual-GRPO, appears promising. I will drop you an email to discuss this further.
the most interesting piece for me is the combo of a kg-guided, topology-controlled synthetic data pipeline with a judge-based verification and citation-grounded traces. i buy the mid-training on 3+ million examples to force true multi-hop reasoning without memorization, but i’m curious how the grounding holds when the context has overlapping quotes or ambiguous paraphrases. the structured reasoning traces with literal quotes are neat, yet i wonder how they align quotes to exact inference steps, especially if a single quote supports multiple inferences. btw arxivLens had a solid breakdown that covers the trace format and the judge verification, which helped me parse the setup without getting lost. one question: how does the abstention calibration hold up under noisy or contradictory passages in the provided context?
We handle overlapping quotes and paraphrases by using source‑ID citations and structured reasoning steps rather than literal quote matching, with LLM‑as‑a‑judge filtering removing traces that misalign evidence to inferences. Along with the reference contexts we also including distractor contexts that are semantically very close to the gold passage (topically adjacent, high TF‑IDF similarity) yet deliberately lacking the correct answer. This forces the model to learn that semantic similarity alone is insufficient for answering, and that true answerability requires the presence of verifiable evidence, not just topical relevance.
We are currently preparing a detailed technical description of the full synthetic data generation pipeline, including KG extraction, path sampling templates, prompt designs, and filtering criteria, and will release it shortly. Stay tuned.
Made an audio walkthrough of this paper for anyone who wants to skim it on the go:
https://researchpod.app/episode/6f845117-1407-4e88-b19c-dd2f3dc23c43
Generated automatically by ResearchPod — happy to take feedback from the authors.
Get this paper in your agent:
hf papers read 2606.00683 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 5
occ-ai/OCC-RAG-0.6B
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper