HF-Link

https://huggingface.co/datasets/MasterControlAIML/R1-Reasoning-Unstructured-To-Structured

MasterControl AIML Team 🚀

Overview

The MasterControl AIML team supports the Hugging Face initiative of re-creating DeepSeek R1 training, recognizing it as one of the most impactful open-source projects today.

We aim to contribute to reasoning datasets, specifically those where:

A real-world problem involves generating complex structured output
It is accompanied by step-by-step reasoning and unstructured input

Challenges in Integrating Generative AI into Systems of Record (SoR)

Integrating Generative AI into Systems of Record (SoR) for health, life sciences, and manufacturing quality is challenging due to:

These systems rely on strictly structured data formats (e.g., JSON, XML, templates).
LLM outputs are unstructured and do not conform to regular expressions or context-free grammars.

Techniques for Structured Output Generation

To enforce structured output generation, we explore:

Strict schema prompting
Post-processing and output validation
Reformulating text generation into transitions between finite-state machine states

DeepSeek R1 Approach

A key challenge is fitting hybrid structured and unstructured historical manufacturing production records to master templates.
We aim to leverage the DeepSeek R1 model, which uses:

Pure reinforcement learning to train a base language model
Learning to reason without human supervision

Model Used

We used Deepseek's Distilled 7b to creating reasoning responses to go from unstructured to structured.

Purpose of Reasoning Responses

The reasoning responses are created in such a way, that if the model is presented with unstructured text and a schema of rules, it needs to convert it into a structured schema. These responses can be used for any unstructured to structured creation.

Next Step: Reasoning Data

Our first step is curating and contributing to reasoning datasets that facilitate structured output generation.

HF-Link

https://huggingface.co/datasets/MasterControlAIML/R1-Reasoning-Unstructured-To-Structured

Create R1 Responses to go from unstructured to structured36c417d4

bhaviktheslider changed pull request title from Create R1 Responses to go from unstructured to structured to Distilled R1 Responses Unstructured to Structured Jan 31

bhaviktheslider

Jan 31

@eliebak /Admins can we add this to reasoning dataset collection as i have seen a lot of Reasoning Datasets on coding, math (deterministic) problems but not on Unstructured text to Structured Json Schema reasonings. The full dataset had been synthetically generated by us.

bhaviktheslider changed pull request status to closed Feb 1