Spaces:
Running
Distilled R1 Responses Unstructured to Structured
HF-Link
https://huggingface.co/datasets/MasterControlAIML/R1-Reasoning-Unstructured-To-Structured
MasterControl AIML Team ๐
Overview
The MasterControl AIML team supports the Hugging Face initiative of re-creating DeepSeek R1 training, recognizing it as one of the most impactful open-source projects today.
We aim to contribute to reasoning datasets, specifically those where:
- A real-world problem involves generating complex structured output
- It is accompanied by step-by-step reasoning and unstructured input
Challenges in Integrating Generative AI into Systems of Record (SoR)
Integrating Generative AI into Systems of Record (SoR) for health, life sciences, and manufacturing quality is challenging due to:
- These systems rely on strictly structured data formats (e.g., JSON, XML, templates).
- LLM outputs are unstructured and do not conform to regular expressions or context-free grammars.
Techniques for Structured Output Generation
To enforce structured output generation, we explore:
- Strict schema prompting
- Post-processing and output validation
- Reformulating text generation into transitions between finite-state machine states
DeepSeek R1 Approach
A key challenge is fitting hybrid structured and unstructured historical manufacturing production records to master templates.
We aim to leverage the DeepSeek R1 model, which uses:
- Pure reinforcement learning to train a base language model
- Learning to reason without human supervision
Model Used
- We used Deepseek's Distilled 7b to creating reasoning responses to go from unstructured to structured.
Purpose of Reasoning Responses
- The reasoning responses are created in such a way, that if the model is presented with unstructured text and a schema of rules, it needs to convert it into a structured schema. These responses can be used for any unstructured to structured creation.
Next Step: Reasoning Data
Our first step is curating and contributing to reasoning datasets that facilitate structured output generation.
HF-Link
https://huggingface.co/datasets/MasterControlAIML/R1-Reasoning-Unstructured-To-Structured
@eliebak /Admins can we add this to reasoning dataset collection as i have seen a lot of Reasoning Datasets on coding, math (deterministic) problems but not on Unstructured text to Structured Json Schema reasonings. The full dataset had been synthetically generated by us.