Spaces:
Running
GoogleDeepMind Unstructured-To-JSON Model
Structured Data Extraction from PDFs – Our New Model Release!
Today, Philipp Schmid shared an exciting post about extracting structured outputs from PDFs using Google DeepMind's Gemini 2.0. Inspired by this direction, we’ve been working on a similar problem and are thrilled to share our progress!
🔗 Philipp’s post: LinkedIn
🛠️ Our Approach
Recognizing the importance of structured data extraction early on, we initially replicated DeepSeek R1’s strategy on Qwen-2.5-1.5B and released our own R1 model , which gained 1,400+ downloads in just 3 days!
We then fine-tuned it further on DeepSeek-32B responses, incorporating strategic enhancements in reward modeling. Today, we're excited to release an even better model for the community!
📌 Key Highlights:
✅ Replication of DeepSeek’s strategy with improved reward mechanisms
✅ Trained on Qwen2.5-1.5B with additional fine-tuning on DeepSeek Distilled responses
✅ Generates high-quality JSON outputs from unstructured text
✅ Improved handling of schema-based extraction
🔗 Try It Out!
Check out our model here:
👉 Hugging Face Model
✨ Example Output
Our model can seamlessly convert unstructured text into structured JSON formats. Here’s an example of its Chain of Thought (CoT) reasoning and corresponding structured output:
🔍 CoT Explanation:
Chain of Thought Explanation:
To fill out the blank schema based on the Text and make it look exactly like the Filled Schema, I would start by understanding what each part represents.
First, looking at the Blank Schema, it has placeholders where actual content should go. The goal here is to replace those placeholders with appropriate data extracted from the Text so that when completed, it matches the Filled Schema perfectly.
I notice that both schemas have a similar structure, just with different types of components. My first step would be to identify how each component type maps back to elements in the Text.
For example, in the Introduction section under "Key Metrics," there's a table listing two variables: Defect Rate at 0.5% and Yield Rate at 98%. In the Filled Schema, these become entries in the 'variables' array of a Table component. So I'd extract these values from the corresponding parts in the Text and plug them into their respective places in the Blank Schema.
Next, moving down to the Subsection titled "QA Checklists," there are checkboxes for Confirmations about defects being identified and corrected versus audits conducted regularly. Looking at the Text again, I see bullet points describing whether particular checks were done—like confirming if defects had been found—and then noting what wasn't confirmed. From this, I can infer that each checkbox corresponds to one of these confirmation statements.
Then comes the Detail Section called "Standards Overview." Here, there's another table comparing ISO 9001 and ASQ-Certified QM. In the Text, under QA Standards Comparison, they mention checking both options with clear indicators of true/false states. Therefore, I'd replicate this structure in the Blank Schema by adding these same rows of comparison details.
Throughout this entire process, my focus remains on matching each piece of textual information to its correct location within the schema without altering any other parts beyond what's specified. By methodically identifying and populating each placeholder based on the Text's content, the end result should mirror the Filled Schema precisely because every element in the filled version was accounted for during extraction.
📜 Generated JSON Output:
{
"id": "0.0",
"title": "Quality Assurance Manual for Manufacturing Process",
"level": 0,
"level_type": "ROOT",
"component": [
{
"idc": 0,
"component_type": "PARAGRAPH",
"metadata": "<overview>This manual covers the QA procedures and guidelines for various manufacturing processes.</overview>",
"properties": {
"variables": [],
"content": []
}
}
],
"children": [
{
"id": "1.0",
"title": "Introduction to Quality Assurance",
"level": 1,
"level_type": "SECTION",
"component": [
{
"idc": 0,
"component_type": "PARAGRAPH",
"metadata": "",
"properties": {
"variables": [],
"content": [
"Quality Assurance (QA) involves systematic activities designed to ensure consistent production of reliable and safe products.",
"It aims to reduce variability in the final product and improve overall efficiency."
]
}
},
{
"idc": 1,
"component_type": "TABLE",
"metadata": "",
"properties": {
"variables": [
{
"idx": "0.0",
"name": "Metric",
"value": "Defect Rate",
"unit": "%",
"metrics": true,
"formula": false
},
{
"idx": "0.1",
"name": "Target",
"value": 0.5,
"unit": null,
"metrics": true,
"formula": false
}
],
"content": []
}
}
],
"children": [
{
"id": "1.1",
"title": "QA in Manufacturing Processes",
"level": 2,
"level_type": "SUBSECTION",
"component": [
{
"idc": 0,
"component_type": "PARAGRAPH",
"metadata": "",
"properties": {
"variables": [],
"content": [
"Manufacturing processes require strict adherence to QA procedures to ensure product reliability and safety."
]
}
},
{
"idc": 1,
"component_type": "CHECKBOX",
"metadata": "",
"properties": {
"variables": [
{
"idx": "0.0",
"name": "Process Control",
"value": true,
"unit": null,
"metrics": false,
"formula": false
},
{
"idx": "0.1",
"name": "Raw Material Inspection",
"value": false,
"unit": null,
"metrics": false,
"formula": false
}
],
"content": [
"Check if the following QA procedures are followed:",
"✔ Process Control ✔ Raw Material Inspection"
]
}
}
],
"children": [
{
"id": "1.1.1",
"title": "Standard QA Practices",
"level": 3,
"level_type": "DETAIL_N",
"component": [
{
"idc": 0,
"component_type": "PARAGRAPH",
"metadata": "",
"properties": {
"variables": [],
"content": [
"Adherence to ISO 9001 standards helps maintain quality control."
]
}
},
{
"idc": 1,
"component_type": "TABLE",
"metadata": "",
"properties": {
"variables": [
{
"idx": "0.0",
"name": "Comparison",
"value": "ISO 9001 vs ASQ-Certified QM",
"unit": null,
"metrics": false,
"formula": false
}
],
"content": []
}
}
],
"children": []
}
]
}
]
}
]
}
🔥 We Need Your Feedback!
We invite the community to test our model and share feedback. Your insights will help us improve its capabilities even further.
Let’s push the boundaries of structured data extraction together! 💡