Automatic Evaluation Models for Textual Data Quality (NL & CL)
Automatically assess the quality of textual data using a clear and intuitive scale, adapted for both natural language (NL) and code language (CL).
We compare two distinct approaches:
- A unified model that handles both NL and CL jointly: EuroBERT-210m-Quality
- A dual-model approach that treats NL and CL separately:
Classification Categories:
- Harmful: Harmful data, potentially incorrect or dangerous.
- Low: Low-quality data with major issues.
- Medium: Medium quality, improvable but acceptable.
- High: Good to very good quality data, ready for use without reservation.
Supported Languages:
- Natural Language: French 🇫🇷, English 🇬🇧, Spanish 🇪🇸
- Code Language: Python 🐍, Java ☕, JavaScript 📜, C/C++ ⚙️
Performance
- f1-score: Unified Model (NL + CL)
Catégorie |
Global (NL + CL) |
NL |
CL |
Harmfull |
0.81 |
0.87 |
0.75 |
Low |
0.60 |
0.72 |
0.44 |
Medium |
0.60 |
0.74 |
0.49 |
High |
0.74 |
0.77 |
0.72 |
Accuracy |
0.70 |
0.78 |
0.62 |
- f1-score: Separate Models
Catégorie |
Global (NL + CL) |
NL |
CL |
Harmfull |
0.83 |
0.89 |
0.78 |
Low |
0.59 |
0.71 |
0.46 |
Medium |
0.63 |
0.77 |
0.49 |
High |
0.76 |
0.79 |
0.73 |
Accuracy |
0.71 |
0.80 |
0.63 |
Key Performance Metrics:
Unified Model (NL + CL):
- Overall accuracy: ~69%
- High reliability on harmful data (f1-score: 0.81)
Separate Models:
- Natural Language (NL): ~79% accuracy
- Excellent performance on harmful data (f1-score: 0.89)
- Code Language (CL): ~63% accuracy
- Good detection of harmful data (f1-score: 0.78)
Training Dataset:
Common Use Cases:
- Automatic validation of text corpora before integration into NLP or code generation pipelines.
- Quality assessment of community contributions (forums, Stack Overflow, GitHub).
- Automated pre-processing to enhance NLP or code generation system performance.
Recommendations:
- For specialized contexts, use the separate NL and CL models for optimal results.
- The unified model is suitable for quick assessments when the data context is unknown or mixed.
Citation
Please cite or link back to this model on Hugging Face Hub if used in your projects.