CCI3-HQ-Classifier
Model summary
This is a classifier for judging the educational value of web pages. It was developed to filter and curate educational content from web datasets and was trained on 145k annotations generated by Qwen2-72B-instruct for web samples from CCI3 dataset.
We used this classifier to build CCI3-HQ dataset.
How to use in transformers
To load the classifier, use the following code:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
tokenizer = AutoTokenizer.from_pretrained("BAAI/cci3-hq-classifier")
model = AutoModelForSequenceClassification.from_pretrained("BAAI/cci3-hq-classifier")
text = "曾巩:为人廉洁奉公,才华横溢,关心民间疾苦曾巩,字子固,是我国北宋时期著名的文学家,政治家和教育家。他的一生政绩颇丰,为百姓们做出了许多的好事,在文学创作上他又是北宋诗文革新的主要人物。他文章写得耐人寻味,表露了自己的真情实感。被后人称之为 唐宋八大家之一 。"
inputs = tokenizer(text, return_tensors="pt", padding="longest", truncation=True)
outputs = model(**inputs)
logits = outputs.logits.squeeze(-1).float().detach().numpy()
score = logits.item()
result = {
"text": text,
"score": score
}
print(result)
Training
The classifier was trained on 145,000 pairs of web samples and their scores from 0 to 5, generated by Qwen2. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational.
The prompt used for annotation mostly reuses FineWeb-edu prompt.
We added a classification head with a single regression output to BGE-M3 and trained the model for 20 epochs with a learning rate of 3e-4. During training, the embedding and encoder layers were frozen to focus on the classification head and dropout was not used. The model achieved an F1 score of 73% when converted to a binary classifier using a score threshold of 3.
Training Details:
- Model: BGE-M3 with a classification head
- Dataset: 145,000 samples from Qwen2 annotations
- Epochs: 20
- Learning Rate: 3e-4
- Evaluation Metric: F1 score
Classification report
We treat the regression model's predictions as discrete classes to calculate the metrics on a hold-out set of 1500 Qwen2-annotated samples.
precision recall f1-score support
0 0.76 0.58 0.66 3890
1 0.55 0.62 0.58 4896
2 0.40 0.51 0.45 2703
3 0.38 0.42 0.40 1536
4 0.59 0.27 0.37 972
5 0.33 0.06 0.10 83
accuracy 0.54 14080
macro avg 0.50 0.41 0.43 14080
weighted avg 0.56 0.54 0.54 14080
Confusion matrix
We verify that the predicted educational scores are indeed close to their ground truth, and are mostry impacted by the noisy annotation.
2244 1514 126 6 0 0
690 3035 1049 117 5 0
y_true 24 878 1383 398 20 0
0 118 651 643 124 0
1 13 202 482 264 10
0 0 6 39 33 5
y_pred
Limitations
While the CCI3-HQ classifier performs well in distinguishing high-quality educational content for CCI3 dataset, there are some limitations:
Scope: The model's performance may vary across different datasets, particularly when applied to out-of-distribution samples. It is specifically designed to handle educational content related to primary and grade school levels and may exhibit lower performance on content intended for higher education or specialized domains.
Bias: The model's performance relies on the quality and representativeness of both the training data and the LLM used for annotation. Biases in either can influence the classifier's decisions. There is a risk of overfitting to content that appears more academic, leading to higher scores. We recommend using an
int_score >= 3
as a threshold for data curation.Context: The classifier operates by evaluating individual web pages or extracts without considering the broader context, which may limit its effectiveness in certain scenarios.
The training and inference code is available on GitHub https://github.com/FlagAI-Open/FlagAI/tree/master/examples/CCI3-HQ
References
- Downloads last month
- 22
Model tree for BAAI/CCI3-HQ-Classifier
Base model
BAAI/bge-m3