|
--- |
|
license: mit |
|
language: |
|
- en |
|
metrics: |
|
- accuracy |
|
- bertscore |
|
- f1 |
|
base_model: |
|
- google-bert/bert-base-uncased |
|
pipeline_tag: text-classification |
|
--- |
|
# Newswire Classifier (AP, UPI, NEA) - BERT Transformers |
|
|
|
## π Overview |
|
This repository contains three separately trained BERT models for identifying whether a newspaper article was produced by one of three major newswire services: |
|
- **AP (Associated Press)** |
|
- **UPI (United Press International)** |
|
- **NEA (Newspaper Enterprise Association)** |
|
|
|
The models are designed for historical news classification from public-domain newswire articles (1960β1975). |
|
|
|
## π§ Model Architecture |
|
- **Base Model:** `bert-base-uncased` |
|
- **Task:** Binary classification (`1` if from the specific newswire, `0` otherwise) |
|
- **Optimizer:** AdamW |
|
- **Loss Function:** Binary Cross-Entropy with Logits |
|
- **Batch Size:** 16 |
|
- **Epochs:** 4 |
|
- **Learning Rate:** 2e-5 |
|
- **Device:** TPU (v2-8) in Google Colab |
|
|
|
## π Training Data |
|
- **Source:** Historical newspapers (1960β1975, public domain) |
|
- **Articles:** 4000 per training round (1000 from target newswire, 3000 from other sources) |
|
- **Features Used:** Headline, author, and first 100 characters of the article. |
|
- **Labeling:** `1` for articles from the target newswire, `0` for all others. |
|
|
|
## π Model Performance |
|
| Model | Accuracy | Precision | Recall | F1 Score | |
|
|-------|----------|----------|-------|----------| |
|
| **AP** | 0.9925 | 0.9926 | 0.9925 | 0.9925 | |
|
| **UPI** | 0.9999 | 0.9999 | 0.9999 | 0.9999 | |
|
| **NEA** | 0.9875 | 0.9880 | 0.9875 | 0.9876 | |
|
|
|
## π οΈ Usage |
|
### Installation |
|
```bash |
|
pip install transformers torch |
|
``` |
|
### Example Inference (AP Classifier) |
|
```python |
|
from transformers import AutoModelForSequenceClassification, AutoTokenizer |
|
|
|
model = AutoModelForSequenceClassification.from_pretrained("mike-mcrae/newswire_classifier/AP") |
|
tokenizer = AutoTokenizer.from_pretrained("mike-mcrae/newswire_classifier/AP") |
|
|
|
text = "(AP) President speaks at conference..." |
|
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128) |
|
outputs = model(**inputs) |
|
prediction = outputs.logits.argmax().item() |
|
print("AP Article" if prediction == 1 else "Not AP Article") |
|
``` |
|
|
|
## βοΈ Recommended Usage Notes |
|
- The models were trained on a combination of the first 100 characters of headline + author + the first 100 characters of articles, as the mention of the newswire often appears in these sections. Using the same format for inference may improve accuracy. |
|
|
|
## π Licensing & Data Source |
|
- **Training Data:** Historical newspaper articles (1960β1975) from public-domain sources. |
|
- **License:** Public domain (for data) and MIT License (for model and code). |
|
|
|
## π¬ Citation |
|
If you use these models, please cite: |
|
``` |
|
@misc{newswire_classifier, |
|
author = {McRae, Michael}, |
|
title = {Newswire Classifier (AP, UPI, NEA) - BERT Transformers}, |
|
year = {2025}, |
|
publisher = {Hugging Face}, |
|
url = {https://huggingface.co/username/newswire_classifier} |
|
} |
|
``` |