File size: 2,987 Bytes
e4144f4
 
 
 
 
 
 
 
 
 
 
 
 
c6715fc
e4144f4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c6715fc
 
e4144f4
eebd529
 
e4144f4
 
 
 
 
 
 
 
 
764e674
e4144f4
 
 
 
 
 
 
c6715fc
e4144f4
764e674
e4144f4
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
---
license: mit
language:
- en
metrics:
- accuracy
- bertscore
- f1
base_model:
- google-bert/bert-base-uncased
pipeline_tag: text-classification
---
# Newswire Classifier (AP, UPI, NEA) - BERT Transformers

## πŸ“˜ Overview
This repository contains three separately trained BERT models for identifying whether a newspaper article was produced by one of three major newswire services:
- **AP (Associated Press)**
- **UPI (United Press International)**
- **NEA (Newspaper Enterprise Association)**

The models are designed for historical news classification from public-domain newswire articles (1960–1975).

## 🧠 Model Architecture
- **Base Model:** `bert-base-uncased`
- **Task:** Binary classification (`1` if from the specific newswire, `0` otherwise)
- **Optimizer:** AdamW
- **Loss Function:** Binary Cross-Entropy with Logits
- **Batch Size:** 16
- **Epochs:** 4
- **Learning Rate:** 2e-5
- **Device:** TPU (v2-8) in Google Colab

## πŸ“Š Training Data
- **Source:** Historical newspapers (1960–1975, public domain)
- **Articles:** 4000 per training round (1000 from target newswire, 3000 from other sources)
- **Features Used:** Headline, author, and first 100 characters of the article.
- **Labeling:** `1` for articles from the target newswire, `0` for all others.

## πŸš€ Model Performance
| Model | Accuracy | Precision | Recall | F1 Score |
|-------|----------|----------|-------|----------|
| **AP** | 0.9925 | 0.9926 | 0.9925 | 0.9925 |
| **UPI** | 0.9999 | 0.9999 | 0.9999 | 0.9999 |
| **NEA** | 0.9875 | 0.9880 | 0.9875 | 0.9876 |

## πŸ› οΈ Usage
### Installation
```bash
pip install transformers torch
```
### Example Inference (AP Classifier)
```python
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model = AutoModelForSequenceClassification.from_pretrained("mike-mcrae/newswire_classifier/AP")
tokenizer = AutoTokenizer.from_pretrained("mike-mcrae/newswire_classifier/AP")

text = "(AP) President speaks at conference..."
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
outputs = model(**inputs)
prediction = outputs.logits.argmax().item()
print("AP Article" if prediction == 1 else "Not AP Article")
```

## βš™οΈ Recommended Usage Notes
- The models were trained on a combination of the first 100 characters of headline + author + the first 100 characters of articles, as the mention of the newswire often appears in these sections. Using the same format for inference may improve accuracy.

## πŸ“œ Licensing & Data Source
- **Training Data:** Historical newspaper articles (1960–1975) from public-domain sources.
- **License:** Public domain (for data) and MIT License (for model and code).

## πŸ’¬ Citation
If you use these models, please cite:
```
@misc{newswire_classifier,
  author = {McRae, Michael},
  title = {Newswire Classifier (AP, UPI, NEA) - BERT Transformers},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/username/newswire_classifier}
}
```