Update README.md
Browse files
README.md
CHANGED
@@ -1,12 +1,82 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
-
|
5 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
6 |
|
7 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
8 |
```python
|
9 |
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
10 |
-
|
11 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: mit
|
3 |
+
language:
|
4 |
+
- en
|
5 |
+
metrics:
|
6 |
+
- accuracy
|
7 |
+
- bertscore
|
8 |
+
- f1
|
9 |
+
base_model:
|
10 |
+
- google-bert/bert-base-uncased
|
11 |
+
pipeline_tag: text-classification
|
12 |
+
---
|
13 |
+
# Newswire Classifier (AP, UPI, NEA) - BERT Transformers
|
14 |
|
15 |
+
## π Overview
|
16 |
+
This repository contains three separately trained BERT models for identifying whether a newspaper article was produced by one of three major newswire services:
|
17 |
+
- **AP (Associated Press)**
|
18 |
+
- **UPI (United Press International)**
|
19 |
+
- **NEA (Newspaper Enterprise Association)**
|
20 |
+
|
21 |
+
The models are designed for historical news classification from public-domain newswire articles (1960β1975).
|
22 |
+
|
23 |
+
## π§ Model Architecture
|
24 |
+
- **Base Model:** `bert-base-uncased`
|
25 |
+
- **Task:** Binary classification (`1` if from the specific newswire, `0` otherwise)
|
26 |
+
- **Optimizer:** AdamW
|
27 |
+
- **Loss Function:** Binary Cross-Entropy with Logits
|
28 |
+
- **Batch Size:** 16
|
29 |
+
- **Epochs:** 4
|
30 |
+
- **Learning Rate:** 2e-5
|
31 |
+
- **Device:** TPU (v2-8) in Google Colab
|
32 |
+
|
33 |
+
## π Training Data
|
34 |
+
- **Source:** Historical newspapers (1960β1975, public domain)
|
35 |
+
- **Articles:** 4000 per training round (1000 from target newswire, 3000 from other sources)
|
36 |
+
- **Features Used:** Headline, author, and first 100 characters of the article.
|
37 |
+
- **Labeling:** `1` for articles from the target newswire, `0` for all others.
|
38 |
+
|
39 |
+
## π Model Performance
|
40 |
+
| Model | Accuracy | Precision | Recall | F1 Score |
|
41 |
+
|-------|----------|----------|-------|----------|
|
42 |
+
| **AP** | 0.9925 | 0.9926 | 0.9925 | 0.9925 |
|
43 |
+
| **UPI** | 0.9999 | 0.9999 | 0.9999 | 0.9999 |
|
44 |
+
| **NEA** | 0.9875 | 0.9880 | 0.9875 | 0.9876 |
|
45 |
+
|
46 |
+
## π οΈ Usage
|
47 |
+
### Installation
|
48 |
+
```bash
|
49 |
+
pip install transformers torch
|
50 |
+
```
|
51 |
+
### Example Inference (AP Classifier)
|
52 |
```python
|
53 |
from transformers import AutoModelForSequenceClassification, AutoTokenizer
|
54 |
+
|
55 |
+
model = AutoModelForSequenceClassification.from_pretrained("username/newswire_classifier/AP")
|
56 |
+
tokenizer = AutoTokenizer.from_pretrained("username/newswire_classifier/AP")
|
57 |
+
|
58 |
+
text = "(AP) President speaks at conference..."
|
59 |
+
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
|
60 |
+
outputs = model(**inputs)
|
61 |
+
prediction = outputs.logits.argmax().item()
|
62 |
+
print("AP Article" if prediction == 1 else "Not AP Article")
|
63 |
+
```
|
64 |
+
|
65 |
+
## βοΈ Recommended Usage Notes
|
66 |
+
- The models were trained on headlines, authors, and the first 100 characters of articles, as the mention of the newswire often appears in these sections. Using the same format for inference may improve accuracy.
|
67 |
+
|
68 |
+
## π Licensing & Data Source
|
69 |
+
- **Training Data:** Historical newspaper articles (1960β1975) from public-domain sources.
|
70 |
+
- **License:** Public domain (for data) and MIT License (for model and code).
|
71 |
+
|
72 |
+
## π¬ Citation
|
73 |
+
If you use these models, please cite:
|
74 |
```
|
75 |
+
@misc{newswire_classifier,
|
76 |
+
author = {Your Name},
|
77 |
+
title = {Newswire Classifier (AP, UPI, NEA) - BERT Transformers},
|
78 |
+
year = {2025},
|
79 |
+
publisher = {Hugging Face},
|
80 |
+
url = {https://huggingface.co/username/newswire_classifier}
|
81 |
+
}
|
82 |
+
```
|