mikemcrae25 commited on
Commit
e4144f4
Β·
verified Β·
1 Parent(s): c6715fc

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +78 -8
README.md CHANGED
@@ -1,12 +1,82 @@
1
- # Newswire Classifier
2
- This repository contains three separate BERT models for classifying news articles as from AP, UPI, or NEA.
3
- - **AP Classifier**: Trained on articles from AP.
4
- - **UPI Classifier**: Trained on articles from UPI.
5
- - **NEA Classifier**: Trained on articles from NEA.
 
 
 
 
 
 
 
 
6
 
7
- ### Usage
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  ```python
9
  from transformers import AutoModelForSequenceClassification, AutoTokenizer
10
- model = AutoModelForSequenceClassification.from_pretrained('username/newswire_classifier/AP')
11
- tokenizer = AutoTokenizer.from_pretrained('username/newswire_classifier/AP')
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ```
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - en
5
+ metrics:
6
+ - accuracy
7
+ - bertscore
8
+ - f1
9
+ base_model:
10
+ - google-bert/bert-base-uncased
11
+ pipeline_tag: text-classification
12
+ ---
13
+ # Newswire Classifier (AP, UPI, NEA) - BERT Transformers
14
 
15
+ ## πŸ“˜ Overview
16
+ This repository contains three separately trained BERT models for identifying whether a newspaper article was produced by one of three major newswire services:
17
+ - **AP (Associated Press)**
18
+ - **UPI (United Press International)**
19
+ - **NEA (Newspaper Enterprise Association)**
20
+
21
+ The models are designed for historical news classification from public-domain newswire articles (1960–1975).
22
+
23
+ ## 🧠 Model Architecture
24
+ - **Base Model:** `bert-base-uncased`
25
+ - **Task:** Binary classification (`1` if from the specific newswire, `0` otherwise)
26
+ - **Optimizer:** AdamW
27
+ - **Loss Function:** Binary Cross-Entropy with Logits
28
+ - **Batch Size:** 16
29
+ - **Epochs:** 4
30
+ - **Learning Rate:** 2e-5
31
+ - **Device:** TPU (v2-8) in Google Colab
32
+
33
+ ## πŸ“Š Training Data
34
+ - **Source:** Historical newspapers (1960–1975, public domain)
35
+ - **Articles:** 4000 per training round (1000 from target newswire, 3000 from other sources)
36
+ - **Features Used:** Headline, author, and first 100 characters of the article.
37
+ - **Labeling:** `1` for articles from the target newswire, `0` for all others.
38
+
39
+ ## πŸš€ Model Performance
40
+ | Model | Accuracy | Precision | Recall | F1 Score |
41
+ |-------|----------|----------|-------|----------|
42
+ | **AP** | 0.9925 | 0.9926 | 0.9925 | 0.9925 |
43
+ | **UPI** | 0.9999 | 0.9999 | 0.9999 | 0.9999 |
44
+ | **NEA** | 0.9875 | 0.9880 | 0.9875 | 0.9876 |
45
+
46
+ ## πŸ› οΈ Usage
47
+ ### Installation
48
+ ```bash
49
+ pip install transformers torch
50
+ ```
51
+ ### Example Inference (AP Classifier)
52
  ```python
53
  from transformers import AutoModelForSequenceClassification, AutoTokenizer
54
+
55
+ model = AutoModelForSequenceClassification.from_pretrained("username/newswire_classifier/AP")
56
+ tokenizer = AutoTokenizer.from_pretrained("username/newswire_classifier/AP")
57
+
58
+ text = "(AP) President speaks at conference..."
59
+ inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
60
+ outputs = model(**inputs)
61
+ prediction = outputs.logits.argmax().item()
62
+ print("AP Article" if prediction == 1 else "Not AP Article")
63
+ ```
64
+
65
+ ## βš™οΈ Recommended Usage Notes
66
+ - The models were trained on headlines, authors, and the first 100 characters of articles, as the mention of the newswire often appears in these sections. Using the same format for inference may improve accuracy.
67
+
68
+ ## πŸ“œ Licensing & Data Source
69
+ - **Training Data:** Historical newspaper articles (1960–1975) from public-domain sources.
70
+ - **License:** Public domain (for data) and MIT License (for model and code).
71
+
72
+ ## πŸ’¬ Citation
73
+ If you use these models, please cite:
74
  ```
75
+ @misc{newswire_classifier,
76
+ author = {Your Name},
77
+ title = {Newswire Classifier (AP, UPI, NEA) - BERT Transformers},
78
+ year = {2025},
79
+ publisher = {Hugging Face},
80
+ url = {https://huggingface.co/username/newswire_classifier}
81
+ }
82
+ ```