mikemcrae25
/

newswire_classifiers

Text Classification

Model card Files Files and versions Community

newswire_classifiers / README.md

mikemcrae25's picture

Update README.md

eebd529 verified 26 days ago

|

history blame contribute delete

2.99 kB

	---
	license: mit
	language:
	- en
	metrics:
	- accuracy
	- bertscore
	- f1
	base_model:
	- google-bert/bert-base-uncased
	pipeline_tag: text-classification
	---
	# Newswire Classifier (AP, UPI, NEA) - BERT Transformers

	## 📘 Overview
	This repository contains three separately trained BERT models for identifying whether a newspaper article was produced by one of three major newswire services:
	- AP (Associated Press)
	- UPI (United Press International)
	- NEA (Newspaper Enterprise Association)

	The models are designed for historical news classification from public-domain newswire articles (1960–1975).

	## 🧠 Model Architecture
	- Base Model: `bert-base-uncased`
	- Task: Binary classification (`1` if from the specific newswire, `0` otherwise)
	- Optimizer: AdamW
	- Loss Function: Binary Cross-Entropy with Logits
	- Batch Size: 16
	- Epochs: 4
	- Learning Rate: 2e-5
	- Device: TPU (v2-8) in Google Colab

	## 📊 Training Data
	- Source: Historical newspapers (1960–1975, public domain)
	- Articles: 4000 per training round (1000 from target newswire, 3000 from other sources)
	- Features Used: Headline, author, and first 100 characters of the article.
	- Labeling: `1` for articles from the target newswire, `0` for all others.

	## 🚀 Model Performance
	\| Model \| Accuracy \| Precision \| Recall \| F1 Score \|
	\|-------\|----------\|----------\|-------\|----------\|
	\| AP \| 0.9925 \| 0.9926 \| 0.9925 \| 0.9925 \|
	\| UPI \| 0.9999 \| 0.9999 \| 0.9999 \| 0.9999 \|
	\| NEA \| 0.9875 \| 0.9880 \| 0.9875 \| 0.9876 \|

	## 🛠️ Usage
	### Installation
	```bash
	pip install transformers torch
	```
	### Example Inference (AP Classifier)
	```python
	from transformers import AutoModelForSequenceClassification, AutoTokenizer

	model = AutoModelForSequenceClassification.from_pretrained("mike-mcrae/newswire_classifier/AP")
	tokenizer = AutoTokenizer.from_pretrained("mike-mcrae/newswire_classifier/AP")

	text = "(AP) President speaks at conference..."
	inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=128)
	outputs = model(**inputs)
	prediction = outputs.logits.argmax().item()
	print("AP Article" if prediction == 1 else "Not AP Article")
	```

	## ⚙️ Recommended Usage Notes
	- The models were trained on a combination of the first 100 characters of headline + author + the first 100 characters of articles, as the mention of the newswire often appears in these sections. Using the same format for inference may improve accuracy.

	## 📜 Licensing & Data Source
	- Training Data: Historical newspaper articles (1960–1975) from public-domain sources.
	- License: Public domain (for data) and MIT License (for model and code).

	## 💬 Citation
	If you use these models, please cite:
	```
	@misc{newswire_classifier,
	author = {McRae, Michael},
	title = {Newswire Classifier (AP, UPI, NEA) - BERT Transformers},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/username/newswire_classifier}
	}
	```