Disclaimer: This model is still under testing and may change in the future, we will try to keep backwards compatibility. For any questions reach us at [email protected]

MediaWatch News Topics (Greek)

Fine-tuned model for multi-label text-classification (SequenceClassification), based on roberta-el-news, using Hugging Face's Transformers library. This model is to classify news in real-time on upto 33 topics including: AFFAIRS, AGRICULTURE, ARTS_AND_CULTURE, BREAKING_NEWS, BUSINESS, COVID, ECONOMY, EDUCATION, ELECTIONS, ENTERTAINMENT, ENVIRONMENT, FOOD, HEALTH, INTERNATIONAL, LAW_AND_ORDER, MILITARY, NON_PAPER, OPINION, POLITICS, REFUGEE, REGIONAL, RELIGION, SCIENCE, SOCIAL_MEDIA, SOCIETY, SPORTS, TECH, TOURISM, TRANSPORT, TRAVEL, WEATHER, CRIME, JUSTICE.

How to use

You can use this model directly with a pipeline for text-classification:

from transformers import pipeline

pipe = pipeline(
    task="text-classification", 
    model="cvcio/mediawatch-el-topics", 
    tokenizer="cvcio/roberta-el-news" # or cvcio/mediawatch-el-topics
)

topics = pipe(
    "Η βιασύνη αρκετών χωρών να άρουν τους περιορισμούς κατά του κορονοϊού, "+
    "αν όχι να κηρύξουν το τέλος της πανδημίας, με το σκεπτικό ότι έφτασε "+
    "πλέον η ώρα να συμβιώσουμε με την Covid-19, έχει κάνει μερικούς πιο "+
    "επιφυλακτικούς επιστήμονες να προειδοποιούν ότι πρόκειται μάλλον "+
    "για «ενδημική αυταπάτη» και ότι είναι πρόωρη τέτοια υπερβολική "+
    "χαλάρωση. Καθώς τα κρούσματα της Covid-19, μετά το αιφνιδιαστικό "+
    "μαζικό κύμα της παραλλαγής Όμικρον, εμφανίζουν τάση υποχώρησης σε "+
    "Ευρώπη και Βόρεια Αμερική, όπου περισσεύει η κόπωση μεταξύ των "+
    "πολιτών μετά από δύο χρόνια πανδημίας, ειδικοί και μη αδημονούν να "+
    "«ξεμπερδέψουν» με τον κορονοϊό.",
    padding=True,
    truncation=True,
    max_length=512,
    return_all_scores=True
)

print(topics)

# outputs 
[
  [
    {'label': 'AFFAIRS', 'score': 0.0018806682201102376}, 
    {'label': 'AGRICULTURE', 'score': 0.00014653144171461463}, 
    {'label': 'ARTS_AND_CULTURE', 'score': 0.0012948638759553432}, 
    {'label': 'BREAKING_NEWS', 'score': 0.0001729220530251041}, 
    {'label': 'BUSINESS', 'score': 0.0028276608791202307}, 
    {'label': 'COVID', 'score': 0.4407998025417328}, 
    {'label': 'ECONOMY', 'score': 0.039826102554798126}, 
    {'label': 'EDUCATION', 'score': 0.0019098613411188126}, 
    {'label': 'ELECTIONS', 'score': 0.0003333651984576136}, 
    {'label': 'ENTERTAINMENT', 'score': 0.004249618388712406}, 
    {'label': 'ENVIRONMENT', 'score': 0.0015828514005988836}, 
    {'label': 'FOOD', 'score': 0.0018390495097264647}, 
    {'label': 'HEALTH', 'score': 0.1204477995634079}, 
    {'label': 'INTERNATIONAL', 'score': 0.25892165303230286}, 
    {'label': 'LAW_AND_ORDER', 'score': 0.07646272331476212}, 
    {'label': 'MILITARY', 'score': 0.00033025629818439484}, 
    {'label': 'NON_PAPER', 'score': 0.011991199105978012}, 
    {'label': 'OPINION', 'score': 0.16166265308856964}, 
    {'label': 'POLITICS', 'score': 0.0008890336030162871}, 
    {'label': 'REFUGEE', 'score': 0.0011504743015393615}, 
    {'label': 'REGIONAL', 'score': 0.0008734092116355896}, 
    {'label': 'RELIGION', 'score': 0.0009001944563351572}, 
    {'label': 'SCIENCE', 'score': 0.05075162276625633}, 
    {'label': 'SOCIAL_MEDIA', 'score': 0.00039615994319319725}, 
    {'label': 'SOCIETY', 'score': 0.0043518817983567715}, 
    {'label': 'SPORTS', 'score': 0.002416545059531927}, 
    {'label': 'TECH', 'score': 0.0007818648009561002}, 
    {'label': 'TOURISM', 'score': 0.011870541609823704}, 
    {'label': 'TRANSPORT', 'score': 0.0009422845905646682}, 
    {'label': 'TRAVEL', 'score': 0.03004464879631996}, 
    {'label': 'WEATHER', 'score': 0.00040286066359840333}, 
    {'label': 'CRIME', 'score': 0.0005416403291746974}, 
    {'label': 'JUSTICE', 'score': 0.000990519649349153}
  ]
]

Labels

All labels, except NON_PAPER, retrieved by source articles during the data collection step, without any preprocessing, assuming that journalists and newsrooms assign correct tags to the articles. We disregarded all articles with more than 6 tags to reduce bias and tag manipulation.

label	roc_auc	samples
AFFAIRS	0.9872	6,314
AGRICULTURE	0.9799	1,254
ARTS_AND_CULTURE	0.9838	15,968
BREAKING_NEWS	0.9675	827
BUSINESS	0.9811	6,507
COVID	0.9620	50,000
CRIME	0.9885	34,421
ECONOMY	0.9765	45,474
EDUCATION	0.9865	10,111
ELECTIONS	0.9940	7,571
ENTERTAINMENT	0.9925	23,323
ENVIRONMENT	0.9847	23,060
FOOD	0.9934	3,712
HEALTH	0.9723	16,852
INTERNATIONAL	0.9624	50,000
JUSTICE	0.9862	4,860
LAW_AND_ORDER	0.9177	50,000
MILITARY	0.9838	6,536
NON_PAPER	0.9595	4,589
OPINION	0.9624	6,296
POLITICS	0.9773	50,000
REFUGEE	0.9949	4,536
REGIONAL	0.9520	50,000
RELIGION	0.9922	11,533
SCIENCE	0.9837	1,998
SOCIAL_MEDIA	0.991	6,212
SOCIETY	0.9439	50,000
SPORTS	0.9939	31,396
TECH	0.9923	8,225
TOURISM	0.9900	8,081
TRANSPORT	0.9879	3,211
TRAVEL	0.9832	4,638
WEATHER	0.9950	19,931
loss	0.0533	-
roc_auc	0.9855	-

Pretraining

The model was pretrained using an NVIDIA A10 GPU for 15 epochs (~ approx 59K steps, 8 hours training) with a batch size of 128. The optimizer used is Adam with a learning rate of 1e-5, and weight decay 0.01. We used roc_auc_micro to evaluate the results.

Framework versions

Transformers 4.13.0
Pytorch 1.9.0+cu111
Datasets 1.16.1
Tokenizers 0.10.3

Authors

Dimitris Papaevagelou - @andefined

About Us

Civic Information Office is a Non Profit Organization based in Athens, Greece focusing on creating technology and research products for the public interest.

cvcio
/

mediawatch-el-topics

You need to agree to share your contact information to access this model