emanuelaboros
/

lang-detect

@@ -1,42 +1,93 @@
 ---
 library_name: transformers
 language:
-- en
 - fr
 - de
 tags:
-- v1.0.0
 ---
-#### How to use
-You can use this model with Transformers *pipeline* for NER.
-<!-- Provide a longer summary of what this model is. -->
 ```python
 from transformers import pipeline
-MODEL_NAME = "emanuelaboros/lang-detect"
-lang_pipeline = pipeline("lang-detect", model=MODEL_NAME,
-                        trust_remote_code=True,
-                        device='cpu')
-sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe,
-          le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité."""
-langs = lang_pipeline(sentence)
-langs
 ```
 ```
-{'label': 'fr', 'confidence': 99.87}
-```
-Works with lists of sentences also.
-### BibTeX entry and citation info
 ```
-```

 ---
 library_name: transformers
 language:
 - fr
 - de
+- en
+- it
+- lb
+license: agpl-3.0
 tags:
+- language-identification
+- multilingual
+- historical
+- impresso
 ---
+# Model Card for impresso-project/language-identifier
+## Overview
+`impresso-project/language-identifier` is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports **German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb)** — the core languages of the [Impresso Project](https://impresso-project.ch), which focuses on analyzing historical media across national and linguistic borders.
+This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.
+## Model Details
+- **Model type:** Language identification
+- **Interface:** Hugging Face `transformers` pipeline
+- **Languages supported:** fr, de, en, it, lb
+- **License:** AGPL-3.0
+- **Developed by:** UZH, Switzerland
+- **Training data:** Historical newspapers from the impresso corpus and related sources
+## How to Use
 ```python
 from transformers import pipeline
+MODEL_NAME = "impresso-project/language-identifier"
+lang_pipeline = pipeline(
+    "langident",
+    model=MODEL_NAME,
+    trust_remote_code=True,
+    device="cpu",
+)
+text = """En l'an 1348, au plus fort des ravages de la peste noire à travers
+l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et
+face à une opportunité."""
+langs = lang_pipeline(text)
+print(langs)
 ```
+## Output Format
+The output is a single dictionary with the predicted language and confidence score:
+```python
+{
+  "language": "fr",
+  "score": 1.0
+}
 ```
+## Use Cases
+- Preprocessing for OCR and NLP tasks on historical corpora
+- Document and segment-level language tagging
+- Filtering and sorting multilingual newspaper archives
+## Limitations
+- Works best on **sentence- or paragraph-length** texts
+- May struggle with code-switching or OCR-degraded text that mixes languages
+- Primarily optimized for **Impresso-like sources** (19th–20th century newspapers)
+## Installation
+```bash
+pip install transformers floret
 ```
+## Contact
+- Website: [https://impresso-project.ch](https://impresso-project.ch)
+<p align="center">
+  <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
+</p>