emanuelaboros commited on
Commit
740ed29
·
1 Parent(s): f4c750f

modified readme

Browse files
Files changed (1) hide show
  1. README.md +70 -19
README.md CHANGED
@@ -1,42 +1,93 @@
1
  ---
2
  library_name: transformers
3
  language:
4
- - en
5
  - fr
6
  - de
 
 
 
 
7
  tags:
8
- - v1.0.0
 
 
 
9
  ---
10
 
 
 
 
 
 
 
 
11
 
12
- #### How to use
13
 
14
- You can use this model with Transformers *pipeline* for NER.
 
 
 
 
 
 
 
15
 
16
- <!-- Provide a longer summary of what this model is. -->
17
  ```python
18
  from transformers import pipeline
19
 
20
- MODEL_NAME = "emanuelaboros/lang-detect"
21
-
22
- lang_pipeline = pipeline("lang-detect", model=MODEL_NAME,
23
- trust_remote_code=True,
24
- device='cpu')
25
 
26
- sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe,
27
- le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité."""
 
 
 
 
28
 
29
- langs = lang_pipeline(sentence)
30
- langs
 
31
 
 
 
32
  ```
33
 
 
 
 
 
 
 
 
 
 
34
  ```
35
- {'label': 'fr', 'confidence': 99.87}
36
- ```
37
- Works with lists of sentences also.
38
 
39
- ### BibTeX entry and citation info
40
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
41
  ```
42
- ```
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
  language:
 
4
  - fr
5
  - de
6
+ - en
7
+ - it
8
+ - lb
9
+ license: agpl-3.0
10
  tags:
11
+ - language-identification
12
+ - multilingual
13
+ - historical
14
+ - impresso
15
  ---
16
 
17
+ # Model Card for impresso-project/language-identifier
18
+
19
+ ## Overview
20
+
21
+ `impresso-project/language-identifier` is a multilingual language identification model fine-tuned for use on historical newspaper content. It supports **German (de), French (fr), Italian (it), English (en), and Luxembourgish (lb)** — the core languages of the [Impresso Project](https://impresso-project.ch), which focuses on analyzing historical media across national and linguistic borders.
22
+
23
+ This model has been adapted for short, OCR-noisy and fragmentary inputs typical of historical digitized texts.
24
 
25
+ ## Model Details
26
 
27
+ - **Model type:** Language identification
28
+ - **Interface:** Hugging Face `transformers` pipeline
29
+ - **Languages supported:** fr, de, en, it, lb
30
+ - **License:** AGPL-3.0
31
+ - **Developed by:** UZH, Switzerland
32
+ - **Training data:** Historical newspapers from the impresso corpus and related sources
33
+
34
+ ## How to Use
35
 
 
36
  ```python
37
  from transformers import pipeline
38
 
39
+ MODEL_NAME = "impresso-project/language-identifier"
 
 
 
 
40
 
41
+ lang_pipeline = pipeline(
42
+ "langident",
43
+ model=MODEL_NAME,
44
+ trust_remote_code=True,
45
+ device="cpu",
46
+ )
47
 
48
+ text = """En l'an 1348, au plus fort des ravages de la peste noire à travers
49
+ l'Europe, le Royaume de France se trouvait à la fois au bord du désespoir et
50
+ face à une opportunité."""
51
 
52
+ langs = lang_pipeline(text)
53
+ print(langs)
54
  ```
55
 
56
+ ## Output Format
57
+
58
+ The output is a single dictionary with the predicted language and confidence score:
59
+
60
+ ```python
61
+ {
62
+ "language": "fr",
63
+ "score": 1.0
64
+ }
65
  ```
 
 
 
66
 
 
67
 
68
+ ## Use Cases
69
+
70
+ - Preprocessing for OCR and NLP tasks on historical corpora
71
+ - Document and segment-level language tagging
72
+ - Filtering and sorting multilingual newspaper archives
73
+
74
+ ## Limitations
75
+
76
+ - Works best on **sentence- or paragraph-length** texts
77
+ - May struggle with code-switching or OCR-degraded text that mixes languages
78
+ - Primarily optimized for **Impresso-like sources** (19th–20th century newspapers)
79
+
80
+ ## Installation
81
+
82
+ ```bash
83
+ pip install transformers floret
84
  ```
85
+
86
+ ## Contact
87
+
88
+ - Website: [https://impresso-project.ch](https://impresso-project.ch)
89
+
90
+ <p align="center">
91
+ <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/>
92
+ </p>
93
+