Update README.md
Browse files
README.md
CHANGED
@@ -3,8 +3,9 @@ library_name: transformers
|
|
3 |
tags: []
|
4 |
---
|
5 |
|
6 |
-
# Model
|
7 |
|
|
|
8 |
|
9 |
This model uses a causal language modeling approach during training.
|
10 |
This approach modifies the way the model accesses and processes words that
|
@@ -22,13 +23,9 @@ as the architecture.
|
|
22 |
|
23 |
### Model Description
|
24 |
|
25 |
-
<!-- Provide a longer summary of what this model is. -->
|
26 |
-
|
27 |
-
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
|
28 |
-
|
29 |
- **Developed by:** Ronny Paul
|
30 |
- **Model type:** BLOOM
|
31 |
-
- **Language(s) (NLP):** Northern Sami
|
32 |
- **Finetuned from model:** TurkuNLP/gpt3-finnish-xl
|
33 |
|
34 |
|
@@ -42,11 +39,21 @@ This model was used in an experiment to determine which architecture is favourab
|
|
42 |
|
43 |
The model is trained with the rpa020/SALT dataset. The formatted dataset is named the SAmi LLM Token (SALT) dataset and contains around 22 million tokens and approximately 2 million sentences. On average, each sentence consists of around ten tokens. The dataset has been designed to support the pretraining phase for foundational model development.
|
44 |
|
45 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
46 |
|
47 |
## How to Get Started with the Model
|
48 |
|
49 |
-
model = BloomForCausalLM.from_pretrained("rpa020/
|
50 |
|
51 |
## Performance
|
52 |
|
|
|
3 |
tags: []
|
4 |
---
|
5 |
|
6 |
+
# Model D3
|
7 |
|
8 |
+
Multilingual model
|
9 |
|
10 |
This model uses a causal language modeling approach during training.
|
11 |
This approach modifies the way the model accesses and processes words that
|
|
|
23 |
|
24 |
### Model Description
|
25 |
|
|
|
|
|
|
|
|
|
26 |
- **Developed by:** Ronny Paul
|
27 |
- **Model type:** BLOOM
|
28 |
+
- **Language(s) (NLP):** Northern Sami, Norwegian and Finnish
|
29 |
- **Finetuned from model:** TurkuNLP/gpt3-finnish-xl
|
30 |
|
31 |
|
|
|
39 |
|
40 |
The model is trained with the rpa020/SALT dataset. The formatted dataset is named the SAmi LLM Token (SALT) dataset and contains around 22 million tokens and approximately 2 million sentences. On average, each sentence consists of around ten tokens. The dataset has been designed to support the pretraining phase for foundational model development.
|
41 |
|
42 |
+
The Finnish dataset is a filtered and post-processed corpus of comments from
|
43 |
+
Reddit. The comments were published from 2006 to 2022 and consist of 4 524 360
|
44 |
+
unique messages. The dataset was released by Finnish-NLP. The Norwegian
|
45 |
+
dataset NoReC was created as part of the SANT project (Sentiment Analysis
|
46 |
+
for Norwegian Text), a collaboration between the Language Technology Group
|
47 |
+
(LTG) at the Department of Informatics at the University of Oslo, the Norwegian
|
48 |
+
Broadcasting Corporation (NRK), Schibsted Media Group and Aller Media.
|
49 |
+
This first release of the corpus comprises 35,194 reviews extracted from eight
|
50 |
+
different news sources. In terms of publishing date, the reviews mainly cover
|
51 |
+
the time span from 2003 to 2017, although it also includes a handful of reviews
|
52 |
+
dating back as far as 1998.
|
53 |
|
54 |
## How to Get Started with the Model
|
55 |
|
56 |
+
model = BloomForCausalLM.from_pretrained("rpa020/D3")
|
57 |
|
58 |
## Performance
|
59 |
|