rpa020 commited on
Commit
f29f7da
·
verified ·
1 Parent(s): 310b9b9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -8
README.md CHANGED
@@ -3,8 +3,9 @@ library_name: transformers
3
  tags: []
4
  ---
5
 
6
- # Model D1
7
 
 
8
 
9
  This model uses a causal language modeling approach during training.
10
  This approach modifies the way the model accesses and processes words that
@@ -22,13 +23,9 @@ as the architecture.
22
 
23
  ### Model Description
24
 
25
- <!-- Provide a longer summary of what this model is. -->
26
-
27
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
28
-
29
  - **Developed by:** Ronny Paul
30
  - **Model type:** BLOOM
31
- - **Language(s) (NLP):** Northern Sami
32
  - **Finetuned from model:** TurkuNLP/gpt3-finnish-xl
33
 
34
 
@@ -42,11 +39,21 @@ This model was used in an experiment to determine which architecture is favourab
42
 
43
  The model is trained with the rpa020/SALT dataset. The formatted dataset is named the SAmi LLM Token (SALT) dataset and contains around 22 million tokens and approximately 2 million sentences. On average, each sentence consists of around ten tokens. The dataset has been designed to support the pretraining phase for foundational model development.
44
 
45
-
 
 
 
 
 
 
 
 
 
 
46
 
47
  ## How to Get Started with the Model
48
 
49
- model = BloomForCausalLM.from_pretrained("rpa020/D1")
50
 
51
  ## Performance
52
 
 
3
  tags: []
4
  ---
5
 
6
+ # Model D3
7
 
8
+ Multilingual model
9
 
10
  This model uses a causal language modeling approach during training.
11
  This approach modifies the way the model accesses and processes words that
 
23
 
24
  ### Model Description
25
 
 
 
 
 
26
  - **Developed by:** Ronny Paul
27
  - **Model type:** BLOOM
28
+ - **Language(s) (NLP):** Northern Sami, Norwegian and Finnish
29
  - **Finetuned from model:** TurkuNLP/gpt3-finnish-xl
30
 
31
 
 
39
 
40
  The model is trained with the rpa020/SALT dataset. The formatted dataset is named the SAmi LLM Token (SALT) dataset and contains around 22 million tokens and approximately 2 million sentences. On average, each sentence consists of around ten tokens. The dataset has been designed to support the pretraining phase for foundational model development.
41
 
42
+ The Finnish dataset is a filtered and post-processed corpus of comments from
43
+ Reddit. The comments were published from 2006 to 2022 and consist of 4 524 360
44
+ unique messages. The dataset was released by Finnish-NLP. The Norwegian
45
+ dataset NoReC was created as part of the SANT project (Sentiment Analysis
46
+ for Norwegian Text), a collaboration between the Language Technology Group
47
+ (LTG) at the Department of Informatics at the University of Oslo, the Norwegian
48
+ Broadcasting Corporation (NRK), Schibsted Media Group and Aller Media.
49
+ This first release of the corpus comprises 35,194 reviews extracted from eight
50
+ different news sources. In terms of publishing date, the reviews mainly cover
51
+ the time span from 2003 to 2017, although it also includes a handful of reviews
52
+ dating back as far as 1998.
53
 
54
  ## How to Get Started with the Model
55
 
56
+ model = BloomForCausalLM.from_pretrained("rpa020/D3")
57
 
58
  ## Performance
59