rpa020
/

D3

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

rpa020 commited on Jun 2, 2024

Commit

f29f7da

·

verified ·

1 Parent(s): 310b9b9

Update README.md

Files changed (1) hide show

README.md +15 -8

README.md CHANGED Viewed

@@ -3,8 +3,9 @@ library_name: transformers
 tags: []
 ---
-# Model D1
 This model uses a causal language modeling approach during training.
 This approach modifies the way the model accesses and processes words that
@@ -22,13 +23,9 @@ as the architecture.
 ### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
 - **Developed by:** Ronny Paul
 - **Model type:** BLOOM
-- **Language(s) (NLP):** Northern Sami
 - **Finetuned from model:** TurkuNLP/gpt3-finnish-xl
@@ -42,11 +39,21 @@ This model was used in an experiment to determine which architecture is favourab
 The model is trained with the rpa020/SALT dataset. The formatted dataset is named the SAmi LLM Token (SALT) dataset and contains around 22 million tokens and approximately 2 million sentences. On average, each sentence consists of around ten tokens. The dataset has been designed to support the pretraining phase for foundational model development.
 ## How to Get Started with the Model
-model = BloomForCausalLM.from_pretrained("rpa020/D1")
 ## Performance

 tags: []
 ---
+# Model D3
+Multilingual model
 This model uses a causal language modeling approach during training.
 This approach modifies the way the model accesses and processes words that
 ### Model Description
 - **Developed by:** Ronny Paul
 - **Model type:** BLOOM
+- **Language(s) (NLP):** Northern Sami, Norwegian and Finnish
 - **Finetuned from model:** TurkuNLP/gpt3-finnish-xl
 The model is trained with the rpa020/SALT dataset. The formatted dataset is named the SAmi LLM Token (SALT) dataset and contains around 22 million tokens and approximately 2 million sentences. On average, each sentence consists of around ten tokens. The dataset has been designed to support the pretraining phase for foundational model development.
+The Finnish dataset is a filtered and post-processed corpus of comments from
+Reddit. The comments were published from 2006 to 2022 and consist of 4 524 360
+unique messages. The dataset was released by Finnish-NLP. The Norwegian
+dataset NoReC was created as part of the SANT project (Sentiment Analysis
+for Norwegian Text), a collaboration between the Language Technology Group
+(LTG) at the Department of Informatics at the University of Oslo, the Norwegian
+Broadcasting Corporation (NRK), Schibsted Media Group and Aller Media.
+This first release of the corpus comprises 35,194 reviews extracted from eight
+different news sources. In terms of publishing date, the reviews mainly cover
+the time span from 2003 to 2017, although it also includes a handful of reviews
+dating back as far as 1998.
 ## How to Get Started with the Model
+model = BloomForCausalLM.from_pretrained("rpa020/D3")
 ## Performance