Update README.md
Browse files
README.md
CHANGED
@@ -84,13 +84,13 @@ widget:
|
|
84 |
example_title: O neno
|
85 |
---
|
86 |
|
87 |
-
#
|
88 |
|
89 |
## Table of Contents
|
90 |
<details>
|
91 |
<summary>Click to expand</summary>
|
92 |
|
93 |
-
- [
|
94 |
- [Table of Contents](#table-of-contents)
|
95 |
- [Model description](#model-description)
|
96 |
- [Intended uses and limitations](#intended-uses-and-limitations)
|
@@ -113,12 +113,12 @@ widget:
|
|
113 |
|
114 |
## Model description
|
115 |
|
116 |
-
**
|
117 |
It is the result of a continual pretraining of [FLOR-1.3B](https://huggingface.co/projecte-aina/FLOR-1.3B) (developed by [AINA Project](https://projecteaina.cat/) and based in [BLOOM-1.7B](https://huggingface.co/bigscience/bloom-1b7)) with the galician corpus [CorpusNos](https://zenodo.org/records/10687642).
|
118 |
|
119 |
## Intended uses and limitations
|
120 |
|
121 |
-
The **
|
122 |
It can perform text-generation tasks and be fine-tuned for specific scenarios.
|
123 |
|
124 |
## How to use
|
@@ -128,7 +128,7 @@ from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
|
|
128 |
|
129 |
input_text = "Hoxe fai un bo día. O sol "
|
130 |
|
131 |
-
model_id = "proxectonos/
|
132 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
133 |
model = AutoModelForCausalLM.from_pretrained(model_id)
|
134 |
generator = pipeline(
|
@@ -157,10 +157,10 @@ It was trained using HuggingFace Transformers and Pytorch, using the [Causal Mod
|
|
157 |
|
158 |
### Language adaptation and training
|
159 |
|
160 |
-
The language adaptation technique used to train
|
161 |
1) We trained our own BPE tokenizer for galician and replaced the original FLOR-1.3B tokenizer and vocabulary with it.
|
162 |
2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
|
163 |
-
3) The embeddings from tokens not present in
|
164 |
4) The model was initialized with the weights from FLOR-1.3B and with our adapted tokenizer (step 1) and embeddings (steps 2-3).
|
165 |
5) The model was then trained on a galician corpus.
|
166 |
|
|
|
84 |
example_title: O neno
|
85 |
---
|
86 |
|
87 |
+
# Carballo-bloom-1.3B
|
88 |
|
89 |
## Table of Contents
|
90 |
<details>
|
91 |
<summary>Click to expand</summary>
|
92 |
|
93 |
+
- [Carballo-bloom-1.3B](#carballo-bloom-13)
|
94 |
- [Table of Contents](#table-of-contents)
|
95 |
- [Model description](#model-description)
|
96 |
- [Intended uses and limitations](#intended-uses-and-limitations)
|
|
|
113 |
|
114 |
## Model description
|
115 |
|
116 |
+
**Carballo-bloom-1.3B** is a 1.3B-parameter transformer-based causal language model for Galician.
|
117 |
It is the result of a continual pretraining of [FLOR-1.3B](https://huggingface.co/projecte-aina/FLOR-1.3B) (developed by [AINA Project](https://projecteaina.cat/) and based in [BLOOM-1.7B](https://huggingface.co/bigscience/bloom-1b7)) with the galician corpus [CorpusNos](https://zenodo.org/records/10687642).
|
118 |
|
119 |
## Intended uses and limitations
|
120 |
|
121 |
+
The **Carballo-bloom-1.3B** model is ready-to-use only for causal language modeling.
|
122 |
It can perform text-generation tasks and be fine-tuned for specific scenarios.
|
123 |
|
124 |
## How to use
|
|
|
128 |
|
129 |
input_text = "Hoxe fai un bo día. O sol "
|
130 |
|
131 |
+
model_id = "proxectonos/Carballo-bloom-1.3B"
|
132 |
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
133 |
model = AutoModelForCausalLM.from_pretrained(model_id)
|
134 |
generator = pipeline(
|
|
|
157 |
|
158 |
### Language adaptation and training
|
159 |
|
160 |
+
The language adaptation technique used to train Carballo-bloom-1.3B is based in the used to train FLOR-1.3B, which is explained by their authors in this [Medium Post](https://medium.com/@mpamies247/flor-6-3b-a-chinchilla-compliant-model-for-catalan-spanish-and-english-7cdb389a9aac). In summary, we proceeded as follows:
|
161 |
1) We trained our own BPE tokenizer for galician and replaced the original FLOR-1.3B tokenizer and vocabulary with it.
|
162 |
2) The embeddings corresponding to tokens that are present in both the original and the target vocabulary (matching tokens) were used for initialization.
|
163 |
+
3) The embeddings from tokens not present in Carballo-bloom-1.3B's original vocabulary were initialized as the average of all embeddings.
|
164 |
4) The model was initialized with the weights from FLOR-1.3B and with our adapted tokenizer (step 1) and embeddings (steps 2-3).
|
165 |
5) The model was then trained on a galician corpus.
|
166 |
|