Update readme
Browse files
README.md
CHANGED
@@ -1,44 +1,30 @@
|
|
1 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
|
|
|
|
|
|
|
3 |
|
4 |
-
## Scripts
|
5 |
|
6 |
-
|
|
|
|
|
7 |
|
8 |
-
|
9 |
-
from src.normalizer import normalize
|
10 |
|
11 |
-
|
12 |
-
print(normalize(input_text))
|
13 |
-
```
|
14 |
|
15 |
-
|
16 |
-
```text
|
17 |
-
azbab اینجا ایران خانهشما است ؟ ! 1231231312 الحروف لعربیه
|
18 |
-
```
|
19 |
|
20 |
-
|
21 |
|
22 |
-
|
23 |
-
python train_tokenizer.py --dataset_name oscar --dataset_config_name unshuffled_deduplicated_als --vocab_size 42000
|
24 |
-
```
|
25 |
|
26 |
-
|
27 |
-
|
28 |
-
```bash
|
29 |
-
python create_config.py --name_or_path gpt2-medium --params '{"vocab_size": 42000}'
|
30 |
-
```
|
31 |
-
|
32 |
-
### Normalization steps
|
33 |
-
|
34 |
-
Steps:
|
35 |
-
|
36 |
-
- [x] Remove stretched words such as ســــــــــلام
|
37 |
-
|
38 |
-
- [x] Remove links, user-mentioning (such as @jane_doe)
|
39 |
-
|
40 |
-
- [ ] Remove Telegram, Instagram advertisements, or posts (a whole record)
|
41 |
-
|
42 |
-
- [ ] Remove advertisement records
|
43 |
-
|
44 |
-
- [ ] Remove separated words (or the whole record) which are showing up as an individual record, while they are just the tags at the end of the post (such as بلاب ... بلاب ... ورزشی، خبری، سیاسی، اجتماعی، خانوده)
|
|
|
1 |
+
---
|
2 |
+
language: fa
|
3 |
+
tags:
|
4 |
+
- text-generation
|
5 |
+
widget:
|
6 |
+
- text: "در یک اتفاق شگفت انگیز، پژوهشگران"
|
7 |
+
- text: "گرفتگی بینی در کودکان و بهخصوص نوزادان باعث میشود"
|
8 |
+
- text: "امیدواریم نوروز امسال سالی"
|
9 |
+
---
|
10 |
|
11 |
+
# GPT2 Medium 4 Persian
|
12 |
+
> This is part of the
|
13 |
+
[Flax/Jax Community Week](https://discuss.huggingface.co/t/pretrain-gpt2-from-scratch-in-persian/7560), organized by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google.
|
14 |
|
|
|
15 |
|
16 |
+
## Team Members
|
17 |
+
- FirstName LastName ([hf_user](https://huggingface.co/hf_user))
|
18 |
+
... SOON
|
19 |
|
20 |
+
## Dataset
|
|
|
21 |
|
22 |
+
... SOON
|
|
|
|
|
23 |
|
24 |
+
## How To Use
|
|
|
|
|
|
|
25 |
|
26 |
+
... SOON
|
27 |
|
28 |
+
## Evaluation
|
|
|
|
|
29 |
|
30 |
+
... SOON
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|