julien-c HF staff commited on
Commit
761acc2
1 Parent(s): c84b903

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/cahya/roberta-base-indonesian-522M/README.md

Files changed (1) hide show
  1. README.md +58 -0
README.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: "id"
3
+ license: "mit"
4
+ datasets:
5
+ - Indonesian Wikipedia
6
+ widget:
7
+ - text: "Ibu ku sedang bekerja <mask> supermarket."
8
+ ---
9
+
10
+ # Indonesian RoBERTa base model (uncased)
11
+
12
+ ## Model description
13
+ It is RoBERTa-base model pre-trained with indonesian Wikipedia using a masked language modeling (MLM) objective. This
14
+ model is uncased: it does not make a difference between indonesia and Indonesia.
15
+
16
+ This is one of several other language models that have been pre-trained with indonesian datasets. More detail about
17
+ its usage on downstream tasks (text classification, text generation, etc) is available at [Transformer based Indonesian Language Models](https://github.com/cahya-wirawan/indonesian-language-models/tree/master/Transformers)
18
+
19
+ ## Intended uses & limitations
20
+
21
+ ### How to use
22
+ You can use this model directly with a pipeline for masked language modeling:
23
+ ```python
24
+ >>> from transformers import pipeline
25
+ >>> unmasker = pipeline('fill-mask', model='cahya/roberta-base-indonesian-522M')
26
+ >>> unmasker("Ibu ku sedang bekerja <mask> supermarket")
27
+
28
+ ```
29
+ Here is how to use this model to get the features of a given text in PyTorch:
30
+ ```python
31
+ from transformers import RobertaTokenizer, RobertaModel
32
+
33
+ model_name='cahya/roberta-base-indonesian-522M'
34
+ tokenizer = RobertaTokenizer.from_pretrained(model_name)
35
+ model = RobertaModel.from_pretrained(model_name)
36
+ text = "Silakan diganti dengan text apa saja."
37
+ encoded_input = tokenizer(text, return_tensors='pt')
38
+ output = model(**encoded_input)
39
+ ```
40
+ and in Tensorflow:
41
+ ```python
42
+ from transformers import RobertaTokenizer, TFRobertaModel
43
+
44
+ model_name='cahya/roberta-base-indonesian-522M'
45
+ tokenizer = RobertaTokenizer.from_pretrained(model_name)
46
+ model = TFRobertaModel.from_pretrained(model_name)
47
+ text = "Silakan diganti dengan text apa saja."
48
+ encoded_input = tokenizer(text, return_tensors='tf')
49
+ output = model(encoded_input)
50
+ ```
51
+
52
+ ## Training data
53
+
54
+ This model was pre-trained with 522MB of indonesian Wikipedia.
55
+ The texts are lowercased and tokenized using WordPiece and a vocabulary size of 32,000. The inputs of the model are
56
+ then of the form:
57
+
58
+ ```<s> Sentence A </s> Sentence B </s>```