--- language: - sv - 'no' - da - en license: mit tags: - bert - roberta pipeline_tag: fill-mask widget: - text: Huvudstaden i Sverige är . example_title: Swedish - text: Hovedstaden i Norge er . example_title: Norwegian - text: Danmarks hovedstad er . example_title: Danish --- # roberta-large-1160k ## Intended uses You can use the raw model for masked language modeling, but it's mostly intended to be fine-tuned on a downstream task. ### How to use You can use this model directly with a pipeline for masked language modeling: ```python >>> from transformers import pipeline >>> unmasker = pipeline('fill-mask', model='AI-Sweden-Models/roberta-large-1160k') >>> unmasker("Huvudstaden i Sverige är .") [{'score': 0.5841221213340759, 'token': 1945, 'token_str': ' Stockholm', 'sequence': 'Huvudstaden i Sverige är Stockholm.'}, {'score': 0.06775698810815811, 'token': 5007, 'token_str': ' Göteborg', 'sequence': 'Huvudstaden i Sverige är Göteborg.'}, {'score': 0.05057400465011597, 'token': 5761, 'token_str': ' Malmö', 'sequence': 'Huvudstaden i Sverige är Malmö.'}, {'score': 0.021936343982815742, 'token': 21449, 'token_str': ' Norrköping', 'sequence': 'Huvudstaden i Sverige är Norrköping.'}, {'score': 0.017798304557800293, 'token': 5658, 'token_str': ' Uppsala', 'sequence': 'Huvudstaden i Sverige är Uppsala.'}] ``` ```python >>> unmasker("Hovedstaden i Norge er .") [{'score': 0.6792309284210205, 'token': 5158, 'token_str': ' Oslo', 'sequence': 'Hovedstaden i Norge er Oslo.'}, {'score': 0.09379775077104568, 'token': 15456, 'token_str': ' Trondheim', 'sequence': 'Hovedstaden i Norge er Trondheim.'}, {'score': 0.052535850554704666, 'token': 11370, 'token_str': ' Bergen', 'sequence': 'Hovedstaden i Norge er Bergen.'}, {'score': 0.03465486690402031, 'token': 29407, 'token_str': ' hovedstaden', 'sequence': 'Hovedstaden i Norge er hovedstaden.'}, {'score': 0.03017985075712204, 'token': 33311, 'token_str': ' Kristiansand', 'sequence': 'Hovedstaden i Norge er Kristiansand.'}] ``` ```python >>> unmasker("Danmarks hovedstad er .") [{'score': 0.11624140292406082, 'token': 4794, 'token_str': ' København', 'sequence': 'Danmarks hovedstad er København.'}, {'score': 0.045051511377096176, 'token': 7680, 'token_str': ' død', 'sequence': 'Danmarks hovedstad er død.'}, {'score': 0.02936543896794319, 'token': 10795, 'token_str': ' lukket', 'sequence': 'Danmarks hovedstad er lukket.'}, {'score': 0.026030730456113815, 'token': 13580, 'token_str': ' Odense', 'sequence': 'Danmarks hovedstad er Odense.'}, {'score': 0.02130937948822975, 'token': 16347, 'token_str': ' Roskilde', 'sequence': 'Danmarks hovedstad er Roskilde.'}] ``` Here is how to use this model to get the features of a given text in PyTorch: ```python from transformers import RobertaTokenizer, RobertaModel tokenizer = RobertaTokenizer.from_pretrained('AI-Sweden-Models/roberta-large-1160k') model = RobertaModel.from_pretrained('AI-Sweden-Models/roberta-large-1160k') text = "Replace me by any text you'd like." encoded_input = tokenizer(text, return_tensors='pt') output = model(**encoded_input) ``` ## Training data The Scandinavian subset of the Nordic Pile (Swedish, Norwegian, Danish), consisting of 414 962 688 text samples. ## Training procedure The model was trained with the [optimum-habana](https://github.com/huggingface/optimum-habana) framework. Utilizing 8X Intel® Gaudi® 2 AI accelerators, managed by Intel Sweden AB. The weights from https://huggingface.co/FacebookAI/roberta-large are used as initialization, and the tokenizer is trained from scratch. This model is a checkpoint (1 160 000 / 1 350 790). The final run is 5 epochs. This is epoch: 4.29. A batch size of 1536 was used. ## Evaluation results When fine-tuned on downstream tasks, this model achieves the following results: | rank | da_rank | no_rank | sv_rank | dansk | angry_tweets | scala_da | scandiqa_da | norne_nb | norne_nn | norec | scala_nb | scala_nn | norquad | suc3 | swerec | scala_sv | scandiqa_sv | |-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-| | 1.3 | 1.33 | 1.34 | 1.23 | 74.16 | 51.2 | 73.87 | 49.34 | 92.01 | 87.17 | 60.11 | 72.85 | 65.56 | 60.38 | 82.65 | 77.25 | 77.9 | 49.64 | As by (2024/03/26) it is ranked #2 at [ScandEval](https://scandeval.com/swedish-nlu/) after *gpt-4-0613*.