Fill-Mask
Transformers
PyTorch
Safetensors
bert
Inference Endpoints
File size: 4,509 Bytes
46108ea
507b31c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46108ea
 
ed79b0f
46108ea
 
 
 
 
 
08bd306
46108ea
08bd306
0642e23
46108ea
 
ed79b0f
46108ea
9f4c911
46108ea
 
 
 
 
 
 
 
 
 
 
ed79b0f
46108ea
 
 
 
 
 
 
 
 
 
 
 
 
 
9f4c911
46108ea
 
9f4c911
 
 
 
 
 
 
 
46108ea
ed79b0f
46108ea
 
9f4c911
46108ea
 
 
 
 
 
 
 
 
 
 
507b31c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
license: apache-2.0
datasets:
- rahular/varta
language:
- as
- bh
- bn
- en
- gu
- hi
- kn
- ml
- mr
- ne
- or
- pa
- ta
- te
- ur
---

# Varta-BERT

<!-- Provide a quick summary of what the model is/does. -->

### Model Description

<!-- Provide a longer summary of what this model is. -->
Varta-BERT is a model pre-trained on the `full` training set of [Varta](https://huggingface.co/datasets/rahular/varta) in 14 Indic languages (Assamese, Bhojpuri, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Nepali, Oriya, Punjabi, Tamil, Telugu, and Urdu) and English, using a masked language modeling (MLM) objective. 

[Varta](https://huggingface.co/datasets/rahular/varta) is a large-scale news corpus for Indic languages, including 41.8 million news articles in 14 different Indic languages (and English), which come from a variety of high-quality sources. 
The dataset and the model are introduced in [this paper](https://arxiv.org/abs/2305.05858). The code is released in [this repository](https://github.com/rahular/varta).

## Uses
You can use the raw model for masked language modeling, but it is mostly intended to be fine-tuned on a downstream task.

Note that this model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked) to make decisions, such as sequence classification, token classification or question answering. For tasks such as text generation, you should look at our [Varta-T5](https://huggingface.co/rahular/varta-t5) model.

## Bias, Risks, and Limitations
This work is mainly dedicated to the curation of a new multilingual dataset for Indic languages, many of which are low-resource languages. During data collection, we face several limitations that can potentially result in ethical concerns. Some of the important ones are mentioned below: <br>

- Our dataset contains only those articles written by DailyHunt's partner publishers. This has the potential to result in a bias towards a particular narrative or ideology that can affect the representativeness and diversity of the dataset.
- Another limitation is the languages represented in Varta. Out of 22 languages with official status in India, our dataset has only 13. There are 122 major languages spoken by at least 10,000 people and 159 other languages which are extremely low-resourced. None of these languages are represented in our dataset.
- We do not perform any kind of debiasing on Varta. This means that societal and cultural biases may exist in the dataset, which can adversely affect the fairness and inclusivity of the models trained on it.


## How to Get Started with the Model

You can use this model directly for masked language modeling.

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("rahular/varta-bert")

model = AutoModelForMaskedLM.from_pretrained("rahular/varta-bert")
```


## Training Details

### Training Data
Varta contains 41.8 million high-quality news articles in 14 Indic languages and English. 
With 34.5 million non-English article-headline pairs, it is the largest document-level dataset of its kind. 

### Pretraining
- We pretrain the Varta-BERT model using the standard BERT-Base architecture with 12 encoder layers. 
- We train with a maximum sequence length of 512 tokens with an embedding dimension of 768. 
- We use 12 attention heads with feed-forward width of 3072. 
- To support all the 15 languages in dataset we use a wordpiece vocabulary of size 128K. 
- In total, the model has 184M parameters. The model is trained with AdamW optimizer with alpha=0.9 and beta=0.98.
- We use an initial learning rate of 1e-4 with a warm-up of 10K steps and linearly decay the learning rate till the end of training. 
- We train the model for a total of 1M steps which takes 10 days to finish. 
- We use an effective batch size of 4096 and train the model on TPU v3-128 chips.

Since data sizes across languages in Varta vary from 1.5K (Bhojpuri) to 14.4M articles (Hindi), we use standard temperature-based sampling to upsample data when necessary.

### Evaluation Results
Please see [the paper](https://arxiv.org/pdf/2305.05858.pdf).

## Citation
```
@misc{aralikatte2023varta,
      title={V\=arta: A Large-Scale Headline-Generation Dataset for Indic Languages}, 
      author={Rahul Aralikatte and Ziling Cheng and Sumanth Doddapaneni and Jackie Chi Kit Cheung},
      year={2023},
      eprint={2305.05858},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```