|
--- |
|
license: cc-by-nc-4.0 |
|
language: |
|
- gsw |
|
- multilingual |
|
widget: |
|
- text: "Hinder s'Hans-Heiris Huus hani hundert Hase ghöre hueschte." |
|
--- |
|
|
|
The [**google/canine-s**](https://huggingface.co/google/canine-s) model ([Clark et al., TACL 2022](https://aclanthology.org/2022.tacl-1.5/)) trained on Swiss German text data via continued pre-training. |
|
|
|
## Training Objective |
|
We used the CANINE-S objective combined with the subword vocabulary of [SwissBERT](https://huggingface.co/ZurichNLP/swissbert). |
|
|
|
## Training Data |
|
For continued pre-training, we used the following two datasets of written Swiss German: |
|
1. [SwissCrawl](https://icosys.ch/swisscrawl) ([Linder et al., LREC 2020](https://aclanthology.org/2020.lrec-1.329)), a collection of Swiss German web text (forum discussions, social media). |
|
2. A custom dataset of Swiss German tweets |
|
|
|
In addition, we trained the model on an equal amount of Standard German data. We used news articles retrieved from [Swissdox@LiRI](https://t.uzh.ch/1hI). |
|
|
|
## License |
|
Attribution-NonCommercial 4.0 International (CC BY-NC 4.0). |
|
|
|
## Citation |
|
```bibtex |
|
@inproceedings{vamvas-etal-2024-modular, |
|
title={Modular Adaptation of Multilingual Encoders to Written Swiss German Dialect}, |
|
author={Jannis Vamvas and No{\"e}mi Aepli and Rico Sennrich}, |
|
booktitle={First Workshop on Modular and Open Multilingual NLP}, |
|
year={2024}, |
|
} |
|
``` |