|
--- |
|
language: sw |
|
license: cc-by-4.0 |
|
datasets: |
|
- kenyacorpus_v2 |
|
model-index: |
|
- name: innocent-charles/Swahili-question-answer-latest-cased |
|
results: |
|
- task: |
|
type: question-answering |
|
name: Question Answering |
|
dataset: |
|
name: kenyacorpus |
|
type: kenyacorpus |
|
config: kenyacorpus |
|
split: validation |
|
metrics: |
|
- type: exact_match |
|
value: 51.9309 |
|
name: Exact Match |
|
verified: true |
|
verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiOTIyN2VhODRhMTQzOGYwNGU0NjM4NmMyOWQ1ZmM4ODliNGRlNjdjMTY3MWU5YzVkYWJmODhiNTMyZDE4NGQ5ZSIsInZlcnNpb24iOjF9.oVd4HFhao0K7AwV0sZTCy2Sa4mG2LP-BX0ImCynZQJ-zReQtgoK1x0LRn31chEKF_CHOQ4ZZ5SBrOuCwK5KNCQ |
|
- type: f1 |
|
value: 63.9501 |
|
name: F1 |
|
verified: true |
|
verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiM2E3YWU0YTljNjI4YmEyNjRkZWFlZTZlZmMzNjc2NzhiMmEzNmNlZDQ1YjEwZGY1MTEzYTUyZWNjMWJiMzBlMiIsInZlcnNpb24iOjF9.x_DxEhpVLb_JRhk0z12lJhVV_ugvUdK_axOe7Cb6oyH7ir7Ky0TJpIDfmk6w7IgNKiYAZ_yObNbjyov6QNoeCw |
|
- type: total |
|
value: 445 |
|
name: total |
|
verified: true |
|
verifyToken: eyJhbGciOiJFZERTQSIsInR5cCI6IkpXVCJ9.eyJoYXNoIjoiNTFkYzExMDZiZmUwOTA3ZDYyZjhhZjZmZmFhNWU1NDI4NjY4ZTY1NjQxMjhkNjNiMzBmMGY0YTlhNzVjY2NjNyIsInZlcnNpb24iOjF9.RexL6OXVW3eQRdd7tk9RQPNACCFSwXi3DHz0cd77vZ2Jai7ESLTf8vFIM6j7V2nBGcON4-bJ7MQeRrRg16qyCg |
|
--- |
|
|
|
# SWAHILI QUESTION - ANSWER MODEL |
|
|
|
This is the [bert-base-multilingual-cased](https://huggingface.co/bert-base-multilingual-cased) model, fine-tuned using the [KenyaCorpus](https://github.com/Neurotech-HQ/Swahili-QA-dataset) dataset. It's been trained on question-answer pairs, including unanswerable questions, for the task of Question Answering in Swahili Language. |
|
|
|
Question answering (QA) is a computer science discipline within the fields of information retrieval and NLP that help in the development of systems in such a way that, given a question in natural language, can extract relevant information from provided data and present it in the form of natural language answers. |
|
|
|
|
|
## Overview |
|
**Language model used:** bert-base-multilingual-cased |
|
**Language:** Kiswahili |
|
**Downstream-task:** Extractive Swahili QA |
|
**Training data:** KenyaCorpus |
|
**Eval data:** KenyaCorpus |
|
**Code:** See [an example QA pipeline on Haystack](https://blog.neurotech.africa/building-swahili-question-and-answering-with-haystack/) |
|
**Infrastructure**: AWS NVIDIA A100 Tensor Core GPU |
|
|
|
## Hyperparameters |
|
|
|
``` |
|
batch_size = 16 |
|
n_epochs = 10 |
|
base_LM_model = "bert-base-multilingual-cased" |
|
max_seq_len = 386 |
|
learning_rate = 3e-5 |
|
lr_schedule = LinearWarmup |
|
warmup_proportion = 0.2 |
|
doc_stride=128 |
|
max_query_length=64 |
|
``` |
|
|
|
## Usage |
|
|
|
### In Haystack |
|
Haystack is an NLP framework by deepset. You can use this model in a Haystack pipeline to do question answering at scale (over many documents). To load the model in [Haystack](https://github.com/deepset-ai/haystack/): |
|
```python |
|
reader = FARMReader(model_name_or_path="innocent-charles/Swahili-question-answer-latest-cased") |
|
# or |
|
reader = TransformersReader(model_name_or_path="innocent-charles/Swahili-question-answer-latest-cased",tokenizer="innocent-charles/Swahili-question-answer-latest-cased") |
|
``` |
|
For a complete example of ``Swahili-question-answer-latest-cased`` being used for Swahili Question Answering, check out the [Tutorials in Haystack Documentation](https://haystack.deepset.ai) |
|
|
|
### In Transformers |
|
```python |
|
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline |
|
|
|
model_name = "innocent-charles/Swahili-question-answer-latest-cased" |
|
|
|
# a) Get predictions |
|
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name) |
|
QA_input = { |
|
'question': 'Asubuhi ilitupata pambajioi pa hospitali gani?', |
|
'context': 'Asubuhi hiyo ilitupata pambajioni pa hospitali ya Uguzwa.' |
|
} |
|
res = nlp(QA_input) |
|
|
|
# b) Load model & tokenizer |
|
model = AutoModelForQuestionAnswering.from_pretrained(model_name) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
``` |
|
|
|
## Performance |
|
|
|
``` |
|
"exact": 51.87029394424324, |
|
"f1": 63.91251169582613, |
|
|
|
"total": 445, |
|
"HasAns_exact": 50.93522267206478, |
|
"HasAns_f1": 62.02838248389763, |
|
"HasAns_total": 386, |
|
"NoAns_exact": 49.79983179142137, |
|
"NoAns_f1": 60.79983179142137, |
|
"NoAns_total": 59 |
|
``` |
|
|
|
|
|
## Special consideration |
|
|
|
The project is still going, hence the model is still updated after training the model in more data, Therefore pull requests are welcome to contribute to increase the performance of the model. |
|
|
|
## Author |
|
**Innocent Charles:** [email protected] |
|
|
|
## About Me |
|
|
|
<P> |
|
I build good things using Artificial Intelligence ,Data and Analytics , with over 3 Years of Experience as Applied AI Engineer & Data scientist from a strong background in Software Engineering ,with passion and extensive experience in Data and Businesses. |
|
</P> |
|
|
|
|
|
[Linkedin](https://www.linkedin.com/in/innocent-charles/) | [GitHub](https://github.com/innocent-charles) | [Website](innocentcharles.com) |
|
|
|
|