metadata

language: sw
datasets:
  - kenyacorpus_v2
license: cc-by-4.0
model-index:
  - name: innocent-charles/Swahili-question-answer-latest-cased
    results:
      - task:
          type: question-answering
          name: Question Answering
        dataset:
          name: kenyacorpus
          type: kenyacorpus
          config: kenyacorpus
          split: validation
        metrics:
          - name: Exact Match
            type: exact_match
            value: 79.9309
            verified: true
          - name: F1
            type: f1
            value: 82.9501
            verified: true
          - name: total
            type: total
            value: 11869
            verified: true

SWAHILI QUESTION - ANSWER MODEL

This is the bert-base-multilingual-cased model, fine-tuned using the KenyaCorpus dataset. It's been trained on question-answer pairs, including unanswerable questions, for the task of Question Answering in Swahili Language.

Question answering (QA) is a computer science discipline within the fields of information retrieval and NLP that help in the development of systems in such a way that, given a question in natural language, can extract relevant information from provided data and present it in the form of natural language answers.

Overview

Language model used: bert-base-multilingual-cased
Language: Kiswahili Downstream-task: Extractive Swahili QA
Training data: KenyaCorpus Eval data: KenyaCorpus Code: See an example QA pipeline on Haystack
Infrastructure: AWS NVIDIA A100 Tensor Core GPU

Hyperparameters

batch_size = 16
n_epochs = 10
base_LM_model = "bert-base-multilingual-cased"
max_seq_len = 386
learning_rate = 3e-5
lr_schedule = LinearWarmup
warmup_proportion = 0.2
doc_stride=128
max_query_length=64

Usage

In Haystack

Haystack is an NLP framework by deepset. You can use this model in a Haystack pipeline to do question answering at scale (over many documents). To load the model in Haystack:

reader = FARMReader(model_name_or_path="innocent-charles/Swahili-question-answer-latest-cased")
# or 
reader = TransformersReader(model_name_or_path="innocent-charles/Swahili-question-answer-latest-cased",tokenizer="innocent-charles/Swahili-question-answer-latest-cased")

For a complete example of Swahili-question-answer-latest-cased being used for Swahili Question Answering, check out the Tutorials in Haystack Documentation

In Transformers

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

model_name = "innocent-charles/Swahili-question-answer-latest-cased"

# a) Get predictions
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)
QA_input = {
    'question': 'Asubuhi ilitupata pambajioi pa hospitali gani?',
    'context': 'Asubuhi hiyo ilitupata pambajioni pa hospitali ya Uguzwa.'
}
res = nlp(QA_input)

# b) Load model & tokenizer
model = AutoModelForQuestionAnswering.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Performance

"exact": 79.87029394424324,
"f1": 82.91251169582613,

"total": 11873,
"HasAns_exact": 77.93522267206478,
"HasAns_f1": 84.02838248389763,
"HasAns_total": 5928,
"NoAns_exact": 81.79983179142137,
"NoAns_f1": 81.79983179142137,
"NoAns_total": 5945

Performance

The project is still going, hence the model is still updated after training the model in more data, Therefore pull requests are welcome to contribute to increase the performance of the data.

Author

Innocent Charles: [email protected]

About Me

I build good things using Artificial Intelligence ,Data and Analytics , with over 3 Years of Experience as Applied AI Engineer & Data scientist from a strong background in Software Engineering ,with passion and extensive experience in Data and Businesses.

Linkedin | GitHub | Website