language:
- nl
tags:
- punctuation prediction
- punctuation
datasets: sonar
license: mit
widget:
- text: >-
hervatting van de zitting ik verklaar de zitting van het europees
parlement die op vrijdag 17 december werd onderbroken te zijn hervat
example_title: Dutch Sample
metrics:
- f1
This model predicts the punctuation of Dutch texts. We developed it to restore the punctuation of transcribed spoken language.
This multilanguage model was trained on the SoNaR Dataset.
The model restores the following punctuation markers: "." "," "?" "-" ":"
Sample Code
We provide a simple python package that allows you to process text of any length.
Install
To get started install the package from pypi:
pip install deepmultilingualpunctuation
Restore Punctuation
from deepmultilingualpunctuation import PunctuationModel
model = PunctuationModel(model="oliverguhr/fullstop-dutch-sonar-punctuation-prediction")
text = "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
result = model.restore_punctuation(text)
print(result)
output
hervatting van de zitting. ik verklaar de zitting van het europees parlement, die op vrijdag 17 december werd onderbroken, te zijn hervat.
Predict Labels
from deepmultilingualpunctuation import PunctuationModel
model = PunctuationModel()
text = "hervatting van de zitting ik verklaar de zitting van het europees parlement die op vrijdag 17 december werd onderbroken te zijn hervat"
clean_text = model.preprocess(text)
labled_words = model.predict(clean_text)
print(labled_words)
output
[['hervatting', '0', 0.99998724], ['van', '0', 0.9999784], ['de', '0', 0.99991274], ['zitting', '.', 0.6771242], ['ik', '0', 0.9999466], ['verklaar', '0', 0.9998566], ['de', '0', 0.9999783], ['zitting', '0', 0.9999809], ['van', '0', 0.99996245], ['het', '0', 0.99997795], ['europees', '0', 0.9999783], ['parlement', ',', 0.9908242], ['die', '0', 0.999985], ['op', '0', 0.99998224], ['vrijdag', '0', 0.9999831], ['17', '0', 0.99997985], ['december', '0', 0.9999827], ['werd', '0', 0.999982], ['onderbroken', ',', 0.9951485], ['te', '0', 0.9999677], ['zijn', '0', 0.99997723], ['hervat', '.', 0.9957053]]
Results
The performance differs for the single punctuation markers as hyphens and colons, in many cases, are optional and can be substituted by either a comma or a full stop. The model achieves the following F1 scores:
Label | F1 Score |
---|---|
0 | 0.985816 |
. | 0.854380 |
? | 0.684060 |
, | 0.719308 |
: | 0.696088 |
- | 0.722000 |
macro average | 0.776942 |
micro average | 0.963427 |
References
TBD