bdsl
/

HanmunRoBERTa / README.md
javiercha's picture
Update README.md
55cad71 verified
metadata
license: cc-by-nc-4.0
widget:
  - text: 觀象監啓京都坤方低卑又水口寬闊故於崇仁興禮二門之外皆鑿池貯水近者不曾修築或塡塞水淺或堙沒無址願深鑿貯水植木堤岸以畜氣脈上不納
  - text: 初一日上於崇文堂特召臣希春及下番金睟臣等與承旨崔應龍史官等入見上曰此非言談之時但予欲聞前古帝王之善居喪者耳臣希春進座前
  - text: 兵曹判書趙疏曰伏以臣卽再生之人耳以以其跡則至孤子也以其地則至齟齬也以其罪名則至危悕怖也而猶且得全其軀命復見臣母於三朔相訣之餘者

HanmunRoBERTa (March 2024 Release)

The Big Data Studies Lab at the University of Hong Kong is delighted to introduce this early release of HanmunRoBERTa, a transformer-based model trained exclusively on texts in literary Sinitic authored by Koreans before the 20th century. This version is an early prototype, optimised with data from the Veritable Records (Sillok 實錄) and the Diary of the Royal Secretariat (Sŭngjŏngwŏn ilgi 承政院日記).

HanmunRoBERTa was pretrained from scratch on 443.5 million characters of data from the Veritable Records, the Diary of the Royal Secretariat, A Compendium of Korean Collected Works (Han’guk munjip ch’onggan 韓國文集叢刊), and various Korean hanmun miscellanies. The century prediction task was trained using a data sample from the Veritable Records and the Diary of the Royal Secretariat only. Hence, it performs exceptionally well (~98% accuracy) if you provide court entries but may provide mixed results with Korean munjip or non-Korean examples.

The Inference API provides the following examples.

At this stage, HanmunRoBERTa is prone to overfitting and requires further adjustments and refinement for improved performance. To test the model, please remove all non-Sinitic characters and special symbols, including punctuation. As HanmunRoBERTa was pretrained and fine-tuned using unpunctuated texts, the test samples must be unpunctuated as well. The Hugging Face Inference API is unable to process your inputted text automatically. If you are interested in testing HanmunRoBERTa, we recommend that you use our HanmunRoBERTa Century Prediction web app.