kisti
/

korscideberta

@@ -9,9 +9,9 @@ metrics:
 <!-- Provide a quick summary of what the model is/does. -->
-KorSciDeBERTa는 Microsoft DeBERTa 모델의 아키텍쳐를 기반으로, 논문, NTIS 연구과제, 특허, 뉴스, 한국어 위키 코퍼스 총 146GB를 사전학습한 모델입니다.
-마스킹된 언어 모델링 또는 다음 문장 예측에 사전학습 모델을 사용할 수 있고, 또한 문장 분류, 단어 토큰 분류 또는 질의응답과 같은 다운스트림 작업에서 미세 조정을 통해 사용될 수 있습니다.
 ## Model Details
@@ -79,31 +79,19 @@ trainer.push_to_hub()
 특정 관점에 대한 과대/과소 표현, 고정 관념, 개인 정보, 증오/모욕 또는 폭력적인 언어, 차별적이거나 편견적인 언어, 관련이 없거나 반복적인 출력 생성 등.
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
 ## Training Details
 ### Training Data
 <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
 ### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
 - 과학기술분야 토크나이저 (KorSci Tokenizer)
 - 본 사전학습 모델에서 사용된 코퍼스를 기반으로 명사 및 복합명사 약 600만개의 사용자사전이 추가된 [Mecab-ko Tokenizer](https://bitbucket.org/eunjeon/mecab-ko/src/master/)와 기존 SentencePiece-BPE가 병합되어진 토크나이저를 사용하여 말뭉치를 전처리하였습니다.
@@ -121,11 +109,6 @@ Use the code below to get started with the model.
 - **vocab_size:** 128,100
 - **Training regime:** fp16 mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
 ## Evaluation
@@ -175,23 +158,6 @@ Python 3.9, Cuda 11.8, PyTorch 1.10
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 한국과학기술정보연구원 (2023) : 한국어 과학기술분야 DeBERTa 사전학습 모델 (KorSciDeBERTa). Version 1.0. 한국과학기술정보연구원.
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
 ## Model Card Authors

 <!-- Provide a quick summary of what the model is/does. -->
+KorSciDeBERTa는 Microsoft DeBERTa 모델의 아키텍쳐를 기반으로, 논문, NTIS 연구과제, 특허, 뉴스, 한국어 위키 말뭉치 총 146GB를 사전학습한 모델입니다.
+마스킹된 언어 모델링 또는 다음 문장 예측에 사전학습 모델을 사용할 수 있고, 추가로 문장 분류, 단어 토큰 분류 또는 질의응답과 같은 다운스트림 작업에서 미세 조정을 통해 사용될 수 있습니다.
 ## Model Details
 특정 관점에 대한 과대/과소 표현, 고정 관념, 개인 정보, 증오/모욕 또는 폭력적인 언어, 차별적이거나 편견적인 언어, 관련이 없거나 반복적인 출력 생성 등.
 ## Training Details
 ### Training Data
 <!-- This should link to a Data Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+논문, NTIS 연구과제, 특허, 뉴스, 한국어 위키 말뭉치 총 146GB
 ### Training Procedure
+KISTI HPC NVIDIA A100 80G GPU 24EA에서 2.5개월동안 1,600,000 스텝 학습
+#### Preprocessing
 - 과학기술분야 토크나이저 (KorSci Tokenizer)
 - 본 사전학습 모델에서 사용된 코퍼스를 기반으로 명사 및 복합명사 약 600만개의 사용자사전이 추가된 [Mecab-ko Tokenizer](https://bitbucket.org/eunjeon/mecab-ko/src/master/)와 기존 SentencePiece-BPE가 병합되어진 토크나이저를 사용하여 말뭉치를 전처리하였습니다.
 - **vocab_size:** 128,100
 - **Training regime:** fp16 mixed precision <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 ## Evaluation
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 한국과학기술정보연구원 (2023) : 한국어 과학기술분야 DeBERTa 사전학습 모델 (KorSciDeBERTa). Version 1.0. 한국과학기술정보연구원.
 ## Model Card Authors