CYFRAGOVPL
/

PLLuM-12B-base

Model card Files Files and versions Community

MinistryofDigitalAffairs commited on 14 days ago

Commit

d30d9c4

·

verified ·

1 Parent(s): 4085b28

Update README.md

Files changed (1) hide show

README.md +4 -2

README.md CHANGED Viewed

@@ -1,5 +1,7 @@
 ---
 license: apache-2.0
 ---
 <p align="center">
   <img src="https://pllum.org.pl/_nuxt/PLLuM_logo_RGB_color.DXNEc-VR.png">
@@ -50,7 +52,7 @@ Below is a summary of the main PLLuM models, including their licenses, bases, an
 ### Model Development
 - **Pretraining**: All models were pretrained or continued-pretrained on large-scale Polish corpora (up to 150B tokens) plus a range of additional Slavic/Baltic and English texts.
-- **Instruction Fine-Tuning**: We refined the models on manually curated Polish “organic instructions,” converted instructions from premium Polish corpora, and synthetic instructions generated by strong LLMs.
 - **Alignment and Preference Learning**: Manually annotated preference data taught the models to produce safer, balanced, and contextually appropriate responses, even in adversarial or sensitive cases.
 - **Domain-Specific Adaptations**: Specialized RAG-based (Retrieval Augmented Generation) models were developed for tasks like public administration, demonstrating strong performance in complex information retrieval and question answering.
@@ -229,4 +231,4 @@ We welcome feedback, collaboration, and further exploration of PLLuM models!
 Project financed by the Minister of Digital Affairs under the targeted subsidy No. 1/WI/DBiI/2023: *“Responsible development of the open large language model PLLuM (Polish Large Language Model) to support breakthrough technologies in the public and economic sector, including an open, Polish-language intelligent assistant for petitioners.”*
 **Funding Amount:** 14,504,392.00 PLN
-**Contract Signing Date:** 2024-01-22

 ---
 license: apache-2.0
+language:
+- pl
 ---
 <p align="center">
   <img src="https://pllum.org.pl/_nuxt/PLLuM_logo_RGB_color.DXNEc-VR.png">
 ### Model Development
 - **Pretraining**: All models were pretrained or continued-pretrained on large-scale Polish corpora (up to 150B tokens) plus a range of additional Slavic/Baltic and English texts.
+- **Instruction Fine-Tuning**: We refined the models on manually curated Polish “organic instructions” (approx. 40k), converted instructions from premium Polish corpora (approx. 50k), and synthetic instructions generated by strong LLMs (approx. 10k).
 - **Alignment and Preference Learning**: Manually annotated preference data taught the models to produce safer, balanced, and contextually appropriate responses, even in adversarial or sensitive cases.
 - **Domain-Specific Adaptations**: Specialized RAG-based (Retrieval Augmented Generation) models were developed for tasks like public administration, demonstrating strong performance in complex information retrieval and question answering.
 Project financed by the Minister of Digital Affairs under the targeted subsidy No. 1/WI/DBiI/2023: *“Responsible development of the open large language model PLLuM (Polish Large Language Model) to support breakthrough technologies in the public and economic sector, including an open, Polish-language intelligent assistant for petitioners.”*
 **Funding Amount:** 14,504,392.00 PLN
+**Contract Signing Date:** 2024-01-22