Pierre-Carl Langlais's picture

Pierre-Carl Langlais

Pclanglais

·

Dorialexander

AI & ML interests

Open data & open LLMs

Recent Activity

updated a dataset about 11 hours ago

DistressedModel/synthetic_transport_tweet

published a dataset about 11 hours ago

DistressedModel/synthetic_transport_tweet

updated a dataset about 13 hours ago

JZSG/other2

View all activity

Organizations

Posts 6

Post

3654

We release today our first foundation model and experiment with a new category: specialized pre-training.

OCRonos-Vintage is a 124m parameters model trained end-to-end by Pleias on llm.c from 18 billion tokens from cultural heritage archives. Despite its small size it achieve nearly state of the art results for OCR correction of historical English sources. OCRonos-Vintage is also an historical model with an unusual cut-off date: December 29th, 1955…

We look forward to replicate this approach very soon on other "hard" tasks commonly associated with generalist LLMs/SLMs: RAG, function calling, summarization, document segmentation…

OCRonos-Vintage: PleIAs/OCRonos-Vintage
CPU Demo: PleIAs/OCRonos-Vintage-CPU
GPU Demo: PleIAs/OCRonos-Vintage-GPU
Our annoncement and call for specialized pre-training: https://huggingface.co/blog/Pclanglais/specialized-pre-training

Articles 8

Article

7

SYNTH: the new data frontier

View all Articles

Papers 2

arxiv:2504.18225

arxiv:2501.08365

spaces 9

Reversed Zotero

Editorialization

Correction-OCR

Tchap

Motta

tag_theme

models 49

Pclanglais/Gemma-Annotation-2

Updated Sep 26, 2025 • 10

Pclanglais/Gemma-Discourse-2

Updated Sep 1, 2025 • 6

Pclanglais/Gemma-Discourse

Updated Aug 31, 2025 • 4

Pclanglais/Qwen-Discourse

8B • Updated Aug 30, 2025 • 4

Pclanglais/Qwen-Debate

8B • Updated Aug 5, 2025 • 7

Pclanglais/deepseek-prover-drafter

7B • Updated Jul 30, 2025 • 4

Pclanglais/deepseek-prover-solver

7B • Updated Jul 30, 2025 • 7

Pclanglais/Brahe

Text Generation • 13B • Updated Jun 10, 2025 • 18 • 16

Pclanglais/Inheritance-Intensity

Text Classification • 0.4B • Updated Apr 23, 2025 • 14

Pclanglais/Inheritance-Meaning

Text Classification • 0.4B • Updated Apr 23, 2025 • 7

datasets 19

Pclanglais/Nanochat

Viewer • Updated Nov 20, 2025 • 97.2M • 18k • 7

Pclanglais/other

Viewer • Updated Oct 13, 2025 • 451k • 45

Pclanglais/Youtube-Commons-Audio

Viewer • Updated Oct 11, 2025 • 985 • 32

Pclanglais/assembly-17-datasets

Updated Sep 22, 2025 • 3

Pclanglais/course-material

Viewer • Updated Jun 22, 2025 • 11 • 1.22k

Pclanglais/heritage

Preview • Updated Apr 26, 2025 • 455

Pclanglais/Onegin

Viewer • Updated Apr 21, 2025 • 977 • 11

Pclanglais/gutenberg_set

Viewer • Updated Mar 21, 2025 • 7.53M • 285

Pclanglais/tokenized_sample

Viewer • Updated Feb 10, 2025 • 1.54M • 423

Pclanglais/pdf_sample_10k

Viewer • Updated Nov 30, 2024 • 415k • 30 • 1

View 19 datasets