Boundless Byte Pair Encoding: Breaking the Pre-tokenization Barrier Paper • 2504.00178 • Published 6 days ago • 1
Multimodal LLMs for OCR, OCR Post-Correction, and Named Entity Recognition in Historical Documents Paper • 2504.00414 • Published 5 days ago • 1
Overcoming Vocabulary Constraints with Pixel-level Fallback Paper • 2504.02122 • Published 4 days ago • 1
E3C-Projected Collection This collection contains the projected datasets of English layer one of e3c into Greek, Italian, Polish, Slovak, and Slovenian • 11 items • Updated Jan 8 • 1
State Fourier Diffusion Language Model (SFDLM): A Scalable, Novel Iterative Approach to Language Modeling Paper • 2503.17382 • Published 21 days ago • 1
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model Paper • 2211.05100 • Published Nov 9, 2022 • 31
xLSTM 7B: A Recurrent LLM for Fast and Efficient Inference Paper • 2503.13427 • Published 20 days ago • 3
UniBERTs: Adversarial Training for Language-Universal Representations Paper • 2503.12608 • Published 21 days ago • 1
Do Construction Distributions Shape Formal Language Learning In German BabyLMs? Paper • 2503.11593 • Published 23 days ago • 1
HyperZcdotZcdotW Operator Connects Slow-Fast Networks for Full Context Interaction Paper • 2401.17948 • Published Jan 31, 2024 • 4
Modern Models, Medieval Texts: A POS Tagging Study of Old Occitan Paper • 2503.07827 • Published 27 days ago • 1
Reading the unreadable: Creating a dataset of 19th century English newspapers using image-to-text language models Paper • 2502.14901 • Published Feb 18 • 2
NER4all or Context is All You Need: Using LLMs for low-effort, high-performance NER on historical texts. A humanities informed approach Paper • 2502.04351 • Published Feb 4 • 1
AI-assisted German Employment Contract Review: A Benchmark Dataset Paper • 2501.17194 • Published Jan 27 • 1
Familiarity: Better Evaluation of Zero-Shot Named Entity Recognition by Quantifying Label Shifts in Synthetic Training Data Paper • 2412.10121 • Published Dec 13, 2024 • 2
MorphBPE: A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies Paper • 2502.00894 • Published Feb 2 • 2