matlok
's Collections
Papers - Training
updated
SELF: Language-Driven Self-Evolution for Large Language Model
Paper
•
2310.00533
•
Published
•
2
GrowLength: Accelerating LLMs Pretraining by Progressively Growing
Training Length
Paper
•
2310.00576
•
Published
•
2
A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity
Paper
•
2305.13169
•
Published
•
3
Transformers Can Achieve Length Generalization But Not Robustly
Paper
•
2402.09371
•
Published
•
13
Triple-Encoders: Representations That Fire Together, Wire Together
Paper
•
2402.12332
•
Published
•
2
Veagle: Advancements in Multimodal Representation Learning
Paper
•
2403.08773
•
Published
•
7
Training Compute-Optimal Large Language Models
Paper
•
2203.15556
•
Published
•
10
Hash Layers For Large Sparse Models
Paper
•
2106.04426
•
Published
•
2
Chain-of-Verification Reduces Hallucination in Large Language Models
Paper
•
2309.11495
•
Published
•
38
Contrastive Decoding Improves Reasoning in Large Language Models
Paper
•
2309.09117
•
Published
•
37
Paper
•
2407.10671
•
Published
•
160
Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws
Paper
•
2404.05405
•
Published
•
9
Scaling Laws for Precision
Paper
•
2411.04330
•
Published
•
6
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
Paper
•
1806.07572
•
Published
•
1
Procedural Knowledge in Pretraining Drives Reasoning in Large Language
Models
Paper
•
2411.12580
•
Published
•
2
Studying Large Language Model Generalization with Influence Functions
Paper
•
2308.03296
•
Published
•
12
Scaling and evaluating sparse autoencoders
Paper
•
2406.04093
•
Published
•
3
SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
Paper
•
2308.11466
•
Published
•
1
ByT5: Towards a token-free future with pre-trained byte-to-byte models
Paper
•
2105.13626
•
Published
•
3
CANINE: Pre-training an Efficient Tokenization-Free Encoder for Language
Representation
Paper
•
2103.06874
•
Published
•
1
Paper
•
2412.08905
•
Published
•
92
An Evolved Universal Transformer Memory
Paper
•
2410.13166
•
Published
•
3
No More Adam: Learning Rate Scaling at Initialization is All You Need
Paper
•
2412.11768
•
Published
•
41
Byte Latent Transformer: Patches Scale Better Than Tokens
Paper
•
2412.09871
•
Published
•
74