matlok
's Collections
Papers - Training Research
updated
Measuring the Effects of Data Parallelism on Neural Network Training
Paper
•
1811.03600
•
Published
•
2
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost
Paper
•
1804.04235
•
Published
•
2
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
Paper
•
1905.11946
•
Published
•
3
Yi: Open Foundation Models by 01.AI
Paper
•
2403.04652
•
Published
•
62
Extending Context Window of Large Language Models via Positional
Interpolation
Paper
•
2306.15595
•
Published
•
53
ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment
Paper
•
2403.05135
•
Published
•
42
Algorithmic progress in language models
Paper
•
2403.05812
•
Published
•
18
Adding NVMe SSDs to Enable and Accelerate 100B Model Fine-tuning on a
Single GPU
Paper
•
2403.06504
•
Published
•
53
QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models
Paper
•
2310.16795
•
Published
•
26
CoCa: Contrastive Captioners are Image-Text Foundation Models
Paper
•
2205.01917
•
Published
•
3
Paper
•
1605.07146
•
Published
•
2
Slimmable Encoders for Flexible Split DNNs in Bandwidth and Resource
Constrained IoT Systems
Paper
•
2306.12691
•
Published
•
2
Learning to Reason and Memorize with Self-Notes
Paper
•
2305.00833
•
Published
•
4
Bootstrap Your Own Skills: Learning to Solve New Tasks with Large
Language Model Guidance
Paper
•
2310.10021
•
Published
•
2
Self-Supervised Vision Transformers Learn Visual Concepts in
Histopathology
Paper
•
2203.00585
•
Published
•
2
DeepNet: Scaling Transformers to 1,000 Layers
Paper
•
2203.00555
•
Published
•
2
Gemma: Open Models Based on Gemini Research and Technology
Paper
•
2403.08295
•
Published
•
47
Scan and Snap: Understanding Training Dynamics and Token Composition in
1-layer Transformer
Paper
•
2305.16380
•
Published
•
4
SELF: Language-Driven Self-Evolution for Large Language Model
Paper
•
2310.00533
•
Published
•
2
GrowLength: Accelerating LLMs Pretraining by Progressively Growing
Training Length
Paper
•
2310.00576
•
Published
•
2
JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and
Attention
Paper
•
2310.00535
•
Published
•
2
Does Circuit Analysis Interpretability Scale? Evidence from Multiple
Choice Capabilities in Chinchilla
Paper
•
2307.09458
•
Published
•
10
The Impact of Depth and Width on Transformer Language Model
Generalization
Paper
•
2310.19956
•
Published
•
9
A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity
Paper
•
2305.13169
•
Published
•
3
MicroNAS: Memory and Latency Constrained Hardware-Aware Neural
Architecture Search for Time Series Classification on Microcontrollers
Paper
•
2310.18384
•
Published
•
2
PreNAS: Preferred One-Shot Learning Towards Efficient Neural
Architecture Search
Paper
•
2304.14636
•
Published
•
2
Can GPT-4 Perform Neural Architecture Search?
Paper
•
2304.10970
•
Published
•
2
Neural Architecture Search: Insights from 1000 Papers
Paper
•
2301.08727
•
Published
•
2
Unified Functional Hashing in Automatic Machine Learning
Paper
•
2302.05433
•
Published
•
2
Self-Discover: Large Language Models Self-Compose Reasoning Structures
Paper
•
2402.03620
•
Published
•
112
Take a Step Back: Evoking Reasoning via Abstraction in Large Language
Models
Paper
•
2310.06117
•
Published
•
3
Transformers Can Achieve Length Generalization But Not Robustly
Paper
•
2402.09371
•
Published
•
13
Triple-Encoders: Representations That Fire Together, Wire Together
Paper
•
2402.12332
•
Published
•
2
MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
Paper
•
2403.09611
•
Published
•
125
Veagle: Advancements in Multimodal Representation Learning
Paper
•
2403.08773
•
Published
•
7
Quiet-STaR: Language Models Can Teach Themselves to Think Before
Speaking
Paper
•
2403.09629
•
Published
•
75
3D-VLA: A 3D Vision-Language-Action Generative World Model
Paper
•
2403.09631
•
Published
•
7
StreamMultiDiffusion: Real-Time Interactive Generation with Region-Based
Semantic Control
Paper
•
2403.09055
•
Published
•
24
Vision Transformer with Quadrangle Attention
Paper
•
2303.15105
•
Published
•
2
Synthetic Shifts to Initial Seed Vector Exposes the Brittle Nature of
Latent-Based Diffusion Models
Paper
•
2312.11473
•
Published
•
2
Semi-Supervised Semantic Segmentation using Redesigned Self-Training for
White Blood Cells
Paper
•
2401.07278
•
Published
•
2
Beyond ChatBots: ExploreLLM for Structured Thoughts and Personalized
Model Responses
Paper
•
2312.00763
•
Published
•
19
Training Compute-Optimal Large Language Models
Paper
•
2203.15556
•
Published
•
10
Unified Scaling Laws for Routed Language Models
Paper
•
2202.01169
•
Published
•
2
Hash Layers For Large Sparse Models
Paper
•
2106.04426
•
Published
•
2
Chain-of-Verification Reduces Hallucination in Large Language Models
Paper
•
2309.11495
•
Published
•
38
Adapting Large Language Models via Reading Comprehension
Paper
•
2309.09530
•
Published
•
77
Exploring Large Language Models' Cognitive Moral Development through
Defining Issues Test
Paper
•
2309.13356
•
Published
•
36
Large Language Models Cannot Self-Correct Reasoning Yet
Paper
•
2310.01798
•
Published
•
33
Table-GPT: Table-tuned GPT for Diverse Table Tasks
Paper
•
2310.09263
•
Published
•
39
TabLib: A Dataset of 627M Tables with Context
Paper
•
2310.07875
•
Published
•
8
Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as
an Alternative to Attention Layers in Transformers
Paper
•
2311.10642
•
Published
•
23
Text Generation with Diffusion Language Models: A Pre-training Approach
with Continuous Paragraph Denoise
Paper
•
2212.11685
•
Published
•
2
Neural networks behave as hash encoders: An empirical study
Paper
•
2101.05490
•
Published
•
2
Large Language Models as Optimizers
Paper
•
2309.03409
•
Published
•
75
Simple synthetic data reduces sycophancy in large language models
Paper
•
2308.03958
•
Published
•
21
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Paper
•
2305.10429
•
Published
•
3
Recurrent Drafter for Fast Speculative Decoding in Large Language Models
Paper
•
2403.09919
•
Published
•
20
FouriScale: A Frequency Perspective on Training-Free High-Resolution
Image Synthesis
Paper
•
2403.12963
•
Published
•
7
Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs
Paper
•
2403.12596
•
Published
•
9
PERL: Parameter Efficient Reinforcement Learning from Human Feedback
Paper
•
2403.10704
•
Published
•
57
End-to-End Object Detection with Transformers
Paper
•
2005.12872
•
Published
•
5
RewardBench: Evaluating Reward Models for Language Modeling
Paper
•
2403.13787
•
Published
•
21
VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis
Paper
•
2403.13501
•
Published
•
9
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head
Checkpoints
Paper
•
2305.13245
•
Published
•
5
ReNoise: Real Image Inversion Through Iterative Noising
Paper
•
2403.14602
•
Published
•
19
DreamReward: Text-to-3D Generation with Human Preference
Paper
•
2403.14613
•
Published
•
35
Chain of Thought Empowers Transformers to Solve Inherently Serial
Problems
Paper
•
2402.12875
•
Published
•
13
CRITIC: Large Language Models Can Self-Correct with Tool-Interactive
Critiquing
Paper
•
2305.11738
•
Published
•
8
Shepherd: A Critic for Language Model Generation
Paper
•
2308.04592
•
Published
•
31
DRLC: Reinforcement Learning with Dense Rewards from LLM Critic
Paper
•
2401.07382
•
Published
•
2
DenseFormer: Enhancing Information Flow in Transformers via Depth
Weighted Averaging
Paper
•
2402.02622
•
Published
•
3
TRIP: Temporal Residual Learning with Image Noise Prior for
Image-to-Video Diffusion Models
Paper
•
2403.17005
•
Published
•
13
SDXS: Real-Time One-Step Latent Diffusion Models with Image Conditions
Paper
•
2403.16627
•
Published
•
20
LLM Agent Operating System
Paper
•
2403.16971
•
Published
•
65
Prompt me a Dataset: An investigation of text-image prompting for
historical image dataset creation using foundation models
Paper
•
2309.01674
•
Published
•
2
Data Distributional Properties Drive Emergent In-Context Learning in
Transformers
Paper
•
2205.05055
•
Published
•
2
InternLM2 Technical Report
Paper
•
2403.17297
•
Published
•
30
LIMA: Less Is More for Alignment
Paper
•
2305.11206
•
Published
•
21
Masked Audio Generation using a Single Non-Autoregressive Transformer
Paper
•
2401.04577
•
Published
•
42
Unsolvable Problem Detection: Evaluating Trustworthiness of Vision
Language Models
Paper
•
2403.20331
•
Published
•
14
DiJiang: Efficient Large Language Models through Compact Kernelization
Paper
•
2403.19928
•
Published
•
10
Why Transformers Need Adam: A Hessian Perspective
Paper
•
2402.16788
•
Published
•
1
Bigger is not Always Better: Scaling Properties of Latent Diffusion
Models
Paper
•
2404.01367
•
Published
•
21
Training LLMs over Neurally Compressed Text
Paper
•
2404.03626
•
Published
•
21
Locating and Editing Factual Associations in GPT
Paper
•
2202.05262
•
Published
•
1
Prompt-to-Prompt Image Editing with Cross Attention Control
Paper
•
2208.01626
•
Published
•
2
Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation
Paper
•
2107.07651
•
Published
•
1
Superposition Prompting: Improving and Accelerating Retrieval-Augmented
Generation
Paper
•
2404.06910
•
Published
•
2
RecurrentGemma: Moving Past Transformers for Efficient Open Language
Models
Paper
•
2404.07839
•
Published
•
43
Instruction Tuning with Human Curriculum
Paper
•
2310.09518
•
Published
•
3
OOVs in the Spotlight: How to Inflect them?
Paper
•
2404.08974
•
Published
•
1
All you need is a good init
Paper
•
1511.06422
•
Published
•
1
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of
Diverse Models
Paper
•
2404.18796
•
Published
•
68
KAN: Kolmogorov-Arnold Networks
Paper
•
2404.19756
•
Published
•
108
Greedy Growing Enables High-Resolution Pixel-Based Diffusion Models
Paper
•
2405.16759
•
Published
•
7
Beyond Euclid: An Illustrated Guide to Modern Machine Learning with
Geometric, Topological, and Algebraic Structures
Paper
•
2407.09468
•
Published
•
1
Paper
•
2410.05258
•
Published
•
168