Spaces:

muellerzr
/

ai-dev-talk

Running

App Files Files Community

ai-dev-talk / accelerate.qmd

muellerzr HF staff

Upload talk

c9e768e 6 months ago

raw

history blame

12.4 kB

	---
	title: "Hugging Face Accelerate: Making device-agnostic ML training and inference easy at scale"

	format:
	revealjs:
	theme: moon
	fig-format: png
	auto-play-media: true
	---

	## Who am I?

	- Zachary Mueller
	- Technical Lead for the 🤗 Accelerate project
	- Maintain the `transformers` Trainer
	- API design geek

	## What is 🤗 Accelerate?

	* A training framework
	* An inference framework
	* A command-line interface

	## A Training Framework

	* Powered by PyTorch
	* Change a few lines of code, gain device and hardware-agnostic capabilities
	* Low-code, with minimal magic aimed at easy hackability and use without high-level abstractions
	* We handle the intracies so you don't have to

	## A Training Framework

	::: {style="font-size: 70%;"}

	* Support for any hardware-accelerator on the market:
	* CPU, GPU, TPU, XPU, NPU, MLU
	* Automatic mixed-precision training safely in whatever fashion you may choose:
	* FP16, BF16, FP8 (through either `TransformerEngine` or `MS-AMP`)
	* Automatic and efficient gradient accumulation
	* Support for quantization through `bitsandbytes`
	* Support your favorite experiment trackers (`aim`, `clearml`, `comet_ml`, `dvc-lite`, `ml-flow`, `tensorboard`, `wandb`)
	* Easy to configure plugin or YAML-level API for setting up advanced frameworks like `FSDP`, `DeepSpeed`, and `Megatron-LM`
	:::

	## Low-Code

	::: {style="font-size: 70%;"}
	* Biggest friction with "wrapper" libraries is control of your code
	* By being minimally intrusive, your code just "works" while still giving you complete control
	:::

	::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
	```diff
	import torch
	import torch.nn.functional as F
	from datasets import load_dataset
	+ from accelerate import Accelerator

	+ accelerator = Accelerator()
	- device = 'cpu'
	+ device = accelerator.device

	model = torch.nn.Transformer().to(device)
	optimizer = torch.optim.Adam(model.parameters())
	dataset = load_dataset('my_dataset')
	data = torch.utils.data.DataLoader(dataset, shuffle=True)

	+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

	model.train()
	for epoch in range(10):
	for source, targets in dataloader:
	source, targets = source.to(device), targets.to(device)
	optimizer.zero_grad()
	output = model(source)
	loss = F.cross_entropy(output, targets)
	- loss.backward()
	+ accelerator.backward(loss)
	optimizer.step()
	```
	:::

	## Easy to integrate

	::: {style="font-size: 70%;"}
	* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks:
	1. Create an `Accelerator`
	:::

	::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
	```diff
	import torch
	import torch.nn.functional as F
	from datasets import load_dataset
	+ from accelerate import Accelerator

	+ accelerator = Accelerator()
	device = 'cpu'

	model = torch.nn.Transformer().to(device)
	optimizer = torch.optim.Adam(model.parameters())
	dataset = load_dataset('my_dataset')
	data = torch.utils.data.DataLoader(dataset, shuffle=True)

	model.train()
	for epoch in range(10):
	for source, targets in dataloader:
	source, targets = source.to(device), targets.to(device)
	optimizer.zero_grad()
	output = model(source)
	loss = F.cross_entropy(output, targets)
	loss.backward()
	optimizer.step()
	```
	:::

	## Easy to integrate

	::: {style="font-size: 70%;"}
	* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks:
	2. Wrap your PyTorch objects with `accelerator.prepare` and remove device-placements
	:::

	::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
	```diff
	import torch
	import torch.nn.functional as F
	from datasets import load_dataset
	from accelerate import Accelerator

	accelerator = Accelerator()
	- device = 'cpu'

	model = torch.nn.Transformer().to(device)
	optimizer = torch.optim.Adam(model.parameters())
	dataset = load_dataset('my_dataset')
	data = torch.utils.data.DataLoader(dataset, shuffle=True)

	+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

	model.train()
	for epoch in range(10):
	for source, targets in dataloader:
	source, targets = source.to(device), targets.to(device)
	optimizer.zero_grad()
	output = model(source)
	loss = F.cross_entropy(output, targets)
	loss.backward()
	optimizer.step()
	```
	:::

	## Easy to integrate

	::: {style="font-size: 70%;"}
	* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks:
	3. Use `accelerator.backward` for the backward pass
	:::

	::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
	```diff
	import torch
	import torch.nn.functional as F
	from datasets import load_dataset
	from accelerate import Accelerator

	accelerator = Accelerator()

	model = torch.nn.Transformer().to(device)
	optimizer = torch.optim.Adam(model.parameters())
	dataset = load_dataset('my_dataset')
	data = torch.utils.data.DataLoader(dataset, shuffle=True)

	model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

	model.train()
	for epoch in range(10):
	for source, targets in dataloader:
	source, targets = source.to(device), targets.to(device)
	optimizer.zero_grad()
	output = model(source)
	loss = F.cross_entropy(output, targets)
	- loss.backward()
	+ accelerator.backward(loss)
	optimizer.step()
	```
	:::


	## But what about inference?

	* 🤗 Accelerate is not just for training, and has helped make the GPU-Poor take control of the narrative
	* Using tools like Big Model Inference, users with tiny compute can run large models locally
	* Started with the boom of stable diffusion, and now has scaled to having the ability to run huge LLMs locally with a single graphics card

	## How does it work?

	* PyTorch introduced `device="meta"`
	* 🤗 Accelerate introduced `device_map="auto"`

	::: {style="padding-left:15%;padding-right:20%"}
	{{< video big_model_visualization.mp4 width="800" height="400" >}}
	:::

	## A CLI Interface
	* `accelerate config`
	* Configure the environment
	* `accelerate launch`
	* How to run your script

	## Launching distributed training is hard

	::: {style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"}
	```bash
	python script.py
	```
	:::
	::: {style="padding-left:50%;padding-bottom:0%;padding-top:0%;"}
	vs.
	:::
	<br>

	::: {style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"}
	```bash
	torchrun --nnodes=1 --nproc_per_node=2 script.py
	```
	:::
	::: {style="padding-left:50%;padding-bottom:0%;padding-top:0%;"}
	vs.
	:::
	<br>

	::: {style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"}
	```bash
	deepspeed --num_gpus=2 script.py
	```
	<br>
	:::
	How can we make this better?


	## `accelerate launch`

	::: {style="padding-top:0%;padding-left:5%;padding-right:10%;padding-bottom:0%"}
	```bash
	accelerate launch script.py
	```

	<br>

	```bash
	accelerate launch --multi_gpu --num_processes 2 script.py
	```

	<br>

	```bash
	accelerate launch \
	--multi_gpu \
	--use_deepspeed \
	--num_processes 2 \
	script.py
	```
	:::

	## `accelerate config`
	* Rely on `config.yaml` files
	* Choose to either running `accelerate config` or write your own:

	:::: {.columns style="font-size: 60%;padding-left:5%;padding-right:5%"}
	::: {.column width="40%"}
	```{.yaml filename=ddp_config.yaml}
	compute_environment: LOCAL_MACHINE
	distributed_type: MULTI_GPU
	main_training_function: main
	mixed_precision: bf16
	num_machines: 1
	num_processes: 8
	```
	:::
	::: {.column width="40%"}
	```{.yaml filename=fsdp_config.yaml}
	compute_environment: LOCAL_MACHINE
	distributed_type: FSDP
	fsdp_config:
	fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
	fsdp_backward_prefetch: BACKWARD_PRE
	fsdp_cpu_ram_efficient_loading: true
	fsdp_forward_prefetch: false
	fsdp_offload_params: false
	fsdp_sharding_strategy: FULL_SHARD
	fsdp_state_dict_type: SHARDED_STATE_DICT
	fsdp_sync_module_states: true
	fsdp_use_orig_params: false
	main_training_function: main
	mixed_precision: bf16
	num_machines: 1
	num_processes: 8
	```
	:::
	::::

	# Now that you're up to speed, what's new?

	# We've had a busy last year, and so has the ML Community!

	## New training techniques
	- Quantization has taken the field by storm
	- New ideas such as FSDP + QLoRA to train huge models on tiny compute!
	- New precision backends as we train natively on smaller precision
	- Optimizing futher how much we can push on a single machine through efficient RAM and timing techniques

	## Larger compute landscape
	- As we search for alternatives to NVIDIA, new compilers rise:
	- XPU (Intel)
	- NPU (Intel)
	- MLU (Cambricon)

	All of which are supported by 🤗 Accelerate


	## Lower abstractions
	* While the `Accelerator` was great, needed better abstractions focused on controlling behaviors
	* Introduced the `PartialState`

	::: {style="padding-left:10%;padding-top:0%;padding-right:15%"}
	```python
	from accelerate import PartialState

	if PartialState().is_main_process:
	# Run on only 1 device

	with PartialState().main_process_first:
	# Useful for dataset processing

	# Device-agnostic without the bulk of the `Accelerator`
	device = PartialState().device
	```
	:::

	## Faster and better inference alternatives
	::: {style="font-size:70%"}
	- `PiPPy` gives us efficient pipeline-parallelism in distributed environments to increase throughput while keeping a simple torch-bound API
	- Rather than having to wait for each GPU, every GPU can be busy in parallel
	:::
	::: {style="font-size:60%;padding-left:19%;padding-top:0%;padding-right:24%;"}
	```python
	import torch
	from transformers import AutoModelForSequenceClassification

	from accelerate import PartialState, prepare_pippy

	model = AutoModelForSequenceClassification.from_pretrained("gpt2")
	model.eval()

	input = torch.randint(
	low=0,
	high=model.config.vocab_size,
	size=(2, 1024), # bs x seq_len
	device="cpu",
	)

	model = prepare_pippy(model, split_points="auto", example_args=(input,))

	with torch.no_grad():
	output = model(input)
	```
	:::


	# Adoption: Accelerate in the ecosystem

	## Accelerate in the Ecosystem
	* Many of the frameworks you use daily already rely on 🤗 Accelerate!
	* Nearly all of 🤗
	* `axolotl`
	* `fastai`
	* `FastChat`
	* `lucidrains`
	* `kornia`


	## Accelerate in the Ecosystem
	::: {style="font-size: 70%;"}
	- Started as a way to isolate out distributed code on TPU and `DistributedDataParallelism`
	:::

	::: {style="padding-left: 30%"}
	![](sylvain_tweet.JPG){width="70%"}
	:::

	## Accelerate in the Ecosystem
	::: {style="font-size: 70%;"}
	- Now is the backbone of some of the largest PyTorch training frameworks in the ecosystem
	:::
	::: {style="padding-left: 30%;"}
	![](hf_trainer.JPG){width="70%"}
	:::

	# What's next?

	# Elevating the community

	* Now that more advanced training techniques are reachable (FSDP, DeepSpeed, etc), we need to focus on educating the community on how to use it best
	* Goes beyond how to use the `Trainer` or `Accelerator`, but how to use what where
	* Keep Accelerate as a tool for the community to utilize when new techniques come out and play with, to push new ideas to scale quickly

	# 1.0.0: Soon!

	* Tried and battle-tested by over 7M users/month
	* As we've been stable for over a year now, we're near ready to release 1.0.0

	# Thanks for joining!

	::: {style="font-size: 70%;"}

	- [🤗 Accelerate documentation](https://hf.co/docs/accelerate)
	- [Launching distributed code](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
	- [Distributed code and Jupyter Notebooks](https://huggingface.co/docs/accelerate/basic_tutorials/notebook)
	- [Migrating to 🤗 Accelerate easily](https://huggingface.co/docs/accelerate/basic_tutorials/migration)
	- [Big Model Inference tutorial](https://huggingface.co/docs/accelerate/usage_guides/big_modeling)
	- [DeepSpeed and 🤗 Accelerate](https://huggingface.co/docs/accelerate/usage_guides/deepspeed)
	- [Fully Sharded Data Parallelism and 🤗 Accelerate](https://huggingface.co/docs/accelerate/usage_guides/fsdp)
	- [FSDP vs DeepSpeed In-Depth](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed)
	:::