cuda out of memory for flux1-dev-fp8.safetensors
I have nvidia telsa T4 GPU. I have downloaded fp8 safetensors and flux-dev model locally. this the code for model
bfl_repo = "app/utilities/flux_model"
dtype = torch.bfloat16
transformer = FluxTransformer2DModel.from_single_file(
"app/utilities/flux_model/flux1-dev-fp8.safetensors",
torch_dtype=dtype
).to("cuda")
pipe = FluxPipeline.from_pretrained(
"app/utilities/flux_model",
transformer=transformer,
torch_dtype=dtype
)
pipe.enable_sequential_cpu_offload()
When I run the code I am getting this error
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 90.00 MiB. GPU 0 has a total capacity of 15.57 GiB of
which 57.38 MiB is free. Including non-PyTorch memory, this process has 15.51 GiB memory in use. Of the allocated memory
15.40 GiB is allocated by PyTorch, and 16.88 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory
is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.
See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
That code is casting it back to bf16, the dtype should be torch.float8_e4m3fn
and not torch.bfloat16
I have cuda 12.6 and my pytorch is
torch = "^2.4.1"
torchvision = "^0.19.1"
torchaudio = "^2.4.1"
I am getting this error
TypeError: couldn't find storage object Float8_e4m3fnStorage
import torch
from diffusers import FluxPipeline, FluxTransformer2DModel
dtype = torch.float8_e4m3fn
transformer = FluxTransformer2DModel.from_single_file("https://huggingface.co/Kijai/flux-fp8/blob/main/flux1-dev-fp8.safetensors",
dtype=dtype).to("cuda")
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev",
transformer=transformer,
token="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",
torch_dtype=dtype
).to("cuda")
crashes with
OutOfMemoryError: CUDA out of memory. Tried to allocate 144.00 MiB. GPU 0 has a total capacity of 39.56 GiB of which 50.81 MiB is free. Process 591371 has 39.51 GiB memory in use. Of the allocated memory 39.10 GiB is allocated by PyTorch, and 5.77 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)