Failed to serve the model with llama-box v0.0.126 (89097c5)
#1
by
jerryrt
- opened
As mentioned in the model card:
!!! Experimental supported by gpustack/llama-box v0.0.77+ only !!!
However, even the smallest one, FLUX.1-dev-Q4_1.gguf, failed to start with llama-box v0.0.126 (89097c5).
Any suggestion appreciated.
~/Live.Projects/flux-mulgpu$ ./llama-box.startup
0.00.001.578 I
0.00.001.589 I arguments : ./llama-box.bin -v --verbosity 3 --log-colors -np 1 --host 0.0.0.0 -m ./FLUX.1-gguf-gpustack/FLUX.1-dev-Q4_1.gguf --images --image-max-batch 1 --image-vae-tiling
0.00.001.590 I version : v0.0.126 (89097c5)
0.00.001.590 I compiler : cc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
0.00.001.590 I target : x86_64-redhat-linux
0.00.001.591 I vendor : llama.cpp 84d54755 (4883), stable-diffusion.cpp 3eb18db (204)
0.00.008.308 I ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
0.00.008.310 I ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
0.00.008.311 I ggml_cuda_init: found 3 CUDA devices:
0.00.008.485 I Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1, VMM: yes
0.00.008.618 I Device 1: NVIDIA P104-100, compute capability 6.1, VMM: yes
0.00.008.750 I Device 2: NVIDIA P104-100, compute capability 6.1, VMM: yes
0.00.389.227 I system_info: n_threads = 28 (n_threads_batch = 28) / 56 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | FA_ALL_QUANTS = 1 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 |
0.00.389.230 I
0.00.389.366 I srv main: listening, hostname = 0.0.0.0, port = 8080, n_threads_http = 3 + 2
0.00.390.602 I srv main: loading model
0.00.390.625 I srv load_model: loading model './FLUX.1-gguf-gpustack/FLUX.1-dev-Q4_1.gguf'
0.00.619.633 I load_from_file: using device CUDA0 (NVIDIA GeForce GTX 1070) - 7740 MiB free
0.00.619.653 I load_from_file: using device CUDA1 (NVIDIA P104-100) - 8025 MiB free
0.00.619.666 I load_from_file: using device CUDA2 (NVIDIA P104-100) - 8025 MiB free
0.00.619.688 I load_from_file: loading model from './FLUX.1-gguf-gpustack/FLUX.1-dev-Q4_1.gguf'
0.00.620.021 I init_from_file: load ./FLUX.1-gguf-gpustack/FLUX.1-dev-Q4_1.gguf using gguf format
0.00.620.024 D init_from_gguf_file: init from './FLUX.1-gguf-gpustack/FLUX.1-dev-Q4_1.gguf'
0.00.638.032 I load_from_file: Version: Flux
0.00.641.471 I load_from_file: Weight type: q4_1
0.00.641.472 I load_from_file: CLIP_L weight type: f16
0.00.641.472 I load_from_file: CLIP_G weight type: ??
0.00.641.473 I load_from_file: T5XXL weight type: q4_1
0.00.641.473 I load_from_file: Diffusion model weight type: q4_1
0.00.641.473 I load_from_file: VAE weight type: f16
0.00.641.474 D load_from_file: ggml tensor size = 400 bytes
0.00.665.399 D load_from_merges: vocab size: 49408
0.00.704.842 D load_from_merges: trigger word img already in vocab
0.00.831.403 I FluxRunner: Flux blocks: 19 double, 38 single
0.00.832.531 D alloc_params_buffer: clip params backend buffer size = 235.06 MB(VRAM) (196 tensors)
0.00.832.881 D alloc_params_buffer: t5 params backend buffer size = 2839.20 MB(VRAM) (219 tensors)
0.00.833.801 D alloc_params_buffer: flux params backend buffer size = 7310.02 MB(VRAM) (780 tensors)
0.00.834.575 D alloc_params_buffer: vae params backend buffer size = 160.00 MB(VRAM) (244 tensors)
0.00.835.154 D load_from_file: loading weights
0.00.840.317 D load_tensors: loading tensors from ./FLUX.1-gguf-gpustack/FLUX.1-dev-Q4_1.gguf
0.00.939.303 I load_tensors: loaded 1% model tensors into buffer
0.00.947.717 I load_tensors: loaded 2% model tensors into buffer
0.00.956.752 I load_tensors: loaded 3% model tensors into buffer
0.00.964.786 I load_tensors: loaded 4% model tensors into buffer
0.00.970.897 I load_tensors: loaded 5% model tensors into buffer
0.00.977.210 I load_tensors: loaded 6% model tensors into buffer
0.00.986.181 I load_tensors: loaded 7% model tensors into buffer
0.00.995.242 I load_tensors: loaded 8% model tensors into buffer
0.01.004.250 I load_tensors: loaded 9% model tensors into buffer
0.01.012.400 I load_tensors: loaded 10% model tensors into buffer
0.01.020.682 I load_tensors: loaded 11% model tensors into buffer
0.01.029.022 I load_tensors: loaded 12% model tensors into buffer
0.01.037.813 I load_tensors: loaded 13% model tensors into buffer
0.01.064.665 I load_tensors: loaded 14% model tensors into buffer
0.01.203.312 I load_tensors: loaded 15% model tensors into buffer
0.01.332.681 I load_tensors: loaded 16% model tensors into buffer
0.01.443.979 I load_tensors: loaded 17% model tensors into buffer
0.01.581.847 I load_tensors: loaded 18% model tensors into buffer
0.01.682.289 I load_tensors: loaded 19% model tensors into buffer
0.01.824.454 I load_tensors: loaded 20% model tensors into buffer
0.01.960.600 I load_tensors: loaded 21% model tensors into buffer
0.02.083.903 I load_tensors: loaded 22% model tensors into buffer
0.02.219.682 I load_tensors: loaded 23% model tensors into buffer
0.02.319.922 I load_tensors: loaded 24% model tensors into buffer
0.02.456.143 I load_tensors: loaded 25% model tensors into buffer
0.02.587.432 I load_tensors: loaded 26% model tensors into buffer
0.02.701.209 I load_tensors: loaded 27% model tensors into buffer
0.02.840.167 I load_tensors: loaded 28% model tensors into buffer
0.03.025.092 I load_tensors: loaded 29% model tensors into buffer
0.03.052.442 I load_tensors: loaded 30% model tensors into buffer
0.03.112.641 I load_tensors: loaded 31% model tensors into buffer
0.03.118.611 I load_tensors: loaded 32% model tensors into buffer
0.03.130.198 I load_tensors: loaded 33% model tensors into buffer
0.03.144.427 I load_tensors: loaded 34% model tensors into buffer
0.03.176.063 I load_tensors: loaded 35% model tensors into buffer
0.03.245.385 I load_tensors: loaded 36% model tensors into buffer
0.03.312.286 I load_tensors: loaded 37% model tensors into buffer
0.03.380.881 I load_tensors: loaded 38% model tensors into buffer
0.03.401.411 I load_tensors: loaded 39% model tensors into buffer
0.03.404.852 I load_tensors: loaded 40% model tensors into buffer
0.03.419.969 I load_tensors: loaded 41% model tensors into buffer
0.03.450.917 I load_tensors: loaded 42% model tensors into buffer
0.03.533.106 I load_tensors: loaded 43% model tensors into buffer
0.03.570.471 I load_tensors: loaded 44% model tensors into buffer
0.03.610.718 I load_tensors: loaded 45% model tensors into buffer
0.03.643.918 I load_tensors: loaded 46% model tensors into buffer
0.03.873.193 I load_tensors: loaded 47% model tensors into buffer
0.04.103.480 I load_tensors: loaded 48% model tensors into buffer
0.04.333.863 I load_tensors: loaded 49% model tensors into buffer
0.04.583.364 I load_tensors: loaded 50% model tensors into buffer
0.04.773.089 I load_tensors: loaded 51% model tensors into buffer
0.05.001.523 I load_tensors: loaded 52% model tensors into buffer
0.05.239.624 I load_tensors: loaded 53% model tensors into buffer
0.05.467.018 I load_tensors: loaded 54% model tensors into buffer
0.05.717.895 I load_tensors: loaded 55% model tensors into buffer
0.05.905.510 I load_tensors: loaded 56% model tensors into buffer
0.06.135.919 I load_tensors: loaded 57% model tensors into buffer
0.06.370.063 I load_tensors: loaded 58% model tensors into buffer
0.06.598.964 I load_tensors: loaded 59% model tensors into buffer
0.06.848.716 I load_tensors: loaded 60% model tensors into buffer
0.07.035.545 I load_tensors: loaded 61% model tensors into buffer
0.07.268.714 I load_tensors: loaded 62% model tensors into buffer
0.07.498.306 I load_tensors: loaded 63% model tensors into buffer
0.07.729.181 I load_tensors: loaded 64% model tensors into buffer
0.07.978.934 I load_tensors: loaded 65% model tensors into buffer
0.08.165.511 I load_tensors: loaded 66% model tensors into buffer
0.08.403.247 I load_tensors: loaded 67% model tensors into buffer
0.08.633.752 I load_tensors: loaded 68% model tensors into buffer
0.08.864.210 I load_tensors: loaded 69% model tensors into buffer
0.09.114.514 I load_tensors: loaded 70% model tensors into buffer
0.09.302.966 I load_tensors: loaded 71% model tensors into buffer
0.09.538.275 I load_tensors: loaded 72% model tensors into buffer
0.09.770.293 I load_tensors: loaded 73% model tensors into buffer
0.10.003.845 I load_tensors: loaded 74% model tensors into buffer
0.10.256.192 I load_tensors: loaded 75% model tensors into buffer
0.10.446.746 I load_tensors: loaded 76% model tensors into buffer
0.10.681.213 I load_tensors: loaded 77% model tensors into buffer
0.11.184.729 I load_tensors: loaded 78% model tensors into buffer
0.11.478.673 I load_tensors: loaded 79% model tensors into buffer
0.11.748.644 I load_tensors: loaded 80% model tensors into buffer
0.11.991.950 I load_tensors: loaded 81% model tensors into buffer
0.12.309.358 I load_tensors: loaded 82% model tensors into buffer
0.12.631.356 I load_tensors: loaded 83% model tensors into buffer
0.12.917.723 I load_tensors: loaded 84% model tensors into buffer
0.13.180.093 I load_tensors: loaded 85% model tensors into buffer
0.13.422.019 I load_tensors: loaded 86% model tensors into buffer
0.13.745.092 I load_tensors: loaded 87% model tensors into buffer
0.14.062.713 I load_tensors: loaded 88% model tensors into buffer
0.14.346.395 I load_tensors: loaded 89% model tensors into buffer
0.14.610.313 I load_tensors: loaded 90% model tensors into buffer
0.14.860.150 I load_tensors: loaded 91% model tensors into buffer
0.15.178.192 I load_tensors: loaded 92% model tensors into buffer
0.15.497.229 I load_tensors: loaded 93% model tensors into buffer
0.15.785.968 I load_tensors: loaded 94% model tensors into buffer
0.16.052.353 I load_tensors: loaded 95% model tensors into buffer
0.16.293.495 I load_tensors: loaded 96% model tensors into buffer
0.16.608.743 I load_tensors: loaded 97% model tensors into buffer
0.16.931.307 I load_tensors: loaded 98% model tensors into buffer
0.17.215.848 I load_tensors: loaded 99% model tensors into buffer
0.17.664.697 I load_tensors: loaded 100% model tensors into buffer
0.17.669.117 I load_from_file: total params memory size = 10544.29MB (VRAM 10544.29MB, RAM 0.00MB): clip 3074.27MB(VRAM), unet 7310.02MB(VRAM), vae 160.00MB(VRAM), controlnet 0.00MB(VRAM), pmid 0.00MB(VRAM)
0.17.669.120 I load_from_file: loading model from './FLUX.1-gguf-gpustack/FLUX.1-dev-Q4_1.gguf' completed, taking 16.83s
0.17.669.121 I load_from_file: running in Flux FLOW mode
0.17.669.280 I load_from_file: running with discrete schedule
0.17.669.283 D load_from_file: finished loaded file
0.17.669.439 W common_sd_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.17.670.267 D tokenize: parse 'a lovely cat' to [['a lovely cat', 1], ]
0.17.670.755 D pad_tokens: token length: 77
0.17.670.764 D pad_tokens: token length: 255
0.17.671.197 D forward: Missing text_projection matrix, assuming identity...
0.17.672.025 D alloc_compute_buffer: clip compute buffer size: 1.40 MB(VRAM)
0.17.672.114 D forward: Missing text_projection matrix, assuming identity...
0.17.717.026 D alloc_compute_buffer: t5 compute buffer size: 67.80 MB(VRAM)
0.19.405.063 D get_learned_condition_common: computing condition graph completed, taking 1734 ms
0.19.409.128 D tokenize: parse '' to [['', 1], ]
0.19.409.394 D pad_tokens: token length: 77
0.19.409.397 D pad_tokens: token length: 255
0.19.409.669 D forward: Missing text_projection matrix, assuming identity...
0.19.409.904 D alloc_compute_buffer: clip compute buffer size: 1.40 MB(VRAM)
0.19.409.988 D forward: Missing text_projection matrix, assuming identity...
0.19.424.641 D alloc_compute_buffer: t5 compute buffer size: 67.80 MB(VRAM)
0.21.037.241 D get_learned_condition_common: computing condition graph completed, taking 1628 ms
0.21.041.177 I get_sampling_stream: get_learned_condition completed, taking 3372 ms
0.21.157.158 E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 2576.18 MiB on device 2: cudaMalloc failed: out of memory
0.21.157.163 E ggml_gallocr_reserve_n: failed to allocate CUDA2 buffer of size 2701322624
0.21.157.164 E alloc_compute_buffer: flux: failed to allocate the compute buffer
Segmentation fault