Question About VRAM Requirements for Full 256K Context Length
Dear CohereForAI Team,
first of all, thank you for your incredible work on the c4ai-command-a-03-2025 model. The advancements in context length and efficiency are truly impressive!
I am currently experimenting with the model using vLLM and have achieved a context length of approximately 110K tokens on 8 x RTX A6000 GPUs with the following settings:
#!/bin/bash
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
export CUDA_LAUNCH_BLOCKING=0
export disable_custom_all_reduce=True
token=100000
python -m vllm.entrypoints.openai.api_server
--model=CohereForAI/c4ai-command-a-03-2025
--host 192.xxx.x.xx
--port 9000
--trust-remote-code
--device cuda
--tensor-parallel-size 8
--gpu-memory-utilization 1
--swap-space 10
--max_num_seqs 3
--max_num_batched_tokens $token
--max_model_len $token
According to the model card, the 256K token limit is theoretically achievable. However, I couldn’t find an explicit reference to the VRAM requirements needed to reach this full context length.
Could you provide any insights into how much GPU memory (VRAM) would be required to fully utilize 256K tokens? Would increasing the number of GPUs beyond 8x RTX A6000 significantly help, or is there another approach (e.g., CPU offloading, swap space tuning) that you would recommend?
Again, thank you for your fantastic work—this model is truly pushing boundaries! I appreciate any guidance you can share.
Best regards
try reduce max_num_seqs=1 and set token size to multiple of 16
Thank you very much for your reply!
Even with max_num_seqs=1, I still can’t load more tokens, and 100,000 is already a multiple of 16—yet 131,072 tokens don’t work either.
From my understanding, 8×48 GB (384 GB VRAM) should be enough for handling 256,000 tokens with this model, correct?
Thank you again for your assistance!