Spaces:

damienbenveniste
/

deploy_vLLM

Sleeping

Damien Benveniste commited on Aug 12, 2024

Commit

a959d74

1 Parent(s): 96c4e4e

modified

Files changed (1) hide show

app.py CHANGED Viewed

@@ -18,7 +18,7 @@ engine = AsyncLLMEngine.from_engine_args(
         gpu_memory_utilization=0.85,   # Slightly increased, adjust if needed
         max_model_len=4096,            # Phi-3-mini-4k context length
         quantization='awq',            # Enable quantization if supported by the model
-        enforce_eager=True,            # Disable CUDA graphs
         dtype='half',                  # Use half precision
     )
 )

         gpu_memory_utilization=0.85,   # Slightly increased, adjust if needed
         max_model_len=4096,            # Phi-3-mini-4k context length
         quantization='awq',            # Enable quantization if supported by the model
+        enforce_eager=True,            # Disable CUDA graph
         dtype='half',                  # Use half precision
     )
 )