Spaces:

damienbenveniste
/

deploy_vLLM

Sleeping

Damien Benveniste commited on Aug 12, 2024

Commit

96c4e4e

1 Parent(s): 0306c33

modified

Files changed (1) hide show

app.py CHANGED Viewed

@@ -19,7 +19,6 @@ engine = AsyncLLMEngine.from_engine_args(
         max_model_len=4096,            # Phi-3-mini-4k context length
         quantization='awq',            # Enable quantization if supported by the model
         enforce_eager=True,            # Disable CUDA graphs
-        max_num_layers=None,           # This allows vLLM to determine the optimal number of layers
         dtype='half',                  # Use half precision
     )
 )

         max_model_len=4096,            # Phi-3-mini-4k context length
         quantization='awq',            # Enable quantization if supported by the model
         enforce_eager=True,            # Disable CUDA graphs
         dtype='half',                  # Use half precision
     )
 )