neo4j/text2cypher-gemma-2-9b-it-finetuned-2024v1 · neo4j/text2cypher-gemma-2-9b-it-finetuned-2024v1 in hugging face api inference

Hi,

I want to run neo4j/text2cypher-gemma-2-9b-it-finetuned-2024v1 using Hugging Face API inference. When I tried to deploy this model, I received a warning stating that handler.py is missing.

I tried using my access token for Google Gemma and deploying it with the following hardware configuration:

Nvidia T4 (4 GPUs, 64GB)
46 vCPUs with 192GB RAM
I encountered the following error:

[Previous line repeated 2 more times]
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 804, in _apply
param_applied = fn(param)
^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/torch/nn/modules/module.py", line 1159, in convert
return t.to(
^^^^^
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 56.00 MiB. GPU
Application startup failed. Exitin

How can I successfully use this model in API inference?
Thanks in advance!

Hello,
It looks like you're encountering 'Out of Memory' error, suggesting the GPU is running out of memory. Since you already have a large RAM resource, it's possible that other applications are also utilizing GPU memory.
Were you able to run the example code provided in the model card? That could help confirm if the issue is specific to your setup.

Here are some potential solutions that may help:

Use torch.cuda.empty_cache() before loading the model to free up unused memory.
Load the model with torch.bfloat16 or torch.float16 to reduce memory usage.
Set device_map="auto", in order to automatically distribute model.
If the error occurs during inference, try reducing the batch size.

If you're still experiencing issues, could you share the code you’re using to load and run inference? That will help us troubleshoot further.
Looking forward to your update!