Transformers.js error: can't find model.onnx_data file
I'm having trouble initializing this model using Transformers.js / Node and the example that was provided by Xenova in the readme.
When loading the model using dtype: 'q4'
, the model.onnxq4 file will download to the cache, but then node crashes with no error message.
When I remove dtype
altogether, I get an error from onnx runtime that it cannot locate the model.onnx_data
file. I checked in the .cache and transformers only downloads the model.onnx
file. As a potential solution, I downloaded the repo and all the onnx models and placed the model.onnx_data
file into the cache manually, and even tried pointing the cache at a the repo folder, but the error persists:
Error: Exception during initialization: filesystem error: in file_size: No such file or directory ["model.onnx_data"]
at new OnnxruntimeSessionHandler (/Users/timsekiguchi/recall/node_modules/onnxruntime-node/dist/backend.js:27:92)
at Immediate.<anonymous> (/Users/timsekiguchi/recall/node_modules/onnxruntime-node/dist/backend.js:64:29)
at process.processImmediate (node:internal/timers:483:21)
Any suggestions of where to go from here, or what might be throwing? Has the implementation included by Xenova in the readme been tested?
Thank you!
Hey @timsek, sorry for the late reply. Adding @Xenova to the loop
All good! Thanks for replying. There was no error messages being generated by Transformers, but I believe it was an OOM error when trying to load the fp32 model. At this point I can't remember exactly. I ended up getting JINA to work by using ONNX runtime js directly. I ran into a couple issues though that prevented me from using it further:
- My use case is to generate embeddings for image and text separately, but the current ONNX models require both text and image input. This required me to pass dummy text / images when generating one or the other.
- I'm using a local machine with electron to generate embeddings. I did some performance comparisons, and on my mac m1 max, it was taking ~1700ms per image to generate embeddings, which unfortunately isn't feasible for a library of 50K+ images. I don't know if it's in the cards for Jina, but I would love to use a more optimized model that trades off accuracy for speed. For comparison, when running the standard ViT-B-32 model, inference time is 8-12ms per image
For 1 you can use zero-sized tensors https://huggingface.co/jinaai/jina-clip-v2/discussions/12#67445e1ae8ad555f8d307322
For 2 although jina-clip-v2 is a ViT-L14 model, this doesnt justify the huge difference in runtimes. Have you tried the fp16 model?