Join the conversation
Join the community of Machine Learners and AI enthusiasts.
Sign Upyour dataset is huge. i'd recommend scaling down at first. try working with a sample (about 1gb or less) to get the hang of the process. this way you won't be burning through GPU time on things that might not work right away. once you've got that working, you can always scale it back up.
im going to assume you'll have knowledge to a degree on working with PDF data: parsing -> cleaning -> structuring
next, decide on if you want to do fine tuning or a rag system. there are some pretty decent layman articles on medium that should give you some idea.
look into different GPUaaS like runpod, vast.ai who -iirc- provide hourly rates.
and last but not least: take it easy and enjoy what you're doing.
good
I usually use Colab with A100. It works for me, even with large datasets. When you train It, you can use checkpoints and resume every time. So you can devide the work in several days.
But you have to use something else for the training, also if you want to use RAG.
When you have large datasets use https://huggingface.co/docs/diffusers/quantization/bitsandbytes, Lora or models already optimized