@Znilsson on Hugging Face: "Hi there, I'm an amateur AI enthusiast. I've managed to gather a fairly large…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Znilsson

posted an update Sep 30, 2025

Post

1531

Hi there, I'm an amateur AI enthusiast. I've managed to gather a fairly large dataset of .pdfs of varying quality. I have about 10gb total. I want to use it to fine tune a model and/or a RAG system. I've been having a terrible time trying to get the data prepared so I can train with it. I've tried google collab, I've tried Kaggle and paid for GPUs on both, but I'm struggling. This is because I'm not a data scientist and have zero background in programming, etc. I'm just using AI to teach me along the way. I'm willing to pay some money for GPU power, but because I fail so frequently I fear I'm just burning through it. I don't have enough horsepower on my laptop - I've tried to train on it, but to no avail. I've had minor successes, but have not been able to complete the dataset with anything comprehensible. Any and all suggestions are welcome. I need some serious help. Thanks!!

arhnayan

Sep 30, 2025

•

edited Sep 30, 2025

your dataset is huge. i'd recommend scaling down at first. try working with a sample (about 1gb or less) to get the hang of the process. this way you won't be burning through GPU time on things that might not work right away. once you've got that working, you can always scale it back up.

im going to assume you'll have knowledge to a degree on working with PDF data: parsing -> cleaning -> structuring

next, decide on if you want to do fine tuning or a rag system. there are some pretty decent layman articles on medium that should give you some idea.

look into different GPUaaS like runpod, vast.ai who -iirc- provide hourly rates.

and last but not least: take it easy and enjoy what you're doing.

hnsw123

Sep 30, 2025

good

SelmaNajih001

Oct 1, 2025

I usually use Colab with A100. It works for me, even with large datasets. When you train It, you can use checkpoints and resume every time. So you can devide the work in several days.
But you have to use something else for the training, also if you want to use RAG.
When you have large datasets use https://huggingface.co/docs/diffusers/quantization/bitsandbytes, Lora or models already optimized

In this post