talmudic discussion model

#660
by david-sh - opened

Hi, I think about creating model for talmudic discussions 'pilpul' or only for halachics questions.

Great idea. To do so you need to decide for a base model and decide for or create a dataset. If you choose a small base model, you should be able to train it locally. If you want to create a large one, I recommend renting some GPUs on RunPod or asking me to train it for you. For finetuning I recommend axolotl. The dataset should be well formatted and contain a system prompt, a prompt and a response and contain at least 1000 rows. If you need any help or advice, feel to always ask me.

I not close on model, it`s one of the questions, if possible take Llama 3.2 -3B and make full fine-tune to it, and replace tokenizer to hebrew. Data is enought, exist about 150000 question->answer, or if possible train on raw text.
I think on amazon AWS possible rent spot instance cheaper, Amazon EC2 G4 Instances.

what you mean you train for me model?

I not close on model, it`s one of the questions, if possible take Llama 3.2 -3B and make full fine-tune to it, and replace tokenizer to hebrew. Data is enought, exist about 150000 question->answer, or if possible train on raw text.

I recommend against training on RAW text as you can only do so using the base model and creating your own instruction finetune is quite resource intensive.

I think on amazon AWS possible rent spot instance cheaper, Amazon EC2 G4 Instances.

RunPod is almost certainly cheaper as it is heavily subtilized by investors while Amazon needs to make a profit but do your own price comparison and use whatever you like.

what you mean you train for me model?

If you give me a model and dataset I can finetune the model using that dataset for you for free if it doesn't require too much resources. 150K is quite a lot but 3B is relatively small so it might be feasible for me to finetune it using 2x RTX 4090 in reasonable time. I can test feasability if you link me the dataset you want to use. For a 3B model you might even be able to use whatever GPU you already own and finetune it locally.

I have Nvidea Geforce GTX 960M with only 4GB. Can you provide me format of dataset you need. Can you, please, suggest best suitible model for such task.

do you want replace tokenizer for this task?

As a base model you could use https://huggingface.co/yam-peleg/Hebrew-Mistral-7B-200K as Llama 3.3 only supports English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai. Keep in mind that this is a base model and not an instruction tuned model so your dataset should either be in a question/response or even better an instruction/question/response format. Please choose one of the formats from https://axolotl-ai-cloud.github.io/axolotl/docs/dataset-formats/ and post the dataset to your HuggingFace account. Keep in mind that I do not understand Hebrew so while I can train it, I will not be able to judge the quality of your dataset or the resulting model.

Sign up or log in to comment