talmudic discussion model

#660

by david-sh - opened Feb 11

Discussion

david-sh

Feb 11

Hi, I think about creating model for talmudic discussions 'pilpul' or only for halachics questions.

nicoboss

Feb 11

Great idea. To do so you need to decide for a base model and decide for or create a dataset. If you choose a small base model, you should be able to train it locally. If you want to create a large one, I recommend renting some GPUs on RunPod or asking me to train it for you. For finetuning I recommend axolotl. The dataset should be well formatted and contain a system prompt, a prompt and a response and contain at least 1000 rows. If you need any help or advice, feel to always ask me.

david-sh

Feb 11

•

edited Feb 11

I not close on model, it`s one of the questions, if possible take Llama 3.2 -3B and make full fine-tune to it, and replace tokenizer to hebrew. Data is enought, exist about 150000 question->answer, or if possible train on raw text.
I think on amazon AWS possible rent spot instance cheaper, Amazon EC2 G4 Instances.

what you mean you train for me model?

nicoboss

Feb 11

•

edited Feb 11

I not close on model, it`s one of the questions, if possible take Llama 3.2 -3B and make full fine-tune to it, and replace tokenizer to hebrew. Data is enought, exist about 150000 question->answer, or if possible train on raw text.

I recommend against training on RAW text as you can only do so using the base model and creating your own instruction finetune is quite resource intensive.

I think on amazon AWS possible rent spot instance cheaper, Amazon EC2 G4 Instances.

RunPod is almost certainly cheaper as it is heavily subtilized by investors while Amazon needs to make a profit but do your own price comparison and use whatever you like.

what you mean you train for me model?

If you give me a model and dataset I can finetune the model using that dataset for you for free if it doesn't require too much resources. 150K is quite a lot but 3B is relatively small so it might be feasible for me to finetune it using 2x RTX 4090 in reasonable time. I can test feasability if you link me the dataset you want to use. For a 3B model you might even be able to use whatever GPU you already own and finetune it locally.

david-sh

Feb 12

I have Nvidea Geforce GTX 960M with only 4GB. Can you provide me format of dataset you need. Can you, please, suggest best suitible model for such task.

david-sh

Feb 12

do you want replace tokenizer for this task?

nicoboss

Feb 12

As a base model you could use https://huggingface.co/yam-peleg/Hebrew-Mistral-7B-200K as Llama 3.3 only supports English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai. Keep in mind that this is a base model and not an instruction tuned model so your dataset should either be in a question/response or even better an instruction/question/response format. Please choose one of the formats from https://axolotl-ai-cloud.github.io/axolotl/docs/dataset-formats/ and post the dataset to your HuggingFace account. Keep in mind that I do not understand Hebrew so while I can train it, I will not be able to judge the quality of your dataset or the resulting model.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment