Update README.md
Browse files
README.md
CHANGED
@@ -23,7 +23,7 @@ should probably proofread and complete it, then remove this comment. -->
|
|
23 |
|
24 |
This model is a fine-tuned version of [vidore/colpali-v1.3-hf](https://huggingface.co/vidore/colpali-v1.3-hf) on these datasets:
|
25 |
- [selimc/tr-textbook-ColPali](https://huggingface.co/datasets/selimc/tr-textbook-ColPali)
|
26 |
-
- [muhammetfatihaktug/bilim_teknik_mini_base_colpali](https://huggingface.co/datasets/muhammetfatihaktug/
|
27 |
|
28 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65281302cad797fc4abeffd7/bs8zGLYCYPrjCs8JdsmjA.png)
|
29 |
|
@@ -40,7 +40,7 @@ This model is primarily designed for efficient indexing and retrieval of Turkish
|
|
40 |
The training data was created via the following steps:
|
41 |
- Downloading PDF files of Turkish textbooks and science magazines that are publicly available on the internet.
|
42 |
- Using the [pdf-to-page-images-dataset](https://huggingface.co/spaces/Dataset-Creation-Tools/pdf-to-page-images-dataset) Space to convert the PDF documents into a single page image dataset
|
43 |
-
- Use `gemini-2.0-flash-exp` to generate synthetic queries for these documents using the approach outlined [here](https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html) with additional modifications. This results in [selimc/tr-textbook-ColPali](https://huggingface.co/datasets/selimc/tr-textbook-ColPali) and [muhammetfatihaktug/bilim_teknik_mini_base_colpali](https://huggingface.co/datasets/muhammetfatihaktug/
|
44 |
- Train the model using the fine tuning [notebook](https://github.com/merveenoyan/smol-vision/blob/main/Finetune_ColPali.ipynb?s=35) from [Merve Noyan](https://huggingface.co/merve). Data processing step was modified to include all 3 types of queries. This approach not only adds variety to the training data but also effectively triples the dataset size, helping the model learn to handle diverse query types.
|
45 |
|
46 |
## Usage
|
|
|
23 |
|
24 |
This model is a fine-tuned version of [vidore/colpali-v1.3-hf](https://huggingface.co/vidore/colpali-v1.3-hf) on these datasets:
|
25 |
- [selimc/tr-textbook-ColPali](https://huggingface.co/datasets/selimc/tr-textbook-ColPali)
|
26 |
+
- [muhammetfatihaktug/bilim_teknik_mini_base_colpali](https://huggingface.co/datasets/muhammetfatihaktug/bilim_teknik_mini_colpali)
|
27 |
|
28 |
![image/png](https://cdn-uploads.huggingface.co/production/uploads/65281302cad797fc4abeffd7/bs8zGLYCYPrjCs8JdsmjA.png)
|
29 |
|
|
|
40 |
The training data was created via the following steps:
|
41 |
- Downloading PDF files of Turkish textbooks and science magazines that are publicly available on the internet.
|
42 |
- Using the [pdf-to-page-images-dataset](https://huggingface.co/spaces/Dataset-Creation-Tools/pdf-to-page-images-dataset) Space to convert the PDF documents into a single page image dataset
|
43 |
+
- Use `gemini-2.0-flash-exp` to generate synthetic queries for these documents using the approach outlined [here](https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html) with additional modifications. This results in [selimc/tr-textbook-ColPali](https://huggingface.co/datasets/selimc/tr-textbook-ColPali) and [muhammetfatihaktug/bilim_teknik_mini_base_colpali](https://huggingface.co/datasets/muhammetfatihaktug/bilim_teknik_mini_colpali).
|
44 |
- Train the model using the fine tuning [notebook](https://github.com/merveenoyan/smol-vision/blob/main/Finetune_ColPali.ipynb?s=35) from [Merve Noyan](https://huggingface.co/merve). Data processing step was modified to include all 3 types of queries. This approach not only adds variety to the training data but also effectively triples the dataset size, helping the model learn to handle diverse query types.
|
45 |
|
46 |
## Usage
|