Fancy-MLLM commited on
Commit
c8004e6
·
verified ·
1 Parent(s): a73e6d0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +71 -3
README.md CHANGED
@@ -1,3 +1,71 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - Fancy-MLLM/R1-OneVision
5
+ base_model:
6
+ - Qwen/Qwen2.5-VL-7B-Instruct
7
+ pipeline_tag: image-text-to-text
8
+ ---
9
+
10
+ ## Model Overview
11
+
12
+ This is a multimodal large language model fine-tuned from Qwen2.5-VL on the **R1-OneVision** dataset. The model enhances vision-language understanding and reasoning capabilities, making it suitable for various tasks such as visual reasoning, image understanding.
13
+
14
+ ## Performance
15
+
16
+ | Task | Metric | Score |
17
+ |--------------------|---------|-------|
18
+ | Image Captioning | BLEU-4 | XX.XX |
19
+ | VQA | Accuracy | XX.XX |
20
+ | Scene Understanding | mAP | XX.XX |
21
+
22
+ ## Usage
23
+
24
+ You can load the model using the Hugging Face `transformers` library:
25
+
26
+ ```python
27
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
28
+ from qwen_vl_utils import process_vision_info
29
+
30
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
31
+ "Fancy-MLLM/R1-OneVison/R1-OneVison-7B", torch_dtype="auto", device_map="auto"
32
+ )
33
+
34
+ processor = AutoProcessor.from_pretrained("Fancy-MLLM/R1-OneVison/R1-OneVison-7B")
35
+
36
+ messages = [
37
+ {
38
+ "role": "user",
39
+ "content": [
40
+ {"type": "image", "image": "<your image path>"},
41
+ {"type": "text", "text": "Hint: Please answer the question and provide the final answer at the end. Question: Which number do you have to write in the last daisy?"},
42
+ ],
43
+ }
44
+ ]
45
+
46
+ # Preparation for inference
47
+ text = processor.apply_chat_template(
48
+ messages, tokenize=False, add_generation_prompt=True
49
+ )
50
+ image_inputs, video_inputs = process_vision_info(messages)
51
+ inputs = processor(
52
+ text=[text],
53
+ images=image_inputs,
54
+ videos=video_inputs,
55
+ padding=True,
56
+ return_tensors="pt",
57
+ )
58
+ inputs = inputs.to(model.device)
59
+
60
+ generated_ids = model.generate(**inputs, max_new_tokens=4096)
61
+ generated_ids_trimmed = [
62
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
63
+ ]
64
+ output_text = processor.batch_decode(
65
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
66
+ )
67
+ print(output_text)
68
+ ```
69
+ ## Model Contact
70
+ Xiaoxuan He: [email protected]
71
+ Hongkun Pan: [email protected]