nintwentydo
commited on
Commit
•
8d45bb0
1
Parent(s):
8c4e1dd
Update README.md
Browse files
README.md
CHANGED
@@ -24,18 +24,64 @@ pipeline_tag: image-text-to-text
|
|
24 |
|
25 |
6.0bpw quant of [Pixtral-Large-Instruct](https://huggingface.co/nintwentydo/Pixtral-Large-Instruct-2411).
|
26 |
|
27 |
-
Vision inputs working on dev branch of [ExLlamaV2](https://github.com/turboderp/exllamav2/tree/dev).
|
28 |
|
|
|
|
|
29 |
|
30 |
-
##
|
31 |
-
|
32 |
-
|
33 |
-
|
34 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
35 |
```
|
36 |
-
<s>[SYSTEM_PROMPT] <system prompt>[/SYSTEM_PROMPT]
|
|
|
|
|
|
|
|
|
|
|
|
|
37 |
```
|
38 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
39 |
## Available Sizes
|
40 |
| Repo | Bits | Head Bits | Size |
|
41 |
| ----------- | ------ | ------ | ------ |
|
|
|
24 |
|
25 |
6.0bpw quant of [Pixtral-Large-Instruct](https://huggingface.co/nintwentydo/Pixtral-Large-Instruct-2411).
|
26 |
|
27 |
+
Vision inputs working on dev branch of [ExLlamaV2](https://github.com/turboderp/exllamav2/tree/dev).
|
28 |
|
29 |
+
***21 Dec 2024:** This model has been a LOT of fun to experiment and learn with. Model card updated below with changes made to this repo
|
30 |
+
over the last week.*
|
31 |
|
32 |
+
## Architecture Differences to Pixtral 12B
|
33 |
+
Pixtral 12B has bias keys for the multi_modal_projector layers, whereas Pixtral Large does not. Instead of including with low/zero values
|
34 |
+
this conversion does not include those bias keys, aligning with the keys present in the original Pixtral Large upload from Mistral. The
|
35 |
+
model's config.json file includes `"multimodal_projector_bias": false` to flag this. *n.b. If anyone in the community confirms initializing
|
36 |
+
these keys with zero values is the better way to go I'm happy to reupload without them excluded.*
|
37 |
+
|
38 |
+
## Tokenizer
|
39 |
+
This model uses a conversion of the Mistral v7m1 tokenizer. Pixtral 12B and Large use different tokenizers with different vocab sizes,
|
40 |
+
so make sure you use the right tokenizer.
|
41 |
+
|
42 |
+
## Prompting / Chat Template
|
43 |
+
The included chat_template.json supports all of Mistral's defined features with some of my own additions.
|
44 |
+
|
45 |
+
I believe this implementation should give quite a lot of flexibility for using the model, and in my testing has worked quite well.
|
46 |
+
|
47 |
+
Example *(line breaks added for readability)*
|
48 |
```
|
49 |
+
<s>[SYSTEM_PROMPT] <system prompt>[/SYSTEM_PROMPT]
|
50 |
+
[INST] [IMG]<user message>
|
51 |
+
[AVAILABLE_TOOLS] [<tool definitions>][/AVAILABLE_TOOLS][/INST]
|
52 |
+
[IMG]<assistant response>
|
53 |
+
[TOOL_CALLS] [<tool calls>][/TOOL_CALLS]
|
54 |
+
[TOOL_RESULTS] <tool results including images>[/TOOL_RESULTS]
|
55 |
+
</s>[INST] <user message>[/INST]
|
56 |
```
|
57 |
|
58 |
+
**System Prompts**:
|
59 |
+
Messages with role "system" will be parsed as `[SYSTEM_PROMPT] <content>[/SYSTEM_PROMPT]` anywhere they appear in chat history.
|
60 |
+
|
61 |
+
This appears to work pretty well for passing extra instructions at various depths, and keeps instructions separate from conversation.
|
62 |
+
|
63 |
+
**Allowing Non-Alternating Roles**:
|
64 |
+
Multiple user messages in a row can be provided, and each will be separated with `[INST][/INST]`. This could work well in group conversation
|
65 |
+
settings, or environments where multiple user messages can be provided before the model is invoked. Having a `[/INST]` breaking each one up
|
66 |
+
appeared to help prevent the model thinking it needs to respond to every previous message and focus on the last message, while still retaining
|
67 |
+
knowledge of what messages sit before it.
|
68 |
+
|
69 |
+
**Image Inputs Everywhere**:
|
70 |
+
Images can now be sent in user, assistant, and tool result messages. And seems to actually work. I did tests like including an image on an
|
71 |
+
assistant reply 10-15 messages back in the conversation, asked the assistant to recall what image they previously sent, and it was able to
|
72 |
+
accurately describe it.
|
73 |
+
|
74 |
+
Having this flexibility could allow for interesting applications, for example if you were to define a tool definition for image generation:
|
75 |
+
- tool is invoked and calls image generation api/model
|
76 |
+
- image returned inside tool result message
|
77 |
+
- model responds with a message with context of the image generated
|
78 |
+
- you can have further conversation about the generated image, or make revisions with the model actually knowing what was created
|
79 |
+
|
80 |
+
## Usage
|
81 |
+
Working in TabbyAPI with dev branch of ExLlamaV2.
|
82 |
+
<img src="https://huggingface.co/nintwentydo/Pixtral-Large-Instruct-2411/resolve/main/image-input-example.jpg">
|
83 |
+
|
84 |
+
|
85 |
## Available Sizes
|
86 |
| Repo | Bits | Head Bits | Size |
|
87 |
| ----------- | ------ | ------ | ------ |
|