duyntnet commited on
Commit
2e4e42f
1 Parent(s): a955989

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +143 -0
README.md ADDED
@@ -0,0 +1,143 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: other
3
+ language:
4
+ - en
5
+ pipeline_tag: text-generation
6
+ inference: false
7
+ tags:
8
+ - transformers
9
+ - gguf
10
+ - imatrix
11
+ - Llama-3.2-3B
12
+ ---
13
+ Quantizations of https://huggingface.co/meta-llama/Llama-3.2-3B
14
+
15
+
16
+ ### Inference Clients/UIs
17
+ * [llama.cpp](https://github.com/ggerganov/llama.cpp)
18
+ * [KoboldCPP](https://github.com/LostRuins/koboldcpp)
19
+ * [text-generation-webui](https://github.com/oobabooga/text-generation-webui)
20
+ * [ollama](https://github.com/ollama/ollama)
21
+
22
+
23
+ ---
24
+
25
+ # From original readme
26
+
27
+ Last week, the release and buzz around DeepSeek-V2 have ignited widespread interest in MLA (Multi-head Latent Attention)! Many in the community suggested open-sourcing a smaller MoE model for in-depth research. And now DeepSeek-V2-Lite comes out:
28
+
29
+ - 16B total params, 2.4B active params, scratch training with 5.7T tokens
30
+ - Outperforms 7B dense and 16B MoE on many English & Chinese benchmarks
31
+ - Deployable on single 40G GPU, fine-tunable on 8x80G GPUs
32
+
33
+ DeepSeek-V2, a strong Mixture-of-Experts (MoE) language model characterized by economical training and efficient inference. DeepSeek-V2 adopts innovative architectures including Multi-head Latent Attention (MLA) and DeepSeekMoE. MLA guarantees efficient inference through significantly compressing the Key-Value (KV) cache into a latent vector, while DeepSeekMoE enables training strong models at an economical cost through sparse computation.
34
+
35
+ ## 7. How to run locally
36
+
37
+ **To utilize DeepSeek-V2-Lite in BF16 format for inference, 40GB*1 GPU is required.**
38
+ ### Inference with Huggingface's Transformers
39
+ You can directly employ [Huggingface's Transformers](https://github.com/huggingface/transformers) for model inference.
40
+
41
+ #### Text Completion
42
+ ```python
43
+ import torch
44
+ from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
45
+
46
+ model_name = "deepseek-ai/DeepSeek-V2-Lite"
47
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
48
+ model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
49
+ model.generation_config = GenerationConfig.from_pretrained(model_name)
50
+ model.generation_config.pad_token_id = model.generation_config.eos_token_id
51
+
52
+ text = "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is"
53
+ inputs = tokenizer(text, return_tensors="pt")
54
+ outputs = model.generate(**inputs.to(model.device), max_new_tokens=100)
55
+
56
+ result = tokenizer.decode(outputs[0], skip_special_tokens=True)
57
+ print(result)
58
+ ```
59
+
60
+ #### Chat Completion
61
+ ```python
62
+ import torch
63
+ from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
64
+
65
+ model_name = "deepseek-ai/DeepSeek-V2-Lite-Chat"
66
+ tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
67
+ model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True, torch_dtype=torch.bfloat16).cuda()
68
+ model.generation_config = GenerationConfig.from_pretrained(model_name)
69
+ model.generation_config.pad_token_id = model.generation_config.eos_token_id
70
+
71
+ messages = [
72
+ {"role": "user", "content": "Write a piece of quicksort code in C++"}
73
+ ]
74
+ input_tensor = tokenizer.apply_chat_template(messages, add_generation_prompt=True, return_tensors="pt")
75
+ outputs = model.generate(input_tensor.to(model.device), max_new_tokens=100)
76
+
77
+ result = tokenizer.decode(outputs[0][input_tensor.shape[1]:], skip_special_tokens=True)
78
+ print(result)
79
+ ```
80
+
81
+ The complete chat template can be found within `tokenizer_config.json` located in the huggingface model repository.
82
+
83
+ An example of chat template is as belows:
84
+
85
+ ```bash
86
+ <|begin▁of▁sentence|>User: {user_message_1}
87
+
88
+ Assistant: {assistant_message_1}<|end▁of▁sentence|>User: {user_message_2}
89
+
90
+ Assistant:
91
+ ```
92
+
93
+ You can also add an optional system message:
94
+
95
+ ```bash
96
+ <|begin▁of▁sentence|>{system_message}
97
+
98
+ User: {user_message_1}
99
+
100
+ Assistant: {assistant_message_1}<|end▁of▁sentence|>User: {user_message_2}
101
+
102
+ Assistant:
103
+ ```
104
+
105
+ ### Inference with vLLM (recommended)
106
+ To utilize [vLLM](https://github.com/vllm-project/vllm) for model inference, please merge this Pull Request into your vLLM codebase: https://github.com/vllm-project/vllm/pull/4650.
107
+
108
+ ```python
109
+ from transformers import AutoTokenizer
110
+ from vllm import LLM, SamplingParams
111
+
112
+ max_model_len, tp_size = 8192, 1
113
+ model_name = "deepseek-ai/DeepSeek-V2-Lite-Chat"
114
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
115
+ llm = LLM(model=model_name, tensor_parallel_size=tp_size, max_model_len=max_model_len, trust_remote_code=True, enforce_eager=True)
116
+ sampling_params = SamplingParams(temperature=0.3, max_tokens=256, stop_token_ids=[tokenizer.eos_token_id])
117
+
118
+ messages_list = [
119
+ [{"role": "user", "content": "Who are you?"}],
120
+ [{"role": "user", "content": "Translate the following content into Chinese directly: DeepSeek-V2 adopts innovative architectures to guarantee economical training and efficient inference."}],
121
+ [{"role": "user", "content": "Write a piece of quicksort code in C++."}],
122
+ ]
123
+
124
+ prompt_token_ids = [tokenizer.apply_chat_template(messages, add_generation_prompt=True) for messages in messages_list]
125
+
126
+ outputs = llm.generate(prompt_token_ids=prompt_token_ids, sampling_params=sampling_params)
127
+
128
+ generated_text = [output.outputs[0].text for output in outputs]
129
+ print(generated_text)
130
+ ```
131
+
132
+ ### LangChain Support
133
+ Since our API is compatible with OpenAI, you can easily use it in [langchain](https://www.langchain.com/).
134
+ Here is an example:
135
+
136
+ ```
137
+ from langchain_openai import ChatOpenAI
138
+ llm = ChatOpenAI(
139
+ model='deepseek-chat',
140
+ openai_api_key=<your-deepseek-api-key>,
141
+ openai_api_base='https://api.deepseek.com/v1',
142
+ temperature=0.85,
143
+ ma