More readme updates
Browse files
README.md
CHANGED
@@ -32,17 +32,17 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
|
|
32 |
model = AutoModelForCausalLM.from_pretrained("tomg-group-umd/huginn-0125", torch_dtype=torch.bfloat16, trust_remote_code=True)
|
33 |
tokenizer = AutoTokenizer.from_pretrained("tomg-group-umd/huginn-0125")
|
34 |
```
|
35 |
-
###
|
36 |
-
By providing the argument `num_steps`, the model will execute a pass with that amount of compute:
|
37 |
```python
|
38 |
-
input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)
|
39 |
model.eval()
|
40 |
model.to(device)
|
41 |
|
42 |
model(input_ids, num_steps=32)
|
43 |
```
|
44 |
The model has about 1.5B parameters in non-recurrent code, 0.5B parameters in the embedding, and 1.5B recurrent parameters, so, as a guideline,
|
45 |
-
the number of materialized parameters is `num_steps * 1.5B + 2B`. Playing with this parameter is what makes this model interesting
|
46 |
The model is trained to accept an arbitrary number of steps. However, using fewer than 4 steps will result in very coarse answers. If given enough context to reason about, benchmarks show the model improving up to around `num_steps=64`. Beyond that, more steps generally do not hurt, but we see no further improvements.
|
47 |
|
48 |
|
@@ -60,7 +60,7 @@ config = GenerationConfig(max_length=256, stop_strings=["<|end_text|>", "<|end_t
|
|
60 |
eos_token_id=65505,bos_token_id=65504,pad_token_id=65509)
|
61 |
|
62 |
|
63 |
-
input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)
|
64 |
outputs = model.generate(input_ids, config, tokenizer=tokenizer, num_steps=16)
|
65 |
```
|
66 |
|
@@ -84,7 +84,7 @@ model.generate(input_ids, config, num_steps=64, tokenizer=tokenizer)
|
|
84 |
|
85 |
### KV-cache Details
|
86 |
The model requires its own KV-cache implementation `HuginnDynamicCache`, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones.
|
87 |
-
|
88 |
|
89 |
```python
|
90 |
# first step:
|
@@ -98,25 +98,34 @@ outputs = model(input_ids=input_ids, use_cache=True, past_key_values=past_key_va
|
|
98 |
## Advanced Features
|
99 |
|
100 |
### Per-Token Adaptive Compute
|
|
|
|
|
|
|
|
|
101 |
```python
|
102 |
-
|
103 |
-
|
104 |
|
105 |
-
|
106 |
-
|
107 |
-
use_cache=True, past_key_values=past_key_values,
|
108 |
-
do_sample=False, temperature=None, top_k=None, top_p=None, min_p=None,
|
109 |
-
return_dict_in_generate=True,
|
110 |
-
eos_token_id=65505,bos_token_id=65504,pad_token_id=65509)
|
111 |
-
# Note: num_steps and other model arguments CANNOT be included here, they will shadow model args at runtime
|
112 |
|
113 |
-
input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)[:, :-1]
|
114 |
-
outputs = model.generate(input_ids, config, tokenizer=tokenizer)
|
115 |
```
|
|
|
116 |
|
117 |
### KV-cache Sharing
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
118 |
|
119 |
-
|
|
|
|
|
|
|
|
|
120 |
|
121 |
|
122 |
|
|
|
32 |
model = AutoModelForCausalLM.from_pretrained("tomg-group-umd/huginn-0125", torch_dtype=torch.bfloat16, trust_remote_code=True)
|
33 |
tokenizer = AutoTokenizer.from_pretrained("tomg-group-umd/huginn-0125")
|
34 |
```
|
35 |
+
### Modifying the Model's Depth at Test Time:
|
36 |
+
By providing the argument `num_steps`, the model will execute a forward pass with that amount of compute:
|
37 |
```python
|
38 |
+
input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)
|
39 |
model.eval()
|
40 |
model.to(device)
|
41 |
|
42 |
model(input_ids, num_steps=32)
|
43 |
```
|
44 |
The model has about 1.5B parameters in non-recurrent code, 0.5B parameters in the embedding, and 1.5B recurrent parameters, so, as a guideline,
|
45 |
+
the number of materialized parameters is `num_steps * 1.5B + 2B`. Playing with this parameter is what makes this model interesting, and different from fixed-depth transformers!
|
46 |
The model is trained to accept an arbitrary number of steps. However, using fewer than 4 steps will result in very coarse answers. If given enough context to reason about, benchmarks show the model improving up to around `num_steps=64`. Beyond that, more steps generally do not hurt, but we see no further improvements.
|
47 |
|
48 |
|
|
|
60 |
eos_token_id=65505,bos_token_id=65504,pad_token_id=65509)
|
61 |
|
62 |
|
63 |
+
input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)
|
64 |
outputs = model.generate(input_ids, config, tokenizer=tokenizer, num_steps=16)
|
65 |
```
|
66 |
|
|
|
84 |
|
85 |
### KV-cache Details
|
86 |
The model requires its own KV-cache implementation `HuginnDynamicCache`, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones.
|
87 |
+
The current implementation will always try to inject this Cache implementation, but that may break with huggingface updates. If you do not use generate, but implement your own generation, use a pattern like this:
|
88 |
|
89 |
```python
|
90 |
# first step:
|
|
|
98 |
## Advanced Features
|
99 |
|
100 |
### Per-Token Adaptive Compute
|
101 |
+
When generating, you can also a variable amount of compute per-token. The model is not trained for this, so this is a proof-of-concept, that can do this task zero-shot.
|
102 |
+
You can pick between a few sane stopping rules, `entropy-diff`, `latent-diff`,`kl` and `argmax-stability`, via `criterion=kl`. The exit threshold can be modified via `exit_threshold=5e-4`.
|
103 |
+
We suggest using `kl` for interesting exits and `argmax-stability` for conservative exits. Note that using these variables overrides the default generation function. Not all arguments that are valid for the normal `generate` call are valid here. To make this more explicit, you can also directly call `generate_with_adaptive_compute`:
|
104 |
+
|
105 |
```python
|
106 |
+
from transformers import TextStreamer
|
107 |
+
streamer = TextStreamer(tokenizer)
|
108 |
|
109 |
+
model.generate_with_adaptive_compute(input_ids, config, num_steps=64, tokenizer=tokenizer, streamer=streamer,
|
110 |
+
continuous_compute=False, criterion="kl", exit_threshold=5e-4, cache_kwargs={"lookup_strategy": "latest-m4"})
|
|
|
|
|
|
|
|
|
|
|
111 |
|
|
|
|
|
112 |
```
|
113 |
+
Your cache strategy should be set to `"latest-m4"` if using adaptive compute.
|
114 |
|
115 |
### KV-cache Sharing
|
116 |
+
To reduce KV cache memory requirements, the model can be run with fewer KV-caches, with later iterations in the recurrence overwriting earlier caches. To use this feature, set
|
117 |
+
the cache argument `lookup_strategy` to include `compress-s16` (where the last number determine the size of the cache).
|
118 |
+
```
|
119 |
+
model.generate_with_adaptive_compute(input_ids, config, num_steps=64, tokenizer=tokenizer, streamer=streamer,
|
120 |
+
continuous_compute=False, cache_kwargs={"lookup_strategy": "compress-s16"})
|
121 |
+
```
|
122 |
+
You can combine this per-token adaptive compute. In that case your lookup strategy should be `latest-m4-compress-s16`.
|
123 |
|
124 |
+
### Warmstart / Continuous CoT
|
125 |
+
At each generation step, the recurrence can be warmstarted with the final state from the previous token by setting `continuous_compute=True`, like so
|
126 |
+
```
|
127 |
+
model.generate_with_adaptive_compute(input_ids, config, num_steps=64, tokenizer=tokenizer, streamer=streamer, continuous_compute=True)
|
128 |
+
```
|
129 |
|
130 |
|
131 |
|