JonasGeiping commited on
Commit
4a208d0
·
verified ·
1 Parent(s): e1dcd8a

More readme updates

Browse files
Files changed (1) hide show
  1. README.md +27 -18
README.md CHANGED
@@ -32,17 +32,17 @@ from transformers import AutoModelForCausalLM, AutoTokenizer
32
  model = AutoModelForCausalLM.from_pretrained("tomg-group-umd/huginn-0125", torch_dtype=torch.bfloat16, trust_remote_code=True)
33
  tokenizer = AutoTokenizer.from_pretrained("tomg-group-umd/huginn-0125")
34
  ```
35
- ### Fixed depth Usage
36
- By providing the argument `num_steps`, the model will execute a pass with that amount of compute:
37
  ```python
38
- input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)[:, :-1]
39
  model.eval()
40
  model.to(device)
41
 
42
  model(input_ids, num_steps=32)
43
  ```
44
  The model has about 1.5B parameters in non-recurrent code, 0.5B parameters in the embedding, and 1.5B recurrent parameters, so, as a guideline,
45
- the number of materialized parameters is `num_steps * 1.5B + 2B`. Playing with this parameter is what makes this model interesting (and different from fixed-depth) transformers!
46
  The model is trained to accept an arbitrary number of steps. However, using fewer than 4 steps will result in very coarse answers. If given enough context to reason about, benchmarks show the model improving up to around `num_steps=64`. Beyond that, more steps generally do not hurt, but we see no further improvements.
47
 
48
 
@@ -60,7 +60,7 @@ config = GenerationConfig(max_length=256, stop_strings=["<|end_text|>", "<|end_t
60
  eos_token_id=65505,bos_token_id=65504,pad_token_id=65509)
61
 
62
 
63
- input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)[:, :-1]
64
  outputs = model.generate(input_ids, config, tokenizer=tokenizer, num_steps=16)
65
  ```
66
 
@@ -84,7 +84,7 @@ model.generate(input_ids, config, num_steps=64, tokenizer=tokenizer)
84
 
85
  ### KV-cache Details
86
  The model requires its own KV-cache implementation `HuginnDynamicCache`, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones.
87
- This should be handled automatically by this implementation, but may break with huggingface updates. If you do not use generate, but implement your own generation, use a pattern like this:
88
 
89
  ```python
90
  # first step:
@@ -98,25 +98,34 @@ outputs = model(input_ids=input_ids, use_cache=True, past_key_values=past_key_va
98
  ## Advanced Features
99
 
100
  ### Per-Token Adaptive Compute
 
 
 
 
101
  ```python
102
- model.to(device=device, dtype=torch.bfloat16)
103
- model.eval()
104
 
105
- past_key_values = DynamicCache()
106
- config = GenerationConfig(max_length=64, stop_strings=["<|end_text|>", "<|end_turn|>"],
107
- use_cache=True, past_key_values=past_key_values,
108
- do_sample=False, temperature=None, top_k=None, top_p=None, min_p=None,
109
- return_dict_in_generate=True,
110
- eos_token_id=65505,bos_token_id=65504,pad_token_id=65509)
111
- # Note: num_steps and other model arguments CANNOT be included here, they will shadow model args at runtime
112
 
113
- input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)[:, :-1]
114
- outputs = model.generate(input_ids, config, tokenizer=tokenizer)
115
  ```
 
116
 
117
  ### KV-cache Sharing
 
 
 
 
 
 
 
118
 
119
-
 
 
 
 
120
 
121
 
122
 
 
32
  model = AutoModelForCausalLM.from_pretrained("tomg-group-umd/huginn-0125", torch_dtype=torch.bfloat16, trust_remote_code=True)
33
  tokenizer = AutoTokenizer.from_pretrained("tomg-group-umd/huginn-0125")
34
  ```
35
+ ### Modifying the Model's Depth at Test Time:
36
+ By providing the argument `num_steps`, the model will execute a forward pass with that amount of compute:
37
  ```python
38
+ input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)
39
  model.eval()
40
  model.to(device)
41
 
42
  model(input_ids, num_steps=32)
43
  ```
44
  The model has about 1.5B parameters in non-recurrent code, 0.5B parameters in the embedding, and 1.5B recurrent parameters, so, as a guideline,
45
+ the number of materialized parameters is `num_steps * 1.5B + 2B`. Playing with this parameter is what makes this model interesting, and different from fixed-depth transformers!
46
  The model is trained to accept an arbitrary number of steps. However, using fewer than 4 steps will result in very coarse answers. If given enough context to reason about, benchmarks show the model improving up to around `num_steps=64`. Beyond that, more steps generally do not hurt, but we see no further improvements.
47
 
48
 
 
60
  eos_token_id=65505,bos_token_id=65504,pad_token_id=65509)
61
 
62
 
63
+ input_ids = tokenizer.encode("The capital of Westphalia is", return_tensors="pt", add_special_tokens=True).to(device)
64
  outputs = model.generate(input_ids, config, tokenizer=tokenizer, num_steps=16)
65
  ```
66
 
 
84
 
85
  ### KV-cache Details
86
  The model requires its own KV-cache implementation `HuginnDynamicCache`, otherwise the KV-caches of later calls to the recurrent block will overwrite the earlier ones.
87
+ The current implementation will always try to inject this Cache implementation, but that may break with huggingface updates. If you do not use generate, but implement your own generation, use a pattern like this:
88
 
89
  ```python
90
  # first step:
 
98
  ## Advanced Features
99
 
100
  ### Per-Token Adaptive Compute
101
+ When generating, you can also a variable amount of compute per-token. The model is not trained for this, so this is a proof-of-concept, that can do this task zero-shot.
102
+ You can pick between a few sane stopping rules, `entropy-diff`, `latent-diff`,`kl` and `argmax-stability`, via `criterion=kl`. The exit threshold can be modified via `exit_threshold=5e-4`.
103
+ We suggest using `kl` for interesting exits and `argmax-stability` for conservative exits. Note that using these variables overrides the default generation function. Not all arguments that are valid for the normal `generate` call are valid here. To make this more explicit, you can also directly call `generate_with_adaptive_compute`:
104
+
105
  ```python
106
+ from transformers import TextStreamer
107
+ streamer = TextStreamer(tokenizer)
108
 
109
+ model.generate_with_adaptive_compute(input_ids, config, num_steps=64, tokenizer=tokenizer, streamer=streamer,
110
+ continuous_compute=False, criterion="kl", exit_threshold=5e-4, cache_kwargs={"lookup_strategy": "latest-m4"})
 
 
 
 
 
111
 
 
 
112
  ```
113
+ Your cache strategy should be set to `"latest-m4"` if using adaptive compute.
114
 
115
  ### KV-cache Sharing
116
+ To reduce KV cache memory requirements, the model can be run with fewer KV-caches, with later iterations in the recurrence overwriting earlier caches. To use this feature, set
117
+ the cache argument `lookup_strategy` to include `compress-s16` (where the last number determine the size of the cache).
118
+ ```
119
+ model.generate_with_adaptive_compute(input_ids, config, num_steps=64, tokenizer=tokenizer, streamer=streamer,
120
+ continuous_compute=False, cache_kwargs={"lookup_strategy": "compress-s16"})
121
+ ```
122
+ You can combine this per-token adaptive compute. In that case your lookup strategy should be `latest-m4-compress-s16`.
123
 
124
+ ### Warmstart / Continuous CoT
125
+ At each generation step, the recurrence can be warmstarted with the final state from the previous token by setting `continuous_compute=True`, like so
126
+ ```
127
+ model.generate_with_adaptive_compute(input_ids, config, num_steps=64, tokenizer=tokenizer, streamer=streamer, continuous_compute=True)
128
+ ```
129
 
130
 
131