An experimental model, fine-tuned using the "multiplicative-LoRA" method on c4ai-command-r-v01.

Other experimental models which attempt to encourage more diverse/creative text generation:

Click to see some (brief) tests on the effect of these changes

Using command-r-3-2024 with temperature = 1 and min-p = 0.01:

image.png

Using creative-writer-v0.1-alfa:35b with temperature = 1 and min-p = 0.01:

image.png

Using creative-writer-v0.1-bravo:35b with temperature = 1 and min-p = 0.01:

image.png

Using creative-writer-v0.1-charlie:35b with temperature = 1 and min-p = 0.01:

image.png


Using command-r-3-2024 with temperature = 1 and min-p = 0.01:

image.png

Using creative-writer-v0.1-alfa:35b with temperature = 1 and min-p = 0.01:

image.png

Using creative-writer-v0.1-bravo:35b with temperature = 1 and min-p = 0.01:

image.png

Using creative-writer-v0.1-charlie:35b with temperature = 1 and min-p = 0.01:

image.png


Using command-r-3-2024 with temperature = 1 and min-p = 0.01:

image.png

Using creative-writer-v0.1-alfa:35b with temperature = 1 and min-p = 0.01:

image.png

Using creative-writer-v0.1-alfa:35b with temperature = 1.1 and min-p = 0.01:

image.png

Using creative-writer-v0.1-bravo:35b with temperature = 1 and min-p = 0.01:

image.png

Using creative-writer-v0.1-bravo:35b with temperature = 0.9 and min-p = 0.01:

image.png

Using creative-writer-v0.1-charlie:35b with temperature = 1 and min-p = 0.01:

image.png


Observations:

  • Up-scaling of the pre-softmax logits during training used by creative-writer-v0.1-bravo:35b looks the most promising.
  • Down-scaling of the pre-softmax logits during training used by creative-writer-v0.1-charlie:35b looks to be very similar to inference-time temperature adjustment.
  • It may be better to just leave the pre-softmax logits up-scaled after training and then let the user perform inference-time temperature adjustment.

Usage

  • Use the normal command-r chat template: '<|START_OF_TURN_TOKEN|><|USER_TOKEN|>prompt<|END_OF_TURN_TOKEN|><|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>reply...'.
  • I suggest using no system prompt with this (and all other Cohere models!), as it writes much better without it IMO...
  • You MUST use some (small) value of min-p with this such as 0.01(and with the original c4ai-command-r-v01 model), or else the model will output gibberish!

The "multiplicative-LoRA" method

Uses:

h = (I + lora_B @ lora_A) @ tensor @ x = tensor @ x + lora_B @ lora_A @ tensor @ x

or equivalently:

h = tensor @ x

h' = h + lora_B @ lora_A @ h

instead of the normal "additive-LoRA" method of:

h = (tensor + lora_B @ lora_A) @ x = tensor @ x + lora_B @ lora_A @ x

I only apply this to the down_proj matrices, and skipped the last layer's down_proj matrix in the same way as creative-writing-control-vectors-v3.0.

This currently requires hacking PEFT's layer.py like so:

#self.lora_A[adapter_name] = nn.Linear(self.in_features, r, bias=False)
self.lora_A[adapter_name] = nn.Linear(self.out_features, r, bias=False)
self.lora_B[adapter_name] = nn.Linear(r, self.out_features, bias=False)

and:

#x = x.to(lora_A.weight.dtype)
temp = result.to(lora_A.weight.dtype)

if not self.use_dora[active_adapter]:
    #result = result + lora_B(lora_A(dropout(x))) * scaling
    result = result + lora_B(lora_A(dropout(temp))) * scaling

Then to merge you need to hack qlora-pipe's merge_lora.py to use:

old_type = tensor.dtype
tensor = tensor.to(torch.float32)
tensor += scale * lora_B.to(torch.float32) @ lora_A.to(torch.float32) @ tensor
tensor = tensor.to(old_type)

The "multiplicative-LoRA" method's link to control-vectors (and "abliteration")

There are actually 3 existing "multiplicative-LoRA" methods in PEFT/tuners:

but as explained in this conceptual guide:

image/png

all 3 methods deliberately maintain orthogonality (as a form of regularization; likely more suited to image generation models than LLMs), and thus are more restrictive in the types of transformations they can perform (ie: Rotations and/or Improper Rotations only; with no scaling or sheer transformations possible...).

For example, these can't perform the orthogonal projection needed for "abliteration":

h' = h - v @ v^T @ h

whereas the general (non-orthogonal) "multiplicative-LoRA" method can (in theory) do this by choosing to set u = -v like so:

h' = h + u @ v^T @ h

This general (non-orthogonal) "multiplicative-LoRA" method can also (in theory) perform Householder Transformation(s):

h' = h - 2 * v @ v^T @ h

by choosing to set u = -2v like so:

h' = h + u @ v^T @ h

In general, the way to think about these (non-orthogonal) "multiplicative-LoRAs" is as a kind of "conditional control-vector":

  • Each vector in lora_A looks for a certain dirrection, and via the dot-product it generates a (signed) weighting factor that measures the similarity between the output of the down_proj transformation and the specific vector in lora_A.
  • Each corresponding vector in lora_B then gets added to the hidden state / residual stream, scaled by the corresponding (signed) weighting factor.

So instead of having just a single vector that we add (and in essence adding a '.bias' weight to create an affine transformation), we now have many different control vectors that can be added (stored in lora_B), based on how well they match another set of "direction detection vectors" (stored in lora_A).

NOTE: The LoRA+ paper uses a similar way of viewing the purpose of lora_A and lora_B:

image/png

but whereas lora_A looks at the input to the transformation for "additive-LoRAs"; these new (non-orthogonal) "multiplicative-LoRAs" instead use lora_A to look at the output of the (down_proj) transformation...


Training

  • Took just over 4 days using dual-A6000 GPUs connected via NVLink, using qlora-pipe.
  • The dataset consisted of approximately 1000 pre-2012 books converted to Markdown (~180M tokens) using the same dataset_combination_mode = 'concatenate' and dataset_type = 'textfile' as tdrussell's Llama-3-70B-Instruct-Storywriter used.
  • I used the same sequence_len = 8192 and batch_size_tokens = 8192 as Llama-3-70B-Instruct-Storywriter, but since I only target down_proj in a very specific way; I doubt this will affect the useable context length of the model, and 8k tokens should be around 2-3 user-AI rounds' worth of interaction in real terms.
  • I used pipeline_stages = 2 and "gradient_accumulation_steps": 16 to roughly match the "tokens-per-step" as Llama-3-70B-Instruct-Storywriter used.
  • I used a much lower learning-rate of 5e-6, as the 5e-5 value used by Llama-3-70B-Instruct-Storywriter dropped the evaluation loss far too quickly (likely due to adapting down_proj only being "almost convex").
  • I set lora_dropout = 0.0 as it doesn't really make sense to use with epochs = 1.
  • I left weight_decay = 0.01 but not convinced this is really doing anything useful, and may actually even be harming the adaption of the early down_proj matrices where the gradient signal is likely to be much weaker.
  • I found via experimentation that setting lora_rank and lora_alpha to a very low value (as a form of Spectral Regularization), can cause the training to get stuck at saddle-points as explained in this paper; particularly if using stock SGD instead of Adam.
  • In general, I relied mainly on early stopping for Regularization and deliberately set out to undertrain the model (we can always increase the size of the dataset at a later time...).

config_creative_writer.toml

# Paths
model = '/mnt/data/c4ai-command-r-v01'
output_dir = '/mnt/data/creative-writer-v0.1-35b'

# Lora configuration
lora_rank = 64
lora_alpha = 64
lora_dropout = 0.0
target_modules = ['down_proj']
layers_to_transform = '0:38'  # skip last layer

# Optimization configuration
epochs = 1
lr_scheduler = 'constant'
warmup_steps = 100
batch_size_tokens = 8192

# Performance settings
pipeline_stages = 2
logging_steps = 1
eval_steps = 100
save_steps = 100
checkpoint_every_n_minutes = 60
eval_before_first_step = true
model_weight_dtype = 'bfloat16'
lora_weight_dtype = 'bfloat16'
keep_states = 3
group_by_length = true
activation_checkpointing = 'unsloth'

# Resume a prior run
resume_from_checkpoint = false

# Dataset configuration
dataset_combination_mode = 'concatenate'
eval_gradient_accumulation_steps = 1

[optimizer]
type = 'adamw_kahan'
lr = 5e-6
beta1 = 0.9
beta2 = 0.99
weight_decay = 0.01

[[datasets]]
name = 'books'
dataset_type = 'textfile'
dataset_path = '/mnt/data/books/*.txt'
sequence_len = 8192
eval_size = 0.01

ds_creative_writer.json

{
    "train_micro_batch_size_per_gpu": 1,
    "gradient_accumulation_steps": 16,
    "gradient_clipping": 1.0,
    "steps_per_print": 1
}

Graphs

image/png

image/png

image/png

image/png

image/png

Downloads last month
53
Safetensors
Model size
35B params
Tensor type
FP16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for jukofyork/creative-writer-v0.1-alfa-35b

Quantizations
2 models

Collection including jukofyork/creative-writer-v0.1-alfa-35b