ecker
/

vall-e

Model card Files Files and versions Community

mrq commited on Jun 13, 2024

Commit

0c4f028

1 Parent(s): 2de4670

cleanup

Browse files

Files changed (6) hide show

README.md +31 -0
{old → model}/ckpt/ar+nar-retnet-8/fp32.pth +0 -0
model/{config.split.yaml → config.llama-split.yaml} +0 -0
model/{config.yaml → config.llama.yaml} +0 -0
old/config.ar_nar.yaml → model/config.retnet.yaml +94 -77
old/config.yaml +0 -0

README.md CHANGED Viewed

@@ -7,3 +7,34 @@ This repo catalogs my weights for use with my [VALL-E](https://github.com/e-c-k-
 The model currently is in a *semi-usable* state, and I'm releasing them now in hopes that it also helps jumpstart anyone else that wants to use them.
 To reiterate, this is ***by no means*** complete. I am not passing this off as competitive.

 The model currently is in a *semi-usable* state, and I'm releasing them now in hopes that it also helps jumpstart anyone else that wants to use them.
 To reiterate, this is ***by no means*** complete. I am not passing this off as competitive.
+## Models
+* `config.retnet.yaml` / `ar+nar-retnet-8`: The previously released weights.
+	+ This configuration utilizes a RetNet (retention based transformer) as the underlying architecture due to a number of misleading interpretations with comparisons, for better or for worse.
+		+ Prompt and response embeddings are summed (further RVQ levels gets the previous RVQ levels' embeddings factored in).
+		+ Tokenizer is a homebrewed "naive" implementation.
+	+ This model received the most training time between my 4070Ti, 7900XTX, and a few rental rigs to training further progress, entirely at `bfloat16` with `prodigyopt` (and a few optimizer restarts).
+	+ The later part of training aimed to shuffle between speakers rather than the global pool of utterances to better focus on zero-shot performance. Due to this, I feel it achieved *decent* zero-shot performance.
+	+ However, due to the dataset being aggressively trimmed under 12 seconds for memory savings during training, it suffers trying to inference non-short utterances. Additional training may fix this, the following models seemed to adapt well to longer utterances.
+	+ Prior testing showed that longer prompt durations results in better utterances.
+* `config.llama.yaml` / `ar+nar-llama-8`: The most recent-ishly trained weights after learning from my mistakes.
+	+ This configuration utilizes Llama's attention-based transformer as the underlying architecture, making use of creature comforts like RoPE, GQA, and memory-efficient attention (trained under `xformers`, shouldn't really affect things).
+		+ Prompt and response embeddings are NOT summed (each RVQ level only attends to the current RVQ level).
+		+ Utilizes a HF tokenizer for "optimal" vocab.
+		+ The current RVQ level is included as a token as well to help guide NAR tasks better.
+	+ This model received a few days of training on my 4xV100s, stepping up the duration window to *try* and better make the model inference for longer utterances.
+		+ Some sessions end up training the current duration window for a few epochs, but I don't know how much it affected things.
+	+ However, it seems to *only* do well with long utterances. Short utterances fumble. I believe further training with a variety of durations should allow the AR to handle a variety of durations.
+		- I believe the "slowly stepping up the context length" only works for text, and not audio.
+	+ Zero-shot performance leaves a bit to be desired, as it did not receive the special training prioritizing shuffling between speakers rather than the global pool of utterances.
+	+ Testing showed that, despite also stepping up the prompt duration, it *really* likes three second prompts.
+	+ Definitely needs additional training.
+* `config.llama.split.yaml` / `ar-llama-1` + `nar-llama-8`: The above model, but split and trained a little bit more.
+	+ This experiment is to see whether the AR and NAR benefitted from being split up after enough pretraining, to un-"lobotomize" any penalties from attending to two different tasks (as the AR predicts the next token, and the NAR predicts the same token but a different level).
+	+ I believe I trained each separate model an additional extra day for another additional audio-duration window for similar training lengths.
+	+ I don't think audio quality differs a non-trivial amount to warrant splitting the model.
+There's a bunch of additional configurations (between the underlying arch, embedding modes, interleaving, and even a NAR-"only" model) that are to be further explored, but current experiments showed they either are not worth the additional performance penalties (interleaving) or fall flat (NAR-"only", chunked interleaving).

{old → model}/ckpt/ar+nar-retnet-8/fp32.pth RENAMED Viewed

File without changes

model/{config.split.yaml → config.llama-split.yaml} RENAMED Viewed

File without changes

model/{config.yaml → config.llama.yaml} RENAMED Viewed

File without changes

old/config.ar_nar.yaml → model/config.retnet.yaml RENAMED Viewed

@@ -1,97 +1,73 @@
-dataset:
-  training: []
-  validation: []
-  noise: []
-  speaker_name_getter: "lambda p: f'{p.parts[-3]}_{p.parts[-2]}'"
-  use_hdf5: True
-  use_metadata: True
-  hdf5_flag: r
-  validate: True
-  workers: 2
-  cache: True
-  phones_range: [4, 256]
-  duration_range: [1.0, 16.0]
-  random_utterance: 1.0
-  max_prompts: 3
-  prompt_duration: 6.0
-  sample_type: speaker
-  tasks_list: [ "tts" ] # , [ "tts", "tts-c", "ns", "sr", "tse", "cse", "nse", "tts"]
 models:
-  _prom_levels: 8
-  _max_levels: 8
-  _models:
-  - name: "ar+nar"
-    size: "full"
-    resp_levels: 8
-    prom_levels: 8
-    tasks: 8
-    arch_type: "retnet"
-    training: True
-    version: 2
 hyperparameters:
-  batch_size: 8
-  gradient_accumulation_steps: 32
-  gradient_clipping: 100
   optimizer: Prodigy
-  torch_optimizer: True
   learning_rate: 1.0
-  scheduler_type: ""
-  #scheduler_type: OneCycle
-  #scheduler_params:
-  #  cycle_first_step_size: 10_000
-  #  cycle_first_stair_count: 10_000
-  #  cycle_second_step_size: 15_000
-  #  cycle_second_stair_count: 15_000
-  #  decay_step_size: 5_000
-  #  cycle_min_lr: 2.5e-4 # 1.0e-5
-  #  cycle_max_lr: 2.5e-4 # 1.0e-4
-  #  decay_lr_rate: 0.0
-  #  cycle_min_mom: 0.90
-  #  cycle_max_mom: 0.99
-  #  decay_mom_rate: 0.0
 evaluation:
   batch_size: 16
-  frequency: 250
   size: 16
-  steps: 450
   ar_temperature: 0.95
   nar_temperature: 0.25
   load_disabled_engines: True
 trainer:
   iterations: 1_000_000
   save_tag: step
   save_on_oom: True
   save_on_quit: True
-  save_frequency: 100
   export_on_save: True
-  keep_last_checkpoints: 4
   aggressive_optimizations: False
   load_disabled_engines: False
   #load_state_dict: True
-  #strict_loading: False
   #load_tag: "9500"
   #load_states: False
   #restart_step_count: True
@@ -99,25 +75,66 @@ trainer:
   gc_mode: None # "global_step"
   weight_dtype: bfloat16
-  amp: False
   backend: deepspeed
   deepspeed:
     zero_optimization_level: 0
-    use_compression_training: True
-  activation_checkpointing: True
 inference:
-  use_vocos: True
   normalize: False
   weight_dtype: bfloat16
-  amp: False
-bitsandbytes:
-  enabled: False
-  injects: True
-  linear: True
-  embedding: True

+sample_rate: 24_000
+audio_backend: vocos
+experimental: True
 models:
+- name: "ar+nar"
+  size: "full"
+  resp_levels: 8
+  prom_levels: 8
+  tasks: 8
+  langs: 2
+  tones: 1
+  arch_type: retnet
+  training: False
+  version: 2
+  dropout: 0.1
+  audio_embedding_sums: True
+  interleave: False
+  experimental: False
+  capabilities: ["ar", "nar"]
 hyperparameters:
+  autotune: False
+  autotune_params:
+    start_profile_step: 1
+    end_profile_step: 50
+    num_tuning_micro_batch_sizes: 8
+  batch_size: 16
+  gradient_accumulation_steps: 8
+  gradient_clipping: 1.0
+  warmup_steps: 250
   optimizer: Prodigy
   learning_rate: 1.0
+  torch_optimizer: True
+  scheduler: "" # ScheduleFree
+  torch_scheduler: True
 evaluation:
   batch_size: 16
+  frequency: 1000
   size: 16
+  steps: 500
   ar_temperature: 0.95
   nar_temperature: 0.25
   load_disabled_engines: True
 trainer:
+  #no_logger: True
+  ddp: False
+  check_for_oom: False
   iterations: 1_000_000
   save_tag: step
   save_on_oom: True
   save_on_quit: True
+  save_frequency: 500
   export_on_save: True
+  keep_last_checkpoints: 8
   aggressive_optimizations: False
   load_disabled_engines: False
+  gradient_checkpointing: True
   #load_state_dict: True
+  strict_loading: False
   #load_tag: "9500"
   #load_states: False
   #restart_step_count: True
   gc_mode: None # "global_step"
   weight_dtype: bfloat16
+  amp: True
   backend: deepspeed
   deepspeed:
+    inferencing: True
     zero_optimization_level: 0
+    use_compression_training: False
+    amp: False
+  load_webui: False
 inference:
+  backend: deepspeed
+  audio_backend: "vocos"
   normalize: False
   weight_dtype: bfloat16
+  amp: True
+optimizations:
+  injects: False
+  replace: True
+  linear: False
+  embedding: False
+  optimizers: True
+  bitsandbytes: False
+  dadaptation: False
+  bitnet: False
+  fp8: False
+dataset:
+  speaker_name_getter: "lambda p: f'{p.parts[-3]}_{p.parts[-2]}'"
+  speaker_group_getter: "lambda p: f'{p.parts[-3]}'"
+  speaker_languages:
+    ja: []
+  use_hdf5: True
+  use_metadata: True
+  hdf5_flag: r
+  validate: True
+  workers: 6
+  cache: True
+  duration_range: [3.0, 16.0]
+  random_utterance: 1.0
+  max_prompts: 1
+  prompt_duration_range: [3.0, 9.0]
+  max_resps: 1
+  p_resp_append: 0.25
+  sample_type: path # path # speaker
+  tasks_list: [ "tts" ] # , [ "tts", "tts-c", "ns", "sr", "tse", "cse", "nse", "tts"]
+  training: []
+  validation: []
+  noise: []

old/config.yaml DELETED Viewed

The diff for this file is too large to render. See raw diff