metadata

license: cc-by-nc-sa-4.0
extra_gated_prompt: >-
  The Models are available for download for non-commercial purposes .  Terms of
  Access: The researcher has requested permission to use the models.  In
  exchange for such permission, the researcher hereby agrees to the following
  terms and conditions:

  1. Researcher shall use the models only for non-commercial research and
  educational purposes. 

  2. The authors make no representations or warranties regarding the models,
  including but not limited to warranties of non-infringement or fitness for a
  particular purpose. 

  3. Researcher accepts full responsibility for his or her use of the models and
  shall defend and indemnify the authors of the models, including their
  employees, Trustees, officers and agents, against any and all claims arising
  from Researcher's use of the models, including but not limited to Researcher's
  use of any copies of copyrighted models files that he or she may create from
  the models. 

  4.Researcher may provide research associates and colleagues with access to the
  models provided that they first agree to be bound by these terms and
  conditions.

  5. The authors reserve the right to terminate Researcher's access to the
  models at any time.

  6. If Researcher is employed by a for-profit, commercial entity, Researcher's
  employer shall also be bound by these terms and conditions, and Researcher
  hereby represents that he or she is fully authorized to enter into this
  agreement on behalf of such employer.
extra_gated_fields:
  Name: text
  Email: text
  Organization: text
  Address: text
  I accept the terms of access: checkbox
datasets:
  - Wenetspeech4TTS/WenetSpeech4TTS
language:
  - zh

ISCSLP2024 Conversational Voice Clone Challenge(CoVoC) baseline model.

There are two baseline models in this competition.

VALL-E:

VALL-E is trained using Amphion.

First, training is performed on the Wenetspeech4TTS dataset, and the model weight is valle_base_model.bin.

After that, fine-tuning is performed on the HQ-Conversations dataset, the model weight is valle_HQ-sft_model.bin.

For specific inference code, please refer to ISCSLP2024_CoVoC_baseline Github for more details.

fish-speech:

An open-source speech model, fish-speech, whose LLAMA and vits_decoder are fine-tuned using the HQ-Conversations dataset.

The training follows the default configuration of fish-speech.

For specific training code, please refer to Fish Speech Github for more details.