Hyperion-3.0-Mixtral-3x7B
Model Details
This is an experimental first attempt at creating a Mixture of Experts (MoE) language model by combining several Mistral expert models. The model uses the hyperion-3.0-beta
architecture as the base, with a bfloat16
output dtype. The gating mechanism is set to hidden
and two experts are consulted per token (experts_per_token: 2
).
The model incorporates three expert models:
hyperion-3.0-beta
: Focused on science, math, and coding tasksdibt-mistral-7b
: Handles open-ended questions, summarization, and stream of consciousness.rp-mistral-7b
: Specializes in roleplaying and character-based conversations
Each expert is trained on a set of positive and negative prompts to guide its specialization.
Intended Use and Limitations
This MoE model is an early prototype and may not exhibit optimal performance. It is intended for research and experimentation purposes only, and should not be used in production environments or for critical applications.
Please note that the expert models mentioned in the configuration have not been publicly released yet. They are expected to be made available in the near future, at which point this MoE model can be fully instantiated and evaluated.
Training Details
The base model and experts were trained using QLoRA and SFT. However, the specific details of the training data, hyperparameters, and optimization techniques used for this MoE model are not available at this time.
Feedback and Future Updates
As this is an experimental model, feedback and suggestions are welcome. Future updates may include improvements to the gating mechanism, fine-tuning of the expert models, and the incorporation of additional experts to enhance the model's performance and breadth of knowledge.
- Downloads last month
- 66