miniG / README.md
osanseviero's picture
Add proper library type
4ce9cf0 verified
|
raw
history blame
13.7 kB
metadata
language:
  - en
  - zh
  - ja
  - de
model-index:
  - name: miniG
    results:
      - task:
          type: text-generation
        metrics:
          - name: MMLU
            type: MMLU
            value: 85.45
          - name: IFEval
            type: IFEval
            value: 74.22
          - name: GSM8K (5-shot)
            type: GSM8K (5-shot)
            value: 75.89
          - name: HumanEval
            type: HumanEval
            value: 79.88
          - name: GPQA
            type: GPQA
            value: 37.37
license: agpl-3.0
pipeline_tag: text-generation
co2_eq_emissions:
  emissions: 700
  training_type: fine-tuning
library_name: transformers

miniG

Text-Only Weight

GGML with ChatGLM.cpp (recommended): https://github.com/li-plus/chatglm.cpp

GGUF (Text-Only, not recommended): There is a significant degradation, even with the F16.

Update: A new "alt" version of the model has been uploaded, which is trained with masked context provided. This is intended to reduce overfitting and provide a more objective performance. The model weights in the main branch of the repository are trained directly on SFT data, while the alt branch, on the other hand, is trained with the masked context of raw-text used to synthesize the data provided. The alt version exhibits better stability in some cases, with less overfitting. However, it may have limitations in knowledge retention and hallucination due to the lack of external context.

Hint: How can I check if my inference parameters and quantized inference are performing well? You can try having the model recite "The Gift of the Magi" by O. Henry (which is a public domain text). You should expect it to recite the entire text accurately, including the formatting.

A model trained on a synthesis dataset of over 120 million entries, this dataset having been generated through the application of state-of-the-art language models utilizing large context windows, alongside methodologies akin to retrieval-augmented generation and knowledge graph integration, where the data synthesis is conducted within clusters derived from a curated pretraining corpus of 20 billion tokens, with subsequent validation performed by the model itself.

Despite the absence of thorough alignment with human preferences, the model is under no obligation to cater to poorly constructed prompts or the clichés often found in conventional benchmarks. Bonus: Included is an implementation of a Vision Language Model that has undergone Locked-Image Tuning.

Supported Input Modalities: text, image. For text-only weight, please use the branch revision=text-only at https://huggingface.co/CausalLM/miniG/tree/text-only . And GGUF for text-only should be working after PR #9194 was merged.

Context Window: 1M tokens

Model Parameters: LLM - 9B (initialized from THUDM/glm-4-9b-chat-1m); Optional ViT - 5B

Cautionary Notes: It is strongly recommended to utilize a standardized implementation for inference, such as Hugging Face Transformers, to avoid the significant performance degradation that might occur when using accelerated kernels like vllm or lmdeploy - not to mention the potentially catastrophic effects of model quantization. As of now, these accelerated inference implementations are known to severely compromise effective vision inference, though they have a less pronounced impact on pure text performance.

Inference Parameters: Our observations suggest that, if one desires to achieve results with fewer hallucinations, it is advisable to employ sampling with top_p=0.8 followed by a temperature setting of 0.3, or alternatively, to use pure temperature sampling with a setting of 0.2. In general, a lower temperature is required compared to similar models, which we tentatively attribute to overfitting on the vast dataset. The model inference should refer to THUDM/glm-4-9b-chat-1m and THUDM/glm-4v-9b. We only guarantee best performance when using transformers for inference. In our testing, we also used lmdeploy, which resulted in a significant performance degradation for multimodal input.

Regarding Formatting: We strongly recommend you double-check your input to ensure: 1. The system prompt is not empty. Even something as simple as "You are a helpful assistant." is expected. 2. There is always a newline character after the <|role|> tag. This will help ensure proper parsing and processing of your input.

Regarding Benchmark Scores: Generally, you shouldn't worry too much about them, as people can always train specifically to achieve good results. We mainly use them as a smoke test, a quick check to ensure no major regressions have occurred. In fact, if you actually read through the benchmark questions themselves, you'll often find yourself chuckling at how inane, low-quality, or even downright silly they are.

Regarding Training: The final released version was trained using a merge of multiple candidate models in an attempt to improve performance. However, we were unable to conclusively determine whether this was effective. Excluding candidate versions, an efficient naïve fine-tuning should be achievable within one day on 16 nodes of 8*A100-80G. Based on this, we estimate the carbon emissions to be 700 kg CO2 eq.

Disclaimer: Please note that the model was trained on unfiltered internet data. Since we do not have the capacity to vet all of it, there may be a substantial amount of objectionable content, pornography, violence, and offensive language present that we are unable to remove. Therefore, you will still need to complete your own checks on the model's safety and filter keywords in the output. Due to computational resource constraints, we are presently unable to implement RLHF for the model's ethics and safety, nor training on SFT samples that refuse to answer certain questions for restrictive fine-tuning.

For English Users: This model was not trained on meaningless logical riddles like those "strawberry questions" (which is a data optimization case-by-case, unseen during the pre-training phase). This approach has no value beyond creating a spectacle. The model focuses more on utilizing the content within the pre-training corpus, rather than solely on artificial optimizations introduced during the SFT stage for specific tasks.

Seeking Unconditional Sponsorship: Training and synthesizing datasets can be expensive. While we cannot disclose more details about the cost budget, we can theoretically analyze the example of synthesizing and self-verifying the dataset used to train this model, which involved 120M entries synthesized from 20B tokens. The nominal cost of data synthesis and self-verification using a commercial model API could be as high as $3M, while the nominal cost using local model inference, measured in GPU time, could still reach up to $0.1M. We are actively training larger parameter models and scaling up data synthesis, and are seeking substantial compute resources and generous unconditional grants. While this is for the purpose of commercial exploration and technology selection, we are currently under no immediate pressure to generate profit and remain committed to sharing more with the open-source community.

迷你G

纯文本权重

GGML 用于 ChatGLM.cpp (推荐): https://github.com/li-plus/chatglm.cpp

GGUF (纯文本,不推荐): 即使使用F16,性能也有显著下降。

更新: 我们上传了一个新的 "alt" 版本 模型,该模型使用掩码上下文进行训练。此版本旨在减少过拟合并提供更客观的性能。仓库主分支中的模型权重直接在 SFT 数据上训练,而 alt 分支则使用用于合成提供数据的原始文本的掩码上下文进行训练。alt 版本在某些情况下表现出更好的稳定性,过拟合更少。然而,由于缺乏外部上下文,它可能在知识保留和幻觉方面存在局限性。

提示: 如何检查我的推理参数和量化推理是否表现良好?你可以尝试让模型背诵朱自清的《背影》(这是一个公共领域的文本)。你应该期待它能够准确地背诵整个文本,包括格式和换行。

一个在超过1.2亿条数据合成数据集上训练的模型,这些数据集是通过应用具有大上下文窗口的最先进语言模型生成的,并结合了类似于检索增强生成和知识图谱集成的方法,数据合成是在一个由200亿个标记组成的预训练语料库中提取的聚类内进行的,随后由模型本身进行验证。

尽管该模型没有完全对齐人类偏好,但它没有义务迎合不良构建的提示或常见基准测试中的陈词滥调。额外内容:包含了经过锁定图像微调的视觉语言模型实现。

支持的输入模态:文本、图像。对于纯文本权重,请使用 https://huggingface.co/CausalLM/miniG/tree/text-only 上的分支 revision=text-only。在 PR #9194 合并后,适用于纯文本的 GGUF 应该可以正常工作。

上下文窗口:1M 个标记

模型参数:LLM - 9B(从THUDM/glm-4-9b-chat-1m初始化);可选的ViT - 5B。

注意事项: 强烈建议使用标准化的推理实现,例如Hugging Face Transformers,以避免在使用加速内核(如vllm或lmdeploy)时可能发生的显著性能下降——更不用说模型量化可能带来的灾难性影响。目前,这些加速推理实现已知会严重损害视觉推理的有效性,尽管对纯文本性能的影响较小。

推理参数: 我们的观察表明,如果想要减少幻觉结果,建议使用top_p=0.8的采样方式,然后设置temperature为0.3,或者使用纯粹的temperature采样,设置为0.2。总体来说,相比类似的模型,该模型需要较低的temperature,我们暂时将其归因于在庞大数据集上的过拟合。模型推理应参考 THUDM/glm-4-9b-chat-1m 和 THUDM/glm-4v-9b。我们只保证使用 transformer 进行推理时的性能最佳。在我们的测试中,我们还使用了 lmdeploy,这导致多模态输入的性能显著下降。

关于格式: 我们强烈建议您仔细检查输入内容,以确保:1. 系统提示不为空。即使是像“You are a helpful assistant.”这样简单的提示也是预期的。2. <|role|> 标签后始终有一个换行符。这将有助于确保正确解析和处理您的输入。

关于基准测试分数 一般来说,你不应该太过在意这些分数,因为人们总是可以专门训练以取得好成绩。我们主要将它们作为一个冒烟测试,一种快速检查,确保没有发生重大回退。事实上,如果你真的去阅读这些基准测试问题本身,你常常会发现自己会忍不住笑出声来,因为它们是多么无聊、低质量,甚至荒谬可笑。

关于训练: 最终发布的版本使用了多个候选模型的合并来尝试提高性能。然而,我们无法确定这种方法是否确实有效。排除候选版本和合并实验,使用16个节点、每个节点配备8个A100-80G显卡的情况下,应该可以在一天之内实现高效的朴素微调。据此我们估算碳排放量为700公斤二氧化碳当量。

免责声明: 请注意,该模型是在未经过滤的互联网数据上训练的。由于我们无法对所有数据进行筛选,仍有可能存在大量不适当的内容——包括从露骨的材料到暴力和攻击性语言的内容——我们无法移除。因此,您必须自行对模型进行安全检查,并在输出中实施关键词过滤。由于计算资源的限制,我们目前无法为伦理和安全考虑进行人类反馈的强化学习(RLHF),也不能对SFT样本进行限制性微调,以限制模型回答某些问题的能力。

致中文用户: 这个模型没有接受过像“弱智吧”这样毫无意义的逻辑谜题的训练(这属于数据优化中的个案,在预训练阶段从未见过)。这种方法除了制造噱头之外没有任何价值。该模型更注重利用预训练语料库中的内容,而不是仅仅依靠 SFT 阶段为特定任务引入的人工优化。

寻求无条件赞助: 训练和合成数据集可能非常昂贵。虽然我们无法透露更多关于成本预算的细节,但我们可以从理论上分析一下合成和自我验证用于训练该模型的数据集的例子,该数据集包含从 200 亿个标记合成的 1.2 亿个条目。使用商业模型 API 进行数据合成和自我验证的名义成本可能高达 300 万美元,而使用本地模型推理(以 GPU 时间衡量)的名义成本仍然可能高达 10 万美元。我们正在积极训练更大参数的模型并扩大数据合成规模,同时寻求大量的计算资源和慷慨的无条件资助。尽管这是为了商业探索和技术选择的目的,但我们目前并没有立即产生利润的压力,并且仍然致力于与开源社区分享更多成果。