--- license: other license_name: cogvlm2 license_link: https://huggingface.co/THUDM/cogvlm2-llama3-chat-19B/blob/main/LICENS language: - ens pipeline_tag: text-generation tags: - chat - cogvlm2 inference: false --- # VisionReward-Image ## Introduction We present VisionReward, a general strategy to aligning visual generation models——both image and video generation——with human preferences through a fine-grainedand multi-dimensional framework. We decompose human preferences in images and videos into multiple dimensions,each represented by a series of judgment questions, linearly weighted and summed to an interpretable and accuratescore. To address the challenges of video quality assess-ment, we systematically analyze various dynamic features of videos, which helps VisionReward surpass VideoScore by 17.2% and achieve top performance for video preference prediction. Here, we present the model of VisionReward-Image. ## Merging and Extracting Checkpoint Files Use the following command to merge the split files into a single `.tar` file and then extract it into the specified directory: ```sh cat ckpts/split_part_* > ckpts/visionreward_image.tar tar -xvf ckpts/visionreward_image.tar ``` ## Using this model You can quickly install the Python package dependencies and run model inference in our [github](https://github.com/THUDM/VisionReward). > This model utilizes fp32 precision parameters and requires the use of the sat (SwissArmyTransformer) library for invocation. For the bf16 (bfloat16) version of the model, please refer to the following link: [https://huggingface.co/THUDM/VisionReward-Image-bf16](https://huggingface.co/THUDM/VisionReward-Image-bf16)