Spaces:
Runtime error
Runtime error
import streamlit as st | |
from streamlit_extras.switch_page_button import switch_page | |
st.title("Llava-NeXT-Interleave") | |
st.success("""[Original tweet](https://twitter.com/mervenoyann/status/1813560292397203630) (July 17, 2024)""", icon="βΉοΈ") | |
st.markdown(""" """) | |
st.markdown("""The vision language model in this video is 0.5B and can take in image, video and 3D! π€― | |
Llava-NeXT-Interleave is a new vision language model trained on interleaved image, video and 3D data keep reading β₯₯β₯₯ | |
""") | |
st.markdown(""" """) | |
st.video("pages/Llava-NeXT-Interleave/video_1.mp4", format="video/mp4") | |
st.markdown(""" """) | |
st.markdown("""This model comes with 0.5B, 7B and 7B-DPO variants, all can be used with Transformers π | |
[Collection of models](https://t.co/sZsaglSXa3) | [Demo](https://t.co/FbpaMWJY8k) | |
See how to use below ππ» | |
""") | |
st.markdown(""" """) | |
st.image("pages/Llava-NeXT-Interleave/image_1.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
Authors of this paper have explored training <a href='LLaVA-NeXT' target='_self'>LLaVA-NeXT</a> on interleaved data where the data consists of multiple modalities, including image(s), video, 3D π | |
They have discovered that interleaved data increases results across all benchmarks! | |
""", unsafe_allow_html=True) | |
st.markdown(""" """) | |
st.image("pages/Llava-NeXT-Interleave/image_2.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
The model can do task transfer from single image tasks to multiple images π€― | |
The authors have trained the model on single images and code yet the model can solve coding with multiple images. | |
""") | |
st.markdown(""" """) | |
st.image("pages/Llava-NeXT-Interleave/image_3.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
Same applies to other modalities, see below for video: | |
""") | |
st.markdown(""" """) | |
st.image("pages/Llava-NeXT-Interleave/image_4.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
The model also has document understanding capabilities and many real-world application areas. | |
""") | |
st.markdown(""" """) | |
st.image("pages/Llava-NeXT-Interleave/image_5.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.markdown(""" | |
This release also comes with the dataset this model was fine-tuned on π [M4-Instruct-Data](https://t.co/rutXMtNC0I) | |
""") | |
st.markdown(""" """) | |
st.image("pages/Llava-NeXT-Interleave/image_6.jpg", use_column_width=True) | |
st.markdown(""" """) | |
st.info(""" | |
Resources: | |
- [LLaVA-NeXT: Tackling Multi-image, Video, and 3D in Large Multimodal Models](https://llava-vl.github.io/blog/2024-06-16-llava-next-interleave/) | |
by Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li (2024) | |
[GitHub](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/inference/docs/LLaVA-NeXT-Interleave.md) | |
- [Transformers Documentation](https://huggingface.co/docs/transformers/en/model_doc/llava) | |
- [Demo](https://huggingface.co/spaces/merve/llava-next-interleave) | |
""", icon="π") | |
st.markdown(""" """) | |
st.markdown(""" """) | |
st.markdown(""" """) | |
col1, col2, col3 = st.columns(3) | |
with col1: | |
if st.button('Previous paper', use_container_width=True): | |
switch_page("RT-DETR") | |
with col2: | |
if st.button('Home', use_container_width=True): | |
switch_page("Home") | |
with col3: | |
if st.button('Next paper', use_container_width=True): | |
switch_page("Chameleon") |