Spaces:

merve
/

vision_papers

Running

App Files Files Community

vision_papers / pages /21_4M-21.py

merve HF Staff

Fix streamlit warning (#3)

595df7a verified 7 months ago

raw

history blame contribute delete

6.6 kB

	import streamlit as st
	from streamlit_extras.switch_page_button import switch_page


	translations = {
	'en': {'title': '4M-21',
	'original_tweet':
	"""
	[Original tweet](https://twitter.com/mervenoyann/status/1804138208814309626) (June 21, 2024)
	""",
	'tweet_1':
	"""
	EPFL and Apple just released 4M-21: single any-to-any model that can do anything from text-to-image generation to generating depth masks! 🙀
	Let's unpack 🧶
	""",
	'tweet_2':
	"""
	4M is a multimodal training [framework](https://t.co/jztLublfSF) introduced by Apple and EPFL.
	Resulting model takes image and text and output image and text 🤩
	[Models](https://t.co/1LC0rAohEl) \| [Demo](https://t.co/Ra9qbKcWeY)
	""",
	'tweet_3':
	"""
	This model consists of transformer encoder and decoder, where the key to multimodality lies in input and output data:
	input and output tokens are decoded to generate bounding boxes, generated image's pixels, captions and more!
	""",
	'tweet_4':
	"""
	This model also learnt to generate canny maps, SAM edges and other things for steerable text-to-image generation 🖼️
	The authors only added image-to-all capabilities for the demo, but you can try to use this model for text-to-image generation as well ☺️
	""",
	'tweet_5':
	"""
	In the project page you can also see the model's text-to-image and steered generation capabilities with model's own outputs as control masks!
	""",
	'ressources':
	"""
	Ressources
	[4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities](https://arxiv.org/abs/2406.09406) by Roman Bachmann, Oğuzhan Fatih Kar, David Mizrahi, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir (2024)
	[GitHub](https://github.com/apple/ml-4m/)
	"""
	},
	'fr': {
	'title': '4M-21',
	'original_tweet':
	"""
	[Tweet de base](https://twitter.com/mervenoyann/status/1804138208814309626) (en anglais) (21 juin 2024)
	""",
	'tweet_1':
	"""
	L'EPFL et Apple viennent de publier 4M-21 : un modèle unique qui peut tout faire, de la génération texte-à-image à la génération de masques de profondeur ! 🙀
	Détaillons tout ça 🧶
	""",
	'tweet_2':
	"""
	4M est un [framework](https://t.co/jztLublfSF) d'entraînement multimodal introduit par Apple et l'EPFL.
	Le modèle résultant prend une image et un texte et produit une image et un texte 🤩
	[Modèles](https://t.co/1LC0rAohEl) \| [Demo](https://t.co/Ra9qbKcWeY)
	""",
	'tweet_3':
	"""
	Ce modèle se compose d'un transformer encodeur-décodeur, où la clé de la multimodalité réside dans les données d'entrée et de sortie :
	les tokens d'entrée et de sortie sont décodés pour générer des boîtes de délimitation, les pixels de l'image, les légendes, etc. !
	""",
	'tweet_4':
	"""
	Ce modèle a aussi appris à générer des filtres de Canny, des bordures SAM et pleins d'autres choses pour tout ce qui est pilotage de la génération d'images à partir de textes 🖼️
	Les auteurs n'ont ajouté que des capacités image-vers-tout pour la démo, mais vous pouvez essayer d'utiliser ce modèle pour la génération texte-image également ☺️ """,
	'tweet_5':
	"""
	Dans la page du projet, vous pouvez également voir les capacités du modèle en matière de texte vers image et de génération dirigée avec les propres sorties du modèle en tant que masques de contrôle ! """,
	'ressources':
	"""
	Ressources :
	[4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities](https://arxiv.org/abs/2406.09406) de Roman Bachmann, Oğuzhan Fatih Kar, David Mizrahi, Ali Garjani, Mingfei Gao, David Griffiths, Jiaming Hu, Afshin Dehghan, Amir Zamir (2024)
	[GitHub](https://github.com/apple/ml-4m/)
	"""
	}
	}


	def language_selector():
	languages = {'EN': '🇬🇧', 'FR': '🇫🇷'}
	selected_lang = st.selectbox('', options=list(languages.keys()), format_func=lambda x: languages[x], key='lang_selector')
	return 'en' if selected_lang == 'EN' else 'fr'

	left_column, right_column = st.columns([5, 1])

	# Add a selector to the right column
	with right_column:
	lang = language_selector()

	# Add a title to the left column
	with left_column:
	st.title(translations[lang]["title"])

	st.success(translations[lang]["original_tweet"], icon="ℹ️")
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_1"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.image("pages/4M-21/image_1.jpg", use_container_width=True)
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_2"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.video("pages/4M-21/video_1.mp4", format="video/mp4")
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_3"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.image("pages/4M-21/image_2.jpg", use_container_width=True)
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_4"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.image("pages/4M-21/image_3.jpg", use_container_width=True)
	st.markdown(""" """)

	st.markdown(translations[lang]["tweet_5"], unsafe_allow_html=True)
	st.markdown(""" """)

	st.video("pages/4M-21/video_2.mp4", format="video/mp4")
	st.markdown(""" """)

	st.info(translations[lang]["ressources"], icon="📚")

	st.markdown(""" """)
	st.markdown(""" """)
	st.markdown(""" """)
	col1, col2, col3= st.columns(3)
	with col1:
	if lang == "en":
	if st.button('Previous paper', use_container_width=True):
	switch_page("Florence-2")
	else:
	if st.button('Papier précédent', use_container_width=True):
	switch_page("Florence-2")
	with col2:
	if lang == "en":
	if st.button("Home", use_container_width=True):
	switch_page("Home")
	else:
	if st.button("Accueil", use_container_width=True):
	switch_page("Home")
	with col3:
	if lang == "en":
	if st.button("Next paper", use_container_width=True):
	switch_page("RT-DETR")
	else:
	if st.button("Papier suivant", use_container_width=True):
	switch_page("RT-DETR")