The training scripts make it very clear how to train on interleaved images and text by adding the <image> token. However its not clear how to do this at inference time.
<image>
· Sign up or log in to comment