jw2yang phanerozoic commited on
Commit
ab5f2c3
·
verified ·
1 Parent(s): 422780d

Update README.md (#4)

Browse files

- Update README.md (9895fa5fbcdfb3aee161b3c33b75d4213edf7ab7)


Co-authored-by: Charles Norton <[email protected]>

Files changed (1) hide show
  1. README.md +6 -6
README.md CHANGED
@@ -209,7 +209,7 @@ In addition to the text-related preprocessing, we mainly undertake the following
209
 
210
  * UI Grounding and Navigation Data: For each UI screenshot, we extract the bounding boxes for the UI elements, and apply [Set-of-Mark Prompting](https://arxiv.org/abs/2310.11441) to overlay numeric marks on the raw image. The model is trained to generate the UI grounding text based on the image and the Set-of-Mark prompts.
211
 
212
- * Instruction Video Data: For each video clip, we apply [Co-Tracker](https://co-tracker.github.io/) to extract the grid traces and then apply filtering algorithm to remove the noisy or staic points. For videos that bear camera motion, we further apply homography transformation to stabilize the video clips. In tne end, we assign a numeric mark for each trace which gives us a set of trace-of-mark. The model is trained to generate the trace-of-mark given the video clips and instructional text.
213
 
214
  * Robotics Manipulation Data: For robotics data in Open-X Embodiment, we extract the 7 DoF robot gripper state and also extract the trace-of-mark from the video clips. Similar filtering and stabilization steps are applied to the video clips. The model is trained to generate the robot manipulation action as well as the trace-of-mark given the video clips and instructional text.
215
 
@@ -309,7 +309,7 @@ Zero-shot evaluation on agentic intelligence. We report the results for pretrain
309
  * Language Model: We use [Meta LLama-3](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the backbone LLM.
310
  * Vision Encoder: We use [CLIP-ConvneXt-XXLarge](https://huggingface.co/laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) trained by LAION team as the vision encoder to tokenize the images and videos.
311
 
312
- The whole pipeline follows the common practice in the multimodal LLMs, where the vision encoder is used to tokenize the images and videos, and then the visual tokens are fed into the LLM along with the textal tokens to generate the text outputs.
313
 
314
 
315
  ### Compute Infrastructure
@@ -337,14 +337,14 @@ Our model is built based on:
337
  * [Transformers](https://huggingface.co/transformers/)
338
  * [TorchVision](https://pytorch.org/vision/stable/index.html)
339
  * [DeepSpeed](https://www.deepspeed.ai/)
340
- * [FlashAttenton](https://github.com/HazyResearch/flash-attention)
341
 
342
 
343
  ## Intended Uses
344
 
345
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
346
 
347
- This model is intended for broad research use in English. It is designed only for research purposes and aimed at knowledge-sharing and accelerating research in multimodal AI, particularly in mutimodal agentic AI. It is intended to be used by domain experts who are independently capable of evaluating the quality of outputs before acting on them.
348
 
349
  ### Direct Use
350
 
@@ -352,7 +352,7 @@ This model is intended for broad research use in English. It is designed only fo
352
 
353
  The model takes images and text as inputs, and produces the textual outputs for the following uses:
354
 
355
- * **Image/Video-Conditoned Text Generation:** The model can generate text (e.g., descriptions, answers) based on the input text and image.
356
 
357
  * **Visual Planning Capabilities:** The model can also produce the visual trace as the future planning to accomplish a task (e.g., move object from one place to another).
358
 
@@ -380,7 +380,7 @@ The model can be further finetuned for different downstream tasks, such as:
380
 
381
  * **UI Navigation:** We can finetune this model for specific UI navigation tasks, such as web navigation or mobile navigation. The model can achieve superior performance on these tasks.
382
 
383
- * **Robotics Manipulation:** Our model can be further finetuned for robotics tasks given its general agentic capabilities as a vision-language-action model. After finetuning, our model significantly outperms the state-of-the-art models such as OpenVLA on robotics manipulation tasks.
384
 
385
 
386
  ## Bias, Risks, and Limitations
 
209
 
210
  * UI Grounding and Navigation Data: For each UI screenshot, we extract the bounding boxes for the UI elements, and apply [Set-of-Mark Prompting](https://arxiv.org/abs/2310.11441) to overlay numeric marks on the raw image. The model is trained to generate the UI grounding text based on the image and the Set-of-Mark prompts.
211
 
212
+ * Instruction Video Data: For each video clip, we apply [Co-Tracker](https://co-tracker.github.io/) to extract the grid traces and then apply filtering algorithm to remove the noisy or static points. For videos that bear camera motion, we further apply homography transformation to stabilize the video clips. In the end, we assign a numeric mark for each trace which gives us a set of trace-of-mark. The model is trained to generate the trace-of-mark given the video clips and instructional text.
213
 
214
  * Robotics Manipulation Data: For robotics data in Open-X Embodiment, we extract the 7 DoF robot gripper state and also extract the trace-of-mark from the video clips. Similar filtering and stabilization steps are applied to the video clips. The model is trained to generate the robot manipulation action as well as the trace-of-mark given the video clips and instructional text.
215
 
 
309
  * Language Model: We use [Meta LLama-3](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as the backbone LLM.
310
  * Vision Encoder: We use [CLIP-ConvneXt-XXLarge](https://huggingface.co/laion/CLIP-convnext_xxlarge-laion2B-s34B-b82K-augreg) trained by LAION team as the vision encoder to tokenize the images and videos.
311
 
312
+ The whole pipeline follows the common practice in the multimodal LLMs, where the vision encoder is used to tokenize the images and videos, and then the visual tokens are fed into the LLM along with the textual tokens to generate the text outputs.
313
 
314
 
315
  ### Compute Infrastructure
 
337
  * [Transformers](https://huggingface.co/transformers/)
338
  * [TorchVision](https://pytorch.org/vision/stable/index.html)
339
  * [DeepSpeed](https://www.deepspeed.ai/)
340
+ * [FlashAttention](https://github.com/HazyResearch/flash-attention)
341
 
342
 
343
  ## Intended Uses
344
 
345
  <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
346
 
347
+ This model is intended for broad research use in English. It is designed only for research purposes and aimed at knowledge-sharing and accelerating research in multimodal AI, particularly in multimodal agentic AI. It is intended to be used by domain experts who are independently capable of evaluating the quality of outputs before acting on them.
348
 
349
  ### Direct Use
350
 
 
352
 
353
  The model takes images and text as inputs, and produces the textual outputs for the following uses:
354
 
355
+ * **Image/Video-Conditioned Text Generation:** The model can generate text (e.g., descriptions, answers) based on the input text and image.
356
 
357
  * **Visual Planning Capabilities:** The model can also produce the visual trace as the future planning to accomplish a task (e.g., move object from one place to another).
358
 
 
380
 
381
  * **UI Navigation:** We can finetune this model for specific UI navigation tasks, such as web navigation or mobile navigation. The model can achieve superior performance on these tasks.
382
 
383
+ * **Robotics Manipulation:** Our model can be further finetuned for robotics tasks given its general agentic capabilities as a vision-language-action model. After finetuning, our model significantly outperforms the state-of-the-art models such as OpenVLA on robotics manipulation tasks.
384
 
385
 
386
  ## Bias, Risks, and Limitations