LAPA: Latent Action Pretraining from Videos

---
license: mit
language:
- en
pipeline_tag: image-text-to-text
tags:
- jax
- robotics
widget:
- messages:
  - role: user
    content: >-
      Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many
      sisters does Sally have?
library_name: transformers
base_model:
- LargeWorldModel/LWM-Chat-1M-Jax
---

<h1 align="center">	LAPA: Latent Action Pretraining from Videos</h1>
<p align="center">
<a href="">Hugging Face</a>&nbsp | &nbsp <a href="">Paper</a>&nbsp | &nbsp <a href="">Github</a> &nbsp 
<br>

- LAPA is the **first unsupervised approach** for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels.

- LAPA outperforms the current state-of-the-art VLA model trained with ground-truth actions, building a new **SOTA VLA model**.

- LAPA achieves over **30x** greater pretraining efficiency compared to conventional VLA pretraining.

## Model Summary

- **Developed by:** The LAPA team consisting of researchers from KAIST, UW, Microsoft, NVIDIA, and AI2.
- **Model type:** Vision-language-action (language, image => robot actions)
- **Language(s) (NLP):** en
- **License:** MIT
- **Finetuned from:** [`LWM-Chat-1M-Jax`](https://huggingface.co/LargeWorldModel/LWM-Chat-1M-Jax), a VLM trained from:
  + **Vision Backbone**: VQGAN
  + **Language Model**: Llama-2
- **Pretraining Dataset:** [Open X-Embodiment](https://robotics-transformer-x.github.io/)
- **Repository:** 
- **Paper:** 
- **Project Page & Videos:** 

### Primary Use Cases

Our model is designed to accelerate research on unsupervised methods for building vision-language-action models, for use as a building block for generative AI powered features. 

### Use Case Considerations

Our models are not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of multimodal language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fariness before using within a specific downstream use case, particularly for high risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.

***Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.*** 

## Usage


### Latent Inference

To analyze the output of the model, which is a sequence of latent actions (8^4), run the following command:
```bash
conda create -n lapa python=3.10 -y
conda activate lapa
git clone https://github.com/LatentActionPretraining/LAPA.git
pip install -r requirements.txt 
mkdir lapa_checkpoints && cd lapa_checkpoints
wget https://huggingface.co/latent-action-pretraining/LAPA-7B-openx/resolve/main/tokenizer.model
wget https://huggingface.co/latent-action-pretraining/LAPA-7B-openx/resolve/main/vqgan
wget https://huggingface.co/latent-action-pretraining/LAPA-7B-openx/resolve/main/params
cd ..
python -m latent_pretraining.inference
```

### Fine-tuning

Since the released checkpoint is trained with latent pretraining objective, **the outputs are not real actions that are executable in the real world**. To make the model output executable actions, fine-tuning on a small set of trajectories that contain ground-truth actions (~150 trajs) to map the latent action space to the actual action space. 

To finetune the model on SIMPLER, run the following command:
```bash
./scripts/finetune_simpler.sh
```

To finetune the model on a custom dataset, run the following command:
```bash
python data/finetune_preprocess.py --input_path "/path_to_json_file" --output_filename "data/real_finetune.jsonl" --csv_filename "data/real_finetune.csv"
./scripts/finetune_real.sh
```

## Benchmarks

To understand the capabilities, we compare LAPA with a set of models over a variety of benchmarks. At the high-level overview of the model quality on representative benchmarks:

### Real-World Experiments


|               | Scratch | OpenVLA (Bridge) | ActionVLA (Bridge) | LAPA (Bridge) | OpenVLA (OpenX) | LAPA (OpenX) | LAPA (Sthv2) |
|---------------|-----------|---------|---------|---------|--------|--------|--------|
| Knock          | 13.9    | 33.3   | 25.0    | 25.0    | 38.9   | 52.8   | 30.6   |
| Cover     |  38.7     | 42.3 | 47.8    | 42.4    | 38.6  | 51.7 | 47.9 | 
| Pick and Place          | 11.1   | 22.2   | 19.4 | 43.4 | 54.2 | 45.8   | 23.6  | 
| Average       | 21.2    | 32.6   | 30.8    | 36.8   | 43.9  | 50.1  | 34.0  | 


## Training

### Model
|                     |     |
|---------------------|-----| 
| Developer           | LAPA Team |
| Architecture        | LAPA has 7B parameters where the base architecture is from [Large-World-Model](https://huggingface.co/LargeWorldModel/LWM-Chat-1M-Jax). The model consists of a pretrained LLaMA-2 language model and a VQGAN vision encoder. |
| Inputs              | Text and Image |
| Context length      | 4K tokens |
| GPUs                | 8 H100-80G |
| Training time       | 34 hours |
| Training data       | 7.0B tokens |
| Outputs             | Generated latent actions in response to the input |
| Dates               | Trained on Sep 2024 |
| Status              | This is a static model trained on an offline dataset (Open-X Embodiment) for publicly available data. Future versions of the tuned models may be released as we improve models. |
| Supported languages | English |
| Release date        | Oct 2024 |
| License             | MIT |

### Training Datasets
Our training data is from [Open-X Embodiment Dataset](https://arxiv.org/abs/2310.08864). From the whole dataset, we use the similar mixture of subsets from [OpenVLA](https://arxiv.org/abs/2406.09246).

## Responsible AI Considerations
LAPA model can potentially behave in ways that are unfair, unreliable, or offensive. Some of the limiting behaviors to be aware of include:  
* Quality of Service: LAPA is trained without using any ground-truth action labels during pretraining. Therefore, it might fall short on complex tasks that require fine-grained motion planning.    
* Inappropriate or Offensive Content: Since the model is based on a vision-language model, it inherits the limitations of the backbone model. This model may produce other types of inappropriate or offensive content, which may make it inappropriate to deploy for sensitive contexts without additional mitigations that are specific to the use case. 
* Information Reliability: The latent action generated by the model may not be accurate since it is trained on a limited amount of pretraining data.

Developers should apply responsible AI best practices and are responsible for ensuring that a specific use-case complies with relevant laws and regulations (e.g. privacy, trade, etc.). Important areas for consideration include: 
* High-Risk Scenarios: Developers should assess suitability of using models in high-risk scenarios where unfair, unreliable or offensive outputs might be extremely costly or lead to harm. This includes the model showing unintended behavior after fine-tuning. Additional safeguards should be implemented at the application level according to the deployment context. 
* Generation of Harmful Content: Developers should assess outputs for their context and use available safety classifiers or custom solutions appropriate for their use case. 
* Misuse: Other forms of misuse such as fraud, spam, or malware production may be possible, and developers should ensure that their applications do not violate applicable laws and regulations.
* Copyrighted content: The model might generate content that infringes on copyright protections. Developers should implement measures to detect and filter copyrighted material, and end-users should be informed about the potential for unintended copyright violations and the importance of verifying original sources to avoid legal complications.
  
## License
The model is licensed under the [MIT license](./LICENSE).