Upload 12 files

Browse files

Files changed (13) hide show

.gitattributes +3 -0
README.md +158 -3
convert_rknn.py +98 -0
dog.jpg +0 -0
export_onnx.py +278 -0
sam2.1_hiera_large_decoder.onnx +3 -0
sam2.1_hiera_large_encoder.rknn +3 -0
sam2.1_hiera_small_decoder.onnx +3 -0
sam2.1_hiera_small_encoder.rknn +3 -0
sam2.1_hiera_tiny_decoder.onnx +3 -0
sam2.1_hiera_tiny_encoder.rknn +3 -0
test_onnx.py +195 -0
test_rknn.py +178 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+sam2.1_hiera_large_encoder.rknn filter=lfs diff=lfs merge=lfs -text
+sam2.1_hiera_small_encoder.rknn filter=lfs diff=lfs merge=lfs -text
+sam2.1_hiera_tiny_encoder.rknn filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,158 @@
----
-license: agpl-3.0
----

+# Segment Anything 2.1 RKNN2
+## (English README see below)
+在RK3588上运行强大的Segment Anything 2.1图像分割模型!
+- 推理速度(RK3588):
+  - Encoder(Tiny)(单NPU核): 3s
+  - Encoder(Small)(单NPU核): 3.5s
+  - Encoder(Large)(单NPU核): 12s
+  - Decoder(CPU): 0.1s
+- 内存占用(RK3588):
+  - Encoder(Tiny): 0.95GB
+  - Encoder(Small): 1.1GB
+  - Encoder(Large): 4.1GB
+  - Decoder: 非常小, 可以忽略不计
+## 使用方法
+1. 克隆或者下载此仓库到本地. 模型较大, 请确保有足够的磁盘空间.
+2. 安装依赖
+```bash
+pip install numpy<2 pillow matplotlib opencv-python onnxruntime rknn-toolkit-lite2
+```
+3. 运行
+```bash
+python test_rknn.py
+```
+你可以修改`test_rknn.py`中这一部分
+```python
+def main():
+    # 1. 加载原始图片
+    path = "dog.jpg"
+    orig_image, input_image, (scale, offset_x, offset_y) = load_image(path)
+    decoder_path = "sam2.1_hiera_small_decoder.onnx"
+    encoder_path = "sam2.1_hiera_small_encoder.rknn"
+    ...
+```
+来测试不同的模型和图片. 注意, 和SAM1不同, 这里的encoder和decoder必须使用同一个版本的模型.
+## 模型转换
+1. 安装依赖
+```bash
+pip install numpy<2 onnxslim onnxruntime rknn-toolkit2 sam2
+```
+2. 下载SAM2.1的pt模型文件. 可以从[这里](https://github.com/facebookresearch/sam2?tab=readme-ov-file#model-description)下载.
+3. 转换pt模型到onnx模型. 以Tiny模型为例:
+```bash
+python ./export_onnx.py --model_type sam2.1_hiera_tiny --checkpoint ./sam2.1_hiera_tiny.pt --output_encoder ./sam2.1_hiera_tiny_encoder.onnx --output_decoder sam2.1_hiera_tiny_decoder.onnx
+```
+4. 将onnx模型转换为rknn模型. 以Tiny模型为例:
+```bash
+python ./convert_rknn.py sam2.1_hiera_tiny
+```
+如果在常量折叠时报错, 请尝试更新onnxruntime到最新版本.
+## 已知问题
+- 只实现了图片分割, 没有实现视频分割.
+- 由于RKNN-Toolkit2的问题, decoder模型在转换时会报错, 暂时需要使用CPU onnxruntime运行, 会略微增加CPU占用.
+## 参考
+- [samexporter/export_sam21_cvat.py](https://github.com/hashJoe/samexporter/blob/cvat/samexporter/export_sam21_cvat.py)
+- [SAM 2](https://github.com/facebookresearch/sam2)
+## English README
+Run the powerful Segment Anything 2.1 image segmentation model on RK3588!
+- Inference Speed (RK3588):
+  - Encoder(Tiny)(Single NPU Core): 3s
+  - Encoder(Small)(Single NPU Core): 3.5s
+  - Encoder(Large)(Single NPU Core): 12s
+  - Decoder(CPU): 0.1s
+- Memory Usage (RK3588):
+  - Encoder(Tiny): 0.95GB
+  - Encoder(Small): 1.1GB
+  - Encoder(Large): 4.1GB
+  - Decoder: Negligible
+## Usage
+1. Clone or download this repository. Models are large, please ensure sufficient disk space.
+2. Install dependencies
+```bash
+pip install numpy<2 pillow matplotlib opencv-python onnxruntime rknn-toolkit-lite2
+```
+3. Run
+```bash
+python test_rknn.py
+```
+You can modify this part in `test_rknn.py`
+```python
+def main():
+    # 1. Load original image
+    path = "dog.jpg"
+    orig_image, input_image, (scale, offset_x, offset_y) = load_image(path)
+    decoder_path = "sam2.1_hiera_small_decoder.onnx"
+    encoder_path = "sam2.1_hiera_small_encoder.rknn"
+    ...
+```
+to test different models and images. Note that unlike SAM1, the encoder and decoder must use the same version of the model.
+## Model Conversion
+1. Install dependencies
+```bash
+pip install numpy<2 onnxslim onnxruntime rknn-toolkit2 sam2
+```
+2. Download SAM2.1 pt model files. You can download them from [here](https://github.com/facebookresearch/sam2?tab=readme-ov-file#model-description).
+3. Convert pt models to onnx models. Taking Tiny model as an example:
+```bash
+python ./export_onnx.py --model_type sam2.1_hiera_tiny --checkpoint ./sam2.1_hiera_tiny.pt --output_encoder ./sam2.1_hiera_tiny_encoder.onnx --output_decoder sam2.1_hiera_tiny_decoder.onnx
+```
+4. Convert onnx models to rknn models. Taking Tiny model as an example:
+```bash
+python ./convert_rknn.py sam2.1_hiera_tiny
+```
+If you encounter errors during constant folding, try updating onnxruntime to the latest version.
+## Known Issues
+- Only image segmentation is implemented, video segmentation is not supported.
+- Due to issues with RKNN-Toolkit2, the decoder model conversion will fail. Currently, it needs to run on CPU using onnxruntime, which will slightly increase CPU usage.
+## References
+- [samexporter/export_sam21_cvat.py](https://github.com/hashJoe/samexporter/blob/cvat/samexporter/export_sam21_cvat.py)
+- [SAM 2](https://github.com/facebookresearch/sam2)

convert_rknn.py ADDED Viewed

	@@ -0,0 +1,98 @@

+#!/usr/bin/env python
+# coding: utf-8
+import datetime
+import argparse
+from rknn.api import RKNN
+from sys import exit
+import os
+import onnxslim
+num_pointss = [1]
+num_labelss = [1]
+def convert_to_rknn(onnx_model, model_part, dataset="/home/zt/rk3588-nn/rknn_model_zoo/datasets/COCO/coco_subset_20.txt", quantize=False):
+    """转换单个ONNX模型到RKNN格式"""
+    rknn_model = onnx_model.replace(".onnx",".rknn")
+    timedate_iso = datetime.datetime.now().isoformat()
+    print(f"\n开始转换 {onnx_model} 到 {rknn_model}")
+    input_shapes = None
+    if model_part == "encoder":
+        input_shapes = None
+    elif model_part == "decoder":
+        input_shapes = [
+            [
+                [1, 256, 64, 64],  # image_embedding
+                [1, 32, 256, 256],  # high_res_feats_0
+                [1, 64, 128, 128],  # high_res_feats_1
+                [num_labels, num_points, 2],  # point_coords
+                [num_labels, num_points],  # point_labels
+                [num_labels, 1, 256, 256],  # mask_input
+                [num_labels],  # has_mask_input
+            ]
+            for num_labels in num_labelss
+            for num_points in num_pointss
+        ]
+    rknn = RKNN(verbose=True)
+    rknn.config(
+        dynamic_input=input_shapes,
+        std_values=[[255,255,255]] if model_part == "encoder" else None,
+        quantized_dtype='w8a8',
+        quantized_algorithm='normal',
+        quantized_method='channel',
+        quantized_hybrid_level=0,
+        target_platform='rk3588',
+        quant_img_RGB2BGR = False,
+        float_dtype='float16',
+        optimization_level=3,
+        custom_string=f"converted at {timedate_iso}",
+        remove_weight=False,
+        compress_weight=False,
+        inputs_yuv_fmt=None,
+        single_core_mode=False,
+        model_pruning=False,
+        op_target=None,
+        quantize_weight=False,
+        remove_reshape=False,
+        sparse_infer=False,
+        enable_flash_attention=False,
+    )
+    ret = rknn.load_onnx(model=onnx_model)
+    ret = rknn.build(do_quantization=quantize, dataset=dataset, rknn_batch_size=None)
+    ret = rknn.export_rknn(rknn_model)
+    print(f"完成转换 {rknn_model}\n")
+def main():
+    parser = argparse.ArgumentParser(description='转换SAM模型从ONNX到RKNN格式')
+    parser.add_argument('model_name', type=str, help='模型名称,例如: sam2.1_hiera_tiny')
+    args = parser.parse_args()
+    # 构建encoder和decoder的文件名
+    encoder_onnx = f"{args.model_name}_encoder.onnx"
+    decoder_onnx = f"{args.model_name}_decoder.onnx"
+    # 检查文件是否存在
+    for model in [encoder_onnx, decoder_onnx]:
+        if not os.path.exists(model):
+            print(f"错误: 找不到文件 {model}")
+            exit(1)
+    # 转换encoder和decoder
+    #encoder需要先跑一个onnxslim
+    print("开始转换encoder...")
+    onnxslim.slim(encoder_onnx, output_model="encoder_slim.onnx", skip_fusion_patterns=["EliminationSlice"])
+    convert_to_rknn("encoder_slim.onnx", model_part="encoder")
+    os.rename("encoder_slim.rknn", encoder_onnx.replace(".onnx", ".rknn"))
+    os.remove("encoder_slim.onnx")
+    # convert_to_rknn(decoder_onnx, model_part="decoder") # 坏的
+    print("所有模型转换完成!")
+if __name__ == "__main__":
+    main()

dog.jpg ADDED Viewed

export_onnx.py ADDED Viewed

	@@ -0,0 +1,278 @@

+from typing import Any
+import argparse
+import pathlib
+import torch
+from torch import nn
+from sam2.build_sam import build_sam2
+from sam2.modeling.sam2_base import SAM2Base
+class SAM2ImageEncoder(nn.Module):
+    def __init__(self, sam_model: SAM2Base) -> None:
+        super().__init__()
+        self.model = sam_model
+        self.image_encoder = sam_model.image_encoder
+        self.no_mem_embed = sam_model.no_mem_embed
+    def forward(self, x: torch.Tensor) -> tuple[Any, Any, Any]:
+        backbone_out = self.image_encoder(x)
+        backbone_out["backbone_fpn"][0] = self.model.sam_mask_decoder.conv_s0(
+            backbone_out["backbone_fpn"][0]
+        )
+        backbone_out["backbone_fpn"][1] = self.model.sam_mask_decoder.conv_s1(
+            backbone_out["backbone_fpn"][1]
+        )
+        feature_maps = backbone_out["backbone_fpn"][
+            -self.model.num_feature_levels :
+        ]
+        vision_pos_embeds = backbone_out["vision_pos_enc"][
+            -self.model.num_feature_levels :
+        ]
+        feat_sizes = [(x.shape[-2], x.shape[-1]) for x in vision_pos_embeds]
+        # flatten NxCxHxW to HWxNxC
+        vision_feats = [x.flatten(2).permute(2, 0, 1) for x in feature_maps]
+        vision_feats[-1] = vision_feats[-1] + self.no_mem_embed
+        feats = [
+            feat.permute(1, 2, 0).reshape(1, -1, *feat_size)
+            for feat, feat_size in zip(vision_feats[::-1], feat_sizes[::-1])
+        ][::-1]
+        return feats[0], feats[1], feats[2]
+class SAM2ImageDecoder(nn.Module):
+    def __init__(self, sam_model: SAM2Base, multimask_output: bool) -> None:
+        super().__init__()
+        self.mask_decoder = sam_model.sam_mask_decoder
+        self.prompt_encoder = sam_model.sam_prompt_encoder
+        self.model = sam_model
+        self.img_size = sam_model.image_size
+        self.multimask_output = multimask_output
+    @torch.no_grad()
+    def forward(
+        self,
+        image_embed: torch.Tensor,
+        high_res_feats_0: torch.Tensor,
+        high_res_feats_1: torch.Tensor,
+        point_coords: torch.Tensor,
+        point_labels: torch.Tensor,
+        orig_im_size: torch.Tensor,
+        mask_input: torch.Tensor,
+        has_mask_input: torch.Tensor,
+    ):
+        sparse_embedding = self._embed_points(point_coords, point_labels)
+        self.sparse_embedding = sparse_embedding
+        dense_embedding = self._embed_masks(mask_input, has_mask_input)
+        high_res_feats = [high_res_feats_0, high_res_feats_1]
+        image_embed = image_embed
+        masks, iou_predictions, _, _ = self.mask_decoder.predict_masks(
+            image_embeddings=image_embed,
+            image_pe=self.prompt_encoder.get_dense_pe(),
+            sparse_prompt_embeddings=sparse_embedding,
+            dense_prompt_embeddings=dense_embedding,
+            repeat_image=False,
+            high_res_features=high_res_feats,
+        )
+        if self.multimask_output:
+            masks = masks[:, 1:, :, :]
+            iou_predictions = iou_predictions[:, 1:]
+        else:
+            masks, iou_predictions = (
+                self.mask_decoder._dynamic_multimask_via_stability(
+                    masks, iou_predictions
+                )
+            )
+        masks = torch.clamp(masks, -32.0, 32.0)
+        return masks, iou_predictions
+    def _embed_points(
+        self, point_coords: torch.Tensor, point_labels: torch.Tensor
+    ) -> torch.Tensor:
+        point_coords = point_coords + 0.5
+        padding_point = torch.zeros(
+            (point_coords.shape[0], 1, 2), device=point_coords.device
+        )
+        padding_label = -torch.ones(
+            (point_labels.shape[0], 1), device=point_labels.device
+        )
+        point_coords = torch.cat([point_coords, padding_point], dim=1)
+        point_labels = torch.cat([point_labels, padding_label], dim=1)
+        point_coords[:, :, 0] = point_coords[:, :, 0] / self.model.image_size
+        point_coords[:, :, 1] = point_coords[:, :, 1] / self.model.image_size
+        point_embedding = self.prompt_encoder.pe_layer._pe_encoding(
+            point_coords
+        )
+        point_labels = point_labels.unsqueeze(-1).expand_as(point_embedding)
+        point_embedding = point_embedding * (point_labels != -1)
+        point_embedding = (
+            point_embedding
+            + self.prompt_encoder.not_a_point_embed.weight
+            * (point_labels == -1)
+        )
+        for i in range(self.prompt_encoder.num_point_embeddings):
+            point_embedding = (
+                point_embedding
+                + self.prompt_encoder.point_embeddings[i].weight
+                * (point_labels == i)
+            )
+        return point_embedding
+    def _embed_masks(
+        self, input_mask: torch.Tensor, has_mask_input: torch.Tensor
+    ) -> torch.Tensor:
+        mask_embedding = has_mask_input * self.prompt_encoder.mask_downscaling(
+            input_mask
+        )
+        mask_embedding = mask_embedding + (
+            1 - has_mask_input
+        ) * self.prompt_encoder.no_mask_embed.weight.reshape(1, -1, 1, 1)
+        return mask_embedding
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Export the SAM2 prompt encoder and mask decoder to an ONNX model."
+    )
+    parser.add_argument(
+        "--checkpoint",
+        type=str,
+        required=True,
+        help="The path to the SAM model checkpoint.",
+    )
+    parser.add_argument(
+        "--output_encoder",
+        type=str,
+        required=True,
+        help="The filename to save the encoder ONNX model to.",
+    )
+    parser.add_argument(
+        "--output_decoder",
+        type=str,
+        required=True,
+        help="The filename to save the decoder ONNX model to.",
+    )
+    parser.add_argument(
+        "--model_type",
+        type=str,
+        required=True,
+        help="In the form of sam2_hiera_{tiny, small, base_plus, large}.",
+    )
+    parser.add_argument(
+        "--opset",
+        type=int,
+        default=17,
+        help="The ONNX opset version to use. Must be >=11",
+    )
+    args = parser.parse_args()
+    input_size = (1024, 1024)
+    multimask_output = False
+    model_type = args.model_type
+    if model_type == "sam2.1_hiera_tiny":
+        model_cfg = "configs/sam2.1/sam2.1_hiera_t.yaml"
+    elif model_type == "sam2.1_hiera_small":
+        model_cfg = "configs/sam2.1/sam2.1_hiera_s.yaml"
+    elif model_type == "sam2.1_hiera_base_plus":
+        model_cfg = "configs/sam2.1/sam2.1_hiera_b+.yaml"
+    elif model_type == "sam2.1_hiera_large":
+        model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
+    else:
+        model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
+    sam2_model = build_sam2(model_cfg, args.checkpoint, device="cpu")
+    img = torch.randn(1, 3, input_size[0], input_size[1]).cpu()
+    sam2_encoder = SAM2ImageEncoder(sam2_model).cpu()
+    high_res_feats_0, high_res_feats_1, image_embed = sam2_encoder(img)
+    pathlib.Path(args.output_encoder).parent.mkdir(parents=True, exist_ok=True)
+    torch.onnx.export(
+        sam2_encoder,
+        img,
+        args.output_encoder,
+        export_params=True,
+        opset_version=args.opset,
+        do_constant_folding=True,
+        input_names=["image"],
+        output_names=["high_res_feats_0", "high_res_feats_1", "image_embed"],
+    )
+    print("Saved encoder to", args.output_encoder)
+    sam2_decoder = SAM2ImageDecoder(
+        sam2_model, multimask_output=multimask_output
+    ).cpu()
+    embed_dim = sam2_model.sam_prompt_encoder.embed_dim
+    embed_size = (
+        sam2_model.image_size // sam2_model.backbone_stride,
+        sam2_model.image_size // sam2_model.backbone_stride,
+    )
+    mask_input_size = [4 * x for x in embed_size]
+    print(embed_dim, embed_size, mask_input_size)
+    point_coords = torch.randint(
+        low=0, high=input_size[1], size=(1, 5, 2), dtype=torch.float
+    )
+    point_labels = torch.randint(low=0, high=1, size=(1, 5), dtype=torch.float)
+    mask_input = torch.randn(1, 1, *mask_input_size, dtype=torch.float)
+    has_mask_input = torch.tensor([1], dtype=torch.float)
+    orig_im_size = torch.tensor([input_size[0], input_size[1]], dtype=torch.int)
+    pathlib.Path(args.output_decoder).parent.mkdir(parents=True, exist_ok=True)
+    torch.onnx.export(
+        sam2_decoder,
+        (
+            image_embed,
+            high_res_feats_0,
+            high_res_feats_1,
+            point_coords,
+            point_labels,
+            orig_im_size,
+            mask_input,
+            has_mask_input,
+        ),
+        args.output_decoder,
+        export_params=True,
+        opset_version=args.opset,
+        do_constant_folding=True,
+        input_names=[
+            "image_embed",
+            "high_res_feats_0",
+            "high_res_feats_1",
+            "point_coords",
+            "point_labels",
+            "orig_im_size",
+            "mask_input",
+            "has_mask_input",
+        ],
+        output_names=["masks", "iou_predictions"],
+        dynamic_axes={
+            "point_coords": {0: "num_labels", 1: "num_points"},
+            "point_labels": {0: "num_labels", 1: "num_points"},
+            "mask_input": {0: "num_labels"},
+            "has_mask_input": {0: "num_labels"},
+        },
+    )
+    print("Saved decoder to", args.output_decoder)

sam2.1_hiera_large_decoder.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c039b2455b4e92dfeb8cb8e4d10a98a92a79ec1550a7119c997bad4352811554
+size 16526061

sam2.1_hiera_large_encoder.rknn ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:0ce5ae036eb273f4e017481c8cb744e50c84a93e81e2f6a84ff4b89a118e756a
+size 1419024037

sam2.1_hiera_small_decoder.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:4e7ba7a80bfae89c1a660d3b64291fa4f5a2de15022a4e8eab933218d4f34582
+size 16526003

sam2.1_hiera_small_encoder.rknn ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d8b9efce9e5d12900a508dc1b79dfbd389057136a6d2ab4cb66654961f3106ef
+size 374531749

sam2.1_hiera_tiny_decoder.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:f594db10b3c7b4d9de7f8854693ea6f7a880e5e228ad08d7823393233e65f4fa
+size 16525993

sam2.1_hiera_tiny_encoder.rknn ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:c3750eef90b87ab63cfefbf4f89858072a4891818c315d96dddeea172119cba1
+size 339018597

test_onnx.py ADDED Viewed

	@@ -0,0 +1,195 @@

+import os
+os.chdir(os.path.dirname(os.path.abspath(__file__)))
+import numpy as np
+import torch
+import onnxruntime
+from PIL import Image
+import requests
+from io import BytesIO
+import matplotlib.pyplot as plt
+from sam2.build_sam import build_sam2
+from sam2.sam2_image_predictor import SAM2ImagePredictor
+def load_image(url):
+    """加载并预处理图片"""
+    response = requests.get(url)
+    image = Image.open(BytesIO(response.content)).convert("RGB")
+    print(f"Original image size: {image.size}")
+    # 计算resize后的尺寸,保持长宽比
+    target_size = (1024, 1024)
+    w, h = image.size
+    scale = min(target_size[0] / w, target_size[1] / h)
+    new_w = int(w * scale)
+    new_h = int(h * scale)
+    print(f"Scale factor: {scale}")
+    print(f"Resized dimensions: {new_w}x{new_h}")
+    # resize图片
+    resized_image = image.resize((new_w, new_h), Image.Resampling.LANCZOS)
+    # 创建1024x1024的黑色背景
+    processed_image = Image.new("RGB", target_size, (0, 0, 0))
+    # 将resized图片粘贴到中心位置
+    paste_x = (target_size[0] - new_w) // 2
+    paste_y = (target_size[1] - new_h) // 2
+    print(f"Paste position: ({paste_x}, {paste_y})")
+    processed_image.paste(resized_image, (paste_x, paste_y))
+    # 保存处理后的图片用于检查
+    processed_image.save("debug_processed_image.png")
+    # 转换为numpy数组并归一化到[0,1]
+    img_np = np.array(processed_image).astype(np.float32) / 255.0
+    # 调整维度顺序从HWC到CHW
+    img_np = img_np.transpose(2, 0, 1)
+    # 添加batch维度
+    img_np = np.expand_dims(img_np, axis=0)
+    print(f"Final input tensor shape: {img_np.shape}")
+    return image, img_np, (scale, paste_x, paste_y)
+def prepare_point_input(point_coords, point_labels, image_size=(1024, 1024)):
+    """准备点击输入数据"""
+    point_coords = np.array(point_coords, dtype=np.float32)
+    point_labels = np.array(point_labels, dtype=np.float32)
+    # 添加batch维度
+    point_coords = np.expand_dims(point_coords, axis=0)
+    point_labels = np.expand_dims(point_labels, axis=0)
+    # 准备mask输入
+    mask_input = np.zeros((1, 1, 256, 256), dtype=np.float32)
+    has_mask_input = np.zeros(1, dtype=np.float32)
+    orig_im_size = np.array(image_size, dtype=np.int32)
+    return point_coords, point_labels, mask_input, has_mask_input, orig_im_size
+def main():
+    # 1. 加载原始图片
+    url = "https://raw.githubusercontent.com/facebookresearch/segment-anything/main/notebooks/images/dog.jpg"
+    orig_image, input_image, (scale, offset_x, offset_y) = load_image(url)
+    # 2. 准备输入点 - 需要根据scale和offset调整点击坐标
+    input_point_orig = [[750, 400]]
+    input_point = [[
+        int(x * scale + offset_x),
+        int(y * scale + offset_y)
+    ] for x, y in input_point_orig]
+    print(f"Original point: {input_point_orig}")
+    print(f"Transformed point: {input_point}")
+    input_label = [1]
+    # 3. 运行PyTorch模型
+    print("Running PyTorch model...")
+    checkpoint = "sam2.1_hiera_large.pt"
+    model_cfg = "configs/sam2.1/sam2.1_hiera_l.yaml"
+    predictor = SAM2ImagePredictor(build_sam2(model_cfg, checkpoint))
+    with torch.inference_mode():
+        predictor.set_image(orig_image)
+        masks_pt, iou_scores_pt, low_res_masks_pt = predictor.predict(
+            point_coords=np.array(input_point),
+            point_labels=np.array(input_label),
+            multimask_output=True
+        )
+    # 4. 运行ONNX模型
+    print("Running ONNX model...")
+    encoder_path = "sam2.1_hiera_tiny_encoder.s.onnx"
+    decoder_path = "sam2.1_hiera_tiny_decoder.onnx"
+    # 创建ONNX Runtime会话
+    encoder_session = onnxruntime.InferenceSession(encoder_path)
+    decoder_session = onnxruntime.InferenceSession(decoder_path)
+    # 运行encoder
+    encoder_inputs = {'image': input_image}
+    high_res_feats_0, high_res_feats_1, image_embed = encoder_session.run(None, encoder_inputs)
+    # 准备decoder输入
+    point_coords, point_labels, mask_input, has_mask_input, orig_im_size = prepare_point_input(
+        input_point, input_label, orig_image.size[::-1]
+    )
+    # 运行decoder
+    decoder_inputs = {
+        'image_embed': image_embed,
+        'high_res_feats_0': high_res_feats_0,
+        'high_res_feats_1': high_res_feats_1,
+        'point_coords': point_coords,
+        'point_labels': point_labels,
+        # 'orig_im_size': orig_im_size,
+        'mask_input': mask_input,
+        'has_mask_input': has_mask_input,
+    }
+    low_res_masks, iou_predictions = decoder_session.run(None, decoder_inputs)
+    # 后处理: 将low_res_masks缩放到原始图片尺寸
+    w, h = orig_image.size
+    # 1. 首先将mask缩放到1024x1024
+    masks_1024 = torch.nn.functional.interpolate(
+        torch.from_numpy(low_res_masks),
+        size=(1024, 1024),
+        mode="bilinear",
+        align_corners=False
+    )
+    # 2. 去除padding
+    new_h = int(h * scale)
+    new_w = int(w * scale)
+    start_h = (1024 - new_h) // 2
+    start_w = (1024 - new_w) // 2
+    masks_no_pad = masks_1024[..., start_h:start_h+new_h, start_w:start_w+new_w]
+    # 3. 缩放到原始图片尺寸
+    masks_onnx = torch.nn.functional.interpolate(
+        masks_no_pad,
+        size=(h, w),
+        mode="bilinear",
+        align_corners=False
+    )
+    # 4. 二值化
+    masks_onnx = masks_onnx > 0.0
+    masks_onnx = masks_onnx.numpy()
+    # 在运行ONNX模型后,打印输出的shape
+    print(f"\nOutput shapes:")
+    print(f"PyTorch masks shape: {masks_pt.shape}")
+    print(f"ONNX masks shape: {masks_onnx.shape}")
+    # 修改可视化部分,暂时注释掉差异图
+    plt.figure(figsize=(10, 5))
+    # PyTorch结果
+    plt.subplot(121)
+    plt.imshow(orig_image)
+    plt.imshow(masks_pt[0], alpha=0.5)
+    plt.plot(input_point_orig[0][0], input_point_orig[0][1], 'rx')
+    plt.title('PyTorch Output')
+    plt.axis('off')
+    # ONNX结果
+    plt.subplot(122)
+    plt.imshow(orig_image)
+    plt.imshow(masks_onnx[0,0], alpha=0.5)
+    plt.plot(input_point_orig[0][0], input_point_orig[0][1], 'rx')
+    plt.title('ONNX Output')
+    plt.axis('off')
+    plt.tight_layout()
+    plt.show()
+    # 6. 打印一些统计信息
+    print("\nStatistics:")
+    print(f"PyTorch IoU scores: {iou_scores_pt}")
+    print(f"ONNX IoU predictions: {iou_predictions}")
+if __name__ == "__main__":
+    main()

test_rknn.py ADDED Viewed

	@@ -0,0 +1,178 @@

+import os
+import time
+os.chdir(os.path.dirname(os.path.abspath(__file__)))
+import numpy as np
+import onnxruntime
+from rknnlite.api import RKNNLite
+from PIL import Image
+import matplotlib.pyplot as plt
+import cv2
+def load_image(path):
+    """加载并预处理图片"""
+    image = Image.open(path).convert("RGB")
+    print(f"Original image size: {image.size}")
+    # 计算resize后的尺寸,保持长宽比
+    target_size = (1024, 1024)
+    w, h = image.size
+    scale = min(target_size[0] / w, target_size[1] / h)
+    new_w = int(w * scale)
+    new_h = int(h * scale)
+    print(f"Scale factor: {scale}")
+    print(f"Resized dimensions: {new_w}x{new_h}")
+    # resize图片
+    resized_image = image.resize((new_w, new_h), Image.Resampling.LANCZOS)
+    # 创建1024x1024的黑色背景
+    processed_image = Image.new("RGB", target_size, (0, 0, 0))
+    # 将resized图片粘贴到中心位置
+    paste_x = (target_size[0] - new_w) // 2
+    paste_y = (target_size[1] - new_h) // 2
+    print(f"Paste position: ({paste_x}, {paste_y})")
+    processed_image.paste(resized_image, (paste_x, paste_y))
+    # 保存处理后的图片用于检查
+    processed_image.save("debug_processed_image.png")
+    # 转换为numpy数组并归一化到[0,1] # 归一化整合到模型了
+    img_np = np.array(processed_image).astype(np.float32) # / 255.0
+    # 调整维度顺序从HWC到CHW
+    img_np = img_np.transpose(2, 0, 1)
+    # 添加batch维度
+    img_np = np.expand_dims(img_np, axis=0)
+    print(f"Final input tensor shape: {img_np.shape}")
+    return image, img_np, (scale, paste_x, paste_y)
+def prepare_point_input(point_coords, point_labels, image_size=(1024, 1024)):
+    """准备点击输入数据"""
+    point_coords = np.array(point_coords, dtype=np.float32)
+    point_labels = np.array(point_labels, dtype=np.float32)
+    # 添加batch维度
+    point_coords = np.expand_dims(point_coords, axis=0)
+    point_labels = np.expand_dims(point_labels, axis=0)
+    # 准备mask输入
+    mask_input = np.zeros((1, 1, 256, 256), dtype=np.float32)
+    has_mask_input = np.zeros(1, dtype=np.float32)
+    orig_im_size = np.array(image_size, dtype=np.int32)
+    return point_coords, point_labels, mask_input, has_mask_input, orig_im_size
+def main():
+    # 1. 加载原始图片
+    path = "dog.jpg"
+    orig_image, input_image, (scale, offset_x, offset_y) = load_image(path)
+    decoder_path = "sam2.1_hiera_small_decoder.onnx"
+    encoder_path = "sam2.1_hiera_small_encoder.rknn"
+    # 2. 准备输入点
+    # input_point_orig = [[750, 400]]
+    input_point_orig = [[189, 394]]
+    input_point = [[
+        int(x * scale + offset_x),
+        int(y * scale + offset_y)
+    ] for x, y in input_point_orig]
+    input_label = [1]
+    # 3. 运行RKNN encoder
+    print("Running RKNN encoder...")
+    rknn_lite = RKNNLite(verbose=False)
+    ret = rknn_lite.load_rknn(encoder_path)
+    if ret != 0:
+        print('Load RKNN model failed')
+        exit(ret)
+    ret = rknn_lite.init_runtime()
+    if ret != 0:
+        print('Init runtime environment failed')
+        exit(ret)
+    start_time = time.time()
+    encoder_outputs = rknn_lite.inference(inputs=[input_image], data_format="nchw")
+    end_time = time.time()
+    print(f"RKNN encoder time: {end_time - start_time} seconds")
+    high_res_feats_0, high_res_feats_1, image_embed = encoder_outputs
+    rknn_lite.release()
+    # 4. 运行ONNX decoder
+    print("Running ONNX decoder...")
+    decoder_session = onnxruntime.InferenceSession(decoder_path)
+    point_coords, point_labels, mask_input, has_mask_input, orig_im_size = prepare_point_input(
+        input_point, input_label, orig_image.size[::-1]
+    )
+    decoder_inputs = {
+        'image_embed': image_embed,
+        'high_res_feats_0': high_res_feats_0,
+        'high_res_feats_1': high_res_feats_1,
+        'point_coords': point_coords,
+        'point_labels': point_labels,
+        'mask_input': mask_input,
+        'has_mask_input': has_mask_input,
+    }
+    start_time = time.time()
+    low_res_masks, iou_predictions = decoder_session.run(None, decoder_inputs)
+    end_time = time.time()
+    print(f"ONNX decoder time: {end_time - start_time} seconds")
+    print(low_res_masks.shape)
+    # 5. 后处理
+    w, h = orig_image.size
+    masks_rknn = []
+    # 处理所有3个mask
+    for i in range(low_res_masks.shape[1]):
+        # 将mask缩放到1024x1024
+        masks_1024 = cv2.resize(
+            low_res_masks[0,i],
+            (1024, 1024),
+            interpolation=cv2.INTER_LINEAR
+        )
+        # 去除padding
+        new_h = int(h * scale)
+        new_w = int(w * scale)
+        start_h = (1024 - new_h) // 2
+        start_w = (1024 - new_w) // 2
+        masks_no_pad = masks_1024[start_h:start_h+new_h, start_w:start_w+new_w]
+        # 缩放到原始图片尺寸
+        mask = cv2.resize(
+            masks_no_pad,
+            (w, h),
+            interpolation=cv2.INTER_LINEAR
+        )
+        # 二值化
+        mask = mask > 0.0
+        masks_rknn.append(mask)
+    # 6. 可视化结果
+    plt.figure(figsize=(15, 5))
+    # 获取IoU分数排序的索引
+    sorted_indices = np.argsort(iou_predictions[0])[::-1]  # 降序排序
+    for idx, mask_idx in enumerate(sorted_indices):
+        plt.subplot(1, 3, idx + 1)
+        plt.imshow(orig_image)
+        plt.imshow(masks_rknn[mask_idx], alpha=0.5)
+        plt.plot(input_point_orig[0][0], input_point_orig[0][1], 'rx')
+        plt.title(f'Mask {mask_idx+1}\nIoU: {iou_predictions[0][mask_idx]:.3f}')
+        plt.axis('off')
+    plt.tight_layout()
+    # plt.show()
+    plt.savefig("result.png")
+    print(f"\nIoU predictions: {iou_predictions}")
+if __name__ == "__main__":
+    main()