Having the coodinates be returned

#2
by TotoB12 - opened

Would it be possible to additionaly have the box coordinates be returned with the Text Output? Thanks.

I appologize, I cannot figure out how to push to the branch I made, since this space is in Dev-mode.
Here is what I wanted to add:

Modified line 81

    return image, str(parsed_content_list), str(label_coordinates)

Added line 108

        with gr.Column():
            image_output_component = gr.Image(type='pil', label='Image Output')
            text_output_component = gr.Textbox(label='Parsed screen elements', placeholder='Text Output')
            coordinates_output_component = gr.Textbox(label='Coordinates', placeholder='Coordinates Output') <-- this one

Modified line 125 (previously 124)

        outputs=[image_output_component, text_output_component, coordinates_output_component]

Many thanks

hello @TotoB12 just read this issue - thanks for taking time investigating!

Will this output the coordinates as well?

Hey!
Yes this will display the usual text output (with the text and icon box numbers), with the coordinates in a seperate output box.
I only got to test this on a modified CPU only Space, but I am pretty this this is all that is needed.
Here is what it would look like:

image.png

awesome! is that something the community wants?

The coordinates are one of the core features of this model. As per the current app.py at line 77:

    dino_labeled_img, label_coordinates, parsed_content_list = get_som_labeled_img(
        image_save_path,
        yolo_model,
        BOX_TRESHOLD=box_threshold,
        output_coord_in_ratio=True,
        ocr_bbox=ocr_bbox,
        draw_bbox_config=draw_bbox_config,
        caption_model_processor=caption_model_processor,
        ocr_text=text,
        iou_threshold=iou_threshold
    )

The coordinates are already being generated when the model is prompted, they are just not being shown.
On this Space, seeing the Text Output and labeled image is nice, but is useless for actual use in projects without the full data.
In the microsoft/OmniParser GitHub repository's issues tab, we can see that it is definitely an indespensible asset in the use of the model.

image.png

Would really appreciate it if you could return the coordinates like this (as seen in the screenshot - the center point of the x1,y1,x2,y2 coordinates of the bounding boxes ), combining this with the actual screen width and height can we can get the actual screen coordinates (x,y center point of the ui element). This could save us a lot of time to locate the actual UI element.

My use case is to build a chrome extension, that can sort of control the browser, execute tasks etc (like the anthropic's computer use api). so, now, I have to use CSS selectors to locate the right element, this process is kind of error prone and some additional processes are necessary too.

If you could give this screen coordinates in the api response then I can just move the cursor to that position, it will massively save us lots of time and help me cache these coordinates too so, things could be cheaper and faster than anthropic's computer use (again I'm just ranting and could be wrong, please feel free to correct)
If anyone is building something similar, please ping, I'll be happy to get involved.

Screenshot 2024-10-29 at 2.29.46 PM.png

@zorba111 awesome! would be best to add @TotoB12 modified lines ..?

@zorba111 This is a great idea and would reduce the amount of work to be implemented on our projects. I am actually building a very similar app as yours, on a computer level. @jadechoghari I think this would be a valuable addition to this Space. I already got Microsot's Demo to have the coordinates be outputed.

Sign up or log in to comment