Spaces:
Running
on
Zero
Having the coodinates be returned
Would it be possible to additionaly have the box coordinates be returned with the Text Output? Thanks.
I appologize, I cannot figure out how to push to the branch I made, since this space is in Dev-mode.
Here is what I wanted to add:
Modified line 81
return image, str(parsed_content_list), str(label_coordinates)
Added line 108
with gr.Column():
image_output_component = gr.Image(type='pil', label='Image Output')
text_output_component = gr.Textbox(label='Parsed screen elements', placeholder='Text Output')
coordinates_output_component = gr.Textbox(label='Coordinates', placeholder='Coordinates Output') <-- this one
Modified line 125 (previously 124)
outputs=[image_output_component, text_output_component, coordinates_output_component]
Many thanks
hello @TotoB12 just read this issue - thanks for taking time investigating!
Will this output the coordinates as well?
awesome! is that something the community wants?
The coordinates are one of the core features of this model. As per the current app.py
at line 77:
dino_labeled_img, label_coordinates, parsed_content_list = get_som_labeled_img(
image_save_path,
yolo_model,
BOX_TRESHOLD=box_threshold,
output_coord_in_ratio=True,
ocr_bbox=ocr_bbox,
draw_bbox_config=draw_bbox_config,
caption_model_processor=caption_model_processor,
ocr_text=text,
iou_threshold=iou_threshold
)
The coordinates are already being generated when the model is prompted, they are just not being shown.
On this Space, seeing the Text Output and labeled image is nice, but is useless for actual use in projects without the full data.
In the microsoft/OmniParser
GitHub repository's issues tab, we can see that it is definitely an indespensible asset in the use of the model.
Would really appreciate it if you could return the coordinates like this (as seen in the screenshot - the center point of the x1,y1,x2,y2 coordinates of the bounding boxes ), combining this with the actual screen width and height can we can get the actual screen coordinates (x,y center point of the ui element). This could save us a lot of time to locate the actual UI element.
My use case is to build a chrome extension, that can sort of control the browser, execute tasks etc (like the anthropic's computer use api). so, now, I have to use CSS selectors to locate the right element, this process is kind of error prone and some additional processes are necessary too.
If you could give this screen coordinates in the api response then I can just move the cursor to that position, it will massively save us lots of time and help me cache these coordinates too so, things could be cheaper and faster than anthropic's computer use (again I'm just ranting and could be wrong, please feel free to correct)
If anyone is building something similar, please ping, I'll be happy to get involved.
@zorba111 This is a great idea and would reduce the amount of work to be implemented on our projects. I am actually building a very similar app as yours, on a computer level. @jadechoghari I think this would be a valuable addition to this Space. I already got Microsot's Demo to have the coordinates be outputed.