awacke1 commited on
Commit
d1b1e68
·
verified ·
1 Parent(s): 4876cb6

Update app.py

Browse files
Files changed (1) hide show
  1. app.py +2 -477
app.py CHANGED
@@ -39,7 +39,7 @@ st.markdown("---")
39
 
40
  st.markdown("## **Interaction Protocol** 🤝 :bulb:**")
41
  st.markdown("### **Key Elements** :guards:")
42
- st.markdown(f"""
43
  1. **Communication** 🗣 \n
44
  - Agents exchange information \n
45
  2. **Cooperation** 🤝 \n
@@ -116,7 +116,7 @@ https://aka.ms/kosmos-2.
116
  ---------------
117
 
118
  ### 19 Feb 2024 | [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) | [⬇️](https://arxiv.org/pdf/2402.04615)
119
- *Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor C\u{a}rbune, Jason Lin, Jindong Chen, Abhanshu Sharma*
120
 
121
  Screen user interfaces (UIs) and infographics, sharing similar visual
122
  language and design principles, play important roles in human communication and
@@ -266,479 +266,4 @@ Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
266
  model training) using existing libraries and autonomously self-debug.
267
 
268
  ---------------
269
-
270
- ### 24 Jan 2024 | [VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks](https://arxiv.org/abs/2401.13649) | [⬇️](https://arxiv.org/pdf/2401.13649)
271
- *Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, Daniel Fried*
272
-
273
- Autonomous agents capable of planning, reasoning, and executing actions on
274
- the web offer a promising avenue for automating computer tasks. However, the
275
- majority of existing benchmarks primarily focus on text-based agents,
276
- neglecting many natural tasks that require visual information to effectively
277
- solve. Given that most computer interfaces cater to human perception, visual
278
- information often augments textual data in ways that text-only models struggle
279
- to harness effectively. To bridge this gap, we introduce VisualWebArena, a
280
- benchmark designed to assess the performance of multimodal web agents on
281
- realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set
282
- of diverse and complex web-based tasks that evaluate various capabilities of
283
- autonomous multimodal agents. To perform on this benchmark, agents need to
284
- accurately process image-text inputs, interpret natural language instructions,
285
- and execute actions on websites to accomplish user-defined objectives. We
286
- conduct an extensive evaluation of state-of-the-art LLM-based autonomous
287
- agents, including several multimodal models. Through extensive quantitative and
288
- qualitative analysis, we identify several limitations of text-only LLM agents,
289
- and reveal gaps in the capabilities of state-of-the-art multimodal language
290
- agents. VisualWebArena provides a framework for evaluating multimodal
291
- autonomous language agents, and offers insights towards building stronger
292
- autonomous agents for the web. Our code, baseline models, and data is publicly
293
- available at https://jykoh.com/vwa.
294
-
295
- ---------------
296
-
297
- ### 22 Feb 2018 | [Multimodal Named Entity Recognition for Short Social Media Posts](https://arxiv.org/abs/1802.07862) | [⬇️](https://arxiv.org/pdf/1802.07862)
298
- *Seungwhan Moon, Leonardo Neves, Vitor Carvalho*
299
-
300
- We introduce a new task called Multimodal Named Entity Recognition (MNER) for
301
- noisy user-generated data such as tweets or Snapchat captions, which comprise
302
- short text with accompanying images. These social media posts often come in
303
- inconsistent or incomplete syntax and lexical notations with very limited
304
- surrounding textual contexts, bringing significant challenges for NER. To this
305
- end, we create a new dataset for MNER called SnapCaptions (Snapchat
306
- image-caption pairs submitted to public and crowd-sourced stories with fully
307
- annotated named entities). We then build upon the state-of-the-art Bi-LSTM
308
- word/character based NER models with 1) a deep image network which incorporates
309
- relevant visual context to augment textual information, and 2) a generic
310
- modality-attention module which learns to attenuate irrelevant modalities while
311
- amplifying the most informative ones to extract contexts from, adaptive to each
312
- sample and token. The proposed MNER model with modality attention significantly
313
- outperforms the state-of-the-art text-only NER models by successfully
314
- leveraging provided visual contexts, opening up potential applications of MNER
315
- on myriads of social media platforms.
316
-
317
- ---------------
318
-
319
- ### 21 Sep 2023 | [You Only Look at Screens: Multimodal Chain-of-Action Agents](https://arxiv.org/abs/2309.11436) | [⬇️](https://arxiv.org/pdf/2309.11436)
320
- *Zhuosheng Zhang, Aston Zhang*
321
-
322
- Autonomous user interface (UI) agents aim to facilitate task automation by
323
- interacting with the user interface without manual intervention. Recent studies
324
- have investigated eliciting the capabilities of large language models (LLMs)
325
- for effective engagement in diverse environments. To align with the
326
- input-output requirement of LLMs, existing approaches are developed under a
327
- sandbox setting where they rely on external tools and application-specific APIs
328
- to parse the environment into textual elements and interpret the predicted
329
- actions. Consequently, those approaches often grapple with inference
330
- inefficiency and error propagation risks. To mitigate the challenges, we
331
- introduce Auto-UI, a multimodal solution that directly interacts with the
332
- interface, bypassing the need for environment parsing or reliance on
333
- application-dependent APIs. Moreover, we propose a chain-of-action technique --
334
- leveraging a series of intermediate previous action histories and future action
335
- plans -- to help the agent decide what action to execute. We evaluate our
336
- approach on a new device-control benchmark AITW with 30K unique instructions,
337
- spanning multi-step tasks such as application operation, web searching, and web
338
- shopping. Experimental results show that Auto-UI achieves state-of-the-art
339
- performance with an action type prediction accuracy of 90% and an overall
340
- action success rate of 74%. Code is publicly available at
341
- https://github.com/cooelf/Auto-UI.
342
-
343
- ---------------
344
-
345
- ### 06 Jun 2023 | [LIDA: A Tool for Automatic Generation of Grammar-Agnostic Visualizations and Infographics using Large Language Models](https://arxiv.org/abs/2303.02927) | [⬇️](https://arxiv.org/pdf/2303.02927)
346
- *Victor Dibia*
347
-
348
- Systems that support users in the automatic creation of visualizations must
349
- address several subtasks - understand the semantics of data, enumerate relevant
350
- visualization goals and generate visualization specifications. In this work, we
351
- pose visualization generation as a multi-stage generation problem and argue
352
- that well-orchestrated pipelines based on large language models (LLMs) such as
353
- ChatGPT/GPT-4 and image generation models (IGMs) are suitable to addressing
354
- these tasks. We present LIDA, a novel tool for generating grammar-agnostic
355
- visualizations and infographics. LIDA comprises of 4 modules - A SUMMARIZER
356
- that converts data into a rich but compact natural language summary, a GOAL
357
- EXPLORER that enumerates visualization goals given the data, a VISGENERATOR
358
- that generates, refines, executes and filters visualization code and an
359
- INFOGRAPHER module that yields data-faithful stylized graphics using IGMs. LIDA
360
- provides a python api, and a hybrid user interface (direct manipulation and
361
- multilingual natural language) for interactive chart, infographics and data
362
- story generation. Learn more about the project here -
363
- https://microsoft.github.io/lida/
364
-
365
- ---------------
366
-
367
- ### 16 Feb 2023 | [VLTinT: Visual-Linguistic Transformer-in-Transformer for Coherent Video Paragraph Captioning](https://arxiv.org/abs/2211.15103) | [⬇️](https://arxiv.org/pdf/2211.15103)
368
- *Kashu Yamazaki, Khoa Vo, Sang Truong, Bhiksha Raj, Ngan Le*
369
-
370
- Video paragraph captioning aims to generate a multi-sentence description of
371
- an untrimmed video with several temporal event locations in coherent
372
- storytelling. Following the human perception process, where the scene is
373
- effectively understood by decomposing it into visual (e.g. human, animal) and
374
- non-visual components (e.g. action, relations) under the mutual influence of
375
- vision and language, we first propose a visual-linguistic (VL) feature. In the
376
- proposed VL feature, the scene is modeled by three modalities including (i) a
377
- global visual environment; (ii) local visual main agents; (iii) linguistic
378
- scene elements. We then introduce an autoregressive Transformer-in-Transformer
379
- (TinT) to simultaneously capture the semantic coherence of intra- and
380
- inter-event contents within a video. Finally, we present a new VL contrastive
381
- loss function to guarantee learnt embedding features are matched with the
382
- captions semantics. Comprehensive experiments and extensive ablation studies on
383
- ActivityNet Captions and YouCookII datasets show that the proposed
384
- Visual-Linguistic Transformer-in-Transform (VLTinT) outperforms prior
385
- state-of-the-art methods on accuracy and diversity. Source code is made
386
- publicly available at: https://github.com/UARK-AICV/VLTinT.
387
-
388
- ---------------
389
-
390
- ### 04 Mar 2021 | [FAtiMA Toolkit -- Toward an effective and accessible tool for the development of intelligent virtual agents and social robots](https://arxiv.org/abs/2103.03020) | [⬇️](https://arxiv.org/pdf/2103.03020)
391
- *Samuel Mascarenhas, Manuel Guimar\~aes, Pedro A. Santos, Jo\~ao Dias, Rui Prada, Ana Paiva*
392
-
393
- More than a decade has passed since the development of FearNot!, an
394
- application designed to help children deal with bullying through role-playing
395
- with virtual characters. It was also the application that led to the creation
396
- of FAtiMA, an affective agent architecture for creating autonomous characters
397
- that can evoke empathic responses. In this paper, we describe FAtiMA Toolkit, a
398
- collection of open-source tools that is designed to help researchers, game
399
- developers and roboticists incorporate a computational model of emotion and
400
- decision-making in their work. The toolkit was developed with the goal of
401
- making FAtiMA more accessible, easier to incorporate into different projects
402
- and more flexible in its capabilities for human-agent interaction, based upon
403
- the experience gathered over the years across different virtual environments
404
- and human-robot interaction scenarios. As a result, this work makes several
405
- different contributions to the field of Agent-Based Architectures. More
406
- precisely, FAtiMA Toolkit's library based design allows developers to easily
407
- integrate it with other frameworks, its meta-cognitive model affords different
408
- internal reasoners and affective components and its explicit dialogue structure
409
- gives control to the author even within highly complex scenarios. To
410
- demonstrate the use of FAtiMA Toolkit, several different use cases where the
411
- toolkit was successfully applied are described and discussed.
412
-
413
- ---------------
414
-
415
- ### 12 Sep 2022 | [emojiSpace: Spatial Representation of Emojis](https://arxiv.org/abs/2209.09871) | [⬇️](https://arxiv.org/pdf/2209.09871)
416
- *Moeen Mostafavi, Mahsa Pahlavikhah Varnosfaderani, Fateme Nikseresht, Seyed Ahmad Mansouri*
417
-
418
- In the absence of nonverbal cues during messaging communication, users
419
- express part of their emotions using emojis. Thus, having emojis in the
420
- vocabulary of text messaging language models can significantly improve many
421
- natural language processing (NLP) applications such as online communication
422
- analysis. On the other hand, word embedding models are usually trained on a
423
- very large corpus of text such as Wikipedia or Google News datasets that
424
- include very few samples with emojis. In this study, we create emojiSpace,
425
- which is a combined word-emoji embedding using the word2vec model from the
426
- Genism library in Python. We trained emojiSpace on a corpus of more than 4
427
- billion tweets and evaluated it by implementing sentiment analysis on a Twitter
428
- dataset containing more than 67 million tweets as an extrinsic task. For this
429
- task, we compared the performance of two different classifiers of random forest
430
- (RF) and linear support vector machine (SVM). For evaluation, we compared
431
- emojiSpace performance with two other pre-trained embeddings and demonstrated
432
- that emojiSpace outperforms both.
433
-
434
- ---------------
435
-
436
- ### 27 Jan 2020 | [CodeReef: an open platform for portable MLOps, reusable automation actions and reproducible benchmarking](https://arxiv.org/abs/2001.07935) | [⬇️](https://arxiv.org/pdf/2001.07935)
437
- *Grigori Fursin, Herve Guillou and Nicolas Essayan*
438
-
439
- We present CodeReef - an open platform to share all the components necessary
440
- to enable cross-platform MLOps (MLSysOps), i.e. automating the deployment of ML
441
- models across diverse systems in the most efficient way. We also introduce the
442
- CodeReef solution - a way to package and share models as non-virtualized,
443
- portable, customizable and reproducible archive files. Such ML packages include
444
- JSON meta description of models with all dependencies, Python APIs, CLI actions
445
- and portable workflows necessary to automatically build, benchmark, test and
446
- customize models across diverse platforms, AI frameworks, libraries, compilers
447
- and datasets. We demonstrate several CodeReef solutions to automatically build,
448
- run and measure object detection based on SSD-Mobilenets, TensorFlow and COCO
449
- dataset from the latest MLPerf inference benchmark across a wide range of
450
- platforms from Raspberry Pi, Android phones and IoT devices to data centers.
451
- Our long-term goal is to help researchers share their new techniques as
452
- production-ready packages along with research papers to participate in
453
- collaborative and reproducible benchmarking, compare the different
454
- ML/software/hardware stacks and select the most efficient ones on a Pareto
455
- frontier using online CodeReef dashboards.
456
-
457
- ---------------
458
-
459
- ### 28 Feb 2024 | [OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web](https://arxiv.org/abs/2402.17553) | [⬇️](https://arxiv.org/pdf/2402.17553)
460
- *Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, Ruslan Salakhutdinov*
461
-
462
- For decades, human-computer interaction has fundamentally been manual. Even
463
- today, almost all productive work done on the computer necessitates human input
464
- at every step. Autonomous virtual agents represent an exciting step in
465
- automating many of these menial tasks. Virtual agents would empower users with
466
- limited technical proficiency to harness the full possibilities of computer
467
- systems. They could also enable the efficient streamlining of numerous computer
468
- tasks, ranging from calendar management to complex travel bookings, with
469
- minimal human intervention. In this paper, we introduce OmniACT, the
470
- first-of-a-kind dataset and benchmark for assessing an agent's capability to
471
- generate executable programs to accomplish computer tasks. Our scope extends
472
- beyond traditional web automation, covering a diverse range of desktop
473
- applications. The dataset consists of fundamental tasks such as "Play the next
474
- song", as well as longer horizon tasks such as "Send an email to John Doe
475
- mentioning the time and place to meet". Specifically, given a pair of screen
476
- image and a visually-grounded natural language task, the goal is to generate a
477
- script capable of fully executing the task. We run several strong baseline
478
- language model agents on our benchmark. The strongest baseline, GPT-4, performs
479
- the best on our benchmark However, its performance level still reaches only 15%
480
- of the human proficiency in generating executable scripts capable of completing
481
- the task, demonstrating the challenge of our task for conventional web agents.
482
- Our benchmark provides a platform to measure and evaluate the progress of
483
- language model agents in automating computer tasks and motivates future work
484
- towards building multimodal models that bridge large language models and the
485
- visual grounding of computer screens.
486
-
487
- ---------------
488
-
489
- ### 24 Mar 2021 | [Proactive Interaction Framework for Intelligent Social Receptionist Robots](https://arxiv.org/abs/2012.04832) | [⬇️](https://arxiv.org/pdf/2012.04832)
490
- *Yang Xue, Fan Wang, Hao Tian, Min Zhao, Jiangyong Li, Haiqing Pan and Yueqiang Dong*
491
-
492
- Proactive human-robot interaction (HRI) allows the receptionist robots to
493
- actively greet people and offer services based on vision, which has been found
494
- to improve acceptability and customer satisfaction. Existing approaches are
495
- either based on multi-stage decision processes or based on end-to-end decision
496
- models. However, the rule-based approaches require sedulous expert efforts and
497
- only handle minimal pre-defined scenarios. On the other hand, existing works
498
- with end-to-end models are limited to very general greetings or few behavior
499
- patterns (typically less than 10). To address those challenges, we propose a
500
- new end-to-end framework, the TransFormer with Visual Tokens for Human-Robot
501
- Interaction (TFVT-HRI). The proposed framework extracts visual tokens of
502
- relative objects from an RGB camera first. To ensure the correct interpretation
503
- of the scenario, a transformer decision model is then employed to process the
504
- visual tokens, which is augmented with the temporal and spatial information. It
505
- predicts the appropriate action to take in each scenario and identifies the
506
- right target. Our data is collected from an in-service receptionist robot in an
507
- office building, which is then annotated by experts for appropriate proactive
508
- behavior. The action set includes 1000+ diverse patterns by combining language,
509
- emoji expression, and body motions. We compare our model with other SOTA
510
- end-to-end models on both offline test sets and online user experiments in
511
- realistic office building environments to validate this framework. It is
512
- demonstrated that the decision model achieves SOTA performance in action
513
- triggering and selection, resulting in more humanness and intelligence when
514
- compared with the previous reactive reception policies.
515
-
516
- ---------------
517
-
518
- ### 15 Mar 2023 | [Sustainable Cloud Services for Verbal Interaction with Embodied Agents](https://arxiv.org/abs/2203.02606) | [⬇️](https://arxiv.org/pdf/2203.02606)
519
- *Lucrezia Grassi, Carmine Tommaso Recchiuto, Antonio Sgorbissa*
520
-
521
- This article presents the design and the implementation of a cloud system for
522
- knowledge-based autonomous interaction devised for Social Robots and other
523
- conversational agents. The system is particularly convenient for low-cost
524
- robots and devices: it can be used as a stand-alone dialogue system or as an
525
- integration to provide "background" dialogue capabilities to any preexisting
526
- Natural Language Processing ability that the robot may already have as part of
527
- its basic skills. By connecting to the cloud, developers are provided with a
528
- sustainable solution to manage verbal interaction through a network connection,
529
- with about 3,000 topics of conversation ready for "chit-chatting" and a library
530
- of pre-cooked plans that only needs to be grounded into the robot's physical
531
- capabilities. The system is structured as a set of REST API endpoints so that
532
- it can be easily expanded by adding new APIs to improve the capabilities of the
533
- clients connected to the cloud. Another key feature of the system is that it
534
- has been designed to make the development of its clients straightforward: in
535
- this way, multiple robots and devices can be easily endowed with the capability
536
- of autonomously interacting with the user, understanding when to perform
537
- specific actions, and exploiting all the information provided by cloud
538
- services. The article outlines and discusses the results of the experiments
539
- performed to assess the system's performance in terms of response time, paving
540
- the way for its use both for research and market solutions. Links to
541
- repositories with clients for ROS and popular robots such as Pepper and NAO are
542
- available on request.
543
-
544
- ---------------<s>[INST] Context:
545
- 1. <b> AgentAvatar: Disentangling Planning, Driving and Rendering for Photorealistic Avatar Agents </b>
546
- Abstract: In this study, our goal is to create interactive avatar agents that can
547
- autonomously plan and animate nuanced facial movements realistically, from both
548
- visual and behavioral perspectives. Given high-level inputs about the
549
- environment and agent profile, our framework harnesses LLMs to produce a series
550
- of detailed text descriptions of the avatar agents' facial motions. These
551
- descriptions are then processed by our task-agnostic driving engine into motion
552
- token sequences, which are subsequently converted into continuous motion
553
- embeddings that are further consumed by our standalone neural-based renderer to
554
- generate the final photorealistic avatar animations. These streamlined
555
- processes allow our framework to adapt to a variety of non-verbal avatar
556
- interactions, both monadic and dyadic. Our extensive study, which includes
557
- experiments on both newly compiled and existing datasets featuring two types of
558
- agents -- one capable of monadic interaction with the environment, and the
559
- other designed for dyadic conversation -- validates the effectiveness and
560
- versatility of our approach. To our knowledge, we advanced a leap step by
561
- combining LLMs and neural rendering for generalized non-verbal prediction and
562
- photo-realistic rendering of avatar agents.
563
- 2. <b> Caption Anything: Interactive Image Description with Diverse Multimodal Controls </b>
564
- Abstract: Controllable image captioning is an emerging multimodal topic that aims to
565
- describe the image with natural language following human purpose,
566
- $\textit{e.g.}$, looking at the specified regions or telling in a particular
567
- text style. State-of-the-art methods are trained on annotated pairs of input
568
- controls and output captions. However, the scarcity of such well-annotated
569
- multimodal data largely limits their usability and scalability for interactive
570
- AI systems. Leveraging unimodal instruction-following foundation models is a
571
- promising alternative that benefits from broader sources of data. In this
572
- paper, we present Caption AnyThing (CAT), a foundation model augmented image
573
- captioning framework supporting a wide range of multimodel controls: 1) visual
574
- controls, including points, boxes, and trajectories; 2) language controls, such
575
- as sentiment, length, language, and factuality. Powered by Segment Anything
576
- Model (SAM) and ChatGPT, we unify the visual and language prompts into a
577
- modularized framework, enabling the flexible combination between different
578
- controls. Extensive case studies demonstrate the user intention alignment
579
- capabilities of our framework, shedding light on effective user interaction
580
- modeling in vision-language applications. Our code is publicly available at
581
- https://github.com/ttengwang/Caption-Anything.
582
- 3. <b> Kosmos-2: Grounding Multimodal Large Language Models to the World </b>
583
- Abstract: We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
584
- capabilities of perceiving object descriptions (e.g., bounding boxes) and
585
- grounding text to the visual world. Specifically, we represent refer
586
- expressions as links in Markdown, i.e., ``[text span](bounding boxes)'', where
587
- object descriptions are sequences of location tokens. Together with multimodal
588
- corpora, we construct large-scale data of grounded image-text pairs (called
589
- GrIT) to train the model. In addition to the existing capabilities of MLLMs
590
- (e.g., perceiving general modalities, following instructions, and performing
591
- in-context learning), Kosmos-2 integrates the grounding capability into
592
- downstream applications. We evaluate Kosmos-2 on a wide range of tasks,
593
- including (i) multimodal grounding, such as referring expression comprehension,
594
- and phrase grounding, (ii) multimodal referring, such as referring expression
595
- generation, (iii) perception-language tasks, and (iv) language understanding
596
- and generation. This work lays out the foundation for the development of
597
- Embodiment AI and sheds light on the big convergence of language, multimodal
598
- perception, action, and world modeling, which is a key step toward artificial
599
- general intelligence. Code and pretrained models are available at
600
- https://aka.ms/kosmos-2.
601
- 4. <b> ScreenAI: A Vision-Language Model for UI and Infographics Understanding </b>
602
- Abstract: Screen user interfaces (UIs) and infographics, sharing similar visual
603
- language and design principles, play important roles in human communication and
604
- human-machine interaction. We introduce ScreenAI, a vision-language model that
605
- specializes in UI and infographics understanding. Our model improves upon the
606
- PaLI architecture with the flexible patching strategy of pix2struct and is
607
- trained on a unique mixture of datasets. At the heart of this mixture is a
608
- novel screen annotation task in which the model has to identify the type and
609
- location of UI elements. We use these text annotations to describe screens to
610
- Large Language Models and automatically generate question-answering (QA), UI
611
- navigation, and summarization training datasets at scale. We run ablation
612
- studies to demonstrate the impact of these design choices. At only 5B
613
- parameters, ScreenAI achieves new state-of-the-artresults on UI- and
614
- infographics-based tasks (Multi-page DocVQA, WebSRC, MoTIF and Widget
615
- Captioning), and new best-in-class performance on others (Chart QA, DocVQA, and
616
- InfographicVQA) compared to models of similar size. Finally, we release three
617
- new datasets: one focused on the screen annotation task and two others focused
618
- on question answering.
619
- 5. <b> ThingTalk: An Extensible, Executable Representation Language for Task-Oriented Dialogues </b>
620
- Abstract: Task-oriented conversational agents rely on semantic parsers to translate
621
- natural language to formal representations. In this paper, we propose the
622
- design and rationale of the ThingTalk formal representation, and how the design
623
- improves the development of transactional task-oriented agents.
624
- ThingTalk is built on four core principles: (1) representing user requests
625
- directly as executable statements, covering all the functionality of the agent,
626
- (2) representing dialogues formally and succinctly to support accurate
627
- contextual semantic parsing, (3) standardizing types and interfaces to maximize
628
- reuse between agents, and (4) allowing multiple, independently-developed agents
629
- to be composed in a single virtual assistant. ThingTalk is developed as part of
630
- the Genie Framework that allows developers to quickly build transactional
631
- agents given a database and APIs.
632
- We compare ThingTalk to existing representations: SMCalFlow, SGD, TreeDST.
633
- Compared to the others, the ThingTalk design is both more general and more
634
- cost-effective. Evaluated on the MultiWOZ benchmark, using ThingTalk and
635
- associated tools yields a new state of the art accuracy of 79% turn-by-turn.
636
- 6. <b> 3D-GPT: Procedural 3D Modeling with Large Language Models </b>
637
- Abstract: In the pursuit of efficient automated content creation, procedural
638
- generation, leveraging modifiable parameters and rule-based systems, emerges as
639
- a promising approach. Nonetheless, it could be a demanding endeavor, given its
640
- intricate nature necessitating a deep understanding of rules, algorithms, and
641
- parameters. To reduce workload, we introduce 3D-GPT, a framework utilizing
642
- large language models~(LLMs) for instruction-driven 3D modeling. 3D-GPT
643
- positions LLMs as proficient problem solvers, dissecting the procedural 3D
644
- modeling tasks into accessible segments and appointing the apt agent for each
645
- task. 3D-GPT integrates three core agents: the task dispatch agent, the
646
- conceptualization agent, and the modeling agent. They collaboratively achieve
647
- two objectives. First, it enhances concise initial scene descriptions, evolving
648
- them into detailed forms while dynamically adapting the text based on
649
- subsequent instructions. Second, it integrates procedural generation,
650
- extracting parameter values from enriched text to effortlessly interface with
651
- 3D software for asset creation. Our empirical investigations confirm that
652
- 3D-GPT not only interprets and executes instructions, delivering reliable
653
- results but also collaborates effectively with human designers. Furthermore, it
654
- seamlessly integrates with Blender, unlocking expanded manipulation
655
- possibilities. Our work highlights the potential of LLMs in 3D modeling,
656
- offering a basic framework for future advancements in scene generation and
657
- animation.
658
- 7. <b> Embodied Task Planning with Large Language Models </b>
659
- Abstract: Equipping embodied agents with commonsense is important for robots to
660
- successfully complete complex human instructions in general environments.
661
- Recent large language models (LLM) can embed rich semantic knowledge for agents
662
- in plan generation of complex tasks, while they lack the information about the
663
- realistic world and usually yield infeasible action sequences. In this paper,
664
- we propose a TAsk Planing Agent (TaPA) in embodied tasks for grounded planning
665
- with physical scene constraint, where the agent generates executable plans
666
- according to the existed objects in the scene by aligning LLMs with the visual
667
- perception models. Specifically, we first construct a multimodal dataset
668
- containing triplets of indoor scenes, instructions and action plans, where we
669
- provide the designed prompts and the list of existing objects in the scene for
670
- GPT-3.5 to generate a large number of instructions and corresponding planned
671
- actions. The generated data is leveraged for grounded plan tuning of
672
- pre-trained LLMs. During inference, we discover the objects in the scene by
673
- extending open-vocabulary object detectors to multi-view RGB images collected
674
- in different achievable locations. Experimental results show that the generated
675
- plan from our TaPA framework can achieve higher success rate than LLaVA and
676
- GPT-3.5 by a sizable margin, which indicates the practicality of embodied task
677
- planning in general and complex environments.
678
- 8. <b> Joint Representation Learning for Text and 3D Point Cloud </b>
679
- Abstract: Recent advancements in vision-language pre-training (e.g. CLIP) have shown
680
- that vision models can benefit from language supervision. While many models
681
- using language modality have achieved great success on 2D vision tasks, the
682
- joint representation learning of 3D point cloud with text remains
683
- under-explored due to the difficulty of 3D-Text data pair acquisition and the
684
- irregularity of 3D data structure. In this paper, we propose a novel Text4Point
685
- framework to construct language-guided 3D point cloud models. The key idea is
686
- utilizing 2D images as a bridge to connect the point cloud and the language
687
- modalities. The proposed Text4Point follows the pre-training and fine-tuning
688
- paradigm. During the pre-training stage, we establish the correspondence of
689
- images and point clouds based on the readily available RGB-D data and use
690
- contrastive learning to align the image and point cloud representations.
691
- Together with the well-aligned image and text features achieved by CLIP, the
692
- point cloud features are implicitly aligned with the text embeddings. Further,
693
- we propose a Text Querying Module to integrate language information into 3D
694
- representation learning by querying text embeddings with point cloud features.
695
- For fine-tuning, the model learns task-specific 3D representations under
696
- informative language guidance from the label set without 2D images. Extensive
697
- experiments demonstrate that our model shows consistent improvement on various
698
- downstream tasks, such as point cloud semantic segmentation, instance
699
- segmentation, and object detection. The code will be available here:
700
- https://github.com/LeapLabTHU/Text4Point
701
- 9. <b> Executable Code Actions Elicit Better LLM Agents </b>
702
- Abstract: Large Language Model (LLM) agents, capable of performing a broad range of
703
- actions, such as invoking tools and controlling robots, show great potential in
704
- tackling real-world challenges. LLM agents are typically prompted to produce
705
- actions by generating JSON or text in a pre-defined format, which is usually
706
- limited by constrained action space (e.g., the scope of pre-defined tools) and
707
- restricted flexibility (e.g., inability to compose multiple tools). This work
708
- proposes to use executable Python code to consolidate LLM agents' actions into
709
- a unified action space (CodeAct). Integrated with a Python interpreter, CodeAct
710
- can execute code actions and dynamically revise prior actions or emit new
711
- actions upon new observations through multi-turn interactions. Our extensive
712
- analysis of 17 LLMs on API-Bank and a newly curated benchmark shows that
713
- CodeAct outperforms widely used alternatives (up to 20% higher success rate).
714
- The encouraging performance of CodeAct motivates us to build an open-source LLM
715
- agent that interacts with environments by executing interpretable code and
716
- collaborates with users using natural language. To this end, we collect an
717
- instruction-tuning dataset CodeActInstruct that consists of 7k multi-turn
718
- interactions using CodeAct. We show that it can be used with existing data to
719
- improve models in agent-oriented tasks without compromising their general
720
- capability. CodeActAgent, finetuned from Llama2 and Mistral, is integrated with
721
- Python interpreter and uniquely tailored to perform sophisticated tasks (e.g.,
722
- model training) using existing libraries and autonomously self-debug.
723
- 10. <b> VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks </b>
724
- Abstract: Autonomous agents capable of planning, reasoning, and executing actions on
725
- the web offer a promising avenue for automating computer tasks. However, the
726
- majority of existing benchmarks primarily focus on text-based agents,
727
- neglecting many natural tasks that require visual information to effectively
728
- solve. Given that most computer interfaces cater to human perception, visual
729
- information often augments textual data in ways that text-only models struggle
730
- to harness effectively. To bridge this gap, we introduce VisualWebArena, a
731
- benchmark designed to assess the performance of multimodal web agents on
732
- realistic \textit{visually grounded tasks}. VisualWebArena comprises of a set
733
- of diverse and complex web-based tasks that evaluate various capabilities of
734
- autonomous multimodal agents. To perform on this benchmark, agents need to
735
- accurately process image-text inputs, interpret natural language instructions,
736
- and execute actions on websites to accomplish user-defined objectives. We
737
- conduct an extensive evaluation of state-of-the-art LLM-based autonomous
738
- agents, including several multimodal models. Through extensive quantitative and
739
- qualitative analysis, we identify several limitations of text-only LLM agents,
740
- and reveal gaps in the capabilities of state-of-the-art multimodal language
741
- agents. VisualWebArena provides a framework for evaluating multimodal
742
- autonomous language agents, and offers insights towards building stronger
743
- autonomous agents for the web.
744
  """)
 
39
 
40
  st.markdown("## **Interaction Protocol** 🤝 :bulb:**")
41
  st.markdown("### **Key Elements** :guards:")
42
+ st.markdown("""
43
  1. **Communication** 🗣 \n
44
  - Agents exchange information \n
45
  2. **Cooperation** 🤝 \n
 
116
  ---------------
117
 
118
  ### 19 Feb 2024 | [ScreenAI: A Vision-Language Model for UI and Infographics Understanding](https://arxiv.org/abs/2402.04615) | [⬇️](https://arxiv.org/pdf/2402.04615)
119
+ *Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor Crbune, Jason Lin, Jindong Chen, Abhanshu Sharma*
120
 
121
  Screen user interfaces (UIs) and infographics, sharing similar visual
122
  language and design principles, play important roles in human communication and
 
266
  model training) using existing libraries and autonomously self-debug.
267
 
268
  ---------------
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
269
  """)