What does Visual ChatGPT mean for and how does it work?

What does Visual ChatGPT mean for and how does it work?

Visual ChatGPT is a brand-new model from Microsoft that combines ChatGPT with visual foundation models (VFMs) like Transformers, ControlNet, and Stable Diffusion. ChatGPT interaction is also made possible by the system, which goes beyond language.

How does it work?

Because it provides a language interface with exceptional conversational competence and reasoning abilities across various fields, ChatGPT attracts interest from a variety of disciplines. However, due to its linguistic training, ChatGPT is currently unable to process or generate images from the visual environment. Visual foundation models, on the other hand, like Visual Transformers and Stable Diffusion are only good at specific tasks that require a single round of fixed inputs and outputs. On the other hand, demonstrate exceptional visual comprehension and generation skills.

In order to accomplish this, Microsoft researchers created a system known as Visual ChatGPT. It features a variety of visual foundation models and lets users interact with ChatGPT through graphical user interfaces. It is able to do:

1) sending and getting message as well as pictures

2) providing complex visual requests or visual altering directions requiring the joint effort of various computer based intelligence models with numerous stages.

3) Providing input and requesting corrections The researchers have developed a series of prompts to inject the visual model information into ChatGPT in light of models with numerous inputs and outputs and models that require visual feedback. Tests exhibit that Visual ChatGPT makes it conceivable to research the visual jobs of ChatGPT utilizing visual establishment models.

What changed?

Large language models (LLMs) like T5, BLOOM, and GPT-3 have made significant progress in recent years. The training for ChatGPT, which is based on InstructGPT, teaches it to respond appropriately to follow-up questions, maintain conversational context, and produce accurate responses. However, despite its impressive capabilities, ChatGPT has been limited in its ability to process visual data because it has only been trained with a single language modality.

Due to their capacity to interpret and produce intricate images, VFMs have demonstrated tremendous potential in computer vision. In human-machine cooperations, notwithstanding, VFMs are less adaptable than conversational language models because of the cutoff points forced by the idea of assignment detail and the predetermined information yield designs.

A system with the ability to perceive and generate visual information that is comparable to ChatGPT can be built by training a multimodal conversational model. However, constructing such a system would necessitate a significant amount of computing power and data.

A potential solution?

A recent Microsoft study suggests that Visual ChatGPT, which utilizes text and prompt chaining to interact with vision models, could be utilized to address this issue. The researchers built Visual ChatGPT on top of ChatGPT and added several VFMs rather than training a brand-new multimodal ChatGPT. A prompt manager for ChatGPT and these VFMs has been developed. It has the accompanying highlights:

– Sets the formats for the input and output and informs ChatGPT of the capabilities of each VFM.

– It takes care of the conflicts, priorities, and histories that exist between various visual foundation models.

– Converts a variety of visual data, including mask matrices, depth images, and PNG images, into a language format that ChatGPT can understand.

ChatGPT can use these VFMs repeatedly and learn from their responses by integrating the Prompt Manager until it either meets the needs of the users or reaches the end state.

How does it work?

Take, for instance, the scenario in which a user uploads a picture of a black elephant and includes a difficult-to-understand instruction such as “Please make a white African elephant in the picture and then build it step by step like a cartoon.”

Visual ChatGPT initiates the execution of linked visual foundation models with the assistance of the Prompt Manager. It specifically makes use of a style transfer VFM based on a stable diffusion model to give the image the appearance of a cartoon and a depth-to-image model to transform the depth information into a picture of a white elephant.

The Prompt Manager serves as ChatGPT’s dispatcher in the preceding processing chain by providing visual representations and tracking information changes. For instance, Visual ChatGPT will stop the pipeline and display the final result when it receives “cartoon” hints from Prompt Manager.


The researchers note in their work that the failure of VFMs and the irregularity of the Prompt are areas of concern because they lead to generation results that are less than satisfactory. To ensure that execution outputs are consistent with human intentions and to make the necessary corrections, a single self-correcting module is required. In addition, it is possible that the model’s propensity for constant course correction would result in an increase in the inference time. In a subsequent study, the team intends to investigate this issue.

Basically, a single image contains a lot of information, most notably form, color, and shape. The system needs to know what the user wants and how to correctly render the image. While visual establishment models have made considerable progress, it is still early days to request that generative computer based intelligence make and modify pictures with a straightforward voice order. Having said that, VisualGPT could be an astonishing experiment for it.

Share This Post