MiniGPT-4 is an AI model that utilizes a large language model (LLM) to augment the capabilities of vision-language understanding. This model incorporates the same multi-modal generation properties as its predecessor, GPT-4, such as generating clear image descriptions and creating websites via handwritten drafts. Additionally, MiniGPT-4 can produce stories and poems based on images, develop solutions to problems shown through pictures, and offer cooking instructions through photos. Its architecture involves a prespecific visual encoder, a single linear projection layer, and the accelerated Vicuna LLM. The training of the linear layer is a requirement to connect visual features to Vicuna and requires approximately 5 million corresponding image-text pairs.

