MiniGPT-4
MiniGPT-4 is an advanced large language model that enhances vision-language understanding by aligning a frozen visual encoder with a frozen LLM, Vicuna, using just one projection layer.
MiniGPT-4 possesses many capabilities similar to those exhibited by GPT-4, such as generating detailed image descriptions and creating websites from hand-written drafts. Moreover, the tool has some emerging capabilities, such as writing stories and poems inspired by given images, providing solutions to problems shown in images, and teaching users how to cook based on food photos.
MiniGPT-4 requires training the linear layer to align the visual features with the Vicuna model. The model has highly computationally efficient training, using approximately 5 million aligned image-text pairs.
The pretraining process on raw image-text pairs could produce unnatural language outputs that lack coherence, including repetition and fragmented sentences. To address this problem, MiniGPT-4 curates a high-quality, well-aligned dataset to fine-tune the model using a conversational template. This step proves crucial for augmenting the model’s generation reliability and overall usability.
MiniGPT-4’s design is based on a vision encoder with a pre-trained VIT and Q-former, a single linear projection layer, and an advanced Vicuna Large Language Model.