Unlocking Vision-Language Unity with Florence-2

Microsoft has open-sourced a groundbreaking lightweight vision-language model named Florence-2 under the MIT license. Despite its small size, Florence-2 demonstrates remarkable capabilities, including zero-shot and fine-tuning performance across tasks such as captioning, object detection, grounding, and segmentation. Thanks to its training on the extensive FLD-5B dataset, it achieves results comparable to those of much larger models like Kosmos-2. 

Unified Representation for Diverse Vision Tasks 

 Figure1

Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. 

The authors of Florence-2 have taken a truly innovative approach by unifying the representation of diverse vision tasks within a single model. This groundbreaking strategy eliminates the need for training multiple models for individual tasks, significantly advancing the field of computer vision and natural language processing. 

 

Building the Comprehensive FLD-5B Dataset 

Figure2 

Figure 2. Florence-2 consists of an image encoder and standard multi-modality encoder-decoder. We train Florence-2 on our FLD-5B data in a unified multitask learning paradigm, resulting in a generalist vision foundation model, which can perform various vision tasks. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. 

Creating a unified dataset for training Florence-2 was a significant challenge. Existing datasets like SA-1B and COCO were limited in their task coverage or relatively small. To overcome this, the authors automated the annotation process using specialized models, resulting in the creation of FLD-5B. This comprehensive dataset contains 126 million images and 5 billion annotations, including boxes, masks, and various captions, providing a broad foundation for training a unified model. The announcement of FLD-5 B's release during CVPR 2024 promises to significantly advance the field of computer vision and natural language processing. 

Model Architecture 

 Figure3

Figure 3. Florence-2 data engine consists of three essential phrases: (1) initial annotation employing specialist models, (2) data filtering to correct errors and remove irrelevant annotations, and (3) an iterative process for data refinement. Our final dataset (FLD-5B) of over 5B annotations contains 126M images, 500M text annotations, 1.3B region-text annotations, and 3.6B text-phrase-region annotations. Source: Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. 

Florence-2's architecture is designed to handle a variety of vision tasks efficiently. The model accepts images and task prompts as input, generating the desired outputs in text format. It utilizes a DaViT vision encoder to convert images into visual token embeddings, which are then concatenated with BERT-generated text embeddings. A transformer-based multi-modal encoder-decoder processes these combined embeddings to generate the final response. For tasks requiring region-specific details, such as object detection or instance segmentation, the tokenizer's vocabulary includes location tokens to allow precise localization within the images. 

Capabilities and Performance 

Florence-2 is available in two versions: Florence-2-base with 0.23 billion parameters and Florence-2-large with 0.77 billion parameters. This compact size enables deployment even on mobile devices, empowering you to access and utilize its powerful capabilities. Remarkably, Florence-2 achieves better zero-shot performance than the much larger Kosmos-2 across all benchmarks. It excels in various tasks, from visual grounding to OCR, with region-specific accuracy. It supports open vocabulary object detection, allowing it to detect and identify objects from various categories. 

Florence-2's unified representation approach significantly enhances efficiency and performance, making it a versatile and potent tool for a broad spectrum of vision tasks. Its lightweight architecture, combined with robust capabilities, not only sets a new standard for efficiency and versatility but also inspires a new wave of innovation in the field of vision-language modeling. 

Explore Florence-2 and experience its capabilities firsthand via  HF Space, a collaborative platform for machine learning projects, or Google Colab, a cloud-based development environment for Python. These platforms provide a user-friendly interface for interacting with Florence-2, allowing you to test its performance on various tasks and datasets. 

About the author

James Tucker