AutoMM Detection - Quick Start with Foundation Model on Open Vocabulary Detection (OVD)#

In this section, our goal is to use a foundation model in object detection to detect novel classes defined by an unbounded (open) vocabulary.

Setting up the imports#

To start, let’s import MultiModalPredictor, and also make sure groundingdino is installed:

from autogluon.multimodal import MultiModalPredictor
try:
    import groundingdino
except ImportError:
    try:
        from pip._internal import main as pipmain
    except ImportError:
        from pip import main as pipmain  # for old pip version
    pipmain(['install', '--user', 'git+https://github.com/IDEA-Research/GroundingDINO.git'])  # equals to "!pip install git+https://github.com/IDEA-Research/GroundingDINO.git"

Prepare sample image#

Let’s use an image of Seattle’s street view to demo:

from IPython.display import Image, display
from autogluon.multimodal import download

sample_image_url = "https://live.staticflickr.com/65535/49004630088_d15a9be500_6k.jpg"
sample_image_path = download(sample_image_url)

display(Image(filename=sample_image_path))

Downloading 49004630088_d15a9be500_6k.jpg from https://live.staticflickr.com/65535/49004630088_d15a9be500_6k.jpg...

../../../../_images/95611e9613c6886f3a45d3840184971041642be0a90f761ddb2dece775eb6aad.jpg

Creating the MultiModalPredictor#

We create the MultiModalPredictor and specify the problem_type to "open_vocabulary_object_detection". We set the preset as "best_quality", which uses a SwinB as backbone. This preset gives us higher accuracy for detection. We also provide presets "high_quality" and "medium_quality" with SwinT as backbone, faster but also with lower performance.

# Init predictor
predictor = MultiModalPredictor(problem_type="open_vocabulary_object_detection", presets = "best_quality")

Inference#

To run inference on the image, perform:

pred = predictor.predict(
    {
        "image": [sample_image_path],
        "prompt": ["Pink notice. Green sign. One Way sign. People group. Tower crane in construction. Lamp post. Glass skyscraper."],
    },
    as_pandas=True,
)

print(pred)

Downloading groundingdino_swinb_cogcoor.pth from https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth...
final text_encoder_type: bert-base-uncased

/home/ci/opt/venv/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
/home/ci/opt/venv/lib/python3.10/site-packages/transformers/modeling_utils.py:881: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
  warnings.warn(
/home/ci/opt/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
  warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[4], line 1
----> 1 pred = predictor.predict(
   {
       "image": [sample_image_path],
       "prompt": ["Pink notice. Green sign. One Way sign. People group. Tower crane in construction. Lamp post. Glass skyscraper."],
   },
   as_pandas=True,
)
print(pred)

File ~/autogluon/multimodal/src/autogluon/multimodal/predictor.py:586, in MultiModalPredictor.predict(self, data, candidate_data, id_mappings, as_pandas, realtime, save_results)
def predict(
   self,
   data: Union[pd.DataFrame, dict, list, str],
   (...)
   save_results: Optional[bool] = None,
):
   """
   Predict values for the label column of new data.

   (...)
   Array of predictions, one corresponding to each row in given dataset.
   """
--> 586     return self._learner.predict(
       data=data,
       candidate_data=candidate_data,
       as_pandas=as_pandas,
       realtime=realtime,
       save_results=save_results,
       id_mappings=id_mappings,
   )

File ~/autogluon/multimodal/src/autogluon/multimodal/learners/base.py:1922, in BaseLearner.predict(self, data, candidate_data, as_pandas, realtime, **kwargs)
else:
   outputs = self.predict_per_run(
       data=data,
       requires_label=False,
       realtime=realtime,
   )
-> 1922     logits = extract_from_output(outputs=outputs, ret_type=ret_type)
   if self._df_preprocessor:
       pred = self._df_preprocessor.transform_prediction(
           y_pred=logits,
       )

File ~/autogluon/multimodal/src/autogluon/multimodal/utils/inference.py:61, in extract_from_output(outputs, ret_type, as_ndarray)
if ret_type == LOGITS:
   logits = [ele[LOGITS] for ele in outputs]
---> 61     ret = torch.cat(logits).nan_to_num(nan=-1e4)
elif ret_type == PROBABILITY:
   probability = [ele[PROBABILITY] for ele in outputs]

TypeError: expected Tensor as element 0 in argument 0, but got list

The output pred is a pandas DataFrame that has two columns, image and bboxes.

In image, each row contains the image path

In bboxes, each row is a list of dictionaries, each one representing the prediction for an object in the image: {"class": <predicted_class_name>, "bbox": [x1, y1, x2, y2], "score": <confidence_score>}, for example:

print(pred["bboxes"][0][0])

!pip install opencv-python

To visualize results, run the following:

from autogluon.multimodal.utils import Visualizer

conf_threshold = 0.2  # Specify a confidence threshold to filter out unwanted boxes
image_result = pred.iloc[0]

img_path = image_result.image  # Select an image to visualize

visualizer = Visualizer(img_path)  # Initialize the Visualizer
out = visualizer.draw_instance_predictions(image_result, conf_threshold=conf_threshold)  # Draw detections
visualized = out.get_image()  # Get the visualized image

from PIL import Image
from IPython.display import display
img = Image.fromarray(visualized, 'RGB')
display(img)

Other Examples#

You may go to AutoMM Examples to explore other examples about AutoMM.

Customization#

To learn how to customize AutoMM, please refer to Customize AutoMM.

Citation#

@misc{liu2023grounding,
      title={Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection}, 
      author={Shilong Liu and Zhaoyang Zeng and Tianhe Ren and Feng Li and Hao Zhang and Jie Yang and Chunyuan Li and Jianwei Yang and Hang Su and Jun Zhu and Lei Zhang},
      year={2023},
      eprint={2303.05499},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}