AutoMM Detection - Quick Start with Foundation Model on Open Vocabulary Detection (OVD)#
In this section, our goal is to use a foundation model in object detection to detect novel classes defined by an unbounded (open) vocabulary.
Setting up the imports#
To start, let’s import MultiModalPredictor, and also make sure groundingdino is installed:
from autogluon.multimodal import MultiModalPredictor
try:
import groundingdino
except ImportError:
try:
from pip._internal import main as pipmain
except ImportError:
from pip import main as pipmain # for old pip version
pipmain(['install', '--user', 'git+https://github.com/IDEA-Research/GroundingDINO.git']) # equals to "!pip install git+https://github.com/IDEA-Research/GroundingDINO.git"
Prepare sample image#
Let’s use an image of Seattle’s street view to demo:
from IPython.display import Image, display
from autogluon.multimodal import download
sample_image_url = "https://live.staticflickr.com/65535/49004630088_d15a9be500_6k.jpg"
sample_image_path = download(sample_image_url)
display(Image(filename=sample_image_path))
Downloading 49004630088_d15a9be500_6k.jpg from https://live.staticflickr.com/65535/49004630088_d15a9be500_6k.jpg...
Creating the MultiModalPredictor#
We create the MultiModalPredictor and specify the problem_type to "open_vocabulary_object_detection".
We set the preset as "best_quality", which uses a SwinB as backbone. This preset gives us higher accuracy for detection.
We also provide presets "high_quality" and "medium_quality" with SwinT as backbone, faster but also with lower performance.
# Init predictor
predictor = MultiModalPredictor(problem_type="open_vocabulary_object_detection", presets = "best_quality")
Inference#
To run inference on the image, perform:
pred = predictor.predict(
{
"image": [sample_image_path],
"prompt": ["Pink notice. Green sign. One Way sign. People group. Tower crane in construction. Lamp post. Glass skyscraper."],
},
as_pandas=True,
)
print(pred)
Downloading groundingdino_swinb_cogcoor.pth from https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha2/groundingdino_swinb_cogcoor.pth...
final text_encoder_type: bert-base-uncased
/home/ci/opt/venv/lib/python3.10/site-packages/torch/functional.py:504: UserWarning: torch.meshgrid: in an upcoming release, it will be required to pass the indexing argument. (Triggered internally at ../aten/src/ATen/native/TensorShape.cpp:3483.)
return _VF.meshgrid(tensors, **kwargs) # type: ignore[attr-defined]
/home/ci/opt/venv/lib/python3.10/site-packages/transformers/modeling_utils.py:881: FutureWarning: The `device` argument is deprecated and will be removed in v5 of Transformers.
warnings.warn(
/home/ci/opt/venv/lib/python3.10/site-packages/torch/utils/checkpoint.py:31: UserWarning: None of the inputs have requires_grad=True. Gradients will be None
warnings.warn("None of the inputs have requires_grad=True. Gradients will be None")
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[4], line 1
----> 1 pred = predictor.predict(
2 {
3 "image": [sample_image_path],
4 "prompt": ["Pink notice. Green sign. One Way sign. People group. Tower crane in construction. Lamp post. Glass skyscraper."],
5 },
6 as_pandas=True,
7 )
9 print(pred)
File ~/autogluon/multimodal/src/autogluon/multimodal/predictor.py:589, in MultiModalPredictor.predict(self, data, candidate_data, id_mappings, as_pandas, realtime, save_results)
554 def predict(
555 self,
556 data: Union[pd.DataFrame, dict, list, str],
(...)
561 save_results: Optional[bool] = None,
562 ):
563 """
564 Predict values for the label column of new data.
565
(...)
587 Array of predictions, one corresponding to each row in given dataset.
588 """
--> 589 return self._learner.predict(
590 data=data,
591 candidate_data=candidate_data,
592 as_pandas=as_pandas,
593 realtime=realtime,
594 save_results=save_results,
595 id_mappings=id_mappings,
596 )
File ~/autogluon/multimodal/src/autogluon/multimodal/learners/base.py:1929, in BaseLearner.predict(self, data, candidate_data, as_pandas, realtime, **kwargs)
1923 else:
1924 outputs = self.predict_per_run(
1925 data=data,
1926 realtime=realtime,
1927 requires_label=False,
1928 )
-> 1929 logits = extract_from_output(outputs=outputs, ret_type=ret_type)
1931 if self._df_preprocessor:
1932 pred = self._df_preprocessor.transform_prediction(
1933 y_pred=logits,
1934 )
File ~/autogluon/multimodal/src/autogluon/multimodal/utils/inference.py:61, in extract_from_output(outputs, ret_type, as_ndarray)
59 if ret_type == LOGITS:
60 logits = [ele[LOGITS] for ele in outputs]
---> 61 ret = torch.cat(logits).nan_to_num(nan=-1e4)
62 elif ret_type == PROBABILITY:
63 probability = [ele[PROBABILITY] for ele in outputs]
TypeError: expected Tensor as element 0 in argument 0, but got list
The output pred is a pandas DataFrame that has two columns, image and bboxes.
In image, each row contains the image path
In bboxes, each row is a list of dictionaries, each one representing the prediction for an object in the image: {"class": <predicted_class_name>, "bbox": [x1, y1, x2, y2], "score": <confidence_score>}, for example:
print(pred["bboxes"][0][0])
!pip install opencv-python
To visualize results, run the following:
from autogluon.multimodal.utils import Visualizer
conf_threshold = 0.2 # Specify a confidence threshold to filter out unwanted boxes
image_result = pred.iloc[0]
img_path = image_result.image # Select an image to visualize
visualizer = Visualizer(img_path) # Initialize the Visualizer
out = visualizer.draw_instance_predictions(image_result, conf_threshold=conf_threshold) # Draw detections
visualized = out.get_image() # Get the visualized image
from PIL import Image
from IPython.display import display
img = Image.fromarray(visualized, 'RGB')
display(img)
Other Examples#
You may go to AutoMM Examples to explore other examples about AutoMM.
Customization#
To learn how to customize AutoMM, please refer to Customize AutoMM.
Citation#
@misc{liu2023grounding,
title={Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection},
author={Shilong Liu and Zhaoyang Zeng and Tianhe Ren and Feng Li and Hao Zhang and Jie Yang and Chunyuan Li and Jianwei Yang and Hang Su and Jun Zhu and Lei Zhang},
year={2023},
eprint={2303.05499},
archivePrefix={arXiv},
primaryClass={cs.CV}
}