Zero-Shot Object Detection with YOLO-World¶
Click the Open in Colab
button to run the cookbook on Google Colab.
YOLO-World was designed to solve a limitation of existing zero-shot object detection models: speed. Whereas other state-of-the-art models use Transformers, a powerful but typically slower architecture, YOLO-World uses the faster CNN-based YOLO architecture.
According to the paper YOLO-World reached between 35.4 AP with 52.0 FPS for the large version and 26.2 AP with 74.1 FPS for the small version. While the V100 is a powerful GPU, achieving such high FPS on any device is impressive.
Before you start¶
Let's make sure that we have access to GPU. We can use nvidia-smi
command to do that. In case of any problems navigate to Edit
-> Notebook settings
-> Hardware accelerator
, set it to GPU
, and then click Save
.
!nvidia-smi
Fri Feb 16 12:46:14 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 | | N/A 65C P8 13W / 70W | 0MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
NOTE: To make it easier for us to manage datasets, images and models we create a HOME
constant.
import os
HOME = os.getcwd()
print(HOME)
/content
Install required packages¶
In this guide, we utilize two Python packages: inference
, for executing zero-shot object detection using YOLO-World, and supervision
, for post-processing and visualizing the detected objects.
!pip install -q inference-gpu[yolo-world]==0.9.12rc1
!pip install -q supervision==0.19.0rc3
Imports¶
import cv2
import supervision as sv
from tqdm import tqdm
from inference.models.yolo_world.yolo_world import YOLOWorld
Download example data¶
!wget -P {HOME} -q https://media.roboflow.com/notebooks/examples/dog.jpeg
!wget -P {HOME} -q https://media.roboflow.com/supervision/cookbooks/yellow-filling.mp4
SOURCE_IMAGE_PATH = f"{HOME}/dog.jpeg"
SOURCE_VIDEO_PATH = f"{HOME}/yellow-filling.mp4"
NOTE: If you want to run the cookbook using your own file as input, simply upload video to Google Colab and replace SOURCE_IMAGE_PATH
and SOURCE_VIDEO_PATH
with the path to your file.
Run Object Detection¶
The Inference package provides the YOLO-World model in three versions: S
, M
, and L
. You can load them by defining model_id as yolo_world/s
, yolo_world/m
, and yolo_world/l
, respectively. The ROBOFLOW_API_KEY
is not required to utilize this model.
model = YOLOWorld(model_id="yolo_world/l")
YOLO-World is a zero-shot model, enabling object detection without any training. You only need to define a prompt as a list of classes (things) you are searching for.
classes = ["person", "backpack", "dog", "eye", "nose", "ear", "tongue"]
model.set_classes(classes)
100%|████████████████████████████████████████| 338M/338M [00:03<00:00, 106MiB/s]
We perform detection on our sample image. Then, we convert the result into a sv.Detections
object, which will be useful in the later parts of the cookbook.
image = cv2.imread(SOURCE_IMAGE_PATH)
results = model.infer(image)
detections = sv.Detections.from_inference(results)
The results we've obtained can be easily visualized with sv.BoundingBoxAnnotator
and sv.LabelAnnotator
. We can adjust parameters such as line thickness, text scale, line and text color allowing for a highly tailored visualization experience.
BOUNDING_BOX_ANNOTATOR = sv.BoundingBoxAnnotator(thickness=2)
LABEL_ANNOTATOR = sv.LabelAnnotator(text_thickness=2, text_scale=1, text_color=sv.Color.BLACK)
annotated_image = image.copy()
annotated_image = BOUNDING_BOX_ANNOTATOR.annotate(annotated_image, detections)
annotated_image = LABEL_ANNOTATOR.annotate(annotated_image, detections)
sv.plot_image(annotated_image, (10, 10))