Zero-Shot Object Detection with YOLO-World¶
Click the Open in Colab button to run the cookbook on Google Colab.
YOLO-World was designed to solve a limitation of existing zero-shot object detection models: speed. Whereas other state-of-the-art models use Transformers, a powerful but typically slower architecture, YOLO-World uses the faster CNN-based YOLO architecture.
According to the paper YOLO-World reached between 35.4 AP with 52.0 FPS for the large version and 26.2 AP with 74.1 FPS for the small version. While the V100 is a powerful GPU, achieving such high FPS on any device is impressive.

Before you start¶
Let's make sure that we have access to GPU. We can use nvidia-smi command to do that. In case of any problems navigate to Edit -> Notebook settings -> Hardware accelerator, set it to GPU, and then click Save.
!nvidia-smi
Fri Feb 16 12:46:14 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 |
| N/A 65C P8 13W / 70W | 0MiB / 15360MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| No running processes found |
+---------------------------------------------------------------------------------------+
NOTE: To make it easier for us to manage datasets, images and models we create a HOME constant.
import os
HOME = os.getcwd()
print(HOME)
/content
Install required packages¶
In this guide, we utilize two Python packages: inference, for executing zero-shot object detection using YOLO-World, and supervision, for post-processing and visualizing the detected objects.
!pip install -q inference-gpu[yolo-world]==0.9.12rc1
!pip install -q supervision==0.19.0rc3
Imports¶
import cv2
import supervision as sv
from tqdm import tqdm
from inference.models.yolo_world.yolo_world import YOLOWorld
Download example data¶
!wget -P {HOME} -q https://media.roboflow.com/notebooks/examples/dog.jpeg
!wget -P {HOME} -q https://media.roboflow.com/supervision/cookbooks/yellow-filling.mp4
SOURCE_IMAGE_PATH = f"{HOME}/dog.jpeg"
SOURCE_VIDEO_PATH = f"{HOME}/yellow-filling.mp4"
NOTE: If you want to run the cookbook using your own file as input, simply upload video to Google Colab and replace SOURCE_IMAGE_PATH and SOURCE_VIDEO_PATH with the path to your file.
Run Object Detection¶
The Inference package provides the YOLO-World model in three versions: S, M, and L. You can load them by defining model_id as yolo_world/s, yolo_world/m, and yolo_world/l, respectively. The ROBOFLOW_API_KEY is not required to utilize this model.
model = YOLOWorld(model_id="yolo_world/l")
YOLO-World is a zero-shot model, enabling object detection without any training. You only need to define a prompt as a list of classes (things) you are searching for.
classes = ["person", "backpack", "dog", "eye", "nose", "ear", "tongue"]
model.set_classes(classes)
100%|████████████████████████████████████████| 338M/338M [00:03<00:00, 106MiB/s]
We perform detection on our sample image. Then, we convert the result into a sv.Detections object, which will be useful in the later parts of the cookbook.
image = cv2.imread(SOURCE_IMAGE_PATH)
results = model.infer(image)
detections = sv.Detections.from_inference(results)
The results we've obtained can be easily visualized with sv.BoundingBoxAnnotator and sv.LabelAnnotator. We can adjust parameters such as line thickness, text scale, line and text color allowing for a highly tailored visualization experience.
BOUNDING_BOX_ANNOTATOR = sv.BoundingBoxAnnotator(thickness=2)
LABEL_ANNOTATOR = sv.LabelAnnotator(text_thickness=2, text_scale=1, text_color=sv.Color.BLACK)
annotated_image = image.copy()
annotated_image = BOUNDING_BOX_ANNOTATOR.annotate(annotated_image, detections)
annotated_image = LABEL_ANNOTATOR.annotate(annotated_image, detections)
sv.plot_image(annotated_image, (10, 10))
Adjusting Confidence Level¶
Note that many classes from our prompt were not detected. This is because the default confidence threshold in Inference is set to 0.5. Let's try significantly lowering this value. We've observed that the confidence returned by YOLO-World is significantly lower when querying for classes outside the COCO dataset.
image = cv2.imread(SOURCE_IMAGE_PATH)
results = model.infer(image, confidence=0.003)
detections = sv.Detections.from_inference(results)
By default, sv.LabelAnnotator displays only the names of objects. To also view the confidence levels associated with each detection, we must define custom labels and pass them to sv.LabelAnnotator.
labels = [
f"{classes[class_id]} {confidence:0.3f}"
for class_id, confidence
in zip(detections.class_id, detections.confidence)
]
annotated_image = image.copy()
annotated_image = BOUNDING_BOX_ANNOTATOR.annotate(annotated_image, detections)
annotated_image = LABEL_ANNOTATOR.annotate(annotated_image, detections, labels=labels)
sv.plot_image(annotated_image, (10, 10))
Using Non-Max Suppression (NMS) to Eliminate Double Detections¶
To eliminate duplicates, we will use Non-Max Suppression (NMS). NMS evaluates the extent to which detections overlap using the Intersection over Union metric and, upon exceeding a defined threshold, treats them as duplicates. Duplicates are then discarded, starting with those of the lowest confidence. The value should be within the range [0, 1]. The smaller the value, the more restrictive the NMS.
image = cv2.imread(SOURCE_IMAGE_PATH)
results = model.infer(image, confidence=0.003)
detections = sv.Detections.from_inference(results).with_nms(threshold=0.1)
labels = [
f"{classes[class_id]} {confidence:0.3f}"
for class_id, confidence
in zip(detections.class_id, detections.confidence)
]
annotated_image = image.copy()
annotated_image = BOUNDING_BOX_ANNOTATOR.annotate(annotated_image, detections)
annotated_image = LABEL_ANNOTATOR.annotate(annotated_image, detections, labels=labels)
sv.plot_image(annotated_image, (10, 10))
Video Processing¶
The get_video_frames_generator enables us to easily iterate over video frames. Let's create a video generator for our sample input file and display its first frame on the screen.
generator = sv.get_video_frames_generator(SOURCE_VIDEO_PATH)
frame = next(generator)
sv.plot_image(frame, (10, 10))
Let's update our list of classes. This time we are looking for yellow filling. The rest of the code performing detection, filtering and visualization remains unchanged.
classes = ["yellow filling"]
model.set_classes(classes)
results = model.infer(frame, confidence=0.002)
detections = sv.Detections.from_inference(results).with_nms(threshold=0.1)
annotated_image = frame.copy()
annotated_image = BOUNDING_BOX_ANNOTATOR.annotate(annotated_image, detections)
annotated_image = LABEL_ANNOTATOR.annotate(annotated_image, detections)
sv.plot_image(annotated_image, (10, 10))
Filtering Detectuions by Area¶
Our prompt allowed us to locate all filled holes, but we also accidentally marked the entire high-level element. To address this issue, we'll filter detections based on their relative area in relation to the entire video frame. If a detection occupies more than 10% of the frame's total area, it will be discarded.
We can use VideoInfo.from_video_path to learn basic information about our video, such as duration, resolution, or FPS.
video_info = sv.VideoInfo.from_video_path(SOURCE_VIDEO_PATH)
video_info
VideoInfo(width=1280, height=720, fps=25, total_frames=442)
Knowing the frame's resolution allows us to easily calculate its total area, expressed in pixels.
width, height = video_info.resolution_wh
frame_area = width * height
frame_area
921600
On the other hand, by using sv.Detections.area property, we can learn the area of each individual bounding box.
results = model.infer(frame, confidence=0.002)
detections = sv.Detections.from_inference(results).with_nms(threshold=0.1)
detections.area
array([ 7.5408e+05, 92844, 11255, 12969, 9875.9, 8007.7, 5433.5])
Now, we can combine these two pieces of information to construct a filtering condition for detections with an area greater than 10% of the entire frame.
(detections.area / frame_area) < 0.10
array([False, False, True, True, True, True, True])
detections = detections[(detections.area / frame_area) < 0.10]
annotated_image = frame.copy()
annotated_image = BOUNDING_BOX_ANNOTATOR.annotate(annotated_image, detections)
annotated_image = LABEL_ANNOTATOR.annotate(annotated_image, detections)
sv.plot_image(annotated_image, (10, 10))
Final Result¶
Finally, we are ready to process our entire video. Now in truth we can appreciate the speed of YOLO-World.
TARGET_VIDEO_PATH = f"{HOME}/yellow-filling-output.mp4"
frame_generator = sv.get_video_frames_generator(SOURCE_VIDEO_PATH)
video_info = sv.VideoInfo.from_video_path(SOURCE_VIDEO_PATH)
width, height = video_info.resolution_wh
frame_area = width * height
frame_area
with sv.VideoSink(target_path=TARGET_VIDEO_PATH, video_info=video_info) as sink:
for frame in tqdm(frame_generator, total=video_info.total_frames):
results = model.infer(frame, confidence=0.002)
detections = sv.Detections.from_inference(results).with_nms(threshold=0.1)
detections = detections[(detections.area / frame_area) < 0.10]
annotated_frame = frame.copy()
annotated_frame = BOUNDING_BOX_ANNOTATOR.annotate(annotated_frame, detections)
annotated_frame = LABEL_ANNOTATOR.annotate(annotated_frame, detections)
sink.write_frame(annotated_frame)
100%|██████████| 442/442 [00:31<00:00, 13.90it/s]
Keep in mind that the video preview below works only in the web version of the cookbooks and not in Google Colab.