Zero-Shot Object Detection with YOLO-World¶
Click the Open in Colab
button to run the cookbook on Google Colab.
YOLO-World was designed to solve a limitation of existing zero-shot object detection models: speed. Whereas other state-of-the-art models use Transformers, a powerful but typically slower architecture, YOLO-World uses the faster CNN-based YOLO architecture.
According to the paper YOLO-World reached between 35.4 AP with 52.0 FPS for the large version and 26.2 AP with 74.1 FPS for the small version. While the V100 is a powerful GPU, achieving such high FPS on any device is impressive.
Before you start¶
Let's make sure that we have access to GPU. We can use nvidia-smi
command to do that. In case of any problems navigate to Edit
-> Notebook settings
-> Hardware accelerator
, set it to GPU
, and then click Save
.
!nvidia-smi
Fri Feb 16 12:46:14 2024 +---------------------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 CUDA Version: 12.2 | |-----------------------------------------+----------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+======================+======================| | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 | | N/A 65C P8 13W / 70W | 0MiB / 15360MiB | 0% Default | | | | N/A | +-----------------------------------------+----------------------+----------------------+ +---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+
NOTE: To make it easier for us to manage datasets, images and models we create a HOME
constant.
import os
HOME = os.getcwd()
print(HOME)
/content
Install required packages¶
In this guide, we utilize two Python packages: inference
, for executing zero-shot object detection using YOLO-World, and supervision
, for post-processing and visualizing the detected objects.
!pip install -q inference-gpu[yolo-world]==0.9.12rc1
!pip install -q supervision==0.19.0rc3
Imports¶
import cv2
import supervision as sv
from tqdm import tqdm
from inference.models.yolo_world.yolo_world import YOLOWorld
Download example data¶
!wget -P {HOME} -q https://media.roboflow.com/notebooks/examples/dog.jpeg
!wget -P {HOME} -q https://media.roboflow.com/supervision/cookbooks/yellow-filling.mp4
SOURCE_IMAGE_PATH = f"{HOME}/dog.jpeg"
SOURCE_VIDEO_PATH = f"{HOME}/yellow-filling.mp4"
NOTE: If you want to run the cookbook using your own file as input, simply upload video to Google Colab and replace SOURCE_IMAGE_PATH
and SOURCE_VIDEO_PATH
with the path to your file.
Run Object Detection¶
The Inference package provides the YOLO-World model in three versions: S
, M
, and L
. You can load them by defining model_id as yolo_world/s
, yolo_world/m
, and yolo_world/l
, respectively. The ROBOFLOW_API_KEY
is not required to utilize this model.
model = YOLOWorld(model_id="yolo_world/l")
YOLO-World is a zero-shot model, enabling object detection without any training. You only need to define a prompt as a list of classes (things) you are searching for.
classes = ["person", "backpack", "dog", "eye", "nose", "ear", "tongue"]
model.set_classes(classes)
100%|████████████████████████████████████████| 338M/338M [00:03<00:00, 106MiB/s]
We perform detection on our sample image. Then, we convert the result into a sv.Detections
object, which will be useful in the later parts of the cookbook.
image = cv2.imread(SOURCE_IMAGE_PATH)
results = model.infer(image)
detections = sv.Detections.from_inference(results)
The results we've obtained can be easily visualized with sv.BoundingBoxAnnotator
and sv.LabelAnnotator
. We can adjust parameters such as line thickness, text scale, line and text color allowing for a highly tailored visualization experience.
BOUNDING_BOX_ANNOTATOR = sv.BoundingBoxAnnotator(thickness=2)
LABEL_ANNOTATOR = sv.LabelAnnotator(text_thickness=2, text_scale=1, text_color=sv.Color.BLACK)
annotated_image = image.copy()
annotated_image = BOUNDING_BOX_ANNOTATOR.annotate(annotated_image, detections)
annotated_image = LABEL_ANNOTATOR.annotate(annotated_image, detections)
sv.plot_image(annotated_image, (10, 10))
Adjusting Confidence Level¶
Note that many classes from our prompt were not detected. This is because the default confidence threshold in Inference is set to 0.5
. Let's try significantly lowering this value. We've observed that the confidence returned by YOLO-World is significantly lower when querying for classes outside the COCO dataset.
image = cv2.imread(SOURCE_IMAGE_PATH)
results = model.infer(image, confidence=0.003)
detections = sv.Detections.from_inference(results)
By default, sv.LabelAnnotator
displays only the names of objects. To also view the confidence levels associated with each detection, we must define custom labels
and pass them to sv.LabelAnnotator
.
labels = [
f"{classes[class_id]} {confidence:0.3f}"
for class_id, confidence
in zip(detections.class_id, detections.confidence)
]
annotated_image = image.copy()
annotated_image = BOUNDING_BOX_ANNOTATOR.annotate(annotated_image, detections)
annotated_image = LABEL_ANNOTATOR.annotate(annotated_image, detections, labels=labels)
sv.plot_image(annotated_image, (10, 10))
Using Non-Max Suppression (NMS) to Eliminate Double Detections¶
To eliminate duplicates, we will use Non-Max Suppression (NMS). NMS evaluates the extent to which detections overlap using the Intersection over Union metric and, upon exceeding a defined threshold, treats them as duplicates. Duplicates are then discarded, starting with those of the lowest confidence. The value should be within the range [0, 1]
. The smaller the value, the more restrictive the NMS.
image = cv2.imread(SOURCE_IMAGE_PATH)
results = model.infer(image, confidence=0.003)
detections = sv.Detections.from_inference(results).with_nms(threshold=0.1)
labels = [
f"{classes[class_id]} {confidence:0.3f}"
for class_id, confidence
in zip(detections.class_id, detections.confidence)
]
annotated_image = image.copy()
annotated_image = BOUNDING_BOX_ANNOTATOR.annotate(annotated_image, detections)
annotated_image = LABEL_ANNOTATOR.annotate(annotated_image, detections, labels=labels)
sv.plot_image(annotated_image, (10, 10))
Video Processing¶
The get_video_frames_generator
enables us to easily iterate over video frames. Let's create a video generator for our sample input file and display its first frame on the screen.
generator = sv.get_video_frames_generator(SOURCE_VIDEO_PATH)
frame = next(generator)
sv.plot_image(frame, (10, 10))
Let's update our list of classes. This time we are looking for yellow filling
. The rest of the code performing detection, filtering and visualization remains unchanged.
classes = ["yellow filling"]
model.set_classes(classes)
results = model.infer(frame, confidence=0.002)
detections = sv.Detections.from_inference(results).with_nms(threshold=0.1)
annotated_image = frame.copy()
annotated_image = BOUNDING_BOX_ANNOTATOR.annotate(annotated_image, detections)
annotated_image = LABEL_ANNOTATOR.annotate(annotated_image, detections)
sv.plot_image(annotated_image, (10, 10))
Filtering Detectuions by Area¶
Our prompt allowed us to locate all filled holes, but we also accidentally marked the entire high-level element. To address this issue, we'll filter detections based on their relative area in relation to the entire video frame. If a detection occupies more than 10% of the frame's total area, it will be discarded.
We can use VideoInfo.from_video_path
to learn basic information about our video, such as duration, resolution, or FPS.
video_info = sv.VideoInfo.from_video_path(SOURCE_VIDEO_PATH)
video_info
VideoInfo(width=1280, height=720, fps=25, total_frames=442)
Knowing the frame's resolution allows us to easily calculate its total area, expressed in pixels.
width, height = video_info.resolution_wh
frame_area = width * height
frame_area
921600
On the other hand, by using sv.Detections.area
property, we can learn the area of each individual bounding box.
results = model.infer(frame, confidence=0.002)
detections = sv.Detections.from_inference(results).with_nms(threshold=0.1)
detections.area
array([ 7.5408e+05, 92844, 11255, 12969, 9875.9, 8007.7, 5433.5])
Now, we can combine these two pieces of information to construct a filtering condition for detections with an area greater than 10% of the entire frame.
(detections.area / frame_area) < 0.10
array([False, False, True, True, True, True, True])
detections = detections[(detections.area / frame_area) < 0.10]
annotated_image = frame.copy()
annotated_image = BOUNDING_BOX_ANNOTATOR.annotate(annotated_image, detections)
annotated_image = LABEL_ANNOTATOR.annotate(annotated_image, detections)
sv.plot_image(annotated_image, (10, 10))
Final Result¶
Finally, we are ready to process our entire video. Now in truth we can appreciate the speed of YOLO-World.
TARGET_VIDEO_PATH = f"{HOME}/yellow-filling-output.mp4"
frame_generator = sv.get_video_frames_generator(SOURCE_VIDEO_PATH)
video_info = sv.VideoInfo.from_video_path(SOURCE_VIDEO_PATH)
width, height = video_info.resolution_wh
frame_area = width * height
frame_area
with sv.VideoSink(target_path=TARGET_VIDEO_PATH, video_info=video_info) as sink:
for frame in tqdm(frame_generator, total=video_info.total_frames):
results = model.infer(frame, confidence=0.002)
detections = sv.Detections.from_inference(results).with_nms(threshold=0.1)
detections = detections[(detections.area / frame_area) < 0.10]
annotated_frame = frame.copy()
annotated_frame = BOUNDING_BOX_ANNOTATOR.annotate(annotated_frame, detections)
annotated_frame = LABEL_ANNOTATOR.annotate(annotated_frame, detections)
sink.write_frame(annotated_frame)
100%|██████████| 442/442 [00:31<00:00, 13.90it/s]
Keep in mind that the video preview below works only in the web version of the cookbooks and not in Google Colab.