Have you ever trained multiple detection models and wondered which one performs best on your specific use case? Or maybe you've downloaded a pre-trained model and want to verify its performance on your dataset? Model benchmarking is essential for making informed decisions about which model to deploy in production.
This guide will show an easy way to benchmark your results using supervision. It will go over:
This guide will use an instance segmentation model, but it applies to object detection, instance segmentation, and oriented bounding box models (OBB) too.
A condensed version of this guide is available as a Colab Notebook.
Suppose you start with a dataset. Perhaps you found it on Universe; perhaps you labeled your own. In either case, this guide assumes you know of a labelled dataset at hand.
This will create a folder called Corgi-v2-4 with the dataset in the current working directory, with train, test, and valid folders and a data.yaml file.
Roboflow supports a range of state-of-the-art pre-trained models for object detection, instance segmentation, and pose tracking. You don't even need an API key!
Evaluating your model requires careful selection of the dataset. Which images should you use?Let's go over the different scenarios.
Unrelated Dataset: If you have a dataset that was not used to train the model, this is the best choice.
Training Set: This is the set of images used to train the model. This is fine if the model was not trained on this dataset. Otherwise, never use it for benchmarking - the results will seem unrealistically good.
Validation Set: This is the set of images used to validate the model during training. Every Nth training epoch, the model is evaluated on the validation set. Often the training is stopped once the validation loss stops improving. Therefore, even while the images aren't used to train the model, it still indirectly influences the training outcome.
Test Set: This is the set of images kept aside for model testing. It is exactly the set you should use for benchmarking. If the dataset was split correctly, none of these images would be shown to the model during training.
Therefore, an unrelated dataset or the test set is the best choice for benchmarking.
Several other problems may arise:
Extra Classes: An unrelated dataset may contain additional classes which you may need to filter out before computing metrics.
Class Mismatch: In an unrelated dataset, the class names or IDs may be different to what your model produces, you'll need to remap them, which is shown in this guide.
Data Contamination: The test set may not be split correctly, with images from the test set also present in training or validation set and used during training. In this case, the results will be overly optimistic. This also applies when very similar images are used for training and testing - e.g. those taken in the same environment, same lighting conditions, similar angle, etc.
Missing Test Set: Some datasets do not come with a test set. In this case, you should collect and label your own data. Alternatively, a validation set could be used, but the results could be overly optimistic. Make sure to test in the real world as soon as possible.
A dataset of labeled images to evaluate the model.
A model prepared for benchmarking.
With these ready, we can now run the model and obtain predictions.
We'll use supervision to create a dataset iterator, and then run the model on each image.
Did you notice an issue in the above logic?
Since we're using an unrelated dataset, the class names and IDs may be different from what the model was trained on.
We need to remap them to match the dataset classes. Here's how to do it:
Dataset class names and IDs can be found in the data.yaml file, or by printing dataset.classes.
Each model will have a different class mapping, so make sure to check the model's documentation. In this case, the model was trained on the COCO dataset, with a class
configuration found here.
The first step in evaluating your model’s performance is to visualize its predictions.
This gives an intuitive sense of how well your model is detecting objects and where it might be failing.
With multiple models, fine details matter. Visual inspection may not be enough. supervision provides a collection of metrics that help obtain precise numerical results of model performance.
We'll start with MeanAveragePrecision (mAP), which is the most commonly used metric for object detection. It measures the average precision across all classes and IoU thresholds.
Here, the most popular value is mAP 50:95. It represents the average precision across all classes and IoU thresholds (0.5 to 0.95), whereas other values such as mAP 50 or mAP 75 only consider a single IoU threshold (0.5 and 0.75 respectively).
MeanAveragePrecisionResult:
Metric target: MetricTarget.MASKS
Class agnostic: False
mAP @ 50:95: 0.2409
mAP @ 50: 0.3591
mAP @ 75: 0.2915
mAP scores: [0.35909 0.3468 0.34556 ...]
IoU thresh: [0.5 0.55 0.6 ...]
AP per class:
0: [0.35909 0.3468 0.34556 ...]
...
Small objects: ...
Medium objects: ...
Large objects: ...
You can also plot the results:
map_result.plot()
The metric also breaks down the results by detected object area. Small, medium and large are simply those with area less than 32², between 32² and 96², and greater than 96² pixels respectively.
The F1 Score is another useful metric, especially when dealing with an imbalance between false positives and false negatives. It’s the harmonic mean of precision (how many predictions are correct) and recall (how many actual instances were detected).
F1ScoreResult:
Metric target: MetricTarget.MASKS
Averaging method: AveragingMethod.WEIGHTED
F1 @ 50: 0.5341
F1 @ 75: 0.4636
F1 @ thresh: [0.53406 0.5278 0.52153 ...]
IoU thresh: [0.5 0.55 0.6 ...]
F1 per class:
0: [0.53406 0.5278 0.52153 ...]
...
Small objects: ...
Medium objects: ...
Large objects: ...
Similarly, you can plot the results:
f1_result.plot()
As with mAP, the metric also breaks down the results by detected object area. Small, medium and large are simply those with area less than 32², between 32² and 96², and greater than 96² pixels respectively.
Here to compare the basic models? We've got you covered. Check out our Model Leaderboard to see how different models perform and to get a sense of the state-of-the-art results. It's a great place to understand what the leading models can achieve and to compare your own results.
Even better, the repository is open source! You can see how the models were benchmarked, run the evaluation yourself, and even add your own models to the leaderboard. Check it out on GitHub!
In this guide, you've learned how to set up your environment, train or use pre-trained models, visualize predictions, and evaluate model performance with metrics like mAP, F1 score, and got to know our Model Leaderboard.
A condensed version of this guide is also available as a Colab Notebook.
For more details, be sure to check out our documentation and join our community discussions. If you find any issues, please let us know on GitHub.