Professional Documents
Culture Documents
Efficient Graph-Friendly COCO Metric Computation For Train-Time Model Evaluation
Efficient Graph-Friendly COCO Metric Computation For Train-Time Model Evaluation
Abstract—Evaluating the COCO mean average precision (MaP) evaluation tools, an open source Keras-native implementation
and COCO recall metrics as part of the static computation of both mean average precision and recall, and an open-source
graph of modern deep learning frameworks poses a unique set of
arXiv:2207.12120v1 [cs.CV] 21 Jul 2022
To compute recall, the number of true positives computed for A. Variable Size State.
the given IoU threshold is divided by the number of ground The primary issue with computing MaP in the context of
truth boxes. This process is repeated for all θ ∈ Θ, and the a computation graph is that MaP requires a variable-size
average of all of these is considered the final COCO recall state representation. This is because in order to compute MaP
score. bounding box, predictions must be sorted on a global-batch
basis according to the detection confidence score. Whether or
C. Mean Average Precision not a detection was a true positive or false positive must be
Precision is defined as the number of true positives divided stored internally in the metric, alongside the corresponding
by the sum of true and false positives. While precision can confidence score. The number of predictions for any given
be computed using the standard precision formula, this is dataset is unknown until the full dataset is evaluated, therefore
unfortunately not particularly useful in the object detection the internal state size of MaP is variable. Modern deep learning
domain, as models may make many predictions, some of which frameworks lack strong support for variable size state in their
have low confidence scores. computation graphs.
MaP is accounts for this fact and serves as a strong general MaP Approximation Algorithm.) In order to solve this issue we
purpose object detection model evaluation metric. MaP is propose an algorithm to closely approximate MaP. Instead of
parameterized by a set of recall thresholds R, and a set of storing a list of detection instances, whether or not they were
IoU thresholds Θ. While many metrics can be computed on true positives, and their corresponding confidence score we
a per-image basis, MaP can only be computed on a global initialize twos vector, βp and βn , both of shape (buckets, ).
basis. These vectors are initialized to contain all zeros.
First, we construct the product of the sets R and Θ. We consider Evaluation is performed on a mini-batch basis. During each
each subset of these parameters, γj ∈ R and θi ∈ Θ. mini-batch, we first determine if the detection instances are true
positives or false positives. Following this, an index is computed
Next, the aforementioned process of determining whether or
for each bounding box using i = f loor(c ∗ (buckets − δ)),
not each bounding box is a true positive or false positive is
where c = conf idence and δ is an infinitesimal. This yields a
followed with the selected value θi . The global set of predicted
number in the range [0, buckets). If the bounding box is a true
bounding boxes from the entire evaluation dataset is sorted
positive, βp [i] = βp [i] + 1, otherwise βn [i] = βn [i] + 1.
by confidence. The set of all ordered subsets of boxes is
constructed, starting from the empty set, moving to the set Additionally, the number of ground truth boxes is also tracked
of only the highest confidence box, then adding an additional in a Tensor, Γ in order to compute the recall scores as required
box by confidence level in descending order until the set of during the AuC interpolation step.
At the end of evaluation of all mini-batches, the final result is it prudent to benchmark the error margins between our imple-
computed using the following process. mentation and the pycocotools implementation.
First, a cumulative sum operation is performed over βp A. Mirroring Standard COCO Metric Evaluation
and βn . βps =cumsum(βp ), βns =cumsum(βn ). βps [i] can be The standard pycocotools COCO evaluation suite contains
interpreted as the number of true positives the classifier has several additional parameterizations of the standard mean
scored when only considering bounding boxes with confidence average precision and recall metrics. These additional parame-
score i/buckets. terizations include area range filtering, where only bounding
Following this, recall scores for every bucket are computed: boxes inside of a specified area range are considered, and a
R = βps /Γ. Precision scores for every bucket are also max-detection value, a limit on the number of detections a
computed: P = βps/(βf s + βps ). These recall and precision model may make per image.
values are then used in the precision recall pair matching In order to evaluate the numerical accuracy of our implementa-
process described in the original. The area under curve tion we have implemented support for these parameters.
computation process is unchanged from the original, as is
the averaging and repetition of the process over each IoU While implementing area range filtering and max detections on
threshold. top of the algorithm we have proposed in IV is trivial, there
is no one-to-one perfectly matching method. In the mini-batch
It should be noted that an additional axis may be added update step all examples are simply filtered based on their
to βp , βn , and Γ to support multi-class MaP computation. computed area before computing false positives, true positives,
process. and ground truths.
While our algorithm Our algorithm exposes a parameter, B. Experimental Setup
buckets, to allow end users to perform a trade off between accu-
Our benchmarking process covers each COCO metric config-
racy and runtime performance. We recommend a default setting
uration used in the original Microsoft COCO challenge. The
of buckets = 10000. In our benchmarks buckets = 10000
accuracy of each configuration is evaluated for varying numbers
achieves strong accuracy, while maintaining tolerable runtime
of images and bounding boxes. Each test set is generated based
performance.
on the ground truth annotations provided in the COCO dataset
B. Ragged Bounding Box Tensors. evaluation. Our process for producing a test dataset is as follows:
The number of bounding boxes in an image may differ First, we sample N random images from the COCO dataset. We
between training samples. This makes the bounding box Tensor select all of the bounding boxes associated with that dataset
ragged. While an easily solvable problem, this is an important and treat these as our ground truths. Next, we modify the
design consideration when implementing COCO metrics in a ground truth bounding boxes by shifting the x coordinate up
computation graph. to 10% of the width of the box, and shifting the y coordinate
by up to 10% of the height of the box. Following this, we
-1 Masking. In order to solve this, our implementation simply randomly shift the width and height of the predicted boxes
ignores samples with a class ID of −1. This allows us to by up to 10%. These mutated boxes are used as predictions
keep Tensors dense, while still supporting a variable number ypred .
of bounding boxes in each image. One major caveat of this
approach is the additional memory use required to store the More formally:
−1 values used to represent the masked out boxes. While lef t = lef t + random(−0.2, 0.2) ∗ width
this is far from ideal, all major deep learning frameworks right = right + random(−0.2, 0.2) ∗ height
are rapidly improving support for RaggedTensors. As such, width = width ∗ random(0.8, 1.2)
our implementation will be updated to support them in the height = height ∗ random(0.8, 1.2)
future.
Test datasets are generated 10 times each, and each met-
V. N UMERICAL ACCURACY B ENCHMARKING ric configuration is evaluated independently. Every metric
In order for our metrics to be useful to researchers and configuration from the COCO standard set of metrics is
practitioners it is important that the metrics be accurate to evaluated, with both the pycocotools implementation and
their predecessor: pycocotools. the corresponding keras_cv implementation. The differences
are compared, and the error margin between each is disclosed
It should be noted that deviation from pycocotools should in section VI.
not be thought of as a deviation from the numerical value of the
underlying metric itself. pycocotools lacks precise testing VI. E XPERIMENTAL R ESULTS
and was primarily developed by one individual contributor. Figure VI-A provides an overview of the error distributions
But while pycocotools results should not be considered of the runs of each metric. Figure 2 and Figure 3 show the
the ground-truth value for MaP or recall, it is still the industry resulting error margin plots generated from this experiment.
standard for COCO metric evaluation. As such, we consider The X axis shows the number of images used in the 10 runs
TABLE I bounding box detection. Finally, tracking area of a segmentation
N UMERICAL R ESULTS ACROSS A LL RUNS map is simply a pain to the end user. There is no general case
reason for this behavior, and as such we have chosen to deviate
Metric Min Error Max Error Mean Error
from it.
Standard MaP 0.000 0.046 0.015 ± 0.011
MaP IoU=0.5 0.000 0.075 0.024 ± 0.019 VII. C ONCLUSIONS
MaP IoU=0.75 0.001 0.079 0.025 ± 0.017
MaP Small Objects 0.000 0.202 0.054 ± 0.040 We proposed and implemented an in-graph, mini-batch friendly
MaP Medium Objects 0.001 0.156 0.055 ± 0.036 approach to compute the COCO recall metric, as well as closely
MaP Large Objects 0.000 0.150 0.028 ± 0.030
Recall 1 Detection 0.000 0.029 0.009 ± 0.007 approximating the COCO mean average precision (MaP) metric.
Recall 10 Detections 0.000 0.035 0.011 ± 0.007 We released an open-source, easy-to-use implementation of
Standard Recall 0.000 0.035 0.011 ± 0.007 both algorithms as part of the KerasCV package. Our numerical
Recall Small Objects 0.001 0.143 0.043 ± 0.031
Recall Medium Objects 0.000 0.172 0.047 ± 0.038 benchmarks confirm that our implementation produces results
Recall Large Objects 0.000 0.150 0.013 ± 0.032 that closely match the most commonly-used prior implemen-
tation of both metrics, available in pycocotools. With our
in-graph approach, it becomes easy and efficient to implement
making up that datapoint, and the Y axis represents the absolute train-time evaluation of object detection models. This enables
error from the pycocotools metric. researchers and practitioners to leverage the full training curves
for these metric to more effectively compare two models or
These figures show that when both implementations are used prototype new models.
to evaluated the same corpus of images, the results are near
identical. However, there is clearly a difference taken in the R EFERENCES
implementation of area range evaluation.
[1] Tsung-Yi Lin et al. Microsoft COCO: Common Objects
The error margins for both MaP and recall are <3% for cases in Context. 2014. DOI: 10.48550/ARXIV.1405.0312.
that do not involve area range filtering. Table I show the URL : https://arxiv.org/abs/1405.0312.
median, mean, and standard deviation of each evaluation case. [2] Martín Abadi et al. TensorFlow: Large-Scale Machine
In cases that do not involve area range filtering the error Learning on Heterogeneous Systems. Software available
margins are likely caused by rounding differences between from tensorflow.org. 2015. URL: https://www.tensorflow.
the NumPy and TensorFlow, rounding errors induced by org/.
differing order of operations between our implementation and [3] Joseph Redmon et al. You Only Look Once: Unified, Real-
the pycocotools implementation, and the approximation Time Object Detection. 2015. DOI: 10.48550/ARXIV.
algorithm used to compute MaP. 1506.02640. URL: https://arxiv.org/abs/1506.02640.
[4] Shaoqing Ren et al. Faster R-CNN: Towards Real-Time
A. Area Range Differences Object Detection with Region Proposal Networks. 2015.
The error margin for all cases involving area filtering is sig- DOI : 10.48550/ARXIV.1506.01497. URL : https://arxiv.
nificantly greater. This can be attributed to an implementation org/abs/1506.01497.
difference. pycocotools filters the area of bounding boxes [5] Tsung-Yi Lin et al. Focal Loss for Dense Object
according to the area annotation of the ground truth annotations Detection. 2017. DOI: 10.48550/ARXIV.1708.02002.
included in the COCO dataset. It should be noted that in URL : https://arxiv.org/abs/1708.02002.
the COCO dataset the area of each bounding box is actually [6] James Bradbury et al. JAX: composable transformations
computed using the segmentation map included in the dataset. of Python+NumPy programs. Version 0.3.13. 2018. URL:
Additionally, no filtering is done on the predicted bounding http://github.com/google/jax.
boxes. [7] Dingfu Zhou et al. “Iou loss for 2d/3d object detection”.
In: 2019 International Conference on 3D Vision (3DV).
On the other hand, KerasCV filters bounding boxes according to IEEE. 2019, pp. 85–94.
the specified area range by computing the area of the bounding [8] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan
box itself. This is done for both the ground truth bounding Mark Liao. “Yolov4: Optimal speed and accuracy of
boxes and the detected bounding boxes. object detection”. In: arXiv preprint arXiv:2004.10934
While neither approach is perfect, we consider our approach to (2020).
be significantly more idiomatic than that of pycocotools. [9] Zheng Ge et al. YOLOX: Exceeding YOLO Series in
Segmentation maps are significantly more expensive to produce. 2021. 2021. DOI: 10.48550/ARXIV.2107.08430. URL:
As such, most users performing bounding box detection will https://arxiv.org/abs/2107.08430.
likely not have segmentation maps for the data they are [10] Ze Liu et al. “Swin Transformer V2: Scaling Up Capacity
evaluating their model on. This would prevent these users from and Resolution”. In: (2021). DOI: 10.48550/ARXIV.2111.
mirroring the approach taken in pycocotools. Additionally, 09883. URL: https://arxiv.org/abs/2111.09883.
it is a confusing delegation of responsibilities to use a value
computed using a segmentation map to evaluate a metric for
Fig. 1. Error distribution for each metric configuration