14 March 2024

The Truth Behind the Advertised 95%+ AI Performance Metric

By Bruno Oliveira

Have you ever wondered what it means when an AI solution is advertised as performing with a certain accuracy, precision or recall percentage? You've seen the impressive 95%+ figures, but when you try out the solution in the real world it doesn't live up to expectations for your business case? (training or business metrics), and with what data the solutions are tested to obtain such metrics. In this post, we will pull back the curtain on these metrics, focusing on understanding the different types of metrics advertised.

Training Metrics: Development Step in AI

During the development of an AI solution, all AI models go through a key training phase. This training phase is an iterative process in which the AI model is fed tons of labeled data over and over, while learning image/video features relevant to the business case. It usually takes a few dozen to several hundreds of iterations to complete the training step. How do engineers determine when to stop this training process? How many iterations are needed?

Enter training metrics.

Engineers define a set of labeled images or videos for training (the Train dataset) and another set for assessing performance (the Test dataset, optionally a validation dataset). They also establish a set of performance indicators, known as training metrics, to assess how well the AI model is performing for the given task they are proposing to solve. After each train iteration, the training metrics are calculated, providing insights into the AI model's progress. Once the metrics fulfill a set of predefined conditions (for example - no improvement for a set number of iterations), we know the maxima has been reached and the training process can stop.

The choice of metrics and the stop conditions to be used depend on the task at hand. For instance, metrics such as Intersection over Union (IoU), Dice coefficient, precision or recall may be used in segmentation tasks. However, it is important to recognize that while training metrics are valuable as development tools for the AI model, often they may not accurately reflect how well the AI model will perform in the real business case. Let's take a look at business metrics for business cases.

Business Metrics: Bridging AI to Real-World Performance

If training metrics are not the best tool for aligning an AI model performance with a real-word business case, then what is the best metric? Business metrics. These datasets wprovide engineers with a better understanding of how the app will behave in the real world and will provide performance data that is relevant and easy explained to the customer and stakeholders. To illustrate this difference, we will explore the real-world example of training metrics versus business metrics in Noema's Fire & Smoke detection application.

In the development phase of the Noema Fire & Smoke AI model, engineers designed this as a segmentation task. This means that the AI model built into that app the ability for the model to grab a camera frame and assess which individual pixels are fire, smoke or background. As a result, the training metrics reflected the accuracy of the pixel quantification -- the result of determining how many pixels are correctly and incorrectly categorized. The reigning metrics also reflected how well the presence of fire and smoke were detected on a single camera frame. In other words, in order to properly train the AI model, the training metrics answered the questions:

How accurately does my model sort out to which bucket a pixel belongs to (smoke, fire, background)?
How accurate is my model in saying if a single image has fire or smoke?

However, this does not reflect the real use-case and business context for the application which is to trigger alarms when fire or smoke are detected within a reasonable time frame. Clearly, training metrics are not the best option to understand how well the AI model will perform for the purpose of the application. Instead, business metrics are essential to address the real-life use cases and address questions, such as:

How often does my app correctly issue a fire alarm, in a 5 second time frame after the fire has become visible to my camera?
What about smoke, in that same 5 second time frame?
How often does my app create a false fire alarm, if any?
How often does my app create a false smoke alarm, if any?
How accurate is my fire and smoke area quantification?
How good is my fire and smoke average color estimation?

Consider this - if a customer seeks a solution specific to notifying them about fire and smoke alarms in a timely fashion then the metrics they care about are specific to their desired result. Promoting a 95%+ performance metric can be considered misleading because this is specific to the training metric which is counting the accuracy of identification for fire and smoke pixels in the AI model.

Considerations in Test and Validation Sets

Another reason for the misalignment in advertised metrics and actual real world performance lies in how the Test/Validation Datasets are designed, i.e. in what data are the AI models tested and validated. If the Test/Validation sets do not encompass all the real use case scenarios, and edge cases, relevant to a specific situation, then there is no way to assess how well the application will perform. In some extreme cases, some applications could be built using Test/Validation sets that encompass what is called "Blue Sky" data. Blue Sky use cases are situations that are easy to solve by an AI algorithm, but optimistic when compared to the real world. It can include images where perfect outdoors weather, great visibility and lighting, no occlusion, etc. We know this is not how the real world looks. For that reason, if an application is tested and validated using only this data, it will be reporting 95% + metrics, but this will not represent its real world performance.

At noema, we are extremely careful when building our Test/Validation sets. We make sure all foreseeable edge cases are considered, and that all challenging conditions are accounted for. This includes, but is not limited to, bad weather, bad visibility, night time operation and more.

About the Author

Bruno Oliveira serves as the Vice President of Engineering at Noema, leading the companies Computer Vision application development team. With over a decade in the CV/AI industry, Bruno's knowledge of the industry and his passion for building solutions that deliver real-world impacts, make him an invaluable member of the Noema team.

Connect With Us

Connect with Us

1536 Cole Blvd
Suite 325
Golden, CO 80401
USA

contact@noema.tech

Phone

+1 720 962 9525

The Truth Behind the Adver­ti­sed 95%+ AI Per­forman­ce Met­ric

Training Metrics: Development Step in AI

Business Metrics: Bridging AI to Real-World Performance

Considerations in Test and Validation Sets

Connect With Us

The Truth Behind the Advertised 95%+ AI Performance Metric