Is more training data always better?

+ Chat

+ Shift Management

+ Storage Management

 

Here, a more consistent improvement for each step is seen across all quality levels as well as size of a dataset. Although it also becomes less when reaching a high-quality level.

 

What to take away?

The results here at least indicate that more data is not the only way to reach a better model performance. 

For safety-critical applications such as in the ADAS/AD area, it is essential to perform quality assurance. Not the quantity, but the required data quality must be ensured in order to analyze and solve emerging problems in ADAS function development.

An increase in data quality by fixing label issues can have an equal or even greater positive impact. This can be useful when the data is rare or hard to acquire. Furthermore, the best levels of performance can’t be reached with bad or false labels. 

Surely, this only represents one use case and results probably vary depending on domain, amount of data and model. Nonetheless, the results are consistent with our experience when working with customers across a range of industries. If you are interested in the topic or you want to know more about the quality of your data just, just get in touch with us.

It is clearly visible that the best performance can only be reached if the data quality is sufficiently high. Furthermore, the positive effect of additional data seems to get less, especially for data with a high (0%) or very low (35%) quality grade. This becomes easier to see by looking at the mAP improvement for each increase of dataset size:

 

With each step the performance gains through more data become less. A different behavior is observed for increasing data quality. Below the average improvement for every decrease in error rate is given:

Back