Data Science

Outlier Detection - Sorting The Sheep From The Goats

October 3, 2023
  •  
5 min read
George Hill
Sagitto Ltd

Outlier detection is an important step in preparing spectroscopy data for machine learning models. Outliers can occur for a variety of reasons such as errors in instrument use, mistakes in sample preparation, or accidental sample swaps. Here we use a real world example to illustrate the approach that Sagitto takes to detect outliers in training data.

Sagitto recently developed calibration models for a PerkinElmer DA7250 at-line NIR instrument, as part of a pre-purchase evaluation exercise conducted by a large hops producer. We use this as a case study to describe how we detect outliers in our training data, and to compare this to using Hotelling's T2 and Q-Residuals.

Sanity Check - Look At The NIR Spectra

As a first step, we plot the NIR spectra in the training set to see if any look unusual. In this case, the spectra all have the general shape we expected.

NIR Spectra for hops from PerkinElmer DA7250
NIR absorbance spectra of ground hops from a PerkinElmer DA7250 instrument

Examine The Initial Cross-Validation Plot

Our next step was to build an initial multivariate calibration model using Sagitto's proprietary techniques, and examine its cross-validation plot.

Alpha Acids in Ground Hops - PerkinElmer DA7250
Cross-validation plot of initial model built using uncleaned data

We immediately noticed two unusual results.

Outlier in alpha acids in hops measured using PerkinElmer DA7250
Sample 22BP02003 had predicted alpha acids of 13.6 compared to the reference value of 5.1
Another outlier in alpha acids in hops measured using PerkinElmer DA7250
Sample 22BP02016 had predicted alpha acids of 10.7 compared to a reference value of 16.6.

Talk To The Customer

Having noticed these two anomalous results in our intial model, we checked with our customer. Sure enough, there were easy explanations: sample 22BP02003 should have had a reference value of 14.7, not 5.1, and the spectrum that we had been supplied for sample 22BP02016 was mislabelled. After correcting these two outliers, we rebuilt the model and got much better results.

Sagitto model of alpha acids in ground hops using PerkinElmer DA7250
Cross-validation plot of Sagitto model built after correction of two outliers, with potential outlier 22BP01909 circled in blue
Yet another Outlier in alpha acids in hops measured using PerkinElmer DA7250
Sample 22BP01909

Now another sample (22BP01909) seemed to be a potential outlier. However our customer confirmed the reference value for this sample was correct, and we chose not to remove it. It could be simply an unusual sample, and removing it could be a mistake.

We need to balance the desire to remove outliers in order to increase model accuracy, against the risk that we over-fit a model to data that isn't representative of what it will see when deployed.

The Same Process Using PLS Regression

Just for comparison, we repeated our outlier detection process using the widely used Partial Least Squares (PLS) technique, such as might be created with Aspen Unscrambler X. This initial PLS model also highlighted our two outliers.

Initial PLS regression model for alpha acids in hops using PerkinElmer 7250
Cross-validation plot of initial PLS Regression model built using uncleaned data

After correcting these two outliers, we rebuilt the PLS Regression model. As expected, this resulted in an improvement - although the revised PLS model is not as good as the model built using Sagitto’s proprietary machine learning techniques. (Incidentally, this illustrates why Sagitto rarely uses PLS Regression.)

Revised PLS regression model for alpha acids in hops using PerkinElmer 7250
Cross-validation plot of PLS Regression model, rebuilt after correcting two outliers and with sample 22BP01909 circled in blue.

Once again the new PLS Regression model suggests that sample 22BP01909 might be a potential outlier.

We want to be sure that any outlier that we remove is definitely an error, and not just an unusual sample that doesn't fit the model's notion of a 'good' sample.

The outlier detection method described above - eyeballing the NIR spectra, then reviewing cross-validation plots of initial calibration models - works fine for small training sets where we have high confidence in the source of the data. But it may struggle to scale to large data sets in which the providence of each individual data point is less certain. For that reason, it's worth reviewing some of the more automated outlier detection methods used in spectroscopy applications.

Outlier Detection Using Hotelling's T2 and Q-Residuals from PLS Regression Models

A common technique for identifying outliers in PLS models is to calculate two statistics for each sample - Hotelling's T2 and Q-Residuals. Usually these two statistics are visualised in a scatter plot, with a 95% confidence interval also plotted to give the values a sense of scale. Here's what we found when we used our original PLS model to calculate Hotelling's T2 and Q-Residuals for the hops data, prior to correcting samples 22BP02003 and 22BP02016.

Outlier identification in hops data for PerkinElmer DA7250 using Hotellings T2 and Q Residuals
Sample 1944 has a high Q-Residual value, and sample 2016 has a high Hotelling's T2 value

Looking at the highest values in tabular form, we see that while sample 2016 stands out with a T2 value of 57.6, sample 2003 (our other known error) doesn't make it into the top tier of potential outliers using this method. However a new candidate emerges in 1944, with a very high Q-Residual value.

T Squared and Q Residuals
T2 and Q-Residuals in original data

This surprisingly high Q-Residual value prompted us to take a closer look at sample 1944. We concluded that it was not an error and should stay. Having cleaned the training set of our two genuine outliers - 2003 and 2016 - we calculated the Hotelling's T2 and Q-Residuals values for the revised PLS model. This generates more candidates for consideration as outliers.

More outlier identification in hops data for PerkinElmer DA7250 using Hotellings T2 and Q Residuals
Samples 1912 and 2013 now come into contention.

When To Stop?

At some point in the hunt for outliers, a decision needs to be made about when to stop. Sagitto tends to err on the side of caution, and only remove data from a training set when we're sure that it's an error and not just an unusual sample. However, Hotelling's T2 and Q-Residuals have a place for large, noisy datasets where a more automated approach is required.

To illustrate how this can work, the video below shows 329 NIR spectra being removed from a dataset of 10,243 scans of mango fruit being measured for dry matter, with the 95% confidence intervals marked in blue (T2) and yellow (Q-Residuals). As each sample is removed, a new PLS model is created and the T2 and Q-Residual values are recalculated on the remaining data. These T2 and Q-Residual values change with each iteration. Just as you might think that sufficient samples have been excluded, new ones become candidates for removal! The decision on when to stop ultimately becomes a subjective one. In the paper which Sagitto used as the basis for this example, the authors (using a different process to the one shown here) chose to exclude 329 spectra (about 3% of the initial dataset).

Conclusion

Outlier detection is an important step in preparing spectroscopy data for machine learning models. Hotelling's T2 and Q-Residuals are two outlier detection methods commonly used in chemometrics. However, Sagitto has found that they need to be used with caution to avoid discarding unusual but valid data.

Acknowledgements

Special thanks to the following :-

Daniel Pelliccia of NIRPY Research for his blogpost 'Outliers Detection with PLS Regression for NIR Spectroscopy in Python'

Anderson, Nicholas & Walsh, Kerry & Flynn, Jamie & Walsh, J. (2020). Achieving robustness across season, location and cultivar for a NIRS model for intact mango fruit dry matter content. II. Local PLS and nonlinear models. Postharvest Biology and Technology. 171. 111358. 10.1016/j.postharvbio.2020.111358.

Mishra, Puneet & Passos, Dário. (2021). A synergistic use of chemometrics and deep learning improved the predictive performance of near-infrared spectroscopy models for dry matter prediction in mango fruit. Chemometrics and Intelligent Laboratory Systems. 212. 10.1016/j.chemolab.2021.104287

My father Rowland Blackith Hill, who taught me many things including how to draft sheep - and the occasional goat.

Subscribe to Sagitto's Blog

Get industry insights that you won't delete, straight in your inbox.
We use contact information you provide to us to contact you about our relevant content, products, and services. You may unsubscribe from these communications at any time. For information, check out our Privacy Policy.
George Hill
Sagitto Ltd
Sagitto's founder, George Hill, first started working with artificial intelligence during the 1980s, while developing 'expert systems' within Bank of America in London. On returning to New Zealand, he undertook part-time study with the University of Waikato's Machine Learning Group while working for Hill Laboratories, a well-known New Zealand commercial testing laboratory. This led to the formation of Sagitto Limited, dedicated to combining the power of artificial intelligence and machine learning with spectroscopy.

More news

Vendor Lockin

You Should Be Free To Leave

We believe that our customers should subscribe to our services willingly, because of the value that they receive and not because they are locked in to using us. That is why we take particular care to provide a smooth pathway, should our corporate customers decide to no longer use Sagitto's services.

Read Article
Authentication

Know Your Lavender Oil

'English lavender' oil is extracted from the flowers of Lavandula angustifolia, while 'Lavandin' oil is made from Lavandula x. intermedia, a hybrid cross between Lavandula angustifolia and Lavandula latifolia (Spike Lavender). Near infrared spectroscopy not only gives a very rapid and inexpensive method to tell the difference between these two types of oil, but also allows the composition of oils to be accurately measured.

Read Article
Data Science

Benchmark Against Machine Learning Models

The success of generative AI applications such as ChatGPT and DALL-E has increased public awareness of the power of artificial intelligence software. Sagitto's Benchmarking Service allows users of infrared spectroscopy instruments to benchmark their current models against models generated by machine learning.

Read Article