During any basic or statistical analysis, it is necessary to identify the outliers in datasets and understand its possible impact on the data. Hence handling outliers in analytics is an important topic to learn.
What is an Outlier?
In a group of smokers, you can always notice a guy who is a non-smoker. Well, that guy is the outlier of the group! So, when data objects are very from the remaining set of the data, they are called outliers. Analysts often face situations when the patterns in the data do not conform to the expected normal behavior.
It is very important to identify the outliers present in the dataset and understand why those are outliers as they might impact the results. The motive of outlier analysis is not always the exclusion of outliers. The exclusion or inclusion of outliers depends upon the reason why the case is an outlier and also on the purpose of the analysis. Let us now explore how to detect outliers and also the methods available for handling outliers in analytics.
How to Detect an Outlier?
To identify univariate outliers, i.e. cases that have an unusual value for a single variable, analysts convert the value/scores for a variable to a standard ‘z’ score.
Mathematically, it is can be stated as
Z= (x-µ)/ σ
Where x is the value/score for the variable, µ is the mean score and σ is the standard deviation.
Now, for a small sample size (80 or fewer cases), a case is considered an outlier if its standard score is ±2.5 or beyond. Similarly, for a large sample size (larger than 80 cases), a case is considered an outlier if its standard score is ±3 or beyond.
Multivariate outliers are cases that have an unusual combination of values for a number of variables. To detect multivariate analysis, Mahalanobis distance is measured. Mahalanobis Distance (D2) measures the distance of a case from the centroid (multidimensional mean) of distribution, given the covariance (multidimensional covariance) of the distribution.
In a multivariate outlier, a case is considered to be an outlier if the probability associated with its D2 is 0.001 or less. D2 follows a chi-square distribution with degrees of freedom equal to the number of variables included in the calculation.
After the detection of an outlier, the question remains that what to do with it? One can either exclude it or modify the values by the next highest or lowest values or transform the variable depending on the nature and objective of analysis. Outlier detection can be extremely important for the cases mentioned below.
Application of Outlier Analysis
Recognition of unusual characteristics can generate useful insights.
- – Fraud Detection: In many cases, unauthorized usage of the credit card can provide different patterns that can help to identify credit card fraud.
- – Medical Diagnosis: Data can be collected from a variety of devices like ECG, MRI scans, PET scans. Any kind of unusual pattern can be used to identify disease.
- – Law Enforcement: Patterns can only be discovered over time through multiple actions of an entity. Fraud in financial transactions or trading or insurance claims may be caused by the actions of criminal entities and can stand out as an unusual pattern.
- – Intrusion Detection Systems: Different kinds of data may be collected from the network system in a host-based or networked computer system. Malicious activity can be traced from the unusual behavior.
Join our data science course to learn the techniques mentioned here like handling outliers, visit the website of Data Brio Academy for different training programs on offer.