In a dataset even if everything’s right and accurate, Outliers themselves are capable enough to disrupt your analysis results. So let’s discuss how can we handle such outliers in the dataset in this article.
What are Outliers?
By definition, Outliers are the numerical values in a dataset that differ from the majority of the similar numerical values. Typically you should remove outliers from your dataset however it’s not always a good idea to remove outliers. To remove an outlier from the dataset there has to be a specific reason. Moreover, pointing out outliers can sometimes become extremely difficult and it isn’t always possible for a data scientist to find out these anomalous instances.
Types Of Outliers in a Dataset
Outliers are generally of two types i.e, Univariate Outliers & Multivariate Outliers. But in addition to that, there are some environment based outliers as well.
When for a single variable there are unusual values in a dataset that is known as Univariate Outlier.
In contrast to univariate outliers, multivariate outliers are found in the distribution where there are multiple unusual values for n no of variables. For a human brain, it gets very tricky to analyze and find out these outliers, so specific models have to be trained for finding out multivariate outliers.
Environment Based Outliers
Apart from this depending upon environments there are also different flavours of outliers, these are point outliers, contextual outliers, and collective outliers.
- Outliers far away from the rest of the dataset present in a single data point format are point outliers.
- The noise in data is referred to as a contextual outlier. e.g Special characters during text analysis or background audio during speech recognition.
- Novelty in a dataset is referred to as a collective outlier.
What Causes Outliers In a Dataset?
Multiple factors can make an outlier present in your dataset. But among all these causes primarily outliers in data happen during Data Entry, Sampling Problems, Natural Variations.
Mostly when filling up the data in an excel sheet, sometimes by mistake wrong inputs are filled and during the analysis phase eventually, outliers get displayed. The key thing to note here is that a minute typo can ruin the process. But these mistakes are easy to find out and one can easily delete the outlier data point from the dataset.
Sometimes while collecting samples some random samples also get sort out in the dataset. let’s say we have collected the body fat percentage of males under the age of 20, but there’s an unusual body fat percentage in the collected samples. After further analysis, it is found that subject goes to the gym every day. And the goal was to have a body fat percentage of males under 20 who don’t do any sort of exercise every day. As this data point doesn’t meet the required criteria therefore we’ll not consider it.
These types of outliers aren’t anyone’s fault instead they might occur in the processes. Assume that you have a very large dataset, and that needs to be distributed so chances are that there will be extreme values which on viewing separately seem like an outlier but when viewing both the distributions combined will show normal. So in this case you don’t need to worry about these outliers. Also, the power failure of a machine while running can cause its settings and standards to be altered, this can be an outlier.
Negative Impacts of Outliers
- The results of data analysis and statistical modelling can get greatly affected by outliers.
- The estimates can get biased if there’s an outlier.
- Mean and standard deviation gets affected due to the presence of outliers.
Consider the following dataset without an outlier.
1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4
Now when finding out mean, median, mode and standard deviation we get;
- Mean = 2.58
- Median = 2.5
- Mode = 2
- Standard Deviation = 1.08
In the same dataset if we add an outlier we get;
1, 1, 2, 2, 2, 2, 3, 3, 3, 4, 4, 400
And on finding out mean, median, mode and standard deviation we get;
- Mean = 35.38
- Median = 2.5
- Mode = 2
- Standard Deviation = 114.74
As you can see mean and standard deviation has been drastically affected.
How to Handle Outliers In Dataset
Before altering outliers make sure that you’ve rechecked possible causes for the outlier to be present.
If Possible Correct Otherwise Remove Data Point
If the outlier is resulted due to data entry, sampling errors then it’s best to correct them if possible and if you aren’t able to correct it then it’s best to remove the data point from the dataset. If you choose to remove the outlier make sure to document it and write the reason for removing the outlier.
What to Do When You Cannot Handle The Outliers From Dataset?
When you can’t exclude outliers from the dataset but they keep bothering your statistical analysis assumptions. Then in such scenarios, you can use different statistical analyses that wouldn’t necessarily distort the results.
- Non-Parametric Hypothesis Tests are very useful as the results/assumptions are not based on distribution.
- Regression Analysis is a machine learning algorithm that measures how closely is the independent variable related to the dependent variable. There are different types of regression analysis named Multiple Polynomial Regression, Multiple Linear Regression, Polynomial Regression, Linear Regression.
- Bootstrapping techniques uses a single dataset to create multiple simulations of it.