How to remove outliers in Python?

By Shivang Yadav Last updated : November 21, 2023

Outliers in Python

Outliers in Python are data observations that lie significantly away from the rest of the datasets. These outliers are the values that are caused by some error in the program or data feeders. These are needed to be processed before performing any operation on the data because these can affect the accuracy of the analysis.

Methods to identify outliers

These outliers present in the dataset first needed to be identified for processing them. Finding the values that could go outside the desired range of values is then eliminating them so that the analysis to be done on the data is more accurate.

Interquartile Range method

IQR (InterQuartile Range) method is an outlier identification method. IQR is the difference between the 75th percentile(Q3) and 25th percentile(Q1) in a dataset. The value outside the 1.5X of the IQR range is the outlier.

Z-score method

Z-score is one more statistical method that can be used for the identification of Outliers. A value in the dataset that is far away from the Z-score is treated as an outliner.

How to Remove Outliers in Python?

Once identified, outliners need to be removed to make sure that the data to be processed is more precise to make the result better.

Z-score Method

The Z-score for the value of the dataset can be made a measure to remove outliers. Removing outliers from a dataset using the Z-score method is done by marking values of the -3 to +3 range for Z-scores.

Outliners: observations with Z-score value outside the -3 to 3 range.

Z-score is a more sensitive method which means only extreme outliers will be deleted.

Program to illustrate the removal of outliers in Python using Z-score

import numpy as np
import pandas as pd
import scipy.stats as stats

array = np.array(
    [
        [0.315865, 0.152790, -0.454003],
        [-0.083838, 0.213360, -0.200856],
        [0.655116, 0.085485, 0.042914],
        [14845370, -10798049, -19777283],
        [0.243121, 0.32123, -0.454993],
        [-0.088338, 0.213364, -0.211856],
        [0.165511, 0.085485, 0.042914],
        [14845370, -10798055, -19777183],
    ]
)

index_values = [1, 2, 3, 4, 5, 6, 7, 8]
column_values = ["A", "B", "C"]
dataValues = pd.DataFrame(data=array, index=index_values, columns=column_values)

print(f"The dataset is \n{dataValues}")

zScore = np.abs(stats.zscore(dataValues))
data_clean = dataValues[(zScore < 3).all(axis=1)]

print(f"Value count in dataSet after removing outliers is \n{data_clean.shape}")

The output of the above program is:

The dataset is 
              A             B             C
1  3.158650e-01  1.527900e-01 -4.540030e-01
2 -8.383800e-02  2.133600e-01 -2.008560e-01
3  6.551160e-01  8.548500e-02  4.291400e-02
4  1.484537e+07 -1.079805e+07 -1.977728e+07
5  2.431210e-01  3.212300e-01 -4.549930e-01
6 -8.833800e-02  2.133640e-01 -2.118560e-01
7  1.655110e-01  8.548500e-02  4.291400e-02
8  1.484537e+07 -1.079806e+07 -1.977718e+07
Value count in dataSet after removing outliers is 
(8, 3)

Interquartile Range Method

IQR is the difference between 75th percentile(Q3) and 25th percentile(Q1) in a dataset. The value outside the 1.5X of the IQR range is the outlier.

Program to illustrate the removing of outliers in Python using Interquartile Range method

import numpy as np
import pandas as pd
import scipy.stats as stats

array = np.array(
    [
        [0.315865, 0.152790, -0.454003],
        [-0.083838, 0.213360, -0.200856],
        [0.655116, 0.085485, 0.042914],
        [14845370, -10798049, -19777283],
        [0.243121, 0.32123, -0.454993],
        [-0.088338, 0.213364, -0.211856],
        [0.165511, 0.085485, 0.042914],
        [14845370, -10798055, -19777183],
    ]
)

index_values = [1, 2, 3, 4, 5, 6, 7, 8]
column_values = ["A", "B", "C"]
dataValues = pd.DataFrame(data=array, index=index_values, columns=column_values)

print(f"The dataset is \n{dataValues}")

Q1 = dataValues.quantile(q=0.25)
Q3 = dataValues.quantile(q=0.75)
IQR = dataValues.apply(stats.iqr)

data_clean = dataValues[
    ~((dataValues < (Q1 - 1.5 * IQR)) | (dataValues > (Q3 + 1.5 * IQR))).any(axis=1)
]

print(f"Value count in dataSet after removing outliers is \n{data_clean.shape}")

The output of the above program is:

The dataset is 
              A             B             C
1  3.158650e-01  1.527900e-01 -4.540030e-01
2 -8.383800e-02  2.133600e-01 -2.008560e-01
3  6.551160e-01  8.548500e-02  4.291400e-02
4  1.484537e+07 -1.079805e+07 -1.977728e+07
5  2.431210e-01  3.212300e-01 -4.549930e-01
6 -8.833800e-02  2.133640e-01 -2.118560e-01
7  1.655110e-01  8.548500e-02  4.291400e-02
8  1.484537e+07 -1.079806e+07 -1.977718e+07
Value count in dataSet after removing outliers is 
(6, 3)

Python Pandas Programs »


Comments and Discussions!

Load comments ↻






Copyright © 2024 www.includehelp.com. All rights reserved.