Home » Python

Conditional selection in the DataFrame | Pandas DataFrame

Here, we are going to learn about the conditional selection in the Pandas DataFrame in Python, Selection Using multiple conditions, etc.
Submitted by Sapna Deraje Radhakrishna, on January 06, 2020

Conditional selection in the DataFrame

Consider the following example,

import numpy as np
import pandas as pd
from numpy.random import randn

np.random.seed(102)
df = pd.DataFrame(randn(5,4),['P','Q','R','S','T'],['A','B','C','D'])
print(df)

Output

          A         B         C         D
P  1.668068  0.925862  1.057997 -0.920339
Q  1.299748  0.331183 -0.509845 -0.903099
R -0.130016 -2.238203  0.973165 -0.024185
S -0.484928 -1.109264 -0.558975  1.042387
T -1.712263  0.136120 -0.464444  0.050980

If we use < symbol on a DataFrame, like >0, the values in the dataFrame is compared against 0 and returned with True/False.

print(df > 0)

'''
Output:
       A      B      C      D
P   True   True   True  False
Q   True   True  False  False
R  False  False   True  False
S  False  False  False   True
T  False   True  False   True
'''

Now, assign the df>0 to a Boolean value called bool_df

bool_df = df > 0
print(bool_df)

'''
Output:
       A      B      C      D
P   True   True   True  False
Q   True   True  False  False
R  False  False   True  False
S  False  False  False   True
T  False   True  False   True
'''

Pass bool_df to df, in the below we can see that the values which were True have their original value and where it is False, we have a NAN. Using this approach, we can use the conditional selection in dataFrame.

print(df[bool_df])

'''
Output:
          A         B         C         D
P  1.668068  0.925862  1.057997       NaN
Q  1.299748  0.331183       NaN       NaN
R       NaN       NaN  0.973165       NaN
S       NaN       NaN       NaN  1.042387
T       NaN  0.136120       NaN  0.050980
'''

The above can be achieved in single line,

print(df[df>0])

'''
Output:
          A         B         C         D
P  1.668068  0.925862  1.057997       NaN
Q  1.299748  0.331183       NaN       NaN
R       NaN       NaN  0.973165       NaN
S       NaN       NaN       NaN  1.042387
T       NaN  0.136120       NaN  0.050980
'''

Instead of passing an entire dataFrame, pass only the row/column and instead of returning nulls what that's going to do is return only the rows/columns of a subset of the data frame where the conditions are True.

Take a look at the 'A' column, here the value against 'R', 'S', 'T' are less than 0 hence you get False for those rows,

print(df['A'])

'''
Output:
P    1.668068
Q    1.299748
R   -0.130016
S   -0.484928
T   -1.712263
Name: A, dtype: float64
'''

print(df['A']>0)

'''
Output:
P     True
Q     True
R    False
S    False
T    False
Name: A, dtype: bool
'''

Use this series of Boolean values corresponding to rows to filter out rows based off of a column values and that means if the series is passed into a dataFrame using bracket notation, the rows of A which have True value will only be returned (no null values returned).

print(df[df['A']>0])

'''
Output:
          A         B         C         D
P  1.668068  0.925862  1.057997 -0.920339
Q  1.299748  0.331183 -0.509845 -0.903099
'''

Consider some more examples of conditional selection, grab all rows from df where D<0

print(df)

'''
Output:
          A         B         C         D
P  1.668068  0.925862  1.057997 -0.920339
Q  1.299748  0.331183 -0.509845 -0.903099
R -0.130016 -2.238203  0.973165 -0.024185
S -0.484928 -1.109264 -0.558975  1.042387
T -1.712263  0.136120 -0.464444  0.050980
'''

Retrieving the subset dataFrame, step wise

print(df[df['D']<0]) #subset dataFrame

'''
Output:
          A         B         C         D
P  1.668068  0.925862  1.057997 -0.920339
Q  1.299748  0.331183 -0.509845 -0.903099
R -0.130016 -2.238203  0.973165 -0.024185
'''

result_df = df[df['D']<0]
print(result_df)
'''
Output
          A         B         C         D
P  1.668068  0.925862  1.057997 -0.920339
Q  1.299748  0.331183 -0.509845 -0.903099
R -0.130016 -2.238203  0.973165 -0.024185
'''

print(result_df['B'])

'''
Output:
P    0.925862
Q    0.331183
R   -2.238203
Name: B, dtype: float64
'''

Retrieving the subset dataFrame, in single step

print(df[df['D']<0]['B'])

'''
Output:
P    0.925862
Q    0.331183
R   -2.238203
Name: B, dtype: float64
'''

Retrieving multiple columns from dataFrame

print(df[df['D']<0][['B','C']])

'''
Output:
          B         C
P  0.925862  1.057997
Q  0.331183 -0.509845
R -2.238203  0.973165
'''

Though it seems to be a little confusing to use one-liners, it is a preferred way, since using multiple steps the code takes more memory with each variable defined. However, until one is comfortable it is good to break it down to multiple steps.

Selection Using multiple conditions

The normal approach in python to implement multiple conditions is by using 'and' operator. However, if we use the 'and' operator in the pandas function we get an 'ValueError: The truth value of a Series is ambiguous.' Consider the below example

print(df[(df['D']<0) and (df['A']>0)])

'''
Output:
Traceback (most recent call last):
  File "main.py", line 31, in <module>
    print(df[(df['D']<0) and (df['A']>0)])
  File "/home/runner/.local/share/virtualenvs/python3/lib/python3.7/site-packages/pandas/core/generic.py", line 1555, in __nonzero__
    self.__class__.__name__
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
'''

The reason for the above error is, in python the 'and' operator can deal with the single instance of Boolean values and not multiple instances. The df['D']<0 results in multiple instances of Boolean value, as shown below,

print(df['D']<0)

'''
Output:
P     True
Q     True
R     True
S    False
T    False
Name: D, dtype: bool
'''

In Pandas, in order to use and logical operation we have to use &

print(df[(df['D']<0) & (df['A']>0)])

'''
Output:
          A         B         C         D
P  1.668068  0.925862  1.057997 -0.920339
Q  1.299748  0.331183 -0.509845 -0.903099
'''

'or' operation, use '|'

print(df[(df['D']<0) | (df['A']>0)])

'''
Output:
          A         B         C         D
P  1.668068  0.925862  1.057997 -0.920339
Q  1.299748  0.331183 -0.509845 -0.903099
R -0.130016 -2.238203  0.973165 -0.024185
'''

Resetting the index

In order to reset the index, use the method reset_index() as explained in below example,

print(df.reset_index())

'''
Output:
  index         A         B         C         D
0     P  1.668068  0.925862  1.057997 -0.920339
1     Q  1.299748  0.331183 -0.509845 -0.903099
2     R -0.130016 -2.238203  0.973165 -0.024185
3     S -0.484928 -1.109264 -0.558975  1.042387
4     T -1.712263  0.136120 -0.464444  0.050980
'''

In the above example, the index is reset to numerical values and the existing indexes are reset to a column 'index'.

The method reset_index() doesn't occur in place, unless we pass an argument (inplace=True), as explained in below example,

print(df)

'''
Output:
          A         B         C         D
P  1.668068  0.925862  1.057997 -0.920339
Q  1.299748  0.331183 -0.509845 -0.903099
R -0.130016 -2.238203  0.973165 -0.024185
S -0.484928 -1.109264 -0.558975  1.042387
T -1.712263  0.136120 -0.464444  0.050980
'''

print(df.reset_index(inplace=True))

'''
Output:
  index         A         B         C         D
0     P  1.668068  0.925862  1.057997 -0.920339
1     Q  1.299748  0.331183 -0.509845 -0.903099
2     R -0.130016 -2.238203  0.973165 -0.024185
3     S -0.484928 -1.109264 -0.558975  1.042387
4     T -1.712263  0.136120 -0.464444  0.050980
'''

Setting the index

Create a new column as mentioned below,

new_index = 'KA KL AP TS MH'.split() #creates a list
print(new_index)

'''
Output:
['KA', 'KL', 'AP', 'TS', 'MH']
'''

df['states'] = new_index
print(df)

'''
Output:
  index         A         B         C         D states
0     P  1.668068  0.925862  1.057997 -0.920339     KA
1     Q  1.299748  0.331183 -0.509845 -0.903099     KL
2     R -0.130016 -2.238203  0.973165 -0.024185     AP
3     S -0.484928 -1.109264 -0.558975  1.042387     TS
4     T -1.712263  0.136120 -0.464444  0.050980     MH
'''

Set the index, sets in-place (cannot be reverted)

print(df.set_index('states'))

'''
Output:
       index         A         B         C         D
states
KA         P  1.668068  0.925862  1.057997 -0.920339
KL         Q  1.299748  0.331183 -0.509845 -0.903099
AP         R -0.130016 -2.238203  0.973165 -0.024185
TS         S -0.484928 -1.109264 -0.558975  1.042387
MH         T -1.712263  0.136120 -0.464444  0.050980
'''

Comments and Discussions!

Load comments ↻