Python Pandas Data Operations (With Examples)

Various Data Operations in Python Pandas: In this tutorial, we will learn about some of the very useful operations with explanations and examples. By Sapna Deraje Radhakrishna, on February 03, 2020

Pandas support very useful operations which are illustrated below,

Consider the below dataFrame,

import numpy as np
import pandas as pd

df = pd.DataFrame({
  'col1': [1, 2, 3, 4],
  'col2': [444, 555, 666, 444],
  'col3': ['abc', 'def', 'ghi', 'xyz']
})

print(df.head())

'''
Output:
   col1  col2 col3
0     1   444  abc
1     2   555  def
2     3   666  ghi
3     4   444  xyz
'''

1. Finding unique values in a DataFrame

In order to find unique values from columns,

# returns numpy array of all unique  values
print(df['col2'].unique() )
# Output: array([444, 555, 666])

# returns length / number of unique values 
# in a numpy array
print(df['col2'].nunique())
# Output: 3

# if we want the table of the unique values
# and how many times they show up
print(df['col2'].value_counts() )
'''
Output:
444    2
555    1
666    1
Name: col2, dtype: int64
'''

2. Selecting data from a DataFrame

Consider the dataFrame,

Using the conditional selection, we could select data as follows,

print(df['col1']>2)

'''
Output:
0    False
1    False
2     True
3     True
Name: col1, dtype: bool
'''

print(df[(df['col1']>2)])

'''
Output:
   col1  col2 col3
2     3   666  ghi
3     4   444  xyz
'''

print(df[df['col1']>2 & (df['col2']==44)])

'''
Output:
   col1  col2 col3
0     1   444  abc
1     2   555  def
2     3   666  ghi
3     4   444  xyz
'''

3. Applied Methods

Consider a simple method,

def times2(x):
  return x*2

We already are aware that we can grab a column and call a built-in function off of it. Such as below,

print(df['col1'].sum())
# Output: 10

Now, in order to apply the custom function, such as one defined above (times2), pandas provide an option to do that as well, as explained below,

print(df['col2'].apply(times2))

'''
Output:
0     888
1    1110
2    1332
3     888
Name: col2, dtype: int64
'''

Apply built-in functions,

print(df['col3'].apply(len))

'''
Output:
0    3
1    3
2    3
3    3
Name: col3, dtype: int64
'''

Apply method will be more powerful, when combined with lambda expressions. For instance,

print(df['col2'].apply(lambda x: x*2))

'''
Output:
0     888
1    1110
2    1332
3     888
Name: col2, dtype: int64
'''

4. Getting column names, RangeIndex, and sort values

# returns the columns names
print(df.columns) 
# Output: Index(['col1', 'col2', 'col3'], dtype='object')

#since this is a rangeindex, it actually reports 
# start, stop and step values too
print(df.index)
# Output: RangeIndex(start=0, stop=4, step=1)

# sort by column
print(df.sort_values('col2'))

'''
Output:
   col1  col2 col3
0     1   444  abc
3     4   444  xyz
1     2   555  def
2     3   666  ghi
'''

In the above result, note that the index values doesn't change, this is to ensure that the values is retained.

5. Checking null values in a DataFrame using isnul()

# isnull
print(df.isnull())

'''
Output
    col1   col2   col3
0  False  False  False
1  False  False  False
2  False  False  False
3  False  False  False
'''

The isnull() will return a dataframe of booleans indicating whether or not the value was null or not. In the above, we get a boolean of all false because we have nulls in our dataframe.

6. Dropping NaN values in a DataFrame using dropna()

print(df.dropna())

'''
Output:
   col1  col2 col3
0     1   444  abc
1     2   555  def
2     3   666  ghi
3     4   444  xyz
'''

7. Replacing NaN values with custom values in a DataFrame

df = pd.DataFrame({
  'col1': [1, 2, 3, np.nan],
  'col2': [np.nan, 555, 666, 444],
  'col3': ['abc', 'def', 'ghi', 'xyz']
})

print(df)

'''
Output:
   col1   col2 col3
0   1.0    NaN  abc
1   2.0  555.0  def
2   3.0  666.0  ghi
3   NaN  444.0  xyz
'''

print(df.fillna('FILL'))

'''
Output:
   col1  col2 col3
0     1  FILL  abc
1     2   555  def
2     3   666  ghi
3  FILL   444  xyz
'''

8. Using pivot table in a DataFrame

This methodology will be familiar for the Advanced Excel users. Consider a new dataFrame,

data = {
  'A': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'],
  'B': ['one', 'one', 'two', 'two', 'one', 'one'],
  'C': ['x', 'y', 'x', 'y', 'x', 'y'],
  'D': [1, 3, 2, 5, 4, 1]
}

df = pd.DataFrame(data)

print(df)

'''
Output:
     A    B  C  D
0  foo  one  x  1
1  foo  one  y  3
2  foo  two  x  2
3  bar  two  y  5
4  bar  one  x  4
5  bar  one  y  1
'''

The pivot table, creates a multi index dataFrame. The pivot table takes three main arguments, the values, the index and the columns.

print(df.pivot_table(values='D',index=['A', 'B'],columns=['C']))

'''
Output:
      C        x    y
A    B            
bar one  4.0  1.0
      two  NaN  5.0
foo one  1.0  3.0
      two  2.0  NaN
'''


Comments and Discussions!

Load comments ↻





Copyright © 2024 www.includehelp.com. All rights reserved.