How to split data into 3 sets (train, validation and test)?

NumPy | Split data 3 sets (train, validation, and test): In this tutorial, we will learn how to split your given data (dataset) into 3 sets - training, validation, and testing set with the help of the Python NumPy program. By Pranit Sharma Last updated : June 04, 2023

Problem Statement

Given a data / dataset/ DataFrame, we have to split this data into 3 sets (training, validation, and testing).

Solution approach

We know that while creating a machine learning model or designing any machine learning algorithm, we usually split the data into three sets i.e., the training set, the validation set, and the testing set.

The composition of all the sets is also predefined by the user, usually, 60% of data is used for the training set. Validation and testing sets are composed of 20% each.

How to split data into 3 sets (train, validation and test)?

To split the data into three sets, create a DataFrame having the overall data and then use the numpy.split() method by specifying the size (or, percentage) of the data that you want for the particular sets.

Let us understand with the help of an example,

Python program to split data into 3 sets (train, validation, and test)

# Import numpy
import numpy as np

# Import pandas
import pandas as pd

# Creating a dataframe
df = pd.DataFrame(np.random.rand(10, 5), columns=list("ABCDE"))

# Settings maximum rows and columns
# to display/print all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Display original dataframe
print("Original DataFrame:\n", df, "\n")

# Splitting the data into 3 parts
train, test, validate = np.split(
    df.sample(frac=1, random_state=42), [int(0.6 * len(df)), int(0.8 * len(df))]
)

# Display different sets
print("Training set:\n", train, "\n")
print("Testing set:\n", test, "\n")
print("Validation set:\n", validate)

Output

Original DataFrame:
           A         B         C         D         E
0  0.062043  0.305778  0.040534  0.344276  0.060514
1  0.705843  0.609687  0.070329  0.021927  0.714339
2  0.703366  0.613181  0.384509  0.005025  0.030347
3  0.627445  0.716861  0.802043  0.330570  0.479814
4  0.415682  0.620594  0.704717  0.606593  0.071703
5  0.508037  0.361807  0.904131  0.643761  0.824738
6  0.628795  0.163949  0.072226  0.984469  0.174503
7  0.338267  0.510505  0.608846  0.166929  0.657149
8  0.346381  0.082333  0.947476  0.812816  0.962484
9  0.979881  0.538592  0.433578  0.886863  0.468531 

Training set:
           A         B         C         D         E
8  0.346381  0.082333  0.947476  0.812816  0.962484
1  0.705843  0.609687  0.070329  0.021927  0.714339
5  0.508037  0.361807  0.904131  0.643761  0.824738
0  0.062043  0.305778  0.040534  0.344276  0.060514
7  0.338267  0.510505  0.608846  0.166929  0.657149
2  0.703366  0.613181  0.384509  0.005025  0.030347 

Testing set:
           A         B         C         D         E
9  0.979881  0.538592  0.433578  0.886863  0.468531
4  0.415682  0.620594  0.704717  0.606593  0.071703 

Validation set:
           A         B         C         D         E
3  0.627445  0.716861  0.802043  0.330570  0.479814
6  0.628795  0.163949  0.072226  0.984469  0.174503

Python NumPy Programs »

Concatenate a NumPy array to another NumPy array

How to count the number of true elements in a NumPy bool array?