×

Python Tutorial

Python Basics

Python I/O

Python Operators

Python Conditions & Controls

Python Functions

Python Strings

Python Modules

Python Lists

Python OOPs

Python Arrays

Python Dictionary

Python Sets

Python Tuples

Python Exception Handling

Python NumPy

Python Pandas

Python File Handling

Python WebSocket

Python GUI Programming

Python Image Processing

Python Miscellaneous

Python Practice

Python Programs

How to split data into 3 sets (train, validation and test)?

NumPy | Split data 3 sets (train, validation, and test): In this tutorial, we will learn how to split your given data (dataset) into 3 sets - training, validation, and testing set with the help of the Python NumPy program. By Pranit Sharma Last updated : June 04, 2023

Problem Statement

Given a data / dataset/ DataFrame, we have to split this data into 3 sets (training, validation, and testing).

Solution approach

We know that while creating a machine learning model or designing any machine learning algorithm, we usually split the data into three sets i.e., the training set, the validation set, and the testing set.

The composition of all the sets is also predefined by the user, usually, 60% of data is used for the training set. Validation and testing sets are composed of 20% each.

How to split data into 3 sets (train, validation and test)?

To split the data into three sets, create a DataFrame having the overall data and then use the numpy.split() method by specifying the size (or, percentage) of the data that you want for the particular sets.

Let us understand with the help of an example,

Python program to split data into 3 sets (train, validation, and test)

# Import numpy
import numpy as np

# Import pandas
import pandas as pd

# Creating a dataframe
df = pd.DataFrame(np.random.rand(10, 5), columns=list("ABCDE"))

# Settings maximum rows and columns
# to display/print all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# Display original dataframe
print("Original DataFrame:\n", df, "\n")

# Splitting the data into 3 parts
train, test, validate = np.split(
    df.sample(frac=1, random_state=42), [int(0.6 * len(df)), int(0.8 * len(df))]
)

# Display different sets
print("Training set:\n", train, "\n")
print("Testing set:\n", test, "\n")
print("Validation set:\n", validate)

Output

Original DataFrame:
           A         B         C         D         E
0  0.062043  0.305778  0.040534  0.344276  0.060514
1  0.705843  0.609687  0.070329  0.021927  0.714339
2  0.703366  0.613181  0.384509  0.005025  0.030347
3  0.627445  0.716861  0.802043  0.330570  0.479814
4  0.415682  0.620594  0.704717  0.606593  0.071703
5  0.508037  0.361807  0.904131  0.643761  0.824738
6  0.628795  0.163949  0.072226  0.984469  0.174503
7  0.338267  0.510505  0.608846  0.166929  0.657149
8  0.346381  0.082333  0.947476  0.812816  0.962484
9  0.979881  0.538592  0.433578  0.886863  0.468531 

Training set:
           A         B         C         D         E
8  0.346381  0.082333  0.947476  0.812816  0.962484
1  0.705843  0.609687  0.070329  0.021927  0.714339
5  0.508037  0.361807  0.904131  0.643761  0.824738
0  0.062043  0.305778  0.040534  0.344276  0.060514
7  0.338267  0.510505  0.608846  0.166929  0.657149
2  0.703366  0.613181  0.384509  0.005025  0.030347 

Testing set:
           A         B         C         D         E
9  0.979881  0.538592  0.433578  0.886863  0.468531
4  0.415682  0.620594  0.704717  0.606593  0.071703 

Validation set:
           A         B         C         D         E
3  0.627445  0.716861  0.802043  0.330570  0.479814
6  0.628795  0.163949  0.072226  0.984469  0.174503

Python NumPy Programs »

Advertisement
Advertisement

Comments and Discussions!

Load comments ↻


Advertisement
Advertisement
Advertisement

Copyright © 2025 www.includehelp.com. All rights reserved.