Home » Machine Learning/Artificial Intelligence

Data splitting | Machine Learning

In this article, we will learn one of the methods to split the given data into test data and training data in python.
Submitted by Raunak Goswami, on August 01, 2018

Before going to the coding part, we must be knowing that why is there a need to split a single data into 2 subsets i.e. training data and test data.

So, at first, we would be discussing the training data. We use training data to basically train our model. Training data is a complete set of feature variables or the independent variable and target variable or the dependent variable .so that our model is able to learn the value of target variable on a particular set of feature variables. When encountered with a large set of data we use the major portion of data as a training set.

After supplying training data now it is the time to test that how much our model has learned from that data just like as humans in college after we learn our subjects we are required to give the test to clear the subject. We test our model by supplying the feature variables to our model and in turn, we see the value of the target variable predicted by our model. We generally take a minor portion of the whole data as the test set which is generally 25% or 33% of the complete data set.

This figure below shows the splitting of data into test and training sets:

splitting of data into test and training sets

Image source: http://scott.fortmann-roe.com/docs/docs/MeasuringError/holdout.png

For performing the data splitting. I would be using this data set: headbrain1.CSV

Python code: (The code along with its explanation is as follows)

# -*- coding: utf-8 -*-
Created on Sun Jul 29 22:21:12 2018

@author: RaunakGoswami

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#reading the data
"""here the directory of my code and the headbrain1.csv 
file is same make sure both the files are stored in 
same folder or directory""" 

#this will show the first five records of the whole data

#this will create a variable x which has the feature values 
#i.e brain weight
#this will create a variable y which has the target value 
#i.e brain weight

#splitting the data into training and test
"""the following statement written below will split 
x and y into 2 parts:
	1.training variables named x_train and y_train
	2.test variables named x_test and y_test
The splitting will be done in the ratio of 1:4 as we have 
mentioned the test_size as 1/4 of the total size"""
from sklearn.cross_validation import train_test_split

#this will plot the scattered graph of the training set

plt.ylabel('brain weight(train)')

#this will plot the scattered graph of test set
plt.ylabel('brain weight(test)')

After you run this code, just look into the variable explorer and you will see something like this:

variable explorer

As it is clearly visible that out of 237 rows ,177 rows are allotted to training variables and the remaining 60 rows are allotted to test variables which is roughly ¼ of the total dataset.

The graph below is a scattered graph of the training set variables:

scattered graph of the training set variables 1

The graph below is a scattered graph of test set values notice that the number of scattered red dots are lesser than those in training set:

scattered graph of the training set variables 2

That is it guys hope you enjoyed today’s article.

Comments and Discussions

Ad: Are you a blogger? Join our Blogging forum.

Quick links
Latest articles, Internship, Members
Coding problems, Algorithms, Discrete Mathematics, Big data
C, C++, C++ STL, Java, Data Structure, C#.Net, Android, Kotlin, SQL
PHP, Python, JavaScript, CSS, Ajax, Node.js, Web prog.
C, C++, DS, Java, C#, Python
C, C++, Java, DBMS
C, Embedded C, Java, SEO, HR
CS Subjects
CS Basics O.S. Networks DBMS Embedded Systems Cloud Computing Machine learning CS Organizations Linux DOS
Articles, Puzzles, News/Updates

Recommended posts
C Tips & Tricks, C++ Tips & Tricks
Introduction to Linux (Its modes, Safety, Most popular Applications)
Linux Best Distros of 2018
C programming optimization techniques
Differences b/w C & Embedded C?
Embedded C Interview Q. & A.
C programming tips for Embedded Development
Basic rules of writing a C program
Important points (rules) to remember while writing C/C++ program
Top 5 Websites for solving programming challenges

Computer G.K. (MCQ)
Most viewed pages...

Languages: » C » C++ » C++ STL » Java » Data Structure » C#.Net » Android » Kotlin » SQL
Web Technologies: » PHP » Python » JavaScript » CSS » Ajax » Node.js » Web programming/HTML
Solved programs: » C » C++ » DS » Java » C#
Aptitude que. & ans.: » C » C++ » Java » DBMS
Interview que. & ans.: » C » Embedded C » Java » SEO » HR
CS Subjects: » CS Basics » O.S. » Networks » DBMS » Embedded Systems » Cloud Computing » Machine learning » CS Organizations » Linux » DOS
More: » Articles » Puzzles » News/Updates

© https://www.includehelp.com some rights reserved.