Quick links
Latest articles
Internship
Members
New...
Algorithms
Discrete Mathematics
Big data
Languages
C
C++
C++ STL
Java
Data Structure
C#.Net
Android
Kotlin
SQL
Web
PHP
Python
JavaScript
CSS
Ajax
Node.js
Web prog.
Programs
C
C++
DS
Java
C#
Python
Aptitude
C
C++
Java
DBMS
Interview
C
Embedded C
Java
SEO
HR
CS Subjects
CS Basics
O.S.
Networks
DBMS
Embedded Systems
Cloud Computing
Machine learning
CS Organizations
Linux
DOS
More...
Articles
Puzzles
News/Updates

Home » Machine Learning/Artificial Intelligence

Data splitting | Machine Learning



In this article, we will learn one of the methods to split the given data into test data and training data in python.
Submitted by Raunak Goswami, on August 01, 2018

Before going to the coding part, we must be knowing that why is there a need to split a single data into 2 subsets i.e. training data and test data.

So, at first, we would be discussing the training data. We use training data to basically train our model. Training data is a complete set of feature variables or the independent variable and target variable or the dependent variable .so that our model is able to learn the value of target variable on a particular set of feature variables. When encountered with a large set of data we use the major portion of data as a training set.

After supplying training data now it is the time to test that how much our model has learned from that data just like as humans in college after we learn our subjects we are required to give the test to clear the subject. We test our model by supplying the feature variables to our model and in turn, we see the value of the target variable predicted by our model. We generally take a minor portion of the whole data as the test set which is generally 25% or 33% of the complete data set.

This figure below shows the splitting of data into test and training sets:

splitting of data into test and training sets

Image source: http://scott.fortmann-roe.com/docs/docs/MeasuringError/holdout.png

For performing the data splitting. I would be using this data set: headbrain1.CSV

Python code: (The code along with its explanation is as follows)

# -*- coding: utf-8 -*-
"""
Created on Sun Jul 29 22:21:12 2018

@author: RaunakGoswami
"""

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#reading the data
"""here the directory of my code and the headbrain1.csv 
file is same make sure both the files are stored in 
same folder or directory""" 
data=pd.read_csv('headbrain1.csv')

#this will show the first five records of the whole data
data.head()

#this will create a variable x which has the feature values 
#i.e brain weight
x=data.iloc[:,2:3].values 
#this will create a variable y which has the target value 
#i.e brain weight
y=data.iloc[:,3:4].values 


#splitting the data into training and test
"""the following statement written below will split 
x and y into 2 parts:
	1.training variables named x_train and y_train
	2.test variables named x_test and y_test
The splitting will be done in the ratio of 1:4 as we have 
mentioned the test_size as 1/4 of the total size"""
from sklearn.cross_validation import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=1/4,random_state=0)

#this will plot the scattered graph of the training set
plt.scatter(x_train,y_train,c='red') 

plt.xlabel('headsize(train)')
plt.ylabel('brain weight(train)')
plt.show()

#this will plot the scattered graph of test set
plt.scatter(x_test,y_test,c='red')
plt.xlabel('headsize(test)')
plt.ylabel('brain weight(test)')
plt.show()


After you run this code, just look into the variable explorer and you will see something like this:

variable explorer

As it is clearly visible that out of 237 rows ,177 rows are allotted to training variables and the remaining 60 rows are allotted to test variables which is roughly ¼ of the total dataset.

The graph below is a scattered graph of the training set variables:

scattered graph of the training set variables 1

The graph below is a scattered graph of test set values notice that the number of scattered red dots are lesser than those in training set:

scattered graph of the training set variables 2

That is it guys hope you enjoyed today’s article.






Quick links:
C FAQ(s) C Advance programs C/C++ Tips & Tricks Puzzles JavaScript CSS Python Linux Commands PHP Android Articles More...

Featured post:
Introduction to Linux (Its modes, Safety, Most popular Applications)
Linux Best Distribution Software (Distros) of 2018

Was this page helpful? Please share with your friends...

Are you a blogger? Join our Blogging forum.

Comments and Discussions



Languages: » C » C++ » C++ STL » Java » Data Structure » C#.Net » Android » Kotlin » SQL
Web Technologies: » PHP » Python » JavaScript » CSS » Ajax » Node.js » Web programming/HTML
Solved programs: » C » C++ » DS » Java » C#
Aptitude que. & ans.: » C » C++ » Java » DBMS
Interview que. & ans.: » C » Embedded C » Java » SEO » HR
CS Subjects: » CS Basics » O.S. » Networks » DBMS » Embedded Systems » Cloud Computing » Machine learning » CS Organizations » Linux » DOS
More: » Articles » Puzzles » News/Updates


© https://www.includehelp.com (2015-2018), Some rights reserved.