IIT Certification Programs

Workshops

Projects

Blogs

Careers

Student Reviews

For Business

Academic Training

Informative Articles

Find Jobs

We are Hiring!

All Courses

Choose a category

All Courses

CSE

Modified on

Cross-Validation Techniques For Data

Skill-Lync

Often while working with datasets, we encounter scenarios where the data present might be very scarce. Due to this scarcity, dividing the data into tests and training leads to a loss of information.

Ensuring that we use the full data and still have the test and train data separate led to the concept of cross-validation.

Eight Major Types of Cross-Validation

Most machine learning enthusiasts use the following types of cross-validation:

Leave p out cross-validation:

Here we have a dataset with n sample points. In these datasets, p points are set aside. The model is trained on n-p datasets points and tested on p datasets. This is exhaustively tried with all data points where p are randomly chosen till the entire dataset is exhausted. In the figure below, each brown box represents one data point.

p=1 means Leave 1 out of cross-validation

p=2 means Leave 2 out cross-validation

These three techniques suffer from a major flaw where in the bias is low and computational time is low.

For example, if we have n=10 and p=1, we will have to run the model 10 times. In each run 1 of the data point is set aside for testing. With p =2, we have 10C2 runs, that is 45 runs etc. In the above pictures, the examples are shown for p=1 and p=2. Note for p=2, not all the 45 options are shown.

Holdout cross-validation:

This technique divides the data into 70:30 or 60:40. The training is done on the larger chunk and testing on the smaller chunk. One of the major disadvantages of this is that it does not work for imbalanced datasets. Second, a major chunk of the data is not used for learning.

K Fold cross-validation:

In this technique, the data is divided into K groups. Each time K-1 goes into training, and one of them goes into testing. There are totally K number of runs needed to exhaust the entire dataset. This validation technique suffers from low bias and fails miserably when there is an imbalanced dataset. Each pink box below contains a lot of data points.

Repeated Random Sampling Validation:

This technique randomly splits the data set into training and testing data. Due to the intrinsic randomness, this is also referred to as Montecarlo cross-validation sets. Here neither the split nor the iteration is fixed. In the end, the accuracy is given by the average of all the runs. Due to randomness, sometimes some points may not be used for training the model, which is one of the major disadvantages of this model. With imbalanced datasets, again, this validation technique fails.

Stratified K fold Cross Validation:

In the stratified K fold cross-validation, K folds are made by splitting the data into K groups. However, while making the K groups, it is ensured that all the classes are represented as per their proportion in the original population in the validation dataset. This makes sure that this technique works well with the imbalanced dataset also.

Time Series Cross Validation:

Data that changes with time needs special treatment. We cannot use future data to predict the past. To ensure this, we always use the data till t-1 to train and test with t data. Similarly, we use up to t to train and test with t+1 the next time.

Nested Cross Validation:

In the nested cross-validation, model hyperparameters and the 'K' of the stratified K fold and the K fold cross-validation are both varied. The model that gives the best accuracy is picked, and the corresponding K value and hyperparameter is used for further prediction.

Author

Navin Baskar

Author

Skill-Lync

Subscribe to Our Free Newsletter

When analysing SQL data, Microsoft Excel can come into play as a very effective tool. Excel is instrumental in establishing a connection to a specific database that has been filtered to meet your needs. Through this process, you can now manipulate and report your SQL data, attach a table of data to Excel or build pivot tables.

CSE

07 Aug 2022

How to remove MySQL Server from your PC? A Stepwise Guide

Microsoft introduced and distributes the SQL Server, a relational database management system (RDBMS). SQL Server is based on SQL, a common programming language for communicating with relational databases, like other RDBMS applications.

CSE

22 Aug 2022

Introduction to Artificial Intelligence, Machine learning, and Deep Learning

Machine Learning is a process by which we train a device to learn some knowledge and use the awareness of that acquired information to make decisions. For instance, let us consider an application of machine learning in sales.

CSE

30 Jun 2022

Do Not Be Just Another Engineer: Four Tips to Enhance Your Engineering Career

Companies seek candidates who can differentiate themselves from the colossal pool of engineers. You could have a near-perfect CGPA and be a bookie, but the value you can provide to a company determines your worth.

CSE

03 Jul 2022

A Step-by-Step Installation Guide for Anaconda Navigator

An open-source distribution of Python and R for data research, Anaconda aims to make package management and deployment easier. Conda, the package management system used by Anaconda, is in charge of managing package versions.

CSE

27 Dec 2022

Author

Skill-Lync

Subscribe to Our Free Newsletter