Modified on
27 Dec 2022 06:45 pm
Skill-Lync
Often while working with datasets, we encounter scenarios where the data present might be very scarce. Due to this scarcity, dividing the data into tests and training leads to a loss of information.
Ensuring that we use the full data and still have the test and train data separate led to the concept of cross-validation.
Most machine learning enthusiasts use the following types of cross-validation:
Here we have a dataset with n sample points. In these datasets, p points are set aside. The model is trained on n-p datasets points and tested on p datasets. This is exhaustively tried with all data points where p are randomly chosen till the entire dataset is exhausted. In the figure below, each brown box represents one data point.
p=1 means Leave 1 out of cross-validation
p=2 means Leave 2 out cross-validation
These three techniques suffer from a major flaw where in the bias is low and computational time is low.
For example, if we have n=10 and p=1, we will have to run the model 10 times. In each run 1 of the data point is set aside for testing. With p =2, we have 10C2 runs, that is 45 runs etc. In the above pictures, the examples are shown for p=1 and p=2. Note for p=2, not all the 45 options are shown.
This technique divides the data into 70:30 or 60:40. The training is done on the larger chunk and testing on the smaller chunk. One of the major disadvantages of this is that it does not work for imbalanced datasets. Second, a major chunk of the data is not used for learning.
In this technique, the data is divided into K groups. Each time K-1 goes into training, and one of them goes into testing. There are totally K number of runs needed to exhaust the entire dataset. This validation technique suffers from low bias and fails miserably when there is an imbalanced dataset. Each pink box below contains a lot of data points.
This technique randomly splits the data set into training and testing data. Due to the intrinsic randomness, this is also referred to as Montecarlo cross-validation sets. Here neither the split nor the iteration is fixed. In the end, the accuracy is given by the average of all the runs. Due to randomness, sometimes some points may not be used for training the model, which is one of the major disadvantages of this model. With imbalanced datasets, again, this validation technique fails.
In the stratified K fold cross-validation, K folds are made by splitting the data into K groups. However, while making the K groups, it is ensured that all the classes are represented as per their proportion in the original population in the validation dataset. This makes sure that this technique works well with the imbalanced dataset also.
Data that changes with time needs special treatment. We cannot use future data to predict the past. To ensure this, we always use the data till t-1 to train and test with t data. Similarly, we use up to t to train and test with t+1 the next time.
In the nested cross-validation, model hyperparameters and the 'K' of the stratified K fold and the K fold cross-validation are both varied. The model that gives the best accuracy is picked, and the corresponding K value and hyperparameter is used for further prediction.
Author
Navin Baskar
Author
Skill-Lync
Subscribe to Our Free Newsletter
Continue Reading
Related Blogs
When analysing SQL data, Microsoft Excel can come into play as a very effective tool. Excel is instrumental in establishing a connection to a specific database that has been filtered to meet your needs. Through this process, you can now manipulate and report your SQL data, attach a table of data to Excel or build pivot tables.
08 Aug 2022
Microsoft introduced and distributes the SQL Server, a relational database management system (RDBMS). SQL Server is based on SQL, a common programming language for communicating with relational databases, like other RDBMS applications.
23 Aug 2022
Machine Learning is a process by which we train a device to learn some knowledge and use the awareness of that acquired information to make decisions. For instance, let us consider an application of machine learning in sales.
01 Jul 2022
Companies seek candidates who can differentiate themselves from the colossal pool of engineers. You could have a near-perfect CGPA and be a bookie, but the value you can provide to a company determines your worth.
04 Jul 2022
An open-source distribution of Python and R for data research, Anaconda aims to make package management and deployment easier. Conda, the package management system used by Anaconda, is in charge of managing package versions.
28 Dec 2022
Author
Skill-Lync
Subscribe to Our Free Newsletter
Continue Reading
Related Blogs
When analysing SQL data, Microsoft Excel can come into play as a very effective tool. Excel is instrumental in establishing a connection to a specific database that has been filtered to meet your needs. Through this process, you can now manipulate and report your SQL data, attach a table of data to Excel or build pivot tables.
08 Aug 2022
Microsoft introduced and distributes the SQL Server, a relational database management system (RDBMS). SQL Server is based on SQL, a common programming language for communicating with relational databases, like other RDBMS applications.
23 Aug 2022
Machine Learning is a process by which we train a device to learn some knowledge and use the awareness of that acquired information to make decisions. For instance, let us consider an application of machine learning in sales.
01 Jul 2022
Companies seek candidates who can differentiate themselves from the colossal pool of engineers. You could have a near-perfect CGPA and be a bookie, but the value you can provide to a company determines your worth.
04 Jul 2022
An open-source distribution of Python and R for data research, Anaconda aims to make package management and deployment easier. Conda, the package management system used by Anaconda, is in charge of managing package versions.
28 Dec 2022
Related Courses