Questions tagged [cross-validation]
Repeatedly withholding subsets of the data during model fitting in order to quantify the model performance on the withheld data subsets.
3,520 questions
3
votes
0
answers
53
views
Do k-folds risk sampling bias and, if so, how do we avoid it?
In cross-validation, $k$-folds are a common way to train, compare and validate models. Often we want to find an optimal set of hyperparameters for our models. There are many ways to probe the ...
2
votes
1
answer
55
views
Should differential expression analysis be incorporated in cross validation for training machine learning models?
I'm conducting some experiments using TCGA-LUAD clinical and RNA-Seq count data. I'm building machine learning models for survival prediction (Random Survival Forests, Survival Support Vector Machines,...
2
votes
0
answers
58
views
Cross-validating multi-output models: importance + SHAP
I am currently developing a project that deals with multiple targets which can have different numbers of cardinalities. The idea is to use different ML-models(e.g. Random Forest, SVM, AdaBoost) and ...
0
votes
0
answers
24
views
What is the best way to determine if cross validated R-squared scores are significantly different? [duplicate]
I'm comparing, pairwise, the results of Linear Regression models with transformations applied to one numerical feature and the target. I'm using K folds cross validation scoring with R-squared. The ...
1
vote
0
answers
45
views
How to choose between ARIMA and ARFIMA?
I am in the position of having a time series data set that I can model well using either a Autoregressive Fractionally Integrated Moving Average (ARFIMA) or an ARIMA model. I'm asking for ways to ...
4
votes
1
answer
515
views
Should I normalize both train and valdiation sets or only the train set?
I have a question about normalization when merging training and validation sets for cross-validation.
Normally, I normalize using re-scaling (Min-Max Normalization) calculated from the training set ...
1
vote
2
answers
243
views
A proper approach to K-fold cross validation on imbalanced data
What is the proper algorithm for k-fold CV in case of class-balancing (under/over sampling)?
Variant 1:
split data into train and test set
balance classes in the train set
run k-fold CV
Variant 2:
...
4
votes
1
answer
128
views
When and how can unsupervised preprocessing before splitting data lead to overoptimistic model performance?
Conceptually, I understand that models should be built totally blind to the test set in order to most faithfully estimate performance on future data. However, I'm struggling to understand the extent ...
0
votes
0
answers
52
views
LASSO and cross validation when dealing with missing data
I want to simulate data with missing values and use them to compare the predictive performance of several machine learning algorithms, including LASSO. All analyses will be performed in R, using the ...
4
votes
1
answer
88
views
Confused about the utility of nested cross-validation vs k-fold cross-validation
I am using nested cross validation in mlr3 to tune my model's hyperparameters and gauge its out-of-sample performance. Previously, when I was performing regular k-fold CV, my understanding was that ...
1
vote
1
answer
101
views
How to choose and structure a GLM for species richness with non-normal distribution? [closed]
I know my next steps involve using a GLM and selecting the type of GLM based on my response variables (possibly gamma or Poisson regression?).
I also need to standardise explanatory variables to be ...
0
votes
1
answer
134
views
Comparing AUROCs of binary classifiers across cross-validation folds: alternatives to DeLong
I have two binary classifiers and would like to check whether there is a statistically significant difference between the area under the ROC curve (AUROC). I have reason to opt for AUROC as my ...
2
votes
0
answers
30
views
How can one statistically compare machine learning models based on the results of a cross validation? [duplicate]
It is often recommended that one uses cross fold validation to estimate the generalisation ability of a machine learning model. Most ressources I've found however do not adres what one should do after ...
0
votes
0
answers
63
views
Time series LASSO K-fold cross validation
This topic has been discussed before but I couldn't find a specific answer.
Here's my approach to forecast QoQ values,
Run the usual LASSO K-fold CV on timeseries data and generate a one-step ahead ...
0
votes
1
answer
59
views
Data cross validation to predict label from cluster analysis [closed]
My project has the following steps:
Use elbow method to determine the features and number of clusters for kmeans.
Run kmeans on the data (with determined features and n clusters), and gives the ...