sas - Excel - How to split data into train and test sets that are equally distributed -
i've got data set (in excel) i'm going import sas undertake modelling.
i've got method randomly splitting excel dataset (using =rand()
function), there way (at splitting stage) ensure distribution of samples (other keep randomly splitting , testing distribution until becomes acceptable)?
otherwise, if best performed in sas, efficient approach testing sample randomness?
the dataset contains 35 variables, mixture of binary, continuous , categorical variables.
in sas, can use proc surveyselect
this.
proc surveyselect data=sashelp.cars out=cars_out outall samprate=0.7; run; data train test; set cars_out; if selected output test; else output train; run;
if there particular variable[s] want make sure train , test sets balanced on, can use either strata
or control
depending on sort of thing you're talking about. control
make approximate attempt things control variables (it sorts control variable, pulls every 3rd or whatever, sort of approximate balance; if have 2+ control variables snake-sorts, asc. desc. etc. inside, reduces randomness).
if use strata
, guarantees sample rate inside strata - if did:
proc sort data=sashelp.cars out=cars; origin; run; proc surveyselect data=cars out=cars_out outall samprate=0.7; strata origin; run;
(and final splitting data step same) you'd 70% of each separate origin pulled (which end being 70% of total, of course).
which depends on care being balanced by. more things with, less balanced else, cautious; may simple random sample best, if have enough n.
if don't have enough n, can use bootstrapping techniques, meaning take sample replacement 70% , take maybe 100 of samples, each higher n original. test or whatever on each sample selected, , variation in results tells how you're doing if n not enough in 1 pass.
Comments
Post a Comment