惯常做法
a)训练数据-用于选择模型参数。
i) E.g., finding intercept and slope parameters for an ordinary linear
regression model.
ii) The noise in the training data-set is used in some extent
in over-fitting model parameters.
b)验证数据-用于选择超参数。
i) E.g., we may want to test three different models at step 1.a, say
linear model with one, two or three variables.
ii) The validation data-set is independent from training data, and thus, they provide
'unbiased' evaluation to the models, which help to decide which
hyper-parameter to use.
iii) We note that, a model trained in 1.a, say y = b_0+b_1*x_1, does
not learn anything from this data-set. So, the noise in this data-
set is not used to over-fit the parameters (b_0, b_1), but, over-
fit exists in choosing which linear model to use (in terms of
number of variables).
c)测试数据-用于从上述两个步骤获得输出的置信度
i) Used once a model is completely trained
另一种看待第1部分的方法
a)我们的候选模型池是一个5维数集,即
i) Dimension 1: number of variables to keep in the regression model,
e.g., [1, 2, 3].
ii) Dimension 2-5: (b_0, b_1, b_2, b_3).
b)步骤1a将候选模型从5维减少到1维。
c)步骤1b将模型候选对象从一维减少到0维,即单个模型。
d)但是,OP可能认为上面的“最终”输出在测试数据集上的表现不够好,因此再次重做了整个过程,比如说使用脊回归而不是普通的线性回归。然后,多次使用测试数据集,因此该数据中的噪声可能会在确定使用线性回归还是岭回归时产生一些过拟合。
e)为了处理具有参数,超参数,模型类型和预处理方法的高维模型池,对我们可用数据的任何拆分本质上都定义了决策过程,
i) Sequentially reducing the model pool to zero-dimension.
ii) Allocating data noise overfitting to different steps of dimension
reductions (overfitting the noise in the data is not avoidable but
could be allocated smartly).
结论和对OP问题的回答
a)两分割(训练和测试),三分割(训练,验证和测试)或更大量的分割实质上是关于减少维数和分配数据(尤其是噪声和过度拟合的风险)。
b)在某个阶段,您可能会提出一个“最终”模型候选库,然后,您可以考虑如何设计按顺序减小维数的过程,以便
i) At each step of reducing the dimensions, the output is satisfactory,
e.g., not using just 10 data points with large noise to estimate a
six-parameter liner model.
ii) There are enough data for you to reduce the dimension to zero
finally.
c)如果无法实现b
i) Use model and data insight to reduce the overall dimensionality of
your model pool. E.g., liner regression is sensitive to outliers thus
not good for data with many large outliers.
ii) Choose robust non-parametric models or models with less number of
parameter if possible.
iii) Smartly allocating the data available at each step of reducing the
dimensionality. There is some goodness of fit tests to help us decide
whether the data we use to train the model is enough or not.