<
Back to Previous News Page
Some Methodological Concerns in the Development of Predictive Models Examples From Discriminant Analysis
Multivariate parametric modeling is frequently used to develop predictive relationships between exposures (inputs) and the risk/ occurrence of injury and illness (outcomes). Accurate predictive models can be used to suggest interventions which can minimize illness and injury. Perhaps due to the high cost of measuring particular exposures or to the sparsity of many adverse outcomes (most adverse outcomes, including certain traumatic injuries, are relatively rare events for the individual employment establishment), modeling studies in the health and safety literature frequently lack an independent test sample for evaluating predictive model performance. Rather, these studies report only the resubstitution accuracy--the accuracy which is realized when the model is evaluated on the same sample that was used to generate the model coefficients. Unfortunately, it is well established that the resubstitution accuracy is optimistically biased, that is, it typically provides an inflated estimate of the predictive accuracy of the developed model. For small data sets (small relative to the number of recorded exposures and/or exposures included in the model), this optimistic bias can be substantial. Furthermore, small data sets may lead to selection of an entirely spurious set of exposures.
To elucidate this issue, a Monte Carlo simulation study was conducted using the classification modeling technique of discriminant analysis. Random data containing no true classification power (denoted the “Nil Model”) were generated, then analyzed using discriminant analysis. For the case of two outcome groups, the true accuracy of the Nil Model is 50% (i.e., no better than flipping a fair coin). For conditions similar to those in the literature, the random data “reported” highly accurate classification performance--results as high as 100%. These “reports” represent the bias artifact of resubstitution accuracy. Factors influencing the extent of the bias were studied. It was found that the resubstitution bias is reduced if: sample size is increased, the number of candidate exposures is decreased, the number of selected exposures is decreased, and the proportion of samples from each outcome group is equalized. These simulation studies indicate that reporting of the resubstitution accuracy alone can be problematic. The resubstitution accuracy can be made arbitrarily large, regardless of the true predictive accuracy of the model.
The most common approach to rectifying this situation is the use of a train-test methodology in which the collected subject data are separated into non-overlapping, independent training and testing sets. The model is trained (i.e., the model coefficients are computed) on the training set, then tested on the test set. The performance achieved on an adequately sized test set is considered a good estimate of the true predictive model performance. It is suggested that all research reports which develop parametric models should either (1) train the model on one data set, but report as the performance metric the accuracy achieved on an independent, adequately-sized test data set, or (2) demonstrate that the magnitude of the resubstitution bias is minimal.
|