Monday, April 19, 2010

In LOO of a Crapshoot

Earlier we showed you how our toy model would have passed the LOO test and in fact we could get away with leaving out more than one (Figure 1). For example you might do your cross-validation by leaving out groups of three or five out these procedures might be called leave three out (L3O)or leave five out (L5O). Leaving out more gives you more confidence in your validation and our toy model will validate so long as you retain at least one data point for each of the groups to ‘anchor’ the model (Figure 1).

Now let’s see what happens when we try to do the prediction shown in Figure 2 (see purple line). This should be a safe prediction because the model has passed the LOO test and we’re predicting from smack the middle of the training set where we would normally have the most confidence in the model. However, there is a slight problem. The linear combination of descriptors on the horizontal axis is a latent inhibitor variable (LIV) which for many QSAR models is Nemesis, although creators of these models are seldom aware of this. If you have two groups of structurally related compounds for which the average activities differ and enough descriptors, you’ve got a good chance of finding a LIV that gives you some separation of the two structural groups. If you can do this you’ll have generated a model that will cross-validate successfully.

The trouble comes when you try to do a prediction like the one in Figure 2. Because we’re dealing with a LIV, this prediction actually represents an extrapolation even though the model might ‘think’ that the prediction is an interpolation. The model may use a large number of descriptors but so long as it keeps cross-validating we keep adding more and more, per absurdum ad nauseum. If we could only recognise the structural groups in the data we could distinguish them with proper indicator variables but instead we use LIVs which don’t tell you where the gaps are unless you look very carefully. But enough of our views, let’s see what PoO has to say, pausing only to give thanks that it’s 'overfitting' and not 'over-fitting':

“LOO does however have two blind spots. If the compound collection is made up of a few core chemical compositions, each of which is represented by several compounds of nearly identical composition x, then the operation of removing any single compounds will not be sufficient to get its influence out of the data set, because of the fraternal twin(s) still in the calibration. Under these circumstances, LOO will over-state the quality of the fit."

We’ll leave LOO’s other blind spot for another day because we’d like conclude by sharing some thoughts on what we think QSAR modellers should be doing if they want to claim that their models are truly global. First selection of training sets should aim for a maximally even coverage of space defined by the descriptors even if that means discarding data. Secondly molecular similarity measures should be used to ensure that no two molecular structures in the training set are too similar even if this means discarding data.

We think this is quite a good place to leave things for now and hope that you’re all now on intimate terms with LIV.

next

0 comments: