Sunday, April 11, 2010

Who guards the guardians?

A bit over a week ago, we greatly enjoyed the reunion with our old friend the Latent Indicator Variable (LIV) and were greatly tempted to indulge in some Categorical Sin (CS) to celebrate the occasion. LIV and CS both identify aberrant behaviour and these aberrations can be seen as mirror images of each other in as much as it is possible to get Sin to look at itself in the mirror. Perhaps somebody much smarter and more visionary than us will demonstrate that LIV and CS are actually Fourier Transforms of each other and we will realise that Sin is simply a manifestation of the Wave Particle Duality like everything else.

However, our intention is not to move from one sinful domain to another. We have been bashing QSAR and its cousin QSPR for longer than we care to remember and are beginning to tire of this sport. In order to maintain our wakefulness and sanity at the turkey shoot, we thought that it might a good idea to take a closer look at Validation. Who is Validation, we hear you cry, and as Huxley may have put it, is she ‘pneumatic’? Validation is both QSAR’s shield and QSAR’s Achilles Heel. Duality indeed!

As you’ll have read previously in DoA, predictive modelling in Drug Discovery typically involves lots of correlated descriptors so Overfitting is an ever-present present danger, especially when user-friendly (i.e. easy to generate some output) model building software is put in the hands of grinning halfwits who have only the most rudimentary understanding of the models that they are building. Validating your model is one way that you can convince others (and yourself) that it has not been overfit. Model validation typically involves only using some of the data to build a model and then using the model to predict the observations that you’ve left out. We’ll start by taking a look at the Leave-One-Out (LOO) method for cross-validation and would like to state categorically that bringing PoO, LOO and LIV (Boers go to the livit'ry to krepp after a grit trik) together in a single Crapshoot should not be taken as a scatological comment on any specific predictive modelling methodology.

LOO was described in PoO and we’ll illustrate it using a couple of graphics, one of which we’ll simply recycle from the previous Crapshoot. The LOO procedure involves discarding each data point in turn and re-fitting the model. Let’s take a look at this for our enzyme inhibition model from the previous Crapshoot which is illustrated in Figure 1.



Now let’s see what happens when we leave out one of the data points, an operation that we’ll show by coloring the discarded point as an unfilled circle (Figure 2). You’ll see that the line of new line of fit (dashed black line) has moved away from the point that was discarded since that point can no longer influence the fit. You can calculate something called a q-squared (q**2) which is similar to the R-squared (R**2) that many of you have already encountered. We’ll talk a bit more about these quantities in the next Crapshoot and we’ll also tell you bit more about why LOO might give you an optimistic view of model quality for a dataset like this. Until then please try to keep yourselves busy, motivated and within regions of chemical space acceptable to Senior Pharma Fellow.





next

0 comments: