We cannot put it off any longer. We must now return to our series of posts on predictive modelling and QSAR, depressing though this is. Although the weather back at home in Ontario has been delightful, we are in Peru, very close to the Brazilian frontier, and it has rained continuously for the last four days, reminding us exactly what it is usually like to read an article on the prediction of aqueous solubility or hERG blockage. On those occasions we are usually prompted to summon an appropriately-trained individual, equipped with captive bolt pistol, to put the offending article out of its (and our) misery. On a brighter note, we think it is time to share an atypically good article on QSAR and predictive modelling. It is entitled, QSAR: dead or alive? (DoA) and is a lively and entertaining read. The article and The Problem of Overfitting, (PoO) which has already featured in this column, should both be read by anyone planning to build, use or be influenced by QSAR models.
DoA starts by taking a look at the difference between correlation and causation and you’ll remember us questioning in an earlier Crapshoot whether it is really possible to say that poor oral bioavailability is any more a consequence of too many rotatable bonds than of too high a molecular weight and we can see others heading for that particular tar pit. Correlation does not mean causation and in fact, as we’ll demonstrate in a future Crapshoot, correlation may not even mean correlation although we don’t think that is a particularly helpful place to go right now. We liked the examples presented in DoA such as the eminently sensible proposal to increase the birth rate in Germany by inducing more storks to nest there. We propose using the incentive of placing (concentrating?) them in cages. However, what we really want you to take a look at is Figure 3.
Figure 3 in DoA shows an excellent correlation between length and width for a large collection of skulls gathered from the Paris Catacombs which seemed a lot more dead than alive even when allowing for a Schroedingerian ambiguity on the latter point. Actually there are two groups of skulls: male and female. Most of the strength in this correlation comes from the absolute differences in skull sizes between men and women and if you’re wondering where you’ve seen this before it’s just our old friend the Latent Indicator Variable that was introduced in an excruciating sequence of posts last year. But let’s just move on because we’ve flogged that very dead horse to a greater extent than is usually considered tasteful.
DoA also highlights the problem of chance correlation. These days you have lots of descriptors with which to craft your predictive model. Literally buckets of them! Everything from the kurtosis of quadrupole-scaled atom charges to an entire family of spherical harmonics derived from the trace of the hyperpolarizability tensor. However, the QSAR modeller’s wet dream is a Bosch-sculpted nightmare for anybody trying to use the models to gain insight or even that slight edge over the opposition. If you’ve lots of descriptors with which to play, you’re more likely to find a significant correlation that:
“...is a tale
Told by an idiot, full of sound and fury,
Signifying nothing”.
However, the bad news doesn’t end there because chemical space is not uniformly occupied. We’ve already discussed some of the consequences of this in connection with the Latent Indicator Variable. We believe that an uneven distribution of molecules in chemical space further increases the likelihood of finding a chance correlation. We’ll talk a bit more about that in the next Crapshoot and will leave you with this most sensible of suggestions from PoO:
“If the collection of compounds consists of, or includes, families of close analogues of some smaller number of ‘lead' compounds, then a sample reuse cross-validation will need to omit families and not individual compounds.”
Now why doesn’t everyone do that?
next
Friday, March 19, 2010
Subscribe to:
Post Comments (Atom)

0 comments:
Post a Comment