Wednesday, April 1, 2009

The latent indicator variable 2

The toy example in the previous post is clearly a bit of an over-simplification although it is useful for illustration of some ideas. With only two substituents, it should be pretty obvious to all but the most witless when compounds with one substituent are more active than the corresponding compounds with the other substituent.

Things get a bit more complicated when you have a number of substituents. Time for another of The Crapshoot’s annoying toy examples, for which we make no apology. If you find reading this garbage to be a painful experience then please spare a thought for those of us who have to write it.

Suppose you can now have one of 5 substituents at a particular position instead of just chlorine and the ‘un-substituent’ hydrogen. Let’s also assume classic Free-Wilson linearity-additivity in the SAR such that each substituent makes a constant (and different) contribution to activity. Although this is a rather contrived system it is not too different from the situation that exists in MedChem projects where a well-defined ranking of substituents is observed that is independent of what may be present at other positions of diversity in the molecule. If we’ve got 5 compounds each with a different one of these 5 subsitituents you should be able to fit whatever biological activity you observe using 5 different substituent parameters, provided that each has different values for each substituent. For example you might use sigma meta, sigma para, sigma resonance, sigma inductive, volume, cube root of the trace of the substituent polarizability tensor, ad nauseum. The key point is that it just doesn’t matter as long as that each parameter has different values for each subsituent. This is the curse of the Latent Indicator Variable.

Now 5 adjustable parameters and 5 compounds would really look rather like over-fitting. But suppose we’ve done this combinatorially and have another position (let’s call it B) of diversity at which we can have one of 10 substituents. Now there are 10 compounds with each one of the 5 original substituents (let’s call these the position A substituents). Now here’s the fun bit and don’t worry because we’ll hold your hand so we can do it together. We’re going to take the average pIC50 for compounds with each of the 5 position A subsituents. Provided that these averages are all sufficiently different, you’ll get some sort of model when you use all the data points. And when you use all 50 data points, using 5 adjustable parameters doesn’t look quite so naughty.

The problem is that we’ve used Latent Indicator Variables and, even with 50 data points, this model only works if a compound contains one of the 5 position A substituents that we’ve used to train the model. Unfortunately the situation is a less easy to spot than when we’ve only got two substituents to worry about. A compound might sit right at the centroid of the model space and the unwary would say this was interpolation. Yes, if you’re using one of the 5 position A substituents used to train this model but otherwise No.

This is probably a good point at which to sign off. There were so many things we wanted to talk about like correlations between descriptors, why it doesn’t really make sense to use Hammett constants to model biomolecular recognition and the dangers to Civilisation poised by structural clusters in training sets. However, enough is enough and we’ll leave you with a problem that anyone who has done some ten pin bowling will be familiar with. Your first ball has knocked down all the pins except two. Anyone care to guess which two? In case, you’ve not figured it out, the two balls are numbers 7 and 10. That’s why they call it a 7-10 split! They sit at opposite ends of the back row and the centroid of the model space is not going to be a whole lot of help now.

next

0 comments: