Monday, April 20, 2009

Another year, another Senior Pharma Fellow

It is now 2 years since we started The Crapshoot and 58 posts and 20k pageloads later it has not been put out of your misery. The second year has been less eventful than the first in that we received no death threats, something that greatly disappoints us. This year marked the debut of one of our favorite characters who we have named Senior Pharma Fellow although you will know him by any of a number of names that tact prevents us from mentioning.

Wednesday, April 1, 2009

The latent indicator variable 2

The toy example in the previous post is clearly a bit of an over-simplification although it is useful for illustration of some ideas. With only two substituents, it should be pretty obvious to all but the most witless when compounds with one substituent are more active than the corresponding compounds with the other substituent.

Things get a bit more complicated when you have a number of substituents. Time for another of The Crapshoot’s annoying toy examples, for which we make no apology. If you find reading this garbage to be a painful experience then please spare a thought for those of us who have to write it.

Suppose you can now have one of 5 substituents at a particular position instead of just chlorine and the ‘un-substituent’ hydrogen. Let’s also assume classic Free-Wilson linearity-additivity in the SAR such that each substituent makes a constant (and different) contribution to activity. Although this is a rather contrived system it is not too different from the situation that exists in MedChem projects where a well-defined ranking of substituents is observed that is independent of what may be present at other positions of diversity in the molecule. If we’ve got 5 compounds each with a different one of these 5 subsitituents you should be able to fit whatever biological activity you observe using 5 different substituent parameters, provided that each has different values for each substituent. For example you might use sigma meta, sigma para, sigma resonance, sigma inductive, volume, cube root of the trace of the substituent polarizability tensor, ad nauseum. The key point is that it just doesn’t matter as long as that each parameter has different values for each subsituent. This is the curse of the Latent Indicator Variable.

Now 5 adjustable parameters and 5 compounds would really look rather like over-fitting. But suppose we’ve done this combinatorially and have another position (let’s call it B) of diversity at which we can have one of 10 substituents. Now there are 10 compounds with each one of the 5 original substituents (let’s call these the position A substituents). Now here’s the fun bit and don’t worry because we’ll hold your hand so we can do it together. We’re going to take the average pIC50 for compounds with each of the 5 position A subsituents. Provided that these averages are all sufficiently different, you’ll get some sort of model when you use all the data points. And when you use all 50 data points, using 5 adjustable parameters doesn’t look quite so naughty.

The problem is that we’ve used Latent Indicator Variables and, even with 50 data points, this model only works if a compound contains one of the 5 position A substituents that we’ve used to train the model. Unfortunately the situation is a less easy to spot than when we’ve only got two substituents to worry about. A compound might sit right at the centroid of the model space and the unwary would say this was interpolation. Yes, if you’re using one of the 5 position A substituents used to train this model but otherwise No.

This is probably a good point at which to sign off. There were so many things we wanted to talk about like correlations between descriptors, why it doesn’t really make sense to use Hammett constants to model biomolecular recognition and the dangers to Civilisation poised by structural clusters in training sets. However, enough is enough and we’ll leave you with a problem that anyone who has done some ten pin bowling will be familiar with. Your first ball has knocked down all the pins except two. Anyone care to guess which two? In case, you’ve not figured it out, the two balls are numbers 7 and 10. That’s why they call it a 7-10 split! They sit at opposite ends of the back row and the centroid of the model space is not going to be a whole lot of help now.

next

Saturday, February 28, 2009

The latent indicator variable 1

Well it does seem a while since we last posted and there is still much work to do as we continue from the previous post. The situation in which you either have chlorine or hydrogen at C4 of the phenyl should be easy to spot using any of a number of substituent parameters and comparing average pIC50 values for the two groups of compounds will give you a good idea of whether or not substitution with chloro is good for activity. If substitution with chloro at C4 leads to a consistent increase in potency, you’ll get model that is both predictive and that can be validated. So exactly what is your point, we hear you cry.

OK let’s be a bit more specific. We’ll use the Wikipedia as our source of Hammett sigma constants. The Hammett sigma constant for meta-chloro is +0.37 and (by definition) that for hydrogen is zero. If chloro substitution leads to a significant increase in potency you should get a reasonable model by fitting pIC50 to sigma. It will satisfy validation criteria and Senior Pharma Fellow (SPF) will be able to rattle off an impressive array of quality control metrics in his next presentation. Aren’t we clever! Surely it’s time to use the model to do some predicting.

Our chemists want to know what happens if we introduce methoxy or fluoro at C4. Actually they don’t like Senior Pharma Fellow (SPF) any more than we do but there is a directive from the Project Management Politburo that these models are to be used even if they are not believed. Furthermore you need to run the model so that you can to tick the relevant boxes on the Authorisation For Synthesis form that the tiresome Black-Belted Half-Wits have set up for the gathering of Base-line Productivity Indicators. At least we know that we won’t be extrapolating because the Hammett sigma values for meta-methoxy and meta-fluoro are +0.11 and +0.34 respectively so both lie within the space spanned by the training set. We’d predict that replacing chloro at C4 with fluoro would to lead to a small drop in potency because the relevant Hammett sigma values are so similar. We’d be particularly confident in our predictions for the methoxy-substituted analogs because this represents interpolation to a greater extent than if we were doing predictions for the compounds with which the model was built.

Now for the sake of argument, let’s suppose we’d decided to use the Hammett constants for these substituents at the para position. The value for chlorine is now +0.23 and that for hydrogen is still zero (by definition) as before so the quality of the model. However fluoro (sigma-para = +0.06) looks much more like hydrogen than chloro while methoxy (sigma-para = -0.27) now lies well outside the space spanned by the training set. Needless to say this is a very different picture to what we saw using sigma-meta values.

What does this all mean? This is obviously a toy example that we’ve created to illustrate a point. However it is clear that if we’re building models using pIC50s for compounds that are either unsubstituted or have chloro at C4 then sigma-para will work just as well as sigma-meta. The sigma values function as indicator variables and any parameter which has different values for chloro and hydrogen substituents will do the job just as well. The problem is that for these models having anything other than hydrogen or chloro at C4 represents an extrapolation while the continuous nature of sigma constants suggests that we might be interpolating. Real models are typically a lot more complex than this toy example and it is often not clear when linear combinations of continuous variables are actually functioning as indicator variables. We’ll pick up in the next post since it is getting late and there is cider to be drunk. It should be fun and hopefully we will not encounter a latent indicator variable (LIV).

next

Sunday, January 25, 2009

Islands in the chemical ocean

We left you rather abruptly in the previous post, having been stung by your suggestion that we might be uncouth. However, we have decided to forgive you and continue with our tale.

We'll start with a scenario with which many of our loyal and patient readers will be familiar. You're optimising a series and have found that adding a chloro substituent at C4 of one of the phenyl rings increases the pIC50 (-log IC50 in concentration units of mol/litre) by a unit regardless of what substituents are present at C3 and C5. Those of you who've worked in drug discovery will have seen this sort of thing. Everybody in the project knows that the 4-chloro substituent is good for potency and if it goes the potency has to be clawed back from somewhere else. Just like tax.

This sort of thinking is the basis of Free-Wilson analysis. The C4 chlorine and the hydrogen of the unsubstituted C4 can each be thought of as contributing to potency. The contribution of the chlorine is a log unit greater than that of hydrogen. So you've recognised this pattern in your project data but this isn't good enough. What do you mean, "not good enough". You have quite some nerve, M. le Crapshoot. Nothing to do with us. The Chemistry Discipline Review Committee have decided that they'd really prefer that you did this sort of thing with some equations rather than this uncultured chemical structure stuff. Also Senior Pharma Fellow (SPF) needs some equations for the presentation slides that his secretary is preparing for him. Can't you just generate some predictive models instead of being so difficult.

Well you didn't handle that very well, did you? Anyway stop complaining because you've got work to do. You do some modelling and you find out the Hammett sigmas (both meta and para) for the C4 substituent are both useful predictors of pIC50 as are the substituent hydrophobicity parameter and the molar mass of the substituent. Then you make a startling discovery.

The molecules with which you're building the models either have chlorine at C4 or are unsubstituted at this position.

next

Friday, January 2, 2009

The perils of overfitting

We do seem to have let things slip over the holiday period but rest assured, Esteemed Readers, that we have not abandoned you or, for that matter, William of Ockham who is familiarising himself with the health and safety implications of bringing razors to work. Following on from the last post, we take a look at article on overfitting . Unlike much of the literature we review in this column, we rather like this article and think it's a real shame that many of the folk who publish predictive models are palpably unaware of its existence.

What is overfitting, one might ask? The author notes that, "Occam's razor, or the principle of parsimony, calls for using models and procedures that contain all that is necessary for the modeling but nothing more". How aptly put! Blofeld and Random Forest had better mind their step. You might ask what all this means and we will respond that the predictive models that we're talking about all use adjustable parameters to fit the data. To some extent, the more parameters that you use, the better the fit that you'll achieve. One definition of overfitting is using a model that has too many parameters. Or at least more than you needed.

By now you'll have figured out why we are keen to see the actual models rather than just reading about their wonderful r-squares, q-squares and root mean squares. When the actual model is presented you can see exactly how many parameters it uses. "But isn't this just the number of descriptors?", you might ask. Basically yes, if it's a linear regression model and you don't count the intercept as a parameter. Once you enter the non-linear world of neural nets this is not the case any more. We don't believe that any journal that wishes to be considered respectable should be publishing new predictive models unless these models are are fully specified in in the article.

So we hope we've now got your attention. We're predicting solubility and have two models at our disposal, both of which have satisfied validation criteria. "What are these validation criteria?", we hear you cry. Fair point! The models satisfactorially predicted the solublities of compounds that were not used to train the model. We'll discuss validation in a future post because to do so here would get us bogged down in the data-analytic equivalent of Passchendaele. Anyway back to the models. We'll use root mean square error (RMSE) as our measure of model quality and there are problems with this. However, it's another point that we're trying to address and RMSE will work well for that. One model predicts log(S/M) with RMSE = 0.3 and uses 100 parameters and the other uses 3 parameters and predicts log(S/M) with RMSE of 0.4. The observant amongst you will have noticed that we're actually using the logarithm of the molar solublity rather than solubility itself and there are really good reasons for doing this which we'll not go into right now. Anyway with which model are you going to use to make your predictions?

Being regular readers of The Crapshoot has of course made you cynical. You've seen some of the underhand tricks that folk can use to persuade you that the trends that they have uncovered are stronger than they actually are (see examples 1 and 2 to get an idea of what we mean). The models have both been validated but you can't see the details so you're quite right to be suspicious. You're also thinking that the model with RMSE of 0.4 using 3 parameters is less likely to be overfit than the model with RMSE of 0.3 with 100 parameters. Also you might expect the first model to work better for compounds that are not chemically similar to the compounds used to train it. However, in the predictive modelling world validation is assumed to be valid
and we can only ask who guards the guardians. Once models have been validated, the numbers of descriptors used become irrelevant.

Let's go back to the article on solubility prediction that we mentioned in the previous post. The cross-validation results for partial least squares (PLS), artifical neural net (ANN), support vector machine (SVM) and random forest (RF) models are given in Table 2. The cross-validated RMSE is lowest for RF as is the RMSE for the external test set. Random Forest is the best model! Long live Random Forest! It was validated so who are you, the uncouth authors of a blog that nobody reads, to question this finding?

It is true that the readership of The Crapshoot could comfortably assemble in the ensuite portion of a budget London hotel room. However, we really do object to being called uncouth and so we're going home (and taking our ball with us). Our parting shot is that we've not quite used up all the ammo from that nice paper on overfitting...

next

Sunday, December 14, 2008

Only Connect

As service to our loyal readers, we have installed some forward links so that some the themes that we have explored can be re-visited in the sequences in which they appeared. Here are our first posts on Rule of 5, hydrogen bonding and predictive modelling so you can see how it all works.

Friday, December 12, 2008

Where are the models?

So we left you hanging a bit in the previous post for which we apologise. William of Ockham was about to do battle with Random Forest, armed only with what would appear to be a singularly inadequate razor. We’ll have to apologise again because you’re going to have to wait a while longer for the final showdown. We realise that many of our patient and loyal readers may not have encountered the sorts of predictive models that William of Ockham is licensed to invalidate and as a public service we’ll take a quick look at 3 publications. Our objective in this post is not to review these models but merely to use them to show you why studies like these might be of interest to Mr Ockham.

In the first article, industrial researchers present methods for predicting hERG liability in compound libraries using their own data which was not made available to readers or, presumably, the reviewers of this paper. We extend special sympathy to the reviewers of this article because we just can’t tell whether the models described within are useful and highly predictive or of a value that is largely calorific. This is a general theme which we will re-visit in future posts.

In the second article, industrial researchers present methods for prediction of volume of distribution. Volumes of distribution and calculated properties, although not the structures, for the training set compounds were shared as supplemental material.

In the third article, academic researchers present methods for predicting aqueous solubility. Structures and measures solubility for training and test sets were shared as supplemental material.

The authors of these articles share their data sets to varying degrees however none appear to be particularly forthcoming with the predictive models themselves. The second article presents 31 parameter values for a multi-linear regression model in the supplemental material but the random forest remains an almost complete mystery. Is it fair that a medicinal chemist needs to provide spectral data for new compounds while a predictive modeller can get away with root mean square error and and r-square? Don’t ask us for we are simple folk and we just write The Crapshoot.

So if you think that reading an article on predictive modelling of clearance, volume, CYP inhibition, hERG blockade, solubility or plasma protein binding is going to provide you with a practical means to predict any of these quantities, you may wish to prepare yourselves for disappointment.

In the next post, we’ll be taking a look at the ubiquitous problem of over-fitting. William of Ockham is already sharpening his razor.

next