Saturday, April 19, 2008

Substituents, Potencies and Pinschers

So hopefully the suspense has built from our previous post on the effects of common chemical subsituents on ligand potency. Some of our Loyal Readers will have been annoyed to have been dropped just as it was getting interesting and we can only offer our most abject and grovelling apologies. The suspense had to be built but, much more importantly, we had to go to the pub. They serve a particularly tasty cider there. It’s cloudy, quite strong and, on a bad night, really makes your tongue tingle. If you die from drinking it, it is improbable that your remains will require embalming.

We are sorry that so much time has passed since the previous post. Here’s the link to the target article. We will continue to focus on Table 1.

We have already discussed the categorical sins committed in slicing the tails off distributions to create the F(-1) and F(1) descriptors and you don’t need to be a Doberman Pinscher to appreciate the fundamental immorality of these actions. Nevertheless, slicing distributions is a commonly encountered data-analytic technique in drug discovery research although it tends to be less commonly encountered in statistical textbooks. One recurring concern we have with this data-analytic genre is that the slice points can be chosen to strengthen the conclusion that the investigators would like to draw. Our challenge to the distribution slicers is to demonstrate that the results of their analyses are relatively insensitive to how the slicing is done. Or perhaps consider methods to compare the distributions that adequately account for their continuous nature.

However it is not just the distribution slicing that disturbs our digestion. In order to make the next point we now ask that you ignore the previous paragraph and assume that the binary categorisation by distribution slicing is actually correct. ‘Why do you play these games with our minds?’ we hear you cry and we simply ask that you make the assumption regardless of how absurd it may seem to you (and us) right now. We ask because there are still a number of outstanding issues with the analysis presented in Table 1 of the featured article and it’s just easier to deal with these if you’re not distracted by whether the categorisation is indeed sinful.

In the interests of time we’ll skirt over the arbitrariness of the choice of methyl as the substituent with which distributions for other substituents are compared. We will also skirt over why one would want to compare the effects of substituents with any particular substituent given that these effects have already been defined with respect to hydrogen. If people choose to compare their substituent effects with methyl (or any of the other 52 substituents in Table 1) then it is really not for us to say. The Crapshoot is a liberal, pro-choice sort of column and we believe that Our Loyal Readers are sufficiently mature to take responsibility for those choices that they make.

More serious is the manner in which the casual reader might think that all the distributions that are significantly different from methyl are indeed significantly different from this substituent. A contingency table analysis provides a probability that the observed effect could have been observed by chance alone. The lower the probability, the greater the significance. This is the way of The Statistician. Take another look at Table 1 in the featured article. The entry for F in the eighth column (*) tells us that the fraction of chlorine substitutions (0.064) that lead to an at least 10-fold potency increase and the corresponding figure for methyl (0.053) are significantly different with an associated probability of less than 0.05. This means we are at least 95% sure that the distributions for methyl and chloro are different although it doesn’t mean that we necessarily care. Now let’s suppose we perform two contingency table analyses to compare the effects of substituents X and Y with methyl and get 95% in each case. Does that mean that we are 95% sure that X significantly different from methyl and that Y is significantly different from methyl. Well not exactly. If you want to consider both the substituents, you need to multiply the probabilities (95% x 95% = 90%). If you consider more substituents the problem only gets worse. We hope you’re still with us and apologise profusely for letting things get so tediously technical. We are forced to admit that we’re still no closer to figuring out how, why or whether we should be using the results in Table 1 of the featured article. Please let us know if you are.

Sorry that it all turned into a bit of a slog but we really must move on. You will recall that the data for the analysis has been aggregated across up to 30 assays. The nature of molecular recognition is that sometimes a substituent will increase potency, sometimes it will decrease it and sometimes it will have no effect at all. Medicinal chemists are most interested in the first case where the substituent increases potency. The second situation is still relevant if you can think of a suitable ‘anti-substituent’. For example if you find that putting methyl on an aromatic carbon costs a lot of potency you might try replacing that carbon with nitrogen in case there is a hydrogen bond donor in the binding site whose solvation has been compromised by having a methyl group thrust at it. Probably a bit of long shot (we’re assuming no protein structure is available) but we think it’d still be a better bet than ethyl, butyl or futyl. However when you average over both chemotype and assay, you are unnecessarily adding noise to your signal. In this case, do you really expect to end up with anything other than the unremarkable and underwhelming Table 1?

There is another yet another problem. The dynamic range of assays is limited. If you have a substituent that tends to have a dramatic effect on potency then it will be less likely that you’ll be able to measure potency for both parent and substituted analog. Let’s take a look at the F(-1) and F(1) values for carboxylate that are given in Table 1. The value of F(1) value is 0.247 which tells us that a quarter of the time adding a carboxylate to an aromatic ring leads to at least a log unit drop in potency. This is not surprising given that a carboxylate is not a gift that you really want to offer to a protein unless you’re sure that it will be properly appreciated. The value of F(-1) is 0.056 which is very similar to the corresponding figure (0.053) for methyl. Now let’s assume that we have a situation in which the carboxylate is an essential part of the pharmacophore. The question you really need to ask yourselves is how confident are you that you can measure potencies for both the parent compound and the analog when your substituent is carboxylate. The next question is, knowing what you do about molecular recognition, would you be more or less optimistic about being able to measure the effect of methyl substitution on potency?

Now it’s time to get back to the tails. These are the probably the most interesting regions within the distributions because they provide information about the best (and worst) we can expect to achieve by making a substitution. Unfortunately the authors decided to trim the distributions prior to data analysis by removing potency changes that exceeded 4 standard deviations. Think of all the poorly understood molecular recognition phenomena (topographically-focussed hydrophobic enclosure, electrostatically-enhanced conformational locking, hyperpolarised charge-octupole interactions, hyperconjugation-relayed field gradients) that might be lurking in those discarded tails. What if some of these discarded results could have been interpreted in terms of structure of the target proteins?

So there you have it. The effects of on potency of a number of common substituents and we never even got beyond Table 1. Essential information for drug discovery or philatelic use of the pages of a high impact journal? We are simple folk and we leave it to Our Loyal Readers to decide.

Thankfully, we have only rarely encountered Doberman Pinschers. Our limited experience of this unsavoury breed suggests that they typically throw the best bits away when tail and Pinscher are separated.

3 comments:

baoilleach said...

Let me get up on my hobbyhorse...The way of the statistician is not that one is "95% sure" of anything, but that there is only a 5% chance that the alternative hypothesis could have arisen by chance. :-)

GMC2007 said...

The 'way of the statistician' to which I referred was the way that significance is quantified by improbability. In dealing with contingency tables, my preference is indeed to state things in terms of the probability P of the observed chi-square statistic arising from chance alone.

The main point I was trying to make is that observing significant differences individually for a number of contingency tables does not mean that all of the differences are significant. This point is made more easily by looking ( 1 - P) because you can multiply these. For two contingency tables, one should multiply the two values of (1-P) and then subtract from 1 to get an idea of how likely it was that you would observe the statistics by chance alone. I think it's actually a bit more complicated (Bonferroni corrections) and this is an area I wouldn't claim to have at my fingertips. However, I'm not in the habit of performing multiple contingency tables.

All this said, your comment highlighted a lack of clarity on my part. Hopefully my response will make where I was heading a bit clearer.

Ashutosh said...

null hypothesis...null hypothesis...