# Scientist versus Complex Mathematical Model: Ion-channel story continues…

If you are a follower of these blog-posts then you may have seen a theme around complexity versus simplicity for a specific prediction problem relating to drug induced ion-channel cardiac toxicity.  See here for an article on the matter. In this blog-post we will discuss a short pilot project which was aimed at gaining an initial understanding on whether the scientists being sold complex models of the heart really need them.

The project itself was motivated by the observation that the input-output behaviour of this particular problem is linear and involves a small number of variables. However, modellers who enjoy complexity and whom are aware of the linear solution continue to publish on complex models.  A recent example involves the use of a population of models, see here, by researchers from Oxford University. If you were to take that data and simply use the linear model we’ve described before you will find that it produces just as good a result, see code and data here. Back to the main topic of this blog-post, if the mechanism is linear do you need a model?

In order to answer this question, it would have been great to have had a prospective data-set on a set of new drugs and compared the predictions of model versus scientist.  However, this is not possible for one key reason.  The community advocating the complex model are also responsible for classifying compounds into one of three categories. Without details of how this classification is done it’s not possible to apply it to any other compounds.  Therefore, the only option was to assess whether there is a correlation between scientist and model.

The pilot study conducted involved generating a set of drugs that covered the pharmacological space of interest.  Scientists were asked to classify the drug into one of 3 categories and then a simple correlation assessment was made between each scientist and the model. We found that all but one scientist correlated strongly with the complex model.  This suggests the complex model is not adding information above and beyond what the scientist already knows. If you are interested in the details then take a look at the article here.

Hopefully the next time the cardiac modelling community perform a model evaluation exercise they will consider assessing the scientists predictions too.  Wouldn’t it be interesting to know whether a model really is needed here given the huge amount of public investment made so far, regardless of the complexity?

# Ignoring Uncertainty in Collaborative Conceptual Models

This blog-post relates to a meeting entitled “Modelling Challenges in Cancer and Immunology: one-day meeting of LMS MiLS”. The meeting was held at the beautiful location of Kings College London. The talks were a mix of mathematical modelling and experimental – with most combining the two disciplines together. The purpose of the meeting was to foster new collaborations between the people at the meeting.

Collaboration is a key aspect of this intersection between cancer and immunology since no one person can truly have a complete understanding of both fields, and nor can they possess all the skill-sets needed. When collaborating though each of us trusts the experts in their respective fields to bring conceptual models to the table for discussion. It’s very important to understand how these conceptual models have developed over time.

Every scientist has developed their knowledge via their own interpretation of the data/evidence over their careers. However, the uncertainty in the data/evidence used to make statements such as A interacts with B is rarely mentioned.

For many scientists, null hypothesis testing has been used to help them develop “knowledge” within a field. This “knowledge”, throughout that scientists’ career, has been typically gained by using a p-value threshold of 0.05 with very little consideration of the size of effect or what the test actually means.

For example, at the meeting mentioned there was a stream of statements, which were made to sound like facts, on correlations which were tenuous at best simply because the p-value was below 0.05. An example is the figure below, where the data has a correlation coefficient of 0.22 (“p<0.05”). The scientist from this point onwards will say A correlates with B consigning the noise/variability to history.

Could it be that the conceptual models we discuss are based on decades of analyses described as above? I would argue this is often the case and was certainly present at the meeting. This may argue for having very large collaboration groups and looking for consensus, however being precisely biased is in no-one’s best interests!

Perhaps the better alternative is to teach uncertainty concepts at a far earlier stage in a scientist’s career. That is introducing Bayesian statistics (see blog-post on Bayesian Opinionater) earlier rather than entraining scientists into null-hypothesis testing.  This would generally improve the scientific process – and will probably reduce my blood pressure when attending meetings like this one.

# Radiomics meet Action-Potential-Omics and Recidivism-Omics

In previous blog-posts we have discussed how simple models can perform just as well if not better than more complex ones when attempting to predict the cardiac liability of a new drug, see here and for the latest article on the matter here. One of the examples we discussed involved taking a signal, action-potential, deriving 100s of features from it and placing them into a machine learning algorithm to predict drug toxicity. This approach gave very impressive performance. However, we found that we could get the same results by simply adding/subtracting 3 numbers together!  It seems there are other examples of this nature…

A recent paper sent to me was on the topic of recidivism, see here. The paper explored how well a machine learning algorithm which uses >100 features performed compared to the general public at predicting re-offending risk.  What they found is that the general public was just as good.  They also found that the performance of the machine learning algorithm could be easily matched by a two variable model!

Let’s move back to the life-sciences and take a look at an emerging field called radiomics.  This field is in its infancy compared to the two already discussed above. In simple terms radiomics involves extracting information from an image of a tumour.  Now the obvious parameter to extract is the size of the tumour through measuring its volume, a parameter termed Gross Tumour Volume, see here for a more detailed description. In addition to this though, like in the cardiac story, you can derive many more parameters, from the imaging signal. Again similar to the cardiac story you can apply machine learning techniques on the large data-set created to predict events of interest such as patient survival.

The obvious question to ask is: what do you gain over using the original parameter that is derived, Gross Tumour Volume? Well it appears the answer is very little, see supplementary table 1 from this article here for a first example. Within the table the authors calculate the concordance index for each model. (A concordance index value of 0.5 is random chance whereas a value of 1 implies perfect association, the closer to 1 the better.) The table includes p-values as well as the concordance index, let’s ignore the p-values and focus on the size of effect. What the table shows is that tumour volume is as good as the radiomics model in 2 out of the 3 data-sets, Lung2 and H&N1, and in the 3rd, H&N2, TNM is as good as radiomics:

 TNM (Tumour Staging) Volume Radiomics TNM + Radiomics Volume + Radiomics Lung2 0.60 0.63 0.65 0.64 0.65 H&N1 0.69 0.68 0.69 0.70 0.69 H&N2 0.66 0.65 0.69 0.69 0.68

They then went on and combined the radiomics model with the other two but did not compare what happens when you combine TNM and tumour volume, a 2 variable model, to all other options.  The question we should ask is why didn’t they? Also is there more evidence on this topic?

A more recent paper, see here, from within the field assessed the difference in prognostic capabilities between radiomics, genomics and the “clinical model”.  This time tumour volume was not explored, why wasn’t it? Especially given that it looked so promising in the earlier study. The “clinical model” in this case consisted of two variables, TNM and histology, given we collect so much more than this, is this really a fair representation of a “clinical model”?  The key result here was that radiomics only made a difference once you also included a genomic model too over the “clinical model” see Figure 5 in the paper. Even then the size of improvement was very small.  I wonder what the performance of a simple model that involved TNM and tumour volume would have looked like, don’t you?

Radiomics meet Recidivism-Omics and Action-Potential-Omics you have more in common with them than you realise i.e. simplicity may beat complexity yet again!

# Is this the golden age for open patient level oncology data?

Over the last few years there has been a growth in databases that house individual patient data from clinical trials in Oncology.  In this blog post we will take a look at two of these databases, ProjectDataSphere and ClinicalStudyDataRequest, and discuss our own experiences of using them for research.

ProjectDataSphere houses the control arms of many phase III clinical trials. It has been used to run prediction competitions which we have discussed in a previous blog-post, see here. Gaining access to this database is rather straightforward.  A user simply fills in a form and within 24-48 hours access is granted. You can then download the data sets together with a data dictionary to help you decipher the variable codes and start your research project.  This all sounds too easy, so what’s the catch?

The main issue is being able to understand the coding of the variables, once you’ve deciphered what they mean it’s pretty straightforward to begin a project.  It does help if you have had experience working with such data-sets before. An example of a project that can be conducted with such data can be found here. In brief, the paper explores both a biased and un-biased correlation of tumour growth rate to survival, a topic we have blogged about before, see here.

If you want to access all arms of a clinical trial then ClinicalStudyDataRequest is for you. This is a very large database that spans many disease areas. However access to the data is not as straightforward as ProjectDataSphere.  A user must submit an analysis plan stating the research objectives, methods, relevant data-sets etc. Once the plan has been approved, which can take between 1-2 months from our experience, access is granted to the data-sets.  This access though is far more restrictive than ProjectDataSphere.  The user connects to a server where the data is stored and has to perform all analysis using the software provided, which is R with a limited number of libraries. Furthermore there is a time-restriction on how long you can have access to the data.  Therefore it really is a good idea to have every aspect of your analysis planned and ensure you have the time to complete it.

An example of a project we have undertaken using this database can be found here. In brief the paper describes how a model of tumour growth can be used to analyse the decay and growth rates of tumours under the action of three drugs that have a similar mechanism of action. A blog-post discussing the motivation behind the tumour growth model can be found here.

There are of course many databases other than the two discussed here with an oncology focus, e.g. SEER, TCGA, YODA etc. The growth in such databases clearly suggests that this may well be the golden age for patient level oncology data. Hopefully this growth in open data will also lead to a growth in knowledge.

# Oxford Quantitative Systems Pharmacology (QSP): Is there a case for model reduction?

The title of this blog-post refers to a meeting which I attended recently which was sponsored by the UK QSP network. Some of you may not be familiar with the term QSP, put it simply it describes the application of mathematical and computational tools to pharmacology. As the title of the blog-post suggests the meeting was on the subject of model reduction.

The meeting was split into 4 sessions entitled:

1. How to test if your model is too big?
2. Model reduction techniques
3. What are the benefits of building a non-identifiable model?
4. Techniques for over-parameterised models

I was asked to present a talk in the first session, see here for the slide-set. The talk was based on a topic that has been mentioned on this blog a few times before, ion-channel cardiac toxicity prediction. The latest talk goes through how 3 models were assessed in their ability to discriminate between non-cardio-toxic and cardio-toxic drugs across 3 data-sets which are currently available. (A report providing more details is currently being put together and will be released shortly.)  The 3 models used were a linear combination of block (simple model – blog-post here) and 2 large scale biophysical models of a cardiac cell, one termed the “gold-standard” (endorsed by FDA and other regulatory agencies/CiPA – [1])and the other forming a key component of the “cardiac safety simulator” [2].

The results showed that the simple model does just as well and in certain data-sets out-performs the leading biophysical models of the heart, see slide 25.  Towards the end of the talk I discussed what drivers exist for producing such large models and should we invest further in them given the current evidence, see slides 24, 28 and 33. How does this example fit into all the sessions of the meeting?

In answer to the first session title: How to test if your model is too big?  The answer is straightforward: if the simpler/smaller model outperforms the larger model then the larger model is too big.  Regarding the second session on model reduction techniques – in this case there is no need for these. You could argue, from the results discussed here, that instead of pursuing model reduction techniques we may want to consider building smaller/simpler models to begin with. Onto the 3rd session, on the benefits of building a non-identifiable model, it’s clear that there was no benefit in this situation in developing a non-identifiable model. Finally regarding techniques for over-parameterised models – the lesson learned from the cardiac toxicity field is just don’t build these sorts of models for this question.

Some people at the meeting argued that the type of model depends on the question, this is true, but does the scale of the model depend on the question?

If we now go back to the title of the meeting, is there a case for model reduction? Within the field of ion-channel cardiac toxicity the response would be: Why bother reducing a large model when you can build a smaller and simpler model which shows equal/better performance?

Of course, within the (skeptical) framework of Green and Armstrong [3] point out (see slide 28), one reason for model reduction is that for researchers it is the best of both worlds: they can build a highly complex model, suitable for publication in a highly-cited journal, and then  spend more time extracting a simpler version to support client plans. Maybe those journals should look more closely at their selection criteria.

References

[1] Colatsky et al. The Comprehensive in Vitro Proarrhythmia Assay (CiPA) initiative — Update on progress.  Journal of Pharmacological and Toxicological Methods. (2016) Volume 81 pages 15-20.

[2] Glinka A. and Polak S. QTc modification after risperidone administration – insight into the mechanism of action with use of the modeling and simulation at the population level approach.  Toxicol. Mech. Methods (2015) 25 (4) pages 279-286.

[3] Green K. C. and Armstrong J. S. Simple versus complex forecasting: The evidence. Journal Business Research.  (2015) 68 (8) pages 1678-1685.

# Hierarchical oncology combination therapy: which level leads to success?

The traditional preclinical combination experiment in Oncology for two drugs A and B is as follows. A cancer cell-line is exposed to increasing concentrations of drug A alone, drug B alone and also various concentrations of the combination for a fixed amount of time. That is we determine what effect drug A and B have as monotherapies which subsequently helps us to understand what the combination effect is. There are many articles which describe how mathematical/computational models can be used to analyse such data and possibly predict the combination effect using information on monotherapy agents alone. Those models can be either based on mechanism at the pathway or phenotype level (see CellCycler for an example of the latter) or they could be machine learning approaches. We shall call combinations at this scale cellular as they are mainly focussed on analysing combination effects at that scale. What other scales are their?

We know that human cancers contain more than one type of cell-population so the next scale from the cellular level is the tissue level.  At this level we may have populations of cells with distinct genetic backgrounds either within one tumour or across multiple tumours within one patient. Here we may find for example that drug A kills cell type X and drug B doesn’t, but drug B kills cell-type Y and drug A doesn’t. So the combination can be viewed as a cell population enrichment strategy as it is still effective even though the two drugs do not interact in any way.

Traditional drug combination screening, as described above, are not designed to explore these types of combinations. There is another scale which is probably even less well known, the human population scale …

A typical human clinical combination trial in Oncology can involve combining new drug B with existing treatment A and comparing that to A only. It is unlikely that a 3rd arm in this trial looking at drug B alone is likely to occur.  The reason for this is that if an existing treatment is known to have an effect then it’s unethical to not use it. Unless one knows what effect the new drug B has on its own, it is difficult to assess what the effect of the combination is. Indeed the combination may simply enrich the patient population. That is, if drug A shrinks tumours in patient population X and drug B doesn’t, but drug B shrinks tumours in patient population Y and drug A doesn’t, then if the trial contains both X and Y there is still a  combination effect which is greater than drug A alone.

Many people reading this blog are probably aware that when we see positive combination affects in the clinic that it could be due to this type of patient enrichment. At a meeting in Boston in April of this year a presentation from Adam Palmer suggests that two thirds of marketed combinations in Oncology can be explained in this way, see second half (slide 27 onwards) of this presentation here. This includes current immunotherapy combinations.

We can now see why combinations in Oncology can be viewed as hierarchical. How appreciative the research community is of this is unknown.  Indeed one of the latest challenges from CRUK (Cancer Research UK), see here, suggests that even they may not be fully aware of it. That challenge merely focusses on the well-trodden path of the first level described here. Which level is the best to target? Is it easier to target the tissue and human population level than the cellular one? Only time will tell.

# Misapplication of statistical tests to simulated data: Mathematical Oncologists join Cardiac Modellers

In a previous blog post we highlighted the pitfalls of applying null hypothesis testing to simulated data, see here.  We showed that modellers applying null hypothesis testing to simulated data can control the p-value because they can control the sample size. Thus it’s not a great idea to analyse simulations using null hypothesis tests, instead modellers should focus on the size of the effect.  This problem has been highlighted before by White et al.  which is well worth a read, see here.

Why are we blogging about this subject again? Since that last post, co-authors of the original article we discussed there have repeated the same misdemeanour (Liberos et al., 2016), and a group of mathematical oncologists based at Moffitt Cancer Center has joined them (Kim et al., 2016).

The article by Kim et al., preprint available here, describes a combined experimental and modelling approach that “predicts” new dosing schedules for combination therapies that can delay onset of resistance and thus increase patient survival.  They also show how their approach can be used to identify key stratification factors that can determine which patients are likely to do better than others. All of the results in the paper are based on applying statistical tests to simulated data.

The first part of the approach taken by Kim et al. involves calibrating a mathematical model to certain in-vitro experiments.  These experiments basically measure the number of cells over a fixed observation time under 4 different conditions: control (no drug), AKT inhibitor, Chemotherapy and Combination (AKT/Chemotherapy).  This was done for two different cell lines. The authors found a range of parameter values when trying to fit their model to the data. From this range they took forward a particular set, no real justification as to why that certain set, to test the model’s ability to predict different in-vitro dosing schedules. Unsurprisingly the model predictions came true.

After “validating” their model against a set of in-vitro experiments the authors proceed to using the model to analyse retrospective clinical data; a study involving 24 patients.  The authors acknowledge that the in-vitro system is clearly not the same as a human system.  So to account for this difference they perform an optimisation method to generate a humanised model.  The optimisation is based on a genetic algorithm which searched the parameter space to find parameter sets that replicate the clinical results observed.  Again, similar to the in-vitro situation, they found that there were multiple parameter sets that were able to replicate the observed clinical results. In fact they found a total of 3391 parameter sets.

Having now generated a distribution of parameters that describe patients within the clinical study they are interested in, the authors next set about generating stratification factors. For each parameter set the virtual patient exhibits one of four possible response categories. Therefore for each category a distribution of parameter values exists for the entire population. To assess the difference in the distribution of parameter values across the categories they perform a students t-test to ascertain whether the differences are statistically significant. Since they can control the sample size the authors can control the standard error and p-value, this is exactly the issue raised by White et al. An alternative approach would be to state the difference in the size of the effect, so the difference in means of the distributions. If the claim is that a given parameter can discriminate between two types of responses then a ROC AUC (Receiver Operating Characteristic Area Under Curve) value could be reported. Indeed a ROC AUC value would allow readers to ascertain the strength of a given parameter in discriminating between two response types.

The application of hypothesis testing to simulated data continues throughout the rest of the paper, culminating in applying a log-rank test to simulated survival data, where again they control the sample size. Furthermore, the authors choose an arbitrary cancer cell number which dictates when a patient dies. Therefore they have two ways of controlling the p-value.  In this final act the authors again abuse the use of null hypothesis testing to show that the schedule found by their modelling approach is better than that used in the actual clinical study.  Since the major results in the paper have all involved this type of manipulation, we believe they should be treated with extreme caution until better verified.

References

Liberos, A., Bueno-Orovio, A., Rodrigo, M., Ravens, U., Hernandez-Romero, I., Fernandez-Aviles, F., Guillem, M.S., Rodriguez, B., Climent, A.M., 2016. Balance between sodium and calcium currents underlying chronic atrial fibrillation termination: An in silico intersubject variability study. Heart Rhythm 0. doi:10.1016/j.hrthm.2016.08.028

White, J.W., Rassweiler, A., Samhouri, J.F., Stier, A.C., White, C., 2014. Ecologists should not use statistical significance tests to interpret simulation model results. Oikos 123, 385–388. doi:10.1111/j.1600-0706.2013.01073.x

Kim, E., Rebecca, V.W., Smalley, K.S.M., Anderson, A.R.A., 2016. Phase i trials in melanoma: A framework to translate preclinical findings to the clinic. Eur. J. Cancer 67, 213–222. doi:10.1016/j.ejca.2016.07.024

# The changing skyline

Back in the early 2000s, I worked a couple of years as a senior scientist at the Institute for Systems Biology in Seattle. So it was nice to revisit the area for the recent Seventh American Conference on Pharmacometrics (ACoP7).

A lot has changed in Seattle in the last 15 years. The area around South Lake Union, near where I lived, has been turned into a major hub for biotechnology and the life sciences. Amazon is constructing a new campus featuring giant ‘biospheres’ which look like nothing I have ever seen.

Attending the conference, though, was like a blast from the past – because unlike the models used by architects to design their space-age buildings, the models used in pharmacology have barely moved on.

While there were many interesting and informative presentations and posters, most of these involved relatively simple models based on ordinary differential equations, very similar to the ones we were developing at the ISB years ago. The emphasis at the conference was on using models to graphically present relationships, such as the interaction between drugs when used in combination, and compute optimal doses. There was very little about more modern techniques such as machine learning or data analysis.

There was also little interest in producing models that are truly predictive. Many models were said to be predictive, but this just meant that they could reproduce some kind of known behaviour once the parameters were tweaked. A session on model complexity did not discuss the fact, for example, that complex models are often less predictive than simple models (a recurrent theme in this blog, see for example Complexity v Simplicity, the winner is?). Problems such as overfitting were also not discussed. The focus seemed to be on models that are descriptive of a system, rather than on forecasting techniques.

The reason for this appears to come down to institutional effects. For example, models that look familiar are more acceptable. Also, not everyone has the skills or incentives to question claims of predictability or accuracy, and there is a general acceptance that complex models are the way forward. This was shown by a presentation from an FDA regulator, which concentrated on models being seen as gold-standard rather than accurate (see our post on model misuse in cardiac models).

Pharmacometrics is clearly a very conservative area. However this conservatism means only that change is delayed, not that it won’t happen; and when it does happen it will probably be quick. The area of personalized medicine, for example, will only work if models can actually make reliable predictions.

As with Seattle, the skyline may change dramatically in a very short time.

# Complexity v Simplicity, the winner is?

I recently published a letter with the above title in the journal of Clinical Pharmacology and Therapeutics; unfortunately it’s behind a paywall so I will briefly take you through the key point raised. The letter describes a specific prediction problem around drug induced cardiac toxicity mentioned in a previous blog entry (Mathematical models for ion-channel cardiac toxicity: David v Goliath). In short what we show in the letter is that a simple model using subtraction and addition (pre-school Mathematics) performs just as well for a given prediction problem as a multi-model approach using three large-scale models consisting of 100s of differential equations combined with machine learning approach (University level Mathematics and Computation)! The addition/subtraction model gave a ROC AUC of 0.97 which is very similar to multi-model/machine learning approach which gave a ROC AUC of 0.96. More detail on the analysis can be found on slides 17 and 18 within this presentation, A simple model for ion-channel related cardiac toxicity, which was given at an NC3Rs meeting.

The result described in the letter and presentation continues to add weight within that field that simple models are performing just as well as complex approaches for a given prediction task.

# When is a result significant?

The standard way of answering this question is to ask whether the effect could reasonably have happened by chance (the null hypothesis). If not, then the result is announced to be ‘significant’. The usual threshold for significance is that there is only a 5 percent chance of the results happening due to purely random effects.

This sounds sensible, and has the advantage of being easy to compute. Which is perhaps why statistical significance has been adopted as the default test in most fields of science. However, there is something a little confusing about the approach; because it asks whether adopting the opposite of the theory – the null hypothesis – would mean that the data is unlikely to be true. But what we want to know is whether a theory is true. And that isn’t the same thing.

As just one example, suppose we have lots of data and after extensive testing of various theories we discover one that passes the 5 percent significance test. Is it really 95 percent likely to be true? Not necessarily – because if we are trying out lots of ideas, then it is likely that we will find one that matches purely by chance.

While there are ways of working around this within the framework of standard statistics, the problem usually gets glossed over in the vast majority of textbooks and articles. So for example it is typical to say that a result is ‘significant’ without any discussion of whether it is plausible in a more general sense (see our post on model misuse in cardiac modeling).

The effect is magnified by publication bias – try out multiple theories, find one that works, and publish. Which might explain why, according to a number of studies (see for example here and here), much scientific work proves impossible to replicate – a situation which scientist Robert Matthews calls a ‘scandal of stunning proportions’ (see his book Chancing It: The laws of chance – and what they mean for you).

The way of Bayes

An alternative approach is provided by Bayesian statistics. Instead of starting with the assumption that data is random and making weird significance tests on null hypotheses, it just tries to estimate the probability that a model is right (i.e. the thing we want to know) given the complete context. But it is harder to calculate for two reasons.

One is that, because it treats new data as updating our confidence in a theory, it also requires we have some prior estimate of that confidence, which of course may be hard to quantify – though the problem goes away as more data becomes available. (To see how the prior can affect the results, see the BayesianOpionionator web app.) Another problem is that the approach does not treat the theory as fixed, which means that we may have to evaluate probabilities over whole families of theories, or at least a range of parameter values. However this is less of an issue today since the simulations can be performed automatically using fast computers and specialised software.

Perhaps the biggest impediment, though, is that when results are passed through the Bayesian filter, they often just don’t seem all that significant. But while that may be bad for publications, and media stories, it is surely good for science.