# Ignoring Uncertainty in Collaborative Conceptual Models

This blog-post relates to a meeting entitled “Modelling Challenges in Cancer and Immunology: one-day meeting of LMS MiLS”. The meeting was held at the beautiful location of Kings College London. The talks were a mix of mathematical modelling and experimental – with most combining the two disciplines together. The purpose of the meeting was to foster new collaborations between the people at the meeting.

Collaboration is a key aspect of this intersection between cancer and immunology since no one person can truly have a complete understanding of both fields, and nor can they possess all the skill-sets needed. When collaborating though each of us trusts the experts in their respective fields to bring conceptual models to the table for discussion. It’s very important to understand how these conceptual models have developed over time.

Every scientist has developed their knowledge via their own interpretation of the data/evidence over their careers. However, the uncertainty in the data/evidence used to make statements such as A interacts with B is rarely mentioned.

For many scientists, null hypothesis testing has been used to help them develop “knowledge” within a field. This “knowledge”, throughout that scientists’ career, has been typically gained by using a p-value threshold of 0.05 with very little consideration of the size of effect or what the test actually means.

For example, at the meeting mentioned there was a stream of statements, which were made to sound like facts, on correlations which were tenuous at best simply because the p-value was below 0.05. An example is the figure below, where the data has a correlation coefficient of 0.22 (“p<0.05”). The scientist from this point onwards will say A correlates with B consigning the noise/variability to history.

Could it be that the conceptual models we discuss are based on decades of analyses described as above? I would argue this is often the case and was certainly present at the meeting. This may argue for having very large collaboration groups and looking for consensus, however being precisely biased is in no-one’s best interests!

Perhaps the better alternative is to teach uncertainty concepts at a far earlier stage in a scientist’s career. That is introducing Bayesian statistics (see blog-post on Bayesian Opinionater) earlier rather than entraining scientists into null-hypothesis testing.  This would generally improve the scientific process – and will probably reduce my blood pressure when attending meetings like this one.

# Misapplication of statistical tests to simulated data: Mathematical Oncologists join Cardiac Modellers

In a previous blog post we highlighted the pitfalls of applying null hypothesis testing to simulated data, see here.  We showed that modellers applying null hypothesis testing to simulated data can control the p-value because they can control the sample size. Thus it’s not a great idea to analyse simulations using null hypothesis tests, instead modellers should focus on the size of the effect.  This problem has been highlighted before by White et al.  which is well worth a read, see here.

Why are we blogging about this subject again? Since that last post, co-authors of the original article we discussed there have repeated the same misdemeanour (Liberos et al., 2016), and a group of mathematical oncologists based at Moffitt Cancer Center has joined them (Kim et al., 2016).

The article by Kim et al., preprint available here, describes a combined experimental and modelling approach that “predicts” new dosing schedules for combination therapies that can delay onset of resistance and thus increase patient survival.  They also show how their approach can be used to identify key stratification factors that can determine which patients are likely to do better than others. All of the results in the paper are based on applying statistical tests to simulated data.

The first part of the approach taken by Kim et al. involves calibrating a mathematical model to certain in-vitro experiments.  These experiments basically measure the number of cells over a fixed observation time under 4 different conditions: control (no drug), AKT inhibitor, Chemotherapy and Combination (AKT/Chemotherapy).  This was done for two different cell lines. The authors found a range of parameter values when trying to fit their model to the data. From this range they took forward a particular set, no real justification as to why that certain set, to test the model’s ability to predict different in-vitro dosing schedules. Unsurprisingly the model predictions came true.

After “validating” their model against a set of in-vitro experiments the authors proceed to using the model to analyse retrospective clinical data; a study involving 24 patients.  The authors acknowledge that the in-vitro system is clearly not the same as a human system.  So to account for this difference they perform an optimisation method to generate a humanised model.  The optimisation is based on a genetic algorithm which searched the parameter space to find parameter sets that replicate the clinical results observed.  Again, similar to the in-vitro situation, they found that there were multiple parameter sets that were able to replicate the observed clinical results. In fact they found a total of 3391 parameter sets.

Having now generated a distribution of parameters that describe patients within the clinical study they are interested in, the authors next set about generating stratification factors. For each parameter set the virtual patient exhibits one of four possible response categories. Therefore for each category a distribution of parameter values exists for the entire population. To assess the difference in the distribution of parameter values across the categories they perform a students t-test to ascertain whether the differences are statistically significant. Since they can control the sample size the authors can control the standard error and p-value, this is exactly the issue raised by White et al. An alternative approach would be to state the difference in the size of the effect, so the difference in means of the distributions. If the claim is that a given parameter can discriminate between two types of responses then a ROC AUC (Receiver Operating Characteristic Area Under Curve) value could be reported. Indeed a ROC AUC value would allow readers to ascertain the strength of a given parameter in discriminating between two response types.

The application of hypothesis testing to simulated data continues throughout the rest of the paper, culminating in applying a log-rank test to simulated survival data, where again they control the sample size. Furthermore, the authors choose an arbitrary cancer cell number which dictates when a patient dies. Therefore they have two ways of controlling the p-value.  In this final act the authors again abuse the use of null hypothesis testing to show that the schedule found by their modelling approach is better than that used in the actual clinical study.  Since the major results in the paper have all involved this type of manipulation, we believe they should be treated with extreme caution until better verified.

References

Liberos, A., Bueno-Orovio, A., Rodrigo, M., Ravens, U., Hernandez-Romero, I., Fernandez-Aviles, F., Guillem, M.S., Rodriguez, B., Climent, A.M., 2016. Balance between sodium and calcium currents underlying chronic atrial fibrillation termination: An in silico intersubject variability study. Heart Rhythm 0. doi:10.1016/j.hrthm.2016.08.028

White, J.W., Rassweiler, A., Samhouri, J.F., Stier, A.C., White, C., 2014. Ecologists should not use statistical significance tests to interpret simulation model results. Oikos 123, 385–388. doi:10.1111/j.1600-0706.2013.01073.x

Kim, E., Rebecca, V.W., Smalley, K.S.M., Anderson, A.R.A., 2016. Phase i trials in melanoma: A framework to translate preclinical findings to the clinic. Eur. J. Cancer 67, 213–222. doi:10.1016/j.ejca.2016.07.024