Registration is open for a one-day data analysis workshop which will be a mix of talks and discussions. The talks will go through analysis methods and examples that will show how you can get more from your own data as well as how to leverage external data (open genomic and clinical data). Participants can also bring an old poster and discuss data analysis options with speakers for some on-the-day consulting. To register and find out more about the day follow this link.

# All posts by Hitesh Mistry

# Oxford Quantitative Systems Pharmacology (QSP): Is there a case for model reduction?

The title of this blog-post refers to a meeting which I attended recently which was sponsored by the UK QSP network. Some of you may not be familiar with the term QSP, put it simply it describes the application of mathematical and computational tools to pharmacology. As the title of the blog-post suggests the meeting was on the subject of model reduction.

The meeting was split into 4 sessions entitled:

- How to test if your model is too big?
- Model reduction techniques
- What are the benefits of building a non-identifiable model?
- Techniques for over-parameterised models

I was asked to present a talk in the first session, see here for the slide-set. The talk was based on a topic that has been mentioned on this blog a few times before, ion-channel cardiac toxicity prediction. The latest talk goes through how 3 models were assessed in their ability to discriminate between non-cardio-toxic and cardio-toxic drugs across 3 data-sets which are currently available. (A report providing more details is currently being put together and will be released shortly.) The 3 models used were a linear combination of block (simple model – blog-post here) and 2 large scale biophysical models of a cardiac cell, one termed the “gold-standard” (endorsed by FDA and other regulatory agencies/CiPA – [1])and the other forming a key component of the “cardiac safety simulator” [2].

The results showed that the simple model does just as well and in certain data-sets out-performs the leading biophysical models of the heart, see slide 25. Towards the end of the talk I discussed what drivers exist for producing such large models and should we invest further in them given the current evidence, see slides 24, 28 and 33. How does this example fit into all the sessions of the meeting?

In answer to the first session title: How to test if your model is too big? The answer is straightforward: if the simpler/smaller model outperforms the larger model then the larger model is too big. Regarding the second session on model reduction techniques – in this case there is no need for these. You could argue, from the results discussed here, that instead of pursuing model reduction techniques we may want to consider building smaller/simpler models to begin with. Onto the 3^{rd} session, on the benefits of building a non-identifiable model, it’s clear that there was no benefit in this situation in developing a non-identifiable model. Finally regarding techniques for over-parameterised models – the lesson learned from the cardiac toxicity field is just don’t build these sorts of models for this question.

Some people at the meeting argued that the type of model depends on the question, this is true, but does the scale of the model depend on the question?

If we now go back to the title of the meeting, is there a case for model reduction? Within the field of ion-channel cardiac toxicity the response would be: Why bother reducing a large model when you can build a smaller and simpler model which shows equal/better performance?

Of course, within the (skeptical) framework of Green and Armstrong [3] point out (see slide 28), one reason for model reduction is that for researchers it is the best of both worlds: they can build a highly complex model, suitable for publication in a highly-cited journal, and then spend more time extracting a simpler version to support client plans. Maybe those journals should look more closely at their selection criteria.

References

[1] Colatsky *et al.* The Comprehensive in Vitro Proarrhythmia Assay (CiPA) initiative — Update on progress. *Journal of Pharmacological and Toxicological Methods.* (2016) Volume 81 pages 15-20.

[2] Glinka A. and Polak S. QTc modification after risperidone administration – insight into the mechanism of action with use of the modeling and simulation at the population level approach. Toxicol. Mech. Methods (2015) 25 (4) pages 279-286.

[3] Green K. C. and Armstrong J. S. Simple versus complex forecasting: The evidence. *Journal Business Research.* (2015) 68 (8) pages 1678-1685.

# Hierarchical oncology combination therapy: which level leads to success?

The traditional preclinical combination experiment in Oncology for two drugs A and B is as follows. A cancer cell-line is exposed to increasing concentrations of drug A alone, drug B alone and also various concentrations of the combination for a fixed amount of time. That is we determine what effect drug A and B have as monotherapies which subsequently helps us to understand what the combination effect is. There are many articles which describe how mathematical/computational models can be used to analyse such data and possibly predict the combination effect using information on monotherapy agents alone. Those models can be either based on mechanism at the pathway or phenotype level (see CellCycler for an example of the latter) or they could be machine learning approaches. We shall call combinations at this scale cellular as they are mainly focussed on analysing combination effects at that scale. What other scales are their?

We know that human cancers contain more than one type of cell-population so the next scale from the cellular level is the tissue level. At this level we may have populations of cells with distinct genetic backgrounds either within one tumour or across multiple tumours within one patient. Here we may find for example that drug A kills cell type X and drug B doesn’t, but drug B kills cell-type Y and drug A doesn’t. So the combination can be viewed as a cell population enrichment strategy as it is still effective even though the two drugs do not interact in any way.

Traditional drug combination screening, as described above, are not designed to explore these types of combinations. There is another scale which is probably even less well known, the human population scale …

A typical human clinical combination trial in Oncology can involve combining new drug B with existing treatment A and comparing that to A only. It is unlikely that a 3^{rd} arm in this trial looking at drug B alone is likely to occur. The reason for this is that if an existing treatment is known to have an effect then it’s unethical to not use it. Unless one knows what effect the new drug B has on its own, it is difficult to assess what the effect of the combination is. Indeed the combination may simply enrich the patient population. That is, if drug A shrinks tumours in patient population X and drug B doesn’t, but drug B shrinks tumours in patient population Y and drug A doesn’t, then if the trial contains both X and Y there is still a combination effect which is greater than drug A alone.

Many people reading this blog are probably aware that when we see positive combination affects in the clinic that it could be due to this type of patient enrichment. At a meeting in Boston in April of this year a presentation from Adam Palmer suggests that two thirds of marketed combinations in Oncology can be explained in this way, see second half (slide 27 onwards) of this presentation here. This includes current immunotherapy combinations.

We can now see why combinations in Oncology can be viewed as hierarchical. How appreciative the research community is of this is unknown. Indeed one of the latest challenges from CRUK (Cancer Research UK), see here, suggests that even they may not be fully aware of it. That challenge merely focusses on the well-trodden path of the first level described here. Which level is the best to target? Is it easier to target the tissue and human population level than the cellular one? Only time will tell.

# Cancer Reproducibility Project

At a recent meeting at a medical health faculty, researchers were asked to nominate their favourite papers. One person instead of nominating a paper nominated a whole project website, The Reproducibility Project in Cancer Biology, see here. This person was someone who had left the field of systems biology to re-train as a biostatistician. In case you might be wondering it wasn’t me! In this blog-post we will take a look at the project, the motivation behind it and some of the emerging results.

The original paper which sets out the aims of the project can be found here. The initiative was a joint collaboration between the Center of Open Science and Science Exchange. The motivation behind it is likely to be quite obvious to many readers, but for those who are unfamiliar it relates to the fact that there are many incentives given to exciting new results, much less for verifying old discoveries.

The main paper goes into some detail about the reasons why it is difficult to reproduce results. One of the key factors is openness, which is why this is the first reproducibility attempt that has extensive documentation. The project’s main reason for choosing cancer research was due to previous findings published by Bayer and Amgen, see here and here. In those previous reports the exact details regarding which replication studies were attempted were not published, hence the need for an open project.

The first part of a reproducibility project is to decide which articles to pick. The obvious choices are the ones that are cited the most and have had the most publicity. Indeed this is what the project did. They chose 50 of the most impactful articles in cancer biology published between 2010 and 2012. The experimental group used to conduct the replication studies was not actually a single group. The project utilised the Science Exchange, see here, which is a network that consists of over 900 contract research organisations (CROs). Thus they did not have to worry about finding the people with the right skills.

One clear advantage of using a CRO over an academic lab is that there is no reason for them to be biased either for or against a particular experiment, which may not be true of academic labs. The other main advantage is time and cost – scale up is more efficient. All the details of the experiments and power calculations of the original studies were placed on the Open Science Framework, see here. So how successful has the project been?

The first sets of results are out and as expected they are variable. If you would like to read the results in detail, go to this link here. The five projects were:

- BET bromodomain inhibition as a therapeutic strategy to target c-Myc.
- The CD47-signal regulatory protein alpha (SIRPa) interaction is a therapeutic target for human solid tumours.
- Melanoma genome sequencing reveals frequent PREX2 mutations.
- Discovery and preclinical validation of drug indications using compendia of public gene expression data.
- Co-administration of a tumour-penetrating peptide enhances the efficacy of cancer drugs.

Two of the studies (1) and (4) were largely successful , and one (5) was not. The other two replication studies were found to be un-interpretable as the animal cancer models showed odd behaviour: they either grew too fast or exhibited spontaneous tumour regressions!

One of the studies which was deemed un-interpretable has led to a clinical trial: development of an anti-CD47 antibody. These early results highlight that there is an issue around reproducing preclinical oncology experiments, but many already knew this. (Just to add, this is not about reproducing p-values but size and direction of effects.) The big question is how to improve the reproducibility of research; there are many opinions on this matter. Clearly one step is to reward replication studies, which is easier said than done in an environment where novel findings are the ones that lead to riches!

# Survival prediction (P2P loan profitability) competitions

In a previous blog entry, see here, we discussed how survival analysis methods could be used to determine the profitability of P2P loans. The “trick” highlighted in that previous post was to focus on the profit/loss of a loan – which in fact is what you actually care about – rather than when and if a loan defaults. In doing so we showed that even loans that default are profitable if interest rates are high enough and the period of loan short enough.

Given that basic survival analysis methods shed light on betting strategies that could be profitable, are there more aggressive approaches that exist in the healthcare community that the financial world could take advantage of? The answer to that question is yes and it lies in using crowdsourcing as we shall now discuss.

Over recent years there has been an increase in prediction competitions in the healthcare sector. One set of organisers have aptly named these competitions as DREAM challenges, follow this link to their website. Compared to other prediction competition websites such as Kaggle here, the winning algorithms are made publicly available through the website and also published.

A recurring theme of these competitions, that simply moves from one disease area to the next, is survival. The most recent of these involved predicting the survival of prostate cancer patients who were given a certain therapy, results were published here. Unfortunately the paper is behind a paywall but the algorithm is downloadable from the DREAM challenge website.

The winning algorithm was basically an ensemble of Cox proportional hazards regression models, we briefly explained what these are in our previous blog entry. Those of you reading this blog who have a technical background will be thinking that doesn’t sound like an overly complicated modelling approach. In fact it isn’t – what was sophisticated was how the winning entry partitioned the data for explorative analyses and model building. The strategy appeared to be more important than the development of a new method. This observation resonates with the last blog entry on Big data versus big theory.

So what does all this have to do with the financial sector? Well competitions like the one described above can quite easily be applied to financial problems, as we blogged about previously, where survival analyses are currently being applied for example to P2P loan profitability. So the healthcare prediction arena is in fact a great place to search for the latest approaches for financial betting strategies.

# Misapplication of statistical tests to simulated data: Mathematical Oncologists join Cardiac Modellers

In a previous blog post we highlighted the pitfalls of applying null hypothesis testing to simulated data, see here. We showed that modellers applying null hypothesis testing to simulated data can control the p-value because they can control the sample size. Thus it’s not a great idea to analyse simulations using null hypothesis tests, instead modellers should focus on the size of the effect. This problem has been highlighted before by White et al. which is well worth a read, see here.

Why are we blogging about this subject again? Since that last post, co-authors of the original article we discussed there have repeated the same misdemeanour (Liberos et al., 2016), and a group of mathematical oncologists based at Moffitt Cancer Center has joined them (Kim et al., 2016).

The article by Kim et al., preprint available here, describes a combined experimental and modelling approach that “predicts” new dosing schedules for combination therapies that can delay onset of resistance and thus increase patient survival. They also show how their approach can be used to identify key stratification factors that can determine which patients are likely to do better than others. All of the results in the paper are based on applying statistical tests to simulated data.

The first part of the approach taken by Kim et al. involves calibrating a mathematical model to certain in-vitro experiments. These experiments basically measure the number of cells over a fixed observation time under 4 different conditions: control (no drug), AKT inhibitor, Chemotherapy and Combination (AKT/Chemotherapy). This was done for two different cell lines. The authors found a range of parameter values when trying to fit their model to the data. From this range they took forward a particular set, no real justification as to why that certain set, to test the model’s ability to predict different in-vitro dosing schedules. Unsurprisingly the model predictions came true.

After “validating” their model against a set of in-vitro experiments the authors proceed to using the model to analyse retrospective clinical data; a study involving 24 patients. The authors acknowledge that the in-vitro system is clearly not the same as a human system. So to account for this difference they perform an optimisation method to generate a humanised model. The optimisation is based on a genetic algorithm which searched the parameter space to find parameter sets that replicate the clinical results observed. Again, similar to the in-vitro situation, they found that there were multiple parameter sets that were able to replicate the observed clinical results. In fact they found a total of 3391 parameter sets.

Having now generated a distribution of parameters that describe patients within the clinical study they are interested in, the authors next set about generating stratification factors. For each parameter set the virtual patient exhibits one of four possible response categories. Therefore for each category a distribution of parameter values exists for the entire population. To assess the difference in the distribution of parameter values across the categories they perform a students t-test to ascertain whether the differences are statistically significant. Since they can control the sample size the authors can control the standard error and p-value, this is exactly the issue raised by White et al. An alternative approach would be to state the difference in the size of the effect, so the difference in means of the distributions. If the claim is that a given parameter can discriminate between two types of responses then a ROC AUC (Receiver Operating Characteristic Area Under Curve) value could be reported. Indeed a ROC AUC value would allow readers to ascertain the strength of a given parameter in discriminating between two response types.

The application of hypothesis testing to simulated data continues throughout the rest of the paper, culminating in applying a log-rank test to simulated survival data, where again they control the sample size. Furthermore, the authors choose an arbitrary cancer cell number which dictates when a patient dies. Therefore they have two ways of controlling the p-value. In this final act the authors again abuse the use of null hypothesis testing to show that the schedule found by their modelling approach is better than that used in the actual clinical study. Since the major results in the paper have all involved this type of manipulation, we believe they should be treated with extreme caution until better verified.

**References**

Liberos, A., Bueno-Orovio, A., Rodrigo, M., Ravens, U., Hernandez-Romero, I., Fernandez-Aviles, F., Guillem, M.S., Rodriguez, B., Climent, A.M., 2016. Balance between sodium and calcium currents underlying chronic atrial fibrillation termination: An in silico intersubject variability study. Heart Rhythm 0. doi:10.1016/j.hrthm.2016.08.028

White, J.W., Rassweiler, A., Samhouri, J.F., Stier, A.C., White, C., 2014. Ecologists should not use statistical significance tests to interpret simulation model results. Oikos 123, 385–388. doi:10.1111/j.1600-0706.2013.01073.x

Kim, E., Rebecca, V.W., Smalley, K.S.M., Anderson, A.R.A., 2016. Phase i trials in melanoma: A framework to translate preclinical findings to the clinic. Eur. J. Cancer 67, 213–222. doi:10.1016/j.ejca.2016.07.024

# Time-dependent bias of tumour growth rate and time to tumour re-growth

The title of this blog entry refers to a letter published in the journal entitled, CPT: Pharmacometrics & Systems Pharmacology. The letter is open-access so those of you interested can read it online here. In this blog entry we will go through it.

The letter discusses a rather strange modelling practice which is becoming the norm within certain modelling and simulation groups in the pharmaceutical industry. There has been a spate of publications citing that tumour re-growth rate (GR) and time to tumour re-growth (TTG), derived using models to describe imaging time-series data, correlates to survival [1-6]. In those publications the authors show survival curves (Kaplan-Meiers) highlighting a very strong relationship between GR/ TTG and survival. They either split on the median value of GR/TTG or into quartiles and show very impressive differences in survival times between the groups created; see Figure 2 in [4] for an example (open access).

Do these relationships seem too good to be true? In fact they may well be. In order to derive GR/TTG you need time-series data. The value of these covariates are not known at the beginning of the study, and only become available after a certain amount of time has passed. Therefore this type of covariate is typically referred to as a time-dependent covariate. None of the authors in [1-6] describe GR/TTG as a time-dependent covariate nor treat it as such.

When the correlations to survival were performed in those articles the authors assumed that they knew GR/TTG before any time-series data was collected, which is clearly not true. Therefore survival curves, such as Figure 2 in [4], are biased as they are based on survival times calculated from study start time to time of death, rather than time from when GR/TTG becomes available to time of death. Therefore, the results in [1-6] should be questioned and GR/TTG should not be used for decision making, as the question around whether tumour growth rate correlates to survival is still rather open.

Could it be the case that the GR/TTG correlation to survival is just an illusion of a flawed modelling practice? This is what we shall answer in a future blog-post.

[1] W.D. Stein et al., Other Paradigms: Growth Rate Constants and Tumor Burden Determined Using Computed Tomography Data Correlate Strongly With the Overall Survival of Patients With Renal Cell Carcinoma, Cancer J (2009)

[2] W.D. Stein, J.L. Gulley, J. Schlom, R.A. Madan, W. Dahut, W.D. Figg, Y. Ning, P.M. Arlen, D. Price, S.E. Bates, T. Fojo, Tumor Regression and Growth Rates Determined in Five Intramural NCI Prostate Cancer Trials: The Growth Rate Constant as an Indicator of Therapeutic Efficacy, Clin. Cancer Res. (2011)

[3] W.D. Stein et al., Tumor Growth Rates Derived from Data for Patients in a Clinical Trial Correlate Strongly with Patient Survival: A Novel Strategy for Evaluation of Clinical Trial Data, The Oncologist. (2008)

[4] K. Han, L. Claret, Y. Piao, P. Hegde, A. Joshi, J. Powell, J. Jin, R. Bruno, Simulations to Predict Clinical Trial Outcome of Bevacizumab Plus Chemotherapy vs. Chemotherapy Alone in Patients With First-Line Gastric Cancer and Elevated Plasma VEGF-A, CPT Pharmacomet. Syst. Pharmacol. (2016)

[5] J. van Hasselt et al., Disease Progression/Clinical Outcome Model for Castration-Resistant Prostate Cancer in Patients Treated With Eribulin, CPT Pharmacomet. Syst. Pharmacol. (2015)

[6] L. Claret et al., Evaluation of Tumor-Size Response Metrics to Predict Overall Survival in Western and Chinese Patients With First-Line Metastatic Colorectal Cancer, J. Clin. Oncol. (2013)

# Complexity v Simplicity, the winner is?

I recently published a letter with the above title in the journal of Clinical Pharmacology and Therapeutics; unfortunately it’s behind a paywall so I will briefly take you through the key point raised. The letter describes a specific prediction problem around drug induced cardiac toxicity mentioned in a previous blog entry (Mathematical models for ion-channel cardiac toxicity: David v Goliath). In short what we show in the letter is that a simple model using subtraction and addition (pre-school Mathematics) performs just as well for a given prediction problem as a multi-model approach using three large-scale models consisting of 100s of differential equations combined with machine learning approach (University level Mathematics and Computation)! The addition/subtraction model gave a ROC AUC of 0.97 which is very similar to multi-model/machine learning approach which gave a ROC AUC of 0.96. More detail on the analysis can be found on slides 17 and 18 within this presentation, A simple model for ion-channel related cardiac toxicity, which was given at an NC3Rs meeting.

The result described in the letter and presentation continues to add weight within that field that simple models are performing just as well as complex approaches for a given prediction task.

# Mathematical models for ion-channel cardiac toxicity: David v Goliath

This blog entry will focus on a rather long standing debate around model complexity and predictivity for a specific prediction problem from drug development. A typical drug project starts off with 1000’s of drugs for a certain idea. All but one of these drugs is eventually weened out through a series of experiments, which explore safety and efficacy, with the final drug being the one that enters human trials. The question we will explore is around a toxicity experiment performed rather early in the development (weening out) process, which determines the drug’s effect on the cardiac system.

Many years of research has identified certain proteins, ion-channels, which if a drug were to affect could lead to dire consequences for a patient. In simple terms, ion-channels allow ions, such as calcium, to flow in and out of a cell. Drugs can bind to ion-channels and disrupt their ability to function, thus affecting the flow of ions. The early experiment we are interested in basically measures how many ions flow across an ion-channel with increasing amount of drug. The cells used in these experiments are engineered to over-express the human protein we are interested in and so do not reflect a real cardiac cell. The experiment is pretty much automated and so allows one to screen 1000s of drugs a year against certain ion-channels. The output of the system is an IC50 value, the amount of drug needed to reduce the flow of ions across the ion-channel by 50 percent.

A series of IC50 values are generated for each drug against a number of ion-channels. (We are actually only interested in three.) The reason why a large screening effort is made is because we cannot test all the compounds in an animal model nor can we take all of them into man! So we can’t measure the effect of these drugs in real cardiac systems but we can measure their effect on certain ion-channel proteins which are expressed in the cardiac system we are interested in. The question is then: given a set of IC50 values against certain ion-channels for a particular drug can we predict how this drug will affect a cardiac system?

As mentioned earlier, drug development involves performing a series of experiments over time. The screening experiment described above is one of many used to look at cardiac toxicity. The next experiment in the pipeline, which could occur one or maybe two years later, is exploring the remaining drugs in an intact cardiac system. This could be a single cardiac cell taken from a dog, a portion of the ventricular wall, or something else entirely. After which, even less compounds are taken into dog studies before entering human trials. So the prediction question could be related to any one of these cardiac systems. The inputs into the prediction problem are the set of IC50 values, three in the cases we will look at, whereas the output, which we want to predict, are certain measures from the cardiac systems described.

At this point some of you may be thinking, well if we want to predict what will happen in a real cardiac system then why don’t we build a virtual version of the system using a large mathematical model (biophysical model)? Indeed people have done this. However, others (especially those who follow this blog) might also be thinking, I have three inputs and one output and given we screen lots of these compounds surely the dynamics are not that difficult to figure out, such that I can do something simpler and more cost effective! Again people have done this too. If I were to refer to the virtual system (consists of >100 parameters) as Goliath and the simple model (3 parameters) as David some of you can guess what the outcome is! A paper documenting the story in detail can be found here and the model used is available online here. I will just give a brief summary of the findings in the main paper.

The data-sets explored in the article involve making predictions in both animal studies and human. Something noticeable about the biophysical models used in the original articles was that a different structural model was needed for each study. This was not the case for the simple model which uses the same structure across all data sets. Given that the simple model gave the same if not better performance than the biophysical models it raises a question: why do the biophysical modelling community need a different model for different studies? In fact for two human studies, A and B, different human models were used, why? The reason may be that the degree of confidence in those models by people using them is actually quite low, hence the lack of consistency in the models used across the studies. Another issue not discussed by any of the biophysical modeling literature is the reproducibility of the data used to build such models. Given the growing skepticism of the reproducibility of preclinical data in science this adds further doubt to the suitability of such models for industrial use.

Given the points raised here (as well as a previous blog entry highlighting the misuse of these models by their own developers) can the biophysical modelling community be trusted to deliver a modelling solution that is both trustworthy and reliable? This is an important question as regulatory agencies are now also considering using these biophysical models together with some quite exciting new experimental techniques to change the way people assess the cardiac liability of a new drug.

# Application of survival analysis to P2P Lending Club loans data

Peer to peer lending is an option people are increasingly turning to, both for obtaining loans and for investment. The principle idea is that investors can decide who they give loans to, based on information provided by the loaner, and the loaner can decide what interest rate they are willing to pay. This new lending environment can give investors higher returns than traditional savings accounts, and loaners better interest rates than those available from commercial lenders.

Given the open nature of peer to peer lending, information is becoming readily available on who loans are given to and what the outcome of that loan was in terms of profitability for the investor. Available information includes the loaner’s credit rating, loan amount, interest rate, annual income, amount received etc. The open-source nature of this data has clearly led to an increased interest in analysing and modelling the data to come up with strategies for the investor which maximises their return. In this blog entry we will look at developing a model of this kind using an approach routinely used in healthcare, survival analysis. We will provide motivation as to why this approach is useful and demonstrate how a simple strategy can lead to significant returns when applied to data from the Lending Club.

In healthcare survival analysis is routinely used to predict the probability of survival of a patient for a given length of time based on information about that patient e.g. what diseases they have, what treatment is given etc. It is routinely used within the healthcare sector to make decisions both at the patient level, for example what treatment to give, and at the institutional level (e.g. health care providers), for example what new healthcare policies will decrease death associated with lung cancer. In most survival analysis studies the data-sets usually contain a significant proportion of patients who have yet to experience the event of interest by the time the study has finished. These patients clearly do not have an event time and so are described as being right-censored. An analysis can be conducted without these patients but this is clearly ignoring vital information and can lead to misleading and biased inferences. This could have rather large consequences were the resultant model applied prospectively. A key part of all survival analysis tools that have been developed is therefore that they do not ignore patients who are right censored. So what does this have to do with peer to peer lending?

The data on the loans available through sites such as the Lending Club contain loans that are current and most modelling methods described in other blogs have simply ignored these loans when building models to maximise investor’s returns. These loans described as being current are the same as our patients in survival analysis who have yet to experience an event at the time the data was collected. Applying a survival analysis approach will allow us to keep people whose loans are described as being current in our model development and thus utilise all information available. How can we apply survival analysis methods to loan data though, as we are interested in maximising profit and not how quickly a loan is paid back?

We need to select relevant dependent and independent variables first before starting the analysis. The dependent variable in this case is whether a loan has finished (fully repaid, defaulted etc.) or not (current). The independent variable chosen here is the relative return (RR) on that loan, this is basically the amount repaid divided by the amount loaned. Therefore if a loan has a RR value less than 1 it is loss making and greater than 1 it is profit making. Clearly loans that have yet to have finished are quite likely to have an RR value less than 1 however they have not finished and so within the survival analysis approach this is accounted for by treating that loan as being right-censored. A plot showing the survival curve of the lending club data can be seen in the below figure.

The black line shows the fraction of loans as a function of RR. We’ve marked the break-even line in red. Crosses represent loans that are right censored. We can already see from this plot that there are approximately 17-18% loans that are loss making, to the left of the red line. The remaining loans to the right of the red line are profit making. How do we model this data?

Having established what the independent and dependent variables are, we can now perform a survival analysis exercise on the data. There are numerous modelling options in survival analysis. We have chosen one of the easiest options, Cox-regression/proportional hazards, to highlight the approach which may not be the optimal one. So now we have decided on the modelling approach we need to think about what covariates we will consider.

A previous blog entry at yhat.com already highlighted certain covariates that could be useful, all of which are actually quite intuitive. We found that one of the covariates FICO range high (essentially is a credit score) had an interesting relationship to RR, see below.

Each circle represents a loan. It’s strikingly obvious that once the last FICO Range High score exceeds ~ 700 the number of loss making loans, ones below the red line decreases quite dramatically. So a simple risk adverse strategy would be just to invest in loans whose FICO Range High score exceeds 700, however there are still profitable loans which have a FICO Range High value less than 700. In our survival analysis we can stratify for loans below and above this 700 FICO Range High score value.

We then performed a rather routine survival analysis. Using FICO Range High as a stratification marker we looked at a series of covariates previously identified in a univariate analysis. We ranked each of the covariates based on the concordance probability. The concordance probability gives us information on how good a covariate is at ranking loans, a value of 0.5 suggests that covariate is no better than tossing a coin whereas a value of 1 is perfect, which never happens! We are using concordance probability rather than p-values, which is often done, because the data-set is very large and so many covariates come out as being “statistically significant” even though they have little effect on the concordance probability. This is a classic problem of Big Data and one option, of many, is to focus model building on another metric to counter this issue. If we use a step-wise building approach and use a simple criterion that to include a covariate the concordance probability must increase by at least 0.01 units, then we end up with a rather simple model: interest rate + term of loan. This model gave a concordance probability value of 0.81 in FICO Range High >700 and 0.63 for a FICO Range High value <700. Therefore, it does a really good job once we have screened out the bad loans and not so great when we have a lot of bad loans but we have a strategy that removes those.

This final model is available online here and can be found on the web-apps section of the website. When playing with the model you will find that if interest rates are high and the term of loan is low then regardless of the FICO Range High value all loans are profitable, however those with FICO Range High values >700 provide a higher return, see figure below.

The above plot was created by using an interest rate of 20% for a 36 month loan. The plot shows two curves, the one in red represents a loan whose FICO Range High value <700 and the black one a loan with FICO Range High value >700. The curves describe your probability of attaining a certain amount of profit or loss. You can see that for the input values used here, the probability of making a loss is similar regardless of the FICO Range High Value; however the amount of return is better for loans with FICO Range High value >700.

Using survival analysis techniques we have shown that you can create a relatively simple model that lends itself well for interpretation, i.e. probability curves. Performance of the model could be improved using random survival forests – the gain may not be as large as you might expect but every percentage point counts. In a future blog we will provide an example of applying survival analysis to actual survival data.