# When is a result significant?

The standard way of answering this question is to ask whether the effect could reasonably have happened by chance (the null hypothesis). If not, then the result is announced to be ‘significant’. The usual threshold for significance is that there is only a 5 percent chance of the results happening due to purely random effects.

This sounds sensible, and has the advantage of being easy to compute. Which is perhaps why statistical significance has been adopted as the default test in most fields of science. However, there is something a little confusing about the approach; because it asks whether adopting the opposite of the theory – the null hypothesis – would mean that the data is unlikely to be true. But what we want to know is whether a theory is true. And that isn’t the same thing.

As just one example, suppose we have lots of data and after extensive testing of various theories we discover one that passes the 5 percent significance test. Is it really 95 percent likely to be true? Not necessarily – because if we are trying out lots of ideas, then it is likely that we will find one that matches purely by chance.

While there are ways of working around this within the framework of standard statistics, the problem usually gets glossed over in the vast majority of textbooks and articles. So for example it is typical to say that a result is ‘significant’ without any discussion of whether it is plausible in a more general sense (see our post on model misuse in cardiac modeling).

The effect is magnified by publication bias – try out multiple theories, find one that works, and publish. Which might explain why, according to a number of studies (see for example here and here), much scientific work proves impossible to replicate – a situation which scientist Robert Matthews calls a ‘scandal of stunning proportions’ (see his book Chancing It: The laws of chance – and what they mean for you).

The way of Bayes

An alternative approach is provided by Bayesian statistics. Instead of starting with the assumption that data is random and making weird significance tests on null hypotheses, it just tries to estimate the probability that a model is right (i.e. the thing we want to know) given the complete context. But it is harder to calculate for two reasons.

One is that, because it treats new data as updating our confidence in a theory, it also requires we have some prior estimate of that confidence, which of course may be hard to quantify – though the problem goes away as more data becomes available. (To see how the prior can affect the results, see the BayesianOpionionator web app.) Another problem is that the approach does not treat the theory as fixed, which means that we may have to evaluate probabilities over whole families of theories, or at least a range of parameter values. However this is less of an issue today since the simulations can be performed automatically using fast computers and specialised software.

Perhaps the biggest impediment, though, is that when results are passed through the Bayesian filter, they often just don’t seem all that significant. But while that may be bad for publications, and media stories, it is surely good for science.

# The exponential growth effect

A common critique of biologists, and scientists in general, concerns their occasionally overenthusiastic tendency to find patterns in nature – especially when the pattern is a straight line. It is certainly notable how, confronted with a cloud of noisy data, scientists often manage to draw a straight line through it and announce that the result is “statistically significant”.

Straight lines have many pleasing properties, both in architecture and in science. If a time series follows a straight line, for example, it is pretty easy to forecast how it should evolve in the near future – just assume that the line continues (note: doesn’t always work).

However this fondness for straightness doesn’t always hold; indeed there are cases where scientists prefer to opt for a more complicated solution. An example is the modelling of tumour growth in cancer biology.

Tumour growth is caused by the proliferation of dividing cells. For example if cells have a cell cycle length td, then the total number of cells will double every td hours, which according to theory should result in exponential growth. In the 1950s (see Collins et al., 1956) it was therefore decided that the growth rate could be measured using the cell doubling time.

In practice, however, it is found that tumours grow more slowly as time goes on, so this exponential curve needed to be modified. One variant is the Gompertz curve, which was originally derived as a model for human lifespans by the British actuary Benjamin Gompertz in 1825, but was adapted for modelling tumour growth in the 1960s (Laird, 1964). This curve gives a tapered growth rate, at the expense of extra parameters, and has remained highly popular as a means of modelling a variety of tumour types.

However, it has often been observed empirically that tumour diameters, as opposed to volumes, appear to grow in a roughly linear fashion. Indeed, this has been known since at least the 1930s. As Mayneord wrote in 1932: “The rather surprising fact emerges that the increase in long diameter of the implanted tumour follows a linear law.” Furthermore, he noted, there was “a simple explanation of the approximate linearity in terms of the structure of the sarcoma. On cutting open the tumour it is often apparent that not the whole of the mass is in a state of active growth, but only a thin capsule (sometimes not more than 1 cm thick) enclosing the necrotic centre of the tumour.”

Because only this outer layer contains dividing cells, the rate of increase for the volume depends on the doubling time multiplied by the volume of the outer layer. If the thickness of the growing layer is small compared to the total tumour radius, then it is easily seen that the radius grows at a constant rate which is equal to the doubling time multiplied by the thickness of the growing layer. The result is a linear growth in radius. This  translates to cubic growth in volume, which of course grows more slowly than an exponential curve at longer times – just as the data suggests.

In other words, rather than use a modified exponential curve to fit volume growth, it may be better to use a linear equation to model diameter. This idea that tumour growth is driven by an outer layer of proliferating cells, surrounding a quiescent or necrotic core, has been featured in a number of mathematical models (see e.g. Checkley et al., 2015, and our own CellCycler model).  The linear growth law can also be used to analyse tumour data, as in the draft paper: “Analysing within and between patient patient tumour heterogenity via imaging: Vemurafenib, Dabrafenib and Trametinib.” The linear growth equation will of course not be a perfect fit for the growth of all tumours (no simple model is), but it is based on a consistent and empirically verified model of tumour growth, and can be easily parameterised and fit to data.

So why hasn’t this linear growth law caught on more widely? The reason is that what scientists see in data often depends on their mental model of what is going on.

I first encountered this phenomenon in the late 1990s when doing my D.Phil. in the prediction of nonlinear systems, with applications to weather forecasting. The dominant theory at the time said that forecast error was due to sensitivity to initial condition, aka the butterfly effect. As I described in The Future of Everything, researchers insisted that forecast errors showed the exponential growth characteristic of chaos, even though plots showed they clearly grew with slightly negative curvature, which was characteristic of model error.

A similar effect in cancer biology has again changed the way scientists interpret data. Sometimes, a straight line really is the best solution.

### References

Collins, V. P., Loeffler, R. K. & Tivey, H. Observations on growth rates of human tumors. The American journal of roentgenology, radium therapy, and nuclear medicine 76, 988-1000 (1956).

Laird A. K. Dynamics of tumor growth. Br J of Cancer 18 (3): 490–502 (1964).

W. V. Mayneord. On a Law of Growth of Jensen’s Rat Sarcoma. Am J Cancer 16, 841-846 (1932).

Stephen Checkley, Linda MacCallum, James Yates, Paul Jasper, Haobin Luo, John Tolsma, Claus Bendtsen. Bridging the gap between in vitro and in vivo: Dose and schedule predictions for the ATR inhibitor AZD6738. Scientific Reports, 5(3)13545 (2015).

Yorke, E. D., Fuks, Z., Norton, L., Whitmore, W. & Ling, C. C. Modeling the Development of Metastases from Primary and Locally Recurrent Tumors: Comparison with a Clinical Data Base for Prostatic Cancer. Cancer Research 53, 2987-2993 (1993).

Hitesh Mistry, David Orrell, and Raluca Eftimie. Analysing within and between patient patient tumour heterogenity via imaging: Vemurafenib, Dabrafenib and Trametinib. (Working paper)

# The CellCycler

Tumour modelling has been an active field of research for some decades, and a number of approaches have been taken, ranging from simple models of an idealised spherical tumour, to highly complex models which attempt to account for everything from cellular chemistry to mechanical stresses. Some models use ordinary differential equations, while others use an agent-based approach to track individual cells.

A disadvantage of the more complex models is that they involve a large number of parameters, which can only be roughly estimated from available data. If the aim is to predict, rather than to describe, then this leads to the problem of overfitting: the model is very flexible and can be tuned to fit available data, but is less useful for predicting for example the effect of a new drug.

Indeed, there is a rarely acknowledged tension in mathematical modelling between realism, in the sense of including lots of apparently relevant features, and predictive accuracy. When it comes to the latter, simple models often out-perform complex models. Yet in most areas there is a strong tendency for researchers to develop increasingly intricate models. The reason appears to have less to do with science, than with institutional effects. As one survey of business models notes (and these points would apply equally to cancer modelling) complex models are preferred in large part because: “(1) researchers are rewarded for publishing in highly ranked journals, which favor complexity; (2) forecasters can use complex methods to provide forecasts that support decision-makers’ plans; and (3) forecasters’ clients may be reassured by incomprehensibility.”

Being immune to all such pressures (this is just a blog post after all!) we decided to develop the CellCycler – a parsimonius “toy” model of a cancer tumour that attempts to capture the basic growth and drug-response dynamics using only a minimal number of parameters and assumptions.  The model uses circa 100 ordinary differential equations (ODEs) to simulate cells as they pass through the phases of the cell cycle; however the equations are simple and the model only uses parameters that can be observed or reasonably well approximated. It is available online as a Shiny app.

The CellCycler model divides the cell cycle into a number of discrete compartments, and is therefore similar in spirit to other models that for example treat each phase G1, S, G2, and mitosis as a separate compartment, with damaged cells being shunted to their own compartment (see for example the model by Checkley et al. here). Each compartment has its own set of ordinary differential equations which govern how its volume changes with time due to growth, apoptosis, or damage from drugs. There are additional compartments for damaged cells, which may be repaired or lost to apoptosis. Drugs are simulated using standard PK models, along with a simple description of phase-dependent drug action on cells. For the tumour growth, we use a linear model, based like the Checkley et al. paper on the assumption of a thin growing layer (see also our post on The exponential growth effect).

### The advantages of compartmentalising

Dividing the cell cycle into separate compartments has an interesting and useful side effect, which is that it introduces a degree of uncertainty into the calculation. For example, if a drug causes damage and delays progress in a particular phase, then that drug will tend to synchronize the cell population in that state. However there is an obvious difference between cells that are affected when they are at the start of the phase, and those that are already near the end of the phase. If the compartments are too large, that precise information about the state of cells is lost.

The only way to restore precision would be to use a very large number of compartments. But in reality, individual cells will not all have exactly the same doubling time. We therefore want to have a degree of uncertainty. And this can be controlled by adjusting the number of compartments.

This effect is illustrated by the figure below, which shows how a perturbation at time zero in one compartment tends to blur out over time, for models with 25, 50, and 100 compartments, and a doubling time of 24 hours. In each case a perturbation is made to compartment 1 at the beginning of the cell cycle (the magnitude is scaled to the number of compartments so the total size of the perturbation is the same in terms of total volume). For the case with 50 compartments, the curve after one 24 hours is closely approximated by a normal distribution with standard deviation of 3.4 hours or about 14 percent. In general, the standard deviation can be shown to be approximately equal to the doubling time divided by the square root of N.

A unique feature of the CellCycler is that it exploits this property as a way of adjusting the variability of doubling time in the cell population. The model can therefore provide a first-order approximation to the more complex heterogeneity that can be simulated using agent-based models. While we don’t usually have exact data on the spread of doubling times in the growing layer, the default level of 50 compartments gives what appears to be a reasonable degree of spread (about 14 percent). Using 25 compartments gives 20 percent, while using 100 compartments decreases this to 10 percent.

### Using the CellCycler

The starting point for the Shiny web application is the Cells page, which is used to model the dynamics of a growing cell population. The key parameters are the average cell doubling time, and the fraction spent in each phase. The number of model compartments can be adjusted in the Advanced page: note that, along with doubling time spread, the choice also affects both the simulation time (more compartments is slower), and the discretisation of the cell cycle. For example with 50 compartments the proportional phase times will be rounded off to the nearest 1/50=0.02.

The next pages, PK1 and PK2, are used to parameterise the PK models and drug effects. The program has a choice of standard PK models, with adjustable parameters such as Dose/Volume.  In addition the phase of action (choices are G1, S, G2, M, or all), and rates for death, damage, and repair can be adjusted. Finally, the Tumor page (shown below) uses the model simulation to generate a plot of tumor radius, given an initial radius and growing layer. Plots can be overlaid with experimental data.

We hope the CellCycler can be a useful tool for research or for exploring the dynamics of tumour growth. As mentioned above it is only a “toy” model of a tumour. However, all our models of complex organic systems – be they of a tumor, the economy, or the global climate system – are toys compared to the real things. And of course there is nothing to stop users from extending the model to incorporate additional effects. Though whether this will lead to improved predictive accuracy is another question.

Try the CellCycler web app here.

### References:

Stephen Checkley, Linda MacCallum, James Yates, Paul Jasper, Haobin Luo, John Tolsma, Claus Bendtsen. “Bridging the gap between in vitro and in vivo: Dose and schedule predictions for the ATR inhibitor AZD6738,” Scientific Reports.2015;5(3)13545.

Green, Kesten C. & Armstrong, J. Scott, 2015. “Simple versus complex forecasting: The evidence,” Journal of Business Research, Elsevier, vol. 68(8), pages 1678-1685.

# Mathematical models for ion-channel cardiac toxicity: David v Goliath

This blog entry will focus on a rather long standing debate around model complexity and predictivity for a specific prediction problem from drug development. A typical drug project starts off with 1000’s of drugs for a certain idea. All but one of these drugs is eventually weened out through a series of experiments, which explore safety and efficacy, with the final drug being the one that enters human trials.  The question we will explore is around a toxicity experiment performed rather early in the development (weening out) process, which determines the drug’s effect on the cardiac system.

Many years of research has identified certain proteins, ion-channels, which if a drug were to affect could lead to dire consequences for a patient.  In simple terms, ion-channels allow ions, such as calcium, to flow in and out of a cell. Drugs can bind to ion-channels and disrupt their ability to function, thus affecting the flow of ions. The early experiment we are interested in basically measures how many ions flow across an ion-channel with increasing amount of drug.  The cells used in these experiments are engineered to over-express the human protein we are interested in and so do not reflect a real cardiac cell. The experiment is pretty much automated and so allows one to screen 1000s of drugs a year against certain ion-channels.  The output of the system is an IC50 value, the amount of drug needed to reduce the flow of ions across the ion-channel by 50 percent.

A series of IC50 values are generated for each drug against a number of ion-channels. (We are actually only interested in three.) The reason why a large screening effort is made is because we cannot test all the compounds in an animal model nor can we take all of them into man! So we can’t measure the effect of these drugs in real cardiac systems but we can measure their effect on certain ion-channel proteins which are expressed in the cardiac system we are interested in.  The question is then: given a set of IC50 values against certain ion-channels for a particular drug can we predict how this drug will affect a cardiac system?

As mentioned earlier, drug development involves performing a series of experiments over time. The screening experiment described above is one of many used to look at cardiac toxicity. The next experiment in the pipeline, which could occur one or maybe two years later, is exploring the remaining drugs in an intact cardiac system.  This could be a single cardiac cell taken from a dog, a portion of the ventricular wall, or something else entirely. After which, even less compounds are taken into dog studies before entering human trials. So the prediction question could be related to any one of these cardiac systems.  The inputs into the prediction problem are the set of IC50 values, three in the cases we will look at, whereas the output, which we want to predict, are certain measures from the cardiac systems described.

At this point some of you may be thinking, well if we want to predict what will happen in a real cardiac system then why don’t we build a virtual version of the system using a large mathematical model (biophysical model)? Indeed people have done this. However, others (especially those who follow this blog) might also be thinking, I have three inputs and one output and given we screen lots of these compounds surely the dynamics are not that difficult to figure out, such that I can do something simpler and more cost effective! Again people have done this too. If I were to refer to the virtual system (consists of >100 parameters) as Goliath and the simple model (3 parameters) as David some of you can guess what the outcome is! A paper documenting the story in detail can be found here and the model used is available online here.  I will just give a brief summary of the findings in the main paper.

The data-sets explored in the article involve making predictions in both animal studies and human.  Something noticeable about the biophysical models used in the original articles was that a different structural model was needed for each study.  This was not the case for the simple model which uses the same structure across all data sets.  Given that the simple model gave the same if not better performance than the biophysical models it raises a question: why do the biophysical modelling community need a different model for different studies? In fact for two human studies, A and B, different human models were used, why?  The reason may be that the degree of confidence in those models by people using them is actually quite low, hence the lack of consistency in the models used across the studies. Another issue not discussed by any of the biophysical modeling literature is the reproducibility of the data used to build such models. Given the growing skepticism of the reproducibility of preclinical data in science this adds further doubt to the suitability of such models for industrial use.

Given the points raised here (as well as a previous blog entry highlighting the misuse of these models by their own developers) can the biophysical modelling community be trusted to deliver a modelling solution that is both trustworthy and reliable? This is an important question as regulatory agencies are now also considering using these biophysical models together with some quite exciting new experimental techniques to change the way people assess the cardiac liability of a new drug.

# Application of survival analysis to P2P Lending Club loans data

Peer to peer lending is an option people are increasingly turning to, both for obtaining loans and for investment.  The principle idea is that investors can decide who they give loans to, based on information provided by the loaner, and the loaner can decide what interest rate they are willing to pay. This new lending environment can give investors higher returns than traditional savings accounts, and loaners better interest rates than those available from commercial lenders.

Given the open nature of peer to peer lending, information is becoming readily available on who loans are given to and what the outcome of that loan was in terms of profitability for the investor. Available information includes the loaner’s credit rating, loan amount, interest rate, annual income, amount received  etc.  The open-source nature of this data has clearly led to an increased interest in analysing and modelling the data to come up with strategies for the investor which maximises their return. In this blog entry we will look at developing a model of this kind using an approach routinely used in healthcare, survival analysis.  We will provide motivation as to why this approach is useful and demonstrate how a simple strategy can lead to significant returns when applied to data from the Lending Club.

In healthcare survival analysis is routinely used to predict the probability of survival of a patient for a given length of time based on information about that patient e.g. what diseases they have, what treatment is given etc.  It is routinely used within the healthcare sector to make decisions both at the patient level, for example what treatment to give, and at the institutional level (e.g. health care providers), for example what new healthcare policies will decrease death associated with lung cancer.  In most survival analysis studies the data-sets usually contain a significant proportion of patients who have yet to experience the event of interest by the time the study has finished.  These patients clearly do not have an event time and so are described as being right-censored. An analysis can be conducted without these patients but this is clearly ignoring vital information and can lead to misleading and biased inferences.  This could have rather large consequences were the resultant model applied prospectively.  A key part of all survival analysis tools that have been developed is therefore that they do not ignore patients who are right censored.  So what does this have to do with peer to peer lending?

The data on the loans available through sites such as the Lending Club contain loans that are current and most modelling methods described in other blogs have simply ignored these loans when building models to maximise investor’s returns.  These loans described as being current are the same as our patients in survival analysis who have yet to experience an event at the time the data was collected. Applying a survival analysis approach will allow us to keep people whose loans are described as being current in our model development and thus utilise all information available. How can we apply survival analysis methods to loan data though, as we are interested in maximising profit and not how quickly a loan is paid back?

We need to select relevant dependent and independent variables first before starting the analysis.  The dependent variable in this case is whether a loan has finished (fully repaid, defaulted etc.) or not (current). The independent variable chosen here is the relative return (RR) on that loan, this is basically the amount repaid divided by the amount loaned.  Therefore if a loan has a RR value less than 1 it is loss making and greater than 1 it is profit making.  Clearly loans that have yet to have finished are quite likely to have an RR value less than 1 however they have not finished and so within the survival analysis approach this is accounted for by treating that loan as being right-censored. A plot showing the survival curve of the lending club data can be seen in the below figure.

The black line shows the fraction of loans as a function of RR. We’ve marked the break-even line in red.  Crosses represent loans that are right censored. We can already see from this plot that there are approximately 17-18% loans that are loss making, to the left of the red line. The remaining loans to the right of the red line are profit making.  How do we model this data?

Having established what the independent and dependent variables are, we can now perform a survival analysis exercise on the data.  There are numerous modelling options in survival analysis.  We have chosen one of the easiest options, Cox-regression/proportional hazards, to highlight the approach which may not be the optimal one. So now we have decided on the modelling approach we need to think about what covariates we will consider.

A previous blog entry at yhat.com already highlighted certain covariates that could be useful, all of which are actually quite intuitive. We found that one of the covariates FICO range high (essentially is a credit score) had an interesting relationship to RR, see below.

Each circle represents a loan. It’s strikingly obvious that once the last FICO Range High score exceeds ~ 700 the number of loss making loans, ones below the red line decreases quite dramatically. So a simple risk adverse strategy would be just to invest in loans whose FICO Range High score exceeds 700, however there are still profitable loans which have a FICO Range High value less than 700. In our survival analysis we can stratify for loans below and above this 700 FICO Range High score value.

We then performed a rather routine survival analysis. Using FICO Range High as a stratification marker we looked at a series of covariates previously identified in a univariate analysis.  We ranked each of the covariates based on the concordance probability. The concordance probability gives us information on how good a covariate is at ranking loans, a value of 0.5 suggests that covariate is no better than tossing a coin whereas a value of 1 is perfect, which never happens! We are using concordance probability rather than p-values, which is often done, because the data-set is very large and so many covariates come out as being “statistically significant” even though they have little effect on the concordance probability. This is a classic problem of Big Data and one option, of many, is to focus model building on another metric to counter this issue.  If we use a step-wise building approach and use a simple criterion that to include a covariate the concordance probability must increase by at least 0.01 units, then we end up with a rather simple model: interest rate + term of loan.  This model gave a concordance probability value of 0.81 in FICO Range High >700 and 0.63 for a FICO Range High value <700. Therefore, it does a really good job once we have screened out the bad loans and not so great when we have a lot of bad loans but we have a strategy that removes those.

This final model  is available online here and can be found on the web-apps section of the website.   When playing with the model you will find that if interest rates are high and the term of loan is low then regardless of the FICO Range High value all loans are profitable, however those with FICO Range High values >700 provide a higher return, see figure below.

The above plot was created by using an interest rate of 20% for a 36 month loan.  The plot shows two curves, the one in red represents a loan whose FICO Range High value <700 and the black one a loan with FICO Range High value >700. The curves describe your probability of attaining a certain amount of profit or loss.  You can see that for the input values used here, the probability of making a loss is similar regardless of the FICO Range High Value; however the amount of return is better for loans with FICO Range High value >700.

Using survival analysis techniques we have shown that you can create a relatively simple model that lends itself well for interpretation, i.e. probability curves. Performance of the model could be improved using random survival forests – the gain may not be as large as you might expect but every percentage point counts.  In a future blog we will provide an example of applying survival analysis to actual survival data.

# The Evolution of Money

Latest book The Evolution of Money with Roman Chlupatý is published this week by Columbia University Press. Everything you need to know about money (except how to make it!).

# Rent or buy

Suppose you were offered the choice between investing in one of two assets. The first, asset A, has a long term real price history (i.e. with inflation stripped out) which looks like this:

It seems that the real price of the asset hasn’t gone anywhere in the last 125 years, with an average compounded growth rate of about half a percent. The asset also appears to be needlessly volatile for such a poor performance.

Asset B is shown in the next plot by the red line, with asset A shown again for comparison (but with a different vertical scale):

Note again this is a log scale, so  asset B has increased in price by more than a factor of a thousand, after inflation, since 1890. The average compounded growth rate, after inflation, is 6.6 percent – an improvement of over 6 percent compared to asset A.

On the face of it, it would appear that asset B – the steeply climbing red line – would be the better bet. But suppose that everyone around you believed that asset A was the correct way to build wealth. Not only were people investing their life savings in asset A, but they were taking out highly leveraged positions in order to buy as much of it as possible. Parents were lending their offspring the money to make a down payment on a loan so that they wouldn’t be deprived. Other buyers (without rich parents) were borrowing the down payment from secondary lenders at high interest rates. Foreigners were using asset A as a safe store of wealth, one which seemed to be mysteriously exempt from anti-money laundering regulations. In fact, asset A had become so systemically important that a major fraction of the country’s economy was involved in either building it, selling it, or financing it.

You may have already guessed that the blue line is the US housing market (based on the Case-Shiller index), and the red line is the S&P 500 stock market index, with dividends reinvested. The housing index ignores factors such as the improvement in housing stock, so really measures the value of residential land. The stock market index (again based on Case-Shiller data) is what you might get from a hypothetical index fund. In either case, things like management and transaction fees have been ignored.

So why does everyone think housing is a better investment than the stock market?

Of course, the comparison isn’t quite fair. For one thing, you can live in a house – an important dividend in itself – while a stock market portfolio is just numbers in an account. But the vast discrepancy between the  two means that we have to ask, is housing a good place to park your money, or is it better in financial terms to rent and invest your savings?

As an example, I was recently offered the opportunity to buy a house in the Toronto area before it went on the market. The price was \$999,000, which is about average for Toronto. It was being rented out at \$2600 per month. Was it a good deal?

Usually real estate decisions are based on two factors – what similar properties are selling for, and what the rate of appreciation appears to be. In this case I was told that the houses on the street were selling for about that amount, and furthermore were going up by about a \$100K per year (the Toronto market is very hot right now). But both of these factors depend on what other people are doing and thinking about the market – and group dynamics are not always the best measure of value (think the Dutch tulip bulb crisis).

A potentially more useful piece of information is the current rent earned by the property. This gives a sense of how much the house is worth as a provider of housing services, rather than as a speculative investment, and therefore plays a similar role as the earnings of a company. And it offers a benchmark to which we can compare the price of the house.

Consider two different scenarios, Buy and Rent. In the Buy scenario, the costs include the initial downpayment, mortgage payments, and monthly maintenance fees (including regular repairs, utilities, property taxes, and accrued expenses for e.g. major renovations). Once the mortgage period is complete the person ends up with  a fully-paid house.

For the Rent scenario, we assume identical initial and monthly outflows. However the housing costs in this case only involve rent and utilities. The initial downpayment is therefore invested, as are any monthly savings compared to the Buy scenario. The Rent scenario therefore has the same costs as the Buy scenario, but the person ends up with an investment portfolio instead of a house. By showing which of these is worth more, we can see whether in financial terms it is better to buy or rent.

This is the idea behind our latest web app: the RentOrBuyer. By supplying values for price, mortgage rates, expected investment returns, etc., the user can compare the total cost of buying or renting a property and decide whether that house is worth buying. (See also this Globe and Mail article, which also suggests useful estimates for things like maintenance costs.)

The RentOrBuyer app allows the user to compare the overall cumulative cost of buying or renting a home.

The Rent page gives details about the rent scenario including a plot of cumulative costs.

For the \$999,000 house, and some perfectly reasonable assumptions for the parameters, I estimate savings by renting of about … a million dollars. Which is certainly enough to give one pause. Give it a try yourself before you buy that beat up shack!

Of course, there are many uncertainties involved in the calculation. Numbers like interest rates and returns on investment are liable to change. We also don’t take into account factors such as taxation, which may have an effect, depending on where you live. However, it is still possible to make reasonable assumptions. For example, an investment portfolio can be expected to earn more over a long time period than a house (a house might be nice, but it’s not going to be the next Apple). The stock market is prone to crashes, but then so is the property market as shown by the first figure. Mortgage rates are at historic lows and are likely to rise.

While the RentOrBuyer can only provide an estimate of the likely outcome, the answers it produces tend to be reasonably robust to changes in the assumptions, with a fair ratio of house price to rent typically working out in the region of 200-220. Perhaps unsurprisingly, this is not far off the historical average. Institutions such as the IMF and central banks use this ratio along with other metrics such as the ratio of average prices to earnings to detect housing bubbles. As an example, according to Moody’s Analytics, the average ratio for metro areas in the US was near its long-term average of about 180 in 2000, reached nearly 300 in 2006 with the housing bubble, and was back to 180 in 2010.

House prices in many urban areas – in Canada, Toronto and especially Vancouver come to mind – have seen a remarkable run-up in price in recent years (see my World Finance article). However this is probably due to a number of factors such as ultra-low interest rates following the financial crash, inflows of (possibly laundered) foreign cash, not to mention a general enthusiasm for housing which borders on mania. The RentOrBuyer app should help give some perspective, and a reminder that the purpose of a house is to provide a place to live, not a vehicle for gambling.

Try the RentOrBuyer app here.

### References

Housing in Crisis: When Will Metro Markets Recover? Mark Zandi, Celia Chen, Cristian deRitis, Andres Carbacho-Burgos, Moody’s Economy.com, February 2009.

# How do you calculate confidence intervals for a prediction?

This is a question that comes up frequently in forecasting. But it is surprisingly hard to answer, because it boils down to predicting how accurate a forecast will be – a prediction about a prediction. Prediction squared.

One approach is to base the estimate on past errors in similar situations. This method is used for example by the National  Institute of Statistics and Economic Studies (INSEE) in France, who wrote  that “the distribution of forecasting errors calculated from past exercises is a reliable indicator of the distribution of future errors and hence of the uncertainty  surrounding a given forecast” (see this research paper).

But this assumes that the new data will follow a familiar pattern – which may not be the case if for example you are trying to predict the effect of a novel economic policy, or a new drug, or climate change.

Another approach is to randomly perturb model parameters. But this has problems of its own.

To illustrate this, consider a simple linear model x(t) = k*t + x0, and suppose we want to predict the state at time t=1 based on an observation at t=0. Without loss of generality we can set the expected slope of the model to k=0, so the prediction is a persistence forecast: x1 = x0 (here x1 = x(1) and x0 = x(0)). Treating errors as random variables, then (in terms of variance) the error in the prediction, relative to the observation, will be the sum of the variance of the initial and final observational errors (see note below for details).

This makes sense since we are assuming the model is perfect, so all error comes from the observations. But in the real world, model error is not usually zero! Observational error is only part of the puzzle. So how do we estimate the contribution of model error?

As mentioned above, a typical approach is to perturb the parameters of the model by some reasonable amount, or do a Monte Carlo over a range of parameter values. (See paper on ensemble forecasting with model error.) For our simple linear model, a Monte Carlo simulation using a normal distribution around 0 with variance w for the parameter k would then give an ensemble of model predictions, with the same variance of w. Again this error will add to the error due to the observations.

This all sounds very logical and scientific, and versions of this approach are used by everyone from central bankers to weather forecasters. But again there is a catch, because the answer will depend on the parameter range that we selected. In other words, we can get whatever answer we want by choosing the range.

Of course, one can argue for a particular range – but if we are forecasting a new situation, we can’t base the estimate reliably on past data.

And there is an even more intractable issue – which is that the prediction error may be due not to parameter error, but to model structure. What if the actual system is not linear? (It probably isn’t.)

The ultimate problem is that the frequentist approach to statistics breaks down completely in forecasting – it relies on analyzing data, but the whole point of forecasting is that there is no data to measure (otherwise you could just measure it and not bother with the forecast).

Fortunately there is a solution, or at least an intellectually consistent method, which is to take a Bayesian approach. Unlike the commonly-taught frequentist approach, which treats probabilities as a measure of the frequency of observed events, the Bayesian approach interprets probabilities as a measure of degrees of belief. And in forecasting, confidence intervals ultimately are a measure of one’s confidence in the model.

In the case of our simple model, the idea is to come up with an initial confidence interval, based for example on previous experience, but see it as an estimate only, and refine it as more data comes in.

Of course this requires admitting that the confidence interval relies on subjective estimates. However doing so can help to avoid another problem in mathematical modelling, which is the tendency of frequentist error estimates to ignore the effect of context and prior information. Read our article on the BayesianOpinionator.

Notes:

For the simple linear model case, prediction error is the sum of the initial and final errors. We’ll use x to denote predictions, y for the true state, z for observations, and e for observational errors.

Suppose that the observed initial condition z0 is observed with an error e0, and the observed final point z1 has an error e1. So the true initial condition is y0 = z0 + e0, and the true final state is y1 = z1+ e1.

If we assume there is no model error, then y1 = y0. It follows that the difference between the forecast x1 = z0 and the observed final state z1 is:

error = z1- z0 = y1 – e1 – y0 + e0 = e0 – e1.

If the errors are assumed to be normal with variance v0 and v1, then the forecast error has variance v0 + v1 (variance is additive), which allows us to determine confidence intervals. So for example a 95% confidence interval would be +/-1.96 times the standard deviation.

If we assume that model error contributes an error at time 1 with variance w1, then again the variances are additive, and the total will increase to v0 + v1 + w1.

# The BayesianOpionionator

I recently finished Robert Matthew’s excellent book Chancing it: The laws of chance – and what they mean for you. One of the themes of the book is that reliance on conventional statistical methods, such as the p-value for measuring statistical significance, can lead to misleading results.

An example provided by Matthews is a UK study (known as The Grampian Region Early Anistreplase Trial, aka GREAT) from the early 1990s of clot-fighting drugs for heart attack patients, which appeared to show that administering the drugs before they reached hospital reduced the risk of death by as much as 77 percent. The range of the effect was large, but was still deemed statistically significant according to the usual definition. However subsequent studies showed that the effect of the drug was much smaller.

Pocock and Spiegelhalter (1992) had already argued that prior studies suggested a smaller effect. They used a Bayesian approach in which a prior belief is combined with the new data to arrive at a posterior result. The impact of a particular study depends not just on its apparent size, but also on factors such as the spread. Their calculations showed that the posterior distribution for the GREAT study was much closer to the (less exciting) prior than to the experimental results. The reason was that the experimental spread was large, which reduced its impact in the calculation.

Given the much-remarked low degree of reproducibility of clinical studies (in the US alone it has been estimated that approximately US\$28,000,000,000 is spent on preclinical research that is not reproducible) it seems that a Bayesian approach could prove useful in many cases. To that end, we introduce the BayesianOpinionator, a web app for incorporating the effect of prior beliefs when determining the impact of a statistical study.

The data for the BayesianOpinionator app is assumed to be in the form of a comparison between two cases, denoted null and treated. For example in a clinical trial the treated case could correspond to a patient population who are treated with a particular drug, and the null case would be a comparison group that are untreated. As mentioned already, a common problem with such studies is that they produce results which appear to be statistically significant, but later turn out to be caused by a fluke. In this case the BayesianOpinionator will help to determine how seriously the results should be taken, by taking prior beiefs and data into account. The method works by representing data in terms of binomial distributions, which as seen below lead to a simple and intuitive way of applying weights to different effects in order to gauge their impact.

The New Data page is used to input the trial results, which can be in a number of different forms. The first is a binary table, with the two options denoted Pos and Neg – for example these could represent fatalities versus non-fatalities. The next is a probability distribution, where the user specifies the mean and the standard deviation of the probability p of the event taking place for each case. Finally, studies are sometimes reported as a range of the odds ratio (OR). The odds for a probability p is defined as p/1-p, so is the ratio of the chance of an event happening to the chance of it not happening. The OR is the odds of the treated case, divided by the odds of the null case. An OR of 1 represents no change, and an OR range of 0.6 to 1 would imply up to 40 percent improvement. Once the odds range is specified, the program searches for a virtual trial which gives the correct range. (The user is also asked to specify a null mean, otherwise the result is underdetermined.)

In all cases, the result is a binomial distribution for the treated and null cases, with a probability p that matches the average chance of a positive event taking place. Note that the problems studied need not be limited to binary events. For example, the data could correspond to diameter growth of a tumor with or without treatment, from a scale of 0 to 1. Alternatively, when data is input using the probability range option, a range can be chosen to scale p between any two end points, which could represent the minimum and maximum of a particular variable. In other words, while the binomial distribution is based on a sequence of binary outcomes, it generalises to continuous cases while retaining its convenient features.

In the next page Prior, the user inputs the same type of information to represent their prior beliefs about a trial. Again, this information is used to generate binomial distributions for the prior case. Finally, the two sets are pooled together in order to give the posterior result on the next page. The posterior is therefore literally the sum of the prior and the new data.

The next page, Odds, shows how the new results compare with the prior in terms of impact on the posterior. The main plot shows the log-OR distribution, which is approximately normal. A feature of the odds ratio is that it allows for a simplified representation within the Bayesian framework. The posterior distribution can be calculated as the weighted sum of the prior and the new data. The weights are given in table form, and are represented graphically by the bubble plot in the sidebar. The size of the bubbles represents spread of log-OR, while vertical position represents weight of the data, with heavy at the bottom.

As shown by Matthews (see this paper), the log-OR plot allows one to determine a critical prior interval (CPI) which can be viewed as the minimum necessary in order for the new result to be deemed statistically significant (i.e. has a 95 percent chance of excluding the possibility of no effect). If the CPI is more extreme than the result, this implies that the posterior result will not be significant unless one already considers the CPI to be realistic. For clinical trials, which for ethical reasons assume no clear advantage between the null and treated cases, the CPI acts as a reality check on new results, because if the results are very striking it shows how flexible the prior needs to be in order to see them as meaningful.

The BayesianOpinionator Shiny app can be accessed here.

# Article on forecasting in Significance magazine

Nice article in Significance magazine that mentions Systems Forecasting:
Great expectations: The past, present and future of prediction
by Graham Southorn

Gives an overview of the history of forecasting, and discusses more recent topics such as prediction markets and the impact of big data.