Friday, 24 June 2016

Thesis Defence Notes - Approximate Bayesian Computation

These are notes from the slides that were given during my defense. I've clarified some points since then.

The other half of the notes, on the cricket simulator, are found here.

------------------

Approximate Bayesian Computation (ABC) is a method of estimating the distribution of parameters when the likelihood is very difficult to find. This feature makes ABC applicable to a great variety of problems.

The classical ABC method can be summarized as:

Step 1: Select a prior joint distribution F(theta) for the parameters of interest.

Step 2: Observe a sample s from the population you wish to characterize.

Step 3: Compute T(s), a statistic, or set of statistics, from the sample.

Step 4: Repeat the following many times

4A: Generate random parameter values p from F(theta)
4B: Use the values p to simulate a sample s'
4C: Compute the distance between T(s') and T(s), the statistics taken from the simulated and observed samples, respectively.
4D: If the difference is less than some tuning value, epsilon, then accept the values p.

Step 5: The collection of values p from all accepted values is taken as the posterior distribution of the parameters of interest.

Step 6: If the parameters of interest are continuous, a smoothing method is applied to this collection to produce a continuous posterior distribution.

The distance between two statistics can be determined by Manhattan or Euclidean distance, or a weighted distance if deviations between some statistic values are more important than others.

The tuning parameter epsilon determines the balance between accuracy and computational efficiency. A small epsilon will ensure that only generated samples that are very similar to the observed sample will be used. That is, only samples that produce statistics very close to the real sample's statistics. However, many samples will be rejected.

A large epsilon will ensure that many samples will be produced, but some of those samples will have statistics that deviate from the observed sample. Since the process of producing new samples could be time-consuming, depending on the application, this may be a worthwhile trade-off.







Compared to classical Approximate Bayesian Computation method, I made some modifications.

- Instead of having an accept-reject rule, I opted to use every sample but assign weights to the parameters behind each sample based on the distance between T(s') and T(s).

- The weights and smoothing mechanism are determined through a Kernel Density Estimator (KDE)


Specifically, every generated sample is gets treated as a point mass in a space across the statistic values and parameter values.


Consider the example in the following MS-Paint diagrams. 

In Figures 1-3, we have a sample with a single statistic value, T(s), vertical, and a single parameter of interest, theta, horizontal. We don't know the parameter value of the observed sample, so we represent the observed sample as a blue horizontal line.

Points A and B represent the statistic and parameter values from each of two generated samples (also A and B, respectively). These are points instead of lines because these samples are simulated using known parameter values. 

The gray area at the bottom of each figure is the conditional density of the parameter of the observed sample. The thin lines around points A and B are the contour lines of the probability density, with peaks at the points and decreasing outwards.




Figure 1: A generated sample that's far from the observed sample.

In Figure 1, we see the effect of generated sample A on the parameter estimate. The statistic from sample A is far from the statistic of the observed sample, so point A is far from the line. As a result, the probability density does increase or decease much along the line, and therefore the conditional density of the parameter estimate is diffuse, as shown by the shallow gray hill.

Figure 2: A generated sample that's close to the observed sample.
 In Figure 2, we see sample B's effect. This sample's statistic is close to that of the observed sample. Therefore, point B is close to the line. The conditional density is much steeper around sample B's parameter value. This is because the probability density increases and decreases a lot along the blue line.
Figure 3: The combined estimate from both samples.
In Figure 3, we see the cumulative effect of both samples. The conditional density from both resembles that from generated sample B. This is because sample B's effect was given more weight. Generated samples that more closely resemble the observed sample contribute more to the parameter estimate.




I used this method on two applications
1. A theoretical network based on a network of sexual partners in Africa.
2. A real network of precedence citations between decisions of the Supreme Court of Canada.

The theoretical dataset was based on a respondent driven sample. An initial seed of people was selected, presumably by a simple random process. Members of this seed were polled about the number of sexual partners they had in the previous year, their HIV status, and their sexual orientation. For each reported partner, contact information was taken in order to recruit these partners into the sample as well. If the recruitment pool was exhausted before the desired number of respondents, new seed members were found.

Like the basis dataset, we assumed that the sample we were given contained the mentioned covariates of those polled, including the number of partnerships, and we assumed that only the partnerships that led to successful recruitment were available. This implies that large parts of the network could be hidden to us, such as partnerships that 'lead back' into the sample already taken.

Consider these two figures.

Figure 4 is from a sample of a network with identifying information about every reported partnership is given.  Figure 5 is from the same sample, where only the partnerships that led to recruitment are known. Notice that Figure 4 includes lines going out to points that we did not sample directly. Notice also that Figure 4 includes cycles, in which you could go from one point to itself by following the lines and without backtracking.

We don't see these cycles in Figure 5, the recruitment-only network, because a person cannot be recruited if they are already in the sample. A cycle might be a triangle between persons J, K, and L, where all three are connected. Person J can recruit both K and L through those connections, but K cannot recruit L, so we never see the K-L connection. What we see is one person, J, with two distinct partners instead of the whole picture.  

Socially, the K-L connection is key information, and mathematically this lack of cycles makes estimation of anything about this network a lot harder.

Figure 4: A network with identifying information from the sample.
Figure 5: A network with only recruitment information from the sample.

 Nevertheless, I used ABC to estimate some features (parameters) of the population that this sample came from.
 
The parameters of interest were:

- The number of members in the population, (prior: geometric, mean 2000, truncated minimum 400)
- The probability of an HIV infection passing along a partnership, (prior: uniform 0 to 0.3)
- The proportion of this network that was already infected before any of these partnerships (prior: uniform 0 to 0.5)
- The 'closeness preference', a parameter governing the propensity to select partners that are similar to oneself. Zero is no preference, 10 is a strong preference, and a negative number represents a general dislike for those similar to you. (prior: uniform -2 to 10)

I conducted two rounds of Approximate Bayesian Computation by producing 500 and 2500 samples, respectively. The results are here. The yellow and red represent regions of high conditional probability. ABC allows us to narrow down the likely value of the real parameters to these yellow and red regions.

Parameter sets at points all across these graphs were tried, but only those in the non-blue regions produced samples that were anything like the observed sample.

Figure 6: Coloured contour maps of conditional densities, social network data.

From these, we can calculate marginal means to get our parameter estimates.


Parameter True Value Mean (Initial Round) Mean (Final Round)
Population Size 1412 2049 2094
Initial Infection Rate 0.467 0.399 0.374
Transmission Chance 0.093 0.175 0.201
Closeness Preference 7.298 6.753 9.352


The second application was on the precedence citations of decisions in the Supreme Court of Canada (SCC). Because this is real data, I am not able to find the ground truth with which to compare my results, but I can showcase how the method behaves with real data.

When this data was collected, there were 10623 documents produced by the SCC, 9847 of which are cases (e.g. disputes between two parties). These decisions are among the more than 3,000,000 cases and other documents recorded in the Canadian Legal Information Institute (CanLII). Figure 7 shows part of the network of SCC cases.

Figure 7: Induced subgraph of cases and citations in Supreme Court of Canada


Cases span from 1876 to 2016, with more recent cases represented as points near the top. The larger points are those that have been cited often in Canadian law. Lines between points represent citations between SCC cases. Only 1000 cases are shown here, so the true network between SCC cases has roughly 10 times as many nodes and 100 times as many links.


My general interest in this dataset is to identify the features that make a law prone to being obscure over a long time.

I modelled each SCC case as 'vendor' that was trying to 'sell' citations to future Canadian legal cases. Each SCC case had an attractiveness a, where

  
and where

βirrel : Parameter determining the decay in attractiveness over time (per 5 years without a citation).
βpa : Parameter determining a preferential attachment factor, assuming that a case becomes more well-known whenever it is cited.
βcorp , βcrown , βdis : Parameters determining increased/decreased attractiveness based on whether a case involved a corporation, the crown, or dissent between Supreme Court Judges, respectively.

what I found was less clear that in was in the synthetic data case, as shown in Figure 8.

Figure 8: Coloured contour maps of conditional densities, Supreme Court of Canada data.

 It's impossible to read the variable names in this figure, I'm afraid. The jist, however, is that...

- a case that is not cited in five or more is less and less likely to receive citations over time.
- a case that is cited often is more likely to receive citations in the next five years.
- a case that involves a corporation is less likely to be cited.
- a case that involves the crown is more likely to be cited.
- a case that had dissenting opinions by some of the judges is more likely to be cited.


I read this: Spatial and Spatio-Temporal Bayesian Models with R-INLA


Spatial and Spatio-temporal Bayesian Models with R-INLA, by Marta Blandiardo and Michela Cameletti is a short textbook that introduces INLA (Integrated nested Laplace approximations) and some preliminaries, and works through the math and R-code of a variety of problems that the R-INLA package can tackle. Several non-spatial regression models are also covered in enough detail to get you started on R-INLA in general.

Why I read it: 

I’m working on a research problem in which INLA was recommended as a possible solution. I was sold on the idea, but also warned that the original paper describing the INLA method. (Link) When Dr. Dave Campbell says something is hard, you can safely translate this as ‘herculean’. The paper
(Rue, et al. 2009) was 74 pages of material too intimidating for me to examine. So, I looked for other summary work of INLA, and came across this book which describes INLA and the use of an R package being developed by a community called R-INLA. R-INLA is not available on CRAN at the time of writing this, but it is available at www.r-inla.org

What is INLA: 

INLA, or integrated nested Laplace approximations is a method used to estimate the likelihood in high-dimension cases that integrates out unimportant variables, and then approximates the likelihood by using the product of marginals for all these assumed unimportant variables. Compared to MCMC, or Markov Chain Monte Carlo methods, INLA scales much better with the number of dimensions, but is reliant on the quality of that extra layer of approximations. Also, INLA is deterministic, instead of simulation based like most popular Bayesian methods, including ABC (Approximate Bayesian Computation). This has the advantage of replicability, but the disadvantage that you can’t improve the results simply by throwing more computers at it.


Who is this book for: 

This is book is for people with some statistical and mathematical fluency already. If you feel ready for Julian J. Faraway’s “Extending the Linear Model with R”, a popular book for teaching GLMs, then you are probably ready for this R-INLA book. That is, a background in calculus and statistical inference are required, and a background in R and in Bayesian statistics are helpful.

From what I’ve read (Chapters 4 and 5), R-INLA is a beast that can provide quick solutions to a vast set of popular analyses, and solutions of any sort for high-dimensional problems. It could be worth your reading time, especially if the INLA method becomes more popular.

There are 300 pages in this text, but 100 pages are dedicated to bringing the reader up to speed on using R and some basic analyses, on Bayesian theory, and on Bayesian computational methods like MCMC, Metropolis-Hastings, and BUGS.

The rest is a guide on applying INLA to Bayesian regression, hierarchical models (my focus for this project), spatial, and Spatio-Temporal models.

What I learned about using R-INLA:

At the basic level, the inla() function in R-INLA is used a lot like the glm() function.
1  Specify a formula, y ~ x. (You can also specify custom functions, see below)
2  Specify a family of distributions for the likelihood.
3  Specify a dataset.

However, under the hood, it’s performing a Bayesian regression (or other estimation, if you wish), with a default of very diffuse priors. For a regression, for example, you can use the control.fixed=list() setting in inla() and...
- change the priors on the intercept and all the regression parameters. 
- change the prior on precision (inverse of standard error sigma) of response estimates, which defaults to Gamma(1,10^-5), which you replace with another gamma or even a distribution from another family. 
The book provides a way to specify the standard deviation sigma directly, but it’s bit of an awkward workaround.

To get the fitted values that you would normally get from $fitted in a lm()or glm() model, you have specify with control.predictor = list(compute=TRUE)) in inla() to produce any fitted values in your model object. If you wish to predict for new data where the response is missing, you can use link= in the control.predictor list to specify which predictors to compute, and make sure to specify rows in which the response is missing.

1  The custom functions that you can specify in the formula that gets fed into inla() has many options that allow to specify that you are doing something more complex than a linear regression. The list of currently available ones is here: http://www.r-inla.org/models/latent-models

For example, you can set up an autocorrelational model by adding f(varname, model="rw1", hyper=list(…), prior=…, param=…) where "rw1" stands for Random Walk of Order 1, hyper stands for the hyper-prior like the one set for precision of estimates, and prior and param define the prior for the parameter(s) of the effect of varname. You can also specify multiple such functions by adding them as f1(…) + f2(…) + ….

If you want to treat levels of a particular variable as random effects, use f(varname, "iid"), and inla() will treat the effects of varname as independent and identically distributed values from some common distribution. This is particular useful in hierarchical models. These random effects can be used in different family specifications.

2.When specifying a family, the default is Gaussian. If you want to reduce the effect of outliers, using a family with fatter tails, like Gosset’s Student’s T. Remember, this is maximizing likelihood, not minimizing squared error, so using a fatter tailed family means that the likelihood doesn’t drop off as quickly as deviations get larger. As with other parts of inla(), you can override the defaults (like df for T, which it will get from n-p), with a list in control.family.

Other options, listed at http://www.r-inla.org/models/likelihoods , include Weibull, skew normal, binomial (for logistic regression), Poisson (for regression on counts), and Cox (as in Cox Proportional Hazard for survival data). 



What else I learned:

A random Markov field, or RMF, is a network of variables that shows explicitly which variables depend of which others. A connection between nodes represents a dependency, and a lack of a connection between nodes represents conditional independence. If the variables form a multivariate normal distribution, you have a GRMF, or Gaussian Markov random field. I can see how a concept like this would be useful in identifying which variables could be safely integrated out. The ‘Markov’ part specifies that only the immediate connections imply dependence.

The approxfun() function. It’s in base R, and it takes two numeric vectors of equal length, x and y, and makes a function that you can use to put any new x in to get an interpolated y value. By default, it has nothing more sophisticated than linear interpolation, but I’d be surprised if there wasn’t a polynomial or spline-based extension out there. Functions output by approxfun() can be used just like other continuous value functions of x, like sin() and cos(), which is very handy when integrating with integrate().



Original paper: 

Rue, H., Martino, S., & Chopin, N. (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations. Journal of the royal statistical society: Series b (statistical methodology)71(2), 319-392.

Tuesday, 21 June 2016

Thesis Defence Notes - Cricket Simulator

These are notes from the 'lost slides' on cricket that I didn't have time to cover in my PhD defense, which I completed two weeks ago.

The other half of the notes, on approximate Bayesian computation (ABC) for networks, is found here.


------------------

Cricket is a sport popular in the emerging economies of the Indian subcontinent. Most of my cricket work has focused on the Twenty20 format of Cricket, in which matches are designed to last only three hours, roughly the same time as matches in other popular team sports. This format has only been played professionally for about 10 years, and it growing quickly in terms of spectator and commercial interest. The intricacies of this format are also poorly understood, which provides many good research opportunities.

In the Twenty20 format, each team is allowed one innings to score runs. (Yes, 'innings', not 'inning') That inning is over if either...

1) 10 of the 11 players are dismissed (a dismissal is also called an 'out' or a 'wicket'), or
2) 120 'fair' balls are thrown. A group of 6 throws is called an 'over', making 20 overs per innings. Hence the name Twenty20.

A small number (typically fewer than 10) of throws per innings do not count against the limit of 120 because they were not thrown in a prescribed manner. A small number of throws also incur a one-run penalty that is added to the These cases are similar to 'balls' and 'hits-by-pitch' in baseball, respectively.

Each of these 120+ throws has a discrete result that we simplified into six categories: Wicket, 0 Runs, 1 Run, 2-3 Runs, 4-5 Runs (Ground Rule Double), and 6 Runs (Home Run).  Cases where 3 runs and 5 runs occurred were rare, so we treated them as a fixed proportion of 2-3 runs (~6.4%) and 4-5 (~1.5%) runs respectively. Likewise, throws not counting against the 120 limit, and those resulting in a penalty run, were treated as a fixed proportion of each of the six categories.

Here is table of a team's batting and bowling summaries of a recent T20I match from ESPN CricInfo.


All of these metrics, such as 'Strike Rate', are based solely on the number of events (e.g. balls faced,  times dismissed, runs scored), without respect to when they happened. The inherit weakness of all these metrics is that they treat each throw as an identical event, independent of the number of wickets and throws remaining. For the baseball crowd, consider the difference between overall batting average, and the batting average with runners in scoring position.

Reality is more complex. Consider this:


 This is a heat map of the empirical chance of any throw/ball/pitch ending as a six (a home run in baseball terms) as a function of overs and wickets taken. All games start in the upper-left corner, with 20 overs and 10 wickets remaining. From there, plausible game situations 'fan out', where games with situations following the top of the fan are those where relatively few batters are dismissed by the time most of the overs are used.

In these games, many sixes occur in the late game. Scoring a six involves hitting the ball with great power, which involves taking a greater risk of a dismissal. Since in these low-dismissal games, remaining throws are a more precious resource than remaining wickets, this high-risk, high-reward strategy is rational.

Here's the raw probability of scoring a four, which is similar to a ground rule double in baseball. This is a ball that touches the ground in bounds, but bounces or rolls out of bounds. As such, the fielders have an effect on the outcome that they cannot usually have on sixes. Notice that the general pattern of fours are similar to those of the sixes, except for the major drop at between the 6th and 7th overs. From the 7th over onwards, the players in the field are more restricted, and so more balls are able to roll out of bounds to score 4 runs


Likewise, the batting team gets really desperate in the last few overs, so they try reckless moves that frequently end in a wicket (a dismissal of a batting player). We also see wickets getting more likely when there are very few batters left. This seems counter to the intuitive strategy - that running out of batters would cause the remaining batters to be more cautious. Instead, you're seeing the effect of putting the worst batters on a team last in the batting order - they just get dismissed more often.


Here's the major confound: A batting player continues to bat until dismissed or until the innings ends. So a starting batter will rarely, if ever, see a situation where there are few wickets remaining. Likewise, a player batting late in the lineup will never enjoy the game situations with many wickets remaining. Therefore, a late-batting player has less opportunity to aim for those sweet, sweet, sixes.

We developed a system for estimating 'situational effects', which are the relative propensities towards a given outcome (i.e. wicket, 0, 1, 2-3, 4-5, 6) holding batter and bowler constant. These effects were estimated by first estimating odds ratios of events between 'adjacent' situations (those differing by one over or one wicket), within individual player records.

The data in these heatmaps are used to find these situational effects, which control for individual player effects.

Having these situational effects in place, we were able to estimate a great deal of additional parameters. First, we estimated the probability of any given player to produce a given outcome at some 'baseline' situation. To do this, we gave each ball faced, or ball thrown for bowlers, a weight. If a six happened in a situation when sixes were more likely (again, controlling for player difference), a small weight was given. If a six happened at a time when sixes were rare, a large weight was given. This way, we could compare batters and bowlers to each other, even we they played under very different situations.

To knit this all together into estimates of a players batting or bowling outcomes at a 'baseline' situation, we used a Metropolis-Hastings within Gibbs sampling method. More specifically, we draw a set of values from a Dirichlet distribution (an extension of the Beta distribution to multinomial cases), and propose those as a player's outcome probabilities at baseline. Then we compute the likelihood of these probabilities given the player's history and the estimated situational effects. Proposals that fit the data better have a higher likelihood, and these are given more weight in the final estimate. The final estimate for a player is an aggregate of thousands of these proposals.

I treated situational effects as normalized multiplicative effects on these baseline probabilities. in the following formula,



the situational effects are the tau owj's and pi70j is a player's ability at the baseline situation (7 overs, 1st wicket). The left-hand part, piowj, represents the probability of any player getting any outcome in any situation (o overs, w wickets). From here, making a simulation was just a matter of making a Markov Chain that followed the rules of the game.

This simulator has been the basis of five published or submitted papers, including the two in the first half of my thesis (bolded)


Davis, J., Perera, H., & Swartz, T. B. (2015). A simulator for Twenty20 cricket. Australian & New Zealand Journal of Statistics, 57(1), 55-71.
 
Davis, J., Perera, H., & Swartz, B. (2015). Player evaluation in Twenty20 cricket. Journal of Sports Analytics, (Preprint).

 Perera, H., Davis, J., & Swartz, T. B. (2016). Optimal lineups in Twenty20 cricket. Journal of Statistical Computation and Simulation, 1-13.

Perera, H., Davis, J., & Swartz, T. B. (2016). Assessing the impact of fielding in
twenty20 cricket. In submission.

Silva, R., Davis, J., & Swartz, T. B. (2016). Mismatches in Twenty20 cricket. In submission.