Wednesday, 25 February 2015

New powerplay metrics applied to 2014-15 NHL teams

In this previous post I made a case for a new powerplay metric called GDA-PK (Goal-Difference Average -  Penalty Kill) that used the amount of time in a powerplay state rather than the number of powerplay instances as its basis. My argument was essentially that not every powerplay is the same: some are majors, some are interrupted by goals or additional penalties, and others are confounded by pulling the goalie and by shorthanded goals.

My aim was to present a new metric that would measure the effectiveness of special lineups in way that removed a lot of the noise and was easier to interpret in the context of the rest of the game.

GDA-PK represents the average goals gained or lost per hour when killing a penalty. In Table 1 below, for example, the Vancouver Canucks, have a GDA-PK this season of -3.87, so they fall behind by 3.87 goals for every 60 minutes in a 4-on-5 situation. This can also be written that the 'nucks lose a goal for every 60/3.87 = 15.5 minutes of penalty killing.

Likewise, GDA-PP represents the average goals gained or lost per hour of powerplay. Table 1 shows that the Detroit Red Wings manage to gain 8.00 goals for every 60 minutes of powerplay, or get ahead by 1 goal for every 7.5 minutes spent on the powerplay. This isn't the same as gaining 1 goal for every 7.5 minutes of penalty in favour of the Red Wings, because a lot of those penalties will be cut short by conversions.

Also note that it's 'Goal Difference', not goals scored/allowed. This way, both measures account for shorthanded goals by treating them as negative powerplay goals.

I wanted to see if there would be any major divergences between GDA measures and the traditional PP% and PK% (PowerPlay percent and Penalty Kill %) measures in terms of ranking the teams. Ideally, both measures would agree in rankings because they are intended to measure the same thing.

Results


Table 1: Powerplay and Penalty Kill Goal Differential per Hour (Goals against minus goals for) and related statistics for teams for the 901 of 1230 regular season games up to 2015-02-24, inclusive.


Team GDA-PK Pkill Pct GDA-PP Pplay Pct
ANA -3.44 0.814 4.35 0.177
ARI -7.77 0.779 5.97 0.212
BOS -5.62 0.824 5.00 0.174
BUF -6.88 0.754 1.59 0.118
CAR -3.26 0.880 5.66 0.185
CBJ -5.44 0.808 6.32 0.207
CGY -4.64 0.803 5.96 0.175
CHI -3.96 0.862 6.14 0.187
COL -4.49 0.832 3.56 0.133
DAL -6.18 0.793 5.43 0.180
DET -5.33 0.832 8.00 0.252
EDM -5.67 0.780 4.41 0.152
FLA -6.66 0.790 4.04 0.148
L.A -5.35 0.798 5.17 0.186
MIN -3.63 0.860 3.44 0.159
MTL -3.88 0.856 4.90 0.170
N.J -6.35 0.798 5.20 0.195
NSH -5.24 0.820 5.80 0.173
NYI -6.69 0.746 5.88 0.185
NYR -3.33 0.830 4.89 0.182
OTT -4.31 0.825 4.65 0.172
PHI -8.49 0.760 6.76 0.233
PIT -3.56 0.856 4.81 0.205
S.J -5.26 0.804 5.90 0.205
STL -6.67 0.804 6.67 0.233
T.B -4.83 0.833 4.11 0.171
TOR -4.52 0.821 4.52 0.186
VAN -3.76 0.859 4.83 0.184
WPG -4.75 0.807 5.05 0.188
WSH -6.17 0.808 7.76 0.237

League-wide GDA-PK: -5.163
League-wide GDA-PP: +5.163

Correlations between proposed and classic measures:
PP Pearson r = 0.892, Spearman's rho = 0.810
PK Pearson r = 0.833, Spearman's rho = 0.841

The league-wide GDA-PP and the GDA-PK balance out by definition.

Notable performances are highlighted in Table 1. The Carolina Hurricanes seem to get a lot more mileage out of their powerplays than their opponents (5.66 goals/hr gained vs. 3.26 goals/hr lost). Philadelphia and Washington are very exciting to watch during penalties in both directions.

After 60 games, each team has had 5-6 hours of powerplay time, and 5-6 hours of penalty kill time. As such, there's still a lot of uncertainty in the GDA-PK and PP of individual teams. Each team value should be considered to have a 'plus-or-minus 2 goals/hr' next to it. For example, Boston's GDA-PP of 5.00 could really mean they've been gaining 3 goals/hr on the powerplay and getting some lucky bounces, or that they should be gaining 7 goals/hr, but can't catch a break this season.

Further inference from these values is possible. We could make reasonable GDA-PK and PP estimates of specific matchups. For example, Pittsburgh's excellent penalty killing and Washington's excellent powerplay skills should cancel each other out. We would expect the Washington Capitals to gain (WSH PP + (- PIT PK) - LeagueAvg) =  (7.76 + 3.56 - 5.163) = 6.16 goals/hr on the Penguins with a 1-man advantage.

We could also find the average length of a minor or major penalty, so we can use this to estimate how many goals a given penalty is worth, either league wide or for a given team. We can also find the success rate of penalty shots, assuming we can use the shootout to increase our sample size, so we could also find out how many penalty minutes a penalty shot is worth.

The correlation measures Pearson's r and Spearman's rho are positive and far from zero (they are on a scale from -1 to 1). These correlations indicate a strong agreement between the GDA measures and the traditional measures; a team that puts up good PP% / PK% numbers will put up comparably good GDA-PP/PK numbers.

Furthermore, we can use a similar metric to isolate even-strength situations and see how well a team fares when there are no penalties involved. In Table 2, GDA-EV (Goal Difference Average - EVen Strength) refers to the number of goals a team gains or loses on average per 60 minutes of even strength play. A positive number represents a team that outscores their opponents when there are no active penalties (or only offsetting ones), and a negative number represents a team that falls behind at even strength.

Table 2: Even-Strength Goal differential per hour (Goals against minus goals for) and related statistics for teams for the 901 of 1230 regular season games up to 2015-02-24, inclusive.


Team GP Goal Diff GD/G GDA-EV
ANA 61 8 0.13 0.06
ARI 61 -71 -1.16 -1.19
BOS 60 4 0.07 0.08
BUF 61 -94 -1.54 -1.09
CAR 59 -24 -0.41 -0.67
CBJ 59 -31 -0.53 -0.66
CGY 60 12 0.20 -0.08
CHI 61 29 0.48 0.28
COL 61 -17 -0.28 -0.08
DAL 59 -11 -0.19 0.00
DET 61 25 0.41 0.17
EDM 59 -65 -1.10 -1.05
FLA 62 -21 -0.34 0.00
L.A 60 16 0.27 0.48
MIN 59 11 0.19 0.16
MTL 60 25 0.42 0.47
N.J 60 -20 -0.33 -0.04
NSH 60 42 0.70 0.86
NYI 61 22 0.36 0.37
NYR 62 43 0.69 0.58
OTT 59 4 0.07 -0.11
PHI 61 -12 -0.20 -0.10
PIT 60 25 0.42 0.37
S.J 61 0 0.00 -0.22
STL 60 32 0.53 0.54
T.B 62 38 0.61 0.61
TOR 60 -16 -0.27 -0.39
VAN 60 13 0.22 0.10
WPG 62 3 0.05 0.23
WSH 61 30 0.49 0.34

League-wide GDA-EV 0.000

Correlation between goal difference per game, and GDA-EV:
Pearson's r = 0.941 , Spearman's rho = 0.916

The league-wide GDA-EV is exactly 0, as expected. There is a very strong linear relationship between GDA-EV and simple goal differential, which may indicate that powerplay performance isn't as important as it's made out to be. Looking at all goals, the Buffalo Sabres seem to be uniquely awful, but when you remove the effect of penalties, they're merely as bad as Edmonton and Arizona.

Another note is that even though Montreal is currently the top of the Eastern Conference standings, they're only putting up half a goal more than their opponents on the 5-on-5 or the 4-on-4. They're on par with the L.A. Kings, who are currently on the playoff bubble.

Since teams spend the large majority of their time at even-strength, GDA-EV has a lot of data to draw from. You should consider them to be (plus-or-minus 0.5 goals/hr).

This is still very much a work in progress, and these measurements should be taken as preliminary only. There are likely still some bugs to work out that have gone unseen. 

In this previous post, I gave a short demo of the nhlscrapr package, which allows anybody to download a play-by-play table of every hit, shot, faceoff, line change, penalty, and goal recorded. The data used to make Tables 1 and 2 come from nhlscrapr. Select details are found in the methodology below.

Methodology


I've written an R script, to count the time spent in each powerplay state and the number of events (in this case, events are goals) that occur in each state and for/against each team. A cautionary note to those trying to do similar things: the numbers in the home.skates and away.skaters variables are unreliable and you're better off tracking it yourself if it's critical. The other variables appeared to be pretty reliable, such as home and away goalie ID which made identifying empty-net times possible.

I apologize, but I won't be sharing the script to get this data, until I've considered (and possibly taken advantage of) the academic publication potential of the results in the tables below. Also, the script is a modification of one used for a research paper Dr. Paramjit Gill and I currently have in the submission stage, and I don't want to complicate that process.

Overtime and two-man advantage situations was ignored. Neither the goals nor the time spent are included here. However, the league-wide goal differential seems to be in the 12-16 goals-per-hour range. There is nowhere near enough data to say anything about individual teams in these situations.

The league-wide measures are not the raw average of the values from each of the 30 teams because some teams spend more time short handed or on the power play. Instead, they are computed by treating the whole league as a single team playing against itself. If a team draws more penalties and spends more time on the powerplay, that team's performance will count for more towards the league-wide average than other teams. However, in reality, every team is contributing between 3.0% and 3.6% towards each league measure.

Any goals scored when the goalie was on the bench for an extra skater (e.g. during a delayed penalty call or when a team is losing at the very end of the game) were ignored. This includes goals on empty nets, goals by teams with empty nets, and that time the Flames did both.  The time where one (or both?) goalies were off the ice was deducted from the appropriate situation. There was a mean average of about 50 seconds of empty net per game, with a lot of games with no empty net time.

Finally, only regulation play (the first 60 minutes of each game) is considered. This is to filter out any confounding issues such as teams scoring less often in overtime than they do in regulation.

Saturday, 14 February 2015

More examples of the Play to Donate model.

Tree Planet 3

This is the first in the Korean mobile game series that is available in English. Tree Planet 3 is a tower defense game where you defend a baby tree from waves of monsters and possessed livestock. You do this by recruiting stationary  minions to beat down the monsters as they parade along the side of the path or paths to the tree. You also have a tree hero that has some movement and special abilities to slow or damage monsters.

Stages are a mobile friendly 3-5 minutes long, and finishing a set of three stages results in a real tree being planted (with proof of planting provided).

Part of the revenue for the game and the planting comes from the game's freemium pay-to-win model: beating a stage provides rewards to slowly unlock and upgrade new heroes, but rewards can be purchased for $USD right away. The rest comes from sponsorship from NGOs and from corporations that benefit from trees being planted, such as a paper company in Thailand that will buy back the trees later, and a solar power company in China that uses trees to catch dust that would otherwise blow onto their panels.

I greatly enjoy Tree Planet 3. For some stages I've managed to set up minions to win almost without looking, but for others I'm still working out a viable strategy. The monsters and maps are varied in function as well as look, which is better than I expect from most mobile games, let alone a charity one.

Aside from purchasing upgrades, play doesn't seem to have any effect on the revenue generated for the planting system, even though playing supposedly causes planting.  Is the sponsorship paid on a per tree basis? I have questions, but the faq is in Korean only.

Aside from the sponsorship there were no advertisements such as a rolling banner; I would love to know how that affects a game like this.

---------------------

Play to Cure: Genes in Space

This game was commissioned as part of a 48 hour programming event by body called UK Cancer Research, if I'm recalling the details correctly.

In a stage of Genes in Space, the player plots a course through a map based on some genetic data that needs analysis.  Then the player pilots their spacecraft in an over-the-view through the map with the genetic data showing, as well as the path they charted.

The goal of the player is to fly through as much of the genetic data as possible. The goal of the designer is to use multiple players charting and flying through the same map to find where the abrupt changes in genetic information occur, which will match with the points where an optimal player changes her flight path.

Aside from previously mentioned charity games, this one has the appeal that gameplay is actually creating value, rather than using your efforts to moving money from somewhere to somewhere nicer. By playing through a level, you are in fact contributing to science, and the rewards to upgrade your ship and climb a leaderboard are related to the density of the genetic data is that you manage to fly through (which in turn relates to quality of the contribution to analysis)

I enjoy the flavour of the game, and being able to paint my vessel to look like a bumblebee gummyship gathering genetic pollen.

However, the game was rushed and even after patches this is painfully obvious. It is unclear how ship upgrades help you, aside from the occasional meteor storm event that occurs after gathering, and which seems unnecessary and tedious.

The flight portion of the levels is extremely short: about 30 seconds to play, which follows 60 seconds to chart and 60-90 of loading screens and menus. By prolonging the flights by a factor of 2, I would be spend more time on the best part of the game and probably make better flight decisions, at the cost of processing maps 15-20% slower. Remove the meteor storm event and make ship upgrades aesthetic only, and even that loss is gone.

By far the worst problem is a bug where sometimes a blank map is presented, on which no route can be planned, and flight never ends. This could be because all the generic data has been analyzed, but I suspect it's an issue with corrupt genome data.

What elevates this from an annoyance to a deal breaker is the inability to do anything about it. There is no option to scrap the map and try a new one, and sometimes even uninstalling, then resetting the phone, then reinstalling fails to fix the issue.

If I'm right about the source, the problem could be cleared up by giving players more autonomy regarding their contributions. An option to manually  download a new map, or several maps at once for offline play should be simple. It would also fix some of the loading issues and improve gameplay, especially if players could go through multiple genomes in one long flight.

One final note: I know finding these breakpoints in a genome is ineffective for a machine because I looked it up, but this player data should make for a fantastic training set.

Monday, 9 February 2015

Package spotlight - quantreg

Linear regression and its extensions are great for cases when you wish to model the mean of values Y in response to some predictors X.

Sometimes, however, you may be interested in the median response, or something more general like the cutoff for the top 20 percent of the responses for a given set of predictors. To model such cutoffs, quantile regression excels, and the quantreg package makes it feel a lot like linear regression.

This spotlight includes a simple bivariate normal example, then moves on to a more complex applciation that could be used to track player progression in a video game (e.g. a future digitization of Kudzu). First, we generate the bivarate normals, where y is the response to x with normal variation, and x is the predictor.

set.seed(12345)
x = rnorm(1000)
y = x + rnorm(1000)


Since the normal distribution is symmetric, the linear regression and quantile regression about the median should both approximate a slope of 1 and intercept of 0.

lm(y ~ x)$coef
(Intercept)           x 
-0.03220106  1.03914825 

library(quantreg)
rq(y ~ x)$coef
(Intercept)           x 
-0.02357192  0.98935700 

Furthermore, for other quantile regressions of this dataset, we would expect the slope to remain near 1 and the intercept to approximate the univariate normal quantiles. We expect this because the random variation in y is known to be Norm(0,1).

### 25th and 75th percentiles
rq(y ~ x, tau=c(.25,.75))
            tau= 0.25 tau= 0.75
(Intercept) -0.729756 0.6756721
x            1.042261 1.0169105

qnorm(c(.25,.75))
 -0.6744898  0.6744898

 
### 70th, 90th, 95th, 97.5th, 99.5th percentiles
rq(y ~ x, tau=c(.7,.9,.95,.975,.995))
            tau= 0.700 tau= 0.900 tau= 0.950 tau= 0.975 tau= 0.995
(Intercept)  0.5099463   1.298006   1.545429   1.833121  2.5811008
x            1.0036638   1.043121   1.023059   1.010482  0.8197345

qnorm( c(.7,.9,.95,.975,.995))
0.5244005 1.2815516 1.6448536 1.9599640 2.5758293


As you might expect. The model behaves unpredicably for extreme quantiles because useful data grows sparse. Finally, we can draw all these quantile regression lines.

 mod = rq(y ~ x, tau=c(.25,.5,.7,.9,.95,.975,.995))
 coefs = mod$coef
 plot(y ~ x)
 abline(h=0)
 abline(v=0)
 for(k in 1:ncol(coefs))  {   abline(coefs[,k], col="Red", lwd=2) }
 


For this next example, we are interested in estimating how many players can reach an arbitrary score in a game after playing a certain amount of time. This sort of information is valueable when tuning the difficulty of "reach a score of Y" challenge in a game so that few players can reach the goal right away but most players can after some effort. Consider the best score a player reaches in a game as the response Y and time spent on a challenge as a predictor X.

One problem: scores in games do not always increase linearly or in a simple transformation of linear. To handle this, the quantreg package also includes nlrq(), a non-linear quantile regression function. As before, we generate some player data.

#### Make a database of simulated users for our game
set.seed(12345)
Nusers = 1000

## An average of 20 minutes played, capped at 100 minutes
minutes = rexp(Nusers, rate=1/20) 
minutes = pmin(minutes,100) #pmin = parallel min

## performance is a function of skill (gamma), luck (norm) and time
score = rnorm(Nusers, mean=100,sd=100) + 
  10*minutes + 0.25*minutes^2 

gamedata = data.frame(score,minutes)  

Like the non-linear least squares function nls() in base R, nlrq() needs a formula with parameters on which to find an optimum fit. Starting values for the parameters are also needed. The formula can be user-defined function, such as the cubic polynomial function shown here:

### Polynomial degree 3 function
polyd3 = function(x,a,b,c)
{
 output = a + b*x + c*x^2
 return(output)
}


Now, using a flat line at zero our initial curve, we can build a model.

init = list(a=0,b=0,c=0)
mod = nlrq(score ~ polyd3(minutes,a,b,c),data=gamedata,start=init,tau=0.5)
coefs = summary(mod)$coef
coefs
        Value  Std. Error  t value Pr(>|t|)
a 103.7766248 7.381757743 14.05852        0
b   9.8370123 0.559544019 17.58041        0
c   0.2525046 0.007178738 35.17396        0

coefs[,1]
          a           b           c 
103.7766248   9.8370123   0.2525046 

If you look at the score formula we used in the simulation, the parameters of 100, 10, and 0.25 were all estimated fairly well. Also, the standard error estimates are reasonable, if not a bit conservative in this instance. The nlrq() function depends on summary() for most of the information, which is why the extra steps involving summary are shown here. We can use the formula function defined earlier to make predictions, and the delta method to get standard error of those predictions (delta method not shown).

 ### Median score after 15 minutes
 polyd3(x=15,coefs['a',1],coefs['b',1],coefs['c',1])
 308.1454
 
 
 ### Median score after 100 minutes
 polyd3(x=100,coefs['a',1],coefs['b',1],coefs['c',1])
 3612.524
 


Finally, we can also use the polyd3() and curve() to draw quantile curves through the data.

 attach(gamedata)
 plot(score ~ minutes, xlab="Time Played (minutes)", 
  ylab="Best Score", cex=0.6) #cex = Character EXpansion
 
 
 mod.10 = nlrq(score ~ polyd3(minutes,a,b,c),start=init,tau=0.1)
 coefs = summary(mod.10)$coef
 
 curve(polyd3(x,coefs['a',1],coefs['b',1],coefs['c',1]),
        from=0,to=100,col="Red",add=T,lwd=1)
 
 
 mod.50 = nlrq(score ~ polyd3(minutes,a,b,c),start=init,tau=0.5)
 coefs = summary(mod.50)$coef
 
 curve(polyd3(x,coefs['a',1],coefs['b',1],coefs['c',1]),
        from=0,to=100,col="Red",add=T,lwd=1)
 
 
 mod.90 = nlrq(score ~ polyd3(minutes,a,b,c),start=init,tau=0.9)
 coefs = summary(mod.90)$coef
 
 curve(polyd3(x,coefs['a',1],coefs['b',1],coefs['c',1]),
        from=0,to=100,col="Red",add=T,lwd=1)



 

Tuesday, 3 February 2015

Academic Salvage

One of my jobs is to facilitate research grants for educational development through the ISTLD (Institute for the Study of Teaching and Learning in the Disciplines) at Simon Fraser University. The Institute has given more than 130 awards to faculty-lead projects to improve the educational experiences of their classrooms. I've read the grant proposals and final reports of many of these awards and among the patterns that have emerged:


- Almost all the granted projects reach completion close their proposed timeline and submit a final report.

- Many of them mention plans to publish research papers in their proposals.

- Many of them have made measurable beneficial impacts on the experience of students, and these effects are publishable in education journals.

- Many of the final reports mention sharing the findings at on-campus talks and posters.

- Not as many project results actually get submitted to journals, even in response to a follow up a year after the final reports are submitted.

Papers are getting submitted, but not as many as there could be. Sure, there are some There's some barriers at the end of the projects to publishing. Part of the barrier is that the primary goal of the projects is to improve education, not to write about it. Still, it feels like a waste to finish research and write a report and a poster, but never get published credit for it.

I've been told by some colleagues that statistical analysis of the data at the end is often an issue, as well as the paper writing process. It makes me want to find projects that ended in this ABP (all-but-publication) state and offer to write and analyze in exchange for a name on the paper. From my perspective as a statistician and a writer, it seems like one of the most efficient ways to boost my own paper count. From the perspective of a faculty member who has completed such a project, I hope they would consider such an offer as a way to be first author of a real paper rather than sole author of an none.

Is there a name for someone makes these sort of arrangements? If not, I'd like to suggest 'academic salvager'?  Specifically, I mean in someone who takes the raw materials from the unpublished research of others and value-adds it up to a paper.

Is there a lot of research in this all-but-publication state in other fields? This is just from one granting program, how much 'academic salvage' is out there waiting to be gathered, refined, and shipped out? 

Sunday, 1 February 2015

Package Spotlight - nhlscrapr

NOTE: AN UPDATED INTRODUCTION TO NHLSCRAPR HAS BEEN POSTED HERE.

nhlscrapr is a package to acquire and use play-by-play information of National Hockey League games. It's similar in function to pitchRx for Major League Baseball. Unlike pitchRx, it doesn't allow for SQL Queries to be sent to an open database, but is instead a system to collect raw data from www.nhl.com and format it into an R dataframe.

The focus of this package spotlight will be on acquiring the game data rather than using it.  The first thing we need is a list of the games available for scraping. A function in nhlscrapr returns a table of gameIDs.

fgd = full.game.database()

dim(fgd)
[1] 15510    13


names(fgd)
 "season"     "session"    "gamenumber" "gcode"      "status"
 "awayteam"   "hometeam"   "awayscore"  "homescore"  "date"
 "game.start" "game.end"   "periods"  

The full games data frame, which we'll store as 'fgd', has about 15000 games by default.
 
table(fgd$season, fgd$session)

           Playoffs Regular

  20022003      105    1230
  20032004      105    1230
  20052006      105    1230

  ...

  20122013      105     720
  20132014      105    1230
  20142015      105    1230


Every regular season and playoff game of the 12 seasons from 2002-3 to 2014-5. It is visible from a crosstab how the 2012-3 season was shortened and the 2004-5 season was missed entirely. Also visible is that the number of playoff games appears fixed; all potential games have an assigned code.

fgd[c(1,2,1000,1001,15509,15510),1:5]
        season  session gamenumber gcode status
1     20022003  Regular       0001 20001      0
2     20022003  Regular       0002 20002      0
1000  20022003  Regular       1000 21000      1
1001  20022003  Regular       1001 21001      1
15509 20142015 Playoffs       0416 30416      1
15510 20142015 Playoffs       0417 30417      1

The other three variables of consequence are gamenumber, gcode, and status. The variable 'status' is marked 0 for games that are confirmed to be unscrapable. Most unscrapable games are from the early part of the 2002-3 season when this database was being established, and from playoff games that never happened. There 30 other regular season games that are lost for other reasons I don't know.

gamenumber is the unique-within-session identifier for a game, from 0001 to 1230 for regular season games. The gamenumber for playoff games is encoded as 0[round][series][game], so a gamenumber of 0315 represents Game 5 of the 3rd round of the playoffs for one conference, and 0325 would be Game 5 for the other conference.

gcode is [session]gamenumber. A 2xxxx gcode is a regular season game, and a 3xxxx gcode a playoff game. Presesason games would have a gcode of 1xxxx, but they aren't included in the data from this package. The remaining eight variables, from awayteam to periods, appear to have little or no use at the moment. Finally, a function is used to import rather than a data() command because there will be more data after the 2014-15 season which is not included when you leave the parameter extra.season at its default of 0.

## Doesn't work yet, but would also include 2015-6 and 2016-7
test = full.game.database(extra.seasons = 2) 

The dataframe fgd is just the start - it is a list of IDs used for scraping. The following script uses fgd to download and compile a much larger data frame of every recorded play, including shots, hits, goals, and penalties. Explanation below.

setwd("C:\\Set\\This\\First")
yearlist = unique(fgd$season) 


for(thisyear in yearlist) ## For each season...

{
    ## Get the game-by-game data for that season
    game_ids = subset(fgd, season == thisyear) 

    ## Download those games, waiting 2 sec between games
    dummy = download.games(games = game_ids, wait = 2) 

    ## Processing, unpacking and formatting
    process.games(games=fgd,override.download=FALSE) 

    gc() ## Clear up the RAM using (g)arbage (c)ollection.
}


## Put all the processed games into a single file
compile.all.games(output.file="NHL play-by-play.RData") 

Any games you download will be saved in the subdirectories 'nhlr-data' and 'source-data' of the working directory, which you can set a dropdown menu or with setwd(). The names of these subdirectories can be changed with parameters in download.games(), process.games(), and compile.all.games().

The download.games() function will download any games from www.nhl.com listed in the database of the same format as fgd. Raw game data is placed in the nhlr data folder.

Rather than use fgd directly , we are subsetting it by season because the downloading process has a memory leak that needs to be addressed using gc() occasionally. If you try to download all the games without stopping to perform garbage cleanup, R will eventually run out of memory and crash. 2Gb of RAM should be more than enough to handle one season at a time.

The 'wait' parameter defines the number of seconds to wait between single game downloads. The default is 20 seconds, I presume as a courtesy to the NHL or to avoid scraping detection, but it could also have something to do with slow download speeds.

The use of the process.games() function is conjecture, but its use is necessary. Processed game files are also saved in the nhlr data folder. Compilations from the compile.all.games() function are put in the source data folder. If you interrupt the download process, whatever games you have managed to download and process will be compiled.

Once you have some games compiled, you can load them into R, see the recorded play-by-plays. The output below is from a single season (2006-7). There are about 375 events per game in this season, more than one every ten seconds. This should be everything you need to explore the data yourself.  Later, I'll be using this dataset to measure the GDA-PK and GDA-PP metrics proposed in a previous post.

temp = load("source-data\\NHL play-by-play.RData")
nhl_all = get(temp)

length(unique(nhl_all$gcode))
1310

dim(nhl_all)
[1] 494619     44

 names(nhl_all)
 [1] "season"              "gcode"         "refdate"    
 [4] "event"               "period"        "seconds"     
 [7] "etype"               "a1"            "a2"          
[10] "a3"                  "a4"            "a5"          
[13] "a6"                  "h1"            "h2"        
[16] "h3"                  "h4"            "h5"      
[19] "h6"                  "ev.team"       "ev.player.1"  
[22] "ev.player.2"         "ev.player.3"   "distance"   
[25] "type"                "homezone"      "xcoord"  
[28] "ycoord"              "awayteam"      "hometeam"   
[31] "home.score"          "away.score"    "event.length"  
[34] "away.G"              "home.G"        "home.skaters"    
[37] "away.skaters"        "adjusted.dist" "shot.prob.dist 
[40] "prob.goal.if.ongoal" "loc.section"   "new.loc.sectio  
[43] "newxc"               "newyc"