Saturday, 28 May 2016

The NHL is half as good as it should be.

The soul of the National Hockey League feels like it's gone.

1. Every 8 years or so, at least half, sometimes a whole season is simply lost because of labor dispute. The backbone of sport is ritual and tradition, and a ritual adhered to only sometimes is just a habit.

2. Teams are playing to tie instead of to win, as demonstrated by fivethirtyeight, and later in greater detail in a paper I'm writing with Paramjit Gill. Half a win is awarded to any team that loses in overtime, and this incentivizes some very risk-averse behaviour late in games. Despite other changes to overtime including
- the introduction of a shootout,
- 4-on-4 overtime play, and later
- 3-on-3 overtime play, the overtime bonus point has persisted.

3. Some teams are hopeless year-after-year failures. In the past, expansion teams have had one or two terrible seasons while they get established, but no team has that excuse now.  There are several 'rubber band' mechanics in play to pull the level of competitiveness of teams closer together over time.
- There is an upper limit on the total salaries that can be paid to players, and many teams are at this limit.
- Top-earning teams subsidize other teams.
- Teams that rank at the bottom in one season are given first pick of players in next year's draft.

The overtime bonus is itself a form of rubber-banding because a team can still earn season points by losing, and if the game goes to a shoot-out, then the game is effectively a coin toss regardless of team strength.

In spite of all these mechanics to prop up teams after a bad season, there are some have been hopeless for a decade. Looking at you, Oilers.

My guess is that isn't enough talent coming from second-tier 'feeder' leagues (WHL, OHL, QMJHL, AHL, and the European leagues) to properly fill 30 NHL teams anymore. In turn, these feeder leagues are working from a diminished talent pool because of demographic changes; compared to 15 years ago, there are fewer children being born in hockey-playing countries, and their parents are poorer and less able to enroll these children in organized hockey.

Here are my suggestions to improve the state of professional hockey.

1. Remove (at least) two teams from the league.

As a Canadian, my inclination would be to remove the Arizona Coyotes and the Columbus Blue Jackets because of their tiny home fanbases and financial troubles. Realistically, the Edmonton Oilers should go, simply for being the worst.

With 28 teams, all four divisions could have 7 teams, and the talent would be spread 7% less thinly. Cutting out a couple teams might be enough deterrent to 'diving', an alleged practice where bottom-tier teams intentionally lose games to improve their draft prospects for the next season.

2. Optimize the talent pool.

The first chapter of the book Outliers points out the phenomenon where boys born in the first 3 months of the year are overwhelmingly more likely to make a career out of hockey. The explanation given for this was that, as a fluke of the age cutoff system for children's leagues, these children of January are the oldest of their 4-6 year old peers on the ice. Since they are the best players in their leagues, they get filtered into more competitive leagues with better support. To my knowledge, Hockey Canada still uses this cutoff system.

Players that were both late in the year that may have been NHL material instead never reach their potential because they were outclassed and looked over when they were very young. Fairness aside, if early age cutoffs were done at 6-month intervals instead of 1-year intervals, we could see more players at their potential.

We wouldn't notice the difference for a generation, but it would offset future demographic changes.

3. Increase support for women's hockey.

At the moment there are only a handful of competitive teams in women's hockey in Canada, so the developed talent pool can't be exceedingly large. Even then, the Canadian and US teams were so much better than teams from other countries that the IOC was considering removing women's hokcey from the olympics because it wasn't competitive enough. Things are improving, as Finland has been showing strength in recent competitions. If we want more high quality hockey, maybe it's time to look elsewhere.

Friday, 20 May 2016

Biochar farming idea

Here's an idea I hope gets stolen if it's good, and shot down early if it's bad. Also, I claim no expertise in biology, apiary science, or soil ecology, so this could be drastically off the mark.

1. Pyrolyze massive amounts of organic waste to make biochar.
2. Use the biochar to make a soil that mimics land after a forest fire.
3. Grow high-value crops that grow best after a forest fire.
4. Incorporate secondary processes like beehives.


First, pyrolysis is a technique for burning lingen-rich plant matter (i.e. wood, bamboo, corn stalks) in a low oxygen environment. Pyrolysis can produce a stable, porous form of charcoal called biochar. Biochar is commonly applied to soil to improve its capacity to store water and nutrients. Soil with large amounts of biochar can approximate a very high quality topsoil call terra preta.

Also, the process of creating biochar is carbon-negative. The carbon in the biochar is effectively locked out of the atmosphere more permanently than it would be if the biological matter is left to rot. It certainly keeps carbon out of the atmosphere more effectively than simple burning.

What happens, however, when the soil is mostly biochar, with just enough other parts (or a layer on top) to keep it from blowing away? I suspect that you would have a soil similar to what would be found on the ground after a forest fire.

Pioneer species are the first organisms to thrive after a forest fire. These include valuable plants like morel mushrooms, and, particularly, fireweed. Specifically, I've heard that honey derived from fireweed is valuable. However, the cultivation of fireweed honey is difficult because it has to be done in areas of recent forest fires, and therefore can't be done in the same place for very long.

My idea is to make soil that mimics the ground after a recent fire, and maintain that state with frequent renewal of  biochar. Then, I want to use that soil to cultivate pioneer species.

With enough biochar, my hope is to have some farmland that can grow fireweed every year, and have permanently installed beehives amongst the fireweed. Rather than move the hives to where the land is suitable, I will work to maintain the land in a suitable state. With luck, the practice will pay for itself in harvests of fireweed honey, morels, and excess biochar.

If this works, then it can become a business that is both long-term profitable and carbon negative. There is an established demand for morels, and with more consumer awareness there could be a large demand for fireweed honey, so there is room for lots of people to try this.


One potential issue is that bees prefer some flowers to others, and growing fireweed near the beehives doesn't guarantee that those are the flowers that will be used. Having a honey with a precursor that was part fireweed and part wildflowers may be acceptable, but it even guaranteeing that would mean influence an area much larger than the farmland.

Another issue is a sustainable supply of lingen-rich organic matter. Piles of slash and scrap wood are unreliable, one-time sources. Corn husks work nicely, but aren't available everywhere. Low-grade wood chips, called hog, would also work, but those are already used by pulp mills for energy. The city of Vancouver is already collecting organic waste for industrial composting, so it may be possible to tap into that pipeline. It will take some research to find how to use whatever is regularly available from local food processing and agriculture.


ADDENDUM: user Osageandrot on Reddit, who does research on biochar for soil restoration, was kind enough to critique this idea and give some input. It was enough to show that this idea needs a major rework before anything more goes forward with this

First, any fresh biochar would need to be pre-aged by mixing with existing soil. This is because Polycyclic Aromatic Hydrocarbons (PAHs) are in too high of a concentration originally, and these will suppress microbal life.

Second, fresh application of biochar may not be necessary anyways. Most pioneer species are simply fast growers that need anything else that grows slower, but to a higher maximum, to be gone. Forest fires are not a necessary ingredient for a lot of these, but they just happen to provide the necessary conditions; clearcutting would do the same thing.

Tuesday, 17 May 2016

Course Notes on the PitchRx package

These are the notes I recently delivered as a guest lecturer for Simon Fraser University's course on Sports Analytics. It's a course for people with some experience with R, but not necessarily experts. As such, I made these notes with a beginners in mind.

If you want to see what PitchRx can really do, I recommend the below links. However, if you want to get started and you don't have much familiarity with R, and possibly none with SQL, the following notes are for you.

The PitchRx package is an R package designed to use a Major League Baseball dataset called pitchFx. As with nhlscrapr, there are other means of accessing this dataset, but my preference is usually towards R integration.

It gives you detailed pitch-by-pitch information about every MLB game, including the speed and location of the ball as it crossed homeplate.

Getting started with pitchRx is pretty quick:
The scrape() function in pitchRx will allow you scrape data from every game that happened during the range of days given. Days are in the YYYY-MM-DD format, which is used because...

1) It's the same format that SQL uses.
2) It's the ISO standard

dat = scrape(start = "2013-06-01", end = "2013-06-01")

The dataset 'dat' that comes out of this is a collection of five tables.

 "atbat"  "action" "pitch"  "po"     "runner"

'atbat': Describes the outcome of each at-bat. One row = one batter.

'action': Other events not related to at-bats, such as pitching changes, coaching visits to the mound, and managers getting ejected from the game.

'pitch' : Pitch-by-pitch description. One row = one pitch. Has lots of physics variables relating to each pitch, but lacks the text descriptions that accompany at-bats.

'po' : Pickoff attempt descriptions.

'runner': Description of where each runner ended up. Most of the rows correspond to at-bats. The other rows represent running events like advancing on someone else's hit, or being forced out.

We will focus on the pitch-by-pitch table.

The list of variables is... intimidating.

Thankfully, a lot of these are the same across all the tables.

des, des_es: The text description of the pitch in English or Spanish, respectively. Examples include "Ball", "Foul", "Strike", and "In play, run(s)".

num: The number of the at-bat for this game. Also used in the at-bat data frame.

count: The ball-strike count before the pitch occured.

start_speed, end_speed: The speed of the ball, in miles per hour, when ball reaches home plate, and when it leaves the pitcher's hand, respectively.

px: The horizontal position that the ball crosses the home-plate plane. Measured in feet left or right of the center of home plate, from the perspective of the catcher.

pz: The vertical position of the ball crossing the home-plate plane. Measured in feet above the ground.

nasty: The 'nasty factor', which is a function of physical variables that is supposed to describe how difficult a pitch is to hit.

spin_rate: The (mean?) rate which the baseball was spinning, in revolutions per minute (RPM).  Yes, some pitchers really do spin the ball at 2700 RPM!

zone (unconfirmed): The portion of the strike zone (or outside it) that a pitch crossed the plate.

Example analysis: Pitching count

One big issue in baseball is pitch count. As a pitcher, especially a starter, throws many pitches, they tire and their performance supposedly gets worse.

Is this true? Let's plot some variables against pitch count.

First, let's isolate one team of one game.

atbat_1game = subset(dat$atbat, inning_side == "top" & gameday_link == "gid_2013_06_01_arimlb_chnmlb_1")

pitch_1game = subset(dat$pitch, inning_side == "top" & gameday_link == "gid_2013_06_01_arimlb_chnmlb_1")

Next, we have to identify the pitcher that throws each pitch. We have to get this information from the at-bat table.

pitcher = rep(NA,nrow(pitch_1game))
for(k in 1:nrow(pitch_1game))
    thisnum = pitch_1game$num[k]
    pitcher[k] = atbat_1game$pitcher[which(atbat_1game$num == thisnum)]
pitch_1game$pitcher = pitcher

Now that we know the pitcher that threw each pitch, we can find the pitch count. This R script first ensures that event_num is treated like a number and not a string. This is important because we will use event_num to put the game's pitches in chronological order.

pitch_1game$event_num = as.numeric(pitch_1game$event_num)
 pitch_1game = pitch_1game[order(pitch_1game$event_num),]

This R script takes makes a new variable for pitch count. For a given pitcher, it marks the pitches as 1, 2, ... up to the number of pitches thrown. It does this separately for each pitcher, and when it's done, it puts that new variable into the 1-game data frame.

pitchcount = rep(NA,nrow(pitch_1game))

for(thispitcher in unique(pitcher))
    idx = which(pitcher == thispitcher)
    pitchcount[idx] = 1:length(idx)
pitch_1game$pitchcount = pitchcount

plot(pitch_1game$end_speed ~ pitch_1game$pitchcount)

plot(pitch_1game$nasty ~ pitch_1game$pitchcount)

plot(pitch_1game$spin_rate ~ pitch_1game$pitchcount)]

These tables are linked by some identifying variables.

gameday_link, example:  gid_2013_06_01_wasmlb_atlmlb_1

This is found in all five tables, it identifies the game as...

...happening on 2013-06-01,
...with Washington as the visiting team,
...and Atlanta as the home team,
...and was the first game between these teams that day

(In the case of two games in a day, the gameday link will end in _2 instead of _1 )


Every event in a game has a number relating to its chronological events. The first recorded pitch is event_num is 3.

After that, every pitch, pickoff attempt, running events, and entry in the 'action' table is given its own event_num.

Since pitchRx is based in SQL, the order of the rows that get scraped isn't guaranteed. The event_num variable is very useful if row-order matters to you.

Remember to save your work!

The data you scrape from pitchFx is NOT automatically saved to a file like nhlscrapr is.

It's probably worth the extra effort to save the tables as separate .csv files.

write.csv(dat$pitch, "Pitch Data 2013-06-01.csv")
write.csv(dat$atbat, "At Bat Data 2013-06-01.csv")

Tuesday, 10 May 2016

Kepler - The Biggest of Deals

Astrobiology, the study of life pertaining to outer space, is the most important and among the least useful fields. It involves the beginning and probable end of life as we know it, but what we find is too large to be used by anyone, or even everyone.

On May 10, 2016, NASA released this image:

The blue circles represent planets that have been previously found (confirmed), mostly by the Kepler satellite in the last few years. The orange shaded circles are those found since NASA's last announcement on the matter.

The size of each circle is proportional to the (estimated) size of the planet, the height is essentially the brightness of the star* that planet orbits. Near the top is our sun, a white** star, and further down are cooler, smaller, redder stars. The further to the right, the less bright that star is from the planet. Notice that Mars is to the right of Earth.

That green band down the middle of the chart, that's the habitable zone. Planets in that range are the possibly the right temperature to support carbon-based life. That doesn't mean these planets can support life, just that the first two criteria, heat and radiation, are in the right zones. Without that,  terraforming for long-term carbon-based life is impossible.

Now that we understand the graph, some remarks.

This is amazing! When I graduated from high school, finding other planets meant speculating about a single planet beyond Pluto. Now, NASA confirms the existence of nearly 1300 newly found planets in the last year! Of those, nine new ones are in or near the habitable zone. These are all pointlessly far away, but it's a leap from nothing to something in our lifetimes.

We still don't know how relatively abundant these small, rocky, habitable zone planets are because larger, more massive gas giants are easier to find. It's worth considering that compared to other stars we can observe, the sun is a bit unusual regarding the high amount of metals it has (i.e. anything but hydrogen and helium), compared to other stars like it (i.e. its population). That even this many rocky planets is found is pretty marvelous.

Space is exciting!

Next, do you see how Earth is close to the too-hot edge of habitable zone? It wasn't always that close. This has nothing to do with global warming on a human history scale, main sequence stars get hotter over time.

A few billion years ago when life forms were much simpler, Earth would have shown up more to the right and a little bit down from where it is now. Earth would have been a lot cooler than now were it not for the fact its atmosphere was mostly carbon dioxide and its core had more radioactivity.  Not only does Earth support life now, but its conditions have changed to offset the changes in the star it orbits in such a way that life was continually sustainable long enough to develop its current complexity.

 The theory that life developed from self-replicating proteins and lipids on Earth is called abiogenesis, as in 'creation from non-life'. However, a growing body of evidence suggests that Terran life is currently too complex to have developed in time if it started on Earth. A competing theory, called panspermia 'life everywhere', suggests that some very simple life arrived from inside a meteor, after being kicked into space by something Michael Bay dreamed up.

 This early life could have come from anywhere, but Mars is a likely source. NASA has also recently found flowing water on Mars, however it tends to boil away quickly without any air pressure. There's substantial evidence to suggest that ancient Mars was much warmer and with a thicker atmosphere, and that the atmosphere slowly boiled away because there wasn't enough gravity and magnetic field to keep it on the planet.

So to get to our current level of life complexity, we needed not one, but two habitable planets in order to buy enough time to develop. We didn't do it with much room for error either.

Remember how Earth is near the too-hot edge of the habitable zone? Well, the sun is still getting hotter, and there's nothing within the bounds of humanity to prevent that. In roughly half a billion years, Earth, assuming its orbit is the same, will have an average temperature of 55 C, and all remaining carbon will be locked away in rocks and out of the atmosphere. Without that carbon, no plant life can exist, and neither can we. Nothing smaller than moving the Earth itself to a wider orbit can prevent that in the long-term.

To put that in perspective, of the time that life can exist on Earth, that span is nearly 90% over, assuming the best-case scenario.


1. We are lucky, insanely lucky, to exist.
2. Regarding the lack of contact from alien life, we could very well be past whatever stops most life from reaching any technological level - otherwise known as the Great Filter.
3. We can't stay home forever.


*assuming main sequence stars, like our sun is.** yes, white. It only looks yellow through our atmosphere.