Thursday, 29 January 2015

Abstract Thoughts About Concrete

The production of cement is a major source of anthropogenic (i.e. human-made) carbon dioxide. In fact, its impact is comparable to that of using fossil fuels to transport goods and heat buildings.

It uses limestone, which contains carbon dioxide that has been locked away for a geologic time, and a portion of this is released in the process of cooking it into a material called clinker. Eventually, the cement will re-absorb some, but not all of the CO2 released this way. Cooking the limestone to 1500 C takes a lot of energy too, and in an intensity that makes it difficult to produce cleanly. There's also the costs and impact of the limestone quarrying and transportation to consider.

Another problem is disposal: Construction waste makes up a large part of what goes into landfills, and cement products like concrete and mortar potentially make up a large part of that. This is from buildings being demolished or renovated, from road construction, and from the occasional truckload of concrete that is mixed but can't be poured at the right time.

So I wonder, in my limited understanding, if it's possible to take cement products and reclaim some of the cement. Currently, concrete is recycled to reduce its landfill impact and the need to mine gravel and other fill. However, that seems to be all it's used for - simple rocks rather than the magic gluestone stuff that holds skyscrapers up.

Curing is a one-way chemical process (I think), and that there's a lot of fill that's added to cement to make concrete, so maybe it's just too hard to be profitable. Fresh cement powder is so fine that it would be a stretch to call it dust, so a tremendous amount of mechanical grinding would be necessary to get concrete down to a point where the fill could be removed from what was cement powder before curing.

Has anyone given serious thought to a chemical or biological means of doing this however? If lichen can break down solid rock, could concrete gravel be broken down or separated into something finer with a plant, enzyme, or type of bacterium? Can the curing process be undone by similar means that leaves behind clinker as a waste product?

Similarly, could wet concrete mix be saved for another time in the cases where it spoils from water contamination or when it can't be poured when intended? Could aging or damaged infrastructure be reinforced or renewed by drilling into it and injecting something to force it to re-cure?

Just some thoughts from someone ignorant on the limitations of infrastructure.  Comments, corrections and discussion about this would be very welcome.

Here's a general statement to close -- climate change won't simply be fixed by driving hybrids and using recycle bins. It's a complex problem and solving it will be the great work of this and the next generation.

Thursday, 22 January 2015

Package Spotlight - stringr

One of my ongoing projects is to maintain a database of international cricket games, which I do by scraping the text from ESPN Cricinfo. For this kind of work, the R package 'stringr' is essential.

One task involved in this project is to find the names of all the players and their batting order each Twenty20 International and One-Day International matchup. Here is a step-by-step of how to use stringr to extract the names. If you want to try it yourself, the file Game_Summary.txt, is literally a Select All + Copy/Paste from the following link .

First we load the text data. Make sure to set the working directory with setwd() first

intext = readLines("Game_Summary.txt" , warn="FALSE")
intext = str_replace_all(intext,"\t","   ") ## Removes the tabs and replaces them with triple spaces for better reading

readLines is a base input function like read.csv. It makes 'intext' a 1D array of the lines from the text file. Which is...

[1] 497

... biggish.

There are nearly 500 lines in the game summary, and we are only looking for 22 names, so a lot of this is going to be garbage.

Ideally, we only want to inspect lines that have the names, and as few lines that do not as possible.
Having looked at the text in Notepad++ , we see the batting summary after the first few lines.

 [1] "T20I no. 43 | 2007/08 season"                                                     
 [2] "Played at Kingsmead, Durban"                                                     
 [3] "20 September 2007 - night match (20-over match)"                                  
 [4] "    India innings (20 overs maximum)   R   M   B   4s   6s   SR"                 
 [5] "View dismissal   G Gambhir   c Smith b Pollock   19   19   19   3   0   100.00"   
 [6] "View dismissal   V Sehwag   c †Boucher b Ntini   11   23   11   1   0   100.00"   
 [7] "View dismissal   KD Karthik   c JA Morkel b Pollock   0   1   1   0   0   0.00"  
 [8] "View dismissal   RV Uthappa   c Smith b M Morkel   15   22   16   1   1   93.75"  
 [9] "RG Sharma   not out   50   52   40   7   2   125.00"                              
[10] "View dismissal   MS Dhoni*†   run out (Philander)   45   39   33   4   1   136.36"
[11] "IK Pathan   not out   0   1   1   0   0   0.00"     

The names start after the line "India innings (20 overs maximum)...", and inspection of several such game summaries shows a similar pattern. We can use this as a marker for when to start looking for names with the following code.

for(i in 1:length(intext))
   #### The rest of the code will go here!
   if(str_detect(intext[i], " innings [(][0-9]+"))
       scanstate = 1

Since we need to retain information from one line to the next, we cannot use the more scalable method of apply() for this problem (in any simple way, at least).
'scanstate' is just a variable name we're using where 0 indicates we're not looking for names, and 1 indicates that we are looking for names.

str_detect( X ,Y ) is where the action happens. It returns TRUE if the pattern Y appears in the string X, and FALSE otherwise.

The pattern Y is a regular expression, it literally means exactly " innings (", then any collection of digits from 0 to 9.

The [ and ] are collection markers, which allow you to define 0-9 as 'any digit'. The + plus is used to say 'at least one character in the collection in [] in a row'.

Paretheses like ( and ) are special characters that regular expressions use for other things, so to use a literal left parethesis, we have to treat it as part of a collection, even if in this case '(' is the only character in that collection.

We want to be as specific as possible here to avoid catching things that are not markers of the start of the batting summary. That's why we don't simply search for ' innings'.

Now that we have a means of determining what lines are from the batting summary, we can proceed to extract the names.

if(scanstate == 1)
    thistext = intext[i]

    playlist[playeridx,scanstate] = thistext
    playeridx = playeridx + 1

    if(playeridx > 11){scanstate = 0}

thistext, playlist (2x11 array), and playeridx (starting at 1) are variables that we have specified.

This code will take the first 11 lines of the batting summary and save then as if they were the names of the first team. Unfortunately, this will grab the entire line instead of just the name, such as

"View dismissal   V Sehwag   c †Boucher b Ntini   11   23   11   1   0   100.00" 

where all we really want is "V Sehwag".

So we need a way to...
1) Remove 'View dismissal', from the start of each string (assuming no player is named View dismissal ), which can be done with a replacement.

thistext = str_replace(thistext,"View dismissal", "")

2) Extract only the text that happens BEFORE the summary of the player. Inspection will show that all such summaries start with ' c ', ' lbw ', ' b ', ' not out ', or some other short tag that explains the fate of the Batsman/Batter. This code...

str_split_fixed(thistext," c | lbw | b | not out | run out | st ",2)[1]

...splits the single element 'thistext' (first parameter) into an 1D array of size 2 (third parameter), where anything fitting the pattern in the second paramter is used as the marker of the break between the elements.

In regular expressions, | means 'or', as in the pattern ' c ', or the pattern ' lbw ', or ... the pattern ' st '. We only want to keep the first element of this array, which should be the name, so we finish this line with [1] outside the function.

With some additional similar housekeeping, we add this code chunk to the beginning of the for loop.

if(scanstate == 1)

 thistext = intext[i]
 thistext = str_replace(thistext,"View dismissal", "") ### Fixes problem 1. Only one replacement needed.
 thistext = str_split_fixed(thistext," c | lbw | b | not out | run out | st ",2)[1] ### Fixes problem 2
 thistext = str_trim(thistext) ### Removes whitespace on the ends
 thistext = str_replace_all(thistext,"[*]|†","") ### Removes all the special characters * and cross

 playlist[playeridx,scanstate] = thistext
 playeridx = playeridx + 1

 if(playeridx > 11){scanstate = 0}

We are not quite done. Sometimes teams run out of overs before they run out of wickets, especially in Twenty20. This means we will miss players labelled under 'did not bat',

"Did not batHarbhajan Singh, Joginder Sharma, S Sreesanth, RP Singh"

and catch lines that are past the list of player names, like this:

"Fall of wickets 1-32 (Gambhir, 4.4 ov), 2-33 (Karthik, 4.6 ov), 3-33 (Sehwag, 5.1 ov), 4-61 (Uthappa, 10.3 ov), 5-146 (Dhoni, 19.4 ov)"

We can make another block of code to search for the list of non-batters.
Here, we only look at cases that start with 'Did not bat', then we remove that first part and split along ', '. We use str_split instead of str_split_fixed, because we do not know in advance how many people will be on the list.

Finally, we turn the list from str_split into an array
like the one we get from str_split_fixed with the unlist() function, which is part of base R.

if(scanstate == 1 & str_detect(intext[i],"Did not bat"))
   thistext = intext[i]
   thistext = str_split_fixed(thistext,"Did not bat",2)[2]
   thistext = unlist(str_split(thistext,", "))

   playlist[playeridx:11,scanstate] = thistext

   scanstate = 0
   playeridx = 1

Then, to fix the 'extra lines' problem, we add some conditions to the first if statement. Specifically, we ignore lines with the terms 'Extras', 'Total', and 'Did not bat'.
if(scanstate == 1 
         & !str_detect(intext[i],"Extras") 
         & !str_detect(intext[i],"Total   ") 
         & !str_detect(intext[i],"Did not bat") )

Now we run this code, we get the list of players in their (intended) batting order, and nothing more!

 [1,] "G Gambhir"      
 [2,] "V Sehwag"       
 [3,] "KD Karthik"     
 [4,] "RV Uthappa"     
 [5,] "RG Sharma"      
 [6,] "MS Dhoni"       
 [7,] "IK Pathan"      
 [8,] "Harbhajan Singh"
 [9,] "Joginder Sharma"
[10,] "S Sreesanth"    
[11,] "RP Singh"   

Here is the complete code for this task, which I used to find the names of both teams.

Saturday, 17 January 2015

Package Spotlight - xtable

xtable is an R package on CRAN that converts output into LaTeX code for a table of that output.

For writing papers, it has cut the time it takes to produce and manage a table by a factor of 3 or 4.

Here's some example code

## Install and load 

## Get 10 random rows from the Iris dataset
theseRows = sample(1:nrow(iris), 10)
theseRows = sort(theseRows)
dat = iris[theseRows,]

### Make a LaTeX table of the data and print to screen

...and you have a basic table. From there you can make edits to the table in LaTeX, but I recommend doing more formatting with xtable first. You specify some parameters in the call to xtable().

### Print the LaTeX table code with more precision and a caption
xtable(dat, digits=4, caption="Ten lines from Iris")

If you store the xtable object as a variable rather than printing it out right away, you can change or retrieve xtable parameters.

### Make a LaTeX table of the data and save as "xt"
xt = xtable(dat, digits=4, caption="Ten lines from Iris")

## Check the default column alignments

## Change the alignments and add a vertical break after the 3rd column
align(xt) = "clr|clr"

## Change the caption
caption(xt) = "Dinner menu for goats"

With a stored xtable object, you can specific a wider variety of parameters with print(). Note that the row names are considered to be the first column, so the LaTeX code that comes out of print(xt, ... ) will have lr|clr in its alignment specification, not clr|clr

print(xt, include.rownames=FALSE)

For more details, see the xtable reference manual on CRAN.

Sunday, 4 January 2015

SPSS Guide for Basic Practice of Statistics

I've made some major updates to this introductory guide to SPSS. (Google docs link, but you can download from there)

It's a guide to using SPSS to answer homework problems from Basic Practice of Statistics (BPS), by Moore, Notz, and Flinger. Although it was originally made for a specific stats for social sciences course, I hope it's applicable beyond that.

The biggest change from the 2012 and 2013 versions is that the guide is no longer tooled around Elementary Statistics for Social Sciences by Levin, Fox, and Forde. Those versions will not be posted here because this 'elementary' textbook is better at teaching fear than statistics. Consequently the GLM procedure and logistic regression chapters have been removed from the SPSS guide.

The updates also includes more information about data manipulation specific to 'Basic Practise' such as random selection, sorting, and weighting data.

There is already a larger, official SPSS companion for this textbook, but my guide is intended to provide just what a novice student needs to get through their course, rather than be a comprehensive reference.

This work is freely available to share and was originally commissioned with a research grant from Simon Fraser University in 2012.