Saturday, 31 October 2015

A Web Crawler using R

This R-based web crawler, available here...
1. Reads the HTML of a webpage from a given address,
2. Extracts the hyperlinks from that page to other pages and files,
3. Filters out links that fail to meet given criteria (e.g. other websites, already explored, and non-html)
4. Stores the origin and destinations of the remaining links,
5. Selects a link from those discovered so far, and returns to 1.

The scraper can be used to gather raw data from thousands of pages from a website, and reveal information of the network of links between them. For example, starting just now at the front page of the National Post website, the crawler visited a news article, the main page for horoscopes, the day's horoscopes, and an article from the financial pages of the paper.

url_seed= "http://www.nationalpost.com/index.html"
hasall = "http://[a-zA-Z]+.(national|financial)post.com/"
result = web_crawl(url_seed=url_seed, hasall=hasall, Npages=5)

112 queued, 2 ://news...elections-canada-urges...
122 queued, 3 ://news...life/horoscopes/
136 queued, 4 ://news...daily-horoscope-for-tuesday-october-27-2015
136 queued, 5 ://business.financialpost.com.../where-to-get...

The core function of the scraper is getHTMLLinks() in the XML package. As long there is a webpage at the address given (including a custom 404 error), then this function downloads the HTML code and returns all the web hyperlinks found in the code as a vector of strings.

The scraper itself is two function definitions. The inner function, get_clean_links(), calls getHTMLLinks() and applies some basic and user-defined criteria to filter out undesired links. In the above example, the regular expression named 'hasall' is such a criterion. Any link that returned by get_clean_links() must match all of the patterns in parameter 'hasall', none of them in parameter 'hasnone', and at least one of the patterns in parameter 'hassome'.

On some websites, links may be recorded in relation to their origin. In these cases, the first part of each link will be missing, and you may need to put that first part of the address back in to each link to use it, which you can do by setting the parameter “prepend” to the missing part of each address.

The outer function, web_crawl(), calls get_clean_links() and passes all relevant parameters, such as filtering criteria, to it. The function web_crawl() keeps all the acceptable links that have been discovered so far and selects from those links the next page to explore. This function also handles the formatting of the network information into two data frames, called edges and nodes, which it will return after Npages web pages have been explored.

str(result$edges)
'data.frame': 243 obs. of 4 variables:
$ From: chr "http://www.nationalpo...
$ To: chr "http://www.nationalpo...
$ From_ID: int 1 1 1 1 1 1 1 1 1 1 ...
$ To_ID: int 6 7 8 9 10 11 12 13 14 15 ...


str(result$nodes)
'data.frame': 5 obs. of  4 variables:
$ address : chr "http://www.nationalp...
$ degree_total: num  114 39 37 53 0
$ degree_new  : num  114 11 13 14 0
$ samporder  : int  1 2 3 4 5

In web_crawl(), the parameter url_seed determines the first address to explore. url_seed can be a vector of web addresses, in which case the crawler will begin exploration with either the first address in the vector, or a random one depending on how queue_method is set.

The parameter queue_method determines how that link is selected; when queue_method is 'BFS' all the direct links from the page at url_seed are explored before any second-order links are explored. When queue_method is 'Random', any link discovered but not explored could be selected next with equal probability.

Finally, web_crawl() has a function which throttles the speed at which you make requests for data from web servers. This is partly for the stability of the crawler, and partly to avoid undue stress or attention from the server being scraped. The parameter 'sleeptime', which defaults to 3, sets the number of seconds that R will idle between pages.

The crawler only retains the addresses of the links, but if you want more in depth data, you can use the getURL(“address”) function in the RCurl package to get the HTML code at “address”. Only the link structure was necessary for my purposes, but a lot more is possible by making the crawler your own with the functions in the XML and Rcurl packages.

The code for this crawler was written with the help of diffuseprior ( http://www.r-bloggers.com/web-scraping-in-r/ ) and Brock Tibert ( http://www.r-bloggers.com/create-a-web-crawler-in-r/ )

Wednesday, 28 October 2015

Take-home lessons and code from a factor-cluster analysis with imputation


Recently I was tapped to examine data from a survey of ~200 children to find if their learning preferences fell into well-defined profiles (i.e. clusters). The relevant part of the survey had more than 50 Likert scale questions The client and I had decided that a factor analysis, followed by a cluster analysis would be most appropriate.


I learned some things in doing this analysis, and wanted to share that and some relevant code.

1. Dimension reduction pairs well with clustering.


Had we simply done a k-means clustering, we would have gotten poor results. There are meaningful correlations between questions, but k-means clustering treats each question as an independent variable. Differences on correlated sets of questions would overstate the differences in responses as a whole.

Also, factor analysis, a standard dimension reduction method, provides combinations of variables which may make clustering results easier to explain.


2. Carry multiple imputations through the entire analysis. Don’t combine results prematurely.


The purpose of doing multiple imputations, rather than a single imputation, is to have the final results reflect the additional uncertainty from using unknown values. By combining the results in the middle of an analysis, such as after dimension reduction, but before k-means clustering, you mask that uncertainty. For example, by aggregating the imputed results before clustering, the centre of each cluster will be interpreted as a single point, perhaps with some variance estimated by the delta method. By doing a k-means cluster analysis for each imputed data set, we produced m points for each cluster center, for m=7 imputations.


Here is the function I made to take one of the m imputed datasets and perform the factor-cluster analysis. It made working with multiple datasets a lot less tedious.


factor_cluster = function(dat, Nfactors=2, Nprofiles=4)
{
# Do a factor analysis and get the matrix of factor loadings
loading_values = factanal(dat, factors=Nfactors)$loadings
loading_values = loading_values[1:ncol(dat), 1:Nfactors]

# factor_values is the matrix of values
factor_values = as.matrix(dat) %*% loading_values

###### K-means clustering
cluster_output = kmeans(factor_values, Nprofiles)

### Output the loadings, factor scores, and clustering results
return(list(loadings = loading_values, factor=factor_values, cluster=cluster_output))
}


3. Results from multiple imputations are found independently, so they sometimes need alignment.

Figure 1 below shows where the k=3 cluster means were computed along two of the factor scores for each of the m=7 imputed datasets.


There is strong agreement between analyses on the location of the means, but not their labels. The numbers represent the number of the cluster given in that analysis. What one cluster analysis refers to as “centre number 2” is essentially the same as what another analysis calls “center number 3”.

This is a big problem because the ~200 cases are assigned these labels, and gives the appearance of a lot more disagreement in assignment to clusters between different imputations. However, we can realign the labels so that nearby centres from different imputed datasets are the same. The following code does that alignment and relabeling:


align_clusters = function(input_centers, input_cluster)
{
require(plyr)
output_centers = input_centers
output_cluster = input_cluster

Nimpute = dim(input_centers)[1]
Nclust = dim(input_centers)[2]
#Nfactor = dim(input_centers)[3]
trans = array(NA, dim=dim(input_centers)[1:2])


for(i in 1:Nclust)
{
dist_vec = rep(NA,Nclust)
for(j in 1:Nimpute)
{
for(k in 1:Nclust)
{
dist_vec[k] = dist( rbind(input_centers[1,i,],input_centers[j,k,]))
}
trans[j,i] = which(dist_vec == min(dist_vec))[1]
}
}


for(i in 1:Nimpute)
{
output_centers[i,,] = input_centers[i,trans[i,],]
output_cluster[,i] = mapvalues(input_cluster[,i], from=1:3, to=trans[i,])
}

return(list(centers = output_centers, cluster = output_cluster))
}

There is a potential uniqueness issue in the relabeling, but it wasn’t a problem for me, so I opted for fast over perfect. Figure 2 shows the means with the realigned labels.

This realignment was only necessary for the cluster analysis. Thankfully, R and other software arrange factors by the amount of variance they explain. So, what the factor analysis of one imputed dataset labels ‘Factor 1’ is the essentially the same as what all of the analyses will label ‘Factor 1’. If there wasn’t a clear ordering of factors, we could have repurposed the realignment code to fix that.

Main code file for this analysis
Functions code file for this analysis

This post written with the permission of the client.

Sunday, 18 October 2015

Teaching Philosophy Statement on Intro Math/Stats - Cartographers of Knowledge

     In the digital age, a teacher's role is not simply to present knowledge, but to navigate it. The overwhelming majority of the information that an undergraduate gains in her degree is available for free on the internet or in libraries.

     A digital age instructor's job is to guide and motivate students' paths through this information – to provide the vision and context necessary to develop expertise in a field. Introductory courses make up the bulk of students' collective experience with mathematics and statistics, so any expertise to be gained in those one or two courses needs to be a self-contained package.

     For example, rigorous proofs serve introductory students very little; the practice of rigor and constructing proofs has little value until upper-division courses. Introductory students learn by doing the tasks that are actually relevant to the course: examples. As such, I prefer to relegate much of the proof work to optional out-of-class readings. The extra instructional time for guided, step-by-step examples makes the material more accessible. It also provides more opportunities to fill the fundamental gaps from high school mathematics that will otherwise prevent understanding. For the few that do continue in a mathematics or statistics major, I feel that what they may lack in experience with proofs is more than compensated by a stronger foundation in the introductory material.

    This focus on accessibility extends to my policies on assignments and office hours. Assignments should be vehicles for students to struggle through a set of practice problems and receive formative feedback. However, logistics of providing quality feedback aside, that doesn't work for everyone. Assignments need to have grades attached so students will have extrinsic motivation to completing them, but these same grades penalize mistakes on something that should be practice.

    I want assignments to be important and challenging enough to take seriously, but not so much as to tempt plagiarism.  In the past, I have solved this by booking extra office hours on the days before assignments are due, and telling my students that I will give them entire solutions to assignment questions. I've found that on these office days, a group of 5-12 students would come to my office with their assignment hang-ups, but that they could answer each others' questions with only moderate guidance from me. Some of these students likely sat in to get their solutions from the rest of the office group, but that's still better than copying written assignments verbatim.


      Finally, I try to explicitly declare the 'take-home messages' by including them in my lessons. That is, the few ideas that I hope students will remember long after the final exam is over. These messages include general strategies about the mathematical sciences such as “every hard problem is merely a collection of easy problems”, and George Box's quote “all models are wrong, some are useful.”. If my students are to retain anything from their time spent on coursework, I hope it's something of value and general applicability rather than memories of referring to tables of integrals and probability.

Monday, 12 October 2015

Now you're thinking with gates!

What do Nintendo and Bitcoin enthusiasts have in common? They weren't content with solving their problems through software advancements alone. The statistical computing field shouldn't be either.

-------------------

The Super Nintendo Entertainment System is a cartridge-based system, meaning that its games were stored on circuit boards encased in plastic cartridges. Unlike disc-based media of most later generations of game consoles, the contents of cartridges were not restricted to read-only data. The most common addition to game cartridges was a small cache of re-writable memory used to store progress data in the cartridge.

Originally, active objects in games, called sprites, could only be displayed as one of a set of pre-drawn frames.  That's why sprite animations are usually simple loops of a few frames, and why characters are rarely seen changing size as they move towards or away from the player's point of view.

However, later games also included special-purpose microchips that expanded the graphical capabilities of the Super Nintendo console itself. One of these chips allowed the SNES to change the way sprites look as the game was happening, which made sprites look much more alive. This chip also allowed for rudimentary three-dimensional rendering.

Any software workaround to get these effects using only the hardware given in the Super Nintendo would have taking much longer and produced much worse results, if any at all. The video on the Super Nintendo (SNES) by video game trivia group Did You Know Gaming covers these effects and the chips in more detail, and shows some great demonstrations.

------------------

Bitcoin, is a cryptocurrency. Part of what gives it value is the premise that it is computationally hard to create or 'mine' for new ones. In fact, there is a self-adjustment mechanism that increases the mining difficulty in proportion to the total computing power of all miners.

I've appended this historical chart of the log of the total computer power (and the log difficulty), over time with the two hardware advancements that defined the trend in bitcoin mining power.


 The first event represents the first time mining using a more specialized graphical processing unit (GPU) rather than a more general central processing unit (CPU) was made publicly possible. Since many miners had compatible graphics cards already, we see a tenfold jump in power almost instantly. 

The second event represents the first time mining using a single-purpose processor, called an ASIC*   was introduced to the market. This time, another rapid increase in processing power is sparked, but without the initial leap.

An ASIC is orders of magnitude faster at the simple, repetitive task of mining bitcoins than a GPU is, and a GPU mines orders of magnitude faster than a comparably priced CPU. In both cases, the new hardware quickly rendered any previous mining methods obsolete.

* Application Specific Integrated Circuit.

---------------

When developing new methods to solve computational problems, a software approach usually works best. The results of purely software-based are often portable as packaged programs,  and the dissemination of improvements can be as quick and as cheap as a software update. The feedback loop of testing and improvement is very quick, and there are many languages such as the R, SAS, and Julia that can make software-based solutions a routine task.

Making hardware to solve a problem may sound insane by comparison - why would anyone willingly give up all of those advantages? This is where Field Programmable Gate Arrays come in. An FPGA is essentially a circuit board that can be programmed down to the gate level. That is, a user can write a program in terms of the fundamental particles of computation, OR, NOT, XOR, AND, and NAND gates. 

The FPGA takes a set of gate instructions and physically wires itself into the programmed configuration. When set, the FPGA is essentially an ASIC, an processor that can only do one task but potentially much faster than a general purpose computer. However, if needed, an FPGA can be re-programmed, so the advantage of a quick trial-and-error turnaround is there. Also, the program can be disseminated like any other software. The most popular FPGAs cost between $200 and $500 USD.

The bitcoin ASIC started off as an FPGA. Once the FPGA program was made, it took about a year for the first ASICs to be sold. This is encouraging for anyone looking towards FPGAs for the next great leap in statistical computing, as it means the endeavor has commercial viability. Just think how much faster some common methods could become even if only large matrix inversion was made faster.

It's time to start thinking with gates.