Sunday, 29 November 2015

I read this: Validity and Validation

 Motivation for reading / Who should read this:
I read Validity and Validation – Understanding Statistics, by Catherine S. Taylor in part as a follow up to the meta-analysis book from two weeks ago, and in part because I wanted to know more about the process of validating a scale made from a series of questions (e.g. The Beck Depression Inventory, or a questionnaire about learning preferences).  For scale validation, there’s a field called Item Response Theory, which I understand better because of this book, but not at any depth.

This book, a quick read at 200 small pages, compliments the handbook of educational research design that Greg Hum and I made. I plan to recommend it anyone new to conducting social science research because it provides a solid first look at the sort of issues that can prevent justifiable inferences (called “threats to internal validity”), and those that can limit the scope of the results (called “threats to external validity”).

A good pairing with “Validity and Validation” is “How to Ask Survey Questions” by Ardene Fink. My findings from that book are in this post from last year. If I were to give reading homework to consulting clients, I would frequently assign both of these books.

What I learned:
Some new vocabulary and keywords.
 

For investigating causality, there is a qualitative analogue to directed acyclic graphs (DAGs), called ‘nomological networks’. Nomological networks are graphs describing factors, directly or not, that contribute to a construct. A construct is like a qualitative analogue of a response variable, but has a more inclusive definition.

To paraphrase of Chapter 1 of [1], beyond statistical checks that scores from a questionnaire or test accurately measure a construct, it’s still necessary to ensure the relevance and utility of that construct.

Hierarchical linear models (HLMs) resemble random effect models, or a model that uses Bayesian hyperpriors. An HLM is a linear model where the regression coefficients are themselves response values to their own linear models, possibly with random effects. More than two layers are possible, in which the coefficients in each of those models could also be responses to their own models, hence the term ‘hierarchical’.

What is Item Response Theory?
Item response theory (IRT) is set of methods that puts both questions/items and respondents/examinees in the same or related parameter spaces.

The simplest model is a 1-Parameter IRT model, also called a Rasch model. A Rasch model assigns a ‘difficulty’ or ‘location’ for an item based on how many respondents give a correct/high answer or an incorrect/low answer. At the same time, respondents also have a location value based on the items they give a correct/high response. An item that few people get correct will have a high location value, and a person that gets many items correct will have a high location value.

A 2-parameter model includes a dimension for ‘discrimination’. Items with higher discrimination will elicit a greater difference in responses between respondents with a lower and those with a higher location than the item. Models with more parameters and ones for non-binary questions also exist.

The WINSTEPS software package for item response theory (IRT):
 
WINSTEPS is a program that, when used on a data set of n cases giving numerical responses each of to p items, gives an assessment of how well each item fits in with the rest of the questionnaire. It gives two statistics: INFIT and OUTFIT. OUTFIT is like a goodness-of-fit measure for extreme respondents at the tail-ends of dataset. INFIT is a goodness-of-fit measure for typical respondents.

In the language of IRT, this means INFIT is sensitive to odd patterns from respondents whose locations are near that of the item, and OUTFIT is sensitive to odd patterns from respondents with locations far from the item. Here is a page with the computations behind each statistic.

On CRAN there is a package called Rwinsteps, which allows you to call functions in the WINSTEPS program inside R. There are many item response theory packages in R, but the more general ones appear to be “cacIRT”, “classify”, “emIRT”, “irtoys”, “irtProb”, “lordif”, “mcIRT”, “mirt”, “MLCIRTwithin”, “pcIRT”, and “sirt”.

For future reference.
Page 11 has a list of common threats to internal validity.

Pages 91-95 have a table of possible validity claims (e.g. “Scores can be used to make inferences”), which are broken down into specific arguments (e.g. “Scoring rules are applied consistently”), which in turn are broken down into tests of that argument (e.g. “Check conversion of ratings to score”).

Pages 158-160 have a table of guidelines for making surveys translatable between cultures. These are taken from a document of guidelines of translating and adapting tests between languages and cultures from the International Test Commission. https://www.intestcom.org/files/guideline_test_adaptation.pdf

The last chapter is entirely suggestions for future reading. The following references stood out:

[1] (Book) Educational Measurement, 4th edition, by Brennan 2006. Especially the first chapter, by Messick

[2] (Book) Hierarchical Linear Models: Applications and Data Analysis by Ravdenbush and Bryk 2002.

[3] (Book) Structural Equation Modelling with EQS by Byrne 2006. (EQS is a software package)

[4] (Book) Fundamentals of Item Response Theory, by Hambleton, Swaminthan, and Rogers 1991.

[5] (Book) Experimental and Quasi-Experimental Designs for General Causal Interance by Shadish, Cook, and Campbell 2002. (this is probably different from the ANOVA/Factorial heavy Design/Analysis of Experiments taught in undergrad stats)

[6] (Journal) New Directions in Evaluation

Tuesday, 10 November 2015

I read this: Meta-Analysis, A Comparison of Approaches

My motivation for reading Meta-Analysis: A Comparison of Approaches by Ralph Schulze was to further explore the idea of a journal of replication and verification. Meta-analyses seemed like a close analogy, except that researchers are evaluating many studies together, rather than one in detail. I’m not working on any meta-analyses right now, but I may later. If you are reading this to decide if this book is right for you, consider that your motivations will differ.

Summary: ‘Meta Analysis’, about 200 pages, was easy to read for a graduate level textbook and managed to be rigorous without overwhelming the reader with formulae.

Most of the book is dedicated to the mathematical methods of finding collective estimates of a value or set of values from related independent studies. The latter half of the book is dedicated to describing these methods and on a large Monte-Carlo based comparison of the methods under a range of conditions. Conditions include different sample sizes per study, different number of studies, and different correlation coefficients of interest.

The first half was much more useful on a first reading, but the detailed descriptions and comparisons would make an excellent reference if I were preparing or performing a meta-analysis.
The ‘soft’ aspects of meta-analysis are only briefly touched upon, but several promising references are given.  References on retrieval of studies (e.g. sampling, scraping, and coverage) and assessing studies for quality and relevance include to several chapters of [1] and to [2].

Take-home lessons (i.e. what I learned):
The most common method to get a collective estimate of a parameter is to take a weighted sum of estimates from independent studies, with weights inversely proportional to the variance of each estimate. This method makes a very questionable assumption: that all papers studied are estimating the same parameter.
The authors call this assumption the fixed effect model. Some of the methods described explicitly use this model, but all of the methods include a test (usually using the chi-squared distribution) to detect if a fixed effect model is inappropriate.

Other models, such as Olkin and Pratt, and DerSimonian-Laird use the more complex, but more realistic random effects model. Under this model, the parameter that each study is estimating is related, but slightly different. Then the collective estimate that comes out of the meta-analysis is an estimate of some parameter with an extra layer of abstraction than the parameters described in each individual study.

There are other, yet more complex models that are viable, such as mixture models or a hierarchical linear models in which each study’s parameter estimate is an estimate of some combination of abstract parameters, but these are only briefly covered in ‘Meta Analysis’.

Many of the methods described used Fisher’s z transformation in some way, where

z =1/2 * ln ( (1 + r) / (1 - r)) ,

which is a pretty simple transformation for Pearson correlation coefficients r that maps from [-1,1] to (-infty, +infty), converges to normality way faster than r does, and has an approximate variance that only depends on the sample size n. (Found on pages 22-23).

Also, apparently transforming effect sizes into correlations by treating treatment group as continuous variable at 0 or 1 isn’t overly problematic (pages 30-32). However, it can be very useful in bringing in a wider range of studies when a collective correlation coefficient is desired.

I didn't find any clear beacon that said "this is where replication work is published", but I found the following promising leads:

[1] The Handbook of Research Synthesis (1994)
[2] Chalmers et al. (1981) “A method for assessing the quality of a randomized control trial.” Controlled Clinical Trials, volume 2, pages 31-49.
[3] Quality & Quantity: International Journal of Methodology
[4] Educational and Psychological Measurement (Journal)
[5] International Journal of Selection and Assessment
[6] Validity Generalization (Book)
[7] Combining Information: Statistical Issues and Opportunities for Research (Book)

Saturday, 31 October 2015

A Web Crawler using R

This R-based web crawler, available here...
1. Reads the HTML of a webpage from a given address,
2. Extracts the hyperlinks from that page to other pages and files,
3. Filters out links that fail to meet given criteria (e.g. other websites, already explored, and non-html)
4. Stores the origin and destinations of the remaining links,
5. Selects a link from those discovered so far, and returns to 1.

The scraper can be used to gather raw data from thousands of pages from a website, and reveal information of the network of links between them. For example, starting just now at the front page of the National Post website, the crawler visited a news article, the main page for horoscopes, the day's horoscopes, and an article from the financial pages of the paper.

url_seed= "http://www.nationalpost.com/index.html"
hasall = "http://[a-zA-Z]+.(national|financial)post.com/"
result = web_crawl(url_seed=url_seed, hasall=hasall, Npages=5)

112 queued, 2 ://news...elections-canada-urges...
122 queued, 3 ://news...life/horoscopes/
136 queued, 4 ://news...daily-horoscope-for-tuesday-october-27-2015
136 queued, 5 ://business.financialpost.com.../where-to-get...

The core function of the scraper is getHTMLLinks() in the XML package. As long there is a webpage at the address given (including a custom 404 error), then this function downloads the HTML code and returns all the web hyperlinks found in the code as a vector of strings.

The scraper itself is two function definitions. The inner function, get_clean_links(), calls getHTMLLinks() and applies some basic and user-defined criteria to filter out undesired links. In the above example, the regular expression named 'hasall' is such a criterion. Any link that returned by get_clean_links() must match all of the patterns in parameter 'hasall', none of them in parameter 'hasnone', and at least one of the patterns in parameter 'hassome'.

On some websites, links may be recorded in relation to their origin. In these cases, the first part of each link will be missing, and you may need to put that first part of the address back in to each link to use it, which you can do by setting the parameter “prepend” to the missing part of each address.

The outer function, web_crawl(), calls get_clean_links() and passes all relevant parameters, such as filtering criteria, to it. The function web_crawl() keeps all the acceptable links that have been discovered so far and selects from those links the next page to explore. This function also handles the formatting of the network information into two data frames, called edges and nodes, which it will return after Npages web pages have been explored.

str(result$edges)
'data.frame': 243 obs. of 4 variables:
$ From: chr "http://www.nationalpo...
$ To: chr "http://www.nationalpo...
$ From_ID: int 1 1 1 1 1 1 1 1 1 1 ...
$ To_ID: int 6 7 8 9 10 11 12 13 14 15 ...


str(result$nodes)
'data.frame': 5 obs. of  4 variables:
$ address : chr "http://www.nationalp...
$ degree_total: num  114 39 37 53 0
$ degree_new  : num  114 11 13 14 0
$ samporder  : int  1 2 3 4 5

In web_crawl(), the parameter url_seed determines the first address to explore. url_seed can be a vector of web addresses, in which case the crawler will begin exploration with either the first address in the vector, or a random one depending on how queue_method is set.

The parameter queue_method determines how that link is selected; when queue_method is 'BFS' all the direct links from the page at url_seed are explored before any second-order links are explored. When queue_method is 'Random', any link discovered but not explored could be selected next with equal probability.

Finally, web_crawl() has a function which throttles the speed at which you make requests for data from web servers. This is partly for the stability of the crawler, and partly to avoid undue stress or attention from the server being scraped. The parameter 'sleeptime', which defaults to 3, sets the number of seconds that R will idle between pages.

The crawler only retains the addresses of the links, but if you want more in depth data, you can use the getURL(“address”) function in the RCurl package to get the HTML code at “address”. Only the link structure was necessary for my purposes, but a lot more is possible by making the crawler your own with the functions in the XML and Rcurl packages.

The code for this crawler was written with the help of diffuseprior ( http://www.r-bloggers.com/web-scraping-in-r/ ) and Brock Tibert ( http://www.r-bloggers.com/create-a-web-crawler-in-r/ )

Wednesday, 28 October 2015

Take-home lessons and code from a factor-cluster analysis with imputation


Recently I was tapped to examine data from a survey of ~200 children to find if their learning preferences fell into well-defined profiles (i.e. clusters). The relevant part of the survey had more than 50 Likert scale questions The client and I had decided that a factor analysis, followed by a cluster analysis would be most appropriate.


I learned some things in doing this analysis, and wanted to share that and some relevant code.

1. Dimension reduction pairs well with clustering.


Had we simply done a k-means clustering, we would have gotten poor results. There are meaningful correlations between questions, but k-means clustering treats each question as an independent variable. Differences on correlated sets of questions would overstate the differences in responses as a whole.

Also, factor analysis, a standard dimension reduction method, provides combinations of variables which may make clustering results easier to explain.


2. Carry multiple imputations through the entire analysis. Don’t combine results prematurely.


The purpose of doing multiple imputations, rather than a single imputation, is to have the final results reflect the additional uncertainty from using unknown values. By combining the results in the middle of an analysis, such as after dimension reduction, but before k-means clustering, you mask that uncertainty. For example, by aggregating the imputed results before clustering, the centre of each cluster will be interpreted as a single point, perhaps with some variance estimated by the delta method. By doing a k-means cluster analysis for each imputed data set, we produced m points for each cluster center, for m=7 imputations.


Here is the function I made to take one of the m imputed datasets and perform the factor-cluster analysis. It made working with multiple datasets a lot less tedious.


factor_cluster = function(dat, Nfactors=2, Nprofiles=4)
{
# Do a factor analysis and get the matrix of factor loadings
loading_values = factanal(dat, factors=Nfactors)$loadings
loading_values = loading_values[1:ncol(dat), 1:Nfactors]

# factor_values is the matrix of values
factor_values = as.matrix(dat) %*% loading_values

###### K-means clustering
cluster_output = kmeans(factor_values, Nprofiles)

### Output the loadings, factor scores, and clustering results
return(list(loadings = loading_values, factor=factor_values, cluster=cluster_output))
}


3. Results from multiple imputations are found independently, so they sometimes need alignment.

Figure 1 below shows where the k=3 cluster means were computed along two of the factor scores for each of the m=7 imputed datasets.


There is strong agreement between analyses on the location of the means, but not their labels. The numbers represent the number of the cluster given in that analysis. What one cluster analysis refers to as “centre number 2” is essentially the same as what another analysis calls “center number 3”.

This is a big problem because the ~200 cases are assigned these labels, and gives the appearance of a lot more disagreement in assignment to clusters between different imputations. However, we can realign the labels so that nearby centres from different imputed datasets are the same. The following code does that alignment and relabeling:


align_clusters = function(input_centers, input_cluster)
{
require(plyr)
output_centers = input_centers
output_cluster = input_cluster

Nimpute = dim(input_centers)[1]
Nclust = dim(input_centers)[2]
#Nfactor = dim(input_centers)[3]
trans = array(NA, dim=dim(input_centers)[1:2])


for(i in 1:Nclust)
{
dist_vec = rep(NA,Nclust)
for(j in 1:Nimpute)
{
for(k in 1:Nclust)
{
dist_vec[k] = dist( rbind(input_centers[1,i,],input_centers[j,k,]))
}
trans[j,i] = which(dist_vec == min(dist_vec))[1]
}
}


for(i in 1:Nimpute)
{
output_centers[i,,] = input_centers[i,trans[i,],]
output_cluster[,i] = mapvalues(input_cluster[,i], from=1:3, to=trans[i,])
}

return(list(centers = output_centers, cluster = output_cluster))
}

There is a potential uniqueness issue in the relabeling, but it wasn’t a problem for me, so I opted for fast over perfect. Figure 2 shows the means with the realigned labels.

This realignment was only necessary for the cluster analysis. Thankfully, R and other software arrange factors by the amount of variance they explain. So, what the factor analysis of one imputed dataset labels ‘Factor 1’ is the essentially the same as what all of the analyses will label ‘Factor 1’. If there wasn’t a clear ordering of factors, we could have repurposed the realignment code to fix that.

Main code file for this analysis
Functions code file for this analysis

This post written with the permission of the client.

Sunday, 18 October 2015

Teaching Philosophy Statement on Intro Math/Stats - Cartographers of Knowledge

     In the digital age, a teacher's role is not simply to present knowledge, but to navigate it. The overwhelming majority of the information that an undergraduate gains in her degree is available for free on the internet or in libraries.

     A digital age instructor's job is to guide and motivate students' paths through this information – to provide the vision and context necessary to develop expertise in a field. Introductory courses make up the bulk of students' collective experience with mathematics and statistics, so any expertise to be gained in those one or two courses needs to be a self-contained package.

     For example, rigorous proofs serve introductory students very little; the practice of rigor and constructing proofs has little value until upper-division courses. Introductory students learn by doing the tasks that are actually relevant to the course: examples. As such, I prefer to relegate much of the proof work to optional out-of-class readings. The extra instructional time for guided, step-by-step examples makes the material more accessible. It also provides more opportunities to fill the fundamental gaps from high school mathematics that will otherwise prevent understanding. For the few that do continue in a mathematics or statistics major, I feel that what they may lack in experience with proofs is more than compensated by a stronger foundation in the introductory material.

    This focus on accessibility extends to my policies on assignments and office hours. Assignments should be vehicles for students to struggle through a set of practice problems and receive formative feedback. However, logistics of providing quality feedback aside, that doesn't work for everyone. Assignments need to have grades attached so students will have extrinsic motivation to completing them, but these same grades penalize mistakes on something that should be practice.

    I want assignments to be important and challenging enough to take seriously, but not so much as to tempt plagiarism.  In the past, I have solved this by booking extra office hours on the days before assignments are due, and telling my students that I will give them entire solutions to assignment questions. I've found that on these office days, a group of 5-12 students would come to my office with their assignment hang-ups, but that they could answer each others' questions with only moderate guidance from me. Some of these students likely sat in to get their solutions from the rest of the office group, but that's still better than copying written assignments verbatim.


      Finally, I try to explicitly declare the 'take-home messages' by including them in my lessons. That is, the few ideas that I hope students will remember long after the final exam is over. These messages include general strategies about the mathematical sciences such as “every hard problem is merely a collection of easy problems”, and George Box's quote “all models are wrong, some are useful.”. If my students are to retain anything from their time spent on coursework, I hope it's something of value and general applicability rather than memories of referring to tables of integrals and probability.

Monday, 12 October 2015

Now you're thinking with gates!

What do Nintendo and Bitcoin enthusiasts have in common? They weren't content with solving their problems through software advancements alone. The statistical computing field shouldn't be either.

-------------------

The Super Nintendo Entertainment System is a cartridge-based system, meaning that its games were stored on circuit boards encased in plastic cartridges. Unlike disc-based media of most later generations of game consoles, the contents of cartridges were not restricted to read-only data. The most common addition to game cartridges was a small cache of re-writable memory used to store progress data in the cartridge.

Originally, active objects in games, called sprites, could only be displayed as one of a set of pre-drawn frames.  That's why sprite animations are usually simple loops of a few frames, and why characters are rarely seen changing size as they move towards or away from the player's point of view.

However, later games also included special-purpose microchips that expanded the graphical capabilities of the Super Nintendo console itself. One of these chips allowed the SNES to change the way sprites look as the game was happening, which made sprites look much more alive. This chip also allowed for rudimentary three-dimensional rendering.

Any software workaround to get these effects using only the hardware given in the Super Nintendo would have taking much longer and produced much worse results, if any at all. The video on the Super Nintendo (SNES) by video game trivia group Did You Know Gaming covers these effects and the chips in more detail, and shows some great demonstrations.

------------------

Bitcoin, is a cryptocurrency. Part of what gives it value is the premise that it is computationally hard to create or 'mine' for new ones. In fact, there is a self-adjustment mechanism that increases the mining difficulty in proportion to the total computing power of all miners.

I've appended this historical chart of the log of the total computer power (and the log difficulty), over time with the two hardware advancements that defined the trend in bitcoin mining power.


 The first event represents the first time mining using a more specialized graphical processing unit (GPU) rather than a more general central processing unit (CPU) was made publicly possible. Since many miners had compatible graphics cards already, we see a tenfold jump in power almost instantly. 

The second event represents the first time mining using a single-purpose processor, called an ASIC*   was introduced to the market. This time, another rapid increase in processing power is sparked, but without the initial leap.

An ASIC is orders of magnitude faster at the simple, repetitive task of mining bitcoins than a GPU is, and a GPU mines orders of magnitude faster than a comparably priced CPU. In both cases, the new hardware quickly rendered any previous mining methods obsolete.

* Application Specific Integrated Circuit.

---------------

When developing new methods to solve computational problems, a software approach usually works best. The results of purely software-based are often portable as packaged programs,  and the dissemination of improvements can be as quick and as cheap as a software update. The feedback loop of testing and improvement is very quick, and there are many languages such as the R, SAS, and Julia that can make software-based solutions a routine task.

Making hardware to solve a problem may sound insane by comparison - why would anyone willingly give up all of those advantages? This is where Field Programmable Gate Arrays come in. An FPGA is essentially a circuit board that can be programmed down to the gate level. That is, a user can write a program in terms of the fundamental particles of computation, OR, NOT, XOR, AND, and NAND gates. 

The FPGA takes a set of gate instructions and physically wires itself into the programmed configuration. When set, the FPGA is essentially an ASIC, an processor that can only do one task but potentially much faster than a general purpose computer. However, if needed, an FPGA can be re-programmed, so the advantage of a quick trial-and-error turnaround is there. Also, the program can be disseminated like any other software. The most popular FPGAs cost between $200 and $500 USD.

The bitcoin ASIC started off as an FPGA. Once the FPGA program was made, it took about a year for the first ASICs to be sold. This is encouraging for anyone looking towards FPGAs for the next great leap in statistical computing, as it means the endeavor has commercial viability. Just think how much faster some common methods could become even if only large matrix inversion was made faster.

It's time to start thinking with gates.

Tuesday, 8 September 2015

Arguments for Journals of Replication

In today's xkcd, a collection of hypothetical headlines are given, all of which would spell trouble for science as a whole. Also, these headlines all have to do with replicating experiments. The only fictional part of (most of) these headlines is that they're showing up in the news, because this is an old problem. Given that new journals are frequently being made as fields evolve and grow, it's surprising to not find any journals of replication and verification, because having such journals could help make science more honest.

Creating and contributing to journals dedicated to replicating, verifying, and assessing work in a field is a worthwhile endeavour. Here's why:

Primary Motivation - Having a journal of replication in a field would implicitly improve the quality of research across that field. It would do this by effectively putting a bounty on bad research by offering a publication opportunity for addressing flaws. It also provides an incentive for a team that 'gets there second' to publish, which could remove some of the unscientific competitiveness of people working towards the same goal. Finally, it provides relatively easy publication opportunities for people wishing to do the work that makes a field's body of work more cohesive by repeating key experiments or by conducting meta-studies.

Assertion 1 - Replication has high scientific value: Consider an experiment done twice, in two independent labs. It may not produce as much new information as two different experiments, but the information it does provide would be of a higher quality. That's what scientific journals are supposed to do - focus on quality. Large volumes of low quality information can be found anywhere.

Assertion 2 - Replication is easy and safe: Assuming that the method have been outlined well, someone trying to replicate an experiment can skip past a lot of the planning and false starts of the original researcher. The second research team even has the first one to ask questions. It's safe in that the chance of a new discovery, and hence publication delays, is low. This makes it suitable for the sort of mass produced work that can be outsourced to graduate students with minimal degree hiccups.

Assertion 3 - A replication journal has reward potential for contributors: Logically, most replication papers fall into two categories: Refutation or verification. If research is verified, the original researchers gain prestige and a citation, and the replicators gain a publication. In future work, many groups mentioning the original work will want to cite both papers because together they make a stronger argument. If the research if refuted, at best it could spark interest in 'tiebreaker' work by a 3rd research party, which would cite both (positively, if everything was done honestly), and at worst the original work dies early where it would or should have anyway, and the replicators establish their reputation for rigor.

Assertion 4
- A replication journal has reader appeal: If someone is reading a research paper, they may be interested how credible the work is after it has been subject to public scrutiny. A replication journal appropriate to the paper's field would be a good first place to look because it would save the reader the trouble of filtering through work that cited the paper in question or looking for otherwise related work that may lend to or take credence from the paper. In short, a replication journal would offer the same service to readers that review sites offer to consumers of more commercial goods.

Assertion 5 - A replication journal would be easy to administer: Aside from papers on related verification methods, all the relevant submissions to such a journal would be adhere to specific formulae - they would either be direct replications or metastudies. Hopefully, this would make the editing and review work of these papers easier because most viable papers would look the same: Introduction of the work to be replicated, comparison of methods, comparison of results, short discussion. Criteria for publication would have few ambiguities that require editorial decision-making.

Tuesday, 18 August 2015

Lesson Notes - Edit Distance (Text Processing 2)

These are the notes for the second lecture in the unit on text processing that I'm building. Some useful ideas like exact string matching and the definitions of characters and strings are covered in the notes of Text Processing 1 - Regular Expressions.

Edit distance, also called Levenshtein distance, is a measure of the number of primary edits that would need to be made to transform one string into another. The R function adist() is used to find the edit distance.

adist("exactly the same","exactly the same") # edit distance 0 
adist("exactly the same","totally different") # edit distance 14

A primary edit is one of three things:
a) An insertion of a character.
b) A deletion of a character.
c) A substitution of a character.

For example:

## Insertions: All of these will return an edit distance of 1.
adist("5678","05678") # inserting an '0' character before the '1'
adist("weasel", "w easel")  # inserting a space between w and e
adist("weasel", "weasel!")  # inserting a '!' character after the 'l'

## Inserting 5 characters gives an edit distance of 5
adist("Genome","Lord Genome")

## Deletions
adist("Genome","Gnome") # deletion of the first 'e': 1 edit
adist("Genome","nome")  # deleting the first two characters: 2 edits
adist("Canada","Cnd")   # deleting all three ehs: 3 edits

## substitutions
adist("iiiii","iixii")  # replacing 1 'i'
# 1
adist("iiiii","xiyiz")  # replacing 3
# 3

If more than one type of edit is needed to turn one string into another, the edit distance across different types is additive.

adist("weasel","Vin Deasel")  # 4 insertions, then replacing 'w' with 'D'
# 5

The edit distance is the 'shorted path' distance between two strings. This means that two strings might be closer than you expect. For example, you can get from "Moos" to "oose" with three substitutions, but adist() will return a smaller distance because a simpler transformation exists (delete 'M', then insert 'e' at the end).

adist("Moos", "oose")

The edit distance from one string to another is not always the sum of the edit distances between them and an intermediate string, but it is never the more than the sum. In other words the triangle inequality holds.

### These:
adist("triangle inequality?","what's that?")
adist("what's that?","some math thing.")
### 15 and 12

### Do not add to this:
adist("triangle inequality?","some math thing.")
### 17

By default each type of primary edit, insertions, deletions, and substitutions, contributes and edit distance of 1. This can be changed with an option in adist(). As above, the smallest distance is used, so watch your weights.

adist("insert3","in---sert3", costs=list(insertion=1)) # 3
adist("insert3","in---sert3", costs=list(insertion=50)) # 150
adist("Mooo!","Moo", costs=list(insertion=0, deletion=50)) # 100

Note the case sensitivity. Since to a computer an upper case 'R' and a lower case 'r' are two completely different letters, differences in cases count towards the edit distance.

str1 = "youch that hurt!"
str2 = "YOUCH THAT HURT!"
adist(str1,str2) ## edit distance 13


If you don't want edit distance to count cases, you can either use functions to change the case of the strings, or use an option in adist()

adist( toupper(str1), str2)
adist( str1, tolower(str2))
adist( str1, str2, ignore.case=TRUE) # all zero distance


Leading and trailing spaces can also be an issue because they contribute to distance, but you can remove them with the str_trim() function in the stringr package.

str1 = "Mars"
str2 = "                 Mars              "
adist(str1,str2) # distance is vast
adist(str1, str_trim(str2)) # distance is nothing

Transpositions, which are the swapping of the characters or words, are more problematic. These are interpreted as large numbers of substitutions or insertions/deletions, usually.

adist("Animal", "Aminal") # 2
adist("Down Under", "Under Down") # 10
adist("Life isn't fair.","fair Life isn't.") # 10

Edit distance is used for fuzzy matching.


Fuzzy matching, in contrast to the exact matching that we dealt with in the last text processing lesson, is designed to detect strings that are close to a target string (or string pattern) as well as ones that fit the string (pattern) completely. Common systems that use fuzzy matching (and other things) include spell checkers and search engines when suggestions are made that are close to what you wrote. Some of the suggestions that come up will be selected because they have a minimal edit distance subject to some weighting based on common misspellings or typos.


We can build a workable fuzzy matching function in R in less than ten lines.

Suppose we had the transcript of the events of a Major League Soccer match from ESPN, like this one between the Houston Dynamo and the New England Revolution ( ESPN_MLS_Example.txt , thanks to Rajitha Silva for letting me give away part of his research like this ), and we wanted to identify the team associated with certain lines, but there was a lot of extra and messy information in the way. This includes lines from the example text like:

  • Houston Dynamo Dominic Oduro 58'
  • New England Revolution • Shalrie Joseph 43' • Kheli Dube 73'
  • Joseph Ngwenya (Houston Dynamo) receives a red card.
  • Houston Dynamo makes a substitution: Joseph Ngwenya enters for Cameron Weaver
  • Goal !!! Kheli Dube scores for New England Revolution
In each case, you can identify the team of interest by looking at the edit distance between the team name and each of the lines. Most of time (alas, there are always edge cases), the team with the shorter edit distance to the line of text is the team of interest.

adist("Houston Dynamo Dominic Oduro 58'","Houston Dynamo")
18

adist("Houston Dynamo Dominic Oduro 58'","New England Revolution")
27

Using this principle, we can make a simple 'choose the least' system.

teamlist = c("Houston Dynamo", "New England Revolution", 
 "Chicago Fire", "Montreal Impact", "Vancouver Whitecaps")

clean_teamname = function(raw_name, teamlist)
{
    #optional
    require(stringr)
    raw_name = str_trim(raw_name)
    
    #necessary
    ed_dist = adist(raw_name, teamlist)
    best = which(ed_dist == min(ed_dist))[1] # [1] breaks ties
    return(teamlist[best]) # Return the best fitting name
}

Then we test it out:

clean_teamname("Houston Dynamo Dominic Oduro 58' ",teamlist)
[1] "Houston Dynamo"

clean_teamname(
"Houston Dynamo makes a substitution: Joseph Ngwenya enters for Cameron Weaver",
teamlist)
[1] "Houston Dynamo"

clean_teamname(
"Joseph Ngwenya (Houston Dynamo) receives a red card. ",teamlist)
[1] "Houston Dynamo"

clean_teamname("Goal !!! Kheli Dube scores for New England Revolution",teamlist)
[1] "New England Revolution"



Tuesday, 4 August 2015

Lesson Prototype - First Lecture on Multiple Imputation

Continuing the work on my data management course, here's the start of the Imputation unit. I'm intending three units in total - Text processing, principles of data manipulation (cleaning, merging, formatting, and database access), and imputation.

The companion reading for this lesson is Chapter 1 of Flexible Imputation of Missing Data, by Stef van Buuren, which is referenced in some places and is drawn from heavily to make these notes. As this is just a prototype, I have not yet filled in all the details or included the iris_nasty dataset, which is just the iris dataset with random values deleted.

Structure:
1) Motivation for learning imputation
2) Types of missingness
3) A few simple methods that build up towards multiple imputation

--------------------
Hook: The frictionless vacuum of data


Consider the iris dataset in base R:


data(iris)
iris[c(1:3,148:150),]

Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1            5.1         3.5          1.4         0.2    setosa
2            4.9         3.0          1.4         0.2    setosa
3            4.7         3.2          1.3         0.2    setosa
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica

any(is.na(iris))
[1] FALSE

dim(iris)
[1] 150   5


A dataset like this is the equivalent of a frictionless vacuum in a physics problem: ideal for a variety of methods and for teaching principles but not commonly found in 'real' industrial settings. More likely is something like this:

iris_nasty = read.csv("iris_nasty.csv", as.is=TRUE)
 iris_nasty[c(1:3,148:150),]
 
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1             NA         3.5          1.4          NA    setosa
2            4.9          NA           NA         0.2    setosa
3            4.7         3.2          1.3          NA    setosa
148          6.5         3.0          5.2          NA virginica
149          6.2          NA          5.4         2.3 virginica
150          5.9         3.0           NA          NA virginica


Missing values, shown as "NA" (as in Not Applicable) like the ones in the nasty version of the iris dataset above, could come from any manner of problems. In a biological setting, it could be a failure of the measuring device. If it was a live specimen, the measurement may not have been possible without harming the specimen, or trampling a lot of ground in order to reach it.

If this data were from a social survey, missingness can come from questions that were skipped over or refused by the respondent. The data points could also missing because they were not relevant to the respondent, such as questions about prostate cancer being asked to a woman. If the survey was done of the internet, questions could be missed due to a connection problem.

------------
Types of missingness


Missingness patterns are classified into three types: MCAR, MAR, and MNAR.

MCAR: Missing Completely At Random - The chance of any given value being missing has nothing to do with any meaningful variable. This is the least problematic to deal with, because methods of imputation (filling in missing values) can work by imputing on one variable at a time, rather than imputing on the whole dataset at once. Missingness related to internet connection problems are reasonably assumed to be MCAR.

MAR: Missing At Random - The chance of any given value being missing depends ONLY on other variables that are in the dataset. This is more problematic to impute because there is the possibility of cases that are missing two variables in which the missingness of one variable depends on the other, which is itself missing. A possiblity could be the results of someone's most recent prostate exam being missing in a dataset that also includes age, sex, and a few other factors that explain most or all of the reasons why such data wouldn't be available.

MNAR: Missing Not At Random - The chance of any given being missing depends in some way on variables that are NOT included in the dataset. This is by far the most prolematic of missingness patterns because there is no way to know why something is missing, and therefore any assumptions of the missing values are biased and conjecture. Van Buuren, author of "Flexible Imputation of Missing Data" uses an example of a weight scale that is wearing out over time, espcially after heavy objects are weighed, and occasionally failing to measure something. If we don't know what's been weighed first or last, then we can't accurately estimate the chances of any given value being missing.

My favourite example is the 2011 Canadian Long-Form Census, which was voluntary. About a third of the forms that were sent out were not returned, and there is no reasonable way to account for them because certain demographics are less likely to return voluntary surveys. Unfortunately, the information about which demographics these are is exactly the kind of information that is missing.


There are methods for telling MCAR data from MAR data, which we will cover later in this unit. However, there is by definition no way to tell MAR data from MNAR data because that would involve conditioning on information that is not available. Unless there is cause to believe that some data of interest is MNAR, the default assumption is that data is MAR instead.



--------------
Dealing with missing data - building up to multiple imputation


There are many ways to impute the missing data. That is, to replace the NA spaces with values that can be used for analysis. Starting with the most naive and working up:


1) Remove all cases that have a missing value.length(which(is.na(iris_nasty)))

[1] 243

dim(iris_nasty)
[1] 150   5

dim(na.omit(iris_nasty))
[1] 16  5

Although 36% of the data values (243 of 750) are missing, 89% of the cases (134 of 150) contain at least one missing value. To ignore all of those cases means throwing out a lot of data that was collected and recorded. This is not a preferable solution.


2) Remove cases with missing values for the variables used.


mod1 = lm(Petal.Length ~ Petal.Width, data=iris_nasty)
length(mod1$residuals)
[1] 94


mod2 = lm(Petal.Length ~ ., data=iris_nasty)
length(mod2$residuals)
[1] 16

This method of dealing with missing data is the default of the linear model function lm() in R. Depending on the model used, this means losing between 56/150 and 134/150 of the sample. That's at least as good as the 'listwise deletion' option suggested before this. However, a comparability problem has been introduced:

round(mod1$coef,2)

(Intercept) Petal.Width
1.12 2.23


round(mod2$coef)

(Intercept) Sepal.Length Sepal.Width Petal.Width Speciesversicolor
-0.32 0.48 -0.21 0.53 1.67

Speciesvirginica
2.12 


Between these two models, the coefficients are different (try for yourself to see if these differences are significant). From the simple linear model to the full model, the intercept changes from 1.123 to -0.319, and the slope gradient along Petal.Width changes from from 2.23 to 0.52.

If these two models were using the same dataset, we could interpret the differences between these coefficients as the effect of holding the Sepal variables constant and fixing the Species variable in the full model. But the full model was developed using a sample of 16 units, whereas the simple model uses 94 units. Are the differences in the model coefficients attributable to the sample units that were omitted or to the variables that were included, or to both, or to an interaction? There's no way to know.
This confusion also highlights why we need to impute data in the first place, rather than ignore it when it is missing.


3) Mean-value imputation: Take the mean of all known values for a variable, and use that value for all missing values for that variable.

This solves the problems from the row deletion solutions above, in that it would allow the use of all 150 cases in the iris_nasty dataset. However, it fails to reflect any sort of variation in the values being imputed. For example, what if Sepal.Length is different for each species? The mean-value imputed values for Sepal.Length will be the same for all the species, which poorly reflects the trends in the data.

(Figure from van Buuren showing how limiting this is for a simple regression)

4) Model-based imputation.
Van Buuren refers to this as regression-based imputation in Chapter 1, but the bigger picture is that imputed values can be those predicted from a model, such as a regression model.

Like the mean-value imputation, this solution allows all the vases to be used. It also reflects the average trends of the data. However, it does NOT reflect the variance unaccounted for by the model. That is, the uncertainty inherit in the imputed values isn't shown; all the imputed values will be the average of what those values are likely to be.

For each value on its own, taking these model-predicted values as the imputations is ideal. After all, a model prediction reflects our best guess as to what a value really is. However, when you take all the imputed values together, the trends in the data is unrealistically reinforced. If we do this with the missing Petal.Width and Petal.Length values, then the estimation of the correlation between these variables becomes ______. Compare that to the correlation found using the 94 rows of data that don't have missing values: ______ , or of the 150 rows of data in the original iris data set without imputations: ________.


(Figure from van Buuren showing how limiting this is too for a simple regression)
(oorrrrrr... figure of scatterplot of model from iris_nasty)

One set of methodology questions comes up with model-based imputation that we will address in a later lesson: If there are missing values for multiple variables from a case in dataset, which of the variables do you impute first? Do you use the imputed value from one variable to inform predictions for imputations on other variables, or do you only use complete cases as a basis for imputation?


5) Stochastic regression-based imputation.

This is the same as regression or model-based imputation, except that some random noise is added to each predicted value to produce an imputed value.

The distribution of the random noise is determined by the amount of uncertainty in the model. For example, a prediction from a very strong regression would have only a small amount of noise added as compared to a prediction from a weak regression. Also, if a linear regression were the model used for prediction, the noise would come from a Gaussian distribution as per the standard regression assumptions.

Adding this noise makes the distribution of imputed values resemble that of the known values, instead of being underdispersed like it is in simple model-based imputation. However, the results from the imputed dataset are now sensitive to the random seed used to generate the noise, and the strength of correlations is now understated. Using stochastic regression-based imputation, 95% of correlations for model 1 in the iris_nasty dataset were between ____ and ______.

Finally, after the noise is added, the imputed missing values are taken with as much certainty as the directly observed values. There is no distinction between values that were originally missing and those that were not.


6) Multiple imputation.
This method is the one we will be using for the rest of this unit. To do multiple imputation, take predictions of values and add random noise as we did in stochastic regression-based imputation. However, do this independently for several (i.e. 3-10) copies of the dataset.

Each copy of the dataset will have the same values for everything that was observed directly, but different values for anything that was imputed. The average of the imputed values will still be close to the predicted value, but now the differences in the imputed values between datasets reflects the uncertainty we have in these imputed values.

(Below are three copies of those six lines from iris_nasty, but with imputations of the missing values show in bold and red)

(figure of what's going on... one dataset to many, many datasets to many analyses, combined to one)

To analyse such a meta-dataset, we perform the same analysis on each copy of the dataset, and then average the expected values from each result using Rubin's Rules. This is shown as a black-box process for the three imputations of the iris_nasty dataset below.

(example of lm from each imputation)

(rubin.rules() black box for now which spits out parameter estimates and uncertainty)

Rubin's Rules deserve an entire lesson, so that's where we'll pick up next time we talk about imputation.


Previous Lesson Prototype on Text - Regular Expression

Friday, 31 July 2015

Possible Application of Approximate Bayesian Computation for Networks

I've been thinking about possible applications for the network simulation/analysis program that Steve Thompson and I developed as part of my PhD thesis.*

I propose to investigate the effect of response order/timing in respondent-driven sampling on estimates of network parameters.

Here I'm assuming that samples are taken in waves. That is, a collection of seed members of the population are identified and given the initial coupons. If the standard deviation of the response times is small compared to the mean, the network of the population is being sampled in a manner similar to breadth first search (BFS). If the standard deviation of response times is large relative the mean, the network ends up being sampled in a manner closer to that of a depth first search (DFS). Each of these sampling methods has the potential to provide vastly different information about the sample.

Such an investigation would include four parts:


1) Motivating questions: Why would we care about the time it takes members of the population to be entered into the sample? Because perhaps these times could be influenced by different incentive structures. If they can, what is best? If they cannot, we can do a what-if analysis to explore counter-factuals. Does timing matter? What sampling and incentive setups are robust to the effects of response times? Are there statistical methods that can adjust for the effects of response times?


2) Find some real respondent-driven samples, preferably by looking in the literature of PLoS-One and using the data that is included with publications, but possibly by asking other researchers individually.


Look at the time stamp of each observation in each data set, if available, and fit a distribution such as gamma(r, beta) to the time delay between giving a recruitment coupon and that recruitment coupon being used. Compare the parameter estimates that each data set produces to see if there is a significant difference between them and see if there's any obvious reason or pattern behind the changes.

3) Generate a few networked populations and sample from each one many times using the different response-delay time distributions found in Part 2. Are there any significant changes to the network statistics that we can compute from the samples we find? That is, how does the variation of the statistics between resamples under one delay time distribution compare to the variation between delay time distributions?

4) Employ the full ABCN system to get estimates of whatever network parameters we can get for case ij, 1 <= i,j <= M, where M is the number of datasets we find. Case ij would be using the observed sample from the ith dataset, with simulations in the ABCN system using the delay distribution estimated from the jth dataset.

This way, we could compare the variation in the network parameters attributable to the sample that was actually found, and how much was attributable to the difference in time it took for recruitments to be entered into the survey. Also, we effectively will have performed a what-if analysis on the datasets we use - and seeing if the conclusions from the datasets would have been different if the recruited respondents had been responded with different delay structures.

---------------------------------------
*This network simulation/analysis system takes an observed sample of network and computes a battery of summarizing statistics of the sample. Then it simulates and samples from many networks and computes the same battery from each sample. It estimates the parameters of the original sample by looking at the distribution of parameters from the simulation samples that were found to be similar to the observed sample.

This will all be explained again, and in greater detail when I post my thesis after the defense in... gosh.. a month. Basically it's statistical inference turned upside down, where you generate the parameters and see if they make the sample you want, instead of starting with the sample and estimating a parameter value or distribution. The base method is called Approximate Bayesian Computation.

Sunday, 26 July 2015

Prediction Assisted Streaming

The online game path of exile has a trick for reconciling a game that requires quick reaction times with the limitations of servers: it predicts the actions of the players. It doesn't have to predict far ahead - a couple hundred milliseconds - to keep the action running smoothly most of the time.

Prediction on this time frame isn't hard in principle; play often involves performing the same short 
action repeatedly, such as firing an arrow, so the default prediction is just more of that. When the server predicts incorrectly, it usually has enough time to 'rewind' to the present and handle things as they come. The prediction is just an extra buffering layer that's in place when there is a lot of server lag.

Does video or music streaming do this? Could it?

In a song with a repetitive baseline, could the information to the client computer include: "repeat the sound from time x with the following deviations included in the buffer", rather than "play the following sound"? The "sound at time x" in this case is a note from the baseline and the deviations being the result of a human playing an instrument and not hitting the note exactly the same every time. In a case like that, potentially less data would need to be sent to reproduce the song, allowing for longer buffers or higher sound quality.
 
Likewise for video. Consider a live video feed of a soccer match, in which a player is running across the field. Video prediction may determine there is an object moving at some speed in some direction and predict a few frames ahead where that object will be, and thus what pixels to draw in that spot. Then the streaming service, making the same prediction, could just send the video information that deviates from this prediction.

For repetitive patterns like an animation of a like a spinning wheel or sparkling logo of a sports team. If the wheel spins in a predictable fashion, internet bandwidth could be saved by describing the movement of the wheel as "spinning as expected", where matching prediction software on the server and client sides both recognize that that part of the screen is taken up by an object in 10-frame loop. 
This is different from encoding only the pixels that change from frame to frame. This prediction would incorporate likely changes in a picture based on simple movement patterns or on repetitive animations.

Consider a streaming game of hearthstone, like the last time I pretended to know about video encoding. There are certain animations that regularly impact video quality, such as sand sweeping across the entire screen, that involve a many pixels changing for a non-trivial amount of time. The video encoder does fine when the picture is mostly the same from frame to frame, but introduce one of these effects that causes a lot of pixels change at once, and the quality of the live stream. 

However, the sand effect is one of sand moving slowly across the screen, its movement is predictable in that any one pixel of the effect is likely to follow the same trajectory as it did in the last few frames. Predictive video encoding is more scalable than the application specific encoding I mentioned before, but with time it could achieve the same effect if it was able to recognize frequently used pre-rendered effects such as lightning all over the screen. A predictive video encoder could recognize the first few frames of the 'lightning storm' effect and predict the rest without having to send any information about that part of the screen.

 I'm no expert on video encoding, so this may all be jibberish.


Previous post on video encoding in Twitch, the possibility of application specific codecs.

Sunday, 19 July 2015

Using optim() to get 'quick and dirty' solutions, a case study on network graphs

The optim() function in base R, is a generalized optimizer for continuous functions. It looks for the minimum of some objective function that you define across some continuous parameter space that you define.

If you didn't know the lm() function, you could use to find the least-squares line through set of points.
- The solution space is R2, the possible values of (intercept, slope).
- The objective function is the sum of squared error.


get_ss_error = function(beta, x0, y0)
{
 yhat = beta[1] + x0*beta[2]
 error = sum( (y0 - yhat)^2 )
 return(error)
}
set.seed(1234)
x0 = rnorm(1000)
y0 = -3 + 5*x0 + rnorm(1000)

## linear model solution
lm(y0 ~ x0)$coef
(Intercept)          x0 
  -2.984010    5.055713 
  
## optimizer solution
optim(par=c(0,0), get_ss_error, x0=x0, y0=y0)$par
[1] -2.984321  5.055940

In this case optim() isn't using (X'X)-1 X'y to find the best solution, it just uses one of its general optimization methods to try different values for slope and intercept until it comes up with a pair that has close to the minimum value for get_ss_error. Note that the first parameter passed to get_ss_error is the potential solution. The optim() function will assume this about functions, I think.

Now let's try something more interesting than simple regression. In April, Jeremy Diachuk was playing with Thaumcraft, an expansion to Minecraft which uses a transmutation mechanic. That is, elements can be turned into other elements, but any given element can only be directly turned into a couple of other elements. For example, "Earth" can be changed into "Life" (and "Life" back into "Earth"), but "Earth" can't be changed into "Heal". However, "Life" can be changed into "Heal".

We can describe the transmutation relationships as an undirected network graph, and the shortest path length between each element is the number of transmutations needed to turn one element into another. "Earth" transmutes to "Life" transmutes to "Heal", which is a shortest path, so "Earth" is two transmutations away from "Heal". We call this distance in a network the geodesic distance.

Jeremy wanted a visual arrangement of some key elements that showed their relative geodesic distance from each other. There are programs out there to do this, but I thought it would be faster/better to build something in R using optim() than try to use an out of the box system. I was right... I think.

First, we needed some means to find the geodesic distance between each element. Jeremy gave me a formatted list of the possible transmutations (available here), and it was a simple call to the shortest.path() function in the igraph package in R to get a matrix of the geodesic distances.

library(igraph)
library(permute)

g = read.graph("tc4-22.ncol",format="ncol")
sp = shortest.paths(g)

For the sake of a simpler picture and faster optimization, we weren't interested in all 49 of the elements, so we eliminated unimportant ones from the matrix. Note that we couldn't do this before finding the shortest paths because they may have use to make those paths.

## Full list in downloadable code 
ignore_list = c("CRAFT","SLIME","WRATH",...,"COLD","TAINT")

ignore_idx = which(row.names(sp) %in% ignore_list)
sp = sp[-ignore_idx, -ignore_idx]

Next, we needed...
- An objective function (the difference between geodesic distance and the usual meaning of distance, euclidean distance)
- A parameter space (the location of each of the key elements in a plot), and
- A function that takes in the parameter space, and gives out the objective function


# Full comments in downloadable version 
euclid_vs_geodesic = function( x, sp, importance)
{
  Badness = 0
  size = length(x) / 2  
  y = x[ 1:Nnodes]  
  x = x[ 1:Nnodes + Nnodes ] 
 
  for(i in 1:Nnodes)
  {
    for(j in (1:Nnodes)[-i]) 
    {
      eu_dist =  sqrt((x[i] - x[j])^2 + (y[i] - y[j])^2 )
      geo_dist = sp[i,j]
  
      thisbad = abs( log( eu_dist / geo_dist))^3  
      thisbad = min(10^5, thisbad)  
      thisbad = thisbad * importance[i] * importance[j] 
      Badness = Badness + thisbad 
    }
  }
  return(Badness) 
}

The function euclid_vs_geodesic takes a vector of values 'x', breaks it into x and y coordinates for each element, and compares the euclidean distance between each coordinate pair with the number of transmutations that separate each element, and sums them up to get a 'badness' score. We can feed all this into optim to find the optimal placement of the elements in a 2-D plot.

# Full comments in downloadable version 
Nnodes = nrow(sp)
importance = rep(1,Nnodes)
init = rnorm(2*Nnodes)

res = optim(par=init,
  fn=euclid_vs_geodesic, sp=sp, importance=importance,
  control=list(trace=1, maxit=200), 
  method="L-BFGS-B", lower = rep(-4,2*Nnodes), upper = rep(4,2*Nnodes))

Using the results from res, we can plot:

x = res$par[1:Nnodes]
y = res$par[Nnodes + 1:Nnodes]
plot(x,y, cex=0.1)
text( labels=row.names(sp), x=x, y=y)


Here, words that are close to each other represent elements that require few transmutations between each other. 'Light' (lower left) can be transmuted to 'Fire' and 'Energy' simply, but converting 'Light' to 'Motion' is hard, and turning 'Light' into 'Undead' is even harder.

Also, even the edges between the nodes aren't shown, shortest paths between nodes are visually implied. "Life" (just right of center), is close to the line between "Earth" and "Heal", implying the 'Earth to Life to Heal' shortest path.

Download link for list of transmutations
Download link for optimization and plotting code

-------------

The optim() function is also good for getting the inverse of a solution. In January I used quantile regression to find curves for the scores of the best and worst 10% of players in a video game after some amount of practice time. The original solution took time and score, and give back the estimated proportion of players that would reach that score by that time.

However, if you were instead interested in tuning a game to find the score that, say, 90% of players could reach within half an hour, optim() could be used to find that by setting the parameter space to 'score', setting 'time' as function parameter, and using 'distance from 90%' as the objective function.

One major drawback is that, in order to do this, each iteration inside the optim() function had to run the quantile regression code all over again. This made finding the optimal score or play time about ten times slower than finding the quantiles.

I call these 'quick and dirty' solutions because, although they work, they aren't guaranteed to be elegant, revealing, or computationally efficient. However, if you want to get a sense of a solution fast, it's hard to beat optim() for its simplicity of use.

Finally, optim() can also be used for combinatorial optimization, but I haven't covered that here.

Tuesday, 14 July 2015

Lesson Prototype - Regular Expressions in R

My thesis has been handed in, so I've been restarting and catching up on other projects. Blog posting will continue at its normal pace of about 4 posts/month.

For the data management course I'm building, I want to include a module on text analysis. This is a prototype of a lesson and accompanying exercise on regular expressions in R. It runs through how strings are handled in R including vectors of strings. Then it covers some of the regular expression basics and demonstrates them with the stringr function str_detect().

There are some partially finished exercises at the end of the lesson. The exercises are outlined, and a list of anticipated errors are included for each question. However, the text files needed to actually do the exercises and check your work are not yet done.

Since all the exercises are programming based, I can write a solution checker that looks at the submitted answers and give feedback based on anticipated mistakes. I talk a bit about such a program at the end of the document.

As always, feedback is appreciated.
 
LibreOffice Version of Lesson

PDF Version of Lesson

Previous blog post on the stringr package


Lesson contents:
  • Strings in R
    • Strings can be stored and manipulated in a vector
    • Strings are not factors
    • Escape sequences
    • The str_match() function
  • Regular expressions in R
    • Period . means 'any'
    • Vertical line | means 'or'
    • +, *, and {} define repeats
    • ^ and $ mean 'beginning with' and 'ending with'
    • [] is a shortcut for 'or'
    • hyphens in []
    • example: building a regular expression for phone numbers 
  • Exercises
    • Detect e-mail addresses
    • Detect a/an errors
    • Detect Canadian postal codes

Saturday, 11 July 2015

Danni and Jeremy's Wedding Speech

 As requested, here is the speech that was given for Danni-Lynn Laidlaw and Jeremy Strang's wedding last month.

[Party 1] was Danni-Lynn
[Party 2] was Jeremy
[Member of Audience] was Calen, Danni's brother.

Regular text represents my own additions
Italic text is taken verbatim from the Government of British Columbia's standard ceremony,
of which the bold italic parts are immutable and cannot be changed.
Anything in underline is spoken by one of the two parties being married.

The standard ceremony can be found at http://www2.gov.bc.ca/assets/gov/residents/vital-statistics/marriages/vsa718.pdf

-----------------------------------------------

Majestic Ladies, Handsome Gentlemen, and [Member of Audience].

We have assembled here to acknowledge a force that ruthlessly devoured billions before and will no doubt continue to consume live in their prime until the sky crumbles.

This remorseless and unceasing force is called love. These two, though their forms appear before you, are hopelessly lost -- beyond mourning , really.

Today we witness the passing of [Party 1] and [Party 2] into the penultumate stage of their falling to this force. The true word for this stage makes all who hear it cry blood and vomit leeches. Thankfully I lack the four tongues required to pronounce it. However, even the English word has terrified lesser men before. That word is "marriage".

The state of matrimony, as understood by us, is a state ennobled and enriched by a long and honourable tradition of devotion, set in the basis of the law of the land, assuring each participant an equality before the law, and supporting the common right of each party to the marriage.

There is assumed to be a desire for life-long companionship, and a generous sharing of the help and comfort that a couple ought to have from each other, through whatever circimstances of sickness or health, joy or sorrow, proserity or adversity, the lives of these parties may experience.

Marriage is therefore not to be entered upon thoughtlessly or irresponsibly, but with a due and serious understanding and appreciation of the ends for which it is undertaken, and of the material, intellectual, and emotional factors which will govern its fullfillment.

It is by its nature a state of giving rather than taking, of offering rather than receiving, for marriage requires the giving of one's self to support the marriage and the marriage and the home in which it may flourish.

It is into this high and serious state that these two persons desire to unite.

Therefore: 

I charge and require of you both in the presence of these witnesses, that if either of you know of any legal impediment to this marriage, you do now reveal the same.

Let [Party 1] repeat after me:

"I solemnly declare that I do not know of any lawful impediment why I, [Person 1] may not be joined in matrimony to [Person 2]."

Let [Party 2] repeat after me:

"I solemnly declare that I do not know of any lawful impediment why I, [Person 2] may not be joined in matrimony to [Person 1]."
  There having been no reason given why this couple may not be married, nor any reasonless jibbering that could be interpreted as such, I ask you to give answer to these questions.

Do you [Party 1] undertake to afford to [Party 2] the love of your person, the comfort of your companionship,m and the patience of your understanding, to respect the dignity of their person, their own inalienable personal rights, and to recognize the right of counsel and the consultation upon all matters relating to the present, future, and alternate realities of the household established by this marriage?

(A prompt of  'do you, or do you not' may help here)

[Party 1]: I do. 


Do you [Party 2] undertake to afford to [Party 1] the love of your person, the comfort of your companionship,m and the patience of your understanding, to respect the dignity of their person, their own inalienable personal rights, and to recognize the right of counsel and the consultation upon all matters relating to the present, future, and alternate realities of the household established by this marriage?

(Again, a prompt of  'do you, or do you not' may help here. Especially because doing it twice makes it sound planned)

[Party 2]: I do. 

 Let the couple join their right hands, claws, tentacles, feelers, probosci, or pseudopods, and let [Party 1] repeat after me.

I call on those present to witness that I, [Party 1], take [Party 2] to be my lawful wedded (wife/husband/spouse), to have and hold, from this day forward, in madness and in health, in whatever circumstances life may hold for us.

and let [Party 1] repeat after me.

I call on those present to witness that I, [Party 1], take [Party 2] to be my lawful wedded (wife/husband/spouse), to have and hold, from this day forward, in madness and in health, in whatever circumstances life may hold for us.
  Inasmuch as you have made this declaration of your vows concerning one another, and have set these rings before me, I ask that now these rings be used and regarded as a seal and a confirmation and acceptance of the vows you have made.

Let [Party 1] place the ring on the third noodly appendage of [Party 2]'s left hand, repeat after me:

With this ring, as the token and pledge of the vow and covenant of my word, I call upon those persons present, and those unpersons lurking among us beyond mortal sight, that I, [Party 1], do take thee [Party 2], to be my lawful wedded (wife/husband/spouse)

 Let [Party 2] say after me:

In receiving this ring, being the token and pledge of the covenant of your word, I call upon those persons present to witness that I [Party 2] do take thee [Party 1] to be my lawful wedded (wife/husband/spouse).


Let [Party 2] place the ring on the third noodly appendage of [Party 1]'s left hand, repeat after me:

With this ring, as the token and pledge of the vow and covenant of my word, I call upon those persons present, and those unpersons lurking among us beyond mortal sight, that I, [Party 2], do take thee [Party 1], to be my lawful wedded (wife/husband/spouse)

 Let [Party 1] say after me:

In receiving this ring, being the token and pledge of the covenant of your word, I call upon those persons present to witness that I [Party 1] do take thee [Party 2] to be my lawful wedded (wife/husband/spouse).


And now, forasmuch as you [Party 1] and [Party 2] have consented to legal wedlock, and have declared your solemn intention in this company, before these witnesses, and in my presence, and have exchanged these rings as the pledge of your vows to each other, now upon the authority vested in me by the province of British Columbia, I pronounce you as duly married.

You may kiss.

Sunday, 31 May 2015

First thoughts on narrative reporting

I just finished reading The Mayor of Aihara, a biography of a man named Aizawa from rural Japan derived from 1885-1925 of his daily journal. It was the first history book I've read in six years.

I read it because I wanted to get a sense of how non-fiction is written outside of the sciences. The content was good, but it was the style I was looking for, which turned out to be a 160 pages of narrative sandwiched between two 20-page blocks of analysis and commentary.

The introductory chapter discusses the limitations of the biography as a view into the life of the Japanese as a whole. It also gives some general context of the world that Aizawa lived in.

The next five chapters cover blocks of Aizawa's life. Events within each chapter are told mostly in chronological order. There is some jumping around to help organize the material into something like story arcs, except that it's almost all about one person.

In other places, details that the biography author couldn't possibly have known are included, such as the details of Aizawa's birth, and the reasoning behind the local police officer having a bicycle.

Sometimes the author's interpretations are injected, such as that Aizawa was mostly unaware of the plight of the tenant farmers in his village, and that he cared more about his family than was typical. (In the conclusion chapter, some of these assumptions are justified.)

These aspects gave the biography more cohesion as a story, with clearer connections between cause and effect, than Aizawa's life likely had in reality. I didn't mind much because it made the material easier to read, track, and remember.

Still, the reporting is subjective, not just where necessary, but also where the author could improve the work's narrative quality without sacrificing much accuracy. Contrast that to scientific reporting : when doing a chemistry experiment properly, every scientist should produce identical lab notes and the resultant reports should provide the same information regardless of who produced them. If someone else were to examine Aizawa's journal, even if they had the same background in Japanese history as the biography author, they would produce a biography with different information.

This focus on narrative in providing facts is perplexing but the rationale is visible.

Friday, 8 May 2015

The end of jobs

A friend recently asked me if I foresee any chance of "jobs", or in his words "the economic trade of labour in return for payment" becoming so unsustainable that we as society would abandon it. My response is below.

-----------------

No.

The term "chance" implies that I'm not certain about the unsustainability.

There will NEVER be enough meaningful jobs for everyone. Unemployment is only in a 'healthy' range around 6-7% right now because of a system stretched to its utter limit to create jobs, often at the cost of getting meaningful work done.

First, self employment is counted as jobs in this statistic, as is part time work. So the proportion of people that trade their labour for payment likely a lot smaller than official surface figures.

There are also a large portion of jobs that simply shouldn't be.

- Literally pointless jobs like full-service gas pumps. Really gas stations could be fully automated and could behave like large vending machines.

- Parasitic jobs such as car salespeople, day traders and arbitrageurs. I separate these from pointless jobs because they do perform a service, but only because of legacy systems that mandate that these services are necessary.

- Fields where the output of the field is only loosely related to the number of people working in the field, such as marketing. From the perspective of companies, if half of all ads disappears, the only real effect would be for each remaining ad to be twice as effective. Likewise, the benefit to the consumer, knowledge of a product, would be just as large with perhaps only 10% of ads retained. In that sense, 90% of the work in advertising, from ad creation to posting billboards, is parasitic.

Then there are the jobs that won't be for much longer.

- Physical jobs that are due for automation, such as semi-truck driving.

- Small manufacturing jobs that can be simply replaced by on-demand 3D printing.

- Technical jobs that routinely get automated, like how much search engines have supplanted librarians.

- Many service jobs are a fixed portion of the population, such as teaching, haircutting, and child care. However, the population of countries in the developed world are either flat, declining, or dependant upon immigration to maintain the population increase that modern economics relies upon so dearly.

- Many resource-based jobs are at risk to better energy efficiency, better labour efficiency, and automated landfill harvesting and reclaimation. Even argiculture is being turned upside down by cultured meat. With it, there goes shipping.

Finally, the work that NEEDS to be done such as environmental restoration, medical services, and the development of space technology, simply doesn't work well under an exchange-for-payment system because economically 'rational' people and corporations either won't or can't pay for it.

I would refer you to the 20 minute video "Humans Need Not Apply" for a compelling argument about how this is inevitable. My best resources for universal basic income and on post-scarcity are the novels Accelerando and Red Mars touch on these topics.

Wednesday, 6 May 2015

Prelude to a FUSS

I apologize in advance if this one is incoherent, as most of it has come in fever dreams over the last couple days.

I want to make a FUSS. That is, a Formula-Unspecified System Solver. 

The FUSS would input an n * p matrix of X values, and a 1 * p vector of y values.  

The FUSS would output a formula, such as y = log(a + bx1 + c * sqrt(x2)), where the parameters a,b, and c are chosen to fit y the best for this formula by some criterion such as sum of squares*. The formula would be chosen by a simulated annealing method that starts with a basic formula and mutates to find one that balances accuracy with simplicity.

Here's the mechanics as I've worked them out so far:

The R function nlm(f, p, ...) takes in a function f and initial parameters 'p', which get fed into the function. nlm then estimates the values for the variables specified in 'init' that produce the lowest output in 'f'. Usually, the output of 'f' is some value representing distance from some ideal value, so the output of nlm effectively estimates the best value of things for you. 

The '...' represents the additional variables that are fed into nlm(f, p, ...) that are not changed by nlm when it's trying to optimize. So what would happen if those extra variables determined the formula to optimize 'p' on. 

For example

FUSS_INNER = function(p, x, y, form)  ## Assume p=3, x is 2*n
{
   y_hat = rep(0,length(y))
     
   y_hat = rep(0,length(y))
   
   for(k in 1:length(y))
   {
   
      ## Start with a polynomial of exponents
      y_hat[k] = p[1] + p[2]*x[1,k]^form[1] + p[3]*x[2,k]^form[2]
   
      ## Apply transformations
      if(form[3] == 1){ y_hat = log(y_hat)}
      if(form[4] == 1){ y_hat = sqrt(y_hat)}
   }
 
   ## compute lack-of-fit
   lack_of_fit = sum( (y - y_hat)^2)
 
   ## compute a complexity score based on the formula used
   complexity = 1
 
   ## 1 point for each non trivial exponent (0 or 1)
   complexity = complexity + length(which( !(form[1:2] %in% c(0,1))))
 
   ## 1 point for each transformation applied
   complexity = complexity + length(which(  form[3:4] == 1))

   return(c(lack_of_fit,complexity))
}


For the above example formula of y = log(a + bx1 + c * sqrt(x2)), this would be specified by the form vector of "0,1/2,1,0", and would have a complexity of 3.

With a bigger FUSS_INNER function, more possibilities for mutation are available. Also, there's a lot of housekeeping that I'm ignoring in this function because it's all very high level at this moment. Also, I would use apply() instead of for() to improve speed, but this is for demonstration purposes.
It's 2:30am in Montreal, and that's what the FUSS is about.

* maximum likelihood would be better, but let's start small.

Saturday, 18 April 2015

Using Sampling Order in Network Inference

In this previous post, I had talked about how the order of values were sampled, but I gave a pretty esoteric argument involving compression and loss of information in a sample of 8 values.

For my current thesis project, I'm developing an inference methodology for snowball samples. Snowball samples, or more generally respondent driven samples, are used to sample hard-to-reach populations or to explore network structure. The sampling protocol I'm working with is


  1. Select someone to interview from the population by simple random sampling.
  2. In the interview, ask that person the number of connections they have (where the definition of a  'connection' is defined by the study; it could be business partners, sexual partners, close friends, or fellow injection drug users.)
  3. For each connection the interviewed person has, give that person a recruitment coupon to invite that connected person to come in for an interview, often with an incentive.
    • If no connected person responds, go to Step 1 and repeat.
    • If one connected person responds, go to Step 2 and repeat.
    • If multiple people respond, go to Step 2 and interview each of them.
In reality, people are interviewed in the order they come in for an interview. In my simulation, people are interviewed in a breadth first search system, in which I assume connections to the first interviewed person are followed before any connections to those connections are followed. This may make a difference (which I will explore later), because of two limitations of the sample this protocol is modeled after:

  1. When the desired sample size has been attained, no more people are interviewed, even if additional people respond to the invitation. (Budget limitation)
  2. Connections that lead back to people already in the sample are ignored. Even the information about there being a connection is missing. (Confidentiality limitation)
The budget limitation implies that the interviewing order impacts the subjects that are included in the sample, which in turn affects any statistics measured, even those for which sampling order is ancillary.

The limitation on reported connections implies that the observed network structure is dependent upon the sampling order, even when controlling for the actual members of a sample. For example, if someone receives invitation coupons from multiple connections, they will only show up as directly connected to the first person whose coupon they use. The other connections may not even be observed to be the same network component, even though they must be by definition.

Furthermore, every sampled network will be a forest. That is, a collection of one or more trees - networks that have no cycles, connections 'back into themselves'. Only being able to observe these tree sub-networks of the actual network limits what inferences can be made about the network as a whole. Observed link density isn't very useful for well-connected graphs because the observed link density is maximal just shy of one link/node, in cases where the observed network is one huge component.

Alternate measures of density, like component diversity, ( the Shannon species diversity index, applied to the components that nodes are in), run into this problem as well. In Figure 1 below, component diversity is useless at estimating the average (responding) degree of the network for networks with more than 1.5 links/node.


Figure 1, a network measure based only on observed network structure.

However, we can also estimate things like dPr(used)/dt, the rate of change in the proportion of a sample's connections that are also included in the sample over time. Barring variation due to response rate, we would consider such a measure to be non-positive. 

For networks that are sparse in large populations, the rate of change should be near zero, because very few connections will lead back into nodes that were already explored. What few recruitments there are will be just as successful up until near the end of the sample when a few may be turned away. 

However, for well-connected networks, sampling is done much more often by recruitment rather than by simple random sample. As more a component network is explored, the chance that each successive connection followed leading to someone already in the sample increases. Therefore, for well connected graphs, we would expect dPr(used)/dt to be distinctly negative. 

As Figure 2 shows below, we can use the rate of change in successes to differentiate between populations up to 2.5 links/node, even though most of the networks that we observe from 1.5-2.5 links/node all show up as single large components. Also note how poorly dPr(used)/dt differentiates between networks in the region where the structure-based diversity index excels.

Figure 2, a network measure based on sampling order.


Measures like these, in combination, can be used to make estimates of the features of the population as a whole, but that's for a poster I'm working in the coming weeks.