Tuesday, 18 August 2015

Lesson Notes - Edit Distance (Text Processing 2)

These are the notes for the second lecture in the unit on text processing that I'm building. Some useful ideas like exact string matching and the definitions of characters and strings are covered in the notes of Text Processing 1 - Regular Expressions.

Edit distance, also called Levenshtein distance, is a measure of the number of primary edits that would need to be made to transform one string into another. The R function adist() is used to find the edit distance.

adist("exactly the same","exactly the same") # edit distance 0 
adist("exactly the same","totally different") # edit distance 14

A primary edit is one of three things:
a) An insertion of a character.
b) A deletion of a character.
c) A substitution of a character.

For example:

## Insertions: All of these will return an edit distance of 1.
adist("5678","05678") # inserting an '0' character before the '1'
adist("weasel", "w easel")  # inserting a space between w and e
adist("weasel", "weasel!")  # inserting a '!' character after the 'l'

## Inserting 5 characters gives an edit distance of 5
adist("Genome","Lord Genome")

## Deletions
adist("Genome","Gnome") # deletion of the first 'e': 1 edit
adist("Genome","nome")  # deleting the first two characters: 2 edits
adist("Canada","Cnd")   # deleting all three ehs: 3 edits

## substitutions
adist("iiiii","iixii")  # replacing 1 'i'
# 1
adist("iiiii","xiyiz")  # replacing 3
# 3

If more than one type of edit is needed to turn one string into another, the edit distance across different types is additive.

adist("weasel","Vin Deasel")  # 4 insertions, then replacing 'w' with 'D'
# 5

The edit distance is the 'shorted path' distance between two strings. This means that two strings might be closer than you expect. For example, you can get from "Moos" to "oose" with three substitutions, but adist() will return a smaller distance because a simpler transformation exists (delete 'M', then insert 'e' at the end).

adist("Moos", "oose")

The edit distance from one string to another is not always the sum of the edit distances between them and an intermediate string, but it is never the more than the sum. In other words the triangle inequality holds.

### These:
adist("triangle inequality?","what's that?")
adist("what's that?","some math thing.")
### 15 and 12

### Do not add to this:
adist("triangle inequality?","some math thing.")
### 17

By default each type of primary edit, insertions, deletions, and substitutions, contributes and edit distance of 1. This can be changed with an option in adist(). As above, the smallest distance is used, so watch your weights.

adist("insert3","in---sert3", costs=list(insertion=1)) # 3
adist("insert3","in---sert3", costs=list(insertion=50)) # 150
adist("Mooo!","Moo", costs=list(insertion=0, deletion=50)) # 100

Note the case sensitivity. Since to a computer an upper case 'R' and a lower case 'r' are two completely different letters, differences in cases count towards the edit distance.

str1 = "youch that hurt!"
str2 = "YOUCH THAT HURT!"
adist(str1,str2) ## edit distance 13


If you don't want edit distance to count cases, you can either use functions to change the case of the strings, or use an option in adist()

adist( toupper(str1), str2)
adist( str1, tolower(str2))
adist( str1, str2, ignore.case=TRUE) # all zero distance


Leading and trailing spaces can also be an issue because they contribute to distance, but you can remove them with the str_trim() function in the stringr package.

str1 = "Mars"
str2 = "                 Mars              "
adist(str1,str2) # distance is vast
adist(str1, str_trim(str2)) # distance is nothing

Transpositions, which are the swapping of the characters or words, are more problematic. These are interpreted as large numbers of substitutions or insertions/deletions, usually.

adist("Animal", "Aminal") # 2
adist("Down Under", "Under Down") # 10
adist("Life isn't fair.","fair Life isn't.") # 10

Edit distance is used for fuzzy matching.


Fuzzy matching, in contrast to the exact matching that we dealt with in the last text processing lesson, is designed to detect strings that are close to a target string (or string pattern) as well as ones that fit the string (pattern) completely. Common systems that use fuzzy matching (and other things) include spell checkers and search engines when suggestions are made that are close to what you wrote. Some of the suggestions that come up will be selected because they have a minimal edit distance subject to some weighting based on common misspellings or typos.


We can build a workable fuzzy matching function in R in less than ten lines.

Suppose we had the transcript of the events of a Major League Soccer match from ESPN, like this one between the Houston Dynamo and the New England Revolution ( ESPN_MLS_Example.txt , thanks to Rajitha Silva for letting me give away part of his research like this ), and we wanted to identify the team associated with certain lines, but there was a lot of extra and messy information in the way. This includes lines from the example text like:

  • Houston Dynamo Dominic Oduro 58'
  • New England Revolution • Shalrie Joseph 43' • Kheli Dube 73'
  • Joseph Ngwenya (Houston Dynamo) receives a red card.
  • Houston Dynamo makes a substitution: Joseph Ngwenya enters for Cameron Weaver
  • Goal !!! Kheli Dube scores for New England Revolution
In each case, you can identify the team of interest by looking at the edit distance between the team name and each of the lines. Most of time (alas, there are always edge cases), the team with the shorter edit distance to the line of text is the team of interest.

adist("Houston Dynamo Dominic Oduro 58'","Houston Dynamo")
18

adist("Houston Dynamo Dominic Oduro 58'","New England Revolution")
27

Using this principle, we can make a simple 'choose the least' system.

teamlist = c("Houston Dynamo", "New England Revolution", 
 "Chicago Fire", "Montreal Impact", "Vancouver Whitecaps")

clean_teamname = function(raw_name, teamlist)
{
    #optional
    require(stringr)
    raw_name = str_trim(raw_name)
    
    #necessary
    ed_dist = adist(raw_name, teamlist)
    best = which(ed_dist == min(ed_dist))[1] # [1] breaks ties
    return(teamlist[best]) # Return the best fitting name
}

Then we test it out:

clean_teamname("Houston Dynamo Dominic Oduro 58' ",teamlist)
[1] "Houston Dynamo"

clean_teamname(
"Houston Dynamo makes a substitution: Joseph Ngwenya enters for Cameron Weaver",
teamlist)
[1] "Houston Dynamo"

clean_teamname(
"Joseph Ngwenya (Houston Dynamo) receives a red card. ",teamlist)
[1] "Houston Dynamo"

clean_teamname("Goal !!! Kheli Dube scores for New England Revolution",teamlist)
[1] "New England Revolution"



Tuesday, 4 August 2015

Lesson Prototype - First Lecture on Multiple Imputation

Continuing the work on my data management course, here's the start of the Imputation unit. I'm intending three units in total - Text processing, principles of data manipulation (cleaning, merging, formatting, and database access), and imputation.

The companion reading for this lesson is Chapter 1 of Flexible Imputation of Missing Data, by Stef van Buuren, which is referenced in some places and is drawn from heavily to make these notes. As this is just a prototype, I have not yet filled in all the details or included the iris_nasty dataset, which is just the iris dataset with random values deleted.

Structure:
1) Motivation for learning imputation
2) Types of missingness
3) A few simple methods that build up towards multiple imputation

--------------------
Hook: The frictionless vacuum of data


Consider the iris dataset in base R:


data(iris)
iris[c(1:3,148:150),]

Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1            5.1         3.5          1.4         0.2    setosa
2            4.9         3.0          1.4         0.2    setosa
3            4.7         3.2          1.3         0.2    setosa
148          6.5         3.0          5.2         2.0 virginica
149          6.2         3.4          5.4         2.3 virginica
150          5.9         3.0          5.1         1.8 virginica

any(is.na(iris))
[1] FALSE

dim(iris)
[1] 150   5


A dataset like this is the equivalent of a frictionless vacuum in a physics problem: ideal for a variety of methods and for teaching principles but not commonly found in 'real' industrial settings. More likely is something like this:

iris_nasty = read.csv("iris_nasty.csv", as.is=TRUE)
 iris_nasty[c(1:3,148:150),]
 
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
1             NA         3.5          1.4          NA    setosa
2            4.9          NA           NA         0.2    setosa
3            4.7         3.2          1.3          NA    setosa
148          6.5         3.0          5.2          NA virginica
149          6.2          NA          5.4         2.3 virginica
150          5.9         3.0           NA          NA virginica


Missing values, shown as "NA" (as in Not Applicable) like the ones in the nasty version of the iris dataset above, could come from any manner of problems. In a biological setting, it could be a failure of the measuring device. If it was a live specimen, the measurement may not have been possible without harming the specimen, or trampling a lot of ground in order to reach it.

If this data were from a social survey, missingness can come from questions that were skipped over or refused by the respondent. The data points could also missing because they were not relevant to the respondent, such as questions about prostate cancer being asked to a woman. If the survey was done of the internet, questions could be missed due to a connection problem.

------------
Types of missingness


Missingness patterns are classified into three types: MCAR, MAR, and MNAR.

MCAR: Missing Completely At Random - The chance of any given value being missing has nothing to do with any meaningful variable. This is the least problematic to deal with, because methods of imputation (filling in missing values) can work by imputing on one variable at a time, rather than imputing on the whole dataset at once. Missingness related to internet connection problems are reasonably assumed to be MCAR.

MAR: Missing At Random - The chance of any given value being missing depends ONLY on other variables that are in the dataset. This is more problematic to impute because there is the possibility of cases that are missing two variables in which the missingness of one variable depends on the other, which is itself missing. A possiblity could be the results of someone's most recent prostate exam being missing in a dataset that also includes age, sex, and a few other factors that explain most or all of the reasons why such data wouldn't be available.

MNAR: Missing Not At Random - The chance of any given being missing depends in some way on variables that are NOT included in the dataset. This is by far the most prolematic of missingness patterns because there is no way to know why something is missing, and therefore any assumptions of the missing values are biased and conjecture. Van Buuren, author of "Flexible Imputation of Missing Data" uses an example of a weight scale that is wearing out over time, espcially after heavy objects are weighed, and occasionally failing to measure something. If we don't know what's been weighed first or last, then we can't accurately estimate the chances of any given value being missing.

My favourite example is the 2011 Canadian Long-Form Census, which was voluntary. About a third of the forms that were sent out were not returned, and there is no reasonable way to account for them because certain demographics are less likely to return voluntary surveys. Unfortunately, the information about which demographics these are is exactly the kind of information that is missing.


There are methods for telling MCAR data from MAR data, which we will cover later in this unit. However, there is by definition no way to tell MAR data from MNAR data because that would involve conditioning on information that is not available. Unless there is cause to believe that some data of interest is MNAR, the default assumption is that data is MAR instead.



--------------
Dealing with missing data - building up to multiple imputation


There are many ways to impute the missing data. That is, to replace the NA spaces with values that can be used for analysis. Starting with the most naive and working up:


1) Remove all cases that have a missing value.length(which(is.na(iris_nasty)))

[1] 243

dim(iris_nasty)
[1] 150   5

dim(na.omit(iris_nasty))
[1] 16  5

Although 36% of the data values (243 of 750) are missing, 89% of the cases (134 of 150) contain at least one missing value. To ignore all of those cases means throwing out a lot of data that was collected and recorded. This is not a preferable solution.


2) Remove cases with missing values for the variables used.


mod1 = lm(Petal.Length ~ Petal.Width, data=iris_nasty)
length(mod1$residuals)
[1] 94


mod2 = lm(Petal.Length ~ ., data=iris_nasty)
length(mod2$residuals)
[1] 16

This method of dealing with missing data is the default of the linear model function lm() in R. Depending on the model used, this means losing between 56/150 and 134/150 of the sample. That's at least as good as the 'listwise deletion' option suggested before this. However, a comparability problem has been introduced:

round(mod1$coef,2)

(Intercept) Petal.Width
1.12 2.23


round(mod2$coef)

(Intercept) Sepal.Length Sepal.Width Petal.Width Speciesversicolor
-0.32 0.48 -0.21 0.53 1.67

Speciesvirginica
2.12 


Between these two models, the coefficients are different (try for yourself to see if these differences are significant). From the simple linear model to the full model, the intercept changes from 1.123 to -0.319, and the slope gradient along Petal.Width changes from from 2.23 to 0.52.

If these two models were using the same dataset, we could interpret the differences between these coefficients as the effect of holding the Sepal variables constant and fixing the Species variable in the full model. But the full model was developed using a sample of 16 units, whereas the simple model uses 94 units. Are the differences in the model coefficients attributable to the sample units that were omitted or to the variables that were included, or to both, or to an interaction? There's no way to know.
This confusion also highlights why we need to impute data in the first place, rather than ignore it when it is missing.


3) Mean-value imputation: Take the mean of all known values for a variable, and use that value for all missing values for that variable.

This solves the problems from the row deletion solutions above, in that it would allow the use of all 150 cases in the iris_nasty dataset. However, it fails to reflect any sort of variation in the values being imputed. For example, what if Sepal.Length is different for each species? The mean-value imputed values for Sepal.Length will be the same for all the species, which poorly reflects the trends in the data.

(Figure from van Buuren showing how limiting this is for a simple regression)

4) Model-based imputation.
Van Buuren refers to this as regression-based imputation in Chapter 1, but the bigger picture is that imputed values can be those predicted from a model, such as a regression model.

Like the mean-value imputation, this solution allows all the vases to be used. It also reflects the average trends of the data. However, it does NOT reflect the variance unaccounted for by the model. That is, the uncertainty inherit in the imputed values isn't shown; all the imputed values will be the average of what those values are likely to be.

For each value on its own, taking these model-predicted values as the imputations is ideal. After all, a model prediction reflects our best guess as to what a value really is. However, when you take all the imputed values together, the trends in the data is unrealistically reinforced. If we do this with the missing Petal.Width and Petal.Length values, then the estimation of the correlation between these variables becomes ______. Compare that to the correlation found using the 94 rows of data that don't have missing values: ______ , or of the 150 rows of data in the original iris data set without imputations: ________.


(Figure from van Buuren showing how limiting this is too for a simple regression)
(oorrrrrr... figure of scatterplot of model from iris_nasty)

One set of methodology questions comes up with model-based imputation that we will address in a later lesson: If there are missing values for multiple variables from a case in dataset, which of the variables do you impute first? Do you use the imputed value from one variable to inform predictions for imputations on other variables, or do you only use complete cases as a basis for imputation?


5) Stochastic regression-based imputation.

This is the same as regression or model-based imputation, except that some random noise is added to each predicted value to produce an imputed value.

The distribution of the random noise is determined by the amount of uncertainty in the model. For example, a prediction from a very strong regression would have only a small amount of noise added as compared to a prediction from a weak regression. Also, if a linear regression were the model used for prediction, the noise would come from a Gaussian distribution as per the standard regression assumptions.

Adding this noise makes the distribution of imputed values resemble that of the known values, instead of being underdispersed like it is in simple model-based imputation. However, the results from the imputed dataset are now sensitive to the random seed used to generate the noise, and the strength of correlations is now understated. Using stochastic regression-based imputation, 95% of correlations for model 1 in the iris_nasty dataset were between ____ and ______.

Finally, after the noise is added, the imputed missing values are taken with as much certainty as the directly observed values. There is no distinction between values that were originally missing and those that were not.


6) Multiple imputation.
This method is the one we will be using for the rest of this unit. To do multiple imputation, take predictions of values and add random noise as we did in stochastic regression-based imputation. However, do this independently for several (i.e. 3-10) copies of the dataset.

Each copy of the dataset will have the same values for everything that was observed directly, but different values for anything that was imputed. The average of the imputed values will still be close to the predicted value, but now the differences in the imputed values between datasets reflects the uncertainty we have in these imputed values.

(Below are three copies of those six lines from iris_nasty, but with imputations of the missing values show in bold and red)

(figure of what's going on... one dataset to many, many datasets to many analyses, combined to one)

To analyse such a meta-dataset, we perform the same analysis on each copy of the dataset, and then average the expected values from each result using Rubin's Rules. This is shown as a black-box process for the three imputations of the iris_nasty dataset below.

(example of lm from each imputation)

(rubin.rules() black box for now which spits out parameter estimates and uncertainty)

Rubin's Rules deserve an entire lesson, so that's where we'll pick up next time we talk about imputation.


Previous Lesson Prototype on Text - Regular Expression