Friday, 22 April 2016

Academic Proofreading / Copy Editing Samples

With the PhD wrapping up this summer, I can't default to 'do a higher degree' and have to go find a real job.

One option I've been considering is work in scientific, academic publishing. As a job, or just as a source of supplemental income, it seems ideal to me.  It's the kind of job where I could actually add value to research by making it more ready to disseminate. Also, I have research experience in statistics, health science, molecular biology, and education. I write habitually. I'm a native English speaker who can also check the mathematical, and especially the statistical assertions in an academic paper for correctness before it goes to an editor, or to the public.

Copy editing work can be done without leaving Vancouver, in fact it can be done from a houseboat, or a houseboat city. Reading technical reports and academic papers would keep me actively learning and discovering. The work can be done at any time of day, and the amount of work can be adjusted to fit other, more time-specific activities.

There are companies like ManuscriptEdit and Scribendi that dispatch editing work to their own academics on contract. Many of their editors are PhDs with established careers and long publications records. These companies ask for, understandably, a proven record of copy editing ability and writing experience. Blog posts probably don't count.

There are a couple of certification programs, like the one from the Board of Editors in the Life Sciences (BELS) internationally, and the Editors' Association of Canada (EAC), whose scope is editing and proofreading in general. A certification would be great because it's a shorthand for proof of ability. For the BELS exam, the most convenient exam is this November in Florida, and I doubt I could be ready for that even if I could go. The EAC exams are probably doable locally, especially since their annual meeting is in Vancouver this summer, but it's a multi-year process.

So, I tried something with a smaller commitment. I selected short articles from open access journals, specifically ones with grammatical mistakes in their abstracts. Then, I printed out these articles, copy edited them as if they had been given to me before publication, and sent the results to the journals' editors, each with a request to be considered for future contract work.

I copy edited four articles that I can share here. The first three are recently published open-access articles, to which I received two 'no' responses. The last one is one of mine that was recently submitted, but I have permission from the other authors to share it here. Even though I didn't get a positive response, I wasn't expecting one, and 2/3 responses to a cold request at all feels pretty encouraging; it means I'm getting attention. I also got an invitation to be a volunteer peer reviewer for future papers, so there that for connections too.

I'm still reading about the copy editing process, so I think the last two are better than the first two.

Paper 1: Open Journal of Statistics - Predictive Modeling of Gas Production, Utilization, and Flaring in Nigeria...

Paper 2: American Journal of Computational Mathematics - Self Similarity Analysis of Web Users Arrival Pattern at Selected Web Centers

Paper 3: Journal of Data Analysis and Information Processing - Role of Feature Selection on Leaf Image Classification

Paper 4: Submitted - Tactics for Twenty20 Cricket.

nhlscrapr revisited

VanHAC, the Vancouver Hockey Analytics Conference, was April 9th, and I was presenting a tutorial on the nhlscrapr package for R.

This post is excerpts from the code I presented and gave out at the tutorial. The full tutorial expands my of my previous 'package spotlight' post on nhlscrapr. This post only includes the bare bones of downloading the raw games, examining the rate of goals scored and shots fired throughout the game, and making a basic player summary.

Also included is a patch to nhlscrapr I wrote that fixes a couple of functions ( , player.summary() ) that were throwing some errors, and adds a function ( ) that aids in matching player summaries to the proper names.

You can load the nhlscrapr package and use the patch with the following code:

source("Patch for nhlscrapr.r")

After which you can do things like define seasons after 2014-15 to extract. The following line builds a data frame of game IDs with one season beyond 2014-15.

fgd = = 1)

Which lets you download, process, and compile the games in the 2015-16 season

thisyear = "20152016"
game_ids = subset(fgd, season == thisyear)

dummy = = game_ids, wait = 5),
gc()"NHL play-by-play 2014-5.RData")

After extraction and compiling, we have two files

# .. the events logtemp = load("source-data\\nhlscrapr-20142015.RData")
ev_all = get(temp)

#.. and the player roster
temp = load("source-data\\nhlscrapr-core.RData")
roster = get(temp)

Analysis Snippet 1: Goals and Shots per minute of play

# First define a minutes variable based on the existing variable 'seconds'
ev_all$minutes = floor((ev_all$seconds - 0.5) / 60)
ev_all$minutes = pmax(ev_all$minutes, 0)

# Isolate the database to situations that are 
# in regulation time and during the regular season
ev_reg =  subset(ev_all, period <= 3 & gcode < 30000 & gcode >= 20001)
ev_goals = subset(ev_reg, etype == "GOAL")
ev_shots = subset(ev_reg, etype == "SHOT")
goals_per_hr = as.numeric(table(ev_goals$minutes)) / 1230 * 60
shots_per_hr = as.numeric(table(ev_shots$minutes)) / 1230 * 60

 Which produces plots like these..

Note the big jump at the end of the game, removing the empty net goals removes the jump entirely.

Few shots at the beginning of each period, and a downward trend. Could this be warm-up and fatigue?

Analysis Snippet 2: Sedin Summary

If we just want the basic event counts, we can use the roster to find the player IDs for the Sedin twins and see the number of goals, shots, hits, etc. they had in the 2015-16 season.

sedins = subset(roster,last=="SEDIN")$ 
ev_sedin = subset(ev_all, ev.player.1 %in% sedins) 

table(ev_sedin$ev.player.1, ev_sedin$etype)

We can also get player summaries for more complex events using the player.summary() function

roster_name =
ps = player.summary(ev_all, roster_name) 

The output of player.summary() is an array of 5 tables
The first table is the person that did the event (e.g. scored the goal, got the penalty, made the shot, miss, or blocked the shot)
player_summary =[,,1])

The second table is the second person in the event, if relevant. (i.e. 1st assist, victim in penalty (?)), 2nd block (?))
The third table is the third person in the event, if relevant. (only 2nd assists)
player_summary$ASSIST = ps[,3,2] + ps[,3,3] # Third column is the GOAL event

The fourth table is anyone who was on ice when the event happened and it was their team that was
The fifth table is anyone who was on ice when the event happened and it was the opposing team that was refers to the team that scored, took the shot, won the faceoff, or received the penalty
player_summary$PLUSMINUS = ps[,3,4] - ps[,3,5]
player_summary$PLUSMINUS_SHOTS = ps[,2,4] - ps[,2,5]

Finally, we can use information from the roster to fill in the name information
roster_name = subset(roster_name, firstlast %in% rownames(player_summary))
name_idx = match(row.names(player_summary), roster_name$firstlast)
player_summary$firstlast = roster_name$firstlast[name_idx]

And we can look at the sedins again for comparison
subset(player_summary, last == "SEDIN")

Apologies to A.C. Thomas, the author of nhlscrapr, if I'm stepping on your toes with this patch.

Monday, 11 April 2016

Reflections / Postmortem on teaching Stat 302 1601

This was the second course I have been the lecturer for, although I’ve had the bulk of the responsibility for several online courses as well. Every other course I’ve been responsible for had between 30 and 140 students. This one had 300.

Stat 302 is a course aimed towards senior-undergrad life and health science majors whom have completed a couple of quantitative courses before, including a similarly directed 200-level statistics course. It involved 3 hours of lectures per week for 14 weeks, a drop-in tutoring center in lieu of a directed lab section, 4 assignments, 2 midterms and a final exam. The topics to be covered were largely surrounding ANOVA, regression, modeling in general, and an introduction to some practical concerns like causality.

The standard textbook for this course was Applied Regression Analysis and Other Multivariate Methods (5th ed), but I opted not to use it to allow for more focus on practical aspects (at the cost of mathematical depth), as well as to save my students a collective $60,000.

I delivered about 75% of the lectures as fill-in-the-blank notes, where I had written pdf slides and sent them out to the students, but removed key words in the pre-lecture version of the slides. After each lecture the filled slides were made available. The rest of the lectures were in a practice problem / case studies format, where I sent out problems to solve before class, and solved them on paper under a video camera, with written and verbal commentary, during class. Most of these were made available too.

Everything can be found at for now.

What worked:

1. Focusing on the practical aspects of the material. This was a big gamble because it was a break from previous offerings of the course, and meant I had a lot less external material to work from. It was work the risk, and I’m proud of the course that was delivered.

I was able to introduce the theory of an ambitious range of topics, including logistic regression, with time to spare. The extra time was used for in-depth examples. This example time added a lot more value than an equal amount of time on formulae would have. It more closely reflected how these students will encounter data in future courses and projects, and the skills they will need to analyze that data.

The teaching assistants that talked to me about it had good things to say about the shift. The more keen students asked questions of a startling amount of depth and insight. I feel that there were only a few cases where understanding was less that what it would have been if I had given a more solid, proof based explanation of some method or phenomenon, rather than the data-based demonstrations I relied upon.

Although making the notes for the class was doubly hard because it was my first time and because I was breaking from the textbook, those notes are going to stand on their own in future offerings of Stat 302 and of similar courses. As a long-term investment, it will probably pay off. For this class, it probably hurt the attendance rate because students knew the filled notes would be available to them without attending. My assumption about these non-attendees is that they would gain little more from showing up that they wouldn’t from reading the notes and doing the assignments.

2. Using R. At the beginning of the semester, I polled the students about their experience with different statistical software, and the answers were diverse. A handful of students had done statistics with SPSS, JMP, SAS, Excel, and R, and without much overlap. That meant that any software I chose would be new to most of the students. As such, I feel back to my personal default of R.

Using R meant that I could essentially do the computation for the students by providing them the necessary code with their assignments. It saved me some of lecture time that would have otherwise been spent providing a step-by-step of how to manage in a point-and-click environment. It also saved me the lecture time and personal time spent dealing with inevitable license and compatibility issues that would have arisen from using anything not open source.

Also, now the students have experience with an analysis tool that they can access after the class is over. Even though many students had no programming experience, I feel like they got over the barrier of programming easily enough. There were some major hiccups which can hopefully be avoided in the future.

3. Announcing and keeping a hard line on late assignments. In my class, hard copies of assignments were to be handed in to a drop box for a small army of teaching assistants to grade and return. Any late assignments would have added a new layer of logistics to this distribution, so I announced in the first day that late assignments would not be graded at all. This also saved me a lot of grief with regards to judging which late excuses were ‘worthy’ of mercy or not, or trying to verify them.

4. Using a photo-to-PDF app on my phone. It’s faster and more convenient than using a scanner. Once I started using one, posting keys and those case study style lecture notes became a lot easier.

5. Including additional readings in the assignments.  The readings provided the secondary voice to the material that would have otherwise been provided by the textbook. Since I've posted answers to the questions I wrote, I will need to make new questions in order to reuse the articles, but the discovery part is already done.

6. The Teaching Assistants. By breaking from the typical material, I was also putting an extra burden on the teaching assistants to also have knowledge beyond the typical Stat 302 offering. They kept this whole thing together, and they deserve some credit.

What I learned by making mistakes:

1. USE THE MICROPHONE. I have good lungs and a very strong voice, so even when a microphone is available, my preference has been to deliver lectures unaided. This approach worked up until one morning in Week 3 when I woke up mute and choking on my own uvula. Two hours of lectures had to be cancelled.

2. Use an online message board. For a large class, having message board goes a long way. It allows you to answer a student question once or two, rather than several times over e-mail. I had underestimated the number of times I would get the same question, and answer the question in class didn’t seem to help because of the 45-60% attendance rate. Other than the classroom, my other option to send out a mass email, which, aside from sending out lecture notes, was done sparingly.

A message board also serves the same purpose of a webpage as a repository of materials like course notes, datasets, and answer keys. 

3. Do whatever you can in advance.  Had I simply spent more time writing more rough drafts of lectures, or made or found some datasets to use, before the start of class in January, that time spent would have paid off more than one-to-one. How? Because I still had to do that work AND deal with the effects of lost sleep afterwards. There were a few weeks where my life was a cycle of making material for the class at the last minute, and recovering for working until dawn. Thank goodness I was only responsible for one course.

4. Distrust your own code. I have a lot of experience with writing R code on the fly, so I thought I could get away with minimal testing of the example code I wrote and gave out with assignments. Never again.

One of my assignments was a logistical disaster. First, a package dependency had recently changed, so even though on my system I could get away with a single library() call to load every function needed for the assignment, many students needed two. For others, the package couldn't even be installed.

Also, when testing the code for a question, I had removed all the cases with missing data before running an analysis. I didn’t think it would make any difference because the regression function, lm() removes these cases automatically anyways. It turns out that missing data can seriously wreck the stepAIC() function, even if the individual lm() calls within the function handle it fine.

In the future, I will either try to take any necessary functions from packages and put them into a code file that can called with source(), or I will provide the desired output with the example code. This also ties back into working until dawn: quality suffers.

5. Give zero weight on assignments. The average score on my assignments was about 90%, and with little variation. As a means of separating students by ability, the assignments failed completely. As a means of providing a low-stakes venue for learning without my supervision, I can’t really tell. The low variation and other factors in the class suggest to me a lot of copying or collusion.  Identifying which students are copying each other, or are merely writing things verbatim from my notes is infeasible – even with teaching assistants. The number of comparisons grows with the SQUARE of the number of students, and comparisons are hard to judge fairly in statistics already.

One professor here with a similar focus on practicality, Carl Schwarz, gives 0% grade weight to assignments in some classes. The assignments are still marked for feedback, I assume, but only exams are used for summative evaluation. This would be ideal for the next time I teach this course.

I would expect the honest and interested students to hand in work for practice and feedback and they would not be penalized grade-wise for not handing in a better, but copied, answer. I would expect the rest of the students to simply not hand anything in, which isn’t much worse for them than copying, and would save my teaching assistants time and effort.