Statistics et al.: 2018

Saturday, 29 December 2018

Degrees of Freedom, Explained

You can interpret degrees of freedom, or DF as the number of (new) pieces of information that go into a statistic. Using examples from this video [https://www.youtube.com/watch?v=rATNoxKg1yA , James Gilbert, “What are degrees of freedom”]

I personally prefer to think of DF as a kind of statistical currency. You earn it by taking independent sample units, and you spend it on estimating population parameters or on information required to get compute test statistics.

In this article, degrees of freedom are explained through these lenses through some common hypothesis tests, with some selected topics like saturation, fractional DF, and mixed effect models at the end.

We read this: "Your bones".

This is review of the book “Your Bones”, a medical book aimed at the general public by Lara Pizzorno and with Jonathan V. Wright. The review was co-authored with Gabriela Cardillo; she provided the body, medical expertise, and main content and I edited and added the criticisms and asides at the end.

One Thousand Words on Writing Better Surveys

Twenty tips on survey methodology, 1000 words, go!

Four OJS manuscript reviews, 2015-2018

Here is a dump of the remaining reviews I made for Scirp's Open Journal of Statistics from 2015 to 2018. For reasons explained in The Last Review I'll Ever Do For OJS, I won't provide additional linking information. These reviews are here as how-to examples.

Statistics in Politics and Demographics

In introductory statistics courses, we present these polls as if they are draws from a binomial distribution. That is, that every member of the relevant population is equally likely to be a respondent in the sample, and that they will actually respond with their actual voting intention or approval. Poll aggregating websites like Fivethirtyeight and Politifact have shown how far from the truth a real political poll can be.

This post is a draft of a proposal for a undergraduate interdisciplinary courses between statistics and political science on polling.

Parameter Estimation of Binned Data

Section 1: Introduction – The Problem of Binned Data

Hypothetically, say you’re given data like this in Table 1 below, and you’re asked to find the mean:

Group	Frequency
0 to 25	114
25 to 50	76
50 to 75	58
75 to 100	51
100 to 250	140
250 to 500	107
500 to 1000	77
1000 to 5000	124
5000 or more	42

Table 1: Example Binned Data.
Border cases go to the lower bin.

The immediate problem is that the mean (and the variance, and many other statistics) is the average of exact values by we have ranges of values. There are a few things similar to getting the mean that could be done:

The Last Review I'll Do For the Open Journal of Statistics

The following review is for a paper that is currently published in the Scientific Research Publishing's Open Journal of Statistics (link omitted intentionally). I accepted the responsibility to review it, found it unfit for publication, and returned a review less than a month later only to find that the paper had already been published as is.

I won't be reviewing for this journal or this publishing house again. To ask for my review, get my assent, and not wait a reasonable time (a month, really?), for the review before going ahead and publishing is disrespectful of my time. It also smacks of predatory journal behaviour.

Was there another peer reviewer and were they qualified to perform the review? Read my review, and the paper, and judge for yourself.

How to Give a Career Talk, with Question Prompts!

A few of the people I've asked have been interested in talking in front of seminar class on careers in statistics, but didn't think they could fill a half-hour with their career talk. However, with prompts adapted from this list, external speakers have had no trouble giving such a talk with at most a couple hours of preparation.

Applying for a Master’s Degree in Stats or Data Science – Why and How

Why apply for an MSc, instead of finishing with a BSc in Stats or Data Science?

Statistics is a ‘discovered major’. Traditionally, not many people have gone into university planning to be a statistics major early-on. It also relies, as an unofficial prerequisite, on a wide range of skills that are typically learned in an undergrad degree.

For example, a statistician is expected to have a background in mathematics, in writing and communication, and in programming. They’re also expected to know a little about their respective service or collaboration fields.

Open Reviews 3 - Open Journal of Statistics 2015

This was the third paper I reviewed for the predatory publisher Scientific Research Publishing's Open Journal of Statistics.

The manuscript was a simulation study of a new computational method. It was the first paper I had reviewed in three years, as I had been otherwise swamped in coursework. You can see that I still wasn't clear on the differences of the roles of copy-editor and reviewer by the extensive writing feedback I gave. By word count, the review was a third as long as the manuscript itself.

Two career failure stories

This semester, I am teaching a course in career planning in statistics. No such course existed when I was an undergrad, so a lot of the planned course material was learned from experience. I opened the semester with some 'failure stories' of trying to start a career with a BSc in Math. Here are some of my failures:

Open Reviews 2 - Open Journal of Statistics 2012

These were the first two papers I reviewed for Scientific Research Publishing's Open Journal of Statistics, back in 2012. The first one is 'An Exceptional Generalization of the Poisson Distribution', and the second one was 'A Proposed Statistical Method to Explore Quality of Quantitative Data'.

Open Reviews 1 - Two Meta-Psychology Papers

This is the first post of several in which I publish the peer reviews I have previously given to journals.

There are a few reasons for doing this, but the main one is purely mercenary: I want to get more return for the effort I put into carefully reading and critiquing these articles.

Draft Pairing Tournament Format

Worst-vs-first pairing structures for playoffs are designed to reward teams for doing well in the regular season. Sometimes this backfires. A better system would be a 'pairing draft' in which teams choose their first-round playoff opponents in order of regular season ranking.

Give me a few minutes to convince you.

Baseball-to-Cricket translation guide

If sports were species of animals, baseball and cricket would be consider closely related, at least as close as English rules rugby is to American football. Having said that, it's been hard to find fellow fans of both sports. So, in the hopes of increasing the number of crossover fans, I've prepared the following 'translation guide' to explain one sport in terms of the other as closely as possible.

I read this: Visualizing Baseball

What Tim Swartz teaches statistics in sports, he does so through case studies. He will take a paper such as one about the optimal times in which to substitute players in soccer and use that as a platform to demonstrate analysis of soccer data. Visualizing Baseball by Jim Albert is a collection of nine case studies most of which would be excellent for a course like Tim Swartz's.

This is a book written by a statistician for the stats enthusiast about baseball. I say stats enthusiast and not statistician because little to no stats experience is required to understand the book. Likewise little expertise about baseball is assumed, and key concepts are explained clearly but dryly.

Stat Writing Exercise - Pre-Baked Regression Analysis

In this Statistical Communication exercise, the learners take an already completed regression analysis and write a report of 250-400 words describing the analysis. This exercise consists of a 40-50 minute example that the teacher goes through to demonstrate and establish expectations, followed by a 50-70 minute period for the learners to emulate that writing process on a new analysis.

Chess Variant - Laser Chess / Khet / Deflexion

Khet, Deflexion, and Laser Chess are different names for the same game. The currently commercially available game is called Laser Chess, and it has a space theme. Deflexion, Khet, and Khet 2.0 have an ancient Egyptian theme. The themes and some of the suggested starting boards differ between versions, but the pieces and board are functionally identical. They have also introduced a couple of new starting positions and improved the accuracy of the laser, but otherwise kept the fundamentals the same.

Annual Report to Stakeholders 2017-18

Summary:

The last year has been a lot like the year before which works out well mostly. There are a few things I still haven't done that I wanted to which makes it feel a bit like stagnation but let's focus on the victories instead.

Having more professional experience has allowed me to do many of the same things professionally as last year but on a larger scale and faster. For example I taught 6 classes this year, up from 5 last year. I wrote a lot more material, edited a lot more papers, and read a lot more books.

Stat Writing Exercise - Improving Graphs

This is an in-class exercise that I gave to the 3^rd year undergrads in a Statistical Communication class. It was designed to take 20 minutes to explain and 40-50 minutes to execute, including instant feedback. It went well enough that I felt it was worth sharing.

Survey Notes: Sensitive Information, Heaping, and Psychometrics

Below are some additional notes on four survey question topics that warranted more specific information than my 20 survey question tips could offer:

1. How even the most innocent of differences can produce a statistically significant effect,

2. One way to ask for sensitive information without respondents admitting anything.

3. The heaping phenomenon.

4. Where to find previously made and tested psychometric scales.

Analyzing Jeopardy in R - Part 2

My previous Jeopardy analyzer was built using a base of about 30 daily Coryat scores. This one has more than 1600 scores that were either recorded directly, e-mailed to me, or scraped from the forum at jboard.tv . Here we look at the consistency of tournament effects for different at-home players, and some long-term trends.

Reading Assignment - Grant Applications

The purpose of this reading assignment is to give you a sense of the sort of things that someone consulting a statistician will want to know.

Read Chapter 22 – Writing the Data Analysis Plan, by A.T. Panter, of the book How to Write a Successful Research Grant Application – A Guide for Social and Behavioral Scientists, Eds.
Pequegnat et al., 2^nd ed., and answer the questions that appear after the preliminary notes.

I read this: Chess Variants and Games for Intellectual Development and Amusement

Chess Variants and Games for Intellectual Development and Amusement by AV Murali is not a book about Chess. If it's about anything it's about geometry, puzzles, and education. It’s like Hoyle’s Book of Games but instead of well-established games, it has speculative and creative modifications to chess.

Scenario Based Exam Questions for Intro Statistics

A scenario based question is one that includes a brief description of a dataset or model, along with some information like a plot, set of summary statistics, or computer output. From this information, the student needs to answer several questions about the data, such as 'what to these parameters represent?' or 'would the correlation be stronger or weaker without the outlier?'. It's my preferred way of asking questions on exams and assignments, as it more closely mimics the sort of problems that someone would actually encounter in a non-classroom setting.

Below are four examples with post-mortem commentary, as well as download links to all 94 such questions I've made so far that are worth keeping and drawing from.

Simultaneous Strategy 2: Kung Fu Chess

Kung Fu Chess, as found at https://www.kfchess.com/ is a remake of a much older PC game by Shizmoo games, and has recently been adapted for online play.

Kung Fu Chess has the initial setup and most of the same rules as the standard game of queen's chess that everyone is familiar with. The primary difference is that instead of one player moving one piece at a time, a player may move any of their pieces at any time, provided that the peace in question has not been moved in the last 10 seconds.

Simultaneous Strategy 1: Cosmic Blocks

In the chess-like game Cosmic Blocks, by Narcissa Wright, and available through the Discord server at https://discordapp.com/invite/szpznUj , players each start with a 1-square base on an 11-by-21 grid. That base spreads influence, represented by coloured shading of squares, to the 3-by-3 area surrounding the base. The goal is to spread this influence into the opposing base.

Assignments for statistical literacy: Big Data in Healthcare, Data and the Law, Manual Writing

This semester, I've been trying a lot of new assignments to encourage reading and writing of statistical literature as part of a new class and in preparation for a course pack I am publishing soon.

Here are two of the reading assignments and one of writing exercises that I tried this semester: "Data and the Law", "Big Data in Healthcare", and an exercise on writing good statistical instructions.

All of the required reading is open access.

Freakazoid, a Repetition-Robust Symmetric Cipher.

This post describes the mathematical principles behind the Freakazoid cipher such that someone could recreate it with software if they wished. For a targeted guide to some of the terminology used here, please see my previous post discussing some fundamentals of encryption.

It is not meant to be a serious attempt at a cipher, and is presented as food for thought only. It was originally developed more than ten years ago as part of a semester of research as a math undergrad.

Freakazoid is a cipher based on DES, an old cipher that was used as an industry standard for a time, but with the added step that a new key is generated each block of data to be encrypted. The generation of this key is governed by a master key. For clarity, we call a key applied to a particular block of data the block key.

Some Timely Fundamentals of Encryption

This post is in response to some questions about the mechanics of encryption that have been getting more frequent as cryptocurrency and related technologies become more well-known.

Template for a technical report, with example rubric

A completed technical report might look like this:

1. Executive Summary

2. Introduction / Problem statement

3. Methods

4. Results

5. Conclusion / Discussion

Executive Summary: The TL;DR

This is the LAST thing that you should write. This would be the tl;dr of the technical report. “tl;dr” stands for “too long; didn’t read”. More formally this is called the executive summary, which means ‘if this report was given to a major decision maker, whom has tons of things they need to know already, what would you like them to know from the report that can be reduced to 100 words or fewer.’

Here you should write the research question as shortly as you can, one main result, and the name of the main method used. Nothing from the discussion / conclusion section is needed here.

Introduction:

The introduction typically follows a close formula.

1. Describe the research problem or state the research questions that were posed. If you can, tell why this research problem is important. The explanation of importance doesn’t have to be too specific to the research problem. If you are working with data about a medical problem, mention that many people suffer this medical problem; in a research paper, this is a good opportunity to cite a well-known related paper that has found the scope of the problem for you.

If you don’t know why a problem is important and a quick literature search won’t tell you, leave the problem’s importance to a co-author whose expertise is more suited to this part. It’s much better to admit you don’t know something than to say something wrong.

2. Describe each section in the paper or report in very short detail. (e.g. “in the methods section, we describe the data cleaning and the regression tree method that we used. In the results section, we describe the goal scoring rate of different hockey players. In the discussion section, we follow up with a comparison of this method to an older, more traditional one.”)

Methods

The methods: What did you do to get these results?

If this were a field science, you would list the days and describe the conditions under which you went out into the field and gathered information (e.g. ‘we collected our samples on sunny days in the North Okanagan valley between June 10^th and September 20^th, 2015’). In a data science, you would instead describe the dataset that you used, its format and size, and key variables and features (e.g. ‘We gathered the data from NHL.com’s event-tracking database using the nhlscrapr package along with our own patch, The data we collected included each goal, shot, hit, penalty, and faceoff recorded in each regular season game from October 2012 to April 2017’)

This is where the bulk of your writing should be. About 50% of your report will be the methods section. You don’t need to explain the entire data cleaning process, but you should mention where the data came from, and the tools / software that were used. It’s also good practice to mention when the data was taken (especially in the case of news reports which may be updated, altered, or archived such that scraping may produce different results later).

If there were any judgement calls in your data cleaning process, such as…

- what was done about extreme and influential cases,

- how problematic variables were used,

- how tuning parameters for complex methods were selected, and

- how missing values were either filled in or explained away,

…these should be included as well.

In short, you don’t have to give everything away, but an expert with the same software and data access should be able to recreate what you did.

A methods section serves two purposes: first to give legitimacy to your results. If you show results without explaining how you got them, a reader might assume that the results were invented or made up. With a methods section, the reader should be able to see a logical path between the data and the results.

After the data preparation is explained, describe the model you selected or the process you used to select the model. If you just did linear regression, say that. If you used a random forest, or the LASSO, or stepwise regression, say that instead.

Normally, you only need to include the final method that you decided upon. However, there is a good chance that the method you used wasn’t the only method that you tried. In a research paper, you wouldn’t necessarily mention these ‘dead ends’ because paper length is limited by the journal. In a technical report (or a thesis) these other approaches are useful to help you justify your choice and that alternatives were considered. You can explain why these rejected methods didn’t work or what about the results they produced was bad. Don’t overdo these dead-end explanations. The reader is much more interested in what you did and what worked instead of what didn’t work, typically.

Example: “After an exploratory analysis, we tried to classify events using random forests, dimension reduction, and neural nets. We decided to further pursue neural nets because they produce models with much lower out-of-bag errors than other approaches.”

Results

It’s easiest to write the results first, even though they don’t appear first. Any tables of figures you want to show, make these as soon as the analysis work is done. Talk about your results a little. Explain the importance of any tables and figures; why are they there?

Mention the general trend (e.g. ‘there is a negative, non-linear trend between playing time per game and shots against goal’), and any notable observations (‘however, the New Jersey Devils break this trend’)

You don’t need to write much here. The charts should explain themselves.

Discussion / Conclusion:

In a technical report, this is where you take the results and give them meaning in the context of the research questions that were in the introduction. You can also quickly summarize what you did.

In a journal paper or a thesis, this section might also include future research questions that could be answered with more data or by a different analysis. A technical report should be more self-contained, and allusions to the further work are not required.

In every case, no new information about the project should in introduced in the conclusion. If you have an interesting finding, it should be in the results. If that interesting finding doesn’t fit with the rest of the results, a new subsection for it can always be made, but keep it out of the discussion section.

Remember, when giving context to the results, don’t reach beyond your expertise. If the data is genetic, and you are not a geneticist or biologist, do not make conclusions about the importance of a gene. Often statistical publications are co-authored with subject experts; let those experts write about their topics and stick to the data analysis.

Example Rubric: Total out of 100

Length	3-6 full pages	10
	7 pages	8
	8 pages	6
	More than 8	0
	2.5 pages	8
	2 pages	4
	Less than 2 full	0
		Possible 10	/10

Grammar	Start with 10, any OBVIOUS grammar or spelling mistakes, reduce this by 2. Minimum 0/10
		Possible 10	/10

Executive Summary	Name of main method included	3
	Primary finding described	7
		Possible 10	/10

Introduction	Describes the research problem	6
	Makes a case for the its importance	4
		Possible 10	/10

Methods	Describes the data used	5
	Describes the data preparation and/or data cleaning	5
	Describes any decisions / judgement	5
	Describes method used	10
	Justifies choice of method	5
		Possible 30	/30

Results	Summarizing data through text, table, or a figure	10
	Description of a general trend	5
		Possible 15	/15

Conclusion	Ties results back to research question	10
	Does NOT include any new information that would be better in the results or methods sections.	5
		Possible 15	/15

	TOTAL		/100

Featured post

Saturday, 29 December 2018

Saturday, 15 December 2018

Saturday, 8 December 2018

Monday, 12 November 2018

Tuesday, 6 November 2018

Wednesday, 31 October 2018

Section 1: Introduction – The Problem of Binned Data

Monday, 22 October 2018

Friday, 12 October 2018

Wednesday, 10 October 2018

Why apply for an MSc, instead of finishing with a BSc in Stats or Data Science?

Friday, 28 September 2018

Wednesday, 26 September 2018

Friday, 21 September 2018

Tuesday, 18 September 2018

Wednesday, 22 August 2018

Monday, 20 August 2018

Thursday, 9 August 2018

Friday, 27 July 2018

Friday, 20 July 2018

Thursday, 19 July 2018

Tuesday, 26 June 2018

Friday, 25 May 2018

Thursday, 17 May 2018

Tuesday, 15 May 2018

Monday, 23 April 2018

Saturday, 14 April 2018

Saturday, 31 March 2018

Wednesday, 14 February 2018

Tuesday, 13 February 2018

Wednesday, 7 February 2018