Featured post

Textbook: Writing for Statistics and Data Science

If you are looking for my textbook Writing for Statistics and Data Science here it is for free in the Open Educational Resource Commons. Wri...

Sunday, 1 February 2015

R Package Spotlight - nhlscrapr

NOTE: AN UPDATED INTRODUCTION TO NHLSCRAPR HAS BEEN POSTED HERE.

nhlscrapr is a package to acquire and use play-by-play information of National Hockey League games. It's similar in function to pitchRx for Major League Baseball. Unlike pitchRx, it doesn't allow for SQL Queries to be sent to an open database, but is instead a system to collect raw data from www.nhl.com and format it into an R dataframe.

The focus of this package spotlight will be on acquiring the game data rather than using it. 



The first thing we need is a list of the games available for scraping. A function in nhlscrapr returns a table of gameIDs.

fgd = full.game.database()

dim(fgd)
[1] 15510    13


names(fgd)
 "season"     "session"    "gamenumber" "gcode"      "status"
 "awayteam"   "hometeam"   "awayscore"  "homescore"  "date"
 "game.start" "game.end"   "periods"  

The full games data frame, which we'll store as 'fgd', has about 15000 games by default.
 
table(fgd$season, fgd$session)

           Playoffs Regular

  20022003      105    1230
  20032004      105    1230
  20052006      105    1230

  ...

  20122013      105     720
  20132014      105    1230
  20142015      105    1230


Every regular season and playoff game of the 12 seasons from 2002-3 to 2014-5. It is visible from a crosstab how the 2012-3 season was shortened and the 2004-5 season was missed entirely. Also visible is that the number of playoff games appears fixed; all potential games have an assigned code.

fgd[c(1,2,1000,1001,15509,15510),1:5]
        season  session gamenumber gcode status
1     20022003  Regular       0001 20001      0
2     20022003  Regular       0002 20002      0
1000  20022003  Regular       1000 21000      1
1001  20022003  Regular       1001 21001      1
15509 20142015 Playoffs       0416 30416      1
15510 20142015 Playoffs       0417 30417      1

The other three variables of consequence are gamenumber, gcode, and status. The variable 'status' is marked 0 for games that are confirmed to be unscrapable. Most unscrapable games are from the early part of the 2002-3 season when this database was being established, and from playoff games that never happened. There 30 other regular season games that are lost for other reasons I don't know.

gamenumber is the unique-within-session identifier for a game, from 0001 to 1230 for regular season games. The gamenumber for playoff games is encoded as 0[round][series][game], so a gamenumber of 0315 represents Game 5 of the 3rd round of the playoffs for one conference, and 0325 would be Game 5 for the other conference.

gcode is [session]gamenumber. A 2xxxx gcode is a regular season game, and a 3xxxx gcode a playoff game. Presesason games would have a gcode of 1xxxx, but they aren't included in the data from this package. The remaining eight variables, from awayteam to periods, appear to have little or no use at the moment. Finally, a function is used to import rather than a data() command because there will be more data after the 2014-15 season which is not included when you leave the parameter extra.season at its default of 0.

## Doesn't work yet, but would also include 2015-6 and 2016-7
test = full.game.database(extra.seasons = 2) 

The dataframe fgd is just the start - it is a list of IDs used for scraping. The following script uses fgd to download and compile a much larger data frame of every recorded play, including shots, hits, goals, and penalties. Explanation below.

setwd("C:\\Set\\This\\First")
yearlist = unique(fgd$season) 


for(thisyear in yearlist) ## For each season...

{
    ## Get the game-by-game data for that season
    game_ids = subset(fgd, season == thisyear) 

    ## Download those games, waiting 2 sec between games
    dummy = download.games(games = game_ids, wait = 2) 

    ## Processing, unpacking and formatting
    process.games(games=fgd,override.download=FALSE) 

    gc() ## Clear up the RAM using (g)arbage (c)ollection.
}


## Put all the processed games into a single file
compile.all.games(output.file="NHL play-by-play.RData") 

Any games you download will be saved in the subdirectories 'nhlr-data' and 'source-data' of the working directory, which you can set a dropdown menu or with setwd(). The names of these subdirectories can be changed with parameters in download.games(), process.games(), and compile.all.games().

The download.games() function will download any games from www.nhl.com listed in the database of the same format as fgd. Raw game data is placed in the nhlr data folder.

Rather than use fgd directly , we are subsetting it by season because the downloading process has a memory leak that needs to be addressed using gc() occasionally. If you try to download all the games without stopping to perform garbage cleanup, R will eventually run out of memory and crash. 2Gb of RAM should be more than enough to handle one season at a time.

The 'wait' parameter defines the number of seconds to wait between single game downloads. The default is 20 seconds, I presume as a courtesy to the NHL or to avoid scraping detection, but it could also have something to do with slow download speeds.

The use of the process.games() function is conjecture, but its use is necessary. Processed game files are also saved in the nhlr data folder. Compilations from the compile.all.games() function are put in the source data folder. If you interrupt the download process, whatever games you have managed to download and process will be compiled.

Once you have some games compiled, you can load them into R, see the recorded play-by-plays. The output below is from a single season (2006-7). There are about 375 events per game in this season, more than one every ten seconds. This should be everything you need to explore the data yourself.  Later, I'll be using this dataset to measure the GDA-PK and GDA-PP metrics proposed in a previous post.

temp = load("source-data\\NHL play-by-play.RData")
nhl_all = get(temp)

length(unique(nhl_all$gcode))
1310

dim(nhl_all)
[1] 494619     44

 names(nhl_all)
 [1] "season"              "gcode"         "refdate"    
 [4] "event"               "period"        "seconds"     
 [7] "etype"               "a1"            "a2"          
[10] "a3"                  "a4"            "a5"          
[13] "a6"                  "h1"            "h2"        
[16] "h3"                  "h4"            "h5"      
[19] "h6"                  "ev.team"       "ev.player.1"  
[22] "ev.player.2"         "ev.player.3"   "distance"   
[25] "type"                "homezone"      "xcoord"  
[28] "ycoord"              "awayteam"      "hometeam"   
[31] "home.score"          "away.score"    "event.length"  
[34] "away.G"              "home.G"        "home.skaters"    
[37] "away.skaters"        "adjusted.dist" "shot.prob.dist 
[40] "prob.goal.if.ongoal" "loc.section"   "new.loc.sectio  
[43] "newxc"               "newyc"              

 

No comments:

Post a Comment