The PitchRx package is an R package designed to use
a Major League Baseball dataset called pitchFx. As with nhlscrapr, there are
other means of accessing this dataset, but my preference is usually towards R
integration.
It gives you detailed pitch-by-pitch information
about every MLB game, including the speed and location of the ball as it
crossed homeplate.
Getting started with pitchRx is pretty quick:
The scrape() function in pitchRx will allow you
scrape data from every game that happened during the range of days given. Days
are in the YYYY-MM-DD format, which is used because...
1) It's the same format that SQL uses.
2) It's the ISO standard
library(pitchRx)
dat = scrape(start = "2013-06-01", end =
"2013-06-01")
The dataset 'dat' that comes out of this is a
collection of five tables.
names(dat)
"atbat" "action" "pitch" "po" "runner"
'atbat': Describes the outcome of each at-bat. One
row = one batter.
'action': Other events not related to at-bats, such
as pitching changes, coaching visits to the mound, and managers getting ejected
from the game.
'pitch' : Pitch-by-pitch description. One row = one
pitch. Has lots of physics variables relating to each pitch, but lacks the text
descriptions that accompany at-bats.
'po' : Pickoff attempt descriptions.
'runner': Description of where each runner ended
up. Most of the rows correspond to at-bats. The other rows represent running
events like advancing on someone else's hit, or being forced out.
We will focus on the pitch-by-pitch table.
The list of variables is... intimidating.
Thankfully, a lot of these are the same across all
the tables.
des, des_es: The
text description of the pitch in English or Spanish, respectively. Examples
include "Ball", "Foul", "Strike", and "In
play, run(s)".
num: The number of the
at-bat for this game. Also used in the at-bat data frame.
count: The ball-strike count
before the pitch occured.
start_speed, end_speed: The
speed of the ball, in miles per hour, when ball reaches home plate, and when it
leaves the pitcher's hand, respectively.
px: The horizontal position
that the ball crosses the home-plate plane. Measured in feet left or right of
the center of home plate, from the perspective of the catcher.
pz: The vertical position of
the ball crossing the home-plate plane. Measured in feet above the ground.
nasty: The 'nasty factor',
which is a function of physical variables that is supposed to describe how
difficult a pitch is to hit.
spin_rate: The
(mean?) rate which the baseball was spinning, in revolutions per minute
(RPM). Yes, some pitchers really
do spin the ball at 2700 RPM!
zone (unconfirmed): The
portion of the strike zone (or outside it) that a pitch crossed the plate.
Example analysis: Pitching count
One big issue in baseball is pitch count. As a
pitcher, especially a starter, throws many pitches, they tire and their
performance supposedly gets worse.
Is this true? Let's plot some variables against
pitch count.
First, let's isolate one team of one game.
atbat_1game = subset(dat$atbat,
inning_side == "top" & gameday_link ==
"gid_2013_06_01_arimlb_chnmlb_1")
pitch_1game = subset(dat$pitch,
inning_side == "top" & gameday_link == "gid_2013_06_01_arimlb_chnmlb_1")
Next, we have to identify the pitcher that throws
each pitch. We have to get this information from the at-bat table.
pitcher = rep(NA,nrow(pitch_1game))
for(k in 1:nrow(pitch_1game))
{
thisnum
= pitch_1game$num[k]
pitcher[k]
= atbat_1game$pitcher[which(atbat_1game$num == thisnum)]
}
pitch_1game$pitcher = pitcher
Now that we know the pitcher that threw each pitch,
we can find the pitch count. This R script first ensures that event_num is treated like a number and not a string. This is important because we will use event_num to put the game's pitches in chronological order.
pitch_1game$event_num = as.numeric(pitch_1game$event_num)
pitch_1game = pitch_1game[order(pitch_1game$event_num),]
This R script takes makes a new variable for pitch count. For a given pitcher, it marks the pitches as 1, 2, ... up to the number of pitches thrown. It does this separately for each pitcher, and when it's done, it puts that new variable into the 1-game data frame.
pitchcount = rep(NA,nrow(pitch_1game))
for(thispitcher in unique(pitcher))
{
idx
= which(pitcher == thispitcher)
pitchcount[idx]
= 1:length(idx)
}
pitch_1game$pitchcount = pitchcount
plot(pitch_1game$end_speed ~
pitch_1game$pitchcount)
plot(pitch_1game$nasty
~ pitch_1game$pitchcount)
These tables are linked by some identifying
variables.
gameday_link,
example: gid_2013_06_01_wasmlb_atlmlb_1
This is found in all five tables, it identifies the
game as...
...happening on 2013-06-01,
...with Washington as the visiting team,
...and Atlanta as the home team,
...and was the first game between these teams that
day
(In the case of two games in a day, the gameday
link will end in _2 instead of _1 )
event_num,
Every
event in a game has a number relating to its chronological events. The first
recorded pitch is event_num is 3.
After
that, every pitch, pickoff attempt, running events, and entry in the 'action'
table is given its own event_num.
Since
pitchRx is based in SQL, the order of the rows that get scraped isn't
guaranteed. The event_num variable is very useful if row-order matters to you.
Remember to save your work!
The data you scrape from pitchFx is NOT
automatically saved to a file like nhlscrapr is.
It's probably worth the extra effort to save the
tables as separate .csv files.
write.csv(dat$atbat, "At Bat Data 2013-06-01.csv")
...
No comments:
Post a Comment