A sort of introduction
Prior to the 2019 season, there was speculation that the Kansas City Royals, a year after losing over 100 games, would do something only 30 teams have done since 1920 and just once in the new millennium: steal 200 bases. While the range of 200-250 was given in the linked article, for the purposes of this article, 200 is a sufficiently extreme aspiration when taken in a historical context. Looking at the 162-game stolen bases averages prior to 2019 for the Royals’ Opening Day lineup last year gets 192 stolen bases–200, then, was certainly not out of the question. Had they accomplished this feat that appeared within reach, they’d have become one of the all-time-great base-stealing teams in the process.
Anticlimactically, the 2019 Royals ended with just 117 stolen bases, still good for second in the majors, but not worth much of a look in a historical sense. But assuming they attained 200 is instructive for looking into how the stolen base environment has changed over time. As of August 24, the Royals this year are on a 162-game pace of 168 stolen bases, much closer to 200 than they had gotten to last year but still not sufficiently extreme.
With a 100% success rate, a 200-stolen-base season would have nearly doubled the average number of stolen base attempts in 2019, the fewest average number of attempts per team since 1962 and the (current) low point in a near-linear downward trend beginning in the mid-1980s. Unless otherwise noted in this series, I am using the Lahman database’s 2019 csv version as my initial dataset.
I have added ggplot2: Elegant Graphics for Data Analysis (2016) by Hadley Wickham to my list of books used as a resource for this weekly project. This will become apparent in the results section below, where for the R section my graphs are clearly made with ggplot2. I am hoping to find a similar resource for Python graphics, but I have not yet done so.
Methodology and code
A note on methodology
For the purposes of this article, I am assuming the Royals stole 200 bases instead of 117 at a 100% success rate, rather than extending their stolen bases attempted with last season’s success rate to get to 200 stolen bases. This is because I am not interested in stolen base attempts in this article; additionally, I will note where I substituted 200 for 117 in the dataset.
Getting R and python on the same page
Using the usual ways of omitting missing values in R and Python produce different results:
teams_not_missing <- na.omit(teams)
in R gives a dataset of teams beginning in 1970, while teams_not_missing = teams.dropna() in Python gives a dataset of teams beginning in 1995. Looking at the Teams dataset in Excel shows that 1995 is the first year without a blank cell in any column, while 1970 is the first year with data for HBP and SF with the column for winning a Wild Card game still having blank cells until 1995. As I am only concerned with the team, year, stolen bases, and caught stealing, this problem is merely interesting and does not affect the analysis, as I first saw in the Excel file where SB and CS no longer have some missing values, and then took a subset in both R and Python, respectively, as follows:
teams_subset_1951 <- subset(teams, yearID >= 1951)[, c(‘yearID’, ‘teamID’, ‘SB’, ‘CS’)]
teams_subset = teams[[‘yearID’, ‘teamID’, ‘SB’, ‘CS’]] teams_subset_1951 = teams_subset.loc[(teams_subset.yearID >= 1951)].
I imagine there is a way to subset the dataset in Python in one step, but I have yet to find out how.
Now that they’re on the same page
Having the above dataset equal in both R and Python, we can begin further editing it to add a column for stolen base attempts. In R:
teams_subset_1951$SB_Attempts <- with(teams_subset_1951, SB + CS),
and in Python:
teams_subset_1951 = teams_subset_1951.assign(SB_Attempts = teams_subset_1951[‘SB’] + teams_subset_1951[‘CS’].
While they’re not relevant for the rest of the analysis, having a column for stolen base attempts is helpful for viewing the overall stolen base trend in the dataset.
Next, let’s explore the trend in stolen bases by team from 1951-2019. First, in R, using ggplot2, the graph looks like this:
Graph 1: Stolen Bases by Team 1951-2019 in R
The code needed to obtain that graph in R is (after loading the ggplot2 package):
ggplot(teams_subset_1951, aes(yearID, SB)) + geom_point() + geom_smooth() + xlab(“Year”), ylab(“Stolen Bases”) + ggtitle(“Stolen Bases by Team 1951-2019”) + theme(plot.title = element_text(hjust=0.5)).
The theme portion of the code centers the title.
I was unable to figure out how to add the smooth curve to the Python graph, but other than that I was able to replicate the graph from R.
Graph 2: Stolen Bases by Team 1951-2019
The Python code needed to make this graph is (after loading matplotlib.pyplot):
mpl.scatter(teams_subset_1951.yearID, teams_subset_1951.SB) mpl.title(“Stolen Bases by Team 1951-2019”) mpl.xlabel(“Year”) mpl.ylabel(“Stolen Bases”) mpl.show().
Next, to further appreciate how extreme the Royals’ achievement would have been in 2019, we develop a stolen bases plus (SB+) statistic to measure their percentage above league average. For the first graphs, the Royals’ actual number of stolen bases, 117, is used. This will only affect the 2019 results, but changing to the projected number is not necessary until the next set of graphs. Additionally, teams are identified in the graph to better track how the Royals change between scenarios. As there are 45 separate team IDs in this dataset, identifying a specific team may be difficult across the graph, but the Royals’ change will be apparent.
Graph 3: SB+ for 1951-2019 (Kansas City Royals Having 117 SB)
Graph 4: SB+ for 1951-2019 (Kansas City Royals Having 200 SB)
As can be seen in the right-most column on the second graph, an entry appears above the 250 SB+ mark corresponding to the Kansas City Royals. Between graphs, the Royals went from simply an above-average base-stealing team to one of the four greatest base-stealing teams in the sample with a nearly 254 SB+, or 154% better than league average. By this metric, the best base-stealing team in history were the 1962 Los Angeles Dodgers, who stole only (“only”) 198 bases but in a year in which the average team stole just over 67. Unlike the Royals last year, the Dodgers were one of the best teams in baseball, finishing one game behind the San Francisco Giants for the best record in the National League. The Dodgers were led by Maury Wills, who stole 104 bases (Wills as an individual would have had a 154 SB+) and won the NL MVP award.
The R code used for the previous two graphs is:
ggplot(teams_subset_1951, aes(yearID, SB_Plus, color=factor(teamID))) + geom_point() + xlab(“Year”) + ylab(“SB+”) + ggtitle(“SB+ for 1951-2019 (Kansas City Royals Having 117 SB)”) + theme(plot.title=element_text(hjust=0.5)),
substituting “200 SB” where “117 SB” is for the second graph.
I was unable to replicate the results in Python, as I was unable to merge the datasets of the initial data and the average stolen bases by year data, which was found using:
teams_subset_1951_SB['Avg_SB'] = teams_subset_1951_SB.groupby(by='yearID').mean().
From there, I am not yet sure how to create a graph similar to the ones I made using ggplot2 in R. This was still an informative week from the Python perspective, as I learned a lot and now know what to learn for the upcoming articles. From an R perspective, the ggplot2 book has been very helpful.