A sort of introduction
As of now, the goal is to make a weekly (or, possibly, more) post exploring an interesting statistic or trend and to provide the methodology behind it along with the Python and/or R code used. The hope is that this will be more “and” than “or,” as this is as much my learning Python by example as it is refining and expanding my R skills.
My two primary resources at the moment for learning Python are Python for SAS Users: A SAS-Oriented Introduction to Python (2019) by Randy Betancourt and Sarah Chen, and Data Science Using Python and R (2019) by Chantal D. Larose and Daniel T. Larose. Obviously, any question yet to be answered by where I am in those books at the moment has likely been addressed in some form elsewhere online.
That a 60-game season will lead to unexpected and outlier-ish performances is clear; Rob Mains at Baseball Prospectus (where I also write) has been exploring some of these trends on a weekly basis.
A trend that I’ve noticed that hasn’t been discussed, likely due in some part to it happening on a yearly basis, is the presence of players whose on-base percentage is greater than their slugging percentage. This alone, however, is not necessarily interesting as intuition would suggest that players fitting these criteria would post a below-average season. With Carlos Santana and Rhys Hoskins having above-average seasons by Baseball-Reference’s OPS+, Baseball Prospectus’ DRC+, and FanGraphs’ wRC+ while having OBPs greater than their slugging percentage, I decided to see how many of these seasons have occurred since 1955 (why this year was chosen will be discussed in the methodology section below) by my own OPS+ measure (unlike Baseball-Reference’s, it’s not park-adjusted, which is a flaw, and leads to both Santana and Hoskins not qualifying for their 2020 seasons). Call these seasons hyper-Moneyball seasons, where OBP and SLG are not both factors in a player’s value, but rather just the OBP which seems to attract the most attention in discussions of Moneyball’s impact of the two.
This is an extremely rare occurrence: in my dataset from 1955-2020 of qualifying players, which has 7975 players, just 28 met the criteria of having their OBP/SLG greater than 1 and their OPS+ greater than 100 (with 100 indicating league-average and 110, for example, indicating 10 percent better than league-average).
The two graphs below show the plots of each qualifying player’s (OPS+, OBP/SLG) coordinate from 1955-2020 (Table 1) and of the 28 players mentioned above (Table 2). The highest OPS+, OBP, SLP, and OPS, along with the lowest OBP/SLG of the aforementioned 28 belongs to Hall of Fame outfielder Richie Ashburn’s 1955 season and can be found in the bottom right corner of Table 2. That year, Ashburn won the batting and OBP titles, good for a 142 OPS+ from Baseball-Reference (evidence of my methodology deflating OPS+ via considering only players qualifying for the batting title and why the above-average players in 2020 do not appear in my dataset). Ashburn, using the terminology of this post, can be said to have had a hyper-Moneyball career: a .396 OBP, .382 SLG, and 111 Baseball-Reference OPS+.
Table 1: OPS Plus and OBP/SLG Ratio from 1955-2020
Table 2: Players with an OPS+ > 100 and OBP/SLG Ratio > 1 from 1955-2020
Methodology and code
For this post, all code will be from my analysis in R. This is to begin simply, to understand the process behind the coding as well as the coding itself. Subsequent posts will hopefully begin to incorporate Python.
Methodology and code
Not much statistical analysis went into this post relative to the data cleaning that took place. Most importantly, the OPS+ statistic was constructed by:
1) Finding the OPS of each player by adding their OBP and SLG
2) Finding the average OPS by season through the following code:
batting_1955to2020$AvgOPS <- ave(batting_1955to2020$OPS, batting_1955to2020$yearID, FUN=mean)
3) Finally, finding the OPS+ for each player through the following code:
batting_1955to2020$OPSPlus <- with(batting_1955to2020, OPS / AvgOPS * 100)
Likely more useful from a macro standpoint is how to arrive at a list of players qualifying for the batting title by season using the Lahman Database as the initial dataset.
Batting title qualifiers must have at least a number of plate appearances greater than 3.1 times the average number of games played by league. The dataset begins in 1955, as it is the first year without only missing values after the construction of OPS+. Major League Baseball has played a 162-game schedule in both league since 1962, and a 154-game schedule in both leagues from 1955-1960. In 1961, the American League played 162 games while the National League played 154 games. However, two shortened seasons were played between 1962 and 2019: 1981, where both leagues averaged 107 games, and 1994, where the American League averaged 114 games and the National League averaged 115 games. The 2020 data from Baseball-Reference was already sorted by qualifiers.
Hence, to compile the list of batting title qualifiers between 1955 and 2020, the following R code can be used:
batting55to60_qual <- subset(batting55to60, PA > 477)
batting61NL_qual <- subset(batting61NL, PA > 477)
batting61AL_qual <- subset(batting61AL, PA > 502)
batting62to19_qual <- subset(batting62to19, PA > 502)
batting62to80_qual <- subset(batting62to80, PA > 502)
batting81_qual <- subset(batting81, PA > 331)
batting82to93_qual <- subset(batting82to93, PA > 502)
batting94NL_qual <- subset(batting94NL, PA > 356)
batting94AL_qual <- subset(batting94AL, PA > 353)
batting95to19_qual <- subset(batting95to19, PA > 502)
batting_1955to2019_qual <- do.call(“rbind”, list(batting55to60_qual,
batting61AL_qual, batting61NL_qual, batting62to80_qual,
batting81_qual, batting82to93_qual, batting94AL_qual,
batting_1955to2019_qual$OBP <- with(batting_1955to2019_qual,
batting_1955to2019_qual$SLG <- with(batting_1955to2019_qual, ((H-
batting2020_qual <- read_excel(“2020batting820.xlsx”)
batting_1955to2020_qual <- do.call(“rbind”, list(batting_1955to2019_qual,
This is the first in what I hope to be a weekly series in which the questions improve in their being interesting and the coding and methodology improve in their efficiency. As I do more coding, I imagine less will be written directly in the article and will instead be linked to externally.
All stats accurate through the games played on August 19, 2020.