Throughout 2020, I have helped answer baseball research questions for the FanGraphs podcast “Effectively Wild.” Every couple of weeks, one of the hosts will reach out to be asking if I can find the answer to a baseball history question for a segment known as a “Stat Blast.” To answer the questions, I have largely used two databases: a Retrosheet database I have on my own laptop (set up using a tutorial from Bill Petti at https://billpetti.github.io/2019-08-10-build-retrosheet-event-roster-database-rstats-baseballr/) and the Lahman database (which I access with the Lahman package in R).
This series shows how I use these databases along with R to answer these questions. Today, I show how I answered a question on the longest streaks for having a different starter at a position on opening day.
The question was framed as the following:
Since the Giants participated in the blackballing of Barry Bonds, they have not had the same opening day starting LF in consecutive years:
2020, whenever it starts, (or 2021) will be the 13th year without the same opening day starter at a position. Is this a record? If not, what is the record?
To answer this question, I first read in the data from my retrosheet database. The file “dbcon.R” is a script that attaches to a PostgreSQL database and the dbGetQuery() function uses a SQL query to read the entire retro_games table into R. The retro_games table contains one row for each game in MLB history, with game information, including the lineup and positions for the starters, which I will use the answer the question.
library(tidyverse)
source("dbcon.R")
games <- dbGetQuery(con, 'select * from retro_games')
There is some data manipulation necessary to answer the question. First, I would like to have one row for each starting position for each game for each team, or 18 rows per game, rather than the one row per game that the data are currently structured as. The below commands do just that.
away_players <- games %>%
mutate(date_use = as.numeric(substr(game_id, 4,12))) %>%
select(date_use, year_id, team = visiting_team, visitor_batting1name,
visitor_batting2name,
visitor_batting3name,
visitor_batting4name,
visitor_batting5name,
visitor_batting6name,
visitor_batting7name,
visitor_batting8name,
visitor_batting9name) %>%
pivot_longer(-c(date_use, year_id, team)) %>%
select(date_use, year_id, team, player = value)
away_pos <- games %>%
mutate(date_use = as.numeric(substr(game_id, 4,12))) %>%
select(date_use, year_id, team = visiting_team, visitor_batting1position,
visitor_batting2position,
visitor_batting3position,
visitor_batting4position,
visitor_batting5position,
visitor_batting6position,
visitor_batting7position,
visitor_batting8position,
visitor_batting9position) %>%
pivot_longer(-c(date_use, year_id, team)) %>%
select(position = value)
away_combined <- cbind(away_players, away_pos)
head(away_combined, 20)
## date_use year_id team player position
## 1 201903200 2019 SEA Dee Gordon 4
## 2 201903200 2019 SEA Mitch Haniger 8
## 3 201903200 2019 SEA Jay Bruce 3
## 4 201903200 2019 SEA Edwin Encarnacion 10
## 5 201903200 2019 SEA Domingo Santana 7
## 6 201903200 2019 SEA Omar Narvaez 2
## 7 201903200 2019 SEA Ryon Healy 5
## 8 201903200 2019 SEA Tim Beckham 6
## 9 201903200 2019 SEA Ichiro Suzuki 9
## 10 201903210 2019 SEA Dee Gordon 4
## 11 201903210 2019 SEA Mitch Haniger 8
## 12 201903210 2019 SEA Jay Bruce 3
## 13 201903210 2019 SEA Edwin Encarnacion 10
## 14 201903210 2019 SEA Domingo Santana 7
## 15 201903210 2019 SEA Omar Narvaez 2
## 16 201903210 2019 SEA Tim Beckham 6
## 17 201903210 2019 SEA Ryon Healy 5
## 18 201903210 2019 SEA Ichiro Suzuki 9
## 19 201903280 2019 PIT Adam Frazier 4
## 20 201903280 2019 PIT Melky Cabrera 9
In the data frame, for each game, there is a row for each position, the name of the player playing that position, the team for that player, the year, and the date of the game. This dataframe only contains the away teams, so I repeat the code for the home teams and rbind the two data frames together.
home_players <- games %>% mutate(date_use = as.numeric(substr(game_id, 4,12))) %>%
select(date_use, year_id, team = home_team, home_batting1name,
home_batting2name,
home_batting3name,
home_batting4name,
home_batting5name,
home_batting6name,
home_batting7name,
home_batting8name,
home_batting9name) %>%
pivot_longer(-c(date_use, year_id, team)) %>%
select(date_use, year_id, team, player = value)
home_pos <- games %>% mutate(date_use = as.numeric(substr(game_id, 4,12))) %>%
select(date_use, year_id, team = home_team, home_batting1position,
home_batting2position,
home_batting3position,
home_batting4position,
home_batting5position,
home_batting6position,
home_batting7position,
home_batting8position,
home_batting9position) %>%
pivot_longer(-c(date_use, year_id, team)) %>%
select(position = value)
home_combined <- cbind(home_players, home_pos)
lineups <- rbind(away_combined, home_combined)
The data set contains information for every game, while I am only interested in the first game of the season for each team. To filter out non-opening games, I group by each year and each team, then filter the date to be equal to the lowest date for each team in each year. The table below shows the opening day lineups for two teams in 1871.
opening_day <- lineups %>%
group_by(year_id, team) %>%
filter(date_use == min(date_use)) %>%
ungroup() %>%
mutate(year_id = as.numeric(year_id)) %>%
arrange(year_id)
head(opening_day, 18)
## # A tibble: 18 x 5
## date_use year_id team player position
## <dbl> <dbl> <chr> <chr> <dbl>
## 1 187105040 1871 CL1 Deacon White 2
## 2 187105040 1871 CL1 Gene Kimball 4
## 3 187105040 1871 CL1 Charlie Pabor 7
## 4 187105040 1871 CL1 Art Allison 8
## 5 187105040 1871 CL1 Elmer White 9
## 6 187105040 1871 CL1 Al Pratt 1
## 7 187105040 1871 CL1 Ezra Sutton 5
## 8 187105040 1871 CL1 Jim Carleton 3
## 9 187105040 1871 CL1 John Bass 6
## 10 187105050 1871 BS1 George Wright 6
## 11 187105050 1871 BS1 Ross Barnes 4
## 12 187105050 1871 BS1 Dave Birdsall 9
## 13 187105050 1871 BS1 Cal McVey 2
## 14 187105050 1871 BS1 Harry Wright 8
## 15 187105050 1871 BS1 Charlie Gould 3
## 16 187105050 1871 BS1 Harry Schafer 5
## 17 187105050 1871 BS1 Fred Cone 7
## 18 187105050 1871 BS1 Al Spalding 1
Next, I need to identify years where there is a new starter at a position for a team. The is_new() function evaluate whether one element of a vector is different from the previous element, while the streak_finder function finds how many consective elements something is present. In this case, the is_new will create a logical vector indicating whether there is a new starter at a position for a team and the streak finder function will use that logical vector to find the number of consectutive years there has been a new starter.
Additionally, since teams can change names over time, I use the “CurrentNames.csv” file to add the franchise to each team.
is_new <- function(vec) {
test <- NULL
test[2:length(vec)] <- vec[1:(length(vec)-1)] != vec[2:length(vec)]
test[1] <- TRUE
test
}
streak_finder <- function(vec){
test <- NULL
count <- 0
for(i in 1:length(vec)){
if(vec[i]){
count <- count +1
} else(count <- 1)
test[i] <- count
}
test
}
team_names <- read.csv("CurrentNames.csv", header = FALSE)[,1:2]
colnames(team_names) <- c("franchise", "team")
team_names <- distinct(team_names)
opening_clean <- opening_day %>%
left_join(team_names, by = "team") %>%
na.omit()
opening_use <- opening_clean %>%
arrange(franchise, position) %>%
mutate(new_player = is_new(player)) %>%
group_by(franchise, position) %>%
mutate(streak = streak_finder(new_player)) %>%
ungroup()
With each streak identified, I can identify the beginning and end of each streak and the players the first and last players in the streak. I have listed those below, only showing the streaks that had at least 10 different players.
streaks_hit <- opening_use %>% filter(position != 1) %>%
mutate(streak = streak-1) %>%
filter(streak >= 10) %>%
select(year_id,team, player, position, franchise, streak) %>%
mutate(begin_year = year_id - streak+1) %>%
group_by(franchise, begin_year) %>%
filter(streak == max(streak)) %>%
ungroup %>%
arrange(desc(begin_year)) %>%
arrange(desc(streak))
get_player_name <- function(fran, pos, yr){
opening_use %>%
filter(franchise == fran & position == pos & year_id == yr) %>%
pull(player) %>%
.[1]
}
streaks_hit$first_player <- sapply(1:nrow(streaks_hit), function(x) get_player_name(streaks_hit$franchise[x], streaks_hit$position[x], streaks_hit$begin_year[x]))
streaks_hit$last_player <- sapply(1:nrow(streaks_hit), function(x) get_player_name(streaks_hit$franchise[x], streaks_hit$position[x], streaks_hit$year_id[x]))
streaks_hit %>%
na.omit() %>%
select(Franchise = franchise,
Position = position,
Streak = streak,
"First Year" = begin_year,
"Last Year" = year_id,
"First Player" = first_player,
"Last Player" = last_player) %>%
kableExtra::kable() %>%
kableExtra::kable_styling() %>%
kableExtra::scroll_box(height = "500px")
Franchise | Position | Streak | First Year | Last Year | First Player | Last Player |
---|---|---|---|---|---|---|
BAL | 7 | 26 | 1933 | 1958 | Carl Reynolds | Gene Woodling |
DET | 7 | 20 | 1942 | 1961 | Barney McCosky | Rocky Colavito |
BAL | 10 | 19 | 1996 | 2014 | Bobby Bonilla | Delmon Young |
CHA | 5 | 19 | 1948 | 1966 | Luke Appling | Don Buford |
NYA | 7 | 17 | 1920 | 1936 | Duffy Lewis | Roy Johnson |
NYA | 9 | 17 | 1919 | 1935 | Sammy Vick | George Selkirk |
DET | 2 | 16 | 1951 | 1966 | Joe Ginsberg | Bill Freehan |
LAN | 7 | 16 | 1944 | 1959 | Augie Galan | Wally Moon |
SDN | 7 | 15 | 2006 | 2020 | Eric Young | Tommy Pham |
LAN | 7 | 15 | 1927 | 1941 | Jigger Statz | Joe Medwick |
COL | 4 | 14 | 2001 | 2014 | Todd Walker | DJ LeMahieu |
NYA | 7 | 14 | 1990 | 2003 | Mel Hall | Hideki Matsui |
BOS | 9 | 14 | 1988 | 2001 | Mike Greenwell | Trot Nixon |
PHI | 7 | 14 | 1984 | 1997 | Glenn Wilson | Gregg Jefferies |
ANA | 8 | 14 | 1968 | 1981 | Roger Repoz | Fred Lynn |
CLE | 9 | 14 | 1939 | 1952 | Earl Averill | Harry Simpson |
SFN | 5 | 14 | 1937 | 1950 | Lou Chiozza | Hank Thompson |
CIN | 7 | 14 | 1930 | 1943 | Bob Meusel | Eric Tipton |
BOS | 6 | 14 | 1922 | 1935 | Frank O’Rourke | Joe Cronin |
SFN | 7 | 13 | 2008 | 2020 | Dave Roberts | Alex Dickerson |
OAK | 7 | 13 | 2001 | 2013 | Johnny Damon | Yoenis Cespedes |
LAN | 4 | 13 | 1962 | 1974 | Jim Gilliam | Davey Lopes |
SLN | 7 | 13 | 1955 | 1967 | Wally Moon | Lou Brock |
CHN | 8 | 13 | 1932 | 1944 | Kiki Cuyler | Andy Pafko |
ANA | 10 | 12 | 2005 | 2016 | Jeff DaVanon | Albert Pujols |
TBA | 9 | 12 | 2004 | 2015 | Jose Cruz | Steven Souza |
DET | 10 | 12 | 1993 | 2004 | Kirk Gibson | Dmitri Young |
ANA | 4 | 12 | 1991 | 2002 | Luis Sojo | Adam Kennedy |
ANA | 9 | 12 | 1982 | 1993 | Reggie Jackson | Tim Salmon |
TEX | 9 | 12 | 1964 | 1975 | Jim King | Jeff Burroughs |
PHI | 4 | 12 | 1931 | 1942 | Bernie Friberg | Danny Murtaugh |
MIN | 9 | 12 | 1916 | 1927 | Charlie Jamieson | Sam Rice |
NYA | 8 | 12 | 1912 | 1923 | Bert Daniels | Whitey Witt |
KCA | 7 | 11 | 2001 | 2011 | Mark Quinn | Alex Gordon |
TEX | 10 | 11 | 2001 | 2011 | Andres Galarraga | Michael Young |
MIA | 9 | 11 | 1999 | 2009 | Mark Kotsay | Cody Ross |
CHN | 7 | 11 | 1988 | 1998 | Rafael Palmeiro | Henry Rodriguez |
HOU | 9 | 11 | 1988 | 1998 | Terry Puhl | Derek Bell |
NYN | 8 | 11 | 1986 | 1996 | Lenny Dykstra | Lance Johnson |
DET | 9 | 11 | 1974 | 1984 | Jim Northrup | Kirk Gibson |
CHN | 8 | 11 | 1957 | 1967 | Bob Will | Adolfo Phillips |
CHA | 9 | 11 | 1949 | 1959 | Pat Seerey | Al Smith |
ATL | 3 | 11 | 1937 | 1947 | Elbie Fletcher | Earl Torgeson |
ATL | 5 | 11 | 1937 | 1947 | Debs Garms | Bob Elliott |
OAK | 4 | 11 | 1934 | 1944 | Rabbit Warstler | Irv Hall |
SLN | 9 | 11 | 1926 | 1936 | Chick Hafey | Pepper Martin |
ATL | 8 | 11 | 1910 | 1920 | Fred Beck | Ray Powell |
NYA | 4 | 11 | 1908 | 1918 | Harry Niles | Del Pratt |
HOU | 3 | 10 | 2010 | 2019 | Geoff Blum | Yulieski Gurriel |
NYN | 7 | 10 | 2007 | 2016 | Moises Alou | Yoenis Cespedes |
TBA | 5 | 10 | 2000 | 2009 | Herbert Perry | Evan Longoria |
SEA | 7 | 10 | 1997 | 2006 | Lee Tinsley | Raul Ibanez |
BOS | 4 | 10 | 1995 | 2004 | Luis Alicea | Mark Bellhorn |
KCA | 2 | 10 | 1993 | 2002 | Brent Mayne | Brent Mayne |
TEX | 6 | 10 | 1990 | 1999 | Jeff Kunkel | Royce Clayton |
NYA | 10 | 10 | 1989 | 1998 | Tom Brookens | Chili Davis |
BAL | 9 | 10 | 1985 | 1994 | Larry Sheets | Jeffrey Hammonds |
NYA | 10 | 10 | 1974 | 1983 | Bill Sudakis | Don Baylor |
LAN | 7 | 10 | 1966 | 1975 | Lou Johnson | Bill Buckner |
OAK | 7 | 10 | 1961 | 1970 | Norm Siebern | Felipe Alou |
PHI | 3 | 10 | 1961 | 1970 | Pancho Herrera | Deron Johnson |
CIN | 5 | 10 | 1959 | 1968 | Frank Thomas | Tony Perez |
CIN | 7 | 10 | 1959 | 1968 | Jerry Lynch | Alex Johnson |
LAN | 5 | 10 | 1957 | 1966 | Randy Jackson | Jim Lefebvre |
BAL | 9 | 10 | 1955 | 1964 | Gene Woodling | Sam Bowens |
OAK | 4 | 10 | 1951 | 1960 | Billy Hitchcock | Jerry Lumpe |
SLN | 8 | 10 | 1946 | 1955 | Terry Moore | Bill Virdon |
OAK | 7 | 10 | 1943 | 1952 | Jim Tyack | Gus Zernial |
LAN | 8 | 10 | 1940 | 1949 | Charlie Gilbert | Duke Snider |
OAK | 8 | 10 | 1939 | 1948 | Dee Miles | Sam Chapman |
PIT | 9 | 10 | 1939 | 1948 | Fern Bell | Dixie Walker |
PHI | 2 | 10 | 1938 | 1947 | Bill Atwood | Andy Seminick |
ATL | 9 | 10 | 1934 | 1943 | Joe Mowry | Chuck Workman |
BOS | 9 | 10 | 1933 | 1942 | Smead Jolley | Pete Fox |
DET | 8 | 10 | 1930 | 1939 | Liz Funk | Barney McCosky |
BOS | 5 | 10 | 1928 | 1937 | Buddy Myer | Pinky Higgins |
ATL | 7 | 10 | 1923 | 1932 | Bill Bagwell | Red Worthington |
BOS | 4 | 10 | 1918 | 1927 | Dave Shean | Bill Regan |
BAL | 5 | 10 | 1916 | 1925 | Charlie Deal | Gene Robertson |
CHN | 4 | 10 | 1914 | 1923 | Bill Sweeney | George Grantham |