Most Consecutive Years with a New Starter on Opening Day

PUBLISHED ON DEC 17, 2020

Throughout 2020, I have helped answer baseball research questions for the FanGraphs podcast “Effectively Wild.” Every couple of weeks, one of the hosts will reach out to be asking if I can find the answer to a baseball history question for a segment known as a “Stat Blast.” To answer the questions, I have largely used two databases: a Retrosheet database I have on my own laptop (set up using a tutorial from Bill Petti at https://billpetti.github.io/2019-08-10-build-retrosheet-event-roster-database-rstats-baseballr/) and the Lahman database (which I access with the Lahman package in R).

This series shows how I use these databases along with R to answer these questions. Today, I show how I answered a question on the longest streaks for having a different starter at a position on opening day.

The question was framed as the following:

Since the Giants participated in the blackballing of Barry Bonds, they have not had the same opening day starting LF in consecutive years:

  • ’08: Dave Roberts
  • ’09: Fred Lewis
  • ’10: Mark DeRosa
  • ’11: Pat Burrell
  • ’12: Aubrey Huff
  • ’13: Andres Torres
  • ’14: Michael Morse
  • ’15: Nori Aoki
  • ’16: Angel Pagan
  • ’17: Jarrett Parker
  • ’18: Hunter Pence
  • ’19: Connor Joe

2020, whenever it starts, (or 2021) will be the 13th year without the same opening day starter at a position. Is this a record? If not, what is the record?

To answer this question, I first read in the data from my retrosheet database. The file “dbcon.R” is a script that attaches to a PostgreSQL database and the dbGetQuery() function uses a SQL query to read the entire retro_games table into R. The retro_games table contains one row for each game in MLB history, with game information, including the lineup and positions for the starters, which I will use the answer the question.

library(tidyverse)
source("dbcon.R")
games <- dbGetQuery(con, 'select * from retro_games')

There is some data manipulation necessary to answer the question. First, I would like to have one row for each starting position for each game for each team, or 18 rows per game, rather than the one row per game that the data are currently structured as. The below commands do just that.

away_players <- games %>%
  mutate(date_use = as.numeric(substr(game_id, 4,12))) %>% 
  select(date_use, year_id, team = visiting_team, visitor_batting1name,
         visitor_batting2name,
         visitor_batting3name,
         visitor_batting4name,
         visitor_batting5name,
         visitor_batting6name,
         visitor_batting7name,
         visitor_batting8name,
         visitor_batting9name) %>%
  pivot_longer(-c(date_use, year_id, team)) %>% 
  select(date_use, year_id, team, player = value)

away_pos <- games %>% 
  mutate(date_use = as.numeric(substr(game_id, 4,12))) %>% 
  select(date_use, year_id, team = visiting_team, visitor_batting1position,
         visitor_batting2position,
         visitor_batting3position,
         visitor_batting4position,
         visitor_batting5position,
         visitor_batting6position,
         visitor_batting7position,
         visitor_batting8position,
         visitor_batting9position) %>%
  pivot_longer(-c(date_use, year_id, team)) %>% 
  select(position = value)

away_combined <- cbind(away_players, away_pos)

head(away_combined, 20)
##     date_use year_id team            player position
## 1  201903200    2019  SEA        Dee Gordon        4
## 2  201903200    2019  SEA     Mitch Haniger        8
## 3  201903200    2019  SEA         Jay Bruce        3
## 4  201903200    2019  SEA Edwin Encarnacion       10
## 5  201903200    2019  SEA   Domingo Santana        7
## 6  201903200    2019  SEA      Omar Narvaez        2
## 7  201903200    2019  SEA        Ryon Healy        5
## 8  201903200    2019  SEA       Tim Beckham        6
## 9  201903200    2019  SEA     Ichiro Suzuki        9
## 10 201903210    2019  SEA        Dee Gordon        4
## 11 201903210    2019  SEA     Mitch Haniger        8
## 12 201903210    2019  SEA         Jay Bruce        3
## 13 201903210    2019  SEA Edwin Encarnacion       10
## 14 201903210    2019  SEA   Domingo Santana        7
## 15 201903210    2019  SEA      Omar Narvaez        2
## 16 201903210    2019  SEA       Tim Beckham        6
## 17 201903210    2019  SEA        Ryon Healy        5
## 18 201903210    2019  SEA     Ichiro Suzuki        9
## 19 201903280    2019  PIT      Adam Frazier        4
## 20 201903280    2019  PIT     Melky Cabrera        9

In the data frame, for each game, there is a row for each position, the name of the player playing that position, the team for that player, the year, and the date of the game. This dataframe only contains the away teams, so I repeat the code for the home teams and rbind the two data frames together.

home_players <- games %>% mutate(date_use = as.numeric(substr(game_id, 4,12))) %>% 
  select(date_use, year_id, team = home_team, home_batting1name,
         home_batting2name,
         home_batting3name,
         home_batting4name,
         home_batting5name,
         home_batting6name,
         home_batting7name,
         home_batting8name,
         home_batting9name) %>%
  pivot_longer(-c(date_use, year_id, team)) %>% 
  select(date_use, year_id, team, player = value)

home_pos <- games %>% mutate(date_use = as.numeric(substr(game_id, 4,12))) %>% 
  select(date_use, year_id, team = home_team, home_batting1position,
         home_batting2position,
         home_batting3position,
         home_batting4position,
         home_batting5position,
         home_batting6position,
         home_batting7position,
         home_batting8position,
         home_batting9position) %>%
  pivot_longer(-c(date_use, year_id, team)) %>% 
  select(position = value)

home_combined <- cbind(home_players, home_pos)

lineups <- rbind(away_combined, home_combined)

The data set contains information for every game, while I am only interested in the first game of the season for each team. To filter out non-opening games, I group by each year and each team, then filter the date to be equal to the lowest date for each team in each year. The table below shows the opening day lineups for two teams in 1871.

opening_day <- lineups %>%
  group_by(year_id, team) %>% 
  filter(date_use == min(date_use)) %>% 
  ungroup() %>% 
  mutate(year_id = as.numeric(year_id)) %>% 
  arrange(year_id)

head(opening_day, 18)
## # A tibble: 18 x 5
##     date_use year_id team  player        position
##        <dbl>   <dbl> <chr> <chr>            <dbl>
##  1 187105040    1871 CL1   Deacon White         2
##  2 187105040    1871 CL1   Gene Kimball         4
##  3 187105040    1871 CL1   Charlie Pabor        7
##  4 187105040    1871 CL1   Art Allison          8
##  5 187105040    1871 CL1   Elmer White          9
##  6 187105040    1871 CL1   Al Pratt             1
##  7 187105040    1871 CL1   Ezra Sutton          5
##  8 187105040    1871 CL1   Jim Carleton         3
##  9 187105040    1871 CL1   John Bass            6
## 10 187105050    1871 BS1   George Wright        6
## 11 187105050    1871 BS1   Ross Barnes          4
## 12 187105050    1871 BS1   Dave Birdsall        9
## 13 187105050    1871 BS1   Cal McVey            2
## 14 187105050    1871 BS1   Harry Wright         8
## 15 187105050    1871 BS1   Charlie Gould        3
## 16 187105050    1871 BS1   Harry Schafer        5
## 17 187105050    1871 BS1   Fred Cone            7
## 18 187105050    1871 BS1   Al Spalding          1

Next, I need to identify years where there is a new starter at a position for a team. The is_new() function evaluate whether one element of a vector is different from the previous element, while the streak_finder function finds how many consective elements something is present. In this case, the is_new will create a logical vector indicating whether there is a new starter at a position for a team and the streak finder function will use that logical vector to find the number of consectutive years there has been a new starter.

Additionally, since teams can change names over time, I use the “CurrentNames.csv” file to add the franchise to each team.

is_new <- function(vec) {
  test <- NULL
  test[2:length(vec)] <- vec[1:(length(vec)-1)] != vec[2:length(vec)]
  test[1] <- TRUE
  test
}


streak_finder <- function(vec){
  test <- NULL
  count <- 0
  for(i in 1:length(vec)){
    if(vec[i]){
      count <- count +1
    } else(count <- 1)
    test[i] <- count
  }
  test
}

team_names <- read.csv("CurrentNames.csv", header = FALSE)[,1:2]
colnames(team_names) <- c("franchise", "team")
team_names <- distinct(team_names)

opening_clean <- opening_day %>% 
  left_join(team_names, by = "team") %>%
  na.omit()

opening_use <- opening_clean %>%
  arrange(franchise, position) %>% 
  mutate(new_player = is_new(player)) %>% 
  group_by(franchise, position) %>% 
  mutate(streak = streak_finder(new_player)) %>% 
  ungroup() 

With each streak identified, I can identify the beginning and end of each streak and the players the first and last players in the streak. I have listed those below, only showing the streaks that had at least 10 different players.

streaks_hit <- opening_use %>% filter(position != 1) %>% 
  mutate(streak = streak-1) %>% 
  filter(streak >= 10) %>%
  select(year_id,team, player, position, franchise, streak) %>% 
  mutate(begin_year = year_id - streak+1) %>% 
  group_by(franchise, begin_year) %>% 
  filter(streak == max(streak)) %>% 
  ungroup %>% 
  arrange(desc(begin_year)) %>% 
  arrange(desc(streak))

get_player_name <- function(fran, pos, yr){
  opening_use %>%
    filter(franchise == fran & position == pos & year_id == yr) %>% 
    pull(player) %>% 
    .[1]
}

streaks_hit$first_player <- sapply(1:nrow(streaks_hit), function(x) get_player_name(streaks_hit$franchise[x], streaks_hit$position[x], streaks_hit$begin_year[x]))

streaks_hit$last_player <- sapply(1:nrow(streaks_hit), function(x) get_player_name(streaks_hit$franchise[x], streaks_hit$position[x], streaks_hit$year_id[x]))

streaks_hit %>% 
  na.omit() %>% 
  select(Franchise = franchise,
         Position = position,
         Streak = streak,
         "First Year" = begin_year,
         "Last Year" = year_id,
         "First Player" = first_player,
         "Last Player" = last_player) %>% 
  kableExtra::kable() %>% 
  kableExtra::kable_styling() %>% 
  kableExtra::scroll_box(height = "500px")
Franchise Position Streak First Year Last Year First Player Last Player
BAL 7 26 1933 1958 Carl Reynolds Gene Woodling
DET 7 20 1942 1961 Barney McCosky Rocky Colavito
BAL 10 19 1996 2014 Bobby Bonilla Delmon Young
CHA 5 19 1948 1966 Luke Appling Don Buford
NYA 7 17 1920 1936 Duffy Lewis Roy Johnson
NYA 9 17 1919 1935 Sammy Vick George Selkirk
DET 2 16 1951 1966 Joe Ginsberg Bill Freehan
LAN 7 16 1944 1959 Augie Galan Wally Moon
SDN 7 15 2006 2020 Eric Young Tommy Pham
LAN 7 15 1927 1941 Jigger Statz Joe Medwick
COL 4 14 2001 2014 Todd Walker DJ LeMahieu
NYA 7 14 1990 2003 Mel Hall Hideki Matsui
BOS 9 14 1988 2001 Mike Greenwell Trot Nixon
PHI 7 14 1984 1997 Glenn Wilson Gregg Jefferies
ANA 8 14 1968 1981 Roger Repoz Fred Lynn
CLE 9 14 1939 1952 Earl Averill Harry Simpson
SFN 5 14 1937 1950 Lou Chiozza Hank Thompson
CIN 7 14 1930 1943 Bob Meusel Eric Tipton
BOS 6 14 1922 1935 Frank O’Rourke Joe Cronin
SFN 7 13 2008 2020 Dave Roberts Alex Dickerson
OAK 7 13 2001 2013 Johnny Damon Yoenis Cespedes
LAN 4 13 1962 1974 Jim Gilliam Davey Lopes
SLN 7 13 1955 1967 Wally Moon Lou Brock
CHN 8 13 1932 1944 Kiki Cuyler Andy Pafko
ANA 10 12 2005 2016 Jeff DaVanon Albert Pujols
TBA 9 12 2004 2015 Jose Cruz Steven Souza
DET 10 12 1993 2004 Kirk Gibson Dmitri Young
ANA 4 12 1991 2002 Luis Sojo Adam Kennedy
ANA 9 12 1982 1993 Reggie Jackson Tim Salmon
TEX 9 12 1964 1975 Jim King Jeff Burroughs
PHI 4 12 1931 1942 Bernie Friberg Danny Murtaugh
MIN 9 12 1916 1927 Charlie Jamieson Sam Rice
NYA 8 12 1912 1923 Bert Daniels Whitey Witt
KCA 7 11 2001 2011 Mark Quinn Alex Gordon
TEX 10 11 2001 2011 Andres Galarraga Michael Young
MIA 9 11 1999 2009 Mark Kotsay Cody Ross
CHN 7 11 1988 1998 Rafael Palmeiro Henry Rodriguez
HOU 9 11 1988 1998 Terry Puhl Derek Bell
NYN 8 11 1986 1996 Lenny Dykstra Lance Johnson
DET 9 11 1974 1984 Jim Northrup Kirk Gibson
CHN 8 11 1957 1967 Bob Will Adolfo Phillips
CHA 9 11 1949 1959 Pat Seerey Al Smith
ATL 3 11 1937 1947 Elbie Fletcher Earl Torgeson
ATL 5 11 1937 1947 Debs Garms Bob Elliott
OAK 4 11 1934 1944 Rabbit Warstler Irv Hall
SLN 9 11 1926 1936 Chick Hafey Pepper Martin
ATL 8 11 1910 1920 Fred Beck Ray Powell
NYA 4 11 1908 1918 Harry Niles Del Pratt
HOU 3 10 2010 2019 Geoff Blum Yulieski Gurriel
NYN 7 10 2007 2016 Moises Alou Yoenis Cespedes
TBA 5 10 2000 2009 Herbert Perry Evan Longoria
SEA 7 10 1997 2006 Lee Tinsley Raul Ibanez
BOS 4 10 1995 2004 Luis Alicea Mark Bellhorn
KCA 2 10 1993 2002 Brent Mayne Brent Mayne
TEX 6 10 1990 1999 Jeff Kunkel Royce Clayton
NYA 10 10 1989 1998 Tom Brookens Chili Davis
BAL 9 10 1985 1994 Larry Sheets Jeffrey Hammonds
NYA 10 10 1974 1983 Bill Sudakis Don Baylor
LAN 7 10 1966 1975 Lou Johnson Bill Buckner
OAK 7 10 1961 1970 Norm Siebern Felipe Alou
PHI 3 10 1961 1970 Pancho Herrera Deron Johnson
CIN 5 10 1959 1968 Frank Thomas Tony Perez
CIN 7 10 1959 1968 Jerry Lynch Alex Johnson
LAN 5 10 1957 1966 Randy Jackson Jim Lefebvre
BAL 9 10 1955 1964 Gene Woodling Sam Bowens
OAK 4 10 1951 1960 Billy Hitchcock Jerry Lumpe
SLN 8 10 1946 1955 Terry Moore Bill Virdon
OAK 7 10 1943 1952 Jim Tyack Gus Zernial
LAN 8 10 1940 1949 Charlie Gilbert Duke Snider
OAK 8 10 1939 1948 Dee Miles Sam Chapman
PIT 9 10 1939 1948 Fern Bell Dixie Walker
PHI 2 10 1938 1947 Bill Atwood Andy Seminick
ATL 9 10 1934 1943 Joe Mowry Chuck Workman
BOS 9 10 1933 1942 Smead Jolley Pete Fox
DET 8 10 1930 1939 Liz Funk Barney McCosky
BOS 5 10 1928 1937 Buddy Myer Pinky Higgins
ATL 7 10 1923 1932 Bill Bagwell Red Worthington
BOS 4 10 1918 1927 Dave Shean Bill Regan
BAL 5 10 1916 1925 Charlie Deal Gene Robertson
CHN 4 10 1914 1923 Bill Sweeney George Grantham