In their 2008 study, Major League Baseball Players’ Life Expectancies, Saint Onge et al. (2008) (the Authors) used major league baseball (MLB) player data from the Total Baseball (2004) encyclopedia to evaluate the role of anthropomorphic measures (height, body mass, and handedness) and performance measures (career length and total player rating) on longevity. Using a discrete time hazards model, the Authors found:

Compared to 20-year-old U.S. males, MLB players can expect almost five additional years of life,
Height, weight, handedness, and player ratings are not associated with the risk of death, but
Career length is inversely associated with the risk of death.

This workbook reproduces the analysis with an alternative data source, Sean Lahman’s Baseball Database (2022). Differences in players included in the data and their attributes resulted in slightly different coefficient estimates. The Authors’ findings were generally confirmed except that instead of an expected five additional years of life for 20-year-old MLB players, the Lahman-based data produced estimates of only three additional years. Fitting the models with data updated through the 2021 season resulted in lower measured mortality risks that are in closer alignment with the Authors’ findings. A 20-year-old MLB player can expect an additional 4.5 years of additional life over comparable U.S. males. A Cox proportional hazards model with time-dependent covariates confirmed the Authors’ findings that height, weight, and handedness are not associated with risk of death, and career length is inversely associated with risk of death.

Data

The Authors analyzed data from the electronic version of Total Baseball (2004). Total Baseball was last published in 2008. However, another source of detailed baseball player data is now available. The R version of the 2021 edition of Sean Lahman’s Baseball Database, the Lahman R package v10.0-1 (2022) is continuously updated and available for download on CRAN.

The Authors restricted their data set to position players who debuted between 1902 and 2004. They excluded pitchers due to potential differences in career length and selection into major league baseball. They also excluded players whose career consisted of a mere September call-up by removing players who played for a month or less at the end of the season and who did not play the following season.

The Authors did not explain how they define “pitcher”. Occasionally a position player will pitch during a blowout game to avoid wasting actual pitchers in a lost cause. Also, there are instances of players changing positions, the most famous being Babe Ruth. Ruth pitched from 1914 - 1919, and a few games in 1920, 1921, 1930, and 1933. I defined pitchers as players who never played any position other than pitcher. Using the Lahman Fielding data set, I excluded players whose only fielding statistics were for the pitcher position.

library(Lahman)

data("People", "Fielding", package = "Lahman")

person_0 <- People %>%
  semi_join(Fielding %>% filter(POS != "P"), by = "playerID") %>%
  mutate(
    debut = ymd(debut),
    debutMonth = month(debut),
    debutYear = year(debut),
    finalGame = ymd(finalGame),
    finalGameYear = year(finalGame),
    career_length = time_length(interval(debut, finalGame), "years")
  ) %>%
  filter(
    debutYear >= 1902,
    # September call-ups. I originally filtered debutMonth < 9 (naturally), but
    # my data better aligns with the Authors if I use 8 (August). One confirming
    # evidence came from identifying the youngest player to die. The Authors 
    # report it being Al Montegomery. If I use the < 9 filter, it's Newt Halliday.
    debutMonth < 8 | finalGameYear > debutYear
  ) %>%
  select(-c(birthState, birthCity, deathCountry, deathState, deathCity,
            nameGiven, retroID, bbrefID))

The Authors identified eleven players with no death date who, if alive, would have been over 95 years old at the end of 2004. After researching, they dropped one of these records, Jack Dalton, born on July 3, 1885. In the Lahman data, Dalton is recorded with a death year of 1950, making him 65 years old when he died. The Lahman data includes ten players who were over 95 at the end of 2004. All have death dates.

person_0 %>% 
  mutate(
    age_at_2004 = time_length(interval(birthDate, ymd(20041231)), "year") %>% round(),
    age_at_death = time_length(interval(birthDate, deathDate), "year") %>% round()
  ) %>%
  filter(age_at_2004 >= 95 & (is.na(deathYear) | deathYear > 2004)) %>%
  select(nameFirst, nameLast, birthDate, deathDate, age_at_2004, age_at_death)

##    nameFirst   nameLast  birthDate  deathDate age_at_2004 age_at_death
## 1        Ray     Berres 1907-08-31 2007-02-01          97           99
## 2        Ray Cunningham 1905-01-17 2005-07-30         100          101
## 3      Howdy  Groskloss 1906-04-10 2006-07-15          99          100
## 4      Ernie        Koy 1909-09-17 2007-01-01          95           97
## 5         Al      Lopez 1908-08-20 2005-10-30          96           97
## 6       Emil     Mailho 1908-12-16 2007-03-07          96           98
## 7       Tony  Malinosky 1909-10-07 2011-02-08          95          101
## 8      Eddie       Mayo 1910-04-15 2006-11-27          95           97
## 9      Bobby    Stevens 1907-04-17 2005-12-30          98           99
## 10     Billy     Werber 1908-06-20 2009-01-22          97          101

The Lahman data does have seven players with no death date would would be over 95 years old at the end of the 2021 season. The Wikipedia articles on each player as of Jul 2022 indicated all were still alive.

person_0 %>% 
  mutate(
    age_at_2021 = time_length(interval(birthDate, ymd(20211231)), "year") %>% round(),
    age_at_death = time_length(interval(birthDate, deathDate), "year") %>% round()
  ) %>%
  filter(age_at_2021 >= 95 & is.na(deathYear)) %>%
  select(nameFirst, nameLast, birthDate, age_at_2021)

##   nameFirst  nameLast  birthDate age_at_2021
## 1    George     Elder 1921-03-10         101
## 2   Charlie   Maxwell 1927-04-08          95
## 3        Ed Mickelson 1926-09-09          95
## 4     Larry   Miggins 1925-08-20          96
## 5     Bobby    Morgan 1926-06-29          96
## 6     Frank   Saucier 1926-05-28          96
## 7     Bobby    Shantz 1925-09-26          96

However, the Lahman data includes a couple unusual records related to death date. Alfredo Cabrera and Hector Martinez have no death date because of incomplete month and day information.

person_0 %>% 
  filter(is.na(deathDate), !is.na(deathYear)) %>%
  select(nameFirst, nameLast, birthDate, deathYear, deathMonth, deathDay)

##   nameFirst nameLast  birthDate deathYear deathMonth deathDay
## 1   Alfredo  Cabrera 1881-05-11      1964         NA       NA
## 2    Hector Martinez 1939-05-11      1999         12       NA

One option to handle incomplete data is imputing a death date. Another is dropping the record. A death year is present for both players, and in one case the month is present too. I opted to keep both records and impute the middle month of year (Jun) and/or day of month (15th).

person_1 <- person_0 %>%
  mutate(deathDate = coalesce(deathDate, 
                              ymd(paste(deathYear,
                                  coalesce(deathMonth, 6), 
                                  coalesce(deathDay, 15)), 
                                  quiet = TRUE))) %>%
  select(-c(deathMonth, deathDay))

There is a similar problem with birth dates. Two have missing birth months and days.

person_1 %>% 
  filter(is.na(birthDate)) %>%
  select(nameFirst, nameLast, birthYear, birthMonth, birthDay, deathDate)

##   nameFirst nameLast birthYear birthMonth birthDay  deathDate
## 1    Claude  Gouzzie      1873         NA       NA 1907-09-21
## 2        Ed    Irwin      1882         NA       NA 1916-02-05

I handled them the same way, imputing a mid-month and day-of-month.

person_2 <- person_1 %>%
  mutate(birthDate = coalesce(birthDate, 
                              ymd(paste(birthYear,
                                  coalesce(birthMonth, 6), 
                                  coalesce(birthDay, 15)), 
                                  quiet = TRUE))) %>%
  select(-c(birthMonth, birthDay))

The Authors noted that the youngest age at death was for Al Montgomery, 22.5 years. The Lahman data confirms that Montgomery was the youngest, but his age at death was 21.8 years.

person_2 %>% 
  filter(!is.na(deathYear)) %>%
  mutate(age = time_length(interval(birthDate, deathDate), "year")) %>%
  select(nameFirst, nameLast, birthDate, deathDate, age) %>%
  arrange(age) %>%
  head()

##   nameFirst   nameLast  birthDate  deathDate      age
## 1        Al Montgomery 1920-07-03 1942-04-26 21.81370
## 2       Ken      Hubbs 1941-12-23 1964-02-13 22.14208
## 3     Oscar    Taveras 1992-06-19 2014-10-26 22.35342
## 4       Pat      Hynes 1884-03-12 1907-03-12 23.00000
## 5       Joe    Cassidy 1883-02-08 1906-03-25 23.12329
## 6      John      Dodge 1893-04-27 1916-06-19 23.14521

The discrepancy may be explained by how the ages are calculated. The Authors calculates age by person-year. For each calendar year, the player’s age was the number of years between the date of birth and January 1 of the calendar year. So Montgomery’s age on January 1, 1943 would have been 22.5.¹

Otherwise, the birth and death dates are reasonable. The following plot of the age distributions for deceased and living players has reasonable endpoints and modes. The mode age for deceased players is about 82 years.

person_2 %>% 
  mutate(
    all_ages = time_length(interval(birthDate, coalesce(deathDate, ymd(20211231))), 
                           "year"),
    is_alive = factor(if_else(is.na(deathDate), "Alive", "Dead"))
  ) %>%
  ggplot(aes(x = all_ages, fill = is_alive)) + 
  geom_density(alpha = .4, color = "gray40") +
  theme_light() +
  theme(legend.position = "top") +
  labs(title = "Lahman data set age distribution.",
       x = "Age at death, or at end of 2021 (if alive)", fill = NULL)

Variables

The Authors defined several variables from the player attributes.

Cohort. Period of MLB debut (Early (1902–1945), Golden (1946–1968), and Modern (1969–2004)).
Height. Height >= 6 feet indicator (<6 feet, >=6 feet).
Body Mass Index (BMI). BMI range indicators (Normal (18.5 - <25.0), Overweight (25.0 – <30.0), Obese Class I (30.0 – <35.0)).
Seasons Played. Career length >=10 years indicator (<10 years, 10+ years).
Handedness. Dominant hand (Right Handed, Left Handed).

These are all nominal variables, coded as factors in R with the first value as the reference level.

person_3 <- person_2 %>%
  mutate(
    cohort = factor(
      case_when(between(debutYear, 1902, 1945) ~ "Early",
                between(debutYear, 1946, 1968) ~ "Golden",
                debutYear >= 1969 ~ "Modern"),
      levels = c("Early", "Golden", "Modern")
    ),
    height_fct = factor(
      if_else(height >= 72, ">=6 feet", "<6 feet"), 
      levels = c("<6 feet", ">=6 feet")
    ),
    bmi = weight / height^2 * 703,
    bmi_fct = factor(case_when(
      bmi < 18.5 ~ "Underweight (<18.5)",
      bmi >= 18.5 & bmi < 25.0 ~ "Normal (18.5 - <25.0)",
      bmi >= 25.0 & bmi < 30.0 ~ "Overweight (25.0 – <30.0)",
      bmi >= 30.0 & bmi < 35.0 ~ "Obese Class I (30.0 – <35.0)",
      bmi >= 35.0 & bmi < 40.0 ~ "Obese Class II (35.0 - <40.0)",
      bmi >= 40.0 & bmi < 44.9 ~ "Obese Class III (40.0 - <45.0"),
      levels =  c("Underweight (<18.5)",
                  "Normal (18.5 - <25.0)", 
                  "Overweight (25.0 – <30.0)",
                  "Obese Class I (30.0 – <35.0)", 
                  "Obese Class II (35.0 - <40.0)",
                  "Obese Class III (40.0 - <45.0")
    ),
    career_length_fct = factor(
      if_else(career_length < 10, "<10 years", "10+ years"),
      levels = c("<10 years", "10+ years")
    ),
    throws = factor(
      throws, 
      levels = c("R", "L"), 
      labels = c("Right Handed", "Left Handed")
    )
  )

The BMI bins in the Lahman data include values outside the range of the Authors’ study. Their row counts are so low that it is advisable to lump levels. The Lahman data also has insufficient information (missing height and/or weight) to calculate the BMI for 55 players.

person_3 %>% 
  mutate(debut_lte_2004 = if_else(debut <= 2004, "In Study", "After 2004")) %>%
  janitor::tabyl(bmi_fct, debut_lte_2004)

##                        bmi_fct After 2004 In Study
##            Underweight (<18.5)          0        1
##          Normal (18.5 - <25.0)       1217     2646
##      Overweight (25.0 – <30.0)       2437     1816
##   Obese Class I (30.0 – <35.0)        228       11
##  Obese Class II (35.0 - <40.0)          7        0
##  Obese Class III (40.0 - <45.0          1        0
##                           <NA>          0       55

In addition to the missing BMI values, there are missing values in the Lahman data for height and handedness.

person_3 %>%
  select(height_fct, throws) %>%
  summary()

##     height_fct            throws    
##  <6 feet :3643   Right Handed:7223  
##  >=6 feet:4752   Left Handed :1181  
##  NA's    :  24   NA's        :  15

I managed the BMI levels by expanding Normal to include the one Underweight player, and Obese Class I to include the seven Obese Class II players and one Obese Class III player. For the 55 players of unknown BMI, I used multivariate imputations by chained equations (MICE) to impute values. I also used it to impute the missing height and handedness values.

set.seed(2020)

mice_0 <- person_3 %>%
  select(birthCountry, weight, height, throws, bmi, debutYear, career_length, cohort) %>%
  mice::mice(method = "rf", printFlag = FALSE) %>%
  mice::complete()

person_4 <- data.frame(person_3, 
                     bmi_mice = mice_0$bmi,
                     height_mice = mice_0$height,
                     throws_mice = mice_0$throws) %>%
  mutate(
    # Collapse BMI into three levels.
    bmi_fct_2 = case_when(
      bmi_mice < 18.5 ~ "Normal",
      bmi_mice >= 18.5 & bmi_mice < 25.0 ~ "Normal",
      bmi_mice >= 25.0 & bmi_mice < 30.0 ~ "Overweight",
      bmi_mice >= 30.0 & bmi_mice < 35.0 ~ "Obese",
      bmi_mice >= 35.0 & bmi_mice < 40.0 ~ "Obese",
      bmi_mice >= 40.0 & bmi_mice < 44.9 ~ "Obese"
    ),
    bmi_fct_2 = fct_relevel(bmi_fct_2, "Normal", "Overweight", "Obese"),
    height_fct_2 = factor(if_else(height_mice >= 72, ">=6 feet", "<6 feet"),
                          levels = c("<6 feet", ">=6 feet")),
    throws_2 = throws_mice
  )

MICE estimated 32 of the 35 the missing BMIs would be in the Normal range, the rest Overweight. The new BMI distribution is more more aligned with the Authors’, and is compatible with regression algorithms. It estimated 18 of 24 missing heights to be under 6 feet, and all 15 missing handedness to be right-handed.

The Authors reported that the highest recorded BMI was for Calvin Pickering (32.61). The Lahman data reports a BMI of 33.56 for Pickering, ranking him sixth heaviest among players debuting up to 2004. Dmitri Young topped the list at 37.87. Young is shorter and heavier than Pickering. Perhaps Young put on weight over his career and the Lahman data captures the more recent value.

person_4 %>% 
  filter(debutYear <= 2004) %>%
  arrange(desc(bmi_mice)) %>% 
  head(10) %>%
  mutate(rank = row_number()) %>%
  select(rank, nameFirst, nameLast, debut, weight, height, bmi = bmi_mice, bmi_fct_2) %>%
  flextable::flextable() %>%
  flextable::colformat_double(j = 7, digits = 2) %>%
  flextable::autofit() %>%
  flextable::theme_vanilla() %>%
  flextable::bg(i = ~ nameLast == "Pickering", bg = "lightgoldenrod") %>%
  flextable::set_caption("Top 10 BMIs in Lahman data, debut dates <= 2004. Highest BMI in Authors' data highlighted.")

Top 10 BMIs in Lahman data, debut dates <= 2004. Highest BMI in Authors' data highlighted.
rank	nameFirst	nameLast	debut	weight	height	bmi	bmi_fct_2
1	Dmitri	Young	1996-08-29	295	74	37.87	Obese
2	Carlos	Lee	1999-05-07	270	74	34.66	Obese
3	Juan	Diaz	2002-06-12	269	74	34.53	Obese
4	Raul	Chavez	1996-08-30	245	71	34.17	Obese
5	Jose	Molina	1999-09-06	250	72	33.90	Obese
6	Calvin	Pickering	1998-09-12	283	77	33.56	Obese
7	Marlon	Byrd	2002-09-08	245	72	33.22	Obese
8	Juan	Uribe	2001-04-08	245	72	33.22	Obese
9	Bob	Fothergill	1922-04-18	230	70	33.00	Obese
10	Jack	Cust	2001-09-26	250	73	32.98	Obese

It is interesting to consider the data updated through 2021.

person_4 %>% 
  # filter(debutYear <= 2004) %>%
  arrange(desc(bmi_mice)) %>% 
  head(30) %>%
  mutate(rank = row_number()) %>%
  select(rank, nameFirst, nameLast, debut, weight, height, bmi = bmi_mice, bmi_fct_2) %>%
  flextable::flextable() %>%
  flextable::colformat_double(j = 7, digits = 2) %>%
  flextable::autofit() %>%
  flextable::theme_vanilla() %>%
  flextable::bg(i = ~ nameLast %in% c("Young", "Pickering"), bg = "lightgoldenrod") %>%
  flextable::set_caption("Top 30 BMIs in Lahman data, any debut date. Highest BMIs from Authors' and from data through 2004 highlighted.")

Top 30 BMIs in Lahman data, any debut date. Highest BMIs from Authors' and from data through 2004 highlighted.
rank	nameFirst	nameLast	debut	weight	height	bmi	bmi_fct_2
1	Alejandro	Kirk	2020-09-12	265	68	40.29	Obese
2	Pablo	Sandoval	2008-08-14	268	70	38.45	Obese
3	Prince	Fielder	2005-06-13	275	71	38.35	Obese
4	Dmitri	Young	1996-08-29	295	74	37.87	Obese
5	Tommy	Everidge	2009-07-28	275	72	37.29	Obese
6	Dan	Vogelbach	2016-09-12	270	72	36.61	Obese
7	Brayan	Pena	2005-05-23	240	69	35.44	Obese
8	Billy	Butler	2007-05-01	260	72	35.26	Obese
9	Josh	Naylor	2019-05-24	250	71	34.86	Obese
10	Carlos	Lee	1999-05-07	270	74	34.66	Obese
11	Jesus	Aguilar	2014-05-15	277	75	34.62	Obese
12	Juan	Diaz	2002-06-12	269	74	34.53	Obese
13	Edwin	Bellorin	2007-08-07	240	70	34.43	Obese
14	Kennys	Vargas	2014-08-01	290	77	34.39	Obese
15	Ji-Man	Choi	2016-04-05	260	73	34.30	Obese
16	Raul	Chavez	1996-08-30	245	71	34.17	Obese
17	Yermin	Mercedes	2020-08-02	245	71	34.17	Obese
18	Jose	Molina	1999-09-06	250	72	33.90	Obese
19	Sandy	Leon	2012-05-14	235	70	33.72	Obese
20	Tommy	Joseph	2016-05-13	255	73	33.64	Obese
21	Chadwick	Tromp	2020-07-29	221	68	33.60	Obese
22	Calvin	Pickering	1998-09-12	283	77	33.56	Obese
23	A. J.	Reed	2016-06-25	275	76	33.47	Obese
24	Deivy	Grullon	2019-09-22	240	71	33.47	Obese
25	Dayan	Viciedo	2010-06-20	240	71	33.47	Obese
26	Tomas	Telis	2014-08-25	220	68	33.45	Obese
27	Carlos	Maldonado	2006-09-08	260	74	33.38	Obese
28	Cameron	Rupp	2013-09-10	260	74	33.38	Obese
29	Yasmany	Tomas	2015-04-15	260	74	33.38	Obese
30	Harold	Ramirez	2019-05-11	232	70	33.28	Obese

Pickering falls way down the rankings to 22nd, and Dmitri Young falls to fourth. The highest BMI belongs to Alejandro Kirk at 40.29. His entry in Baseball Reference lists him at 245 pounds while the Lahman data lists 265 pounds. Browsing through photos, his weight does seem to vary.

A plot of the relationship between BMI and MLB debut year shows BMIs began to increase in 1990 and peaked in 2014. Perhaps the increasing BMIs are related to sports nutrition and performance enhancing substances rather than body fat. Players debuting after 2004 are colored blue for reference. The updated data has many more overweight players.

person_4 %>%
  mutate(post2004 = if_else(debutYear > 2004, "Post-2004", "1902-2004")) %>%
  ggplot(aes(x = debut, y = bmi_mice)) +
  geom_point(aes(color = post2004), alpha = 0.2) +
  geom_smooth(color = "gray40") + 
  geom_vline(xintercept = ym(201401), linetype = 2, color = "gray60") +
  theme_light() +
  labs(title = "BMIs increased after 1990 and peaked in 2014.")

Kaplan-Meier plots are helpful tools to get a better feel for the variable relationships to mortality. They show the proportion of the number at risk at each follow-up time (death event or end of study) who died.

person <- person_4 %>% 
  mutate(
    fu_date = coalesce(deathDate, max(person_4$deathDate, na.rm = TRUE)),
    fu_time = time_length(interval(debut, fu_date), unit = "year"),
    fu_status = if_else(is.na(deathDate), 0, 1),
    fu_time_from_birth = time_length(interval(birthDate, fu_date), unit = "year")
  ) %>%
  select(-c(bats, throws, debutMonth, height_fct, height_mice, bmi_fct, bmi_mice,
            throws_mice))

The plots below show time to death starting from the birth date. By definition, all MLB players lived into adulthood, so survival probability is 100% through around age 20. The proper way to measure mortality is for time zero to be each player’s MLB debut date. However, for these plots I just want to normalize the data to get a sense of the age at death.

plot_km <- function(x, legend_title, legend_labs) {
  ggsurvplot(
    x, 
    data = person,
    surv.median.line = "hv", 
    ggtheme = theme_light(), 
    legend.title = legend_title,
    legend.labs = legend_labs,
    xlab = "", 
    ylab = ""
  )
}

km1 <- survfit(Surv(fu_time_from_birth, fu_status) ~ cohort, data = person) 
p1 <- plot_km(km1, "Cohort", levels(person$cohort))

km2 <- survfit(Surv(fu_time_from_birth, fu_status) ~ height_fct_2, data = person)
p2 <- plot_km(km2, "Height", levels(person$height_fct_2))

km3 <- survfit(Surv(fu_time_from_birth, fu_status) ~ bmi_fct_2, data = person)
p3 <- plot_km(km3, "BMI", levels(person$bmi_fct_2))

km4 <- survfit(Surv(fu_time_from_birth, fu_status) ~ career_length_fct, data = person)
p4 <- plot_km(km4, "Career Length", levels(person$career_length_fct))

km5 <- survfit(Surv(fu_time_from_birth, fu_status) ~ throws_2, data = person)
p5 <- plot_km(km5, "Handedness", levels(person$throws_2))

p1$plot + p2$plot + p3$plot + p4$plot + p5$plot + plot_layout(nrow = 3) +
  patchwork::plot_annotation(title = "Survival probability by age, selected vars.")

Discrete Times

Discrete time models are fitted to person-year data, with one row per person per year from debut year to the year of death (or censor).

person_year_0 <-
  expand.grid(
    playerID = unique(person$playerID), 
    # Years are through 2021 instead of 2004 so I can run analysis on updated
    # data. I'll filter on yr <= 2004 for reproduction of original analysis.
    yr = c(1902:2021)
  ) %>%
  inner_join(person, by = "playerID") %>%
  filter(debutYear <= yr, coalesce(deathYear, 9999) >= yr)

The Author’s data set included 241,218 person-years for 1902 - 2004 (103 years) with 6,772 players and 3,030 deaths. The data derived from Lahman do not quite match, but are close. It includes 239,678 person-years with 6,827 distinct players and 3,067 deaths. It is worth noting that the data derived from Lahman has more players and deaths, yet less person-years. The smaller number of person-years may indicate that the Lahman data identifies more deaths than the Authors’ source.

The Authors define a few variables related to the person year periods for the regression analysis.

Age Group. Five-year age groups from 20-24 to 85+. Age is defined as the number of years between date of birth and Jan 1 of person-year.
Period. Ten-year period intervals, 1910 to 1970, plus pre-1910, and post-1970.
Status. Survival status at end of the period (0 alive or censored, 1 dead).

person_year <- person_year_0 %>%
  mutate(
    age = time_length(interval(birthDate, ymd(paste0(yr, "0101"))), "years"),
    age_bin = cut(
      age, 
      breaks = c(0, 20 + 5 * (1:13), Inf), 
      dig.lab = 4, 
      right = FALSE
    ),
    period_bin = cut(yr, breaks = c(0, 1900 + 10 * (1:7), Inf), dig.lab = 4, right = FALSE),
    status = if_else(deathYear == yr, 1L, 0L, missing = 0L)
  )

The following table summarizes the finalized data set used in the analysis, including the post 2004 data that are used in the update.

person_year %>% skimr::skim()

Data summary
Name	Piped data
Number of rows	312074
Number of columns	30
_______________________
Column type frequency:
character	4
Date	5
factor	7
numeric	14
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
playerID	1	6	9	8419
birthCountry	1	3	14	42
nameFirst	1	2	12	1576
nameLast	1	2	14	5060

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
debut	0	1.00	1902-04-17	2021-07-30	1953-04-13	5525
finalGame	0	1.00	1902-04-26	2021-10-03	1959-09-23	4618
deathDate	126394	0.59	1905-11-23	2022-02-09	1985-02-17	3630
birthDate	0	1.00	1869-10-12	2001-03-01	1928-07-13	7620
fu_date	0	1.00	1905-11-23	2022-02-09	2011-03-10	3630

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
cohort	1	FALSE	3	Ear: 135382, Mod: 107752, Gol: 68940
career_length_fct	1	FALSE	2	<10: 226297, 10+: 85777
bmi_fct_2	1	FALSE	3	Nor: 170561, Ove: 138116, Obe: 3397
height_fct_2	1	FALSE	2	>=6: 160479, <6 : 151595
throws_2	1	FALSE	2	Rig: 267766, Lef: 44308
age_bin	1	FALSE	14	[30: 37640, [25: 37027, [35: 35238, [40: 32733
period_bin	1	FALSE	8	[19: 185461, [19: 27959, [19: 26769, [19: 24238

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
yr	0	1.00	1976.06	30.64	1902.00	1952.00	1980.00	2003.00	2021.00	▂▅▆▆▇
birthYear	0	1.00	1928.91	30.88	1869.00	1901.00	1928.00	1954.00	2001.00	▆▇▇▇▂
deathYear	126394	0.59	1984.28	23.03	1905.00	1967.00	1985.00	2004.00	2022.00	▁▂▇▇▇
weight	1801	0.99	180.86	17.89	125.00	170.00	180.00	190.00	295.00	▁▇▂▁▁
height	625	1.00	71.56	2.24	63.00	70.00	72.00	73.00	82.00	▁▃▇▁▁
debutYear	0	1.00	1952.96	30.80	1902.00	1926.00	1953.00	1978.00	2021.00	▇▇▇▆▃
finalGameYear	0	1.00	1959.43	31.73	1902.00	1932.00	1959.00	1985.00	2021.00	▆▇▇▇▅
career_length	0	1.00	6.58	5.39	0.00	1.78	5.65	10.35	31.46	▇▅▂▁▁
bmi	1858	0.99	24.79	1.79	18.46	23.67	24.41	25.80	40.29	▁▇▁▁▁
fu_time	0	1.00	46.44	15.58	0.36	36.49	48.71	58.25	80.57	▁▃▆▇▂
fu_status	0	1.00	0.59	0.49	0.00	0.00	1.00	1.00	1.00	▆▁▁▁▇
fu_time_from_birth	0	1.00	70.47	15.36	20.95	60.73	72.57	82.16	101.47	▁▂▅▇▃
age	0	1.00	46.64	16.38	16.00	33.01	44.25	58.26	101.24	▆▇▆▂▁
status	0	1.00	0.01	0.11	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁

t1 <- person_year %>%
  filter(yr <= 2004) %>%
  tbl_summary(
    label = list(
      age_bin ~ "Age Group",
      period_bin ~ "Time Period",
      cohort ~ "Cohort",
      bmi_fct_2 ~ "Body Mass Index",
      height_fct_2 ~ "Height",
      throws_2 ~ "Handedness",
      career_length_fct = "Seasons Played",
      status = "Died"
    ),
    include = c(age_bin, period_bin, cohort, height_fct_2, bmi_fct_2, 
                career_length_fct, throws_2, status)
  ) 

t2 <- person_year %>%
  filter(yr <= 2004) %>%
  replace_na(list(status = 0)) %>%
  group_by(playerID, cohort, bmi_fct_2, height_fct_2, throws_2, career_length_fct) %>%
  summarize(.groups = "drop", status = max(status)) %>%
  tbl_summary(
    label = list(
      cohort ~ "Cohort",
      bmi_fct_2 ~ "Body Mass Index",
      height_fct_2 ~ "Height",
      throws_2 ~ "Handedness",
      career_length_fct = "Seasons Played",
      status = "Died"
    ),
    include = c(cohort, height_fct_2, bmi_fct_2, career_length_fct, throws_2, status)
  )

Table 1 presents the descriptive statistics of the modeling covariates based on the pooled person-years. It is directly comparable to the Authors’ Table 1 (external). Both data sets have the most person-years at age group [30, 35), but the Lahman-based data has more person-years for younger people, aged < 45 years, and relatively more person-year records for the periods before 1940 and the Early (1902 - 1945) debut year cohort. Whereas 4.23% of Authors’ rows are in the lowest age group bin, [0,25), the Lahman-based data has many more, 15,371 (6.4%). Conversely, while 1.11% of the Authors’ rows are in the highest age group, [85,Inf), the Lahman-based data has 2,149 (0.9%). Table 1 includes a second column showing distinct player characteristics. It is evident that there are more records associated with earlier players. For example, while 2,800 (41%) of players debuted between 1902 and 1945 (Early cohort), they comprise 134,554 (56%) of the player-year rows.

Consistent with the Authors’ study, 98,446 (41%) of players-year records in the Lahman-based data are classified as overweight and 745 (0.3%) records are classified as obese. 128,764 (54%) records are less than six feet tall (54.81 in Authors’ data), and 206,020 (86%) percent are right-handed (86.47 in Authors’ data). 64,952 (27%) are career lengths of 10 years or more (23.54 in Authors’ data). The Lahman-based data does not have career rating or first-year rating measures, so those attributes are not included.

gtsummary::tbl_merge(list(t1, t2), tab_spanner = c("Person-Years", "Distinct Persons")) %>%
  gtsummary::as_flex_table() %>% 
  flextable::theme_vanilla() %>%
  flextable::bg(i = ~ is.na(stat_0_1), bg = "gray80") %>%
  flextable::set_caption("Table 1. Percentage of selected characteristics, MLB players, 1902-2004.")

Table 1. Percentage of selected characteristics, MLB players, 1902-2004.
	Person-Years	Distinct Persons
Age Group
[0,25)	15,371 (6.4%)
[25,30)	29,709 (12%)
[30,35)	30,089 (13%)
[35,40)	27,976 (12%)
[40,45)	25,633 (11%)
[45,50)	23,221 (9.7%)
[50,55)	20,576 (8.6%)
[55,60)	17,913 (7.5%)
[60,65)	15,187 (6.3%)
[65,70)	12,285 (5.1%)
[70,75)	9,349 (3.9%)
[75,80)	6,503 (2.7%)
[80,85)	3,717 (1.6%)
[85,Inf)	2,149 (0.9%)
Time Period
[0,1910)	2,216 (0.9%)
[1910,1920)	9,820 (4.1%)
[1920,1930)	15,535 (6.5%)
[1930,1940)	20,076 (8.4%)
[1940,1950)	24,238 (10%)
[1950,1960)	26,769 (11%)
[1960,1970)	27,959 (12%)
[1970,Inf)	113,065 (47%)
Cohort
Early	134,554 (56%)	2,800 (41%)
Golden	56,841 (24%)	1,280 (19%)
Modern	48,283 (20%)	2,747 (40%)
Height
<6 feet	128,764 (54%)	3,232 (47%)
>=6 feet	110,914 (46%)	3,595 (53%)
Body Mass Index
Normal	140,487 (59%)	3,727 (55%)
Overweight	98,446 (41%)	3,030 (44%)
Obese	745 (0.3%)	70 (1.0%)
Seasons Played
<10 years	174,726 (73%)	4,966 (73%)
10+ years	64,952 (27%)	1,861 (27%)
Handedness
Right Handed	206,020 (86%)	5,861 (86%)
Left Handed	33,658 (14%)	966 (14%)
Died	3,067 (1.3%)	3,067 (45%)
1n (%)

Method

The Authors used logistic regression to estimate a discrete time hazards model for risk of death between the debut date and the end of the follow-up period. They used debut date for the start date rather than the birth date because by definition all players have survived until their debut. They used a discrete time hazards model instead of a Cox proportional hazards model because the covariate estimates are easily summarized in life tables.

A discrete time logistic regression can be fit with either logit link function or a complementary log-log link function². The coefficients are the log of the hazard odds ratio in a logit link model, and the log hazard ratio in the complementary log-log link model. The coefficients in the complementary log-log link are directly comparable to those in a Cox proportional hazards model. I’m not sure which link function the Authors used. I tried both and achieved similar coefficient estimates. I settled on the complementary log-log link.

person_year_2004 <- person_year %>% filter(yr <= 2004)

# Logistic regression (lr) formulas.
fmla_1_lr <- status ~ age_bin
fmla_2_lr <- status ~ age_bin + period_bin
fmla_3_lr <- status ~ age_bin + period_bin + cohort
fmla_4_lr <- status ~ age_bin + period_bin + bmi_fct_2
fmla_5_lr <- status ~ age_bin + period_bin + height_fct_2
fmla_6_lr <- status ~ age_bin + period_bin + throws_2
fmla_9_lr <- status ~ age_bin + period_bin + career_length_fct

mdl_1_lr_2004 <- glm(fmla_1_lr, data = person_year_2004, family = binomial(link = "cloglog"))
mdl_2_lr_2004 <- glm(fmla_2_lr, data = person_year_2004, family = binomial(link = "cloglog"))
mdl_3_lr_2004 <- glm(fmla_3_lr, data = person_year_2004, family = binomial(link = "cloglog"))
mdl_4_lr_2004 <- glm(fmla_4_lr, data = person_year_2004, family = binomial(link = "cloglog"))
mdl_5_lr_2004 <- glm(fmla_5_lr, data = person_year_2004, family = binomial(link = "cloglog"))
mdl_6_lr_2004 <- glm(fmla_6_lr, data = person_year_2004, family = binomial(link = "cloglog"))
# No player ratings, so no mdl_7 or mdl_8.
mdl_9_lr_2004 <- glm(fmla_9_lr, data = person_year_2004, family = binomial(link = "cloglog"))

The Authors fit the following models. Models 7 and 8 are not fitted in this analysis because the Lahman data does not include player ratings.

Model 1: \(\log\left[ - \log (1 - P( \operatorname{status} = \operatorname{1} ) ) \right] = \operatorname{age\_bin}\)
Model 2: \(\log\left[ - \log (1 - P( \operatorname{status} = \operatorname{1} ) ) \right] = \operatorname{age\_bin} + \operatorname{period\_bin}\)
Model 3: \(\log\left[ - \log (1 - P( \operatorname{status} = \operatorname{1} ) ) \right] = \operatorname{age\_bin} + \operatorname{period\_bin} + \operatorname{cohort}\)
Model 4: \(\log\left[ - \log (1 - P( \operatorname{status} = \operatorname{1} ) ) \right] = \operatorname{age\_bin} + \operatorname{period\_bin + \operatorname{bmi\_fct\_2}}\)
Model 5: \(\log\left[ - \log (1 - P( \operatorname{status} = \operatorname{1} ) ) \right] = \operatorname{age\_bin} + \operatorname{period\_bin} + \operatorname{height\_fct\_2}\)
Model 6: \(\log\left[ - \log (1 - P( \operatorname{status} = \operatorname{1} ) ) \right] = \operatorname{age\_bin} + \operatorname{period\_bin} + \operatorname{throws\_2}\)
Model 7: \(\log\left[ - \log (1 - P( \operatorname{status} = \operatorname{1} ) ) \right] = \operatorname{age\_bin} + \operatorname{period\_bin} + \operatorname{first\_year\_rating}\)
Model 8: \(\log\left[ - \log (1 - P( \operatorname{status} = \operatorname{1} ) ) \right] = \operatorname{age\_bin} + \operatorname{period\_bin} + \operatorname{career\_rating}\)
Model 9: \(\log\left[ - \log (1 - P( \operatorname{status} = \operatorname{1} ) ) \right] = \operatorname{age\_bin} + \operatorname{period\_bin} + \operatorname{career\_length\_fct}\)

Results

Table 2 presents the coefficient estimates for each model. Table 2 is is directly comparable to the Authors’ Table 2 (external).

astercize <- function(e, p) {
  aster_p <- case_when(
    p < .001 ~ "***",
    p < .01 ~ "**",
    p < .05 ~ "*",
    TRUE ~ ""
  )
  paste0(e, aster_p)
}

bind_rows(
  `Model 1` = tidy(mdl_1_lr_2004, exponentiate = FALSE),
  `Model 2` = tidy(mdl_2_lr_2004, exponentiate = FALSE),
  `Model 3` = tidy(mdl_3_lr_2004, exponentiate = FALSE),
  `Model 4` = tidy(mdl_4_lr_2004, exponentiate = FALSE),
  `Model 5` = tidy(mdl_5_lr_2004, exponentiate = FALSE),
  `Model 6` = tidy(mdl_6_lr_2004, exponentiate = FALSE),
  `Model 9` = tidy(mdl_9_lr_2004, exponentiate = FALSE),
  .id = "model_num"
) %>%
  mutate(lbl = map2_chr(scales::comma(estimate, .01), p.value, astercize)) %>%
  pivot_wider(id_cols = term, names_from = model_num, values_from = lbl) %>%
  flextable::flextable() %>%
  flextable::align(j = 2:8, align = "right") %>%
  flextable::theme_zebra() %>%
  flextable::autofit() %>%
  flextable::set_caption("Table 2. MLB hazard log odds ratio coeficients on mortality, 1902 - 2004.")

Table 2. MLB hazard log odds ratio coeficients on mortality, 1902 - 2004.
term	Model 1	Model 2	Model 3	Model 4	Model 5	Model 6	Model 9
(Intercept)	-7.24***	-6.04***	-6.03***	-6.07***	-6.06***	-6.03***	-6.01***
age_bin[25,30)	0.61	0.70*	0.69*	0.70*	0.70*	0.70*	0.68*
age_bin[30,35)	0.64	0.81*	0.78*	0.81*	0.82*	0.81*	0.79*
age_bin[35,40)	1.01**	1.28***	1.22***	1.28***	1.28***	1.28***	1.26***
age_bin[40,45)	1.41***	1.77***	1.67***	1.77***	1.77***	1.76***	1.74***
age_bin[45,50)	2.05***	2.49***	2.35***	2.49***	2.50***	2.49***	2.47***
age_bin[50,55)	2.61***	3.12***	2.94***	3.12***	3.13***	3.12***	3.10***
age_bin[55,60)	2.85***	3.42***	3.20***	3.42***	3.44***	3.42***	3.40***
age_bin[60,65)	3.34***	3.97***	3.71***	3.97***	3.98***	3.96***	3.94***
age_bin[65,70)	3.81***	4.49***	4.20***	4.49***	4.50***	4.49***	4.46***
age_bin[70,75)	4.12***	4.85***	4.52***	4.85***	4.87***	4.85***	4.82***
age_bin[75,80)	4.55***	5.33***	4.97***	5.33***	5.36***	5.33***	5.30***
age_bin[80,85)	5.06***	5.89***	5.48***	5.90***	5.92***	5.89***	5.86***
age_bin[85,Inf)	5.58***	6.46***	6.01***	6.47***	6.50***	6.46***	6.43***
period_bin[1910,1920)		-0.49	-0.47	-0.48	-0.49	-0.49	-0.48
period_bin[1920,1930)		-0.76*	-0.69	-0.76*	-0.77*	-0.76*	-0.75*
period_bin[1930,1940)		-1.34***	-1.19**	-1.33***	-1.34***	-1.34***	-1.32***
period_bin[1940,1950)		-1.37***	-1.17**	-1.37***	-1.38***	-1.37***	-1.35***
period_bin[1950,1960)		-1.67***	-1.41***	-1.66***	-1.68***	-1.67***	-1.65***
period_bin[1960,1970)		-1.69***	-1.38***	-1.68***	-1.70***	-1.69***	-1.66***
period_bin[1970,Inf)		-2.11***	-1.66***	-2.11***	-2.15***	-2.11***	-2.09***
cohortGolden			-0.41***
cohortModern			-0.73***
bmi_fct_2Overweight				0.06
bmi_fct_2Obese				0.45
height_fct_2>=6 feet					0.07
throws_2Left Handed						-0.05
career_length_fct10+ years							-0.13**

Model 1 shows that the odds of death increase with age. Players aged 40–44 are exp(1.41) = 4.09 times as likely to die over the follow-up period as players aged 20–24 (Authors report exp(1.32) = 3.74), those aged 50–54 are exp(2.61) = 13.64 times as likely to die (Authors report exp(2.39) = 10.9), and players aged 70–74 are exp(4.12) = 61.71 times as likely to die (Authors report exp(4.00) = 54.6). The Lahman-based data produced higher mortality risk estimates at all age groups for Model 1.

Model 2 shows that, adjusting for age, more recent calendar periods are associated with lower risks of death. The greatest declines in mortality took place between the 1920–1929 and 1930–1939 periods, exp(-0.76) = 0.47 to exp(-1.34) = 0.26 (Authors report exp(-.54) = 0.58 to exp(-1.08) = 0.34), and between the 1960–1969 and 1970–2004 periods, exp(-1.69) = 0.19 to exp(-2.11) = 0.12) (Authors report exp(-1.47) = 0.23 to exp(-1.97) = 0.14). The Lahman-based data produced a slightly larger negative association between mortality risk and calendar period for the first two periods, and a smaller association for the remaining periods.

The Authors’ Model 3 shows that the coefficients for period and age are reduced but remain strong predictors of mortality after adjusting for the cohort of debut. The Lahman-based data produce a similar result. Compared to those debuting during the Early period (1902-1945), players debuting in the Golden period (1946-1968) were exp(-0.41) = 0.66 as likely to die (Authors report exp(-0.49) = 0.61), and players debuting in the Modern period (1969+) were exp(-0.73) = 0.48 as likely to die (Authors report exp(-1.02) = 0.36). The Lahman-based data produced a smaller benefit from more recent cohorts. The reduction in the period coefficients is due to the correlation between debut era and the period bins. In the plot below, all Modern era player-year records are (by definition) in the 1970+ period.

person_year_2004 %>%
  count(period_bin, cohort) %>%
  ggplot(aes(x = period_bin, y = n, fill = cohort)) +
  geom_col() +
  theme_light() + 
  theme(axis.text.x = element_text(angle = 90)) +
  labs(y = "player-years", fill = NULL, 
       title = "Period and Cohort (Debut Era) Association")

Models 4, 5, and 6 estimate insignificant (at .05 level) associations between risk and obesity (exp(0.45) = 1.56, Authors report exp(.57) = 1.77), height over six feet (exp(0.07) = 1.07, Authors report exp(.03) = 1.03), and left-handedness (exp(-0.05) = 0.95, Authors report exp(-.02) = 0.98).

Finally, those who play for 10 or more years have (1 - exp(-0.13)) * 100 = 12 percent lower odds of death over the follow-up period than those who have shorter careers (Model 9). The Authors report 14 percent lower odds.

Table 3 presents player life expectancy for the most recent period, 1970-2004, based on the hazard coefficients from Model 2. Table 3 is directly comparable to the Authors’ Table 3. I tried to follow the methodology presented by Measure Evaluation with further explanation from page 442 of The Methods and Materials of Demography (Shryock, et al, 1980), but was not quite successful. Some notes on column definitions:

mx is the age-specific central death rate (the annual mortality rate), \(m_x = 1 / (1 + e^{-z})\) where z is sum of the model coefficients from model 2 for period 1970 - 2004: intercept = -6.04, period 1970 - 2004 = -2.11 and beta = [value from age group]. E.g., \(m_{25-30} = 1 / (1 + e^{-(-6.04 + -2.11 + 0.70)}) = 0.0006\).
nQx is the five-year mortality rate, given by \(\frac{5 \cdot m_x}{1 + g_x m_x}\) where \(g_x\) is a factor measuring the average number of years lived within the age group and is initially assumed by be 2.5 years. This is from equation 25 in Shryock et al (1980]).
1 - nQx is the complement of nQx, an intermediate calc. It is the five-year survival rate. This is where I depart from the Authors. They report 1.0000 for the age 85+ group.
cum(.) is the cumulative product of 1 - nQx. Whereas 1 - nQx is an instantaneous probability, cum(.) is a cumulative probability. It is the probability of living to that age, then living another five years.
lx is the number of persons alive at the beginning of the age interval. It equals the prior cum(.) * 100,000. The 100,000 normalizes the presentation to a hypothetical population of 100,000.
ndx is the number of persons dying during the age interval, lx * nQx.
Lx is the number of person-years in the age interval, (lx - ndx) * 5. It is the average population size.
Tx is the cumulative sum of Lx, moving from bottom to top. It is the total number of person-years that would be lived for the age cohort if the cohort were to progress through the remainder of the life table.
ex is the average remaining life time, Tx / lx.

tdy_2 <- mdl_2_lr_2004 %>% broom::tidy()

life_tbl_period <- tibble(
  age_bin = rep(levels(person_year$age_bin), 8),
  age_bin_term = paste0("age_bin", age_bin),
  period_bin = map(levels(person_year$period_bin), ~rep(.x, 14)) %>% unlist(),
  period_bin_term = paste0("period_bin", period_bin),
  intercept = coef(mdl_2_lr_2004)["(Intercept)"],
  # period_bin_1970_Inf = coef(mdl_2_lr_2004)["period_bin[1970,Inf)"],
  authors_e = rep(authors_e, 8),
  us_male_e = rep(us_male_e, 8)
) %>%
  left_join(tdy_2 %>% select(age_bin_term = term, age_bin_est = estimate), 
            by = "age_bin_term") %>%
  left_join(tdy_2 %>% select(period_bin_term = term, period_bin_est = estimate), 
            by = "period_bin_term") %>%
  replace_na(list(age_bin_est = 0, period_bin_est = 0)) %>%
  group_by(period_bin) %>%
  mutate(
    mx = 1 / (1 + exp(-(intercept + period_bin_est + age_bin_est))),
    nQx = (5 * mx) / (1 + 2.5 * mx),
    nQx_comp = 1 - nQx,
    nQx_cumprod = nQx_comp * lag(nQx_comp, n = 1, default = 1.0),
    lx = 100000 * lag(nQx_cumprod, n = 1, default = 1.0),
    ndx = lx * nQx,
    Lx = (lx - ndx) * 5
  ) %>%
  arrange(period_bin, desc(age_bin)) %>%
  mutate(
    Tx = cumsum(Lx),
    ex = Tx / lx
  ) %>%
  ungroup() %>%
  arrange(period_bin, age_bin) %>%
  select(period_bin, age_bin, intercept, age_bin_est, period_bin_est, mx:ex, 
         authors_e, us_male_e)

life_flex <- function(dat) {
  dat %>%
    select(-c(intercept, age_bin_est, period_bin_est)) %>%
    flextable::flextable() %>%
    flextable::set_header_labels(
      age_bin = "Age Group",
      # age_bin_est = "Beta",
      nQx_comp = "1 - nQx",
      nQx_cumprod = "cum(.)",
      us_male_e = "U.S. Males",
      authors_e = "Authors"
    ) %>%
    flextable::colformat_double(j = c(2:5), digits = 4) %>%
    flextable::colformat_double(j = c(6:9), digits = 0, big.mark = ",") %>%
    flextable::colformat_double(j = c(10:12), digits = 2) %>%
    flextable::theme_zebra() %>%
    flextable::add_header_row(values = c("", "Life Expectancies"), colwidths = c(9, 3)) %>%
    flextable::border(j = 9, border.right = officer::fp_border(color = "gray60"))
}

life_tbl_period %>%
  filter(period_bin == "[1970,Inf)") %>%
  select(-period_bin) %>%
  life_flex() %>%
  flextable::set_caption("Table 3. Life Table, period 1970-2004 from Lahman-based data through 2004.")

Table 3. Life Table, period 1970-2004 from Lahman-based data through 2004.
	Life Expectancies
[0,25)	0.0003	0.0014	0.9986	0.9986	100,000	144	499,281	5,631,495	56.31	58.11	53.25
[25,30)	0.0006	0.0029	0.9971	0.9957	99,856	289	497,838	5,132,214	51.40	53.19	48.67
[30,35)	0.0006	0.0032	0.9968	0.9939	99,568	323	496,224	4,634,376	46.55	48.31	44.10
[35,40)	0.0010	0.0051	0.9949	0.9916	99,388	511	494,384	4,138,152	41.64	43.45	39.57
[40,45)	0.0017	0.0084	0.9916	0.9865	99,163	829	491,670	3,643,768	36.75	38.62	35.09
[45,50)	0.0035	0.0172	0.9828	0.9746	98,654	1,692	484,807	3,152,098	31.95	33.90	30.66
[50,55)	0.0065	0.0320	0.9680	0.9514	97,463	3,115	471,737	2,667,290	27.37	29.28	26.37
[55,60)	0.0088	0.0428	0.9572	0.9266	95,143	4,074	455,345	2,195,553	23.08	25.01	22.30
[60,65)	0.0150	0.0721	0.9279	0.8882	92,658	6,682	429,881	1,740,209	18.78	20.98	18.53
[65,70)	0.0249	0.1174	0.8826	0.8190	88,815	10,427	391,942	1,310,328	14.75	17.25	15.12
[70,75)	0.0354	0.1626	0.8374	0.7391	81,895	13,316	342,897	918,386	11.21	13.90	12.05
[75,80)	0.0563	0.2467	0.7533	0.6309	73,909	18,230	278,396	575,489	7.79	10.97	9.39
[80,85)	0.0945	0.3822	0.6178	0.4654	63,085	24,112	194,868	297,092	4.71	8.35	7.12
[85,Inf)	0.1558	0.5607	0.4393	0.2714	46,541	26,096	102,225	102,225	2.20	6.70	5.31

Columns “U.S. Males” and “Authors” are taken directly from the Authors’ Table 3. The Authors found higher life expectancies for players relative to the general U.S. male population at all ages, with the advantage diminishing with age. For example, from the Authors Table 3, a 20-year-old MLB players can expect to live 58.11 additional years in the period 1970 - 2004, 4.86 years more than the 53.25 years for comparable males in the U.S. population. Table 3 estimated from the Lahman-based data shows a more modest advantage. A 20-year-old MLB player can expect to live 56.31 additional years, only 3.06 years more than comparable males in the U.S. population. The advantage diminishes, then becomes a disadvantage for 65-70 year old MLB players.

Figure 1 illustrates life expectancies for calendar periods (Panel A) calculated from Model 2 in Table 2 and cohorts (Panel B) from Model 3³. Figure 1 is directly comparable to the Authors’ Figure 1. It shows that each period ushers in higher life expectancies for those alive at the outset. Life expectancies at age 20 were 40.3 years (Authors report 45.7 years) among those alive between 1910 and 1919, 48.9 years (Authors report 53.6 years) among those alive between 1930 and 1939, 52.2 years (Authors report 56.3 years) among those alive between 1950 and 1959, and 56.3 years (Authors report 59.6 years) among those alive between 1970 and 2004. Panel B show life expectancies by debut era. The Authors report 20 year-old players from the Modern Era can expect to live 65.5 years, compared to 52.4 years from the Early Era and 58.3 from the Golden Era. My Panel B does not show that, I think because of the collinearity between era and time period in the model.

# Panel B is constructed from the life table derived from model 3. Repeat the 
# process.
tdy_3 <- mdl_3_lr_2004 %>% broom::tidy()

life_tbl_cohort <- tibble(
  age_bin = rep(levels(person_year$age_bin), 3),
  age_bin_term = paste0("age_bin", age_bin),
  cohort = map(levels(person_year$cohort), ~rep(.x, 14)) %>% unlist(),
  cohort_term = paste0("cohort", cohort),
  intercept = coef(mdl_3_lr_2004)["(Intercept)"],
  us_male_e = rep(us_male_e, 3),
  authors_e = rep(authors_e, 3)
) %>%
  left_join(tdy_3 %>% select(age_bin_term = term, age_bin_est = estimate), 
            by = "age_bin_term") %>%
  left_join(tdy_3 %>% select(cohort_term = term, cohort_est = estimate), 
            by = "cohort_term") %>%
  replace_na(list(age_bin_est = 0, cohort_est = 0)) %>%
  group_by(cohort) %>%
  mutate(
    mx = 1 / (1 + exp(-(intercept + cohort_est + age_bin_est))),
    nQx = (5 * mx) / (1 + 2.5 * mx),
    nQx_comp = 1 - nQx,
    nQx_cumprod = nQx_comp * lag(nQx_comp, n = 1, default = 1.0),
    lx = 100000 * lag(nQx_cumprod, n = 1, default = 1.0),
    ndx = lx * nQx,
    Lx = (lx - ndx) * 5
  ) %>%
  arrange(cohort, desc(age_bin)) %>%
  mutate(
    Tx = cumsum(Lx),
    ex = Tx / lx
  ) %>%
  ungroup() %>%
  arrange(cohort, age_bin) %>%
  select(cohort, age_bin, intercept, age_bin_est, cohort_est, mx:ex, us_male_e,
         authors_e)

fig1a <- life_tbl_period %>%
  filter(period_bin %in% c("[1910,1920)", "[1930,1940)", "[1950,1960)", "[1970,Inf)")) %>%
  ggplot(aes(x = age_bin, y = ex, group = period_bin, color = fct_rev(period_bin))) +
  geom_line() +
  theme_light() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(subtitle = "A. Period", color = NULL, x = "Age", y = "e(x)")

fig1b <- life_tbl_cohort %>%
  # filter(period_bin %in% c("[1910,1920)", "[1930,1940)", "[1950,1960)", "[1970,Inf)")) %>%
  ggplot(aes(x = age_bin, y = ex, group = cohort, color = fct_rev(cohort))) +
  geom_line() +
  theme_light() +
  theme(axis.text.x = element_text(angle = 90)) +
  labs(subtitle = "B. Cohort", color = NULL, x = "Age", y = "e(x)")

library(patchwork)
fig1a / fig1b + patchwork::plot_layout(nrow = 2) +
  patchwork::plot_annotation(title = "Figure 1.")

2022 Update

At the time of this analysis, the Lahman R package contains player debuts through Jul 2021. How do the results change with 17 years of additional data? Table 4 presents the model fits with the updated data.

mdl_1_lr_2022 <- glm(fmla_1_lr, data = person_year, family = binomial(link = "cloglog"))
mdl_2_lr_2022 <- glm(fmla_2_lr, data = person_year, family = binomial(link = "cloglog"))
mdl_3_lr_2022 <- glm(fmla_3_lr, data = person_year, family = binomial(link = "cloglog"))
mdl_4_lr_2022 <- glm(fmla_4_lr, data = person_year, family = binomial(link = "cloglog"))
mdl_5_lr_2022 <- glm(fmla_5_lr, data = person_year, family = binomial(link = "cloglog"))
mdl_6_lr_2022 <- glm(fmla_6_lr, data = person_year, family = binomial(link = "cloglog"))
mdl_9_lr_2022 <- glm(fmla_9_lr, data = person_year, family = binomial(link = "cloglog"))

bind_rows(
  `Model 1` = tidy(mdl_1_lr_2022, exponentiate = FALSE),
  `Model 2` = tidy(mdl_2_lr_2022, exponentiate = FALSE),
  `Model 3` = tidy(mdl_3_lr_2022, exponentiate = FALSE),
  `Model 4` = tidy(mdl_4_lr_2022, exponentiate = FALSE),
  `Model 5` = tidy(mdl_5_lr_2022, exponentiate = FALSE),
  `Model 6` = tidy(mdl_6_lr_2022, exponentiate = FALSE),
  `Model 9` = tidy(mdl_9_lr_2022, exponentiate = FALSE),
  .id = "model_num"
) %>%
  mutate(lbl = map2_chr(scales::comma(estimate, .01), p.value, astercize)) %>%
  pivot_wider(id_cols = term, names_from = model_num, values_from = lbl) %>%
  flextable::flextable() %>%
  flextable::align(j = 2:8, align = "right") %>%
  flextable::theme_zebra() %>%
  flextable::autofit() %>%
  flextable::set_caption("Table 4. MLB hazard log odds ratio coefficients on mortality, 1902 - 2021.")

Table 4. MLB hazard log odds ratio coefficients on mortality, 1902 - 2021.
term	Model 1	Model 2	Model 3	Model 4	Model 5	Model 6	Model 9
(Intercept)	-7.24***	-5.92***	-5.90***	-5.94***	-5.92***	-5.91***	-5.88***
age_bin[25,30)	0.41	0.53	0.52	0.53	0.53	0.53	0.52
age_bin[30,35)	0.48	0.68*	0.65*	0.68*	0.68*	0.68*	0.66*
age_bin[35,40)	0.84**	1.13***	1.07***	1.13***	1.13***	1.13***	1.11***
age_bin[40,45)	1.22***	1.60***	1.50***	1.60***	1.60***	1.60***	1.58***
age_bin[45,50)	1.89***	2.35***	2.21***	2.35***	2.35***	2.35***	2.33***
age_bin[50,55)	2.40***	2.93***	2.75***	2.94***	2.93***	2.93***	2.91***
age_bin[55,60)	2.65***	3.25***	3.02***	3.26***	3.25***	3.25***	3.23***
age_bin[60,65)	3.17***	3.82***	3.54***	3.83***	3.82***	3.82***	3.80***
age_bin[65,70)	3.61***	4.32***	3.99***	4.33***	4.32***	4.32***	4.30***
age_bin[70,75)	3.98***	4.75***	4.36***	4.76***	4.75***	4.75***	4.72***
age_bin[75,80)	4.37***	5.21***	4.75***	5.22***	5.21***	5.20***	5.18***
age_bin[80,85)	4.89***	5.79***	5.29***	5.80***	5.79***	5.79***	5.76***
age_bin[85,Inf)	5.51***	6.47***	5.91***	6.48***	6.47***	6.47***	6.44***
period_bin[1910,1920)		-0.48	-0.45	-0.47	-0.48	-0.48	-0.47
period_bin[1920,1930)		-0.73*	-0.65	-0.73*	-0.73*	-0.73*	-0.73*
period_bin[1930,1940)		-1.29***	-1.15**	-1.30***	-1.29***	-1.30***	-1.28***
period_bin[1940,1950)		-1.33***	-1.11**	-1.34***	-1.33***	-1.33***	-1.32***
period_bin[1950,1960)		-1.65***	-1.34***	-1.65***	-1.65***	-1.65***	-1.63***
period_bin[1960,1970)		-1.69***	-1.31***	-1.69***	-1.69***	-1.68***	-1.66***
period_bin[1970,Inf)		-2.30***	-1.62***	-2.30***	-2.30***	-2.30***	-2.27***
cohortGolden			-0.45***
cohortModern			-0.94***
bmi_fct_2Overweight				0.06
bmi_fct_2Obese				0.34
height_fct_2>=6 feet					0.00
throws_2Left Handed						-0.05
career_length_fct10+ years							-0.13***

Comparing Table 4 to Table 2,

the mortality risk is lower for all age groups (Model 1). For example, players aged 40–44 were exp(1.41) = 4.09 times as likely to die as those aged 0-20 in the 2004 data set are now estimated to be exp(1.22) = 3.40 times as likely to die.
The mortality risk during each period is about the same (Model 2). For example, players alive in the 1920s were exp(-0.76) = 0.47 times as likely to die as those alive in the 1900s in the 2004 data set are now exp(-0.73) = 0.48 times as likely to die
The mortality risk of debut era is lower (Model 3). For example, players debuting in the Modern period (1969+) were exp(-0.73) = 0.48 as likely to die as those in the Early period in the 2004 data set are now estimated to be exp(-0.94) = 0.39 as likely to die.
Obesity (Model 4), height (Model 5), and handedness (Model 6) continue to be statistically not significant at the \(\alpha\) = .05 level.
The effect of a longer career length is unchanged (Model 9). Players playinger 10+ seasons were exp(-0.13) = 0.88 as likely to die as those in those playing under ten years in the 2004 data set are now estimated to be exp(-0.13) = 0.88 as likely to die.

Table 5 presents an updated life table.

tdy_2022 <- mdl_2_lr_2022 %>% broom::tidy()

life_tbl_2022 <- tibble(
  age_bin = rep(levels(person_year$age_bin), 8),
  age_bin_term = paste0("age_bin", age_bin),
  period_bin = map(levels(person_year$period_bin), ~rep(.x, 14)) %>% unlist(),
  period_bin_term = paste0("period_bin", period_bin),
  intercept = coef(mdl_2_lr_2022)["(Intercept)"],
  # period_bin_1970_Inf = coef(mdl_2_lr_2004)["period_bin[1970,Inf)"],
  authors_e = rep(authors_e, 8),
  us_male_e = rep(us_male_e, 8)
) %>%
  left_join(tdy_2022 %>% select(age_bin_term = term, age_bin_est = estimate), 
            by = "age_bin_term") %>%
  left_join(tdy_2022 %>% select(period_bin_term = term, period_bin_est = estimate), 
            by = "period_bin_term") %>%
  replace_na(list(age_bin_est = 0, period_bin_est = 0)) %>%
  group_by(period_bin) %>%
  mutate(
    mx = 1 / (1 + exp(-(intercept + period_bin_est + age_bin_est))),
    nQx = (5 * mx) / (1 + 2.5 * mx),
    nQx_comp = 1 - nQx,
    nQx_cumprod = nQx_comp * lag(nQx_comp, n = 1, default = 1.0),
    lx = 100000 * lag(nQx_cumprod, n = 1, default = 1.0),
    ndx = lx * nQx,
    Lx = (lx - ndx) * 5
  ) %>%
  arrange(period_bin, desc(age_bin)) %>%
  mutate(
    Tx = cumsum(Lx),
    ex = Tx / lx
  ) %>%
  ungroup() %>%
  arrange(period_bin, age_bin) %>%
  select(period_bin, age_bin, intercept, age_bin_est, period_bin_est, mx:ex, 
         authors_e, us_male_e)

life_tbl_2022 %>%
  filter(period_bin == "[1970,Inf)") %>%
  select(-period_bin) %>%
  life_flex() %>%
  flextable::set_caption("Table 5. Life Table, period 1970-2021 from Lahman-based data through 2021.")

Table 5. Life Table, period 1970-2021 from Lahman-based data through 2021.
	Life Expectancies
[0,25)	0.0003	0.0014	0.9986	0.9986	100,000	135	499,324	5,783,063	57.83	58.11	53.25
[25,30)	0.0005	0.0023	0.9977	0.9964	99,865	229	498,178	5,283,739	52.91	53.19	48.67
[30,35)	0.0005	0.0027	0.9973	0.9950	99,636	266	496,848	4,785,561	48.03	48.31	44.10
[35,40)	0.0008	0.0042	0.9958	0.9932	99,504	417	495,438	4,288,713	43.10	43.45	39.57
[40,45)	0.0013	0.0067	0.9933	0.9892	99,316	663	493,263	3,793,275	38.19	38.62	35.09
[45,50)	0.0028	0.0140	0.9860	0.9794	98,917	1,388	487,645	3,300,012	33.36	33.90	30.66
[50,55)	0.0051	0.0250	0.9750	0.9614	97,939	2,445	477,472	2,812,367	28.72	29.28	26.37
[55,60)	0.0069	0.0341	0.9659	0.9418	96,136	3,278	464,289	2,334,895	24.29	25.01	22.30
[60,65)	0.0122	0.0593	0.9407	0.9086	94,179	5,584	442,978	1,870,606	19.86	20.98	18.53
[65,70)	0.0200	0.0952	0.9048	0.8511	90,864	8,654	411,047	1,427,628	15.71	17.25	15.12
[70,75)	0.0303	0.1410	0.8590	0.7772	85,112	12,001	365,551	1,016,582	11.94	13.90	12.05
[75,80)	0.0470	0.2104	0.7896	0.6782	77,718	16,353	306,822	651,031	8.38	10.97	9.39
[80,85)	0.0814	0.3381	0.6619	0.5226	67,824	22,932	224,460	344,209	5.08	8.35	7.12
[85,Inf)	0.1486	0.5417	0.4583	0.3033	52,261	28,312	119,749	119,749	2.29	6.70	5.31

The life expectancy estimates are higher for all ages and are closer to the Authors’ results. The Authors estimated a 20-year-old MLB player expect to live 58.11 additional years in the period 1970 - 2004, 4.86 years more than the 53.25 years for comparable males in the U.S. population. Table 3 estimated from the Lahman-based data through 2004 estimated 56.31 additional years. Table 5 estimated from the Lahman-based data through 2021 estimated 57.83 additional years.

Cox Proportional Hazards

The Authors fit a logistic regression model because it enabled them to present their results in life tables. It is more common to see studies fit a Cox proportional hazards model.

Recall that the Cox proportional hazards model expresses mortality hazard as an unspecified baseline hazard multiplied by the exponential of a linear combination of the explanatory variables, \(h(t) = h_0(t) \cdot e^{X\beta}\). \(h_0(t)\) is estimated from the data by maximum likelihood estimation. The proportionality of the model comes from the lack of time dependence in the explanatory variables. Whereas logistic regression predicts the log odds of the response variable, Cox regression predicts the log relative hazard, \(\ln \left[ \frac{h(t)}{h_0(t)} \right] = X\beta\) (relative to the baseline). A change in \(X\) from \(x_a\) to \(x_b\) is associated with a \((x_b - x_a)\beta = \ln \left[ \frac{h_b(t)}{h_0(t)} \right] - \ln \left[ \frac{h_a(t)}{h_0(t)} \right] = \ln \left[ \frac{h_b(t)}{h_a(t)} \right]\) change in the log relative hazard. When \(x\) is a factor variable, \(x_b - x_a\) is a factor level change and the antilog of \(\beta\) is the hazard ratio of level \(b\) relative to level \(a\), \(e^\beta = \frac{h_b(t)}{h_a(t)}\).

There is no need to develop person-year rows for Cox regression. Instead, define a variable for the follow-up time (at death last observation time) from birth (in years), and status (dead or censored). However, given the broad expanse of time covered in the data, the model should control for changes in medical technology. That was the role of the Authors’ Period covariate. And because we are estimating life expectancy by age group, the model needs a time dependent covariate for age.

cox_dat_0 <- person %>%
  mutate(
    debut_age = time_length(interval(birthDate, debut), unit = "year"),
    # follow-up time is earlier of two dates: death or data set date.
    fu_date = coalesce(deathDate, max(person$deathDate, na.rm = TRUE)),
    fu_time = time_length(interval(debut, fu_date), unit = "year"),
    fu_status = if_else(is.na(deathDate), 0, 1),
    # Decade indicators are instrumental variables capturing general advances
    # in life-extending medical technology. *_time = years from debut to start of
    # decade. If debut is in middle of a decade, then *_time is 0 years.
    pd1900_status = "[1900,1910)",
    pd1900_time = pmax(time_length(interval(debut, ym(190001)), unit = "year"), 0),
    pd1900_time = if_else(fu_date >= ym(190001) & debutYear < 1910, pd1900_time, NA_real_),
    pd1910_status = "[1910,1920)",
    pd1910_time = pmax(time_length(interval(debut, ym(191001)), unit = "year"), 0),
    pd1910_time = if_else(fu_date >= ym(191001) & debutYear < 1920, pd1910_time, NA_real_),
    pd1920_status = "[1920,1930)",
    pd1920_time = pmax(time_length(interval(debut, ym(192001)), unit = "year"), 0),
    pd1920_time = if_else(fu_date >= ym(192001) & debutYear < 1930, pd1920_time, NA_real_),
    pd1930_status = "[1930,1940)",
    pd1930_time = pmax(time_length(interval(debut, ym(193001)), unit = "year"), 0),
    pd1930_time = if_else(fu_date >= ym(193001) & debutYear < 1940, pd1930_time, NA_real_),
    pd1940_status = "[1940,1950)",
    pd1940_time = pmax(time_length(interval(debut, ym(194001)), unit = "year"), 0),
    pd1940_time = if_else(fu_date >= ym(194001) & debutYear < 1950, pd1940_time, NA_real_),
    pd1950_status = "[1950,1960)",
    pd1950_time = pmax(time_length(interval(debut, ym(195001)), unit = "year"), 0),
    pd1950_time = if_else(fu_date >= ym(195001) & debutYear < 1960, pd1950_time, NA_real_),
    pd1960_status = "[1960,1970)",
    pd1960_time = pmax(time_length(interval(debut, ym(196001)), unit = "year"), 0),
    pd1960_time = if_else(fu_date >= ym(196001) & debutYear < 1970, pd1960_time, NA_real_),
    pd1970_status = "[1970,Inf)",
    pd1970_time = pmax(time_length(interval(debut, ym(197001)), unit = "year"), 0),
    pd1970_time = if_else(fu_date >= ym(197001) & debutYear < 9999, pd1970_time, NA_real_),
    age00_status = "[0,25)",
    age00_time = time_length(interval(debut, birthDate + years(0)), unit = "year"),
    age00_time = if_else(fu_date >= birthDate + years(0) & age00_time >= -25, pmax(age00_time, 0), NA_real_),
    age25_status = "[25,30)",
    age25_time = time_length(interval(debut, birthDate + years(25)), unit = "year"),
    age25_time = if_else(fu_date >= birthDate + years(25) & age25_time >= -5, pmax(age25_time, 0), NA_real_),
    age30_status = "[30,35)",
    age30_time = time_length(interval(debut, birthDate + years(30)), unit = "year"),
    age30_time = if_else(fu_date >= birthDate + years(30) & age30_time >= -5, pmax(age30_time, 0), NA_real_),
    age35_status = "[35,40)",
    age35_time = time_length(interval(debut, birthDate + years(35)), unit = "year"),
    age35_time = if_else(fu_date >= birthDate + years(35) & age35_time >= -5, pmax(age35_time, 0), NA_real_),
    age40_status = "[40,45)",
    age40_time = time_length(interval(debut, birthDate + years(40)), unit = "year"),
    age40_time = if_else(fu_date >= birthDate + years(40) & age40_time >= -5, pmax(age40_time, 0), NA_real_),
    age45_status = "[45,50)",
    age45_time = time_length(interval(debut, birthDate + years(45)), unit = "year"),
    age45_time = if_else(fu_date >= birthDate + years(45) & age45_time >= -5, pmax(age45_time, 0), NA_real_),
    age50_status = "[50,55)",
    age50_time = time_length(interval(debut, birthDate + years(50)), unit = "year"),
    age50_time = if_else(fu_date >= birthDate + years(50) & age50_time >= -5, pmax(age50_time, 0), NA_real_),
    age55_status = "[55,60)",
    age55_time = time_length(interval(debut, birthDate + years(55)), unit = "year"),
    age55_time = if_else(fu_date >= birthDate + years(55) & age55_time >= -5, pmax(age55_time, 0), NA_real_),
    age60_status = "[60,65)",
    age60_time = time_length(interval(debut, birthDate + years(60)), unit = "year"),
    age60_time = if_else(fu_date >= birthDate + years(60) & age60_time >= -5, pmax(age60_time, 0), NA_real_),
    age65_status = "[65,70)",
    age65_time = time_length(interval(debut, birthDate + years(65)), unit = "year"),
    age65_time = if_else(fu_date >= birthDate + years(65) & age65_time >= -5, pmax(age65_time, 0), NA_real_),
    age70_status = "[70,75)",
    age70_time = time_length(interval(debut, birthDate + years(70)), unit = "year"),
    age70_time = if_else(fu_date >= birthDate + years(70) & age70_time >= -5, pmax(age70_time, 0), NA_real_),
    age75_status = "[75,80)",
    age75_time = time_length(interval(debut, birthDate + years(75)), unit = "year"),
    age75_time = if_else(fu_date >= birthDate + years(75) & age75_time >= -5, pmax(age75_time, 0), NA_real_),
    age80_status = "[80,85)",
    age80_time = time_length(interval(debut, birthDate + years(80)), unit = "year"),
    age80_time = if_else(fu_date >= birthDate + years(80) & age80_time >= -5, pmax(age80_time, 0), NA_real_),
    age85_status = "[85,Inf)",
    age85_time = time_length(interval(debut, birthDate + years(85)), unit = "year"),
    age85_time = if_else(fu_date >= birthDate + years(85) & age85_time >= -5, pmax(age85_time, 0), NA_real_)
  )

# Don't need the newly created period vars in the final data set.
cox_dat_1 <- tmerge(
  cox_dat_0 %>% select(-starts_with("pd"), -starts_with("age")), 
  cox_dat_0, 
  id = playerID, 
  death = event(fu_time, fu_status)
)

# Create the decade and age vars
cox_dat <- tmerge(
  cox_dat_1, cox_dat_0, id = playerID, 
  decade1900 = tdc(pd1900_time, pd1900_status),
  decade1910 = tdc(pd1910_time, pd1910_status),
  decade1920 = tdc(pd1920_time, pd1920_status),
  decade1930 = tdc(pd1930_time, pd1930_status),
  decade1940 = tdc(pd1940_time, pd1940_status),
  decade1950 = tdc(pd1950_time, pd1950_status),
  decade1960 = tdc(pd1960_time, pd1960_status),
  decade1970 = tdc(pd1970_time, pd1970_status),
  isage00 = tdc(age00_time, age00_status),
  isage25 = tdc(age25_time, age25_status),
  isage30 = tdc(age30_time, age30_status),
  isage35 = tdc(age35_time, age35_status),
  isage40 = tdc(age40_time, age40_status),
  isage45 = tdc(age45_time, age45_status),
  isage50 = tdc(age50_time, age50_status),
  isage55 = tdc(age55_time, age55_status),
  isage60 = tdc(age60_time, age60_status),
  isage65 = tdc(age65_time, age65_status),
  isage70 = tdc(age70_time, age70_status),
  isage75 = tdc(age75_time, age75_status),
  isage80 = tdc(age80_time, age80_status),
  isage85 = tdc(age85_time, age85_status)
) %>%
  mutate(decade = factor(coalesce(decade1970, decade1960, decade1950, decade1940, 
                                  decade1930, decade1920, decade1910, decade1900)),
         age = factor(coalesce(isage85, isage80, isage75, isage70, isage65, isage60, 
                                   isage55, isage50, isage45, isage40, isage35, isage30, 
                                   isage25, isage00))) %>%
  select(-starts_with("decade19"), -starts_with("isage")) %>%
  filter(!is.na(decade), !is.na(age))

The following table summarizes the models.

mdl_1_cox_2022 <- coxph(Surv(tstart, tstop, death) ~ age, data = cox_dat)
mdl_2_cox_2022 <- coxph(Surv(tstart, tstop, death) ~ age + decade, data = cox_dat)
mdl_3_cox_2022 <- coxph(Surv(tstart, tstop, death) ~ age + decade + cohort, data = cox_dat)
mdl_4_cox_2022 <- coxph(Surv(tstart, tstop, death) ~ age + decade + bmi_fct_2, data = cox_dat)
mdl_5_cox_2022 <- coxph(Surv(tstart, tstop, death) ~ age + decade + height_fct_2, data = cox_dat)
mdl_6_cox_2022 <- coxph(Surv(tstart, tstop, death) ~ age + decade + throws_2, data = cox_dat)
mdl_9_cox_2022 <- coxph(Surv(tstart, tstop, death) ~ age + decade + career_length_fct, data = cox_dat)

# For text description.
gt_1_cox_2022 <- tbl_regression(mdl_1_cox_2022, exponentiate = TRUE) %>% 
  add_significance_stars(hide_se = TRUE)
gt_2_cox_2022 <- tbl_regression(mdl_2_cox_2022, exponentiate = TRUE) %>% 
  add_significance_stars(hide_se = TRUE)
gt_3_cox_2022 <- tbl_regression(mdl_3_cox_2022, exponentiate = TRUE) %>% 
  add_significance_stars(hide_se = TRUE)
gt_4_cox_2022 <- tbl_regression(mdl_4_cox_2022, exponentiate = TRUE) %>% 
  add_significance_stars(hide_se = TRUE)
gt_5_cox_2022 <- tbl_regression(mdl_5_cox_2022, exponentiate = TRUE) %>% 
  add_significance_stars(hide_se = TRUE)
gt_6_cox_2022 <- tbl_regression(mdl_6_cox_2022, exponentiate = TRUE) %>% 
  add_significance_stars(hide_se = TRUE)
gt_9_cox_2022 <- tbl_regression(mdl_9_cox_2022, exponentiate = TRUE) %>% 
  add_significance_stars(hide_se = TRUE)

# For table.
gt_1_cox_2022_t <- tbl_regression(mdl_1_cox_2022, exponentiate = FALSE) %>% 
  add_significance_stars(hide_se = TRUE)
gt_2_cox_2022_t <- tbl_regression(mdl_2_cox_2022, exponentiate = FALSE) %>% 
  add_significance_stars(hide_se = TRUE)
gt_3_cox_2022_t <- tbl_regression(mdl_3_cox_2022, exponentiate = FALSE) %>% 
  add_significance_stars(hide_se = TRUE)
gt_4_cox_2022_t <- tbl_regression(mdl_4_cox_2022, exponentiate = FALSE) %>% 
  add_significance_stars(hide_se = TRUE)
gt_5_cox_2022_t <- tbl_regression(mdl_5_cox_2022, exponentiate = FALSE) %>% 
  add_significance_stars(hide_se = TRUE)
gt_6_cox_2022_t <- tbl_regression(mdl_6_cox_2022, exponentiate = FALSE) %>% 
  add_significance_stars(hide_se = TRUE)
gt_9_cox_2022_t <- tbl_regression(mdl_9_cox_2022, exponentiate = FALSE) %>% 
  add_significance_stars(hide_se = TRUE)

gtsummary::tbl_merge(
 list(gt_1_cox_2022_t, gt_2_cox_2022_t, gt_3_cox_2022_t, gt_4_cox_2022_t, 
      gt_5_cox_2022_t, gt_6_cox_2022_t, gt_9_cox_2022_t), 
 tab_spanner = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5", 
                 "Model 6", "Model 9")
)

Characteristic	Model 1	Model 2	Model 3	Model 4	Model 5	Model 6	Model 9
Characteristic	log(HR)^1,2	log(HR)^1,2	log(HR)^1,2	log(HR)^1,2	log(HR)^1,2	log(HR)^1,2	log(HR)^1,2
age
[0,25)	—	—	—	—	—	—	—
[25,30)	-0.33	-0.15	-0.14	-0.16	-0.15	-0.16	-0.20
[30,35)	-0.14	0.01	0.03	0.01	0.01	0.01	-0.08
[35,40)	-0.18	-0.03	-0.01	-0.04	-0.03	-0.03	-0.16
[40,45)	-0.05	0.07	0.10	0.06	0.07	0.07	-0.10
[45,50)	0.71	0.77	0.78	0.76	0.77	0.77	0.56
[50,55)	1.3*	1.3*	1.3*	1.3*	1.3*	1.3*	1.0
[55,60)	1.6**	1.6**	1.5**	1.5**	1.6**	1.6**	1.3*
[60,65)	1.9***	1.9***	1.8***	1.9***	1.9***	1.9***	1.6**
[65,70)	2.4***	2.3***	2.3***	2.3***	2.4***	2.3***	2.0***
[70,75)	2.8***	2.8***	2.6***	2.7***	2.8***	2.7***	2.3***
[75,80)	3.0***	3.0***	2.9***	3.0***	3.0***	3.0***	2.5***
[80,85)	3.5***	3.5***	3.4***	3.5***	3.5***	3.5***	3.0***
[85,Inf)	3.9***	3.9***	3.8***	3.9***	3.9***	3.9***	3.3***
decade
[1900,1910)		—	—	—	—	—	—
[1910,1920)		-0.52	-0.50	-0.52	-0.52	-0.52	-0.53
[1920,1930)		-0.86*	-0.78*	-0.86*	-0.86*	-0.86*	-0.86*
[1930,1940)		-1.4***	-1.3***	-1.4***	-1.4***	-1.4***	-1.4***
[1940,1950)		-1.5***	-1.3***	-1.5***	-1.5***	-1.5***	-1.5***
[1950,1960)		-1.8***	-1.5***	-1.8***	-1.8***	-1.8***	-1.8***
[1960,1970)		-1.8***	-1.5***	-1.8***	-1.9***	-1.8***	-1.8***
[1970,Inf)		-2.5***	-1.8***	-2.5***	-2.5***	-2.5***	-2.5***
cohort
Early			—
Golden			-0.41***
Modern			-0.86***
bmi_fct_2
Normal				—
Overweight				0.08*
Obese				0.36
height_fct_2
<6 feet					—
>=6 feet					0.01
throws_2
Right Handed						—
Left Handed						-0.05
career_length_fct
<10 years							—
10+ years							-0.23***
¹ p<0.05; p<0.01; **p<0.001 ² HR = Hazard Ratio

Model 1 shows that the odds of death increase with age after age 50. Players aged 50–54 are 3.56 (95% CI 1.19, 10.7; p=0.023) times as likely to die over the follow-up period as players aged 20–24 (Authors report exp(2.39) = 10.91). Whereas the discrete time logistic regression estimated a monotonically increasing mortality risk with statistically significant (.05 level) coefficients beginning at age bin 35-39, the Lahman-based data produced the lowest risk at the age 25-29 age range. The difference may be due to the model precision. Players aged 0-25 are not equally distributed in age. The average debut age was 24.3; most debut closer age 25. Older age groups are more likely to have players equally distributed within the 5 year age bucket. In effect, the discrete time model was comparing five-year mortality risk to a reference group of 24 year-olds surviving to age 25. The discrete time analysis does not account for time spent in an age interval, Cox does.

Model 2 shows that, adjusting for age, more recent calendar periods are associated with lower risks of death. The 1920–1929 decade mortality risk was 0.42 (95% CI 0.20, 0.89; p=0.024) times that of the 1900-1909 decade (Authors report exp(-.54) = 0.58). The Lahman-based data produced a slightly stronger negative association between mortality risk and calendar period.

Model 3 shows that era of debut is negatively associated with mortality risk. A player debuting in the Modern era faced a mortality risk 0.42 (95% CI 0.36, 0.50; p<0.001) times that of an Early era player (Authors report exp(-1.02) = 0.36). The Lahman-based data produced a slightly weaker negative association between mortality risk and cohort.

Model 4 estimated a statistically significant increased mortality risk associated with being overweight. The Authors did not measure a statistically significant relationship. An overweight player’s mortality risk is 1.08 (95% CI 1.01, 1.15; p=0.020) that of a player of normal weight. The Authors report exp(0.04) = 1.04.

Models 5 and 6 estimated mortality associates with height and with handedness that were not different from 0 at the .05 confidence level, in agreement with the Authors finding.

Model 9 shows that those who play for 10 or more years have 0.80 (95% CI 0.74, 0.86; p<0.001) times the mortality risk of those who have shorter careers (Authors report exp(-.15) = .86).

# tdy_2022 <- mdl_2_cox_2022 %>% broom::tidy()
# 
# life_tbl_2022 <- tibble(
#   age_bin = rep(levels(cox_dat$age), 8),
#   age_bin_term = paste0("age", age_bin),
#   period_bin = map(levels(cox_dat$decade), ~rep(.x, 14)) %>% unlist(),
#   period_bin_term = paste0("decade", period_bin),
#   intercept = coef(mdl_2_cox_2022)["(Intercept)"],
#   authors_e = rep(authors_e, 8),
#   us_male_e = rep(us_male_e, 8)
# ) %>%
#   left_join(tdy_2022 %>% select(age_bin_term = term, age_bin_est = estimate), 
#             by = "age_bin_term") %>%
#   left_join(tdy_2022 %>% select(period_bin_term = term, period_bin_est = estimate), 
#             by = "period_bin_term") %>%
#   replace_na(list(age_bin_est = 0, period_bin_est = 0, intercept = 0)) %>%
#   group_by(period_bin) %>%
#   mutate(
#     mx = 1 / (1 + exp(-(intercept + period_bin_est + age_bin_est))),
#     nQx = (5 * mx) / (1 + 2.5 * mx),
#     nQx_comp = 1 - nQx,
#     nQx_cumprod = nQx_comp * lag(nQx_comp, n = 1, default = 1.0),
#     lx = 100000 * lag(nQx_cumprod, n = 1, default = 1.0),
#     ndx = lx * nQx,
#     Lx = (lx - ndx) * 5
#   ) %>%
#   arrange(period_bin, desc(age_bin)) %>%
#   mutate(
#     Tx = cumsum(Lx),
#     ex = Tx / lx
#   ) %>%
#   ungroup() %>%
#   arrange(period_bin, age_bin) %>%
#   select(period_bin, age_bin, intercept, age_bin_est, period_bin_est, mx:ex, 
#          authors_e, us_male_e)
# 
# life_tbl_2022 %>%
#   filter(period_bin == "[1970,Inf)") %>%
#   select(-period_bin) %>%
#   life_flex() %>%
#   flextable::set_caption("Table 5. Life Table, period 1970-2021 from Lahman-based data through 2021.")

The logistic regression facilitated the construction of life tables. The Cox proportional hazards model does not do that, but instead, the model can be used to predict survival at selected predictor levels. The plots below compare cohort survival expectations for two age groups faceted by two decades. In the first facet, the median remaining life expectancy of a player aged 70 - 75 during the 1900-1910 decade is about 18 years. In the second facet, the median jumps to about 28 years.⁴

summary(mdl_2_cox_2022)

## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ age + decade, data = cox_dat)
## 
##   n= 82704, number of events= 3861 
## 
##                       coef exp(coef) se(coef)      z Pr(>|z|)    
## age[25,30)        -0.15394   0.85732  0.36778 -0.419 0.675537    
## age[30,35)         0.01142   1.01149  0.41790  0.027 0.978194    
## age[35,40)        -0.02987   0.97057  0.48133 -0.062 0.950511    
## age[40,45)         0.07272   1.07543  0.51457  0.141 0.887621    
## age[45,50)         0.76889   2.15737  0.53014  1.450 0.146958    
## age[50,55)         1.28244   3.60541  0.54230  2.365 0.018040 *  
## age[55,60)         1.55903   4.75420  0.55188  2.825 0.004729 ** 
## age[60,65)         1.90679   6.73147  0.55850  3.414 0.000640 ***
## age[65,70)         2.34999  10.48550  0.56342  4.171 3.03e-05 ***
## age[70,75)         2.75274  15.68555  0.56767  4.849 1.24e-06 ***
## age[75,80)         3.01296  20.34752  0.57143  5.273 1.34e-07 ***
## age[80,85)         3.50876  33.40666  0.57477  6.105 1.03e-09 ***
## age[85,Inf)        3.89132  48.97548  0.57920  6.718 1.84e-11 ***
## decade[1910,1920) -0.52456   0.59182  0.38947 -1.347 0.178026    
## decade[1920,1930) -0.86342   0.42172  0.38282 -2.255 0.024106 *  
## decade[1930,1940) -1.43405   0.23834  0.38022 -3.772 0.000162 ***
## decade[1940,1950) -1.49439   0.22438  0.37509 -3.984 6.77e-05 ***
## decade[1950,1960) -1.79692   0.16581  0.37426 -4.801 1.58e-06 ***
## decade[1960,1970) -1.84856   0.15746  0.37337 -4.951 7.39e-07 ***
## decade[1970,Inf)  -2.48379   0.08343  0.37159 -6.684 2.32e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                   exp(coef) exp(-coef) lower .95 upper .95
## age[25,30)          0.85732    1.16642   0.41695    1.7628
## age[30,35)          1.01149    0.98864   0.44591    2.2944
## age[35,40)          0.97057    1.03032   0.37785    2.4931
## age[40,45)          1.07543    0.92986   0.39226    2.9484
## age[45,50)          2.15737    0.46353   0.76326    6.0979
## age[50,55)          3.60541    0.27736   1.24551   10.4367
## age[55,60)          4.75420    0.21034   1.61183   14.0229
## age[60,65)          6.73147    0.14856   2.25276   20.1143
## age[65,70)         10.48550    0.09537   3.47540   31.6354
## age[70,75)         15.68555    0.06375   5.15586   47.7198
## age[75,80)         20.34752    0.04915   6.63915   62.3606
## age[80,85)         33.40666    0.02993  10.82907  103.0564
## age[85,Inf)        48.97548    0.02042  15.73834  152.4047
## decade[1910,1920)   0.59182    1.68972   0.27585    1.2697
## decade[1920,1930)   0.42172    2.37127   0.19914    0.8930
## decade[1930,1940)   0.23834    4.19567   0.11312    0.5022
## decade[1940,1950)   0.22438    4.45663   0.10758    0.4680
## decade[1950,1960)   0.16581    6.03104   0.07962    0.3453
## decade[1960,1970)   0.15746    6.35066   0.07575    0.3273
## decade[1970,Inf)    0.08343   11.98667   0.04027    0.1728
## 
## Concordance= 0.639  (se = 0.005 )
## Likelihood ratio test= 709.4  on 20 df,   p=<2e-16
## Wald test            = 739  on 20 df,   p=<2e-16
## Score (logrank) test = 783.8  on 20 df,   p=<2e-16

new_dat <- expand.grid(
  age = c("[0,25)", "[50,55)", "[70,75)"),
  decade = c("[1900,1910)", "[1960,1970)")
) %>%
  mutate(strata = as.factor(row_number()))

survfit(mdl_2_cox_2022, newdata = new_dat, data = cox_dat) %>% 
  surv_summary() %>%
  inner_join(new_dat, by = "strata") %>%
  ggplot(aes(x = time, color = age, fill = age)) + 
  geom_line(aes(y = surv)) +
  geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.4) +
  geom_hline(yintercept = 0.5, linetype = 2) +
  facet_wrap(facets = vars(decade))

References

Overview of Life Tables and Survival Rates. Lesson 7 of Population Analysis for Planners. Measure Evaluation.

Lahman. Sean Lahman’s R Package, version 10.0-1, 2022-04-07. GitHub

Saint Onge JM, Rogers RG, Krueger PM. Major League Baseball Players’ Life Expectancies. Soc Sci Q. 2008;89(3):817-830. doi:10.1111/j.1540-6237.2008.00562.x. HTML.

Shryock HS, Siegel JS, Larmon EA, (1980).The Methods and Materials of Demography. United States:Department of Commerce, Bureau of the Census. Google Books.

Thorn, J., & Palmer, P. (1990). Total baseball (p. 2294). Haynes Publications. Google Scholar.

This would seem to bias ages downward and estimated lifetimes upward by up to one year.↩︎
See my notes, and Bayesium Analytics)↩︎
The Authors say both panels are from Model 2, but that cannot be right. Cohort is not a predictor variable in Model 2.↩︎
These values are absurdly high. I’m not sure what’s going on here yet.↩︎

Major League Baseball Players’ Life Expectancies

Reproduction and Update to Saint Onge, Rogers, and Krueger (2008)

Michael Foley

2022-07-26