In their 2008 study, Major League Baseball Players’ Life Expectancies, Saint Onge et al. (2008) (the Authors) used major league baseball (MLB) player data from the Total Baseball (2004) encyclopedia to evaluate the role of anthropomorphic measures (height, body mass, and handedness) and performance measures (career length and total player rating) on longevity. Using a discrete time hazards model, the Authors found:
This workbook reproduces the analysis with an alternative data source, Sean Lahman’s Baseball Database (2022). Differences in players included in the data and their attributes resulted in slightly different coefficient estimates. The Authors’ findings were generally confirmed except that instead of an expected five additional years of life for 20-year-old MLB players, the Lahman-based data produced estimates of only three additional years. Fitting the models with data updated through the 2021 season resulted in lower measured mortality risks that are in closer alignment with the Authors’ findings. A 20-year-old MLB player can expect an additional 4.5 years of additional life over comparable U.S. males. A Cox proportional hazards model with time-dependent covariates confirmed the Authors’ findings that height, weight, and handedness are not associated with risk of death, and career length is inversely associated with risk of death.
The Authors analyzed data from the electronic version of Total Baseball (2004). Total Baseball was last published in 2008. However, another source of detailed baseball player data is now available. The R version of the 2021 edition of Sean Lahman’s Baseball Database, the Lahman R package v10.0-1 (2022) is continuously updated and available for download on CRAN.
The Authors restricted their data set to position players who debuted between 1902 and 2004. They excluded pitchers due to potential differences in career length and selection into major league baseball. They also excluded players whose career consisted of a mere September call-up by removing players who played for a month or less at the end of the season and who did not play the following season.
The Authors did not explain how they define “pitcher”. Occasionally a position player will pitch during a blowout game to avoid wasting actual pitchers in a lost cause. Also, there are instances of players changing positions, the most famous being Babe Ruth. Ruth pitched from 1914 - 1919, and a few games in 1920, 1921, 1930, and 1933. I defined pitchers as players who never played any position other than pitcher. Using the Lahman Fielding data set, I excluded players whose only fielding statistics were for the pitcher position.
library(Lahman)
data("People", "Fielding", package = "Lahman")
<- People %>%
person_0 semi_join(Fielding %>% filter(POS != "P"), by = "playerID") %>%
mutate(
debut = ymd(debut),
debutMonth = month(debut),
debutYear = year(debut),
finalGame = ymd(finalGame),
finalGameYear = year(finalGame),
career_length = time_length(interval(debut, finalGame), "years")
%>%
) filter(
>= 1902,
debutYear # September call-ups. I originally filtered debutMonth < 9 (naturally), but
# my data better aligns with the Authors if I use 8 (August). One confirming
# evidence came from identifying the youngest player to die. The Authors
# report it being Al Montegomery. If I use the < 9 filter, it's Newt Halliday.
< 8 | finalGameYear > debutYear
debutMonth %>%
) select(-c(birthState, birthCity, deathCountry, deathState, deathCity,
nameGiven, retroID, bbrefID))
The Authors identified eleven players with no death date who, if alive, would have been over 95 years old at the end of 2004. After researching, they dropped one of these records, Jack Dalton, born on July 3, 1885. In the Lahman data, Dalton is recorded with a death year of 1950, making him 65 years old when he died. The Lahman data includes ten players who were over 95 at the end of 2004. All have death dates.
%>%
person_0 mutate(
age_at_2004 = time_length(interval(birthDate, ymd(20041231)), "year") %>% round(),
age_at_death = time_length(interval(birthDate, deathDate), "year") %>% round()
%>%
) filter(age_at_2004 >= 95 & (is.na(deathYear) | deathYear > 2004)) %>%
select(nameFirst, nameLast, birthDate, deathDate, age_at_2004, age_at_death)
## nameFirst nameLast birthDate deathDate age_at_2004 age_at_death
## 1 Ray Berres 1907-08-31 2007-02-01 97 99
## 2 Ray Cunningham 1905-01-17 2005-07-30 100 101
## 3 Howdy Groskloss 1906-04-10 2006-07-15 99 100
## 4 Ernie Koy 1909-09-17 2007-01-01 95 97
## 5 Al Lopez 1908-08-20 2005-10-30 96 97
## 6 Emil Mailho 1908-12-16 2007-03-07 96 98
## 7 Tony Malinosky 1909-10-07 2011-02-08 95 101
## 8 Eddie Mayo 1910-04-15 2006-11-27 95 97
## 9 Bobby Stevens 1907-04-17 2005-12-30 98 99
## 10 Billy Werber 1908-06-20 2009-01-22 97 101
The Lahman data does have seven players with no death date would would be over 95 years old at the end of the 2021 season. The Wikipedia articles on each player as of Jul 2022 indicated all were still alive.
%>%
person_0 mutate(
age_at_2021 = time_length(interval(birthDate, ymd(20211231)), "year") %>% round(),
age_at_death = time_length(interval(birthDate, deathDate), "year") %>% round()
%>%
) filter(age_at_2021 >= 95 & is.na(deathYear)) %>%
select(nameFirst, nameLast, birthDate, age_at_2021)
## nameFirst nameLast birthDate age_at_2021
## 1 George Elder 1921-03-10 101
## 2 Charlie Maxwell 1927-04-08 95
## 3 Ed Mickelson 1926-09-09 95
## 4 Larry Miggins 1925-08-20 96
## 5 Bobby Morgan 1926-06-29 96
## 6 Frank Saucier 1926-05-28 96
## 7 Bobby Shantz 1925-09-26 96
However, the Lahman data includes a couple unusual records related to death date. Alfredo Cabrera and Hector Martinez have no death date because of incomplete month and day information.
%>%
person_0 filter(is.na(deathDate), !is.na(deathYear)) %>%
select(nameFirst, nameLast, birthDate, deathYear, deathMonth, deathDay)
## nameFirst nameLast birthDate deathYear deathMonth deathDay
## 1 Alfredo Cabrera 1881-05-11 1964 NA NA
## 2 Hector Martinez 1939-05-11 1999 12 NA
One option to handle incomplete data is imputing a death date. Another is dropping the record. A death year is present for both players, and in one case the month is present too. I opted to keep both records and impute the middle month of year (Jun) and/or day of month (15th).
<- person_0 %>%
person_1 mutate(deathDate = coalesce(deathDate,
ymd(paste(deathYear,
coalesce(deathMonth, 6),
coalesce(deathDay, 15)),
quiet = TRUE))) %>%
select(-c(deathMonth, deathDay))
There is a similar problem with birth dates. Two have missing birth months and days.
%>%
person_1 filter(is.na(birthDate)) %>%
select(nameFirst, nameLast, birthYear, birthMonth, birthDay, deathDate)
## nameFirst nameLast birthYear birthMonth birthDay deathDate
## 1 Claude Gouzzie 1873 NA NA 1907-09-21
## 2 Ed Irwin 1882 NA NA 1916-02-05
I handled them the same way, imputing a mid-month and day-of-month.
<- person_1 %>%
person_2 mutate(birthDate = coalesce(birthDate,
ymd(paste(birthYear,
coalesce(birthMonth, 6),
coalesce(birthDay, 15)),
quiet = TRUE))) %>%
select(-c(birthMonth, birthDay))
The Authors noted that the youngest age at death was for Al Montgomery, 22.5 years. The Lahman data confirms that Montgomery was the youngest, but his age at death was 21.8 years.
%>%
person_2 filter(!is.na(deathYear)) %>%
mutate(age = time_length(interval(birthDate, deathDate), "year")) %>%
select(nameFirst, nameLast, birthDate, deathDate, age) %>%
arrange(age) %>%
head()
## nameFirst nameLast birthDate deathDate age
## 1 Al Montgomery 1920-07-03 1942-04-26 21.81370
## 2 Ken Hubbs 1941-12-23 1964-02-13 22.14208
## 3 Oscar Taveras 1992-06-19 2014-10-26 22.35342
## 4 Pat Hynes 1884-03-12 1907-03-12 23.00000
## 5 Joe Cassidy 1883-02-08 1906-03-25 23.12329
## 6 John Dodge 1893-04-27 1916-06-19 23.14521
The discrepancy may be explained by how the ages are calculated. The Authors calculates age by person-year. For each calendar year, the player’s age was the number of years between the date of birth and January 1 of the calendar year. So Montgomery’s age on January 1, 1943 would have been 22.5.1
Otherwise, the birth and death dates are reasonable. The following plot of the age distributions for deceased and living players has reasonable endpoints and modes. The mode age for deceased players is about 82 years.
%>%
person_2 mutate(
all_ages = time_length(interval(birthDate, coalesce(deathDate, ymd(20211231))),
"year"),
is_alive = factor(if_else(is.na(deathDate), "Alive", "Dead"))
%>%
) ggplot(aes(x = all_ages, fill = is_alive)) +
geom_density(alpha = .4, color = "gray40") +
theme_light() +
theme(legend.position = "top") +
labs(title = "Lahman data set age distribution.",
x = "Age at death, or at end of 2021 (if alive)", fill = NULL)
The Authors defined several variables from the player attributes.
These are all nominal variables, coded as factors in R with the first value as the reference level.
<- person_2 %>%
person_3 mutate(
cohort = factor(
case_when(between(debutYear, 1902, 1945) ~ "Early",
between(debutYear, 1946, 1968) ~ "Golden",
>= 1969 ~ "Modern"),
debutYear levels = c("Early", "Golden", "Modern")
),height_fct = factor(
if_else(height >= 72, ">=6 feet", "<6 feet"),
levels = c("<6 feet", ">=6 feet")
),bmi = weight / height^2 * 703,
bmi_fct = factor(case_when(
< 18.5 ~ "Underweight (<18.5)",
bmi >= 18.5 & bmi < 25.0 ~ "Normal (18.5 - <25.0)",
bmi >= 25.0 & bmi < 30.0 ~ "Overweight (25.0 – <30.0)",
bmi >= 30.0 & bmi < 35.0 ~ "Obese Class I (30.0 – <35.0)",
bmi >= 35.0 & bmi < 40.0 ~ "Obese Class II (35.0 - <40.0)",
bmi >= 40.0 & bmi < 44.9 ~ "Obese Class III (40.0 - <45.0"),
bmi levels = c("Underweight (<18.5)",
"Normal (18.5 - <25.0)",
"Overweight (25.0 – <30.0)",
"Obese Class I (30.0 – <35.0)",
"Obese Class II (35.0 - <40.0)",
"Obese Class III (40.0 - <45.0")
),career_length_fct = factor(
if_else(career_length < 10, "<10 years", "10+ years"),
levels = c("<10 years", "10+ years")
),throws = factor(
throws, levels = c("R", "L"),
labels = c("Right Handed", "Left Handed")
) )
The BMI bins in the Lahman data include values outside the range of the Authors’ study. Their row counts are so low that it is advisable to lump levels. The Lahman data also has insufficient information (missing height and/or weight) to calculate the BMI for 55 players.
%>%
person_3 mutate(debut_lte_2004 = if_else(debut <= 2004, "In Study", "After 2004")) %>%
::tabyl(bmi_fct, debut_lte_2004) janitor
## bmi_fct After 2004 In Study
## Underweight (<18.5) 0 1
## Normal (18.5 - <25.0) 1217 2646
## Overweight (25.0 – <30.0) 2437 1816
## Obese Class I (30.0 – <35.0) 228 11
## Obese Class II (35.0 - <40.0) 7 0
## Obese Class III (40.0 - <45.0 1 0
## <NA> 0 55
In addition to the missing BMI values, there are missing values in the Lahman data for height and handedness.
%>%
person_3 select(height_fct, throws) %>%
summary()
## height_fct throws
## <6 feet :3643 Right Handed:7223
## >=6 feet:4752 Left Handed :1181
## NA's : 24 NA's : 15
I managed the BMI levels by expanding Normal to include the one Underweight player, and Obese Class I to include the seven Obese Class II players and one Obese Class III player. For the 55 players of unknown BMI, I used multivariate imputations by chained equations (MICE) to impute values. I also used it to impute the missing height and handedness values.
set.seed(2020)
<- person_3 %>%
mice_0 select(birthCountry, weight, height, throws, bmi, debutYear, career_length, cohort) %>%
::mice(method = "rf", printFlag = FALSE) %>%
mice::complete()
mice
<- data.frame(person_3,
person_4 bmi_mice = mice_0$bmi,
height_mice = mice_0$height,
throws_mice = mice_0$throws) %>%
mutate(
# Collapse BMI into three levels.
bmi_fct_2 = case_when(
< 18.5 ~ "Normal",
bmi_mice >= 18.5 & bmi_mice < 25.0 ~ "Normal",
bmi_mice >= 25.0 & bmi_mice < 30.0 ~ "Overweight",
bmi_mice >= 30.0 & bmi_mice < 35.0 ~ "Obese",
bmi_mice >= 35.0 & bmi_mice < 40.0 ~ "Obese",
bmi_mice >= 40.0 & bmi_mice < 44.9 ~ "Obese"
bmi_mice
),bmi_fct_2 = fct_relevel(bmi_fct_2, "Normal", "Overweight", "Obese"),
height_fct_2 = factor(if_else(height_mice >= 72, ">=6 feet", "<6 feet"),
levels = c("<6 feet", ">=6 feet")),
throws_2 = throws_mice
)
MICE estimated 32 of the 35 the missing BMIs would be in the Normal range, the rest Overweight. The new BMI distribution is more more aligned with the Authors’, and is compatible with regression algorithms. It estimated 18 of 24 missing heights to be under 6 feet, and all 15 missing handedness to be right-handed.
The Authors reported that the highest recorded BMI was for Calvin Pickering (32.61). The Lahman data reports a BMI of 33.56 for Pickering, ranking him sixth heaviest among players debuting up to 2004. Dmitri Young topped the list at 37.87. Young is shorter and heavier than Pickering. Perhaps Young put on weight over his career and the Lahman data captures the more recent value.
%>%
person_4 filter(debutYear <= 2004) %>%
arrange(desc(bmi_mice)) %>%
head(10) %>%
mutate(rank = row_number()) %>%
select(rank, nameFirst, nameLast, debut, weight, height, bmi = bmi_mice, bmi_fct_2) %>%
::flextable() %>%
flextable::colformat_double(j = 7, digits = 2) %>%
flextable::autofit() %>%
flextable::theme_vanilla() %>%
flextable::bg(i = ~ nameLast == "Pickering", bg = "lightgoldenrod") %>%
flextable::set_caption("Top 10 BMIs in Lahman data, debut dates <= 2004. Highest BMI in Authors' data highlighted.") flextable
rank | nameFirst | nameLast | debut | weight | height | bmi | bmi_fct_2 |
1 | Dmitri | Young | 1996-08-29 | 295 | 74 | 37.87 | Obese |
2 | Carlos | Lee | 1999-05-07 | 270 | 74 | 34.66 | Obese |
3 | Juan | Diaz | 2002-06-12 | 269 | 74 | 34.53 | Obese |
4 | Raul | Chavez | 1996-08-30 | 245 | 71 | 34.17 | Obese |
5 | Jose | Molina | 1999-09-06 | 250 | 72 | 33.90 | Obese |
6 | Calvin | Pickering | 1998-09-12 | 283 | 77 | 33.56 | Obese |
7 | Marlon | Byrd | 2002-09-08 | 245 | 72 | 33.22 | Obese |
8 | Juan | Uribe | 2001-04-08 | 245 | 72 | 33.22 | Obese |
9 | Bob | Fothergill | 1922-04-18 | 230 | 70 | 33.00 | Obese |
10 | Jack | Cust | 2001-09-26 | 250 | 73 | 32.98 | Obese |
It is interesting to consider the data updated through 2021.
%>%
person_4 # filter(debutYear <= 2004) %>%
arrange(desc(bmi_mice)) %>%
head(30) %>%
mutate(rank = row_number()) %>%
select(rank, nameFirst, nameLast, debut, weight, height, bmi = bmi_mice, bmi_fct_2) %>%
::flextable() %>%
flextable::colformat_double(j = 7, digits = 2) %>%
flextable::autofit() %>%
flextable::theme_vanilla() %>%
flextable::bg(i = ~ nameLast %in% c("Young", "Pickering"), bg = "lightgoldenrod") %>%
flextable::set_caption("Top 30 BMIs in Lahman data, any debut date. Highest BMIs from Authors' and from data through 2004 highlighted.") flextable
rank | nameFirst | nameLast | debut | weight | height | bmi | bmi_fct_2 |
1 | Alejandro | Kirk | 2020-09-12 | 265 | 68 | 40.29 | Obese |
2 | Pablo | Sandoval | 2008-08-14 | 268 | 70 | 38.45 | Obese |
3 | Prince | Fielder | 2005-06-13 | 275 | 71 | 38.35 | Obese |
4 | Dmitri | Young | 1996-08-29 | 295 | 74 | 37.87 | Obese |
5 | Tommy | Everidge | 2009-07-28 | 275 | 72 | 37.29 | Obese |
6 | Dan | Vogelbach | 2016-09-12 | 270 | 72 | 36.61 | Obese |
7 | Brayan | Pena | 2005-05-23 | 240 | 69 | 35.44 | Obese |
8 | Billy | Butler | 2007-05-01 | 260 | 72 | 35.26 | Obese |
9 | Josh | Naylor | 2019-05-24 | 250 | 71 | 34.86 | Obese |
10 | Carlos | Lee | 1999-05-07 | 270 | 74 | 34.66 | Obese |
11 | Jesus | Aguilar | 2014-05-15 | 277 | 75 | 34.62 | Obese |
12 | Juan | Diaz | 2002-06-12 | 269 | 74 | 34.53 | Obese |
13 | Edwin | Bellorin | 2007-08-07 | 240 | 70 | 34.43 | Obese |
14 | Kennys | Vargas | 2014-08-01 | 290 | 77 | 34.39 | Obese |
15 | Ji-Man | Choi | 2016-04-05 | 260 | 73 | 34.30 | Obese |
16 | Raul | Chavez | 1996-08-30 | 245 | 71 | 34.17 | Obese |
17 | Yermin | Mercedes | 2020-08-02 | 245 | 71 | 34.17 | Obese |
18 | Jose | Molina | 1999-09-06 | 250 | 72 | 33.90 | Obese |
19 | Sandy | Leon | 2012-05-14 | 235 | 70 | 33.72 | Obese |
20 | Tommy | Joseph | 2016-05-13 | 255 | 73 | 33.64 | Obese |
21 | Chadwick | Tromp | 2020-07-29 | 221 | 68 | 33.60 | Obese |
22 | Calvin | Pickering | 1998-09-12 | 283 | 77 | 33.56 | Obese |
23 | A. J. | Reed | 2016-06-25 | 275 | 76 | 33.47 | Obese |
24 | Deivy | Grullon | 2019-09-22 | 240 | 71 | 33.47 | Obese |
25 | Dayan | Viciedo | 2010-06-20 | 240 | 71 | 33.47 | Obese |
26 | Tomas | Telis | 2014-08-25 | 220 | 68 | 33.45 | Obese |
27 | Carlos | Maldonado | 2006-09-08 | 260 | 74 | 33.38 | Obese |
28 | Cameron | Rupp | 2013-09-10 | 260 | 74 | 33.38 | Obese |
29 | Yasmany | Tomas | 2015-04-15 | 260 | 74 | 33.38 | Obese |
30 | Harold | Ramirez | 2019-05-11 | 232 | 70 | 33.28 | Obese |
Pickering falls way down the rankings to 22nd, and Dmitri Young falls to fourth. The highest BMI belongs to Alejandro Kirk at 40.29. His entry in Baseball Reference lists him at 245 pounds while the Lahman data lists 265 pounds. Browsing through photos, his weight does seem to vary.
A plot of the relationship between BMI and MLB debut year shows BMIs began to increase in 1990 and peaked in 2014. Perhaps the increasing BMIs are related to sports nutrition and performance enhancing substances rather than body fat. Players debuting after 2004 are colored blue for reference. The updated data has many more overweight players.
%>%
person_4 mutate(post2004 = if_else(debutYear > 2004, "Post-2004", "1902-2004")) %>%
ggplot(aes(x = debut, y = bmi_mice)) +
geom_point(aes(color = post2004), alpha = 0.2) +
geom_smooth(color = "gray40") +
geom_vline(xintercept = ym(201401), linetype = 2, color = "gray60") +
theme_light() +
labs(title = "BMIs increased after 1990 and peaked in 2014.")
Kaplan-Meier plots are helpful tools to get a better feel for the variable relationships to mortality. They show the proportion of the number at risk at each follow-up time (death event or end of study) who died.
<- person_4 %>%
person mutate(
fu_date = coalesce(deathDate, max(person_4$deathDate, na.rm = TRUE)),
fu_time = time_length(interval(debut, fu_date), unit = "year"),
fu_status = if_else(is.na(deathDate), 0, 1),
fu_time_from_birth = time_length(interval(birthDate, fu_date), unit = "year")
%>%
) select(-c(bats, throws, debutMonth, height_fct, height_mice, bmi_fct, bmi_mice,
throws_mice))
The plots below show time to death starting from the birth date. By definition, all MLB players lived into adulthood, so survival probability is 100% through around age 20. The proper way to measure mortality is for time zero to be each player’s MLB debut date. However, for these plots I just want to normalize the data to get a sense of the age at death.
<- function(x, legend_title, legend_labs) {
plot_km ggsurvplot(
x, data = person,
surv.median.line = "hv",
ggtheme = theme_light(),
legend.title = legend_title,
legend.labs = legend_labs,
xlab = "",
ylab = ""
)
}
<- survfit(Surv(fu_time_from_birth, fu_status) ~ cohort, data = person)
km1 <- plot_km(km1, "Cohort", levels(person$cohort))
p1
<- survfit(Surv(fu_time_from_birth, fu_status) ~ height_fct_2, data = person)
km2 <- plot_km(km2, "Height", levels(person$height_fct_2))
p2
<- survfit(Surv(fu_time_from_birth, fu_status) ~ bmi_fct_2, data = person)
km3 <- plot_km(km3, "BMI", levels(person$bmi_fct_2))
p3
<- survfit(Surv(fu_time_from_birth, fu_status) ~ career_length_fct, data = person)
km4 <- plot_km(km4, "Career Length", levels(person$career_length_fct))
p4
<- survfit(Surv(fu_time_from_birth, fu_status) ~ throws_2, data = person)
km5 <- plot_km(km5, "Handedness", levels(person$throws_2))
p5
$plot + p2$plot + p3$plot + p4$plot + p5$plot + plot_layout(nrow = 3) +
p1::plot_annotation(title = "Survival probability by age, selected vars.") patchwork
Discrete time models are fitted to person-year data, with one row per person per year from debut year to the year of death (or censor).
<-
person_year_0 expand.grid(
playerID = unique(person$playerID),
# Years are through 2021 instead of 2004 so I can run analysis on updated
# data. I'll filter on yr <= 2004 for reproduction of original analysis.
yr = c(1902:2021)
%>%
) inner_join(person, by = "playerID") %>%
filter(debutYear <= yr, coalesce(deathYear, 9999) >= yr)
The Author’s data set included 241,218 person-years for 1902 - 2004 (103 years) with 6,772 players and 3,030 deaths. The data derived from Lahman do not quite match, but are close. It includes 239,678 person-years with 6,827 distinct players and 3,067 deaths. It is worth noting that the data derived from Lahman has more players and deaths, yet less person-years. The smaller number of person-years may indicate that the Lahman data identifies more deaths than the Authors’ source.
The Authors define a few variables related to the person year periods for the regression analysis.
<- person_year_0 %>%
person_year mutate(
age = time_length(interval(birthDate, ymd(paste0(yr, "0101"))), "years"),
age_bin = cut(
age, breaks = c(0, 20 + 5 * (1:13), Inf),
dig.lab = 4,
right = FALSE
),period_bin = cut(yr, breaks = c(0, 1900 + 10 * (1:7), Inf), dig.lab = 4, right = FALSE),
status = if_else(deathYear == yr, 1L, 0L, missing = 0L)
)
The following table summarizes the finalized data set used in the analysis, including the post 2004 data that are used in the update.
%>% skimr::skim() person_year
Name | Piped data |
Number of rows | 312074 |
Number of columns | 30 |
_______________________ | |
Column type frequency: | |
character | 4 |
Date | 5 |
factor | 7 |
numeric | 14 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
playerID | 0 | 1 | 6 | 9 | 0 | 8419 | 0 |
birthCountry | 0 | 1 | 3 | 14 | 0 | 42 | 0 |
nameFirst | 0 | 1 | 2 | 12 | 0 | 1576 | 0 |
nameLast | 0 | 1 | 2 | 14 | 0 | 5060 | 0 |
Variable type: Date
skim_variable | n_missing | complete_rate | min | max | median | n_unique |
---|---|---|---|---|---|---|
debut | 0 | 1.00 | 1902-04-17 | 2021-07-30 | 1953-04-13 | 5525 |
finalGame | 0 | 1.00 | 1902-04-26 | 2021-10-03 | 1959-09-23 | 4618 |
deathDate | 126394 | 0.59 | 1905-11-23 | 2022-02-09 | 1985-02-17 | 3630 |
birthDate | 0 | 1.00 | 1869-10-12 | 2001-03-01 | 1928-07-13 | 7620 |
fu_date | 0 | 1.00 | 1905-11-23 | 2022-02-09 | 2011-03-10 | 3630 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
cohort | 0 | 1 | FALSE | 3 | Ear: 135382, Mod: 107752, Gol: 68940 |
career_length_fct | 0 | 1 | FALSE | 2 | <10: 226297, 10+: 85777 |
bmi_fct_2 | 0 | 1 | FALSE | 3 | Nor: 170561, Ove: 138116, Obe: 3397 |
height_fct_2 | 0 | 1 | FALSE | 2 | >=6: 160479, <6 : 151595 |
throws_2 | 0 | 1 | FALSE | 2 | Rig: 267766, Lef: 44308 |
age_bin | 0 | 1 | FALSE | 14 | [30: 37640, [25: 37027, [35: 35238, [40: 32733 |
period_bin | 0 | 1 | FALSE | 8 | [19: 185461, [19: 27959, [19: 26769, [19: 24238 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
yr | 0 | 1.00 | 1976.06 | 30.64 | 1902.00 | 1952.00 | 1980.00 | 2003.00 | 2021.00 | ▂▅▆▆▇ |
birthYear | 0 | 1.00 | 1928.91 | 30.88 | 1869.00 | 1901.00 | 1928.00 | 1954.00 | 2001.00 | ▆▇▇▇▂ |
deathYear | 126394 | 0.59 | 1984.28 | 23.03 | 1905.00 | 1967.00 | 1985.00 | 2004.00 | 2022.00 | ▁▂▇▇▇ |
weight | 1801 | 0.99 | 180.86 | 17.89 | 125.00 | 170.00 | 180.00 | 190.00 | 295.00 | ▁▇▂▁▁ |
height | 625 | 1.00 | 71.56 | 2.24 | 63.00 | 70.00 | 72.00 | 73.00 | 82.00 | ▁▃▇▁▁ |
debutYear | 0 | 1.00 | 1952.96 | 30.80 | 1902.00 | 1926.00 | 1953.00 | 1978.00 | 2021.00 | ▇▇▇▆▃ |
finalGameYear | 0 | 1.00 | 1959.43 | 31.73 | 1902.00 | 1932.00 | 1959.00 | 1985.00 | 2021.00 | ▆▇▇▇▅ |
career_length | 0 | 1.00 | 6.58 | 5.39 | 0.00 | 1.78 | 5.65 | 10.35 | 31.46 | ▇▅▂▁▁ |
bmi | 1858 | 0.99 | 24.79 | 1.79 | 18.46 | 23.67 | 24.41 | 25.80 | 40.29 | ▁▇▁▁▁ |
fu_time | 0 | 1.00 | 46.44 | 15.58 | 0.36 | 36.49 | 48.71 | 58.25 | 80.57 | ▁▃▆▇▂ |
fu_status | 0 | 1.00 | 0.59 | 0.49 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▆▁▁▁▇ |
fu_time_from_birth | 0 | 1.00 | 70.47 | 15.36 | 20.95 | 60.73 | 72.57 | 82.16 | 101.47 | ▁▂▅▇▃ |
age | 0 | 1.00 | 46.64 | 16.38 | 16.00 | 33.01 | 44.25 | 58.26 | 101.24 | ▆▇▆▂▁ |
status | 0 | 1.00 | 0.01 | 0.11 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
<- person_year %>%
t1 filter(yr <= 2004) %>%
tbl_summary(
label = list(
~ "Age Group",
age_bin ~ "Time Period",
period_bin ~ "Cohort",
cohort ~ "Body Mass Index",
bmi_fct_2 ~ "Height",
height_fct_2 ~ "Handedness",
throws_2 career_length_fct = "Seasons Played",
status = "Died"
),include = c(age_bin, period_bin, cohort, height_fct_2, bmi_fct_2,
career_length_fct, throws_2, status)
)
<- person_year %>%
t2 filter(yr <= 2004) %>%
replace_na(list(status = 0)) %>%
group_by(playerID, cohort, bmi_fct_2, height_fct_2, throws_2, career_length_fct) %>%
summarize(.groups = "drop", status = max(status)) %>%
tbl_summary(
label = list(
~ "Cohort",
cohort ~ "Body Mass Index",
bmi_fct_2 ~ "Height",
height_fct_2 ~ "Handedness",
throws_2 career_length_fct = "Seasons Played",
status = "Died"
),include = c(cohort, height_fct_2, bmi_fct_2, career_length_fct, throws_2, status)
)
Table 1 presents the descriptive statistics of the modeling covariates based on the pooled person-years. It is directly comparable to the Authors’ Table 1 (external). Both data sets have the most person-years at age group [30, 35), but the Lahman-based data has more person-years for younger people, aged < 45 years, and relatively more person-year records for the periods before 1940 and the Early (1902 - 1945) debut year cohort. Whereas 4.23% of Authors’ rows are in the lowest age group bin, [0,25), the Lahman-based data has many more, 15,371 (6.4%). Conversely, while 1.11% of the Authors’ rows are in the highest age group, [85,Inf), the Lahman-based data has 2,149 (0.9%). Table 1 includes a second column showing distinct player characteristics. It is evident that there are more records associated with earlier players. For example, while 2,800 (41%) of players debuted between 1902 and 1945 (Early cohort), they comprise 134,554 (56%) of the player-year rows.
Consistent with the Authors’ study, 98,446 (41%) of players-year records in the Lahman-based data are classified as overweight and 745 (0.3%) records are classified as obese. 128,764 (54%) records are less than six feet tall (54.81 in Authors’ data), and 206,020 (86%) percent are right-handed (86.47 in Authors’ data). 64,952 (27%) are career lengths of 10 years or more (23.54 in Authors’ data). The Lahman-based data does not have career rating or first-year rating measures, so those attributes are not included.
::tbl_merge(list(t1, t2), tab_spanner = c("Person-Years", "Distinct Persons")) %>%
gtsummary::as_flex_table() %>%
gtsummary::theme_vanilla() %>%
flextable::bg(i = ~ is.na(stat_0_1), bg = "gray80") %>%
flextable::set_caption("Table 1. Percentage of selected characteristics, MLB players, 1902-2004.") flextable
| Person-Years | Distinct Persons |
Characteristic | N = 239,6781 | N = 6,8271 |
Age Group | ||
[0,25) | 15,371 (6.4%) | |
[25,30) | 29,709 (12%) | |
[30,35) | 30,089 (13%) | |
[35,40) | 27,976 (12%) | |
[40,45) | 25,633 (11%) | |
[45,50) | 23,221 (9.7%) | |
[50,55) | 20,576 (8.6%) | |
[55,60) | 17,913 (7.5%) | |
[60,65) | 15,187 (6.3%) | |
[65,70) | 12,285 (5.1%) | |
[70,75) | 9,349 (3.9%) | |
[75,80) | 6,503 (2.7%) | |
[80,85) | 3,717 (1.6%) | |
[85,Inf) | 2,149 (0.9%) | |
Time Period | ||
[0,1910) | 2,216 (0.9%) | |
[1910,1920) | 9,820 (4.1%) | |
[1920,1930) | 15,535 (6.5%) | |
[1930,1940) | 20,076 (8.4%) | |
[1940,1950) | 24,238 (10%) | |
[1950,1960) | 26,769 (11%) | |
[1960,1970) | 27,959 (12%) | |
[1970,Inf) | 113,065 (47%) | |
Cohort | ||
Early | 134,554 (56%) | 2,800 (41%) |
Golden | 56,841 (24%) | 1,280 (19%) |
Modern | 48,283 (20%) | 2,747 (40%) |
Height | ||
<6 feet | 128,764 (54%) | 3,232 (47%) |
>=6 feet | 110,914 (46%) | 3,595 (53%) |
Body Mass Index | ||
Normal | 140,487 (59%) | 3,727 (55%) |
Overweight | 98,446 (41%) | 3,030 (44%) |
Obese | 745 (0.3%) | 70 (1.0%) |
Seasons Played | ||
<10 years | 174,726 (73%) | 4,966 (73%) |
10+ years | 64,952 (27%) | 1,861 (27%) |
Handedness | ||
Right Handed | 206,020 (86%) | 5,861 (86%) |
Left Handed | 33,658 (14%) | 966 (14%) |
Died | 3,067 (1.3%) | 3,067 (45%) |
1n (%) |
The Authors used logistic regression to estimate a discrete time hazards model for risk of death between the debut date and the end of the follow-up period. They used debut date for the start date rather than the birth date because by definition all players have survived until their debut. They used a discrete time hazards model instead of a Cox proportional hazards model because the covariate estimates are easily summarized in life tables.
A discrete time logistic regression can be fit with either logit link function or a complementary log-log link function2. The coefficients are the log of the hazard odds ratio in a logit link model, and the log hazard ratio in the complementary log-log link model. The coefficients in the complementary log-log link are directly comparable to those in a Cox proportional hazards model. I’m not sure which link function the Authors used. I tried both and achieved similar coefficient estimates. I settled on the complementary log-log link.
<- person_year %>% filter(yr <= 2004)
person_year_2004
# Logistic regression (lr) formulas.
<- status ~ age_bin
fmla_1_lr <- status ~ age_bin + period_bin
fmla_2_lr <- status ~ age_bin + period_bin + cohort
fmla_3_lr <- status ~ age_bin + period_bin + bmi_fct_2
fmla_4_lr <- status ~ age_bin + period_bin + height_fct_2
fmla_5_lr <- status ~ age_bin + period_bin + throws_2
fmla_6_lr <- status ~ age_bin + period_bin + career_length_fct
fmla_9_lr
<- glm(fmla_1_lr, data = person_year_2004, family = binomial(link = "cloglog"))
mdl_1_lr_2004 <- glm(fmla_2_lr, data = person_year_2004, family = binomial(link = "cloglog"))
mdl_2_lr_2004 <- glm(fmla_3_lr, data = person_year_2004, family = binomial(link = "cloglog"))
mdl_3_lr_2004 <- glm(fmla_4_lr, data = person_year_2004, family = binomial(link = "cloglog"))
mdl_4_lr_2004 <- glm(fmla_5_lr, data = person_year_2004, family = binomial(link = "cloglog"))
mdl_5_lr_2004 <- glm(fmla_6_lr, data = person_year_2004, family = binomial(link = "cloglog"))
mdl_6_lr_2004 # No player ratings, so no mdl_7 or mdl_8.
<- glm(fmla_9_lr, data = person_year_2004, family = binomial(link = "cloglog")) mdl_9_lr_2004
The Authors fit the following models. Models 7 and 8 are not fitted in this analysis because the Lahman data does not include player ratings.
Table 2 presents the coefficient estimates for each model. Table 2 is is directly comparable to the Authors’ Table 2 (external).
<- function(e, p) {
astercize <- case_when(
aster_p < .001 ~ "***",
p < .01 ~ "**",
p < .05 ~ "*",
p TRUE ~ ""
)paste0(e, aster_p)
}
bind_rows(
`Model 1` = tidy(mdl_1_lr_2004, exponentiate = FALSE),
`Model 2` = tidy(mdl_2_lr_2004, exponentiate = FALSE),
`Model 3` = tidy(mdl_3_lr_2004, exponentiate = FALSE),
`Model 4` = tidy(mdl_4_lr_2004, exponentiate = FALSE),
`Model 5` = tidy(mdl_5_lr_2004, exponentiate = FALSE),
`Model 6` = tidy(mdl_6_lr_2004, exponentiate = FALSE),
`Model 9` = tidy(mdl_9_lr_2004, exponentiate = FALSE),
.id = "model_num"
%>%
) mutate(lbl = map2_chr(scales::comma(estimate, .01), p.value, astercize)) %>%
pivot_wider(id_cols = term, names_from = model_num, values_from = lbl) %>%
::flextable() %>%
flextable::align(j = 2:8, align = "right") %>%
flextable::theme_zebra() %>%
flextable::autofit() %>%
flextable::set_caption("Table 2. MLB hazard log odds ratio coeficients on mortality, 1902 - 2004.") flextable
term | Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | Model 9 |
(Intercept) | -7.24*** | -6.04*** | -6.03*** | -6.07*** | -6.06*** | -6.03*** | -6.01*** |
age_bin[25,30) | 0.61 | 0.70* | 0.69* | 0.70* | 0.70* | 0.70* | 0.68* |
age_bin[30,35) | 0.64 | 0.81* | 0.78* | 0.81* | 0.82* | 0.81* | 0.79* |
age_bin[35,40) | 1.01** | 1.28*** | 1.22*** | 1.28*** | 1.28*** | 1.28*** | 1.26*** |
age_bin[40,45) | 1.41*** | 1.77*** | 1.67*** | 1.77*** | 1.77*** | 1.76*** | 1.74*** |
age_bin[45,50) | 2.05*** | 2.49*** | 2.35*** | 2.49*** | 2.50*** | 2.49*** | 2.47*** |
age_bin[50,55) | 2.61*** | 3.12*** | 2.94*** | 3.12*** | 3.13*** | 3.12*** | 3.10*** |
age_bin[55,60) | 2.85*** | 3.42*** | 3.20*** | 3.42*** | 3.44*** | 3.42*** | 3.40*** |
age_bin[60,65) | 3.34*** | 3.97*** | 3.71*** | 3.97*** | 3.98*** | 3.96*** | 3.94*** |
age_bin[65,70) | 3.81*** | 4.49*** | 4.20*** | 4.49*** | 4.50*** | 4.49*** | 4.46*** |
age_bin[70,75) | 4.12*** | 4.85*** | 4.52*** | 4.85*** | 4.87*** | 4.85*** | 4.82*** |
age_bin[75,80) | 4.55*** | 5.33*** | 4.97*** | 5.33*** | 5.36*** | 5.33*** | 5.30*** |
age_bin[80,85) | 5.06*** | 5.89*** | 5.48*** | 5.90*** | 5.92*** | 5.89*** | 5.86*** |
age_bin[85,Inf) | 5.58*** | 6.46*** | 6.01*** | 6.47*** | 6.50*** | 6.46*** | 6.43*** |
period_bin[1910,1920) | -0.49 | -0.47 | -0.48 | -0.49 | -0.49 | -0.48 | |
period_bin[1920,1930) | -0.76* | -0.69 | -0.76* | -0.77* | -0.76* | -0.75* | |
period_bin[1930,1940) | -1.34*** | -1.19** | -1.33*** | -1.34*** | -1.34*** | -1.32*** | |
period_bin[1940,1950) | -1.37*** | -1.17** | -1.37*** | -1.38*** | -1.37*** | -1.35*** | |
period_bin[1950,1960) | -1.67*** | -1.41*** | -1.66*** | -1.68*** | -1.67*** | -1.65*** | |
period_bin[1960,1970) | -1.69*** | -1.38*** | -1.68*** | -1.70*** | -1.69*** | -1.66*** | |
period_bin[1970,Inf) | -2.11*** | -1.66*** | -2.11*** | -2.15*** | -2.11*** | -2.09*** | |
cohortGolden | -0.41*** | ||||||
cohortModern | -0.73*** | ||||||
bmi_fct_2Overweight | 0.06 | ||||||
bmi_fct_2Obese | 0.45 | ||||||
height_fct_2>=6 feet | 0.07 | ||||||
throws_2Left Handed | -0.05 | ||||||
career_length_fct10+ years | -0.13** |
Model 1 shows that the odds of death increase with age. Players aged 40–44 are exp(1.41) = 4.09 times as likely to die over the follow-up period as players aged 20–24 (Authors report exp(1.32) = 3.74), those aged 50–54 are exp(2.61) = 13.64 times as likely to die (Authors report exp(2.39) = 10.9), and players aged 70–74 are exp(4.12) = 61.71 times as likely to die (Authors report exp(4.00) = 54.6). The Lahman-based data produced higher mortality risk estimates at all age groups for Model 1.
Model 2 shows that, adjusting for age, more recent calendar periods are associated with lower risks of death. The greatest declines in mortality took place between the 1920–1929 and 1930–1939 periods, exp(-0.76) = 0.47 to exp(-1.34) = 0.26 (Authors report exp(-.54) = 0.58 to exp(-1.08) = 0.34), and between the 1960–1969 and 1970–2004 periods, exp(-1.69) = 0.19 to exp(-2.11) = 0.12) (Authors report exp(-1.47) = 0.23 to exp(-1.97) = 0.14). The Lahman-based data produced a slightly larger negative association between mortality risk and calendar period for the first two periods, and a smaller association for the remaining periods.
The Authors’ Model 3 shows that the coefficients for period and age are reduced but remain strong predictors of mortality after adjusting for the cohort of debut. The Lahman-based data produce a similar result. Compared to those debuting during the Early period (1902-1945), players debuting in the Golden period (1946-1968) were exp(-0.41) = 0.66 as likely to die (Authors report exp(-0.49) = 0.61), and players debuting in the Modern period (1969+) were exp(-0.73) = 0.48 as likely to die (Authors report exp(-1.02) = 0.36). The Lahman-based data produced a smaller benefit from more recent cohorts. The reduction in the period coefficients is due to the correlation between debut era and the period bins. In the plot below, all Modern era player-year records are (by definition) in the 1970+ period.
%>%
person_year_2004 count(period_bin, cohort) %>%
ggplot(aes(x = period_bin, y = n, fill = cohort)) +
geom_col() +
theme_light() +
theme(axis.text.x = element_text(angle = 90)) +
labs(y = "player-years", fill = NULL,
title = "Period and Cohort (Debut Era) Association")
Models 4, 5, and 6 estimate insignificant (at .05 level) associations between risk and obesity (exp(0.45) = 1.56, Authors report exp(.57) = 1.77), height over six feet (exp(0.07) = 1.07, Authors report exp(.03) = 1.03), and left-handedness (exp(-0.05) = 0.95, Authors report exp(-.02) = 0.98).
Finally, those who play for 10 or more years have (1 - exp(-0.13)) * 100 = 12 percent lower odds of death over the follow-up period than those who have shorter careers (Model 9). The Authors report 14 percent lower odds.
Table 3 presents player life expectancy for the most recent period, 1970-2004, based on the hazard coefficients from Model 2. Table 3 is directly comparable to the Authors’ Table 3. I tried to follow the methodology presented by Measure Evaluation with further explanation from page 442 of The Methods and Materials of Demography (Shryock, et al, 1980), but was not quite successful. Some notes on column definitions:
<- mdl_2_lr_2004 %>% broom::tidy()
tdy_2
<- tibble(
life_tbl_period age_bin = rep(levels(person_year$age_bin), 8),
age_bin_term = paste0("age_bin", age_bin),
period_bin = map(levels(person_year$period_bin), ~rep(.x, 14)) %>% unlist(),
period_bin_term = paste0("period_bin", period_bin),
intercept = coef(mdl_2_lr_2004)["(Intercept)"],
# period_bin_1970_Inf = coef(mdl_2_lr_2004)["period_bin[1970,Inf)"],
authors_e = rep(authors_e, 8),
us_male_e = rep(us_male_e, 8)
%>%
) left_join(tdy_2 %>% select(age_bin_term = term, age_bin_est = estimate),
by = "age_bin_term") %>%
left_join(tdy_2 %>% select(period_bin_term = term, period_bin_est = estimate),
by = "period_bin_term") %>%
replace_na(list(age_bin_est = 0, period_bin_est = 0)) %>%
group_by(period_bin) %>%
mutate(
mx = 1 / (1 + exp(-(intercept + period_bin_est + age_bin_est))),
nQx = (5 * mx) / (1 + 2.5 * mx),
nQx_comp = 1 - nQx,
nQx_cumprod = nQx_comp * lag(nQx_comp, n = 1, default = 1.0),
lx = 100000 * lag(nQx_cumprod, n = 1, default = 1.0),
ndx = lx * nQx,
Lx = (lx - ndx) * 5
%>%
) arrange(period_bin, desc(age_bin)) %>%
mutate(
Tx = cumsum(Lx),
ex = Tx / lx
%>%
) ungroup() %>%
arrange(period_bin, age_bin) %>%
select(period_bin, age_bin, intercept, age_bin_est, period_bin_est, mx:ex,
authors_e, us_male_e)
<- function(dat) {
life_flex %>%
dat select(-c(intercept, age_bin_est, period_bin_est)) %>%
::flextable() %>%
flextable::set_header_labels(
flextableage_bin = "Age Group",
# age_bin_est = "Beta",
nQx_comp = "1 - nQx",
nQx_cumprod = "cum(.)",
us_male_e = "U.S. Males",
authors_e = "Authors"
%>%
) ::colformat_double(j = c(2:5), digits = 4) %>%
flextable::colformat_double(j = c(6:9), digits = 0, big.mark = ",") %>%
flextable::colformat_double(j = c(10:12), digits = 2) %>%
flextable::theme_zebra() %>%
flextable::add_header_row(values = c("", "Life Expectancies"), colwidths = c(9, 3)) %>%
flextable::border(j = 9, border.right = officer::fp_border(color = "gray60"))
flextable
}
%>%
life_tbl_period filter(period_bin == "[1970,Inf)") %>%
select(-period_bin) %>%
life_flex() %>%
::set_caption("Table 3. Life Table, period 1970-2004 from Lahman-based data through 2004.") flextable
Life Expectancies | |||||||||||
Age Group | mx | nQx | 1 - nQx | cum(.) | lx | ndx | Lx | Tx | ex | Authors | U.S. Males |
[0,25) | 0.0003 | 0.0014 | 0.9986 | 0.9986 | 100,000 | 144 | 499,281 | 5,631,495 | 56.31 | 58.11 | 53.25 |
[25,30) | 0.0006 | 0.0029 | 0.9971 | 0.9957 | 99,856 | 289 | 497,838 | 5,132,214 | 51.40 | 53.19 | 48.67 |
[30,35) | 0.0006 | 0.0032 | 0.9968 | 0.9939 | 99,568 | 323 | 496,224 | 4,634,376 | 46.55 | 48.31 | 44.10 |
[35,40) | 0.0010 | 0.0051 | 0.9949 | 0.9916 | 99,388 | 511 | 494,384 | 4,138,152 | 41.64 | 43.45 | 39.57 |
[40,45) | 0.0017 | 0.0084 | 0.9916 | 0.9865 | 99,163 | 829 | 491,670 | 3,643,768 | 36.75 | 38.62 | 35.09 |
[45,50) | 0.0035 | 0.0172 | 0.9828 | 0.9746 | 98,654 | 1,692 | 484,807 | 3,152,098 | 31.95 | 33.90 | 30.66 |
[50,55) | 0.0065 | 0.0320 | 0.9680 | 0.9514 | 97,463 | 3,115 | 471,737 | 2,667,290 | 27.37 | 29.28 | 26.37 |
[55,60) | 0.0088 | 0.0428 | 0.9572 | 0.9266 | 95,143 | 4,074 | 455,345 | 2,195,553 | 23.08 | 25.01 | 22.30 |
[60,65) | 0.0150 | 0.0721 | 0.9279 | 0.8882 | 92,658 | 6,682 | 429,881 | 1,740,209 | 18.78 | 20.98 | 18.53 |
[65,70) | 0.0249 | 0.1174 | 0.8826 | 0.8190 | 88,815 | 10,427 | 391,942 | 1,310,328 | 14.75 | 17.25 | 15.12 |
[70,75) | 0.0354 | 0.1626 | 0.8374 | 0.7391 | 81,895 | 13,316 | 342,897 | 918,386 | 11.21 | 13.90 | 12.05 |
[75,80) | 0.0563 | 0.2467 | 0.7533 | 0.6309 | 73,909 | 18,230 | 278,396 | 575,489 | 7.79 | 10.97 | 9.39 |
[80,85) | 0.0945 | 0.3822 | 0.6178 | 0.4654 | 63,085 | 24,112 | 194,868 | 297,092 | 4.71 | 8.35 | 7.12 |
[85,Inf) | 0.1558 | 0.5607 | 0.4393 | 0.2714 | 46,541 | 26,096 | 102,225 | 102,225 | 2.20 | 6.70 | 5.31 |
Columns “U.S. Males” and “Authors” are taken directly from the Authors’ Table 3. The Authors found higher life expectancies for players relative to the general U.S. male population at all ages, with the advantage diminishing with age. For example, from the Authors Table 3, a 20-year-old MLB players can expect to live 58.11 additional years in the period 1970 - 2004, 4.86 years more than the 53.25 years for comparable males in the U.S. population. Table 3 estimated from the Lahman-based data shows a more modest advantage. A 20-year-old MLB player can expect to live 56.31 additional years, only 3.06 years more than comparable males in the U.S. population. The advantage diminishes, then becomes a disadvantage for 65-70 year old MLB players.
Figure 1 illustrates life expectancies for calendar periods (Panel A) calculated from Model 2 in Table 2 and cohorts (Panel B) from Model 33. Figure 1 is directly comparable to the Authors’ Figure 1. It shows that each period ushers in higher life expectancies for those alive at the outset. Life expectancies at age 20 were 40.3 years (Authors report 45.7 years) among those alive between 1910 and 1919, 48.9 years (Authors report 53.6 years) among those alive between 1930 and 1939, 52.2 years (Authors report 56.3 years) among those alive between 1950 and 1959, and 56.3 years (Authors report 59.6 years) among those alive between 1970 and 2004. Panel B show life expectancies by debut era. The Authors report 20 year-old players from the Modern Era can expect to live 65.5 years, compared to 52.4 years from the Early Era and 58.3 from the Golden Era. My Panel B does not show that, I think because of the collinearity between era and time period in the model.
# Panel B is constructed from the life table derived from model 3. Repeat the
# process.
<- mdl_3_lr_2004 %>% broom::tidy()
tdy_3
<- tibble(
life_tbl_cohort age_bin = rep(levels(person_year$age_bin), 3),
age_bin_term = paste0("age_bin", age_bin),
cohort = map(levels(person_year$cohort), ~rep(.x, 14)) %>% unlist(),
cohort_term = paste0("cohort", cohort),
intercept = coef(mdl_3_lr_2004)["(Intercept)"],
us_male_e = rep(us_male_e, 3),
authors_e = rep(authors_e, 3)
%>%
) left_join(tdy_3 %>% select(age_bin_term = term, age_bin_est = estimate),
by = "age_bin_term") %>%
left_join(tdy_3 %>% select(cohort_term = term, cohort_est = estimate),
by = "cohort_term") %>%
replace_na(list(age_bin_est = 0, cohort_est = 0)) %>%
group_by(cohort) %>%
mutate(
mx = 1 / (1 + exp(-(intercept + cohort_est + age_bin_est))),
nQx = (5 * mx) / (1 + 2.5 * mx),
nQx_comp = 1 - nQx,
nQx_cumprod = nQx_comp * lag(nQx_comp, n = 1, default = 1.0),
lx = 100000 * lag(nQx_cumprod, n = 1, default = 1.0),
ndx = lx * nQx,
Lx = (lx - ndx) * 5
%>%
) arrange(cohort, desc(age_bin)) %>%
mutate(
Tx = cumsum(Lx),
ex = Tx / lx
%>%
) ungroup() %>%
arrange(cohort, age_bin) %>%
select(cohort, age_bin, intercept, age_bin_est, cohort_est, mx:ex, us_male_e,
authors_e)
<- life_tbl_period %>%
fig1a filter(period_bin %in% c("[1910,1920)", "[1930,1940)", "[1950,1960)", "[1970,Inf)")) %>%
ggplot(aes(x = age_bin, y = ex, group = period_bin, color = fct_rev(period_bin))) +
geom_line() +
theme_light() +
theme(axis.text.x = element_text(angle = 90)) +
labs(subtitle = "A. Period", color = NULL, x = "Age", y = "e(x)")
<- life_tbl_cohort %>%
fig1b # filter(period_bin %in% c("[1910,1920)", "[1930,1940)", "[1950,1960)", "[1970,Inf)")) %>%
ggplot(aes(x = age_bin, y = ex, group = cohort, color = fct_rev(cohort))) +
geom_line() +
theme_light() +
theme(axis.text.x = element_text(angle = 90)) +
labs(subtitle = "B. Cohort", color = NULL, x = "Age", y = "e(x)")
library(patchwork)
/ fig1b + patchwork::plot_layout(nrow = 2) +
fig1a ::plot_annotation(title = "Figure 1.") patchwork
At the time of this analysis, the Lahman R package contains player debuts through Jul 2021. How do the results change with 17 years of additional data? Table 4 presents the model fits with the updated data.
<- glm(fmla_1_lr, data = person_year, family = binomial(link = "cloglog"))
mdl_1_lr_2022 <- glm(fmla_2_lr, data = person_year, family = binomial(link = "cloglog"))
mdl_2_lr_2022 <- glm(fmla_3_lr, data = person_year, family = binomial(link = "cloglog"))
mdl_3_lr_2022 <- glm(fmla_4_lr, data = person_year, family = binomial(link = "cloglog"))
mdl_4_lr_2022 <- glm(fmla_5_lr, data = person_year, family = binomial(link = "cloglog"))
mdl_5_lr_2022 <- glm(fmla_6_lr, data = person_year, family = binomial(link = "cloglog"))
mdl_6_lr_2022 <- glm(fmla_9_lr, data = person_year, family = binomial(link = "cloglog"))
mdl_9_lr_2022
bind_rows(
`Model 1` = tidy(mdl_1_lr_2022, exponentiate = FALSE),
`Model 2` = tidy(mdl_2_lr_2022, exponentiate = FALSE),
`Model 3` = tidy(mdl_3_lr_2022, exponentiate = FALSE),
`Model 4` = tidy(mdl_4_lr_2022, exponentiate = FALSE),
`Model 5` = tidy(mdl_5_lr_2022, exponentiate = FALSE),
`Model 6` = tidy(mdl_6_lr_2022, exponentiate = FALSE),
`Model 9` = tidy(mdl_9_lr_2022, exponentiate = FALSE),
.id = "model_num"
%>%
) mutate(lbl = map2_chr(scales::comma(estimate, .01), p.value, astercize)) %>%
pivot_wider(id_cols = term, names_from = model_num, values_from = lbl) %>%
::flextable() %>%
flextable::align(j = 2:8, align = "right") %>%
flextable::theme_zebra() %>%
flextable::autofit() %>%
flextable::set_caption("Table 4. MLB hazard log odds ratio coefficients on mortality, 1902 - 2021.") flextable
term | Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | Model 9 |
(Intercept) | -7.24*** | -5.92*** | -5.90*** | -5.94*** | -5.92*** | -5.91*** | -5.88*** |
age_bin[25,30) | 0.41 | 0.53 | 0.52 | 0.53 | 0.53 | 0.53 | 0.52 |
age_bin[30,35) | 0.48 | 0.68* | 0.65* | 0.68* | 0.68* | 0.68* | 0.66* |
age_bin[35,40) | 0.84** | 1.13*** | 1.07*** | 1.13*** | 1.13*** | 1.13*** | 1.11*** |
age_bin[40,45) | 1.22*** | 1.60*** | 1.50*** | 1.60*** | 1.60*** | 1.60*** | 1.58*** |
age_bin[45,50) | 1.89*** | 2.35*** | 2.21*** | 2.35*** | 2.35*** | 2.35*** | 2.33*** |
age_bin[50,55) | 2.40*** | 2.93*** | 2.75*** | 2.94*** | 2.93*** | 2.93*** | 2.91*** |
age_bin[55,60) | 2.65*** | 3.25*** | 3.02*** | 3.26*** | 3.25*** | 3.25*** | 3.23*** |
age_bin[60,65) | 3.17*** | 3.82*** | 3.54*** | 3.83*** | 3.82*** | 3.82*** | 3.80*** |
age_bin[65,70) | 3.61*** | 4.32*** | 3.99*** | 4.33*** | 4.32*** | 4.32*** | 4.30*** |
age_bin[70,75) | 3.98*** | 4.75*** | 4.36*** | 4.76*** | 4.75*** | 4.75*** | 4.72*** |
age_bin[75,80) | 4.37*** | 5.21*** | 4.75*** | 5.22*** | 5.21*** | 5.20*** | 5.18*** |
age_bin[80,85) | 4.89*** | 5.79*** | 5.29*** | 5.80*** | 5.79*** | 5.79*** | 5.76*** |
age_bin[85,Inf) | 5.51*** | 6.47*** | 5.91*** | 6.48*** | 6.47*** | 6.47*** | 6.44*** |
period_bin[1910,1920) | -0.48 | -0.45 | -0.47 | -0.48 | -0.48 | -0.47 | |
period_bin[1920,1930) | -0.73* | -0.65 | -0.73* | -0.73* | -0.73* | -0.73* | |
period_bin[1930,1940) | -1.29*** | -1.15** | -1.30*** | -1.29*** | -1.30*** | -1.28*** | |
period_bin[1940,1950) | -1.33*** | -1.11** | -1.34*** | -1.33*** | -1.33*** | -1.32*** | |
period_bin[1950,1960) | -1.65*** | -1.34*** | -1.65*** | -1.65*** | -1.65*** | -1.63*** | |
period_bin[1960,1970) | -1.69*** | -1.31*** | -1.69*** | -1.69*** | -1.68*** | -1.66*** | |
period_bin[1970,Inf) | -2.30*** | -1.62*** | -2.30*** | -2.30*** | -2.30*** | -2.27*** | |
cohortGolden | -0.45*** | ||||||
cohortModern | -0.94*** | ||||||
bmi_fct_2Overweight | 0.06 | ||||||
bmi_fct_2Obese | 0.34 | ||||||
height_fct_2>=6 feet | 0.00 | ||||||
throws_2Left Handed | -0.05 | ||||||
career_length_fct10+ years | -0.13*** |
Comparing Table 4 to Table 2,
Table 5 presents an updated life table.
<- mdl_2_lr_2022 %>% broom::tidy()
tdy_2022
<- tibble(
life_tbl_2022 age_bin = rep(levels(person_year$age_bin), 8),
age_bin_term = paste0("age_bin", age_bin),
period_bin = map(levels(person_year$period_bin), ~rep(.x, 14)) %>% unlist(),
period_bin_term = paste0("period_bin", period_bin),
intercept = coef(mdl_2_lr_2022)["(Intercept)"],
# period_bin_1970_Inf = coef(mdl_2_lr_2004)["period_bin[1970,Inf)"],
authors_e = rep(authors_e, 8),
us_male_e = rep(us_male_e, 8)
%>%
) left_join(tdy_2022 %>% select(age_bin_term = term, age_bin_est = estimate),
by = "age_bin_term") %>%
left_join(tdy_2022 %>% select(period_bin_term = term, period_bin_est = estimate),
by = "period_bin_term") %>%
replace_na(list(age_bin_est = 0, period_bin_est = 0)) %>%
group_by(period_bin) %>%
mutate(
mx = 1 / (1 + exp(-(intercept + period_bin_est + age_bin_est))),
nQx = (5 * mx) / (1 + 2.5 * mx),
nQx_comp = 1 - nQx,
nQx_cumprod = nQx_comp * lag(nQx_comp, n = 1, default = 1.0),
lx = 100000 * lag(nQx_cumprod, n = 1, default = 1.0),
ndx = lx * nQx,
Lx = (lx - ndx) * 5
%>%
) arrange(period_bin, desc(age_bin)) %>%
mutate(
Tx = cumsum(Lx),
ex = Tx / lx
%>%
) ungroup() %>%
arrange(period_bin, age_bin) %>%
select(period_bin, age_bin, intercept, age_bin_est, period_bin_est, mx:ex,
authors_e, us_male_e)
%>%
life_tbl_2022 filter(period_bin == "[1970,Inf)") %>%
select(-period_bin) %>%
life_flex() %>%
::set_caption("Table 5. Life Table, period 1970-2021 from Lahman-based data through 2021.") flextable
Life Expectancies | |||||||||||
Age Group | mx | nQx | 1 - nQx | cum(.) | lx | ndx | Lx | Tx | ex | Authors | U.S. Males |
[0,25) | 0.0003 | 0.0014 | 0.9986 | 0.9986 | 100,000 | 135 | 499,324 | 5,783,063 | 57.83 | 58.11 | 53.25 |
[25,30) | 0.0005 | 0.0023 | 0.9977 | 0.9964 | 99,865 | 229 | 498,178 | 5,283,739 | 52.91 | 53.19 | 48.67 |
[30,35) | 0.0005 | 0.0027 | 0.9973 | 0.9950 | 99,636 | 266 | 496,848 | 4,785,561 | 48.03 | 48.31 | 44.10 |
[35,40) | 0.0008 | 0.0042 | 0.9958 | 0.9932 | 99,504 | 417 | 495,438 | 4,288,713 | 43.10 | 43.45 | 39.57 |
[40,45) | 0.0013 | 0.0067 | 0.9933 | 0.9892 | 99,316 | 663 | 493,263 | 3,793,275 | 38.19 | 38.62 | 35.09 |
[45,50) | 0.0028 | 0.0140 | 0.9860 | 0.9794 | 98,917 | 1,388 | 487,645 | 3,300,012 | 33.36 | 33.90 | 30.66 |
[50,55) | 0.0051 | 0.0250 | 0.9750 | 0.9614 | 97,939 | 2,445 | 477,472 | 2,812,367 | 28.72 | 29.28 | 26.37 |
[55,60) | 0.0069 | 0.0341 | 0.9659 | 0.9418 | 96,136 | 3,278 | 464,289 | 2,334,895 | 24.29 | 25.01 | 22.30 |
[60,65) | 0.0122 | 0.0593 | 0.9407 | 0.9086 | 94,179 | 5,584 | 442,978 | 1,870,606 | 19.86 | 20.98 | 18.53 |
[65,70) | 0.0200 | 0.0952 | 0.9048 | 0.8511 | 90,864 | 8,654 | 411,047 | 1,427,628 | 15.71 | 17.25 | 15.12 |
[70,75) | 0.0303 | 0.1410 | 0.8590 | 0.7772 | 85,112 | 12,001 | 365,551 | 1,016,582 | 11.94 | 13.90 | 12.05 |
[75,80) | 0.0470 | 0.2104 | 0.7896 | 0.6782 | 77,718 | 16,353 | 306,822 | 651,031 | 8.38 | 10.97 | 9.39 |
[80,85) | 0.0814 | 0.3381 | 0.6619 | 0.5226 | 67,824 | 22,932 | 224,460 | 344,209 | 5.08 | 8.35 | 7.12 |
[85,Inf) | 0.1486 | 0.5417 | 0.4583 | 0.3033 | 52,261 | 28,312 | 119,749 | 119,749 | 2.29 | 6.70 | 5.31 |
The life expectancy estimates are higher for all ages and are closer to the Authors’ results. The Authors estimated a 20-year-old MLB player expect to live 58.11 additional years in the period 1970 - 2004, 4.86 years more than the 53.25 years for comparable males in the U.S. population. Table 3 estimated from the Lahman-based data through 2004 estimated 56.31 additional years. Table 5 estimated from the Lahman-based data through 2021 estimated 57.83 additional years.
The Authors fit a logistic regression model because it enabled them to present their results in life tables. It is more common to see studies fit a Cox proportional hazards model.
Recall that the Cox proportional hazards model expresses mortality hazard as an unspecified baseline hazard multiplied by the exponential of a linear combination of the explanatory variables, \(h(t) = h_0(t) \cdot e^{X\beta}\). \(h_0(t)\) is estimated from the data by maximum likelihood estimation. The proportionality of the model comes from the lack of time dependence in the explanatory variables. Whereas logistic regression predicts the log odds of the response variable, Cox regression predicts the log relative hazard, \(\ln \left[ \frac{h(t)}{h_0(t)} \right] = X\beta\) (relative to the baseline). A change in \(X\) from \(x_a\) to \(x_b\) is associated with a \((x_b - x_a)\beta = \ln \left[ \frac{h_b(t)}{h_0(t)} \right] - \ln \left[ \frac{h_a(t)}{h_0(t)} \right] = \ln \left[ \frac{h_b(t)}{h_a(t)} \right]\) change in the log relative hazard. When \(x\) is a factor variable, \(x_b - x_a\) is a factor level change and the antilog of \(\beta\) is the hazard ratio of level \(b\) relative to level \(a\), \(e^\beta = \frac{h_b(t)}{h_a(t)}\).
There is no need to develop person-year rows for Cox regression. Instead, define a variable for the follow-up time (at death last observation time) from birth (in years), and status (dead or censored). However, given the broad expanse of time covered in the data, the model should control for changes in medical technology. That was the role of the Authors’ Period covariate. And because we are estimating life expectancy by age group, the model needs a time dependent covariate for age.
<- person %>%
cox_dat_0 mutate(
debut_age = time_length(interval(birthDate, debut), unit = "year"),
# follow-up time is earlier of two dates: death or data set date.
fu_date = coalesce(deathDate, max(person$deathDate, na.rm = TRUE)),
fu_time = time_length(interval(debut, fu_date), unit = "year"),
fu_status = if_else(is.na(deathDate), 0, 1),
# Decade indicators are instrumental variables capturing general advances
# in life-extending medical technology. *_time = years from debut to start of
# decade. If debut is in middle of a decade, then *_time is 0 years.
pd1900_status = "[1900,1910)",
pd1900_time = pmax(time_length(interval(debut, ym(190001)), unit = "year"), 0),
pd1900_time = if_else(fu_date >= ym(190001) & debutYear < 1910, pd1900_time, NA_real_),
pd1910_status = "[1910,1920)",
pd1910_time = pmax(time_length(interval(debut, ym(191001)), unit = "year"), 0),
pd1910_time = if_else(fu_date >= ym(191001) & debutYear < 1920, pd1910_time, NA_real_),
pd1920_status = "[1920,1930)",
pd1920_time = pmax(time_length(interval(debut, ym(192001)), unit = "year"), 0),
pd1920_time = if_else(fu_date >= ym(192001) & debutYear < 1930, pd1920_time, NA_real_),
pd1930_status = "[1930,1940)",
pd1930_time = pmax(time_length(interval(debut, ym(193001)), unit = "year"), 0),
pd1930_time = if_else(fu_date >= ym(193001) & debutYear < 1940, pd1930_time, NA_real_),
pd1940_status = "[1940,1950)",
pd1940_time = pmax(time_length(interval(debut, ym(194001)), unit = "year"), 0),
pd1940_time = if_else(fu_date >= ym(194001) & debutYear < 1950, pd1940_time, NA_real_),
pd1950_status = "[1950,1960)",
pd1950_time = pmax(time_length(interval(debut, ym(195001)), unit = "year"), 0),
pd1950_time = if_else(fu_date >= ym(195001) & debutYear < 1960, pd1950_time, NA_real_),
pd1960_status = "[1960,1970)",
pd1960_time = pmax(time_length(interval(debut, ym(196001)), unit = "year"), 0),
pd1960_time = if_else(fu_date >= ym(196001) & debutYear < 1970, pd1960_time, NA_real_),
pd1970_status = "[1970,Inf)",
pd1970_time = pmax(time_length(interval(debut, ym(197001)), unit = "year"), 0),
pd1970_time = if_else(fu_date >= ym(197001) & debutYear < 9999, pd1970_time, NA_real_),
age00_status = "[0,25)",
age00_time = time_length(interval(debut, birthDate + years(0)), unit = "year"),
age00_time = if_else(fu_date >= birthDate + years(0) & age00_time >= -25, pmax(age00_time, 0), NA_real_),
age25_status = "[25,30)",
age25_time = time_length(interval(debut, birthDate + years(25)), unit = "year"),
age25_time = if_else(fu_date >= birthDate + years(25) & age25_time >= -5, pmax(age25_time, 0), NA_real_),
age30_status = "[30,35)",
age30_time = time_length(interval(debut, birthDate + years(30)), unit = "year"),
age30_time = if_else(fu_date >= birthDate + years(30) & age30_time >= -5, pmax(age30_time, 0), NA_real_),
age35_status = "[35,40)",
age35_time = time_length(interval(debut, birthDate + years(35)), unit = "year"),
age35_time = if_else(fu_date >= birthDate + years(35) & age35_time >= -5, pmax(age35_time, 0), NA_real_),
age40_status = "[40,45)",
age40_time = time_length(interval(debut, birthDate + years(40)), unit = "year"),
age40_time = if_else(fu_date >= birthDate + years(40) & age40_time >= -5, pmax(age40_time, 0), NA_real_),
age45_status = "[45,50)",
age45_time = time_length(interval(debut, birthDate + years(45)), unit = "year"),
age45_time = if_else(fu_date >= birthDate + years(45) & age45_time >= -5, pmax(age45_time, 0), NA_real_),
age50_status = "[50,55)",
age50_time = time_length(interval(debut, birthDate + years(50)), unit = "year"),
age50_time = if_else(fu_date >= birthDate + years(50) & age50_time >= -5, pmax(age50_time, 0), NA_real_),
age55_status = "[55,60)",
age55_time = time_length(interval(debut, birthDate + years(55)), unit = "year"),
age55_time = if_else(fu_date >= birthDate + years(55) & age55_time >= -5, pmax(age55_time, 0), NA_real_),
age60_status = "[60,65)",
age60_time = time_length(interval(debut, birthDate + years(60)), unit = "year"),
age60_time = if_else(fu_date >= birthDate + years(60) & age60_time >= -5, pmax(age60_time, 0), NA_real_),
age65_status = "[65,70)",
age65_time = time_length(interval(debut, birthDate + years(65)), unit = "year"),
age65_time = if_else(fu_date >= birthDate + years(65) & age65_time >= -5, pmax(age65_time, 0), NA_real_),
age70_status = "[70,75)",
age70_time = time_length(interval(debut, birthDate + years(70)), unit = "year"),
age70_time = if_else(fu_date >= birthDate + years(70) & age70_time >= -5, pmax(age70_time, 0), NA_real_),
age75_status = "[75,80)",
age75_time = time_length(interval(debut, birthDate + years(75)), unit = "year"),
age75_time = if_else(fu_date >= birthDate + years(75) & age75_time >= -5, pmax(age75_time, 0), NA_real_),
age80_status = "[80,85)",
age80_time = time_length(interval(debut, birthDate + years(80)), unit = "year"),
age80_time = if_else(fu_date >= birthDate + years(80) & age80_time >= -5, pmax(age80_time, 0), NA_real_),
age85_status = "[85,Inf)",
age85_time = time_length(interval(debut, birthDate + years(85)), unit = "year"),
age85_time = if_else(fu_date >= birthDate + years(85) & age85_time >= -5, pmax(age85_time, 0), NA_real_)
)
# Don't need the newly created period vars in the final data set.
<- tmerge(
cox_dat_1 %>% select(-starts_with("pd"), -starts_with("age")),
cox_dat_0
cox_dat_0, id = playerID,
death = event(fu_time, fu_status)
)
# Create the decade and age vars
<- tmerge(
cox_dat id = playerID,
cox_dat_1, cox_dat_0, decade1900 = tdc(pd1900_time, pd1900_status),
decade1910 = tdc(pd1910_time, pd1910_status),
decade1920 = tdc(pd1920_time, pd1920_status),
decade1930 = tdc(pd1930_time, pd1930_status),
decade1940 = tdc(pd1940_time, pd1940_status),
decade1950 = tdc(pd1950_time, pd1950_status),
decade1960 = tdc(pd1960_time, pd1960_status),
decade1970 = tdc(pd1970_time, pd1970_status),
isage00 = tdc(age00_time, age00_status),
isage25 = tdc(age25_time, age25_status),
isage30 = tdc(age30_time, age30_status),
isage35 = tdc(age35_time, age35_status),
isage40 = tdc(age40_time, age40_status),
isage45 = tdc(age45_time, age45_status),
isage50 = tdc(age50_time, age50_status),
isage55 = tdc(age55_time, age55_status),
isage60 = tdc(age60_time, age60_status),
isage65 = tdc(age65_time, age65_status),
isage70 = tdc(age70_time, age70_status),
isage75 = tdc(age75_time, age75_status),
isage80 = tdc(age80_time, age80_status),
isage85 = tdc(age85_time, age85_status)
%>%
) mutate(decade = factor(coalesce(decade1970, decade1960, decade1950, decade1940,
decade1930, decade1920, decade1910, decade1900)),age = factor(coalesce(isage85, isage80, isage75, isage70, isage65, isage60,
isage55, isage50, isage45, isage40, isage35, isage30, %>%
isage25, isage00))) select(-starts_with("decade19"), -starts_with("isage")) %>%
filter(!is.na(decade), !is.na(age))
The following table summarizes the models.
<- coxph(Surv(tstart, tstop, death) ~ age, data = cox_dat)
mdl_1_cox_2022 <- coxph(Surv(tstart, tstop, death) ~ age + decade, data = cox_dat)
mdl_2_cox_2022 <- coxph(Surv(tstart, tstop, death) ~ age + decade + cohort, data = cox_dat)
mdl_3_cox_2022 <- coxph(Surv(tstart, tstop, death) ~ age + decade + bmi_fct_2, data = cox_dat)
mdl_4_cox_2022 <- coxph(Surv(tstart, tstop, death) ~ age + decade + height_fct_2, data = cox_dat)
mdl_5_cox_2022 <- coxph(Surv(tstart, tstop, death) ~ age + decade + throws_2, data = cox_dat)
mdl_6_cox_2022 <- coxph(Surv(tstart, tstop, death) ~ age + decade + career_length_fct, data = cox_dat)
mdl_9_cox_2022
# For text description.
<- tbl_regression(mdl_1_cox_2022, exponentiate = TRUE) %>%
gt_1_cox_2022 add_significance_stars(hide_se = TRUE)
<- tbl_regression(mdl_2_cox_2022, exponentiate = TRUE) %>%
gt_2_cox_2022 add_significance_stars(hide_se = TRUE)
<- tbl_regression(mdl_3_cox_2022, exponentiate = TRUE) %>%
gt_3_cox_2022 add_significance_stars(hide_se = TRUE)
<- tbl_regression(mdl_4_cox_2022, exponentiate = TRUE) %>%
gt_4_cox_2022 add_significance_stars(hide_se = TRUE)
<- tbl_regression(mdl_5_cox_2022, exponentiate = TRUE) %>%
gt_5_cox_2022 add_significance_stars(hide_se = TRUE)
<- tbl_regression(mdl_6_cox_2022, exponentiate = TRUE) %>%
gt_6_cox_2022 add_significance_stars(hide_se = TRUE)
<- tbl_regression(mdl_9_cox_2022, exponentiate = TRUE) %>%
gt_9_cox_2022 add_significance_stars(hide_se = TRUE)
# For table.
<- tbl_regression(mdl_1_cox_2022, exponentiate = FALSE) %>%
gt_1_cox_2022_t add_significance_stars(hide_se = TRUE)
<- tbl_regression(mdl_2_cox_2022, exponentiate = FALSE) %>%
gt_2_cox_2022_t add_significance_stars(hide_se = TRUE)
<- tbl_regression(mdl_3_cox_2022, exponentiate = FALSE) %>%
gt_3_cox_2022_t add_significance_stars(hide_se = TRUE)
<- tbl_regression(mdl_4_cox_2022, exponentiate = FALSE) %>%
gt_4_cox_2022_t add_significance_stars(hide_se = TRUE)
<- tbl_regression(mdl_5_cox_2022, exponentiate = FALSE) %>%
gt_5_cox_2022_t add_significance_stars(hide_se = TRUE)
<- tbl_regression(mdl_6_cox_2022, exponentiate = FALSE) %>%
gt_6_cox_2022_t add_significance_stars(hide_se = TRUE)
<- tbl_regression(mdl_9_cox_2022, exponentiate = FALSE) %>%
gt_9_cox_2022_t add_significance_stars(hide_se = TRUE)
::tbl_merge(
gtsummarylist(gt_1_cox_2022_t, gt_2_cox_2022_t, gt_3_cox_2022_t, gt_4_cox_2022_t,
gt_5_cox_2022_t, gt_6_cox_2022_t, gt_9_cox_2022_t), tab_spanner = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5",
"Model 6", "Model 9")
)
Characteristic | Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | Model 9 |
---|---|---|---|---|---|---|---|
log(HR)1,2 | log(HR)1,2 | log(HR)1,2 | log(HR)1,2 | log(HR)1,2 | log(HR)1,2 | log(HR)1,2 | |
age | |||||||
[0,25) | — | — | — | — | — | — | — |
[25,30) | -0.33 | -0.15 | -0.14 | -0.16 | -0.15 | -0.16 | -0.20 |
[30,35) | -0.14 | 0.01 | 0.03 | 0.01 | 0.01 | 0.01 | -0.08 |
[35,40) | -0.18 | -0.03 | -0.01 | -0.04 | -0.03 | -0.03 | -0.16 |
[40,45) | -0.05 | 0.07 | 0.10 | 0.06 | 0.07 | 0.07 | -0.10 |
[45,50) | 0.71 | 0.77 | 0.78 | 0.76 | 0.77 | 0.77 | 0.56 |
[50,55) | 1.3* | 1.3* | 1.3* | 1.3* | 1.3* | 1.3* | 1.0 |
[55,60) | 1.6** | 1.6** | 1.5** | 1.5** | 1.6** | 1.6** | 1.3* |
[60,65) | 1.9*** | 1.9*** | 1.8*** | 1.9*** | 1.9*** | 1.9*** | 1.6** |
[65,70) | 2.4*** | 2.3*** | 2.3*** | 2.3*** | 2.4*** | 2.3*** | 2.0*** |
[70,75) | 2.8*** | 2.8*** | 2.6*** | 2.7*** | 2.8*** | 2.7*** | 2.3*** |
[75,80) | 3.0*** | 3.0*** | 2.9*** | 3.0*** | 3.0*** | 3.0*** | 2.5*** |
[80,85) | 3.5*** | 3.5*** | 3.4*** | 3.5*** | 3.5*** | 3.5*** | 3.0*** |
[85,Inf) | 3.9*** | 3.9*** | 3.8*** | 3.9*** | 3.9*** | 3.9*** | 3.3*** |
decade | |||||||
[1900,1910) | — | — | — | — | — | — | |
[1910,1920) | -0.52 | -0.50 | -0.52 | -0.52 | -0.52 | -0.53 | |
[1920,1930) | -0.86* | -0.78* | -0.86* | -0.86* | -0.86* | -0.86* | |
[1930,1940) | -1.4*** | -1.3*** | -1.4*** | -1.4*** | -1.4*** | -1.4*** | |
[1940,1950) | -1.5*** | -1.3*** | -1.5*** | -1.5*** | -1.5*** | -1.5*** | |
[1950,1960) | -1.8*** | -1.5*** | -1.8*** | -1.8*** | -1.8*** | -1.8*** | |
[1960,1970) | -1.8*** | -1.5*** | -1.8*** | -1.9*** | -1.8*** | -1.8*** | |
[1970,Inf) | -2.5*** | -1.8*** | -2.5*** | -2.5*** | -2.5*** | -2.5*** | |
cohort | |||||||
Early | — | ||||||
Golden | -0.41*** | ||||||
Modern | -0.86*** | ||||||
bmi_fct_2 | |||||||
Normal | — | ||||||
Overweight | 0.08* | ||||||
Obese | 0.36 | ||||||
height_fct_2 | |||||||
<6 feet | — | ||||||
>=6 feet | 0.01 | ||||||
throws_2 | |||||||
Right Handed | — | ||||||
Left Handed | -0.05 | ||||||
career_length_fct | |||||||
<10 years | — | ||||||
10+ years | -0.23*** | ||||||
1
*p<0.05; **p<0.01; ***p<0.001
2
HR = Hazard Ratio
|
Model 1 shows that the odds of death increase with age after age
50. Players aged 50–54 are 3.56 (95% CI 1.19, 10.7; p=0.023) times as
likely to die over the follow-up period as players aged 20–24 (Authors
report exp(2.39) = 10.91). Whereas the discrete time logistic regression
estimated a monotonically increasing mortality risk with statistically
significant (.05 level) coefficients beginning at age bin 35-39, the
Lahman-based data produced the lowest risk at the age
25-29 age range. The difference may be due to the model precision.
Players aged 0-25 are not equally distributed in age. The average debut
age was 24.3; most debut closer age 25. Older age groups are more likely
to have players equally distributed within the 5 year age bucket. In
effect, the discrete time model was comparing five-year mortality risk
to a reference group of 24 year-olds surviving to age 25. The discrete
time analysis does not account for time spent in an age interval, Cox
does.
Model 2 shows that, adjusting for age, more recent calendar periods are associated with lower risks of death. The 1920–1929 decade mortality risk was 0.42 (95% CI 0.20, 0.89; p=0.024) times that of the 1900-1909 decade (Authors report exp(-.54) = 0.58). The Lahman-based data produced a slightly stronger negative association between mortality risk and calendar period.
Model 3 shows that era of debut is negatively associated with mortality risk. A player debuting in the Modern era faced a mortality risk 0.42 (95% CI 0.36, 0.50; p<0.001) times that of an Early era player (Authors report exp(-1.02) = 0.36). The Lahman-based data produced a slightly weaker negative association between mortality risk and cohort.
Model 4 estimated a statistically significant increased mortality risk associated with being overweight. The Authors did not measure a statistically significant relationship. An overweight player’s mortality risk is 1.08 (95% CI 1.01, 1.15; p=0.020) that of a player of normal weight. The Authors report exp(0.04) = 1.04.
Models 5 and 6 estimated mortality associates with height and with handedness that were not different from 0 at the .05 confidence level, in agreement with the Authors finding.
Model 9 shows that those who play for 10 or more years have 0.80 (95% CI 0.74, 0.86; p<0.001) times the mortality risk of those who have shorter careers (Authors report exp(-.15) = .86).
# tdy_2022 <- mdl_2_cox_2022 %>% broom::tidy()
#
# life_tbl_2022 <- tibble(
# age_bin = rep(levels(cox_dat$age), 8),
# age_bin_term = paste0("age", age_bin),
# period_bin = map(levels(cox_dat$decade), ~rep(.x, 14)) %>% unlist(),
# period_bin_term = paste0("decade", period_bin),
# intercept = coef(mdl_2_cox_2022)["(Intercept)"],
# authors_e = rep(authors_e, 8),
# us_male_e = rep(us_male_e, 8)
# ) %>%
# left_join(tdy_2022 %>% select(age_bin_term = term, age_bin_est = estimate),
# by = "age_bin_term") %>%
# left_join(tdy_2022 %>% select(period_bin_term = term, period_bin_est = estimate),
# by = "period_bin_term") %>%
# replace_na(list(age_bin_est = 0, period_bin_est = 0, intercept = 0)) %>%
# group_by(period_bin) %>%
# mutate(
# mx = 1 / (1 + exp(-(intercept + period_bin_est + age_bin_est))),
# nQx = (5 * mx) / (1 + 2.5 * mx),
# nQx_comp = 1 - nQx,
# nQx_cumprod = nQx_comp * lag(nQx_comp, n = 1, default = 1.0),
# lx = 100000 * lag(nQx_cumprod, n = 1, default = 1.0),
# ndx = lx * nQx,
# Lx = (lx - ndx) * 5
# ) %>%
# arrange(period_bin, desc(age_bin)) %>%
# mutate(
# Tx = cumsum(Lx),
# ex = Tx / lx
# ) %>%
# ungroup() %>%
# arrange(period_bin, age_bin) %>%
# select(period_bin, age_bin, intercept, age_bin_est, period_bin_est, mx:ex,
# authors_e, us_male_e)
#
# life_tbl_2022 %>%
# filter(period_bin == "[1970,Inf)") %>%
# select(-period_bin) %>%
# life_flex() %>%
# flextable::set_caption("Table 5. Life Table, period 1970-2021 from Lahman-based data through 2021.")
The logistic regression facilitated the construction of life tables. The Cox proportional hazards model does not do that, but instead, the model can be used to predict survival at selected predictor levels. The plots below compare cohort survival expectations for two age groups faceted by two decades. In the first facet, the median remaining life expectancy of a player aged 70 - 75 during the 1900-1910 decade is about 18 years. In the second facet, the median jumps to about 28 years.4
summary(mdl_2_cox_2022)
## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ age + decade, data = cox_dat)
##
## n= 82704, number of events= 3861
##
## coef exp(coef) se(coef) z Pr(>|z|)
## age[25,30) -0.15394 0.85732 0.36778 -0.419 0.675537
## age[30,35) 0.01142 1.01149 0.41790 0.027 0.978194
## age[35,40) -0.02987 0.97057 0.48133 -0.062 0.950511
## age[40,45) 0.07272 1.07543 0.51457 0.141 0.887621
## age[45,50) 0.76889 2.15737 0.53014 1.450 0.146958
## age[50,55) 1.28244 3.60541 0.54230 2.365 0.018040 *
## age[55,60) 1.55903 4.75420 0.55188 2.825 0.004729 **
## age[60,65) 1.90679 6.73147 0.55850 3.414 0.000640 ***
## age[65,70) 2.34999 10.48550 0.56342 4.171 3.03e-05 ***
## age[70,75) 2.75274 15.68555 0.56767 4.849 1.24e-06 ***
## age[75,80) 3.01296 20.34752 0.57143 5.273 1.34e-07 ***
## age[80,85) 3.50876 33.40666 0.57477 6.105 1.03e-09 ***
## age[85,Inf) 3.89132 48.97548 0.57920 6.718 1.84e-11 ***
## decade[1910,1920) -0.52456 0.59182 0.38947 -1.347 0.178026
## decade[1920,1930) -0.86342 0.42172 0.38282 -2.255 0.024106 *
## decade[1930,1940) -1.43405 0.23834 0.38022 -3.772 0.000162 ***
## decade[1940,1950) -1.49439 0.22438 0.37509 -3.984 6.77e-05 ***
## decade[1950,1960) -1.79692 0.16581 0.37426 -4.801 1.58e-06 ***
## decade[1960,1970) -1.84856 0.15746 0.37337 -4.951 7.39e-07 ***
## decade[1970,Inf) -2.48379 0.08343 0.37159 -6.684 2.32e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## age[25,30) 0.85732 1.16642 0.41695 1.7628
## age[30,35) 1.01149 0.98864 0.44591 2.2944
## age[35,40) 0.97057 1.03032 0.37785 2.4931
## age[40,45) 1.07543 0.92986 0.39226 2.9484
## age[45,50) 2.15737 0.46353 0.76326 6.0979
## age[50,55) 3.60541 0.27736 1.24551 10.4367
## age[55,60) 4.75420 0.21034 1.61183 14.0229
## age[60,65) 6.73147 0.14856 2.25276 20.1143
## age[65,70) 10.48550 0.09537 3.47540 31.6354
## age[70,75) 15.68555 0.06375 5.15586 47.7198
## age[75,80) 20.34752 0.04915 6.63915 62.3606
## age[80,85) 33.40666 0.02993 10.82907 103.0564
## age[85,Inf) 48.97548 0.02042 15.73834 152.4047
## decade[1910,1920) 0.59182 1.68972 0.27585 1.2697
## decade[1920,1930) 0.42172 2.37127 0.19914 0.8930
## decade[1930,1940) 0.23834 4.19567 0.11312 0.5022
## decade[1940,1950) 0.22438 4.45663 0.10758 0.4680
## decade[1950,1960) 0.16581 6.03104 0.07962 0.3453
## decade[1960,1970) 0.15746 6.35066 0.07575 0.3273
## decade[1970,Inf) 0.08343 11.98667 0.04027 0.1728
##
## Concordance= 0.639 (se = 0.005 )
## Likelihood ratio test= 709.4 on 20 df, p=<2e-16
## Wald test = 739 on 20 df, p=<2e-16
## Score (logrank) test = 783.8 on 20 df, p=<2e-16
<- expand.grid(
new_dat age = c("[0,25)", "[50,55)", "[70,75)"),
decade = c("[1900,1910)", "[1960,1970)")
%>%
) mutate(strata = as.factor(row_number()))
survfit(mdl_2_cox_2022, newdata = new_dat, data = cox_dat) %>%
surv_summary() %>%
inner_join(new_dat, by = "strata") %>%
ggplot(aes(x = time, color = age, fill = age)) +
geom_line(aes(y = surv)) +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.4) +
geom_hline(yintercept = 0.5, linetype = 2) +
facet_wrap(facets = vars(decade))
Overview of Life Tables and Survival Rates. Lesson 7 of Population Analysis for Planners. Measure Evaluation.
Lahman. Sean Lahman’s R Package, version 10.0-1, 2022-04-07. GitHub
Saint Onge JM, Rogers RG, Krueger PM. Major League Baseball Players’ Life Expectancies. Soc Sci Q. 2008;89(3):817-830. doi:10.1111/j.1540-6237.2008.00562.x. HTML.
Shryock HS, Siegel JS, Larmon EA, (1980).The Methods and Materials of Demography. United States:Department of Commerce, Bureau of the Census. Google Books.
Thorn, J., & Palmer, P. (1990). Total baseball (p. 2294). Haynes Publications. Google Scholar.
This would seem to bias ages downward and estimated lifetimes upward by up to one year.↩︎
See my notes, and Bayesium Analytics)↩︎
The Authors say both panels are from Model 2, but that cannot be right. Cohort is not a predictor variable in Model 2.↩︎
These values are absurdly high. I’m not sure what’s going on here yet.↩︎