Brief Overview of Survival Analysis in Python
Survival analysis (SA) is used to study time to an event of interest (usually the event of death). Through SA, we are able to make estimates and predictions regarding the probability and risk of an event occurring over a span of time, otherwise known as survival time.
While, the name ‘Survival Analysis’ may be misleading and prod individuals into disbelieving that the scope of SA is primarily limited for clinical studies and health research, SA has a wide range of applications in many industries.
Several examples are outlined in the following:
- analyzing time to death (event)
- analyzing time to onset (or relapse) of a disease
- viral load measurements in HIV+ patients
- predicting risk of deaths for insured clients → pricing premiums, deductibles, policy eligibility, and cancellations based on ones’ perceived longevity
- predicting if and when supply chain customers or suppliers might file for bankruptcy so that proactive measures may be taken to avoid supply chain disruption
- analyzing customer churn for subscription based companies
- predicting credit risk
- predicting risk and time to bankruptcy
- predicting attributes of ex-employees that may help estimate future termination or turnover rates (Bayesian Survival Analysis)
- predicting contestants that move onto the next stage/round and those that do not
- assessing time until being locked out from web scraping due to poorly written code and repeated API calls
from lifelines import KaplanMeierFitter
Survival Analysis for estimating the endpoint of death for heart attack survivors who are normal (BMI<25), overweight (30>BMI≥25), and obese (30≥BMI )
In this post, we are interested in survival time to death for subjects who have experienced a myocardial event (n=500). Baseline is LOCATED at the point at which subject randomization occurs, time point zero.
0.43, "57% of subjects are censored"
The event time (aka ‘failure time’ or ‘survival time’) random variable T
is the focus of survival analysis, to define it we need three of the following features:
- an unambiguous time origin
(e.g. time of randomization for study, time of diagnosis, time of marketing intervention)
- a time scale
(e.g. real time (days, months, years), menstrual cycles, etc)
- definition of the (occurrence of the) event
(e.g. death, recurrence, need a new implant)
Defined as right censoring because the true unobserved event occurs after our censoring time, therefore, the only assumption that we can make is that the event had not occurred even by the end of follow-up — we do not know anything that happens after censorship!
Caused mainly due to:
— loss to follow-up in study
— subject drop-out
— study termination (administrative censoring)
- Far more common than left censoring
- Less common
Example 1: Study of age at which African children learn a task. Some already knew (left-censored), some learned during study (exact), some had not yet learned by end of study (right-censored).
Experience: often encountered when attempting to measure mRNA purification concentrations using outdated generation NanoDrop machines. Older models may have a higher minimum readout than newer models, therefore, the assay cannot read concentrations under 20 ng/dL, so any concentration under 20 ng/dL would be left censored at 20 ng/dL.
Non-informative vs Informative Censoring
- Censoring is deemed ‘non-informative’ for survival analysis, if the observations being censored are representative of all patients who survive up to that time t.
- Censoring is defined as ‘informative’, if there is some evidence of dependence between censoring and event (death)
An example of informative censoring:
A study conducted at two study centers A and B, which serve somewhat different patient populations:
- Center A has sicker patients with shorter survival times who are harder to recruit: these patients typically entered the study later
- Center B has healthier patients who tend to enroll earlier in the study and thus may stay on the study longer
Censoring patterns determine comparisons between survival distributions and functions and will give you different conclusions depending on what assumptions you make!
Survival Function — parametric
Survival function, S(t), of a population is defined as:
S(t) = Pr(T>t) = 1-F(t) or 1-cdf(T)
- Lowercase t represents a specific time of interest for T
- If no censoring, then S(t) simply outputs the proportion of individuals with observed events times greater than t
- If there is censoring → non-parametric estimator (Kaplan Meier, lifetables, Cumulative Hazard function)
Hazard function — parametric
- the hazard function is the ratio of the probability density function (f(t)=Pr(X=x)) to the survival function
- the hazard (rate) function sometimes called incidence (rate) : expected number of events per time
*many other parametric distributions (exponential, Weibull, rayleigh, etc)
Estimating Survival Function — with censoring
Due to censored data, we are more often that unable to estimate the entire survival curve, so to estimate the survivor function the following non-parametric methods functions and estimators are used.
- Kaplan-Meier Estimator
Think of dividing the observed timespan of the study into a series of fine intervals so that there is a separate interval for each time of death or censoring (with possible ties ):
Using the law of conditional probability,
Four possibilities for each interval:
- No events (death or censoring) — conditional probability of surviving the interval is 1
2. Censoring — assume they survive to the end of the interval, so that the conditional probability of surviving the interval is 1
3. Death and no censoring — conditional probability of not surviving the interval is # deaths (d) divided by # ‘at risk’ (r) at the beginning of the interval. So the conditional probability of surviving the interval is 1 − (d/r).
4. Tied deaths and censoring — assume censoring last to the end of the interval, so that conditional probability of surviving the interval is still 1 − (d/r) — KM assumes censoring occurs after risk of event!
The general formula for the conditional probability of surviving the j-th interval that holds for all 4 cases:
As the intervals get finer and finer, the approximations made in estimating the probabilities of getting through each interval become smaller and smaller, so that the estimator converges to the true S(t). Reason being Kaplan Meier is also known as the product limit estimator.
2. Nelson-Aalen Estimator — Cumulative hazard function Λ(t)
- The cumulative hazard function is the integral of the hazard function
- Just as the KM, the observed time span is divide into a series of fine intervals so that there is only one event per interval
- Λ(t) can then be approximated by a sum of over intervals
- λj is the value of the hazard in the j-th interval and ∆ is the width of each interval. Since λˆ∆ is approximately the probability of dying in the interval conditional on having survived until the beginning of the interval, we can further approximate by the Nelson-Aalen estimator!
3. Life Table
The removed column contains the number of observations removed during that time period, whether due to death (the value in the observed column) or censorship. So the removed column is just the sum of the observed and censorship columns. The entrance column tells us whether any new subjects entered the population at that time period.
The at_risk column contains the number of subjects that are still alive during a given time. For the remaining time periods, the at_risk value is equal to the difference between the time previous period’s at_risk value and removed value, plus the current period’s entrance value.
Survival Analysis is an enormous area of research that cannot be outlined within the scope of this blog. However, the main takeaway of this piece is to diligently define your censoring, distribution and parameters, and how to estimate the survival function.
Logrank Tests —statistically test for differences in survival curves
Cox Proportional Hazards Regression — semiparametric, adjust for covariates