Preprint
Article

This version is not peer-reviewed.

Debiased Maximum-Likelihood Estimators of Hazard Ratios Under Machine-Learning Adjustment

A peer-reviewed article of this preprint also exists.

Submitted:

24 July 2025

Posted:

25 July 2025

Read the latest preprint version here

Abstract
Previous studies have shown that hazard ratios between treatment groups estimated with the Cox model are uninterpretable because the indefinite baseline hazard of the model fails to identify temporal change in the risk set composition due to treatment assignment and unobserved factors among multiple, contradictory scenarios. To alleviate this problem, especially in studies based on observational data with uncontrolled dynamic treatment and real-time measurement of many covariates, we propose abandoning the baseline hazard and using machine learning to explicitly model the change in the risk set with or without latent variables. For this framework, we clarify the context in which hazard ratios can be causally interpreted, and then develop a method based on Neyman orthogonality to compute debiased maximum-likelihood estimators of hazard ratios. Computing the constructed estimators is more efficient than computing those based on weighted regression with marginal structural Cox models. Numerical simulations confirm that the proposed method identifies the ground truth with minimal bias. These results lay the foundation for developing a useful, alternative method for causal inference with uncontrolled, observational data in modern epidemiology.
Keywords: 
;  ;  ;  ;  

1. Introduction

The use of hazard ratios to measure the causal effect of treatment has recently come under debate. Although it has been standard practice in epidemiological studies to examine hazard ratios using the Cox proportional model and its modified versions [1,2], several studies [3,4,5] have noted that hazard ratios are uninterpretable with regard to causation (see Martinussen (2022) [6] for a review). In particular, Martinussen et al. (2020) [5] provided concrete examples of data-generating processes in which the Cox model is correctly specified, but estimated hazard ratios are difficult to interpret. The main difficulty is that time courses of different study populations with different treatment effects are described by the same Cox model with the same set of parameter values. Although a few authors have rebutted the uninterpretability of hazard ratios in the Cox model [7,8,9], the issue concerning the unidentifiability with multiple, contradictory scenarios as posed by Martinussen et al. (2020) [5] remains unresolved.
To address this issue, researchers have sought alternative measures of causal treatment effect, such as differences between counterfactual survival functions or restricted mean survival time [10,11,12]. Methods for applying machine learning to estimate these measures have also been developed [13,14,15]. However, these measures are only applicable to simple, well-controlled settings, such as randomized clinical trials or observational studies of several covariate-adjusted groups with different baseline treatments, despite such studies making few assumptions about data-generating processes in other respects, as suggested by semiparametric theories for causal inference (see, e.g., Ref.[16]). Modern epidemiology increasingly requires methods for analyzing large quantities of observational data acquired in a more uncontrolled, dynamic manner, in which many covariates are measured in real-time. Examples of such data are electronic medical or health records and data acquired by electronic devices for health promotion [17]. The marginal structural Cox model potentially used for this purpose [18] suffers from the uninterpretability problem described above, and thus, development of an alternative method is required.
The present study proposes a strategy to model heterogeneous risk groups in a study population by using an exponential hazard model based on time-dependent treatment variables and covariates, but with no indefinite baseline hazard. Although Martinussen et al. (2020) did not explicitly describe the role of the baseline hazard in the problem about interpretability, it was the indefinite baseline hazard that allows the Cox model to be correctly specified for arbitrary temporal change in the risk set, namely, what they call “selection”. Thus, we abandon the semiparametric nature of the Cox models, and instead exploit the descriptive power of machine learning to capture how the risk set changes over time. This approach can now be implemented due to recent advances in incorporating machine learning into rigorous statistical analysis with effect estimation [19,20]. Prior to this development, machine learning could be used for only outcome prediction in epidemiology, because estimation with machine learning was biased. We thus develop an algorithm that debiases maximum-likelihood (ML) estimators of hazard ratios in the model, using the framework of doubly robust, debiased machine learning (DML) based on Neyman orthogonality (see Ref.[21] for an introductory review). Under mostly testable assumptions, the estimated hazard ratios can be interpreted as a measure of causal treatment effect. We show that this approach can be systematically extended to settings with latent variables. In simulation studies with clinically plausible settings, the results confirm that the proposed method identifies the ground truth with minimal bias. In these results, we also show that the proposed ML approach has multiple advantages over Cox’s maximum-partial-likelihood (MPL) approach.

2. Theories and Methods

2.1. Exponential Parametric Hazard Model Combined with Machine Learning

We consider observational studies in which the occurrence or absence of an event of interest in subjects randomly sampled from a large population and indexed by i ( I ) is longitudinally observed over time (indexed by t) during a noninformatively right-censored period, 0 t C i . With a set of time-dependent covariates, collectively denoted by X i , t ( = { X i j , t } j J ) X , we strive to identify the effect of time-dependent treatment described by a collection of binary variables A i , t ( = { A i k , t } k K A { 0 , 1 } | K | ). For clarity of presentation, we assume that at most a single treatment variable can take a value of unity, and the remaining variables must be zero. Consider an exponential hazard model whose conditional hazard is given by
h ( t | A i , t , X i , t ) = exp θ A i , t + f ( X i , t ) ,
where θ = { θ k } k K R | K | is a set of parameters corresponding to the natural logarithm of the hazard ratio of the untreated samples to samples treated to different extents, and the function f describes the risk variation due to the given set of covariates, which we adjust using a machine learning model (i.e., a set of functions M in which f lies). In addition, hereinafter, ( · ) denotes the transposition of vectors and matrices. Note that the above model lacks the baseline hazard as a function of t (the time elapsed after enrollment). The covariate X i , t can include temporal information, such as date and durations of suffered disorders, but not t itself.
Suppose that a subset of subjects (indexed by i) experiences an event of interest at time T i in the observation period, and the other subjects experience no such event. The full log-likelihood for this observation is then given by
ln Q h ( T | A , X ) = i I I [ 0 , C i ] ( T i ) ( θ A i , T i + f ( X i , T i ) ) 0 min ( T i , C i ) exp θ A i , t + f ( X i , t ) d t ,
(see, e.g., Ref.[22]), where I [ 0 , C i ] ( T i ) is the indicator function that returns unity for T i [ 0 , C i ] and zero otherwise. To see the meaning of Eq.(2), we consider how the hazard model describes the occurrence of an event in a small interval ( t , t + δ t ] . Assuming for the moment that (i) X i , s is continuous with respect to s and (ii) A i , s = c o n s t over s [ t , t + δ t ] , the likelihoods of occurrence and non-occurrence of an event in this period conditioned on event-free survival at time t are h ( t | A i , t , X i , t ) δ t and 1 h ( t | A i , t , X i , t ) δ t , respectively, up to the first order in δ t . Simple computations show that the log-likelihood for this binary observation is
ln Q h ( I ( t , t + δ t ] ( T i ) | T i > t , A i , t , X i , t ) = I ( t , t + δ t ] ( T i ) ln h ( t | A i , t , X i , t ) δ t + ( 1 I ( t , t + δ t ] ( T i ) ) ln ( 1 h ( t | A i , t , X i , t ) δ t ) + O ( δ t 2 ) , = I ( t , t + δ t ] ( T i ) ( θ A i , t + f ( X i , t ) ) exp θ A i , t + f ( X i , t ) δ t + O ( δ t 2 ) + c o n s t .
The time integration of Eq.(3) results in the summand of Eq.(2) up to a constant independent of θ and f. Here, it is reasonable to consider that the two assumptions hold almost surely in the δ t + 0 limit.

2.2. Causal Inference and ML Estimation

This section clarifies the causal interpretation of the estimation results in the model described in the previous section and discusses the necessary assumptions that support this interpretation. First, we make the conventional assumptions of consistency, no unmeasured confounder (ignorability), and positivity.
Assumption 1 (consistency). 
For any counterfactual treatment schedule a, we have, X ( , t ] a = X ( , t ] | A ( , t ] = a ( , t ] and T a = T | T t , A ( , t ] = a ( , t ] .
Assumption 2 (no unmeasured confounder): 
For some ϵ > 0 , we have, for any t and any counterfactual treatment schedule a,
T a A [ t , t + ϵ ] | T a > t , X ( , t + ϵ ] , A ( , t ) .
Assumption 3 (positivity): 
For any t > 0 and any a A , we have P ( A t = a | T > t , X t ) > 0 .
Here, the set subscript, such as X ( , t ] , denotes the collection of all variables with subscripts included in the set. Throughout the paper, P denotes the ground-truth (conditional) probability of the argument variables. We have used variables for < t < 0 for a technical reason that A 0 and X 0 must be determined on the basis of the past information. This may be replaced by some finite interval before the enrollment at t = 0 for which the information about treatment and covariate is available.
Next, in addition to the above, we make the following assumption about the regularity of hazard.
Assumption 4 (regularity of hazard): 
lim δ t + 0 1 δ t P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A ( , t + δ t ] , X ( , t + δ t ] ) = lim δ t + 0 1 δ t P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A ( , t ] , X ( , t ] ) < .
holds almost surely.
Remark 1: 
Assumption 2 essentially implies that the treatment at time t does not affect the time evolution of covariate X for a subsequent period ( t , t + ϵ ] as seen through the following argument. First, we have
P ( I ( t , t + ϵ ] ( T a ) | T a > t , X ( , t + ϵ ] , A ( , t ) = a ( , t ) ) = P ( I ( t , t + ϵ ] ( T a ) | T a > t , X ( , t + ϵ ] , A ( , t + ϵ ] = a ( , t + ϵ ] ) = P ( I ( t , t + ϵ ] ( T ) | T > t , X ( , t + ϵ ] , A ( , t + ϵ ] = a ( , t + ϵ ] ) ,
holds, where the first and second equalities are due to no unmeasured confounder and consistency, respectively. If the time evolution of covariates is immediately affected by the treatment schedule, and if outcome is immediately affected by the covariates, then under the conditioning on X [ t , t + ϵ ] , X [ t , t + ϵ ] a depends on the value of A, and hence, T a depends on A, which contradicts the assumption. A similar assumption is conventionally made in the analysis of observational data with a marginal structural Cox model [18].
In addition to Assumptions 1–4, we assume that the model in Eq.(1) is correctly specified. More precisely, we make the following four assumptions:
Assumption 5 (hazard conditionally independent of past treatment and covariate). 
For any t > 0 , a A , x X and E A , E X ( , t ) , we almost surely have,
lim δ t + 0 P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t = a , X t = x ) P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t = a , X t = x , A E A , X E X ) = 1 .
Assumption 6 (homogeneous treatment effect). 
For any a , a A and any x X , the hazard contrast,
lim δ t + 0 P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t = a , X t = x ) P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t = a , X t = x ) ,
takes a constant value regardless of the value of x.
Assumption 7 (time-homogeneous hazard): 
The hazard is independent of its timing during the observation period. Mathematically, for any t , t 0 , and any fixed a A and x X ,
lim δ t + 0 P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t = a , X t = x ) P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t = a , X t = x ) = 1 .
Assumption 8 (correctly specified machine-learning model): 
There exist unique θ * R and f * M that satisfy
E ln Q h θ * , f * ( T | A , X ) = sup θ R , f M ( X ) E ln Q h θ , f ( T | A , X ) ,
where M ( X ) is the set of Borel-measurable functions.
Remark 2: 
Assumptions 5–7 can be validated by extending the model to incorporate the effect of past treatment and covariates, inhomogeneous treatment effect, and time inhomogeneity, respectively. The original and extended models can then be compared in terms of, for example, Bayesian model evidence (BME) (see, e.g., Chapter 3 of Ref.[23]). If Assumption 5 is violated, the definition of the treatment variables and covariates may be extended so that the current value retains past information. This is the same strategy as employed for designing variables in a marginal structural Cox model [18]. If Assumption 6 is violated, the model may be extended by incorporating covariate-dependent heterogeneous treatment effect. Violation of Assumption 7 indicates that variations in covariate values do not account for the temporal change in the risk set. In this case, additional covariates or extending the model with latent variables (as described below) may be considered. Finally, Assumption 8 is an a priori assumption that stems from the fact in statistical learning that inference in the function space requires a regularity assumption.
Proposition 1. 
Let the maximizer of the expected likelihood be
θ * , f * = arg max θ R , f M E [ ln Q h ( T | A , X ) ] .
Under Assumptions 1–8, suppose two possible treatment schedules a , a such that a ( , t ) = a ( , t ) holds, and that a k , s = 1 and a s = 0 (or a k , s = 1 ) holds for s [ t , t + ϵ ] for some ϵ > 0 . Then, we have, for any sample path x ( , t + δ t ] for covariate,
θ k * ( θ k * ) = lim δ t + 0 ln P ( I ( t , t + δ t ] ( T a ) = 1 | T a , T a > t , A ( , t ) = a ( , t ) , X ( , t + δ t ] = x ( , t + δ t ] ) P ( I ( t , t + δ t ] ( T a ) = 1 | T a , T a > t , A ( , t ) = a ( , t ) , X ( , t + δ t ] = x ( , t + δ t ] ) .
Proof. 
We have,
lim δ t + 0 P ( I ( t , t + δ t ] ( T a ) = 1 | T a , T a > t , A ( , t ) = a ( , t ) , X ( , t + δ t ] = x ( , t + δ t ] ) P ( I ( t , t + δ t ] ( T a ) = 1 | T a , T a > t , A ( , t ) = a ( , t ) , X ( , t + δ t ] = x ( , t + δ t ] ) = lim δ t + 0 P ( I ( t , t + δ t ] ( T a ) = 1 | T a , T a > t , A ( , t + δ t ] = a ( , t + δ t ] , X ( , t + δ t ] = x ( , t + δ t ] ) P ( I ( t , t + δ t ] ( T a ) = 1 | T a , T a > t , A ( , t + δ t ] = a ( , t + δ t ] , X ( , t + δ t ] = x ( , t + δ t ] ) = lim δ t + 0 P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A ( , t + δ t ] = a ( , t + δ t ] , X ( , t + δ t ] = x ( , t + δ t ] ) P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A ( , t + δ t ] = a ( , t + δ t ] , X ( , t + δ t ] = x ( , t + δ t ] ) = lim δ t + 0 P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t = a t , X t = x t ) P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t = a t , X t = x t ) .
The first and second equalities are due to no unmeasured confounder and consistency (Assumptions 2 and 1), respectively. The last equality is due to Assumptions 4 and 5. Identifying the last line with the ML estimator is straightforward as follows. Using Assumptions 4 and 5 and Eq.(3), the expected likelihood is represented as
E [ Q h ( T | X , A ) ] = 0 0 C X × A lim δ t + 0 1 δ t P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t , X t ) ( θ A t + f ( X t ) ) exp ( θ A t + f ( X t ) ) δ t d P ( X t , A t | T > t ) P ( T > t ) d t d P ( C ) + c o n s t = k K 0 0 C X × A lim δ t + 0 1 δ t P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A k , t = 1 , X t ) ( θ k + f ( X t ) ) exp ( θ k + f ( X t ) ) δ t d P ( X t | T > t , A k , t = 1 ) P ( T > t , A k , t = 1 ) d t d P ( C ) + 0 0 C X × A lim δ t + 0 1 δ t P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t = 0 , X t ) f ( X t ) exp ( f ( X t ) ) δ t d P ( X t | T > t , A t = 0 ) P ( T > t , A t = 0 ) d t d P ( C ) + c o n s t .
Here, terms independent of θ and f have been regarded as constants, and the integration with respect to the ground-truth probability measure has been denoted by d P . Differentiating this representation gives
θ k E [ Q h ( T | X , A ) ] = 0 0 C X × A lim δ t + 0 1 δ t P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A k , t = 1 , X t ) exp ( θ k + f ( X t ) ) d P ( X t | T > t , A k , t = 1 ) P ( T > t , A k , t = 1 ) d t d P ( C ) ,
lim ϵ + 0 1 ϵ E [ Q h θ , f + ϵ ψ ( T | X , A ) Q h θ , f ( T | X , A ) ] = 0 0 C X × A lim δ t + 0 1 δ t P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t , X t ) ψ ( X t ) ψ ( X t ) exp ( θ A t + f ( X t ) ) d P ( X t , A t | T > t ) P ( T > t ) d t d P ( C ) .
With Assumptions 6 and 7, it is seen that these derivatives vanish if and only if
θ k = ln lim δ t + 0 P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A k , t = 1 , X t ) P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t = 0 , X t ) , f ( X t ) = ln lim δ t + 0 1 δ t P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t = 0 , X t ) ,
hold almost everywhere in X × [ 0 , ) with respect to d P ( X t , A t | T > t ) P ( T > t ) P ( C > t ) , and the concavity of the expected log-likelihood ensures that this solution is the maximizer. □
Remark 3-1. 
Suppose that Assumption 2 holds for arbitrary ϵ > 0 . Then, for arbitrary t > t , we obtain
lim δ t + 0 1 δ t P ( I ( t , t + δ t ] ( T a ) = 1 | T a > t , X ( , t + δ t ] , A ( , t ) = a ( , t ) ) = lim δ t + 0 1 δ t P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t = a t , X t ) ,
with the aid of Assumptions 1–5 in the same manner as we have done for Eqs.(6) and (13). If we can assume A ( , t ) = a ( , t ) = 0 with probability one at some t , the above equation implies that the counterfactual survival function is given by
S a ( t ) = E exp 0 t exp ( θ * a s + f * ( X s ) ) d s .
Although measures of causal effects can be calculated from this relation, doing so requires an unbiased estimation of f * , which is possible, only if a simple model can be used to estimate f without modeling error.
Remark 3-2. 
The quantity in Eq.(12) is a covariate-adjusted hazard contrast for two counterfactual treatments that branch at time t, so it can be interpreted as a measure of causal effect. In the setting discussed in Ref.[18], this corresponds to the treatment effect measured in the next month (for which covariates at each month can be regarded as baseline covariates [24]).
Remark 3-3. 
As discussed in Martinussen et al. (2022) [6], of more interest is the quantity
lim δ t + 0 ln P ( I ( t , t + δ t ] ( T a ) = 1 | T a , T a > t , X ( , t + δ t ] a = X ( , t + δ t ] a ) P ( I ( t , t + δ t ] ( T a ) = 1 | T a , T a > t , X ( , t + δ t ] a = X ( , t + δ t ] a ) ,
for two arbitrary counterfactual treatment schedules that branch before t. A stronger untestable assumption is required for this to be related to the hazard ratios. For example,
T T a , X a | T > t , { A s } s t , { X s } s t T a A , T , X | T a > t , { X s a } s t , T a T a , X a | T a > t , { X s a } s t ,
for any t and any counterfactual treatment schedules a , a . This condition is satisfied if no unmeasured factors affect the outcome, as described by the causal graph in Figure 1(A). If unmeasured factors affect the outcome, as described by the causal graph in Figure 1(B), the above conditions are not satisfied. Although the assumption of time-homogeneity (Assumption 7) is strong and does not allow temporal change in the heterogeneity of risk to be left unmodeled, it does not completely exclude the existence of unmeasured factors affecting outcome. For example, suppose that the survival of subjects for an arbitrary counterfactual treatment schedule a is described by
P ( T i a > t ) = exp 0 t exp ( θ * a s + f * ( X i , s ) ) ( 1 + W i , s ) d s , X i , s = X i , s a , W i , s = W i , s a ,
where W i , t = j δ ( t t i j ) is the so called “shot noise” described by Dirac delta functions centered at time points determined by a homogeneous Poisson point process { t i j } j . The subject-wise random effect of W i shared between T , T a and T a then violates Eq.(21), whereas W i at different time points are independent and do not influence the time-homogeneity.

2.3. Debiasing ML Estimators

Next, we assume that we have consistent estimators for θ and f that are the maximizers of the empirical log-likelihood in Eq.(2) in the presence of a suitable regularizer. Although a large body of the machine-learning literature presents such consistent estimators, they are biased in the sense that we cannot expect n θ ^ θ * , n f ^ f * p 0 for the number of subjects n . This situation prohibits making a decision on the significance of estimated results. However, we can apply to this problem debiased machine learning based on Neyman orthogonality and developed by Chernozhukov et al. (2018) [20]. Chernozhukov et al. (2018) describe how to systematically debias ML estimators and other M-estimators. Applying their idea to our problem setting yields debiased estimators of θ * .
Definition 1 (Neyman orthogonal scores).Suppose that an estimator θ ^ ( Θ R d ) of quantities of interest θ * is given as a zero of the empirical average of score functions ϕ : D × Θ × H R d for i.i.d. data D i D :
1 n i ϕ ( D i ; θ ^ , η ^ ) = 0 ,
where the third argument in the summands is a (possibly infinite-dimensional) nuisance parameter and an estimator η ^ of an optimal value η * H is used. Then, ϕ is said to be “Neyman orthogonal,” if it satisfies
E [ ϕ ( D ; θ * , η * ) ] = 0 ,
E [ η ϕ ( D ; θ * , η * ) [ η η * ] ] = 0 ,
where, in the second line, η ( · ) [ η η * ] denotes the Gateaux derivative operator [25].
Given a few conditions on the regularity of score functions and the convergence of nuisance parameters, Chernozhukov et al. [20] proved n θ ^ θ * p 0 . In the following, we construct such orthogonal scores from the log-likelihood in Eq.(2).
Proposition 2 (Neyman orthogonal scores derived from log-likelihood). 
Assume that the positive Borel measures μ a ( a A ) defined as
μ a ( E ) = 0 0 C E P ( A t = a | X t , T > t ) d P ( X t | T > t ) P ( T > t ) d t d P ( C )
( E M ( X ) ) are absolutely continuous to each other and have Radon-Nykodym derivatives
d μ k d μ 0 = e g k * ,
for which μ k denotes μ a such that a k = 1 , and g k * is a measurable function on X . Then, the score functions for estimating θ,
ϕ k ( D ; θ , ( f , g k ) ) = I [ 0 , C ] ( T ) A k , T e θ k ( 1 + e g k ( X t ) ) 0 min ( T , C ) A k , t e f ( X t ) ( 1 + e g k ( X t ) ) d t + 0 min ( T , C ) ( 1 K A , t ) e f ( X t ) ( 1 + e g k ( X t ) ) d t I [ 0 , C ] ( T ) ( 1 K A , T ) ( 1 + e g k ( X T ) ) ,
( k K ) , are Neyman orthogonal. Here, D denotes ( T , C , X , A ) .
Proof. 
Consider the following hazard model that includes the heterogeneous treatment effect:
h ˜ θ , f , f het ( t | A t , X t ) = exp ( θ A t + f ( X t ) + f het ( X t ) A k , t ) .
Because we assumed the homogeneity of the treatment effect (Assumption 6), we have, for arbitrary ψ ,
lim ϵ + 0 1 ϵ E ln Q h ˜ θ * , f * , ϵ ψ ( T | A , X ) ln Q h ˜ θ * , f * , 0 ( T | A , X ) = E I [ 0 , C ] ( T ) A k , T ψ ( X T ) 0 min ( T , C ) A k , t ψ ( X t ) exp ( θ k * A k , t + f * ( X t ) ) d t = 0 .
Let the first line on the right-hand side of Eq.(28) be ϕ k , 1 ( D ; θ , ( f , g k ) ) and the second line be ϕ k , 2 ( D ; θ , ( f , g k ) ) . Note that E [ ϕ k , 1 ] with θ k = θ k * and f = f * coincides with the above equation with ψ ( X t ) = e θ k * ( 1 + e g k ( X t ) ) . Similarly, for arbitrary ψ , we have
lim ϵ + 0 1 ϵ E ln Q h θ * , f * + ϵ ψ ( T | A , X ) ln Q h θ * , f * ( T | A , X ) = E I [ 0 , C ] ( T ) ψ ( X T ) 0 min ( T , C ) ψ ( X t ) exp ( θ * A t + f * ( X t ) ) d t = 0 .
Subtracting the sum of Eq.(30) for all k from Eq.(31) for ψ ( X t ) = 1 + e g k ( X t ) , we obtain E [ ϕ k , 2 ( D ; θ * , ( f * , g k ) ) ] = 0 . Hence, E [ ϕ k ( D ; θ * , ( f * , g k ) ) ] = 0 has been proven.
Next, we examine the Gateaux derivatives of the expected scores with respect to the nuisance parameters ( f , g k ) . First, the derivative of E [ ϕ k ] with respect to g k at θ k = θ k * and f = f * is
lim ϵ + 0 1 ϵ E ϕ k ( D ; θ * , ( f * , g k + ϵ ψ ) ) ϕ k ( D ; θ * , ( f * , g k ) ) = E I [ 0 , C ] ( T ) A k , T ψ ( X T ) e θ k * g k ( X T ) + 0 min ( T , C ) A k , t ψ ( X t ) e f * ( X t ) g k ( X t ) d t + E 0 min ( T , C ) ( 1 K A , t ) ψ ( X t ) e f * ( X t ) + g k ( X t ) d t ( 1 K A , T ) ψ ( X T ) e g k ( X T ) d t = 0 .
The combination of Eqs.(30) and (31) again proves the last equality. The derivative with respect to f at θ = θ * , f = f * and g k = g k * is given by
lim ϵ + 0 1 ϵ E ϕ k ( D ; θ * , ( f * + ϵ ψ , g k * ) ϕ k ( D ; θ * , ( f * , g k * ) ) f = f * = E 0 min ( T , C ) ( 1 K A , t ) ψ ( X t ) e f * ( X t ) ( 1 + e g k * ( X t ) ) A k , t ψ ( X t ) e f * ( X t ) ( 1 + e g k * ( X t ) ) d t = 0 .
Eq.(27) shows that the expected values of the two integrands in the second line cancel out, which proves the last equality. □
Remark 4-1. 
In the above proof, Assumption 8 assures that the argument of the Gateaux derivative ψ does not need to belong to M . As one might notice, the crux of the construction lies in balancing the second and third terms of Eq.(28) by g k . Roughly speaking, for each value of covariates, g k weights the integrand with the inverse-treatment-probability for A k = 1 and A = 0 . Here, note that this treatment probability is marginalized with respect to time.
Remark 4-2. 
Chernozhukov et al. (2018) [20] present a systematic method to derive orthogonal scores from likelihood [see Eq.(2.29) of Ref.[20]. The straightforward application of their method results in more complex scores, each of which contains all of { θ k } k K . As we see in this fact, multiple methods for orthogonalization are available. Prioritizing the simple explanation, we propose using the above.

2.4. Estimation Algorithm

  • In this section, we describe a concrete procedure for computing the debiased estimators for the hazard ratios for treatment.
  • Step 1: Estimation of f and g k .
  • Plug-in estimators for f and g k are used to debias the estimation of θ , following a procedure called “cross-fitting”. To debias θ , the estimation errors in f and g k must be independent of the data D i in the argument of score functions. Thus, the entire dataset D are first partitioned into M groups of approximately equal size ( D ( 1 ) D ( M ) ) . Using D val ( m ) = D ( m ) , D hout ( m ) = D ( m + 1 ( m o d M ) ) and D train ( m ) = D ( D val ( m ) D hout ( m ) ) , f ( m ) ^ is estimated as the maximizer of the log-likelihood of D train ( m ) with a suitable regularizer. Similarly, g k ( m ) ^ is estimated as a regularized solution for the following logistic regression:
    min g k ( m ) i D train ( m ) 0 min ( T i , C i ) A i k , t ln ( 1 + e g k ( m ) ( X i , t ) ) + ( 1 K A i , t ) ln ( 1 + e g k ( m ) ( X i , t ) ) d t + ( regularization term ) .
    To determine the regularization parameters for estimating f, we use BME averaged over the M folds. Because the logarithmic loss in Eq.(34) may underestimate the error due to large e g k and e g k in the score functions, we suggest computing the following cross-validation error in Eq.(33) which g k is targeting:
    CVErr g k = 1 m M i D val ( m ) 0 min ( T i , C i ) A i k , t ( e g k ( m ) ^ ( X i , t ) 1 ) + ( 1 K A i , t ) ( e g k ( m ) ^ ( X i , t ) 1 ) d t 2 .
    The above error design exploits the fact that, for P ^ ( A k = 1 | X ) P ( A k = 1 | X ) ,
    P ( A k = 1 | X ) 1 P ^ ( A k = 1 | X ) + ( 1 P ( A k = 1 | X ) ) 1 1 P ^ ( A k = 1 | X ) 2 .
    Here, the trivial solution P ^ ( A k = 1 | X ) = 1 2 also minimizes CVErr g k and should be avoided by checking BME as well.
  • Step 2: Validation of hypotheses. The assumptions for Proposition 1 should be validated. Validating Assumptions 5–7 can be done by carrying out the ML estimation with replacement of A t by ( A t , A t τ ) , X t by ( X t , X t τ ) , f ( X ) by f 1 ( X ) + f 2 ( X , t ) and θ A by A ( θ + f ( X ) ) , respectively, and comparing the resultant BME (i.e., the posterior probabilities of the null and alternative hypotheses).
  • Step 3: Debiased estimation of θ and its standard error.
  • The estimator θ ^ of θ * is obtained as the zero of the following:
    1 m M i D ( m ) ϕ ( D i ; θ , ( f ^ ( m ) , g ^ ( m ) ) ) .
    The asymptotic standard error of the estimator is given by Σ 1 / 2 ( θ ^ θ * ) d N ( 0 , 1 ) with
    Σ ^ = def J ^ 1 1 n 2 1 m M i D ( m ) ϕ ( D i ; θ ^ , ( f ^ ( m ) , g ^ ( m ) ) ) ϕ ( D i ; θ ^ , ( f ^ ( m ) , g ^ ( m ) ) ) J ^ 1 ,
    J ^ = def 1 n 1 m M i D ( m ) θ ϕ ( D i ; θ , ( f ^ ( m ) , g ^ ( m ) ) ) | θ = θ ^ .
    Thus, in the simulation study described below, the square roots of the diagonal elements of Σ ^ serve to scale the errors from the ground truth to obtain the t-statistics.

2.5. Extension to Models with Latent Variables

Suppose that the Assumption 7 is violated due to the presence of unobserved factors affecting the outcome, and covariates cannot adjust the resultant temporal change in the risk set. In this case, as we see in the simulation study below, the inclusion of the time elapsed after enrollment in the set of covariates may improve the description of event occurrence in terms of BME. For such cases, if additional data about the unobserved factors cannot be acquired, the only alternative is to explicitly model the unobserved factors as latent variables.
To simplify the argument and notation, let us assume that the unobserved factor (denoted by W { 0 , 1 } ) is a single time-independent binary variable (e.g., a variable indicating the presence or absence of a genetic risk factor). Suppose that we attempt to explicitly model W with a latent variable Z { 0 , 1 } and define the distribution of Z and its influence on the outcome as Q β ( Z | X [ 0 , min ( T , C ) ] , A [ 0 , min ( T , C ) ] ) and Q κ ( T | Z , X [ 0 , min ( T , C ) ] , A [ 0 , min ( T , C ) ] ) with parameters β and κ , respectively. For example, in the simulation study described below, we use a logistic model that determines the distribution of the latent variable based on baseline covariate values:
Q β ( Z | X [ 0 , min ( T , C ) ] , A [ 0 , min ( T , C ) ] ) = 1 1 + exp ( ( 2 Z 1 ) ( β X 0 + β 0 ) ) ,
and we extend the hazard model as
h ˜ ( t | A t , X t , Z ) = exp ( θ A t + f ( X t ) + κ Z ) .
The same argument for Proposition 1 applies to this model. Suppose that the ground truth is described by
P ( W | X [ 0 , min ( T , C ) ] , A [ 0 , min ( T , C ) ] ) = 1 1 + exp ( ( 2 W 1 ) ( β * T X 0 + β 0 * ) ) ,
and
lim δ t + 0 P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t = a , X t = x , W = 1 ) P ( I ( t , t + δ t ] ( T ) = 1 | T > t , A t = a , X t = x , W = 0 ) = c o n s t ,
regardless of the values of x, a and t. Then, replacing X with the combination ( X , W ) or ( X , Z ) in Assumptions 1–8 and assuming that the model is identifiable, one obtains the same result as in Proposition 1.
Then, one can also derive Neyman-orthogonal scores for this model as well. The logarithm of marginal likelihood of outcome is given by
i I ln Q h ˜ ( T i | A i , X i ) = i I max r i Z i { 0 , 1 } r i ( Z i ) ln Q h ˜ ( T i | X i , A i , Z i ) Q β ( Z i | X i , A i ) r i ( Z i ) ,
with the aid of variational distributions { r i } i I . Hereafter, we use the variable without time subscripts, such as A i , X i , which denotes the collection of variables throughout the observation period and can be understood from the context. Although optimal values for { r i } i depend on θ , κ , β and f, variable r i can also be regarded as an independent nuisance parameter. Naive non-orthogonal scores for ML estimation are obtained by simply differentiating the marginal likelihood with respect to θ as
ϕ naive , k ( D ; θ , f , κ , r ) = Z { 0 , 1 } r ( Z ) I [ 0 , C ] ( T ) A k , T exp ( θ k ) 0 min ( T , C ) A k , t exp ( f ( X t ) + κ Z ) d t .
These score functions can be orthogonalized in the same manner as for the model without latent variables.
Proposition 3 (Neyman orthogonal scores with a binary latent variable) 
Assume that the positive Borel measures μ a ( a A ) defined as
μ a ( E ) = 0 0 C E P ( A t = a | X t , Z , T > t ) r * ( Z ) d P ( X t , r * | T > t ) P ( T > t ) d t d P ( C )
( E X × { 0 , 1 } ) are absolutely continuous to each other with Radon-Nykodym derivatives
d μ k d μ 0 = e g k * ,
for which μ k denotes μ a such that a k = 1 , and g k * is a measurable function on X × { 0 , 1 } . Here, we have used random variable r * ( Z )
r * ( Z ) = Q h ˜ ( T | Z , X , A ) Q β ( Z | X , A ) Z ˜ { 0 , 1 } Q h ˜ ( T | Z ˜ , X , A ) Q β ( Z ˜ | X , A ) .
Then, score functions for estimating { θ k } k K
ϕ k ( D ; θ , ( f , g k , κ , r ) ) = Z { 0 , 1 } r ( Z ) I [ 0 , C ] ( T ) A k , T exp ( θ k ) ( 1 + e g k ( X T , Z ) ) 0 min ( T , C ) A k , t exp ( f ( X t ) + κ Z ) ( 1 + e g k ( X t , Z ) ) d t + 0 min ( T , C ) ( 1 K A , t ) exp ( f ( X t ) + κ Z ) ( 1 + e g k ( X t , Z ) ) d t I [ 0 , C ] ( T ) ( 1 K A , T ) ( 1 + e g k ( X T , Z ) ) ,
are Neyman orthogonal scores that can be used to debias the maximizer of the marginal log-likelihood in Eq.(44).
Proof. 
The outline of the proof is the same as that for Proposition 2, and so is omitted here. □

2.6. Inference with Multiple-Kernel Models

In the simulation study described below, we use multiple-kernel models for the estimation of f and g k [26]. Statistical properties of estimation with multiple-kernel models under a 1-norm, 2-norm or mixed-norm regularization have been extensively studied [27,28,29,30]. We employ the 2-norm-regularized version for which calculation of BME is tractable. Regularised ML estimation of ( θ , f ) (or g) with discretized timesteps under a 2-norm regularization is formulated as
max θ , f i I 0 t = m Δ t < min ( C i , T i ) + Δ t L ( D i , t ; θ , f ) k λ k 2 α k G k α k ,
where the Gram matrix, ( G k ) i t , i t = k ( X i , t , X i , t ) relates f and α k via
f ( X i , t ) = k i t ( G k ) i t , i t α k , i t
, and
L ( D i , t ; θ , f ) = I [ t , t + Δ t ) ( T i ) ( θ A i t + f ( X i t ) ) exp ( θ A i t + f ( X i t ) ) Δ t
denotes the timestep-wise likelihood function. Concretely, we use linear or 1–3 dimensional Gaussian kernels, such as k j ( X i , t , X i , t ) = X i j , t X i j , t and k j ( X i , t , X i , t ) = exp ( 1 d ( X i j , t X i j , t ) 2 / 2 σ d 2 ) . See the Results section for concrete designs of kernels. Since { α k } k has large dimensions, we use incomplete Cholesky decomposition [31] to approximate N × N Gram matrix with N × N matrix L k as G k L k L k for N N . We then have f = k L k u k with u k = L k α k and λ k 2 α k G k α k = λ k 2 u k 2 2 . We solved this maximization problem, by applying limited-memory BFGS algorithm [32] with backtracking and line search stopped by the Armijo condition. The optimization was stopped, if the 2 -norm of the gradient gets smaller than ϵ stop = 1.0 × 10 2 . For ML estimation, one can approximate the BME by using the Laplace approximation. Assigning the Hessian of the log-likelihood with the regularization term to H ^ and regarding the regularization term as the log-likelihood of a Gaussian process prior [33], we calculate BME (see, e.g., Chapter 3 of Ref.[23]) using Gaussian integrals as follows:
ln P BME ( M | D ) i I 0 t = m Δ t < min ( C i , T i ) + Δ t L D i , t ; θ ^ , k L k u k ^ + k λ k 2 dim ( u k ) 1 2 ln | H ^ | .
In the above approximation, the space of functions perpendicular to the range of G k does not contribute to L (the representer theorem [34]). Therefore, the contributions of the prior and posterior cancel out. Similarly, the space perpendicular to the range of L k , but not to that of G k , makes negligible contributions to L relative to their variations in the prior process.
The above procedure can also be used to obtain g k , where
L g k ( D i , t ; g k ) = A i k , t ln ( 1 + e g k ( X i , t ) ) + ( 1 A i , t ) ln ( 1 + e g k ( X i , t ) )
replaces L for θ and f in the above.
In the simulation with latent-variable Z, we applied the EM algorithm for optimization (see, e.g., Chapter 9 of Ref.[23]). Since the latent variable model has indefiniteness with respect to permutation of the range of Z as do most other models with discrete latent variables, we started optimization from values close to θ * , f * , β * , κ * . Repeating sequences of updating r ( Z i = 1 | X , A , T ) with Eq.(48), optimization of θ , f and a gradient-ascent step in β , κ , we maximized the marginal log-likelihood in Eq.(44). Nuisance parameter g k was obtained by performing logistic regression with L g k ( D i , t , Z i ; g k ) = Z i { 0 , 1 } r i ^ ( Z i ) ( A i k , t ln ( 1 + e g k ( X i , t , Z i ) ) + ( 1 A i , t ) ln ( 1 + e g k ( X i , t , Z i ) ) ) . In this case, we estimated g k ( X , Z ) = g k , 1 ( X ) + Z g k , 2 ( X ) with g k , 1 , g k , 2 H k X .
In the numerical analysis, all covariate values are normalized to have a zero mean and a unit variance. For σ d and λ k , we performed a grid search, examining values with intervals of approximately l n ( 1.5 ) on the logarithmic scale (e.g. 2.0 , 3.0 , 5.0 , 7.0 , 10.0 ). In this grid search, we use the same values of λ k and σ d for all d-dimensional Gaussian kernels and use λ k = 0 for linear kernels. The goodness of fit was measured in terms of the average of BME for { D train ( m ) } 1 m 10 and the error defined in Eq.(35).

3. Results of Numerical Simulations

3.1. Simulation Result 1: Adjustment for Observed Confounders

This section presents a simulation study designed to demonstrate that, in a clinically plausible setting, the proposed method estimates hazard ratios for treatment with minimal bias. Suppose we randomly sample subjects from an electronic medical record that contains the medical history of a large local population. Suppose further that 2,000 subjects aged 50–75 years enrolled at uniformly random times between 2000 and 2005. The chosen subjects are followed up for C i . i . d . Uniform [ 5 , 10 ] years, with timely recording of the onset of comorbidities, the administration of drugs and the occurrence of a disorder defined as an outcome event.
To simulate a typical setting in which confounders bias the estimation of the treatment effect, suppose that subjects are newly diagnosed with condition 1 at a constant rate (2.5%/month), which may be a prodrome for the outcome event. A drug that is suspected to be a risk factor for an outcome event may be used to treat this condition. In the simulation, the probability per month of initiating this drug is 0.004 + 0.2 X 1 t exp ( X 1 t / 0.6 ) / ( 0.6 ) 2 with a covariate X 1 t [year] indicating the duration of suffering condition 1 before time t, and the probability per month of discontinuing the drug is 0.01. Use of this drug increases the risk of developing condition 2. The probability per month of developing condition 2 at time t is 0.05 + 0.05 t B s Θ ( t s 1 / 12 ) e 3 ( t s 1 / 12 ) d s , where B s takes a value of 1 if the subject is a current user of the drug at time s, or it takes a value of 0 otherwise. The above integral including the Heaviside step function Θ describes a short-term delayed increase in the risk of developing condition 2 subsequent to the drug use.
Finally, suppose that the occurrence of the outcome event depends on both treatment and conditions 1 and 2, and its ground-truth risk per month is exp ( 7.0 + θ 1 * A 1 t + θ 2 * A 2 t + 0.04 age t + 0.2 sin ( π date t / 8 ) 0.2 cos ( π date t / 6 ) + 2.0 X 1 exp ( X 1 t / 1.5 ) + P 2 ( 1 exp ( X 2 t / 2.5 ) ) ) / 12.0 , with θ 1 * = 1.0 and θ 2 * = 2.0 , where we define two treatment variables A 1 t and A 2 t , which take a value of 1 if and only if the subject used the drug for a period 18 and > 18 months before time t, respectively. P 2 in the equation is a parameter for which we use different values. The covariates age t , date t and X 2 t (duration of suffering condition 2) are measured in years.
For this dataset, we performed debiased ML estimation using multiple-kernel models for f and g k . Concretely, we used the following model:
f ( X t ) = f age ( age t ) + f date ( date t ) + f 1 ( X 1 t ) + f 2 ( X 2 t ) + b ,
with f age H k age , f date H k date , f 1 H k 1 and f 2 H k 2 , where H k and b are a reproducing-kernel Hilbert space (RKHS) associated with kernel k and a bias parameter, respectively. To correctly specify the model, we chose k age to be a linear kernel and k date , k 1 and k 2 to be one-dimensional Gaussian kernels [e.g. k 1 ( X , X ) = exp ( ( X X ) 2 / σ 2 ) ] with bandwidth hyperparameter, σ . Although we can a priori identify the set of necessary kernels, one can find the appropriate set of kernels by trial and error, using BME as a criterion. In this step, we can also validate the model assumptions (Assumption 5–7) by examining whether the introduction of an additional kernel improves BME. We briefly illustrate the idea through an example.
First, as we know from the model construction, covariate X 2 t is necessary for describing the onset of outcome events. This can be seen by comparing BME values for the correctly specified model and a misspecified model built by removing f 2 ( X 2 t ) and H k 2 from it (Fig.2(A)). The question naturally arising here is how one can see that there is a room for improvement when they have the misspecified model at hand. We propose to examine the assumption of time homogeneity (Assumption 7) as an indicator. In our example, comparison of BME values for the second, misspecified model with or without inclusion of the time elapsed after enrollment as a covariate shows the violation of the assumption, depending on the value of P 2 (Fig.2(B)), although the sensitivity is not high. The risk variation due to the time elapsed after enrollment indicates a temporal change in the risk set that covariates at hand do not account for, and therefore the model is missing some necessary factors. For a correctly specified model, such violation of the assumption is not detected (Fig.2(C)). The violation of the time-homogeneity assumption is observed for models misspecified in different manners (which we omit to avoid redundancy).
The nuisance parameter g can also be similarly modeled as
g ( X t ) = g , age ( age t ) + g , date ( date t ) + g 1 ( X 1 t ) + g 2 ( X 2 t ) + b ,
with g , age H k age , g date H k date , g 1 H k 1 and g 2 H k 2 , where all RKHSs are associated with a one-dimensional Gaussian kernel. In practice, on top of the kernels described above, we additionally included two-dimensional Gaussian kernels such as k 12 ( X , X ) = exp ( ( ( X 1 X 1 ) 2 + ( X 2 X 2 ) 2 ) / σ 2 2 ) , and examined whether the inclusion of these additional kernels improves BME. In the present case, the two-dimensional kernels did not improve BME. In the tuning of hyperparameter values, we observe that BME and CVErr g identify different optimal values (Fig.2(D) and (E)). We show debiased estimates of θ 1 based on g 1 estimated with each of these optimal hyperparameter values λ g = 70.0 and λ g = 1.0 in Fig.2(F). To measure the bias in the estimator, we obtained estimates with 500 datasets generated from the identical process described above with different seeds for the random-number generator. The t-statistics ( = def ( θ ^ k θ k * ) / SE θ k ^ ) in Fig.2(F) showed that the debiased estimator actually identified the ground truth with minimal bias, in contrast with the naive ML estimator. In Fig.2(F), we also observe that g 1 tuned with CVErr g 1 debias the estimator more effectively than g 1 tuned with BME.

3.2. Simulation Result 2: Estimation of Treatment Effect in Population with Heterogeneous Risk

Suppose that most of the simulation settings are the same as the previous one, but the enrolled subjects are divided into two distinct risk groups denoted by W { 0 , 1 } , and this is reflected in a baseline covariate value. More precisely, the number of subjects, their age, enrollment date, censoring, onset probabilities of conditions 1 and 2 and treatment assignment are the same as the previous one. Suppose, however, that blood-test results X j 0 ( test ) i . i . d . N ( 0 , σ j ( test ) 2 ) at baseline ( t = 0 ) are available ( 1 j 3 ), and the risk group to which the subject belongs is probabilistically related to { X j 0 ( test ) } j as
P ( W | A , X ) = 1 1 + exp ( ( 2 W 1 ) ( β * X 0 ( test ) + β 0 * ) ) ,
with { σ j ( test ) } 1 j 3 = { 2.0 , 1.0 , 4.0 } and { β j * } 0 j 3 = { 0.5 , 1.0 , 0.0 , 0.0 } . With variable W, the risk of outcome event per month is given by exp ( 7.0 + θ 1 * A 1 t + θ 2 * A 2 t + 0.04 age t + 0.2 sin ( π date t / 8 ) 0.2 cos ( π date t / 6 ) + 2.0 X 1 exp ( X 1 t / 1.5 ) + P 2 ( 1 exp ( X 2 t / 2.5 ) ) + κ * W ) / 12.0 with θ 1 * = 1.0 , θ 2 * = 2.0 , P 2 = 0.5 and κ * = 2.0 .
We first performed analysis with only observed covariates ( X t , X t ( test ) ) , not using latent variables. Here, we assumed X t ( test ) = X 0 ( test ) . In the validation of assumptions, the inclusion of the time elapsed after enrollment in the set of covariates improved BME (Figure 3(A)), suggesting that the enrolled subjects are heterogeneous, presumably due to an unobserved factor. Then, we analyzed the dataset with a latent variable, maximizing the marginal log-likelihood with EM algorithm (see e.g., Chapter 9 of Ref.[23]), which led to further improvement of BME (Figure 3(B)). In this calculation, we observed that the optimization becomes numerically unstable for small κ ( 1.0 ) presumably because of the singularity of the model. We examined the estimated values of θ with 500 datasets generated by the identical process with different seeds for its random-number generator, and the results confirmed that the suggested debiased estimators identify the ground truth with minimal bias, in contrast with the naive ML estimator (Figure 3(C)).

4. Discussion

Preceding studies [3,4,5] showed that the Cox model allows multiple, contradictory interpretations. In the present study, we therefore proposed to estimate hazard ratios of treatment options in an exponential hazard model with or without latent variables that lacks baseline hazard as a function of time elapsed after enrollment. We assumed that conditional hazard should not depend on a specific time during the observation, and measured the degree of its violation as an indicator of the necessity to seek additional covariates or latent variables. Given this and several other testable assumptions as well as conventional assumptions for causal inference in observational studies, we clarified the context in which the estimated hazard ratios are causally interpreted. Then, we proposed methods to debias ML estimators of hazard ratios in cases with and without latent variables, simultaneously validating the model assumptions by calculating BME. Numerical simulations confirm that the proposed method allows us to systematically identify necessary variables and estimate the treatment effect with minimal bias.
Epidemiologists and statisticians might find it psychologically difficult to abandon the Cox models with baseline hazard because of its established role as a default model. The popularity of the Cox model is due to its semiparametric nature. Leaving the aspects of data difficult to model as a blackbox is an attractive idea. Furthermore, the Cox’s MPL estimator [35] for log-hazard ratios of treatment groups is asymptotically efficient [36,37], which is a natural consequence of obviating the estimation of the baseline hazard (thanks to the cancellation in the representation of partial likelihood). Because of this asymptotic efficiency, all other estimators, including the ML estimator, have essentially been reduced to theoretical subjects, while a relatively small number of studies have suggested the advantage of using the others (e.g. [22,38]). The present results suggest the need to reconsider this situation by reevaluating the tradeoff between the efficiency and uninterpretability due to the indefinite baseline hazard. The disadvantage of using MPL is not only the uninterpretability. First of all, the indefinite baseline hazard is impossible to interpret from the physical and biological points of view. No physical process depends on the time elapsed after the enrollment. Thus, its use is solely justified by convenience. However, from a technical point of view, comparing the estimation of g k in the present study and the inverse-probability-of-treatment weight for marginal structural Cox models used for a similar purpose [18], one can see that the former is a marginal estimand in terms of time and hence usually easier to estimate. The standard method for the latter, that is, estimating the transition probability of treatment and censoring for each value of covariates and each time t, possibly works better in a simple model setting, such as analysis involving only a few–several discrete variables. However, carrying this out under the nonlinear effect of a high-dimensional covariate is prohibitive. In fact, the recently proposed application of doubly robust estimation to marginal structural Cox models are still limited to simple model settings to which machine learning cannot be incorporated [39]. This is why we could not directly compare the results with our method and the conventional method based on the marginal structural Cox model. Another negative factor for MPL approach is that the combination of Bayes theorem and partial likelihood is only approximately justifiable, so extending Cox regression to Bayesian settings is not straightforward, although efforts in this direction have been made for Cox models as well [40]. Considering these negative factors for the MPL approach, we conclude that the debiased ML approach should be reconsidered in causal inference for observational studies with uncontrolled dynamic treatment and real-time measurement of a large number of covariates.
In technical aspects, for the sake of clear exposition, we restricted our study to a relatively simple setting, in which only non-informative right censoring is considered, only a single binary treatment variable can take the value of one, and only a single time-independent latent variable is allowed. For the first condition, the advantage of ML estimation has been demonstrated in the context of informative censoring [41]. Therefore, extending our framework in this direction seems promising. Second, the use of multiple independently changing treatment variables is straightforward, but requires cumbersome modifications, where finding the root of scores amounts to solving multivariate algebraic equations of degrees greater than one. To apply our framework to models with continuous treatment variables, the conditional distribution of treatment assignment for each value of covariates must be identified in a manner similar to what is done with propensity scores for continuous treatment [42]. There is still a technical challenge in this step, because machine learning of the conditional distribution of continuous variables remains in the phase of theoretical argument (see e.g., Refs.[43,44]), not practical use. The extension of our framework to models with multiple (possibly continuous, time-dependent) latent variables would face the identifiability problem. Although progress has been made in both theoretical analysis of identifiability [45,46,47,48] and numerical examination of identifiability in the context of biological modeling [49], there is still a technical challenge. Related to this topic, using a singular model can lead to effective breakdown of asymptotic normality of estimators [50]. In this case, the framework of debiased machine learning would itself need modifications, and the Laplace approximation of BME also fails and needs to be replaced by Monte-Carlo integration [51] (or information criteria for singular parametric models [52,53]). The above difficulties being mentioned, use of rich classes of latent-variable models for which inference methods have been extensively developed in machine learning (especially those for time-series data [54,55]) is yet an attractive, worthy challenge. In this challenge, our technique to use variational distributions as a nuisance parameter in the orthogonalization will be useful.

Author Contributions

T.H. conceptualized the study. T.H. performed all of the mathematical works in the study. T.H. and S.A. designed models and simulation data. T.H. developed all program codes. T.H. and S.A. interpreted the analyzed results. T.H. wrote the manuscript. T.H. and S.A. have read and agreed to its contents and have approved the final manuscript for submissions.

Funding

This research was supported by the Ministry of Education, Culture, Sports, Sciences and Technology (MEXT) of the Japanese government and the Japan Agency for Medical Research and Development (AMED) under grant numbers JP18km0605001 and JP223fa627011.

Data Availability Statement

All of the source codes that support the findings of this study are available as supplementary materials.

Acknowledgments

We would like to express our gratitudes to two medical IT companies, 4DIN Ltd. (Tokyo, Japan) and Phenogen Medical Corporation (Tokyo, Japan) for financial support, while both companies had no role in the research design, analysis, data collection, interpretation of data, and review of the manuscript, and no honoraria or payments were made for authorship.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Lin, R.S.; Lin, J.; Roychoudhury, S.; Anderson, K.M.; Hu, T.; Huang, B.; Leon, L.F.; Liao, J.J.; Liu, R.; Luo, X.; et al. Alternative Analysis Methods for Time to Event Endpoints Under Nonproportional Hazards: A Comparative Analysis. Statistics in Biopharmaceutical Research 2020, 12, 187–198. [Google Scholar] [CrossRef]
  2. Bartlett, J.W.; Morris, T.P.; Stensrud, M.J.; Daniel, R.M.; Vansteelandt, S.K.; Burman, C.F. The Hazards of Period Specific and Weighted Hazard Ratios. Statistics in Biopharmaceutical Research 2020, 12, 518–519. [Google Scholar] [CrossRef] [PubMed]
  3. Hernán, M.A. The Hazards of Hazard Ratios. Epidemiology 2010, 21. [Google Scholar] [CrossRef] [PubMed]
  4. Aalen, O.O.; Cook, R.J.; Røysland, K. Does Cox analysis of a randomized survival study yield a causal treatment effect? Lifetime Data Analysis 2015, 21, 579–593. [Google Scholar] [CrossRef] [PubMed]
  5. Martinussen, T.; Vansteelandt, S.; Andersen, P.K. Subtleties in the interpretation of hazard contrasts. Lifetime Data Analysis 2020, 26, 833–855. [Google Scholar] [CrossRef]
  6. Martinussen, T. Causality and the Cox Regression Model. Annual Review of Statistics and Its Application 2022, 9, 249–259. [Google Scholar] [CrossRef]
  7. Prentice, R.L.; Aragaki, A.K. Intention-to-treat comparisons in randomized trials. Statistical Science 2022, 37, 380–393. [Google Scholar] [CrossRef]
  8. Ying, A.; Xu, R. On Defense of the Hazard Ratio, 2023. 2023; arXiv:math.ST/2307.11971].
  9. Fay, M.P.; Li, F. Causal interpretation of the hazard ratio in randomized clinical trials. Clinical Trials 2024, 21, 623–635. [Google Scholar] [CrossRef] [PubMed]
  10. Rufibach, K. Treatment effect quantification for time-to-event endpoints–Estimands, analysis strategies, and beyond. Pharmaceutical Statistics 2019, 18, 145–165. [Google Scholar] [CrossRef]
  11. Kloecker, D.E.; Davies, M.J.; Khunti, K.; Zaccardi, F. Uses and Limitations of the Restricted Mean Survival Time: Illustrative Examples From Cardiovascular Outcomes and Mortality Trials in Type 2 Diabetes. Annals of Internal Medicine 2020, 172, 541–552. [Google Scholar] [CrossRef] [PubMed]
  12. Snapinn, S.; Jiang, Q.; Ke, C. Treatment effect measures under nonproportional hazards. Pharmaceutical Statistics 2023, 22, 181–193. [Google Scholar] [CrossRef] [PubMed]
  13. Cui, Y.; Kosorok, M.R.; Sverdrup, E.; Wager, S.; Zhu, R. Estimating heterogeneous treatment effects with right-censored data via causal survival forests. Journal of the Royal Statistical Society Series B: Statistical Methodology 2023, 85, 179–211. [Google Scholar] [CrossRef]
  14. Xu, S.; Cobzaru, R.; Finkelstein, S.N.; Welsch, R.E.; Ng, K.; Shahn, Z. Estimating Heterogeneous Treatment Effects on Survival Outcomes Using Counterfactual Censoring Unbiased Transformations. 2024; arXiv:stat.ME/2401.11263]. [Google Scholar]
  15. Frauen, D.; Schröder, M.; Hess, K.; Feuerriegel, S. Orthogonal Survival Learners for Estimating Heterogeneous Treatment Effects from Time-to-Event Data. 2025; arXiv:cs.LG/2505.13072]. [Google Scholar]
  16. Kennedy, E.H. , H.; Wu, P.; Chen, D.G.D., Eds.; Springer International Publishing: Cham, 2016; pp. 141–167. https://doi.org/10.1007/978-3-319-41259-7_8.Inference. In Statistical Causal Inferences and Their Applications in Public Health Research; He, H., Wu, P., Chen, D.G.D., Eds.; Springer International Publishing: Cham, 2016; Springer International Publishing: Cham, 2016; pp. 141–167. [Google Scholar] [CrossRef]
  17. Leviton, A.; Loddenkemper, T. Design, implementation, and inferential issues associated with clinical trials that rely on data in electronic medical records: a narrative review. BMC Medical Research Methodology 2023, 23, 271. [Google Scholar] [CrossRef]
  18. Hernán, M.A.; Brumback, B.; Robins, J.M. Marginal Structural Models to Estimate the Joint Causal Effect of Nonrandomized Treatments. Journal of the American Statistical Association 2001, 96, 440–448. [Google Scholar] [CrossRef]
  19. Van der Laan, M.J.; Rose, S. Targeted learning in data science; Springer, 2018.
  20. Chernozhukov, V.; Chetverikov, D.; Demirer, M.; Duflo, E.; Hansen, C.; Newey, W.; Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal 2018, 21, C1–C68. [Google Scholar] [CrossRef]
  21. Ahrens, A.; Chernozhukov, V.; Hansen, C.; Kozbur, D.; Schaffer, M.; Wiemann, T. An Introduction to Double/Debiased Machine Learning. 2025; arXiv:econ.EM/2504.08324]. [Google Scholar]
  22. Ren, J.J.; Zhou, M. Full likelihood inferences in the Cox model: an empirical likelihood approach. Annals of the Institute of Statistical Mathematics 2011, 63, 1005–1018. [Google Scholar] [CrossRef]
  23. Bishop, C.M.; Nasrabadi, N.M. Pattern recognition and machine learning; Vol. 4, Springer, 2006.
  24. van der Laan, M.J.; Petersen, M.L.; Joffe, M.M. History-adjusted marginal structural models and statically-optimal dynamic treatment regimens. The International Journal of Biostatistics 2005, 1. [Google Scholar] [CrossRef]
  25. Hille, E.; Phillips, R.S. Functional Analysis and Semi-Groups, 3rd printing of rev. In ed. of 1957. In Proceedings of the Colloq. Publ, Vol. 31. 1974. [Google Scholar]
  26. Lanckriet, G.R.; Cristianini, N.; Bartlett, P.; Ghaoui, L.E.; Jordan, M.I. Learning the kernel matrix with semidefinite programming. Journal of Machine Learning Research 2004, 5, 27–72. [Google Scholar]
  27. Bach, F.R. Consistency of the group lasso and multiple kernel learning. Journal of Machine Learning Research 2008, 9. [Google Scholar]
  28. Meier, L.; Van de Geer, S.; Bühlmann, P. High-dimensional additive modeling. The Annals of Statistics 2009, 37, 3779–3821. [Google Scholar] [CrossRef]
  29. Koltchinskii, V.; Yuan, M. Sparsity in multiple kernel learning. The Annals of Statistics 2010, 38, 3660–3695. [Google Scholar] [CrossRef]
  30. Suzuki, T.; Sugiyama, M. Fast learning rate of multiple kernel learning: Trade-off between sparsity and smoothness. The Annals of Statistics 2013, 41, 1381–1405. [Google Scholar] [CrossRef]
  31. Bach, F.; Jordan, M. Kernel independent component analysis. Journal of Machine Learning Research 2003. [Google Scholar]
  32. Liu, D.C.; Nocedal, J. On the limited memory BFGS method for large scale optimization. Mathematical Programming 1989, 45, 503–528. [Google Scholar] [CrossRef]
  33. Williams, C.K.; Rasmussen, C.E. Gaussian processes for machine learning; Vol. 2, MIT press Cambridge, MA, 2006.
  34. Berlinet, A.; Thomas-Agnan, C. Reproducing kernel Hilbert spaces in probability and statistics; Springer Science & Business Media, 2011.
  35. Cox, D.R. Regression Models and Life-Tables. Journal of the Royal Statistical Society: Series B (Methodological) 1972, 34, 187–202. [Google Scholar] [CrossRef]
  36. Efron, B. The Efficiency of Cox’s Likelihood Function for Censored Data. Journal of the American Statistical Association 1977, 72, 557–565. [Google Scholar] [CrossRef]
  37. Oakes, D. The Asymptotic Information in Censored Survival Data. Biometrika 1977, 64, 441–448. [Google Scholar] [CrossRef]
  38. Thackham, M.; Ma, J. On maximum likelihood estimation of the semi-parametric Cox model with time-varying covariates. Journal of Applied Statistics 2020, 47, 1511–1528. [Google Scholar] [CrossRef]
  39. Luo, J.; Rava, D.; Bradic, J.; Xu, R. Doubly robust estimation under a possibly misspecified marginal structural Cox model. Biometrika 2024, 112, asae065. [Google Scholar] [CrossRef]
  40. Zhang, Z.; Stringer, A.; Brown, P.; Stafford, J. Bayesian inference for Cox proportional hazard models with partial likelihoods, nonlinear covariate effects and correlated observations. Statistical Methods in Medical Research 2023, 32, 165–180. [Google Scholar] [CrossRef] [PubMed]
  41. Ma, J.; Heritier, S.; Lô, S.N. On the maximum penalized likelihood approach for proportional hazard models with right censored survival data. Computational Statistics and Data Analysis 2014, 74, 142–156. [Google Scholar] [CrossRef]
  42. Imai, K.; Van Dyk, D.A. Causal inference with general treatment regimes: Generalizing the propensity score. Journal of the American Statistical Association 2004, 99, 854–866. [Google Scholar] [CrossRef]
  43. Hu, B.; Nan, B. Conditional distribution function estimation using neural networks for censored and uncensored data. Journal of Machine Learning Research 2023, 24, 1–26. [Google Scholar]
  44. Kostic, V.; Pacreau, G.; Turri, G.; Novelli, P.; Lounici, K.; Pontil, M. Neural conditional probability for uncertainty quantification. Advances in Neural Information Processing Systems 2024, 37, 60999–61039. [Google Scholar]
  45. Allman, E.S.; Matias, C.; Rhodes, J.A. Identifiability of parameters in latent structure models with many observed variables 2009.
  46. Allman, E.S.; Rhodes, J.A.; Stanghellini, E.; Valtorta, M. Parameter identifiability of discrete Bayesian networks with hidden variables. Journal of Causal Inference 2015, 3, 189–205. [Google Scholar] [CrossRef]
  47. Gassiat, E.; Cleynen, A.; Robin, S. Inference in finite state space non parametric Hidden Markov Models and applications. Statistics and Computing 2016, 26, 61–71. [Google Scholar] [CrossRef]
  48. Gassiat, E.; Rousseau, J. Nonparametric finite translation hidden Markov models and extensions. Bernoulli 2016, 22, 193–212. [Google Scholar] [CrossRef]
  49. Wieland, F.G.; Hauber, A.L.; Rosenblatt, M.; Tönsing, C.; Timmer, J. On structural and practical identifiability. Current Opinion in Systems Biology 2021, 25, 60–69. [Google Scholar] [CrossRef]
  50. Watanabe, S. Algebraic geometry and statistical learning theory; Vol. 25, Cambridge university press, 2009.
  51. Calderhead, B.; Girolami, M. Estimating Bayes factors via thermodynamic integration and population MCMC. Computational Statistics and Data Analysis 2009, 53, 4028–4045. [Google Scholar] [CrossRef]
  52. Watanabe, S. A widely applicable Bayesian information criterion. Journal of Machine Learning Research 2013, 14, 867–897. [Google Scholar]
  53. Drton, M.; Plummer, M. A Bayesian Information Criterion for Singular Models. Journal of the Royal Statistical Society Series B: Statistical Methodology 2017, 79, 323–380. [Google Scholar] [CrossRef]
  54. Moral, P. Feynman-Kac formulae: genealogical and interacting particle systems with applications; Springer, 2004.
  55. Chopin, N.; Papaspiliopoulos, O.; et al. An introduction to sequential Monte Carlo; Vol. 4, Springer, 2020.
Figure 1. Two causal graphs describing the causal relationships among the counting process for outcome event, N t = I ( , t ] ( T ) , treatment variable, A t , covariate, X t and unmeasured factors, W t which bifurcate into an actual process and a counterfactual process at time t. Remark 3 discusses how these causal relationships affect the interpretation of the estimation results in the proposed framework.
Figure 1. Two causal graphs describing the causal relationships among the counting process for outcome event, N t = I ( , t ] ( T ) , treatment variable, A t , covariate, X t and unmeasured factors, W t which bifurcate into an actual process and a counterfactual process at time t. Remark 3 discusses how these causal relationships affect the interpretation of the estimation results in the proposed framework.
Preprints 169599 g001
Figure 2. Numerical results for a clinically plausible data-generation process. (A)–(C): The mean and standard error of the logarithms of the Bayes factors for (A) the correctly specified model and the misspecified models lacking f 2 ( X 2 t ) , (B) the misspecified models with or without the inclusion of f t ( t ) and (C) the correctly specified models with or without the inclusion of f t ( t ) , calculated with ten bootstrap datasets and different regularization parameters λ g . The other hyperparameters are fixed to the optimal values for the smaller model. The log-Bayes factors are calculated by subtracting the log-BME of the larger model from that of the smaller model. (D) and (E): The mean and standard error of (D) log-BME and (E) CVErr g 1 calculated in the logistic regression for g 1 with ten bootstrap datasets and different regularization parameters. (F): The histograms of t-statistics measuring the bias in the naive ML estimator of θ 1 * and the debiased ML estimators of θ 1 * based on nuisance parameter g 1 estimated with λ g = 70 and λ g = 1 , respectively.
Figure 2. Numerical results for a clinically plausible data-generation process. (A)–(C): The mean and standard error of the logarithms of the Bayes factors for (A) the correctly specified model and the misspecified models lacking f 2 ( X 2 t ) , (B) the misspecified models with or without the inclusion of f t ( t ) and (C) the correctly specified models with or without the inclusion of f t ( t ) , calculated with ten bootstrap datasets and different regularization parameters λ g . The other hyperparameters are fixed to the optimal values for the smaller model. The log-Bayes factors are calculated by subtracting the log-BME of the larger model from that of the smaller model. (D) and (E): The mean and standard error of (D) log-BME and (E) CVErr g 1 calculated in the logistic regression for g 1 with ten bootstrap datasets and different regularization parameters. (F): The histograms of t-statistics measuring the bias in the naive ML estimator of θ 1 * and the debiased ML estimators of θ 1 * based on nuisance parameter g 1 estimated with λ g = 70 and λ g = 1 , respectively.
Preprints 169599 g002
Figure 3. Numerical results for a clinically plausible data-generation process with a time-independent unobserved factor affecting outcome. (A) and (B): The mean and standard errors of log-BME for (A) the misspecified model with only the observed covariates with or without the inclusion of f ( X 1 t ( test ) , t ) and (B) the correctly specified model with a latent variable and the misspecified model with only the observed covariates, but without the inclusion of f ( X 1 t ( test ) , t ) , calculated with ten bootstrap datasets and different regularization parameters for Gaussian kernels. The other hyperparameters are fixed to the optimal values for the smaller (null) model. (C) The histograms of t-statistics measuring the bias in the naive ML estimator of θ 1 and the debiased ML estimator of θ 1 based on the correctly specified model with a latent variable.
Figure 3. Numerical results for a clinically plausible data-generation process with a time-independent unobserved factor affecting outcome. (A) and (B): The mean and standard errors of log-BME for (A) the misspecified model with only the observed covariates with or without the inclusion of f ( X 1 t ( test ) , t ) and (B) the correctly specified model with a latent variable and the misspecified model with only the observed covariates, but without the inclusion of f ( X 1 t ( test ) , t ) , calculated with ten bootstrap datasets and different regularization parameters for Gaussian kernels. The other hyperparameters are fixed to the optimal values for the smaller (null) model. (C) The histograms of t-statistics measuring the bias in the naive ML estimator of θ 1 and the debiased ML estimator of θ 1 based on the correctly specified model with a latent variable.
Preprints 169599 g003
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated