Evaluating Dynamic Conditional Quantile Treatment Effects with Applications in Ridesharing

Many modern tech companies, such as Google, Uber, and Didi, utilize online experiments (also known as A/B testing) to evaluate new policies against existing ones. While most studies concentrate on average treatment effects, situations with skewed and heavy-tailed outcome distributions may benefit from alternative criteria, such as quantiles. However, assessing dynamic quantile treatment effects (QTE) remains a challenge, particularly when dealing with data from ride-sourcing platforms that involve sequential decision-making across time and space. In this paper, we establish a formal framework to calculate QTE conditional on characteristics independent of the treatment. Under specific model assumptions, we demonstrate that the dynamic conditional QTE (CQTE) equals the sum of individual CQTEs across time, even though the conditional quantile of cumulative rewards may not necessarily equate to the sum of conditional quantiles of individual rewards. This crucial insight significantly streamlines the estimation and inference processes for our target causal estimand. We then introduce two varying coefficient decision process (VCDP) models and devise an innovative method to test the dynamic CQTE. Moreover, we expand our approach to accommodate data from spatiotemporal dependent experiments and examine both conditional quantile direct and indirect effects. To showcase the practical utility of our method, we apply it to three real-world datasets from a ride-sourcing platform. Theoretical findings and comprehensive simulation studies further substantiate our proposal.


Introduction
Online experiments, often referred to as A/B testing in computer science literature, are widely utilized by technology companies (e.g., Google, Netflix, Microsoft) to assess the effectiveness of new products or policies in comparison to existing ones. These companies have developed in-house A/B testing platforms for evaluating treatment effects and providing valuable experimental insights. Take ridesourcing platforms like Uber, Lyft, and Didi as examples. These platforms operate within intricate spatiotemporal ecosystems, dynamically matching passengers with drivers (see, for instance, Wang and Yang, 2019;Qin et al., 2020;Zhou et al., 2021). They implement online experiments to explore various order dispatch policies and customer recommendation initiatives. These innovative products hold the potential to enhance passenger engagement and satisfaction, diminish pickup waiting times, and boost driver earnings, ultimately leading to a more efficient and user-friendly transportation system.
In this study, we address the fundamental question of how to evaluate the difference between the quantile return of a new product (treatment) and that of an existing one.
Although the average treatment effect (ATE) is widely used in the literature to quantify the difference between two policies (Imbens and Rubin, 2015; Wang and Tchetgen Tchetgen, 2018;Kong et al., 2022), it only considers the average effect and does not account for variability around the expectation. In applications with skewed and heavy-tailed outcome distributions, decision-makers are more interested in the quantile treatment effect (QTE), which offers a more comprehensive characterization of distributional effects beyond the mean and is robust to heavy-tailed errors (see e.g., Abadie et al., 2002;Chernozhukov and Hansen, 2006;Chen and Hsiang, 2019). For example, in ridesourcing platforms, policymakers may want to determine which policy more effectively raises the lower tail of driver income.
Furthermore, developing valid inferential tools for QTE can reveal how treatment effects differ by quantile and provide valuable information about the entire distribution.
Addressing the problem mentioned earlier presents two significant challenges. The first challenge involves efficiently inferring the dynamic QTE (quantile treatment effect), which is defined as the difference between the quantiles of cumulative outcomes under the new and old policies, in long horizon settings with weak signals. In contrast to single-stage decision-making, policy makers for ridesourcing platforms assign treatments sequentially over time and across various locations. Existing estimators, such as those based on (augmented) inverse probability weighting (see e.g., Wang et al., 2018, Section 4), are subject to the curse of horizon, as described by (Liu et al., 2018). This means their variances increase exponentially with respect to the horizon (i.e., the number of decision stages). Such approaches are inadequate in our context, where the horizon typically spans 24 or 48 stages and most policies improve key metrics by only 0.5% to 2% (Qin et al., 2022). Furthermore, unlike the average cumulative outcome, which can be broken down into the sum of individual outcome expectations, the quantile of cumulative outcomes generally does not equal the sum of individual quantiles. This makes estimating our causal effect extremely challenging.
Existing efficient evaluation methods designed for mean return, such as those proposed by Kallus and Uehara (2022) and Liao et al. (2022), cannot be easily adapted to our situation.
The second challenge arises from handling the interference effect caused by temporal and spatial proximities in spatiotemporal dependent experiments. This interference effect results in a treatment applied at one location influencing not only its own outcome, but also the outcomes at other locations. Additionally, the current treatment is likely to affect both present and future outcomes. Neglecting these effects would produce a biased QTE estimator. As far as we are aware, there is no existing test capable of concurrently addressing both challenges.

Related work
A/B testing has been extensively researched in the literature, as evidenced by the works of Yang et al. (2017) and Zhou et al. (2020), among other references. In contrast to most existing A/B testing methods that focus on the Average Treatment Effect (ATE), Quantile Treatment Effects (QTE) have received less attention. Among the few available studies, Liu et al. (2019) proposed a scalable method to test QTE and construct associated confidence intervals. Moreover, Wang and Zhang (2021) developed a nonparametric method to estimate QTEs at a continuous range of quantile locations, including point-wise confidence intervals.
More broadly, the estimation and inference of (conditional) QTEs have been considered in the causal inference literature, as seen in the works of Chernozhukov and Hansen (2006), Firpo (2007), and Blanco et al. (2020). However, these methods predominantly address single-stage decision-making. To the best of our knowledge, this paper represents the first attempt to explore QTE in temporally and/or spatially dependent experiments.
Our paper is closely related to the rapidly expanding body of literature on off-policy evaluation in sequential decision-making. The majority of existing studies primarily concentrate on inferring the expected return under a fixed target policy or a data-dependent estimated optimal policy (Zhang et al., 2013;Shi et al., 2020;Kallus and Uehara, 2022).
In recent years, several papers have explored policy evaluation beyond averages Kallus et al., 2019;Qi et al., 2022;Xu et al., 2022). These works propose using (augmented) inverse probability weighted estimators to evaluate specific robust metrics under a given target policy. As noted previously, these methods are subject to the curse of horizon and become less effective in long-horizon settings. Most notably, policy evaluation in spatiotemporal dependent experiments remains unexplored in the aforementioned studies.
Recent proposals have investigated causal inference with temporal or spatial interference, including studies by Savje et al. (2021) and Hu et al. (2022), among others. However, these methods primarily focus on the average effect. Furthermore, our paper is closely related to the literature on distributional reinforcement learning (see e.g., Zhou et al., 2020). Despite this connection, these studies primarily concentrate on the policy learning problem, and uncertainty quantification of a target policy's quantile value remains unexplored.
Lastly, our paper is connected to a line of research on quantitative analysis of ridesharing across various fields such as economics, operations research, statistics, and computer science (see e.g., Shi et al., 2022;Zhao et al., 2022). Nevertheless, quantile policy evaluation has not been examined in these papers.

Contributions
Our proposal offers three valuable contributions to existing literature. First, we present a framework for deducing dynamic conditional Quantile Treatment Effects (QTE), defined as dynamic QTE dependent on market features, irrespective of treatment history. While unconditional QTE may be of interest, as previously noted, it assumes a highly complex form in long horizon settings and is extremely challenging to identify when the signal is weak. In contrast, we demonstrate that under certain modeling assumptions, the proposed dynamic conditional QTE (CQTE) is equal to the sum of individual CQTE at each spatiotemporal unit, even though the conditional quantile of cumulative rewards does not necessarily equate to the sum of conditional quantiles of individual rewards. This finding significantly streamlines the estimation and inference processes for our causal estimand, making our proposal easily implementable in practice. Additionally, the estimated CQTE can exhibit a smaller variance compared to that of the unconditional counterpart.
Second, we introduce an innovative framework to test dynamic CQTE while accounting for the interference effect. We propose two Varying Coefficient Decision Process (VCDP) models, enabling the application of classical quantile regression (Koenker and Hallock, 2001) for parameter estimation and subsequent inference. We then develop a two-step method for estimating CQTE, along with a bootstrap-assisted procedure for testing CQTE. We further extend our proposal to analyze spatiotemporally dependent data and to test Conditional Quantile Direct Effects (CQDE) and Conditional Quantile Indirect Effects (CQIE).
Third, we thoroughly examine the theoretical and finite sample properties of our methods.
Theoretically, we prove the consistency of our proposed test procedure, allowing the horizon to diverge with the sample size. Notably, classical weak convergence theorems (Van Der Vaart and Wellner, 1996) necessitate a fixed horizon and are not directly applicable. Empirically, we apply our proposed method to real datasets obtained from a leading ridesourcing platform to assess the dynamic quantile treatment effects of new policies.

Organization of the paper
The paper's structure is as follows: Section 2 describes data from online randomized experiments. Section 3 covers temporally dependent experiments, the proposed model, and estimation and inference procedures. Section 4 extends the proposal to spatiotemporally dependent data. Section 5 decomposes CQTE into CQDE and CQIE. Section 6 presents the proposed CQTE test's asymptotic results. Section 7 evaluates ridesourcing dispatching and repositioning policies, and Section 8 assesses our methods' finite sample performance using real-data-based simulations.

Data Description
The purpose of this paper is to analyze three real datasets collected from Didi Chuxing, one of the world's leading ride-sharing companies. One dataset was collected during a time-dependent A/B experiment conducted in a city from December 10, 2021 to December 23, 2021. The goal of this experiment was to evaluate the performance of a newly designed order dispatching policy, which aimed to increase the number of fulfilled ride requests and boost drivers' total revenue. To protect privacy, we will not disclose the city name and the specific policy used. During the experiment, each day was divided into 24 equally spaced non-overlapping time intervals. The new policy (B) and the old policy (A) were alternated and assigned to these intervals every day. On the first day, we used an alternating sequence of AB. . .AB, and on the second day, we used BA. . .BA.
Thereafter, we switched between A and B every two days, ensuring that each policy was used with equal probability at each time interval, meeting the positivity assumption. For more details, see Section 3.1. It is worth noting that such an alternating-time-interval design is commonly used in industries to reduce the variance of treatment effect estimators, as discussed in Section 6.2 of Shi et al. (2022). For further information, please refer to the article by Lyft on experimentation in a ride-sharing marketplace at https: //eng.lyft.com/experimentation-in-a-ridesharing-marketplace-b39db027a66e.
The second dataset comes from a spatiotemporal-dependent experiment conducted in another city between February 19, 2020, and March 13, 2020. Each day is divided into 48 non-overlapping, equal time intervals, and the city is partitioned into 12 distinct, non-overlapping regions. On the first day of the experiment, the initial policy in each region is independently set to either the new or old policy with a 50% probability. The temporal alternation design for time-dependent experiments is then applied in each region.
In addition to the two datasets from A/B experiments mentioned earlier, we also analyze a third dataset collected from an A/A experiment. In this case, the two policies being compared are identical, and the treatment effect is zero. The experiment took place in a specific city from July 13, 2021 to September 17, 2021. This analysis serves as a sanity check to examine the size property of the proposed test. We expect that our test will not reject the null hypothesis when applied to this dataset, as the true effect is zero. Figure 1: Scaled drivers' total income (a), request (b) and drivers' online time (c) in the temporal dependent A/B experiment and the estimated density of drivers' total income (d) in the spatial temporal dependent A/B experiment.
The ridesharing system dynamically connects passengers and drivers in real-time. All three datasets include the number of call orders and the total online time of drivers for each time interval. These metrics represent the supply and demand in this two-sided market.
The platform's outcomes include the drivers' total income, the answer rate (the number of call orders responded to), and the completion rate (the number of call orders completed) for each time interval. In our study, we are interested in determining whether the new policy improves drivers' total income at various quantile levels.
The datasets exhibit four distinct characteristics. First, the horizon duration is typically much longer (e.g., 24 or 48) than the experiment duration, while the treatment effect is usually weak (e.g., 0.5%-2%). Second, both supply and demand are spatiotemporal networks that interact across time and location, as observed in panels (a) and (b) of Figure 1, which display drivers' online time and the number of call orders. Third, the outcome of interest follows a non-normal and heavy-tailed distribution, illustrated in panels (c) and (d) of Figure 1. Finally, there are interference effects over time and space, demonstrated in Figure   2, with temporal interference effects occurring when past actions impact future outcomes.
We focus on answering three key questions in these datasets: (Q1) How can we quantify treatment effects across various quantile levels for the timedependent A/B experiment data in order to gain a comprehensive understanding of the new policy's effects within the city?
(Q2) How to evaluate the quantile treatment effects for the above spatiatemporal dependent experiment data?
(Q3) How to determine whether or not to replace the old policy with the new one? Figure 2: This example illustrates the temporal interference effect in ridesharing, where assigning different drivers to pick up a passenger significantly impacts future ride requests. (a) A city with 10 regions has a passenger in region 6 needing a ride, with three drivers in region 3 and one in region 10. Two actions are possible: assigning a driver from region 3 or region 10. (b) Assigning a driver from region 3 might result in an unmatched future request due to the driver in region 10 being too far from region 1. (c) Assigning the driver in region 10 could lead to all future ride requests being matched, preserving all three drivers in region 3.
These questions drive the methodological development outlined in Sections 3 and 4.

Testing CQTE in temporal dependent experiments
In this section, we explicitly state the test hypotheses for our first research question (Q1) and explore the primary challenge encountered in experiments exhibiting temporal dependence.
Subsequently, we detail the key technical assumptions that enable the cumulative quantile treatment effect (CQTE) to be equivalent to the sum of individual CQTEs. Finally, to address the third research question (Q3), we present the proposed estimation and testing strategies for our investigation.

CQTE, test hypotheses and assumptions
We consider the temporal alternation design with a sequence of treatments over time.
Specifically, we divide each day into m non-overlapping intervals. The platform can implement either one of the two policies at each time interval. For any t ≥ 1, let A t denote the policy implemented at the tth time interval where A t = 1 represents exposure to the new policy and A t = 0 represents exposure to the old policy. Let S t and Y t denote the state (e.g., the supply and demand) and the outcome at time t, respectively.
To formulate our problem, we adopt a potential outcome framework (Rubin, 2005).
We also define S * t (ā t−1 ) and Y * t (ā t ) as the counterfactual state and counterfactual outcome, respectively, that would have occurred had the platform followed the treatment historȳ a t . Our primary interest lies in quantifying the difference between the τ th quantile of the cumulative outcomes under the new policy and that under the old policy, denoted as the quantile treatment effect (QTE): where 1 t and 0 t are vectors of 1s and 0s of length t, respectively, and Q τ (·) denotes the quantile function at the τ th level.
However, learning such an unconditional dynamic QTE from our experimental dataset is highly challenging. Remember that in our A/B experiment, the old and new policies are assigned alternately over the m time intervals. Nevertheless, the target policy we aim to evaluate corresponds to the global policy, which allocates the new or old policy globally throughout each day. This leads to an off-policy setting where the target policy differs from the behavior policy that generates the data. Existing off-policy quantile evaluation methods based on inverse probability weighting, such as those presented by , are inefficient in our setting with a moderately large m. Off-policy evaluation (OPE) methods, including Shi et al. (2020), Liao et al. (2021, and Kallus and Uehara (2022), are semiparametrically efficient 1 in long-horizon settings. Despite this, these methods primarily focus on the mean return, making it difficult to adapt them for quantile evaluation due to the nonlinear quantile function Q τ . To illustrate, note that there is no guarantee that This observation motivates us to seek an alternative definition for QTE.
Second, let E t represent the set of features (e.g., extreme weather events) that have an impact on the outcomes up to time t, but are not influenced by the treatment history. This means that for any treatment historyā t , the potential outcome of these features remains the same, i.e., E * t (ā t ) = E t . By definition, S 1 is an element of E t , which ensures that E t is non-empty for any t. We introduce CQTE as follows: The CQTE is a reasonable measure because the set of conditioning variables remains consistent under both new and old policies. When m = 1, this definition reduces to the one used in single-stage decision-making, as discussed in previous literature, such as Chernozhukov and Hansen (2006 Compared to CQTE, SCQTE is easier to learn from observed data. For example, one can fit a quantile regression model at each stage, estimate individual CQTE values, and then sum these estimators together. Although the quantile function is not additive, we demonstrate in the following proposition that SCQTE is equal to CQTE under specific modeling assumptions. Proposition 1. Suppose that for any time point t, Y * t (ā t ) follows the structural quantile model Y * t (ā t ) = φ t (E t ,ā t , U ) for a specific deterministic function φ t and a uniformly distributed random variable U d ∼ Unif(0, 1), which is independent of {E t } t . Furthermore, assume that φ t (E t , 1 t , τ ) and φ t (E t , 0 t , τ ) are strictly increasing functions of τ for any E t .
Proposition 1 establishes the equivalence between CQTE and SCQTE and serves as a fundamental building block for our proposal. It allows us to focus on SCQTE, which is a simplified version of CQTE. This simplification greatly facilitates the estimation and inference procedures that follow, which rely on fitting a quantile regression model at each time point to learn the SCQTE. For more details, see Sections 3.2 and 3.3. Moreover, the proposed model in Proposition 1 is related to the structural quantile model in the quantile regression literature Hansen, 2005, 2006). These models assume that, conditional on the covariate X = x, the potential outcome Y * (a) = q(a, x, U ) for a = 0, 1 and U ∼ U (0, 1), where q(d, x, τ ) is strictly increasing in τ . The uniformly distributed variable U serves as a rank variable that characterizes the heterogeneity of the outcome across different quantile levels. Under the monotonicity constraint, the τ th conditional quantile of Y * (a)|X = x can be shown to equal q(a, x, τ ). Proposition 1 motivates us to focus on testing the following hypotheses for each quantile level τ : These hypotheses test whether the treatment effect at the τ th quantile is non-negative or positive, respectively.
In this study, we utilize the consistency assumption (CA), sequential ignorability assumption (SRA), and positivity assumption (PA) to identify the causal estimand. These assumptions are frequently used in the dynamic treatment regime literature for learning optimal dynamic treatment policies (Gill and Robins, 2001). The consistency assumption (CA) states that the potential state and outcome, given the observed data history, should align with the actual observed state and outcome. The sequential ignorability assumption (SRA) demands that the action be conditionally independent of all potential variables, given the past data history. In our application, the SRA is inherently satisfied as the policy is assigned according to the alternating-time-interval design, independent of data history.
The positivity assumption (PA) necessitates that the probability of {A t = 1}, given the observed data history, must be strictly between zero and one for any t ≥ 1. Under the alternating-time-interval design, this probability is equal to 0.5, which satisfies the PA automatically. It is essential to note that the combination of CA, SRA, and PA enables the consistent estimation of the potential outcome distribution using the observed data.

VCDP models
Suppose that the experiment is conducted over n consecutive days.
be the state-treatment-outcome triplet measured at the jth time interval of the ith day for i = 1, . . . , n and j = 1, . . . , m. We assume that these triplets are independent across different days, but may be dependent within each day over time.
We begin by introducing two varying coefficient decision process models, one for the outcome and the other for the state. The first model characterizes the conditional quantile of the outcome and is given by where vector of time-varying coefficients, and U i ∼ U (0, 1) is the rank variable. Model (3) extends the idea of using rank variables to represent unobserved heterogeneity across different quantiles in a single-stage study to sequential decision making.
The second model characterizes the conditional mean of the observed state variables and is given by: where φ 0 (t) and is a random error term whose conditional mean given Z i,t equals zero. In addition, It is worth noting that models (3) and (4) belong to the class of varying-coefficient regression models. The existing literature on this topic mainly focuses on estimating the relationships between scalar predictors and scalar responses (Sherwood and Wang, 2016), or between scalar predictors and functional responses (Zhang et al., 2022), or between longitudinal predictors and responses (Wang et al., 2009). However, to the best of our knowledge, none of these works have utilized varying-coefficient regression models for policy evaluation in sequential decision making.
Furthermore, the temporal independence between E i (t + 1)'s implies that the state vector satisfies the Markov property, i.e., S i,t+1 is independent of the past data history given for any i and t. However, the immediate outcomes are conditionally dependent due to the existence of the rank variable U . Consequently, the resulting data generating process does not fall under the classical Markov decision processes (MDPs, see e.g., Puterman, 2014). Moreover, models (3) and (4) are valid when the potential outcomes satisfy similar assumptions in the quantile varying coefficient models. Please refer to (19) and (20) in the supplementary material for more details.
If the residual E i (t + 1) is independent of the treatment history and U i , we can define Under this condition, the assumptions in Proposition 1 are satisfied. Hence, under the proposed VCDP models, CQTE is equivalent to SCQTE. Next, we introduce the following function: The subsequent proposition offers a closed-form formula for CQTE.
Proposition 2 enables us to estimate CQTE through SCQTE under certain assumptions.
To evaluate policy value, we need to estimate the model parameters β, γ, Φ, and Γ.
Notice that under the conditions of Proposition 2, we have that: where , and its conditional τ -th quantile given Z i,t equals zero. Therefore, we can employ ordinary quantile regression to learn β and γ. Meanwhile, since the residuals E i (t)s are independent over time, ordinary least-squares regression is applicable to the state regression model to estimate Φ and Γ. We detail our estimating procedure in the next section.

Estimation and inference procedures
In this subsection, we outline the procedures for estimating and testing CQTE based on the results in Proposition 2. We first estimate the regression coefficients in models (6) and (4). We then plug these estimates into (5) to estimate CQTE. Finally, we develop a bootstrap-assisted procedure to test CQTE.
0 (t), and Γ (ν) (t) denote the ν-th entries of S i,t+1 , φ 0 (t), and Γ(t), respectively. Let Φ (ν) (t) and Θ (ν) (t) denote the ν-th rows of Φ(t) and Θ(t), respectively. It follows from (4) that: We propose a two-step procedure to estimate θ(t, τ ) and Θ(t). In the first step, we minimize the following functions: These one-step estimates can be computed easily but suffer from large variances as they rely solely on observations at time t. In the second step, we employ kernel smoothing to reduce the variances of these initial estimators and identify weak signals (Zhu et al., 2014).
Specifically, for a given kernel function K(·), the second-step estimators θ(t, τ ) and Θ (ν) τ (t) are defined as: where is the weight function and h denotes the kernel bandwidth. The use of kernel smoothing allows us to estimate the varying coefficients θ τ (t) and Θ τ (t) for any real-valued t. Given θ(t, τ ) and Θ (ν) τ (t), we can compute the following CQTE estimator: To test (2), we use the test statistic T τ , which is set to CQTE τ . Under the null hypothesis, T τ is expected to be negative or close to zero. Therefore, we reject the null hypothesis for a large value of CQTE τ . However, deriving the limiting distribution of T τ for large m is complicated due to the complex dependence of CQTE τ on the estimated model parameters.
To address this issue, we use the bootstrap method to simulate the distribution of CQTE τ under the null hypothesis. Specifically, we modify the bootstrap method proposed by Horowitz and Krishnamurthy (2018) and adapt it to our setting as follows. Horowitz and Krishnamurthy (2018) proposed to resample the estimated residuals to infer the conditional quantile function in a nonparametric quantile regression model. In our case, to handle the dependence over time, we resample the entire error process (see Step 3 below for details).
The bootstrap method for CQTE τ is implemented as follows: • Step 1. Compute the estimators θ(t, τ ) and Θ(t) in (9) and (10). • Step 2. Estimate the residuals byê i (t, τ ) = Y i,t − Z i,t θ(t, τ ) for t = 1, . . . , m and • Step 4. For each b, compute the bootstrap estimates θ b (t, τ ) and Θ b (t) according to equations (7)-(10) using the pseudo outcomes In the supplementary material, we present Theorem 1, which rigorously establishes the consistency of the aforementioned bootstrap method. It's worth noting that the bootstrap consistency theory elaborated in Horowitz and Krishnamurthy (2018) isn't readily applicable to our context, where m can increase along with the sample size.

Extension to spatiotemporal dependent experiments
In this section, we aim to address (Q2) and expand upon the method proposed in Section 3 to analyze data from spatiotemporal dependent experiments involving multiple non-overlapping regions receiving distinct treatments in a sequential manner over time. Let r represent the number of these non-overlapping regions. As previously discussed, these experiments are not only subject to temporal interference effects but also exhibit spatial interference, whereby the policy implemented in one location may influence the outcomes in other locations. We begin by defining the test hypotheses, followed by an introduction to our models and the suggested procedures.

Test hypotheses
For the ι-th region, we useā t,ι = (ā 1,ι , . . . ,ā t,ι ) to denote its treatment history up to time t. Letā t,[1:r] = (ā t,1 , . . . ,ā t,r ) represent the treatment history across all regions. Similarly, define S * t,ι (ā t−1,[1:r] ) and Y * t,ι (ā t,[1:r] ) as the potential observation and outcome for the ι-th region, respectively. The set of potential observations at time t is denoted as S * t,[1:r] (ā t−1,[1:r] ). In the spatiotemporal context, our focus is on the cumulative quantile treatment effects, aggregated over all regions. Specifically, we define CQTE and SCQTE at the τ -th quantile level as: respectively, where E t,[1:r] denotes the set of characteristics independent of the treatment history up to time t across all regions. For a given quantile level τ , our goal is to test whether a new policy outperforms the old one as follows: Compared to the testing problem in (2), (13) focuses on global treatment effects aggregated over time and regions. We assume the consistency assumption holds. Similar to Section 3, under the spatial alternating-time-interval design, one can show that the sequential ignorability assumption and the positivity assumption are automatically satisfied, ensuring that CQTE is identifiable from the observed data.

Spatiotemporal VCDP models
Suppose that the experiment last for n days, and each day is divided into m time intervals.
For i = 1, . . . , n, t = 1, . . . , m, and ι = 1, . . . , r, let (S i,t,ι , A i,t,ι , Y i,t,ι ) represent the statetreatment-outcome triplet measured from the ιth region at the t-th time interval of the i-th day. For each ι, N ι denotes the neighbouring regions of ι. To model the quantiles of Y t,ι and S t,ι , we extend the two VCDP models in Section 3 to two spatialtemporal VCDP (STVCDP) models in this section.
The first STVCDP model describes the quantile structure of the outcome. It assumes the following form: (14) is based on two key assumptions.
Firstly, it is assumed that the effect of treatments in other regions on the conditional quantile of Y i,t,ι is limited to those of its neighboring regions, as long as each experimental region is large enough. This is because drivers can only travel between neighboring regions in one time unit, meaning that treatments in non-neighboring regions are not expected to impact Y i,t,ι . Secondly, it is assumed that the influence of treatments in neighboring regions on the conditional quantile of Y i,t,ι is only through the mean of the treatments. This is a common mean-field assumption used to model spillover effects (e.g., Hudgens and Halloran, 2008;Shi et al., 2022) and can be tested using observed data .
The second STVCDP model models the conditional distribution of the next state given the current state-action pair as follows: where autoregressive coefficients. The conditional mean of each entry in the error process E i (t, ι) given Z i,t,ι is zero. The error process is required to be independent over time, although it may be dependent across different locations. Additionally, the varying coefficients are required to be smooth over the entire spatial domain, which will help to reduce the variances of the model estimators and improve the accuracy of the CQTE estimator. The models (14) and (15) hold under the assumption that the potential outcomes satisfy the quantile varying coefficient models, as described in the supplementary material (models (21) and (22)).
The following proposition provides a closed-form expression for CQTE τ st and proves Proposition 3. Suppose that CA and the conditions in equations (21) and (22) of the supplementary material hold, and that U is independent of the collection of error processes strictly increasing in e t,ι . Then, we have Proposition 3 provides a foundation for constructing a plug-in estimator for CQTE τ st .
This forms the basis of the proposed inference procedure, which we discuss in more detail in the next section. Additionally, from models (14) and (15), we can obtain the expression for Y i,t,ι as: , and its conditional τth quantile given Z i,t,ι is equal to zero. It is worth mentioning that these models can be further extended to incorporate the effects of states from neighboring regions on the immediate outcome by including another mean-field term Φ 2 (t, ι)S i,t,Nι whereS i,t,Nι = ι ∈Nι S i,t,ι /N ι .
In this case, the closed-form expression for CQTE τ st can be similarly derived.

Estimation and inference procedures
In this subsection, we outline the estimation and testing procedures for CQTE τ st .
Secondly, we further refine these raw estimators by employing kernel smoothing to borrow information across space. Specifically, we define where Θ 0(ν) (t, ι) is the νth column of Θ 0 (t, ι) and κ ,hst (ι) is a normalized kernel function with bandwidth parameter h st . The kernel function κ ,hst (ι) is given by where (u ι , v ι ) represents the longitude and latitude of region ι. Consequently, regions with smaller spatial distances contribute more significantly.
Thirdly, we estimate CQTE τ st by substituting the refined estimators θ τ st (t, ι) and Θ st (t, ι) and use the resulting estimator CQTE τ st as the test statistic T τ st .
Finally, we introduce a bootstrap method to test (13). During each iteration, we resample the estimated error processes to obtain the bootstrap estimates θ b τ st (t, ι), Θ b st (t, ι), and the bootstrapped statistic As this approach is highly similar to the one presented in Section 3.3, we omit further details for brevity.

Direct and indirect effects
Recall that Proposition 2 provides the closed-form expression of CQTE τ , which is Consequently, we can divide the quantile treatment effect into two components. Specifically, the first term m t=1 γ(t, τ ) of CQTE τ represents the direct effect of the treatment on the immediate outcome, expressed as Observe that for each t, the two potential outcomes Y * t (1 t ) and Y * t (0, 1 t−1 ) differ in the treatment received at time t, but they share the same treatment history. The second term m t=2 β(t, τ ) t−1 k=1 t−1 l=k+1 Φ(l) Γ(k) quantifies the carryover effects of past treatments on the current outcome, defined as Similar decompositions have been considered in Li and Wager (2022) and Shi et al. (2022).
The corresponding testing hypotheses are given by Testing these hypotheses not only enables us to determine whether the new policy is significantly better than the old one or not, but also helps us understand how the new (or the old) policy outperforms the other.
To test (16) and (17), we use the two-step estimators in (9) and (10) to construct the plug-in estimators CQDE τ and CQIE τ for CQDE and CQIE, respectively. Next, we employ the bootstrap method in Section 3.3 to approximate the limiting distributions of CQDE τ and CQIE τ under the null hypotheses. We note that although CQDE τ has a tractable limiting distribution and is asymptotically normal, estimating its asymptotic variance without using bootstrap remains challenging.
Finally, we can similarly define the direct effect and indirect effect in the spatiotemporal design as follows, The estimation and inference procedures can be derived similarly.

Asymptotic Properties
In this section, we investigate the theoretical properties of the proposed test statistics for CQTE. Firstly, we provide an upper error bound as a function of h, m and n to measure the approximation error of the proposed bootstrap method in temporal dependent experiments.
Under the conditions specified in Theorem 1, the error bound delineated in Theorem 1 tends towards zero. When the null hypothesis holds true, we derive that P(T τ ≤ z) ≤ (1), where the little-o p term uniformly applies to z and τ . When the alternative hypothesis is true and the QTE signal satisfies m −1 QTE τ n −1/2 log(nm), the power of the proposed test method approaches 1 (refer to the proof of Theorem 1 for more details). Consequently, the consistency of the proposed test is demonstrated.
There are two significant challenges in establishing Theorem 1: (i) the non-differentiable nature of the checkloss function, and (ii) the allowance for m to diverge with n. In order to address the first challenge, we utilize classical M-estimation theory from the literature to obtain a Bahadur representation of the proposed estimator (see, for instance, Koenker and Portnoy, 1987). To tackle the second challenge, we draw from arguments similar to those from the high-dimensional multiplier bootstrap theorem (Chernozhukov et al., 2013) to generate a nonasymptotic error bound that explicitly characterizes the dependence of the bootstrap approximation error on m.
Secondly, we establish the consistency of the proposed test procedure in spatiotemporal dependent experiments.
Based on Theorem 2, we can demonstrate that the proposed test effectively controls the type-I error and its power tends towards 1 when m −1 r −1 QTE τ st n −1/2 log(nmr). More details can be found in the proof of Theorem 2. Furthermore, we have provided additional theoretical properties of the proposed test statistics for CQDE and CQIE in Section C of the supplementary material.

Real Data Analysis
To address (Q1)-(Q3), we apply the proposed test procedures to the three real datasets obtained from Didi Chuxing introduced in Section 2.
Firstly, we examine the dataset from a temporally dependent A/B experiment conducted from Dec 10, 2021 to Dec 23, 2021. As detailed in Section 2, two order dispatch policies are tested in alternating one-hour time intervals. The new policy, in comparison to the old one, is designed to fulfill more call orders and elevate drivers' total income. We set drivers' total income as the outcome, and the observation variables include the number of call orders and drivers' total online time. To address question (Q1), we apply model (3)  correlation structure between supply and demand and model (4) to elucidate the temporal interference effects. For question (Q3), we utilize the testing procedure described in Section 3.3 for these temporally dependent experiments. As a means to validate the proposed test, we also apply our procedure to the A/A dataset outlined in Section 2, where a single order dispatch strategy is employed. We anticipate that our test will not reject the null hypothesis when applied to this dataset.
In Figure 3, we display the estimated residuals of the outcome over time for τ ∈ 0.1, 0.5, 0.9. As can be seen from Figure 3, some residuals are significantly larger than others, suggesting that the outcome likely originates from heavy-tailed distributions. This reinforces the use of quantile treatment effects for policy evaluation. Table 1 presents the p-values of the proposed test for CQTE τ , CQDE τ , and CQIE τ , respectively. Furthermore, Figure 4 illustrates the estimated treatment effects and the p-values across various quantiles.
As expected, the proposed test does not reject the null hypothesis at any quantile level when applied to the A/A experiment. However, when applied to the A/B experiment, the new policy demonstrates significant quantile direct effects on the business outcome at most quantile levels. In contrast, the indirect effects are not significant.
Secondly, we analyze the dataset from the spatiotemporal dependent experiment as described in Section 2. Recall that in this experiment, the city is divided into 12 regions.
Policies are implemented based on alternating 30-minute time intervals within each region.
We concentrate on a data subset collected from 7 am to midnight each day, as there are relatively few order requests from midnight to 7 am. The drivers' total income and the number of call orders are designated as the outcome and state variable, respectively. We fit the spatiotemporal VCDP models (14) and (15) to address (Q2), and apply the testing procedure from Section 4.3 to address (Q3) for this spatiotemporal dependent experiment.   Our aim is to determine whether the new policy has significant treatment effects on drivers' total income across various quantile levels.  Next, we outline the simulation environment. For a given quantile level τ , we fit the proposed VCDP models (3) and (4) to the data by setting γ(t, τ ) = Γ(t) = 0, since the two policies being compared are essentially the same. This enables us to obtain the estimated model parameters β 0τ (t), β τ (t), φ 0 (t), and Φ(t) and the estimated error processes e i,τ (t) and E i (t) for 1 ≤ t ≤ 24 and 1 ≤ i ≤ 68. To simulate data, we set γ(t, τ ) = δQ τ (Y t ) and Γ(t) = δE(S t ) for some constant δ ≥ 0, where Q τ (Y t ) and E(S t ) denote the (elementwise) Thirdly, we employ the bootstrap method for data generation. Specifically, in each simulation run, we randomly sample n initial observations and n error processes with replacement. Then, we generate n days of data according to the proposed VCDP models: based on these samples and the estimated model parameters. The treatments A i,t are generated according to the temporal alternation design. Specifically, we first implement one policy for TI time units, then switch to the other policy for another TI time units, and alternate between the two policies. We consider a wide range of simulation settings by setting τ ∈ 0.2, 0.5, 0.8, n ∈ 20, 40, TI ∈ 1, 3, and δ ∈ 0, 0.001, 0.025, 0.05, 0.075, 0.1. For each scenario, we generate 500 simulation runs to compute the empirical type-I error rate and power. The significance level is fixed at 5% throughout the simulation.
Finally, we discuss the simulation results. Figure 7 presents the empirical rejection rates of the proposed test for CQTE (refer also to Table 3 in the supplementary material). The type-I error is approximately at the nominal level in all cases. The empirical power generally increases with the sample size and approaches 1 as the signal strength δ increases to 0.1.
Furthermore, the empirical power increases with the quantile level τ , which is expected since γ τ (t) are set to be proportional to Q τ (Y t ), whose values increase with the quantile level.
These results validate our theoretical assertions. We also report the empirical rejection rates of the proposed test for CQDE and CQIE in Figures 8 and 9 of the supplementary material, respectively. The results are very similar to those of CQTE. It is worth noting that the power for CQDE is generally larger than that of CQTE, whereas the power for CQIE is generally smaller than that for CQTE. This is because the test statistics of CQIE have much larger variances than those of CQDE. interference effect, we proposed two VCDP models for parameter estimation and inference.

Discussion
We developed a two-step method and a bootstrap-based testing procedure for the inference of CQTE. Further, we extended the proposed procedure to accommodate spatiotemporal data, decomposing the CQTE into the sum of CQDE and CQIE. We established consistency results for parameter estimation and the test procedure. Through the analysis of real datasets obtained from DiDi Chuxing, we demonstrated that our proposed method is a valuable statistical tool for assessing the dynamic QTE of new policies.

A Tables for simulation results
We report Tables 3 to 5 in this section. These tables contain empirical rejection rates of the proposed tests for CQTE, CQDE and CQIE in simulation studies, respectively, based on 500 simulation replications. Figures 8 and 9 depict the empirical rejection rates for CQDE and CQIE.
Second, we list the following assumptions to guarantee the theoretical results of the proposed test in temporal dependent experiments. Notice that we allow m to grow with n.
Assumption 1. The kernel function K(·) is a symmetric probability density function defined over the interval [−1, 1]. It is Lipschitz continuous and satisfies the condition 1 −1 |tK (t)|dt < ∞, signifying its finiteness under the integral of its absolute derivative.
Assumption 2. The probability density function f e (κ, s; τ ) is strictly positive and continuously varies as a function of κ. It is twice differentiable at any s, possessing a second-order derivative that is both continuous and uniformly bounded. Similarly, the joint probability density function f e (κ 1 , κ 2 , s 1 , s 2 ; τ 1 , τ 2 ) is strictly positive and continuously varies as a function of (κ 1 , κ 2 ). It is twice differentiable at any pair (s 1 , s 2 ), maintaining a second-order derivative that is continuous and uniformly bounded.
Assumption 3. The covariates Z i are drawn independently and identically from a sub-Gaussian process. Furthermore, for any integer t within the range 1 ≤ t ≤ m, the smallest eigenvalue of the expected value matrix E(Z i,t Z i,t ) ∈ M p×p remains greater than zero.
Assumption 4. All components of θ τ (ms) and Θ(ms) possess second-order derivatives with respect to s that are not only bounded but also continuous for each τ .
Assumption 5. For any τ in the interval [0, 1], there exist constants 0 < q < 1, Assumption 1, which is often found in the literature on varying coefficient models and kernel smoothing, pertains to regular conditions in kernel methods (see e.g., Zhu et al., 2012Zhu et al., , 2014Cao et al., 2015). Finally, we present technical assumptions that establish the consistency of the proposed test in a spatiotemporal dependent experiment. We allow both m and r to grow with n.
Assumption 7. The covariate Zis are independently and identically distributed from a sub-Gaussian process. Furthermore, for 1 ≤ t ≤ m and 1 ≤ ι ≤ r, the minimum eigenvalue of E(Zi, t, ιZ i,t,ι ) ∈ M p×p is bounded away from zero.
Assumption 8. All components of θ τ (ms, rl) and Θ(ms, rl) have bounded and continuous second-order derivatives with respect to (s, l) for each τ .
(iv) If further Assumption 5 of the supplement material holds and h = o(n −1/4 ), m n c 2 for some 1/2 < c 2 < 3/2, then with probability approaching 1, there exist some constant ε ∈ (0, 1) and some positive constant C such that Part (i) of Theorem 3 is established using standard arguments in the quantile regression literature (see e.g., Koenker and Hallock, 2001). This part of the theorem provides the limit normal distribution of the raw estimators. Part (ii) further shows that the smoothing step introduces only a negligible bias and reduces the asymptotic covariance by a factor of 1/m. These two results imply that the proposed test statistic for CQDE m t=1γ τ (t) is asymptotically normal. However, as noted earlier, consistently estimating its asymptotic variance remains a challenge due to its complex form. Part (iii) of Theorem 3 validates the proposed bootstrap procedure for CQDE. Instead of using the empirical quantile of the bootstrapped statistics to determine the critical value, our test procedure relies on normal approximation and uses bootstrapped statistics to estimate the variance. Part (iv) validates the bootstrap method for CQIE, which directly follows from the bootstrap consistency for CQTE in Theorem 1.

D Proofs of the theorems
For any two positive sequences {a n } n and {b n } n , the notation a n b n implies that there exists some constant C > 0 such that a n ≤ Cb n for any n. We use C to denote some generic constant whose value is allowed to change from place to place.
Proof of Proposition 1. Under the assumption of monotonicity, it is immediately evident that m t=1 φ t (E t ,ā t , τ ) is a strictly increasing function of τ ∈ (0, 1) whenā t = 1 t or 0 t . The independence between U and E t leads to the following: whenā t = 1 t or 0 t . By definition, we know that E 1 ⊆ E 2 ⊆ · · · ⊆ E m , which yields: This leads to the conclusion that: With these steps, the proof is completed.
Proof of Proposition 2. Consider the following varying coefficient models where both U and E(t) are independent of the treatment history. Through simple calculations, we obtain: Given these assumptions and Proposition 1, Proposition 2 is therefore valid.
Proof of Proposition 3. The proof of Proposition 3 is similar to that of Proposition 2, and we omit it to save space. In the following, we introduce the two varying coefficient models for the potential outcomes, given by where the rank variable U and the errors {E(t, ι)} are independent of the treatment history.
Proof of Theorem 1. First, we provide a sketch of the proof, which is divided into three parts. In the initial step, we acquire a uniform Bahadur representation of the first-stage estimator, which is detailed in Lemma 1. In the second step, we dissect the difference between the distributions of the proposed statistic and the bootstrap statistic. These differences are subsequently bounded by employing the technique of comparison of distributions, as elaborated in Chernozhukov et al. (2013). Finally, we investigate the power properties of the proposed test.
We next detail each of the step. Recall thatθ τ (t) = m t=1 ωt ,h (t) θ τ (t). According to the uniform Bahadur representation in Lemma 1, we havẽ Similarly, we can represent Θ(t) as For simplicity, let vec(·) be the operator that reshapes a matrix into a vector by stacking its columns on top of one another. Define We also define ẽ i,τ (j),Ẽ i (j) be the empirical Gaussian analogs of . . , ξ n are i.i.d standard normal random variables, and Let X(τ ) = (X 2 (τ ), X 3 (τ ), . . . , X m (τ )) = 1 √ n n i=1 x i (τ ), We remark that the term X(τ ) represents the difference between the estimators and the true smoothed coefficients in terms of the error processes, while the term W (τ ) represents the difference between the bootstrap estimators and the obtained estimators, andW (τ ) is the Gaussian analogy of X(τ ).
Define the following functions To verify the proposed bootstrap procedure, we aim to provide an upper error bound for Notice that κ τ and κ b τ can be represented by respectively.
Finally, we show that the proposed test has good power properties. Under the alternative that CQTE τ /m n −1/2 log(nm) and according to (27), we have where z α is the critical value. If we can show that E(κ * 01,τ ) 2 = O(n −1 log 2 (nm)), then it follows that P (T τ /m ≥ z α ) approaches one.

Hence, we have
where the last inequality follows from the fact that P (1 − G) exp(−n + 2n 1/2 log m − (log m) 2 ), according to the Borell TIS inequality and Lemma 2.2.10 in Van Der Vaart and Wellner (1996). This completes the proof.
Proof of Theorem 2. The proof of Theorem 2 is very similar to that of Theorem 1. We omit it here to save space.
Proof of Theorem 3. The proof involves four parts. In part (i), we present the proof of the asymptotic normality of the first-stage estimator θ τ . In part (ii), we give the proof of the asymptotic normality of the second-stage estimator θ τ . In parts (iii) and (iv), we give the proofs of the bootstrap consistency of the testing procedures with respect to CQDE τ and CQIE τ , respectively.
Therefore, we can prove the asymptotic normality of θ τ by using some standard arguments in the quantile regression literature (Koenker, 2005).
It is easy to see thatH 2 is the leading term of H 1 by noticing thatH 1 is of lower order of The smoothing step reduces 1/m of the variance.
With the preparations above, it is easy to calculate that for any vector a n,2 with unit norm, , V ar[ √ na n,2 ( θ τ − θ τ )] = a n,2 V θ,τ a n,2 .
Following equation (2) in the supplementary material of Feng et al. (2011), we can obtain that S nt (θ) = E * S it (θ) + o p * (n −1/2 ). Now we calculate the explicit form of E * S it (θ). Notice that e b i, where the last equality follows from the Taylor's expansion. Direct calculations lead to n −1 n i=1 Z i,t ψ τ ( e b i,τ (t)) = D t (θ − θ) + o p * (1).
(iv) The proof of the bootstrap consistency for QIE follows directly from the proof of Theorem 1 and we omit here. This completes the proof.
We omit the proof to save space.
Under the sub-Gaussian assumption, we have that E(max i max t Z i,t ) log(nm), there exist constants c, B > 0 such that Z i,t 3 cK(log nm/n) 1/2 2 exp(B log nm) n 1/2 (log nm) 7/2 2 exp(B log nm).
This completes the proof.