Modelling of spatio-temporal zero truncated patterns in infectious disease surveillance data

This paper is motivated by spatio-temporal pattern in the occurrence of Leishmaniasis in Afghanistan and the relatively high number of zero counts. We hold the view that correlations that arise from spatial and temporal sources are inherently distinct. Our method decouples these two sources of correlations, there are at least two advantages in taking this approach. First, it circumvents the need to inverting a large correlation matrix, which is a commonly encountered problem in spatio-temporal analyses. Second, it simplifies the modelling of complex relationships such as anisotropy, which would have been extremely difficult or impossible if spatio-temporal correlations were simultaneously considered. We identify three challenges in the modelling of a spatio-temporal process: (1) accommodation of covariances that arise from spatial and temporal sources; (2) choosing the correct covariance structure and (3) extending to situations where a covariance is not the natural measure of association. Moreover, because the data covers a period that overlaps with the US invasion of Afghanistan, the high number of zero counts may be the result of no disease incidence or lapse of data collection. To resolve this issue, a model truncated at zero built on a foundation of the generalized estimating equations was proposed.


Introduction
One of the most challenging issues in modelling spatio-temporal infectious disease data is the choice of a valid and yet flexible correlation (covariance) structure.Some examples of correlation structures can be found in Cressie and Huang (1999); Gneiting (2002); Stein (2005) and Porcu et al. (2007), among others.The correlation structures fall into one of two types: separable in which case it is assumed that the space-time correlation can be written as a product of a correlation for the space dimension and one for the time dimension or nonseparable where the space-time correlation is modelled as a single entity.Unfortunately, most of these correlation structures are either extremely complicated or infeasible to manipulate due to their high dimensions.
The motivating data is a surveillance nationally aggregated monthly counts of leishmaniasis incidence across provinces of Afghanistan between 2003 and 2009.Leishmaniasis is the third most common vector-borne disease and a very important protozoan infection Adegboye et al. (2016).The burden of the disease is overwhelming and the psychological effect can be disturbing.The impact of environmental influences on Leishmaniasis cannot be ruled out and human activities play a significant role in the dispersion of the vectors thereby changing the geographical distribution of the disease.The major goal of the study is to evaluate the influence of Satellite-derived climatic and environmental variable on the occurrence of leishmaniasis while allowing for spatio-temporal correlation in the data.Furthermore, in most previous works, the space-time correlation is considered jointly, a step that we believe is unnecessary or unrealistic.For example, it would be hard to imagine how the disease incidence in Kabul in 2003 would be in any way correlated to that in Hilmand province in 2007, because of their location and time apart.Previous study has shown significant seasonality in the occurrence of the diseaseJanuary to March and September to Decemberwith the highest peak in March, suggesting a peak in the cases of leishmaniasis in March and a through in September of each year Adegboye and Adegboye (2016).
We hold the view that correlations arising from spatial and temporal sources are inherently distinct.In this study we shall decouples these two sources of correlations, an approach that separates the modelling of the space-and time-correlations.There are at least two advantages in taking this approach.First, it circumvents the need to inverting a large correlation matrix, which is a commonly encountered problem in spatio-temporal analyses (e.g., Yasui and Lele, 1997).Second, it simplifies the modelling of complex relationships such as anisotropy, which would have been extremely difficult or impossible if spatio-temporal correlations were simultaneously considered.
Our method begins with a marginal model for Leishmaniasis incidence along the time dimension.Marginal models are natural extensions of the generalized linear models and they are popular in longitudinal analysis.They are well understood and they can be easily fitted using simple modifications to existing programs.One of the most popular marginal models is the generalized estimating equations (GEE, Liang and Zeger, 1986).The standard GEE assumes longitudinal measurements within each observation are correlated but observations are independent of each other.But in our situation, the observations are disease counts and covariates for the different spatial locations and there may be spatial dependencies.
To account for spatial dependency, we create a spatial-GEE by re-weighting the standard GEE so that locations highly correlated with each other would receive less weight.The weights are created from a semivariogram of the spatial data.Because the dimension of a semivariogram is only dependent on the number of spatial locations, it is of manageable size.Furthermore, since a semivariogram measures dissimilarity, there is no need to invert the semivariogram to create weights.The method will be illustrated using data from Leishmaniasis incident in the provinces of Afghanistan in 2009.This model makes it possible to combine the specific provincial rate with the influence of the spatial neighborhood.
The data sets is also characterized by a high percentage of zero disease counts.The data covers a period that overlaps with the US invasion of Afghanistan, the zero counts may be the result of no disease incidence or lapse of data collection.It is a common practice in large survey to use zero (0) as missing value.Faraway (2004) argued that although this is not a good choice since it is a valid value for some of the variables and not mentioning it in their data description, unfortunately this act is common particularly with data sets of any size or complexity .
Apart from the spatial dependency in the data, this study also presents additional challenge.It is very difficult to distinguish between "true" and "imputed" zeros, because of the reporting mechanism of disease in Afghanistan (due to security, technical and logistics issues).These problems prompted us to consider the option of discarding the zeros and model the non-zero data using a Poisson model conditional on greater zero.We make the assumption that "imputed" zeros are a random event.It is often practiced to truncate the values that are bigger than a constant to overcome over-dispersion (Saffari et al., 2011).The analysis of truncated data often arises from a subsidiary set of results that treat a practical problem of how data are gathered and analyzed (Greene, 2005) and incompleteness of this data requires special estimators of the regression coefficients (Karlsson and Lindmark, 2014).To resolve this issue, we use a model truncated at zero. Lee and Kim (1998) provides detail review and a comparison of properties of estimators for regression models under truncated data.
The rest of the paper is structured as follows.In Section 2, we describe the data collection method and the variables that will be used in the study.Section 3 describes the method.Section 4, we give the results of the data analysis, while Section 5 concludes the paper.

Model specification
Let y = {y(s, t), s = s 1 , ..., s S , t = t 1 , ..., t T } where y(s, t) ≡ y st denotes the count of disease at spatial location s and time t.Suppose associated with (s, t) are covariates x(s, t) ≡ x st that record the spatial location and time as well as other information that might affect disease counts.Furthermore, let X = {x(s, t), s = s 1 , ..., s S , t = t 1 , ..., t T } and the disease incidence at each location is model as a Poisson count.
To model spatio-temporal correlation (or equivalently covariance) and overdispersion, we assume there is a non-negative weakly stationary latent process e st such that conditional on the e's, the y's are independent and are assumed to follow a log-linear model given by E(y st |e st ) = exp(x τ st β)e st , and var(y where β is a vector of unknown parameters that captures the association between incidence and the covariates.We assume E(e st ) = 1 so exp(x st β) represents the marginal mean of y st .
The latent process e st is assumed to have a variance of σ 2 and the covariance between e st and e s t is given by cov(e st , e s t ) = σ 2 ρ(z st , z s t , α) where z st , z s t are covariates from (s, t), (s , t ) that jointly induce spatio-temporal correlation and α are unknown parameters.Depending on the context, the covariates z st and x st may be distinct, may share some components or may be the same.This model was considered by Zeger (1988) to model discrete time series data.Under these assumptions, it can easily be shown that If ρ(z st , z s t , α) = 0, then we have a Poisson model with overdispersion.Furthermore, if σ 2 = 0, then we have a standard Poisson model at each spatial location and time.For convenience, we define y s• = (y s1 , ..., y sT ) τ as the vector of counts taken at times 1, ..., T at spatial location s, y •t = (y 1t , ..., y St ) τ as the vector of counts taken at locations 1, ..., S at time t and we use similar definitions for We begin by considering the data y = (y τ 1• , ..., y τ S• ) τ and X = (x τ 1• , ..., x τ S• ) τ as a set of longitudinal data over S spatial locations.If we temporarily treat the observations between spatial locations to be independent of each other, then the GEE (Liang and Zeger, 1986) is given by: where D s = ∂µ s• /∂β τ and V s is the covariance matrix of y s• .The matrix V s can be expressed as and R s (α) is a correlation matrix, with the (t, t )-th element representing the correlation between times t and t at location s.If we let v −1 s,tt as the (t, t ) element of V −1 s , then (4) can be written as where w ss = 1, s = s and w ss = 0, s = s .We can compare (5) to a set of estimating equations that considers space-time correlation simultaneously.Let the data y be stacked as a S × T vector.Then a set of estimating equations can be written as where µ = (µ τ 1• , ...µ τ s• ) τ and D = ∂µ/∂β τ and Ṽ is the covariance matrix of y and we let ṽ−1 st,s t be the (st, st )-th element of Ṽ−1 .We can interpret (6) as a linear combination of ∂µ s t /∂β τ {y st − µ st } with coefficients given by ṽ−1 st,s t .Comparing (6) to (5), we observe that a standard GEE is also a linear combination but it replaces ṽ−1 st,s t with w ss v −1 s,tt .

Parameter estimation
Consider the following; suppose we remove all y st = 0, then conditioned on y st > 0, (1)-( 2) become Let d = {d(s, t) = d st } S×T be a matrix of indicators such that d st = 1 if y st > 0 and d st = 0 otherwise.Note that y st = 0 could mean the count was zero or count was not taken.To resolve the missing counts, we assumed counts were missing completely at random.The problem of counts not missing completely at random can be handled by adding an extra model for the propensity of d st = 1.However, we wanted to illustrate the idea of a zero truncated spatio-temporal GEE and so we chose to minimise any distraction to this main idea.For a particular set of variance covariance matrix v s,tt and spatial weight wss , the spatio-temporal GEE conditioned only on those observations with y st > 0 can be written as where v s,tt is the t, t-th element of V s , the covariance matrix of y s• .The matrix V s can be expressed as and R s (α) is a matrix with its (t, t )-th element representing the correlation between times t and t at location s.
Our primary interest lies in the parameters β but we also must deal with the nuisance parameters α.Let R(α) be a 84 × 84 matrix where α contains the parameters (θ) estimated via weighted least square method as described in Section 3.2.The parameters are estimated via a Newton-Raphson iteration method.To solve for (α, β) jointly, we employed the method of (Prentice, 1988).Let βk and αk be the estimates of β and α at the k-th iteration.We first fitted a GEE with an independence working correlation structure, we then solve the estimating equation for α, and we then iterate until convergence.This step gives the values v s,tt .Denoting s,s ,t,t ≡ s S s=s 1 s S s =s 1 t T t=t 1 t T t =t 1 , we estimate an initial estimate β0 using (5) by assuming an identity matrix for R s (α), equivariance, i.e., v −1 s,tt = 1 and, spatial weight.Then at iteration k, Here we take the slope of the linear regression of log(r k st rk st ) on log(|t − t |) as αk .We then iterate between ( 9) and (10) until convergence.
The standard errors for the parameter estimates can be obtained via the large-sample properties (Liang and Zeger, 1986;Leung et al., 2009;Paul et al., 2013) However, in our experience, the sandwich formula does not work very well.Instead, we use resampling by blocked jackknife (see, e.g., Künsch, 1989 andSherman, 2011, chapter 10).Let β−k be an estimate of β with data from the k-th province removed and let β = 1/K K k=1 β−k .We then define the resampled estimate of var( β) by 3 Application to Afghanistan leishmaniasis incidence data

Data description
The  2 presents the distribution of the disease incidence across the 20 provinces with available data sets.A striking feature of the data is the high number of zero incidence for many locations.Many of the provinces have counts of zeros for months, then a sudden jump to a few hundreds or thousands, then back to zero.Between 2003 and 2006, most of the provinces reported no cases of Leishmaniasis; this claim cannot be verified because this period coincides with the US led war in Afghanistan and disease reporting may only be possible in a relatively safe environment.

Model fitting
The modelling is a 2-step process, we need to find the spatial weight, wss , then the variancecovariance matrix, v s,tt that induce will the temporal dependency.Recall the dimension of Ṽ is ST × ST .For the current data set, S = 20 represents the number of provinces and T = 7 represents the number of years with recorded data.If we use the monthly data, which would allow us to study how seasonality affects the transmission (incidence) of the disease, then T = 84 and so S × T = 20 × 84 = 1680 and therefore Ṽ would be a matrix that cannot feasibly be handled.Furthermore, as we argued in Section 1, the correlation between y(s, t) and y(s , t ) often does not have any practical meaning so (6) does not seem to be a route that we should follow.
Firstly, our idea is to consider the spatial correlations that are omitted by the standard GEE.Therefore, we propose using a set of weight w ss different from those in (5).The weights are derived from a commonly used measure of spatial correlation, the semivariogram (see, e.g., Cressie, 1993).For the data in this paper, we define an empirical semivariogram Figure 1: Map showing the distribution of new cases of Leishmaniasis in Afghanistan in 2009 q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Badakhshan Baghlan Balkh  (2003)(2004)(2005)(2006)(2007)(2008)(2009) as where h is a lag distance between spatial locations s and s and N (h) is the number of pairs of spatial locations separated by no more than h.The empirical variogarms is computed at different time scale and then averaged over the same spatial lags.Unlike covariance or correlation, which measure similarity, semivariogram measures dissimilarity.Hence, if we use semivariogram to create weights, we do not have to carry out any matrix inversion.This empirical semivariogram ( 13) can be used to fit a parametric semivariogram model, e.g., nugget, exponential or Gaussian, for illustration, we have chosen powered exponential; where 0 < q ≤ 2, φ > 0 and α = (τ 2 , σ 2 , φ).The quantities τ 2 , σ 2 and φ represent the nugget, sill, and range, respectively.This semivariogram includes as special cases the exponential (q = 1) and Gaussian (q = 2).The corresponding correlation function has the form An exponential model was used to obtain the parameters that were used in the construction of the spatial weights matrix.Secondly, we also need to construct the variance matrix, V s to induce the temporal dependency.Recall that for fixed a s, v s,tt , t, t = t 1 , ..., t T are the elements of the variance covariance matrix of disease counts between times.The data set consists of monthly disease counts for each province that were captured over 7 years, from 2003-2009, with up to 84 observations per province.However, as mentioned before, year is an artificial variable that is not of interest.On the contrary, there might be two different types of temporal correlations: (1) Between months that are nearby and (2) Between the same month in different years (seasonality).A simple temporal correlation function may be of the form α = {α t,t , t, t = 1, ..., 12}, such that α t,t = α |t−t | , 0 < α < 1 and t = 1, ..., 12 be indicator for months of the year, and |t − t | is the time lag.But this seems too simple, for example, supposed January are not likely to be correlated.In order to account for the presence of seasonality in the data i.e. temporal correlation between the same month in different years (Afghanistan is characterized by four seasons namely; winter, spring, summer and autumn), we used dummy (binary) variable to capture and quantify the seasonal effects.
Similarly, a semivariogram was used to parameterize the correlation matrix R(α), with its (t, t )-th element representing the correlation between times t and t at location s.We compute the empirical variograms at different spatial locations and then average the variograms over temporal lags.
For two different times, say t, t , that are t = |t − t | months apart, the correlation between the two times, t, t could be written as: where C 0 T (t) represent an temporal exponential function with parameters α = τ 2 , σ 2 , φ.We defined a 84 × 84 matrix R(α) where α contains the parameters (τ 2 , σ 2 , φ) that can be estimated by using a graphical display of γ(h) at h = h 1 , ..., h K .Another method is to use weighted least squares, i.e., minimise with respect to α for some weights w k 's.Following Cressie (1985), we use

Results
We applied the spatio-temporal zero truncated model to study the effects of environmental variables on Leishmaniasis, while allowing for different dependencies in the data.Population size was added as offset, the environmental variables used as covariates were average monthly temperature (Celsius), average monthly accumulated rainfall (Inches), average monthly wind speed (Knots) and altitude (Metres).In an attempt to investigate the effects of spatial and temporal interaction, the model incorporated both the temporal variance covariance matrix v s,tt and spatial weight wss .
The results from the spatio-temporal truncated model is given in Table 1.The environmental parameters are significant risk factors for leishmaniasis in Afghanistan.There appears to be a negative effect of altitude, temperature, wind and accumulated rainfall as predictors for leishmaniasis incidence.

Conclusion
The technique used is this article allow for correct specification of correlation structures to improve the efficiency of the GEE method.The leishmaniasis data presented several problems with modelling issues, ranging from correlation/covaraince specification to issues with "imputed" or "non true" zeros.The high percentage of zero disease counts may be the result of no disease incidence or lapse of data collection.Moreover, the dependency in the data may be a result of spatial variation, temporal or both.To resolve this issue, a renowned method was used to address the many issues that the data presented in a very novel way.A model truncated at zero was fitted while allowing for dependency in the data via a working correlation matrix using the technique of GEE.
The results from this study are similar to that of (Adegboye and Kotze, 2012;Rajesh and Sanjay, 2013;Thompson et al., 2002;Valderrama-Ardila et al., 2010;Karagiannis-Voules et al., 2013).The model confirms the significant influence of environmental factors on the incidence of Leishmaniasis.The model indicates that high temperatures are associated with a lower incidence of Leishmaniasis; this is similar to the findings of (Rajesh and Sanjay, 2013).The survivability of the sand fly (Leishmaniasis vector) has been reported to reduce during high temperatures (Rajesh and Sanjay, 2013).A negative association between accumulated rainfall and incidence of Leishmaniasis has been found; this is not surprising as extreme rainfall may have a negative effect on the vector such as flooding (Thompson et al., 2002).The negative effect of temperature and rainfall is also in line with what was observed in the exploratory analysis.Two peaks were observed in the disease occurrence between 2003 and 2009 -January to March and September to December -which coincide with the cold period, while July is the hottest month and March is the wettest month.The results also indicate that low altitudes are associated with an increase in the cases of Leishmaniasis, whereas an increase in the wind speed has a negative effect on the disease.
2003 and May 2003 are correlated because they are only 4 months apart, whereas January 2003 and February 2004 are correlated because they are from the same season (Winter in Afghanistan) but different years, on the other hand, January 2003 and May 2003 probably above methodology was applied to the analysis of leishmaniasis incidence data in Afghanistan between 2003 and 2009.The data sets were monthly cases of leishmaniasis reported to the Afghanistan Health Management Information System (HMIS) under the National Malaria and Leishmaniasis Control Programme (NMLCP) of the Ministry of Public Health (MoPH).Leishmaniasis infections were confirmed clinically or calibrated ocular micrometer supported binocular light microscopy of Leishmania parasites.The data consists of 148,945 new cases of Leishmaniasis from 20 provinces in Afghanistan between 2003 and 2009 (of these, 41,072 occurred in 2009)(Figure 1).Satellite-derived environmental and climatic data such as accumulated rainfall, land surface temperature and Wind were obtained from the National Aeronautics and Space Administration-NASA Earth Observations (NEO) [http://earthobservatory.nasa.gov Figure

Table 1 :
Parameter estimates (and standard errors) of the zero-truncated model for Leishmaniasis data