Modeling Categorical Random Fields via Linear Bayesian Updating

Abstract: Categorical variables are common in spatial data analysis. Traditional analytical methods for deriving probabilities of class occurrence, such as kriging-family algorithms, have been hindered by the discrete characteristics of categorical fields. This study introduces the theoretical backgrounds of linear Bayesian updating (LBU) approach for spatial classification through expert system. Transition probabilities are interpreted as expert opinions for updating the prior marginal probabilities of categorical response variables. The main objective of this paper is to present the solid theoretical foundations of LBU and provide a categorical random field prediction method which yields relatively higher classification accuracy compared with conventional Markov chain random field (MCRF) approach. A real-world case study has also been carried out to demonstrate the superiority of our method. Since the LBU idea is originated from aggregating expert opinions and not restricted to conditional independent assumption (CIA), it may prove to be reasonably adequate for analyzing complex geospatial data sets, like remote sensing images or area-class maps.


Introduction
Categorical spatial data, such as lithofacies, land-use/land-cover (LULC) classifications, mineralization phases, are widely investigated geographical and geological data sources.They are typically represented by mutually exclusive and collectively exhaustive (MECE) classes and visualized as area-class maps [1].Traditionally, indicator variograms or indicator covariances are commonly used in geostatistics to measure spatial continuity in categorical fields [2].Different from (but related to) those two indicator approaches, transition probabilities [3], or equivalently, transiograms [4], are naturally designed to quantify spatial variability of categorical variables.The advantages of transition probability-based methods over indicator covariances or variograms had been well discussed [3].The critical and most challenging step here is to model the data redundancy between these elementary probabilities or interactions among multiple variables.Several methods have been developed in consideration of data interaction, e.g., Tau ( ) τ model [5,6], Nu ( ) υ expression [7], generalized linear mixed model (GLMM) [8], etc.The paradigm of latent random fields and spatial covariance functions used in GLMM was also discussed in [9], the results show that the class occurrence probability at a target location can be formulated as a multinomial logistic function of spatial variable covariances between the target and sampled locations.The interaction terms were included into logistic regression models in [10] to account for data dependency in terms of Markov random fields [11,12].In the Geo-information context, the Rao's quadratic diversity was used in [13] to measure the scale-dependent landscape structure.In the geological counterpart, a spatial hidden Markov chain model was employed in [14] for estimation of petroleum reservoir categorical variables.Most often, the step of considering data interaction is avoided, or at least simplified, by assuming some form of conditional independent assumption (CIA), such as the Markov chain random field (MCRF) method, introduced by [15].The recently proposed MCRF co-located cosimulation algorithm [16] is an enrichment of the previous MCRF theory by integrating auxiliary data.
Generally speaking, the spatial classification problem can be regarded as combing two-point transition probabilities into a multi-point conditional probability.A formal review of most of the available techniques to aggregate probability distributions in geoscience can be found in [17].What we focus on is to use probability aggregation method for spatial classification based on the pioneering work of [18] and [19].This paper is organized as follows.We start by introducing the basic forms of the linear Bayesian updating (LBU) method in Section 2. We then make detailed proofs in Section 3 for some propositions introduced by [18] and [19] while have not yet been proven.Finally, to further investigate the performance of the proposed method in real cases, the lithology types in the wellknown Jura data set [20] are used for case study (Section 4).The contribution is completed by conclusions and acknowledgments in Section 5.

Linear Bayesian updating model
We use A and to represent the events in sample spaces of categorical random variable ( ) respectively and A denotes the complementary event of A .We wish to assess the probability of an event A , conditional on the occurrence of a set of data events i D , For categorical events, the formal probabilistic set-up is the following.We need to consider a sample space Ω such that all events A and i D are subsets of Ω .In the case of categorical data, let A be the finite set of events in Ω such that the events


of A are MECE, that is A forms a finite partition of Without loss of generality, we will use A as a short notation for j A .The event A can for example be a lithofacies category at a specified location 0 x , while the data i D can represent information provided by core samples at surrounding wells . This means that we wish to approximate the probability ( )


on the basis of the simultaneous knowledge.This multi-point posterior probability can be obtained by combining prior probability ( ) with the two-point transition probabilities ( ) . This prior probability is independent on any other probabilities ( ) . It can be thought of as arising from an abstract and never specified information In geoscience, such a prior probability could be for example the proportion of a kind of lithofacies varying in space [17].The n neighboring events


are regarded as experts, thus the conditional probability ( ) can be considered as expert opinion i Q for the occurrence of A , an event of interest.
We then examine the situation in which a decision maker (DM) uses the opinions of  Q to revise his/her own prior probability ( ) for the occurrence of A , and it is desired to use this information to form the DM's posterior probability for A .This process is called Bayesian updating.The experts' opinions are treated as random variables , are to be revealed to the DM.The posterior probability of A given q Q = is then ( ) The LBU method was proposed by [18] of the form ( ) ( ) with possibly negative weights, i λ , expressing the amounts of correlation between each i Q and A .i μ denotes the mathematical expectation of i Q .When for all i , (2) reduces to the linear opinion pool [21] ( ) where the DM is considered as one of the experts, thereby lending support to a supposition of [22] to the effect that (3) sometimes corresponds to an application of Bayes' theorem.To obtain a legitimate probability, i λ must obey a number of inequalities [18].If, for example, the DM feels that all i λ are positive, the most common case, then they must be chosen so that which can be considered as a sufficient but not necessary condition for LBU model.The proof of ( 4) is given in Section 3.1.
The LBU method presented above has some parameters that need either to be estimated or specified by the DM.In the context of aggregating expert opinion, [23] introduced four ways of assessing the weights for the linear pool: the first suggestion is to set equal weights when there is no element which allows to prefer one source of information to another, or when symmetry of information justifies it; the second and third propositions are setting weights proportional to a ranking based on expert's advice and self-rating, respectively.The last suggestion is to set weights based on some comparison of previously assessed distributions with actual outcomes.An algorithm [24] is based on the minimization of a Kullback-Leibler distance for selecting weighting factors in logarithmic opinion pools.The optimal weights are found by solving a quadratic programming problem.We can also find approach based on ordinary kriging, a well-known framework for accounting for data interaction in a spatial context [25].In their discussion, information interaction between i D and 1 − i D is encapsulated by the exponent weight factor i τ , but the concept of distance between source of information and that of variogram of probabilities is not obvious [17].Our derivation provides an interpretation of the coefficients i λ in LBU (2) as the linear regression coefficients of p p − * on μ Q − so long as the experts are not linearly dependent.The proof of this proposition can be found in Section 3.2.As in multiple regression, each i λ can thus be thought of as a measure of the additional information that the i th expert provides over and above the other experts and what the DM already knows.

Parameter ranges for linear Bayesian updating
Since ( ) , that is to say, Through algebra transformation, (5) can be simplified as Suppose that all i λ are positive, since , as long as (6) can be satisfied.Therefore, (7) can be considered as a sufficient but not necessary condition for LBU model (2).Only when (7), or equivalently (4), is satisfied, can (2) be a valid probability model.

Interpreting parameters as regression coefficients for linear Bayesian updating
In addition, (2) can be given as Taking the expectations on both sides of the equation after transformation yields Suppose we have m samples in the training set, consider the regression model denotes the random errors.As long as the experts are not linearly dependent, the least squares estimation of the regression coefficients is which can be written in its matrix form p p p p p p q q q q q q q q q q q q q q q q q q q q q μ μ μ Therefore, the parameters λ in LBU model can be interpreted as the linear regression coefficients

Case study
To investigate the performance of LBU method in real cases, the lithology types in the wellknown Swiss Jura data set [20] are used, in which four rock types are sampled in a 14.5 2  km area and 259 samples are selected as training data set (Figure 1).The class proportions of these four categories are [20.46%,32.82%, 24.32%, 22.39%], respectively.We consider 10 nearest samples as the neighbors of target locations, thus we have 10 experts for consultation.Correlation analysis (Figure 2) shows that the 10 experts are mutual correlated to some extent.The two nearest neighbors have the highest correlation coefficient (greater than 0.9).Therefore, a variable-reduction technique is needed.We use principal component analysis (PCA) to convert those correlated experts into a set of values of linearly uncorrelated principal components.Before doing this, the well-known measures of appropriateness of factor analysis, KMO and Bartlett's test, have been performed.Table 1 shows several important parts of the output: the Kaiser-Meyer-Olkin measure of   is more likely to be high (and less likely to be low) when We also use MCRF method for comparison.As a non-linear and parameter-free method, the MCRF is based on the product of probabilities, which will dramatically amplify higher probability and reduce lower probability close to zero.The lithology types of corresponding 259 locations can be predicted based on the maximum a posterior (MAP) probability criterion (Figure 4).Together with Table 6, we find that LBU is superior to MCRF in overall prediction accuracy by comparing the prediction results of MCRF with LBU model.

Conclusions
In this work, we consummate the theoretical foundations of LBU method for the prediction of categorical spatial data.The philosophy underlying this approach is originated from combining probability distributions in expert system.The Bayesian argument embodied in LBU is both normative and logical.With a linear expression, the probabilities of class occurrence of target locations can be assessed by neighboring sample types.Weights in this model are interpreted as linear regression coefficients.We have enriched our previous findings [19] by adding some rigorous theoretical proofs of the LBU method.Experimental results show that our method performs better  in terms of classification accuracy compared with traditional MCRF method.As pointed out by [19], our method can also be generalized to nonlinear cases, where more confident probability forecasting results can be obtained.
Let Q Σ denote the covariance matrix of Q and Q * p σ be the vector of covariances between * p and Q .Let t denote matrix transposition and

Figure 1 .
Figure 1.Jura lithology data set with four classes.

Figure 3 .
Figure 3. Normal probability plot and Jarque-Bera test for goodness-of-fit to a normal distribution.

Figure 4 .
Figure 4. Predicted lithology types of corresponding 259 locations based on MAP probability criterion.

Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 14 July 2016 doi:10.20944/preprints201607.0030.v1 sampling
adequacy and Barlett's test of sphericity.The KMO statistic of each lithology type is greater than 0.5, so we should be confident that PCA is appropriate for these data.Bartlett's measure tests the null hypothesis that the original correlation matrix is an identity matrix.For these data, Bartlett's

Table 2 .
Accumulative contribution rate of the 10 principle components.Prin1 is short for the first principle component.

Table 3 .
Factor loadings of the retained principle components.
We then perform logistic regression for each lithology type with regard to corresponding retained principal components.The purpose is to obtain the continuous posterior occurrence probability * p of the binary event A for estimating parameter i λ .Stepwise selection results (Table4)show that several principal components have been removed from the logistic regression model.Max-Preprints (www.

preprints.org) | NOT PEER-REVIEWED | Posted: 14 July 2016 doi:10.20944/preprints201607.0030.v1 of
stepwise regression in LBU model.All the estimated i λ are greater than zero, which means that i Q