SAIN: Search-And-INfer, A Mathematical and Computational Framework for Personalised Multimodal Data Modelling with Applications in Health Care

Nikola Kasabov; Cristian Calude; Alec Henderson; Patrick Gladding

doi:10.20944/preprints202507.0022.v2

Submitted:

30 August 2025

Posted:

02 September 2025

You are already at the latest version

Abstract

Personalised modelling has become dominant in personalised medicine and precision health. It creates a computational model for an individual based on large data repositories of existing personalised data, aiming to achieve the best possible personal diagnosis or prognosis and derive an informative explanation for it. Current methods are still working on a single data modality or treating all modalities with the same method. The proposed method, SAIN (Search-And-INfer), offers better results and an informative explanation for classification multimodal object (sample) using a database of similar multimodal objects. The method is based on different distance measures suitable for each data modality and introduces a new formula to rank each sample and then use these ranks for a probabilistic inference. The paper describes SAIN and applies it to two types of multimodal data, cardiovascular diagnosis and EEG time series, modelled by integrating modalities, such as numbers, categories, images, and time series, and using a software implementation of SAIN. search in multimodal data; inference in multimodal data; personalised modelling; precision health.

Keywords:

search in multimodal data

;

inference in multimodal data

;

personalised modelling

;

precision health.

Subject:

Public Health and Healthcare - Public Health and Health Services

1. Introduction

Multimodal data has been gathered at a personalised level in large quantities, worldwide, for many applications, such as neuro-imaging analysis, personalised health diagnosis and prognosis, environmental modelling, and financial modelling, to mention only a few of them. Still, there are no efficient methods to integrate various multimodal data for a new subject and derive a more accurate and explainable diagnosis or prognosis based on existing multimodal data of many other subjects. The goal of this paper is to create such a method.

There are three main approaches to multimodal data integration in machine learning, explored so far [1]:

1. Early integration, where a common vector represents all modalities used for training a model and for its recall. This approach has been used for integrating time series and textual information [2,3], for image integration [4], and for the integration of clinical, social, and cognitive data modalities to predict psychosis in young adults [5]. While the method in [4] is based on deep, feedforward neural networks, the methods in [2,5] use a brain-inspired spiking neural network architecture NeuCube [6].

2. Late integration, where a model is created and trained for each of the modalities of data, and the results from all models are integrated to calculate the output. This approach has been demonstrated in [1] on integrating clinical, genetic, cognitive, and social data for medical prognosis.

3. Hybrid, early, late, and intermediate integration of data modalities, where the two approaches above are combined [1].

The proposed SAIN method in this paper is designed for early integration of data modalities, where specific encoding and distance metrics are suggested for different types of data, along with novel algorithms for search in a multimodal database and inference. These search and inference algorithms are related to building a personalised model for individual outcome assessment and its explanation.

Personalised modelling is concerned with the creation of an individual model for a new personal (individual) record of data X, using an already existing repository D of many other personal records for which the outcomes are known, to assess the outcome of the new record X [7]. Methods for personalised modelling have been developed to work mainly on a single modality of data [8,9]. These methods have been applied in many applications and constitute the state-of-the-art in the field (e.g. [1,9,10,11,12,13]). In [9,14], a personalised modelling is proposed based on static and temporal modalities of data, which are used to train a spiking neural network model.

The enormous growth of personal multimodal data worldwide demands more advanced methods for personal modelling with the use of multimodal data. This paper offers such a method, called SAIN, where the specificity of each data modality is considered and new algorithms are proposed for the encoding of multimodal data, for search in a multimodal data repository, and for multimodal inference, along with its explanation and visualisation. In contrast with the statistical solutions used in [15,16,17], we adopt a probabilistic framework that gives more precise evaluations of probabilities of outcomes for an individual.

2. Mathematical Description

In this section, we present the mathematical method. In detail, we start with the database coding and a class of distances (metrics) to be used in this article, followed by the list of tasks (problems) and their solutions. Detailed examples illustrate the solutions. Detailed solutions to three critical problems, survival analysis, heart disease diagnosis, and time series classification, are presented in detail and illustrated with numerical examples.

2.1. Database

We will work with the multidimensional data described as follows:

$m > 1$ objects (samples) $o_{1}, \dots, o_{m}$ ,
each object $o_{i}$ ( $1 \leq i \leq m$ ) is defined by $n > 1$ criteria (variables) $c_{1}, \dots, c_{n}$ with values in linearly ordered domains $D_{i}$ with $min D_{i}$ and $max D_{i}$ ; if some value $a_{i, j} \in D_{i}$ ( $1 \leq i \leq m$ , $1 \leq j \leq n$ ) is either missing or uncertain, then its value is recorded as ∞,
$n > 1$ weights $w_{1}, \dots, w_{n}$ in $(0, 1)$ with $\sum_{i = 1}^{n} w_{i} = 1$ , where each $w_{i}$ ( $1 \leq i \leq n$ ) quantifies the importance of the criterion $c_{i}$ ; if $w_{i} = \frac{1}{n}$ for all $1 \leq i \leq n$ , then all criteria are equally important; a criterion $c_{i}$ is ignore if $w_{i} = 0$ .

Data of independent variables are organised as in (Table 1):

2.2. Distance Metrics

A distance metric on a space X is a positive real-valued function

d : X \times X \to R_{+}

satisfying the following three conditions for all

x, y, z \in X

: a)

d (x, y) = 0

if and only if

x = y

, b)

d (x, y) = d (y, x)

, c)

d (x, z) \leq d (x, y) + d (y, z)

.

The multicriteria metrics [18,19] (used in multicriteria recommendation systems [20]) presented in this part can be used on a variety of domains

X = D_{i}

: they can be sets of logical values, rational numbers, percentages, digitally codified images, sounds, videos, and many others. We use a bounded distributive complemented lattice

(L, \lor, \land, \bar{}, 0, 1)

to describe uniformly the domains

D_{i}

. We rank all objects according to their aggregated distance to a new one; based on that, we calculate probabilities of the new object belonging to different classes, represented in the object repository.

Here is a list with illustrative, but far from exhaustive, examples of domains

D_{i}

:

Logical Boolean domain: $({0, 1}, max, min, \bar{}, 0, 1)$ , where $\bar{x} = 1 - x, x \in {0, 1}$ .
Logical non-Boolean domain: $(\{0, \frac{1}{N - 1}, \frac{2}{N - 1}, \dots, \frac{N - 2}{N - 1}, 1\}, max, min, \bar{}, 0, 1)$ , where $x \in \{0, \frac{1}{N - 1}, \frac{2}{N - 1}, \dots, \frac{N - 2}{N - 1}, 1\}$ and $\bar{x} = 1 - x,$ .
Numerical domain with natural values: $({0, 1, \dots, N}, max, min, \bar{}, 0, 1)$ , where $\bar{x} = N - x, x \in {0, 1, \dots, N}$ .
Numerical domain with rational values: $({x ∣ a \leq x \leq A}, max, min, \bar{}, a, A)$ , where $\bar{x} = A - x, a \leq x \leq A$ .
Binary code: $({0, 1}^{n}, max, min, \bar{}, 00 \dots 0, 11 \dots 1)$ , where the domain consists of all binary strings of length n, ${0, 1}^{n} = {x_{1} x_{2} \dots x_{n} ∣ x_{i} \in {0, 1}}$ and for all $x_{1} x_{2} \dots x_{n}, y_{1} y_{2} \dots y_{n} \in {0, 1}^{n}$ , $max (x_{1} x_{2} \dots x_{n}, y_{1} y_{2} \dots y_{n}) = max (x_{1}, y_{1}) max (x_{2}, y_{2}) \dots max (x_{n}, y_{n})$ , $min (x_{1} x_{2} \dots x_{n}, y_{1} y_{2} \dots y_{n}) = min (x_{1}, y_{1}) min (x_{2}, y_{2}) \dots min (x_{n}, y_{n})$ , $\bar{x_{1} x_{2} \dots x_{n}} = (1 - x_{1}) (1 - x_{2}) \dots (1 - x_{n}) .$

In the lattice

(L, \lor, \land, \bar{}, 0, 1)

we introduce, following [18], the metric:

d (x, y) = \{\begin{matrix} (x \land \bar{y}) \lor (\bar{x} \land y), & if x \neq y, \\ 0, & otherwise, \end{matrix}

for

x, y \in L

. This metric d can be extended to

L \cup {\infty}

as follows:

d_{\infty} (x, y) = \{\begin{matrix} d (x, y), & if x, y \in L, \\ σ (x), & if x \in L and y = \infty, \\ σ (y), & if y \in L and x = \infty, \\ 0, & otherwise, \end{matrix}

where

σ (x) = max (x, \bar{x})

.

The metrics

d_{\infty, i}

on

L_{i} \cup {\infty}

,

1 \leq i \leq n

, can be extended to

{(L_{i} \cup {\infty})}^{n}

, i.e. to n-dimensional vectors, as follows:

d_{\infty} (x_{1} x_{2} \dots x_{n}, y_{1} y_{2} \dots y_{n}) = \sum_{i = 1}^{n} d_{\infty, i} (x_{i}, y_{i}),

(1)

where

x_{i}, y_{i} \in L_{i} \cup {\infty}

,

1 \leq i \leq n

.

In what follows, we write d for

d_{\infty}

when the meaning is clear from the context.

2.3. Tasks Specification

Data organised as in Table 2, Table 3, and Table 4 consists of independent objects augmented with a column of labels, the weights of criteria, and a new unlabelled object, respectively.

Additional information associated with data in Table 2 may include the range of each criterion

c_{j}

and the associated specific distance, e.g. the Euclidean distance for real numbers and the distance d for binary strings or strings over a non-binary alphabet (e.g. for images or colours).

We consider the following tasks:

Task 1:: Calculate the distance (or similarity metric) between the new object and each object in Table 2.

If the distance corresponding to $c_{i}$ is $d_{i}$ , then

$d (o_{j}, x) = \sum_{i = 1}^{n} w_{i} \cdot d_{i} (a_{i, j}, x_{j}) .$
Task 2:: Given a threshold $δ > 0$ , calculate all objects $o_{i}$ at a distance at most $δ$ to x.
Task 3:: Calculate the probability of a new object belonging to a labelled class (e.g. low risk vs. high risk) using a threshold $δ$ and Table 2.
Task 4:: Rank the criteria in Table 2 and calculate the marker or markers criterion/criteria that are the most important/ones.
Task 5:: Assign alternative weights to criteria.
Task 6:: Test the data accuracy and method for Task 4.

2.4. Tasks Solutions

For Task 1 we calculate the distances

d_{\infty} (o_{i}, x)

between each object

o_{i}

in Table 2 and x in Table 4.

For Task 2, given a threshold

δ > 0

, we calculate all objects in Table 2 at a distance at most

δ

to x, that is, the objects which are

δ

-similar to x:

C_{δ, x} = {o_{i} ∣ d (x, o_{i}) \leq δ, 1 \leq i \leq m},

and its complement

\bar{C_{δ, x}}

.

For Task 3 we calculate the probability that x is in class label

l_{t}

, which is the ratio of the number of objects in

C_{δ, x}

with the label

l_{t}

to the size of the cluster

C_{δ, x}

:

P r o b (x h a s l a b e l l_{t}) = \frac{# {o_{i} \in C_{δ, x} ∣ l_{i} = l_{t}}}{# (C_{δ, x})},

where

# {\dots}

means the number of elements in the set

{\dots}

.

For Task 4, we work with Table 2. Recall that for each criterion

c_{i}

we have a domain

D_{i}

augmented with information “high" or “low," indicating whether higher or lower values are desirable. Based on this information, we can construct a hypothetical object (see Table 5) which has the most desirable values for each criterion: one could see this object as an “exemplar" one.

Sometimes, criteria are interrelated or correlated. This means that in some cases, there is no unique “exemplar object", but a couple of them have to be studied in ranking the importance of criteria.

For example, fix an “exemplar object"

o_{E}

.

Compute the distances $d_{\infty} (o_{i}, o_{E})$ between each object $o_{i}$ in Table 2 and $o_{E}$ , so obtain a vector with n non-negative real components $V_{0} = (d_{1}^{0}, \dots d_{n}^{0})$ .
For each $1 \leq t \leq m$ , compute the distances $d_{\infty} (o_{i}, o_{E})$ taking into consideration all criteria in Table 2 except $c_{t}$ : obtain the vector $V_{t} = (d_{1}^{t}, \dots d_{n}^{t})$ .
Compute the distances between $d i s t (V_{0}, V_{t})$ , $1 \leq t \leq m$ using the formula

$d i s t (V_{0}, V_{t}) = \sum_{i = 1}^{n} | d_{0, i} - d_{t, i} |,$

and sort them in increasing order. The criterion $c_{t}$ is a marker if $d i s t (V_{0}, V_{t}) \geq d i s t (V_{0}, V_{j})$ , for every $1 \leq j \leq m$ .

We repeat this procedure for each “exemplar object" and study possible variations.

For Task 5, normalise the distances

d i s t (V_{0}, V_{t})

and use these values to construct the weights

w_{i}^{*}

,

1 \leq t \leq m

.

For Task 6 assume we have weights

(w_{i})

associated to Table 2 (see Table 1). To test the accuracy of the data and method used for Task 4, compare the original weights

(w_{i})

with

(w_{i}^{*})

. Serious discrepancies should signal issues either with the data or the choices made in the applications of the method.

2.5. An Example

We illustrate the above tasks with an example of a labelled database in (see Table 6) and a new object (see Table 7), all having the following seven characteristics (the last column has the label classes 1 and 2):

c_{1}

: real number

{0 - 100}

, e.g. age, weight, BMI etc.;

c_{2}

: Boolean value

{0, 1}

, e.g. gender;

c_{3}

: integer number

{0 - 10, 000}

, e.g. gene expression;

c_{4}

: categorical {small, med, large }, e.g. size of tumour, body size, keywords;

c_{5}

: colour {red, yellow, white, black}, e.g. colour of a spot on the body, on the heart;

c_{6}

: spike sequence of

{- 1, 0, 1}

e.g. encoded EEG, ECG;

c_{7}

: black and white image, e.g. MRI, face image.

In this fictitious example, for simplicity, we didn’t use weights.

The first step is to code the data in Table 6 and Table 7. The new data is in Table 8 and Table 9.

Then, we normalise the data in Table 8 and Table 9 – the entries in the first, third, and fourth columns have been divided by 100, 10,000, and 2, respectively. The entries in the last three columns have been transformed into reals in the unit interval, and the column of labels has been removed. In this way, we have obtained Table 10 and Table 11.

Then, we choose an appropriate distance according to each criterion. In this example, we used the Euclidean distance for all criteria (see Table 12 and Table 13).

We can compute

C_{δ, x} = {o_{i} ∣ d (o_{i}, x) \leq δ}

and, accordingly, the probability that x would be labelled in class 1 or class 2.

If

δ = 3.5

, then

C_{3.5, x} = {o_{1}, o_{2}, o_{3}, o_{5}, o_{6}, o_{7}, o_{8}}

so the probability that x is in class 1 is 2/7 and the probability that x is in class 2 is 5/7. If

δ = 2.5

, then its closest cluster is

C_{2.5, x} = {o_{2}, o_{3}, o_{5}, o_{6}, o_{7}, o_{8}}

, so the probability that x is in class 1 is 1/3 and the probability that x is in class 2 is 2/3.

which induces a ranking of the objects in the Table 8:

o_{3}, o_{6}, o_{7}, o_{5}, o_{2}, o_{8}, o_{1}, o_{9}, o_{2},

.

For Task 4, assume that the criteria

c_{1}, \dots, c_{7}

in Table 10 have the additional information

(m, m, m, m, m, M, M)

, where m (M) means that the exemplar value is the minimum (maximum) value. Based on this vector, we compute the exemplar object(see Table 14):

Next we calculate

V_{0}, \dots, V_{t}

, ( see Table 15), and finally the distances

D i s t (V_{0}, V_{t})

,

t = 1, 2, \dots, 7

and the weights as their normalised values, see Table 16. The marker, in this case, is the criterion

c_{5}

.

2.6. Complexity Estimation of the SAIN Method

The proposed method to compute the similarity between a new object and N objects in a data repository is linear in N, so very fast.

3. Survival Analysis in SAIN

Medical survival analysis evaluates the time until an event of interest occurs, like death or disease recurrence, in a group of patients. This analysis is often used to compare treatment outcomes or predict prognosis. In contrast with the statistical solutions used in [15,16,17], we adopt a probabilistic framework that gives more precise evaluations of probabilities.

3.1. Data and Tasks

We are given the following data:

Table 17 in which the first column lists the patients treated for the same disease with the same method under strict conditions, and the last column records the times till the patients’ deaths.
Table 18, which includes the record of the new patient p.
A threshold $δ$ which defines the acceptable similarity between p and the relevant $p_{i}$ ’s in the Survival database (i.e. $d (p, p_{i}) \leq δ$ ).

We consider the following tasks:

Task 1: What is the life expectancy of p?

Task 2: What is the probability that the life expectancy of p is greater than or equal to a given T?

3.2. Tasks Solutions

Using a standard method of survival analysis

For Task 1,

(a)

Compute the set of patients that are similar up to $δ$ to p:

$C_{δ, p} = {p_{i} ∣ d (p, p_{i}) \leq δ, 1 \leq i \leq m} .$

(2)

(b)

Using $C_{δ, p}$ , compute the probability that p will survive the time $t_{j}$ :

$P r o b_{δ} (p s u r v i v e s t i m e t_{j}) = \frac{# {p_{i} \in C_{δ, p} ∣ t_{i} = t_{j}}}{# (C_{δ, p})} .$

(3)

(c)

Compute the life expectancy of p using the formula:

$L E_{δ} (p) = \sum_{j = 1, t_{j} \in C_{δ, p}}^{m} t_{j} \times P r o b_{δ} (p s u r v i v e s t i m e t_{j}) .$

(4)
For Task 2, calculate the probability that the life expectancy of p is at least time T:

$P r o b_{δ} (L E (p) \geq T) = \sum_{j = 1, t_{j} \in C_{δ, p}, t_{j} \geq T}^{m} P r o b_{δ} (p s u r v i v e s t i m e t_{j}) .$

(5)

3.3. An Example

We illustrate the above tasks with an example of a database in which columns 2–8 record patients’ medical test results, and the last column records time to death (see Table 19) and a new patient (see Table 20):

The distance for column 4 is

d (x, y) = | x - y | =

and

d_{\infty} (x, \infty) = max (x, 1 - x)

. For example,

d_{\infty} (1, \infty) = max (1, 1 - 1) = 1

. For all other columns, the distance is

d (x, y) = | x - y |

. Finally, the total distance is the sum of individual distances (7 terms), with the results in Table 21.

The results for Task 1, (a), (b), and (c) are listed below:

For $δ \geq 3.37$ , $C_{δ, p} = {v_{1}, v_{2}, v_{3}, v_{4}, v_{5}, v_{6}, v_{7}, v_{8}, v_{9}}$ , that is the entire database. Then

(a)

$L E_{δ} (p) = 50.11$ ,

(b)

i.

$P r o b_{δ} (p s u r v i v e s t i m e = 12.3) = 1 / 9$ ,

ii.

$P r o b (p s u r v i v e s t i m e = 15) = 1 / 9$ ,

iii.

$P r o b_{δ} (p s u r v i v e s t i m e = 68) = 1 / 9$ ,

iv.

$P r o b_{δ} (p s u r v i v e s t i m e = 1.4) = 1 / 9$ ,

v.

$P r o b_{δ} (p s u r v i v e s t i m e = 40.5) = 1 / 9$ ,

vi.

$P r o b_{δ} (p s u r v i v e s t i m e = 97.2) = 2 / 9$ ,

vii.

$P r o b_{δ} (p s u r v i v e s t i m e = 55.7) = 1 / 9$ ,

viii.

$P r o b_{δ} (p s u r v i v e s t i m e = 63.7) = 1 / 9$ .

(c)

i.

$P r o b_{δ} (L E_{δ} (p) \geq 1.4) = 1$ ,

ii.

$P r o b_{δ} (L E_{δ} (p) \geq 12.3) = 8 / 9$ ,

iii.

$P r o b_{δ} (L E_{δ} (p) \geq 15) = 7 / 9$ ,

iv.

$P r o b_{δ} (L E_{δ} (p) \geq 40.5) = 6 / 9$ ,

v.

$P r o b_{δ} (L E_{δ} (p) \geq 55.7) = 5 / 9$ ,

vi.

$P r o b_{δ} (L E_{δ} (p) \geq 63.7) = 4 / 9$ ,

vii.

$P r o b_{δ} (L E_{δ} (p) \geq 68) = 3 / 9$ ,

viii.

$P r o b_{δ} (L E_{δ} (p) \geq 97.2) = 2 / 9$ ,

We can calculate other probabilities, for example, $P r o b_{δ} (L E_{δ} (p) \geq 60) = P r o b_{δ} (L E_{δ} (p) \geq 63.7) + P r o b_{δ} (L E_{δ} (p) \geq 68) = 3 / 9 + P r o b_{δ} (L E_{δ} (p) \geq 97.2) = 2 / 9 = 1 / 9 + 1 / 9 + 2 / 9 = 4 / 9$ .
For $δ \geq 2.5$ , $C_{δ, p} = {v_{2}, v_{3}, v_{5}, v_{6}, v_{7}, v_{8}}$ . Then

(a)

$L E_{δ} (p) = 62.27$ ,

(b)

i.

$P r o b (p s u r v i v e s t i m e = 15) = 1 / 6$ ,

ii.

$P r o b (p s u r v i v e s t i m e = 68) = 1 / 6$ ,

iii.

$P r o b (p s u r v i v e s t i m e = 40.5) = 1 / 6$ ,

iv.

$P r o b (p s u r v i v e s t i m e = 97.2) = 2 / 6$ ,

v.

$P r o b (p s u r v i v e s t i m e = 55.7) = 1 / 6$ ,

(c)

i.

$P r o b_{δ} (L E_{δ} (p) \geq 15) = 1$ ,

ii.

$P r o b_{δ} (L E_{δ} (p) \geq 40) = 5 / 6$ ,

iii.

$P r o b_{δ} (L E_{δ} (p) \geq 55.7) = 4 / 6$ ,

iv.

$P r o b_{δ} (L E_{δ} (p) \geq 68) = 3 / 6$ ,

v.

$P r o b_{δ} (L E_{δ} (p) \geq 97.2) = 2 / 6$ .

Similarly, we can calculate the probabilities $P r o b_{δ} (L E_{δ} (p) \geq 45) = 4 / 6$ , $P r o b_{δ} (L E_{δ} (p) \geq 100) = 0$ .

In contrast with the statistical solutions used in [14,15,16], we adopted a probabilistic framework that gives more precise evaluations of probabilities. The SAIN algorithms also include some statistically established methods, such as the t-test, for ranking variables before applying the inference method.

4. SAIN: A Modular Diagram and Functional Information Flow

The SAIN framework consists of the following modules (Figure 1):

Multimodal data of a new object X.
An existing repository D of multimodal data of many objects, labelled with their outcome.
A module of algorithms for searching in the database D and based on the distance between X and each object in D.
Defining a subset $D_{x}$ from D, so that X is closer to the objects in $D_{x}$ based on a given threshold.
A module of algorithms for building a model $M_{x}$ in $D_{x}$ .
An inference algorithm to derive the output for X from the model $M_{x}$ and to visualise it for explanation purposes. Figure 1 gives a modular view of the SAIN framework and Figure 2 shows the information processing flow:

(a)

Encoding the multimodal data of X and D. -

(b)

Choosing a distance matrix and similarity search in the data set D.

(c)

Calculating the aggregated difference between the new data vector X and the closest vectors in $D_{x}$ .

(d)

Creating a model $M_{x}$ in $D_{x}$ .

(e)

Applying inference by calculating the $X_{c, j}$ for each class $C_{j}$ (or output value), using the wwkNN method in [5].

(f)

Reporting and visualisation of results of the individual model Mx. This is illustrated in Figure 3.

The inference method is based on the wwkNN (weighted variables, weighted samples k-nearest neighbour) proposed by Kasabov [7]. This method first ranks the impact of the variables (multimodal ones) to estimate their weights/impact towards the output using a t-test; then, it measures the distance between the new object X and the ones in the database

D_{x}

and weighs it. For each class

C_{j}

, the higher a variable is ranked, the closer the samples/objects belong to class

C_{j}

. to X, the higher the calculated value of

C_{j}

is. The new object X is classified in class

C_{l}

if

X_{c l}

is the highest among all

X_{c j}

values In Figure 1 and Figure 2, we present the modular diagram and the functional information flow of SAIN.

Figure 1. A modular diagram of the proposed SAIN computational framework

Figure 2. A flow of data and information processing in the SAIN computational framework

Figure 3. An example of visualisation of a personalised SAIN model. The closest four samples (out of 6) to the new object (star) are from class 1 (in red) using the top three informative variables. Each sample is a multimodal one, and the top 3 variables can be of different modalities

5. Case Studies for Medical Diagnosis and Prognosis

We present three case studies in which we applied SAIN.

5.1. Heart Disease Diagnosis

We worked with the well-known Cleveland dataset, which contains multiple data types [21]. The UCI Heart Disease data set includes 76 attributes. As in most articles, the attributes in our experiment data were restricted to 14, see Table 22.

The problem is a binary classification of whether the patient has or does not have heart disease.

First, we selected suitable distance metrics and weights to classify the attributes. For binary objects, the distance metric is simply whether they are equal; for non-binary discrete objects such as resting electrocardiographic results, the appropriate distance measure is not obvious and should be informed by an expert. We give electrocardiographic results 0 for normal, 1 for having ST-T wave abnormality, and 2 for showing probable or definite left ventricular hypertrophy following Estes’ criteria.

Many studies with the Cleveland dataset have been tested with different machine learning techniques. For example, [21] lists different algorithms and performances ranging from 47% to 80% accuracy. SAIN achieved an 82% accuracy score. Why SAIN? The search is fast, uses appropriate distances chosen by a medical expert, and provides explainability at a personal level, including probabilities. It offers different scenarios for modelling by experimenting with different sets of features, parameters, and preferred outcome visualizations.

The SAIN experiment used binary and numerical representation for each variable as described. We have used the same data representation (recommended by medical experts) as in the original paper [21]. The accuracy of the SAIN experiment was 82%, the same accuracy as in [6], which used classical machine learning. In addition, SAIN allows visualising each personalised model as shown in the examples in Figure 3 and Figure 8.

5.2. Time Series Classification

The proposed SAIN framework can incorporate time series data, as another modality, in addition to other modalities of data for a person, making a joint multimodal personal vector. A time series data is encoded into binary vector by using spike encoding algorithms [6], where if there is a positive change from one discrete time point to the next one in the time series, there will be a positive spike( encoded as 1); a negative change will result in a negative spike (-1) and no change will result in 0 value. This is illustrated on a hypothetical time series in Figure 4. This approach applies to any time series raw data, at any time scales, and here we show just two hypothetical examples of brain EEG data (Figure 5) and cardio data (Figure 6).

Many data sets for classifying outcomes of events consist of multiple time series. Each variable in a time series may depend on other variables that change in time. The proposed model can deal with this problem by encoding time series (signal) into binary vectors, which can be processed for classification in the SAIN framework. The variables for this data set are 14 channels of temporal EEG data, located at places of interest on the human scalp.

The signals measured over the same period are the EEG channels, fMRI voxels, ECG electrodes, seismic sensory signals, financial time series, gene expressions, voice, and music frequency bands [11]. Even when the variable (signal) measurements are independent, the signals may impact each other as they represent the same object/person at the same time period. The number N of these signals can vary from just a few for a short time window T (Figure 4) to hundreds and thousands when the time varies from a few milliseconds to minutes, hours, days, etc.

Figure 5 shows an EEG experiment, and Figure 6 shows a cardio-vascular disease signal.

Next, we present a simple example of how this search can be computed for a new record X consisting of only three variables/signals (e.g., EEG channels, ECG electrodes) over a short period of 5 time moments and the database D consisting of only six such records, which are labelled by outcome labels 1,2,3 (e.g., diagnosis, prognosis).

In addition to the record X, a weight vector is supplied with the weighted importance of the signals at different time points, e.g.

W = [0.1, 0.2, 0.4, 0.2, 0.1]

, meaning that the most important and informative part of the measurements is at time point 3.

The new record

X = [1, 1, - 1, 0, 1]

(signal, EEG channel 1) 0, 1, 1, 1, -1 (signal, EEG channel 2) 1, 1, -1, -1, 0 (signal, EEG channel 3),

W = [0.1, 0.2, 0.4, 0.2, 0.1]

The database contains records

(R e c o r d s, R 1, R 2, R 3, R 4, R 5, L a b e l s L)

where (see Table 23):

The new record X of EEG-signals will be classified in class 1 as it is closest according to the Euclidean distance, class 1 data samples

R 1

and

R 2

.

5.3. Predicting Longevity in Cardiac Patients

We utilised a data set [22] in which we applied a binary classification on whether the patient had an event (e.g., death) and further to those that had an event, whether this would occur in the near future (within the next 180 days, e.g., approximately six months). The data set contained a set of 150 variables and an outcome, with 295 patients in the first data set and 49 in the second. The data included a mix of variables that could be grouped as follows:

demographics, risk factors, disease states, medication and deprivation scores,
echocardiography, cardiac ultrasound measurements,
advanced ECG measurements,

The other data includes the days until the event occurred and the censor date for the Cox proportional hazard monitoring.

The objectives are to predict an arrhythmic event or death.

Before running the algorithm, the data was normalised, and to account for the data being unbalanced, we utilised the SMOTE data balancing method [11] each time we left one out (ensuring that we did not SMOTE when the true data point was part of the data set). For the event classification data set, the model achieved an accuracy of 79%. This is broken down into classifying no event (198/247, 80%) and an event with (36/49, 73%) accuracy. It is worth noting that the confidence of each individual could be explored with a sample of the confidence for classification in Figure 7.

For the second experiment, we normalised the dataset and removed any columns with unknown values. We then applied a genetic algorithm to find the set of features to use for classification. We found a set of 34 variables which would provide an accuracy of

81 %

with (34/34) for class 0 and (6/15) for class 1. Alternatively, if we apply SMOTE and focus more on the accuracy of class 1, we obtain

69 %

accuracy, however, more evenly distributed with (24/34) for class zero and (10/15) for class 1.

Experiment one, in which we used a non-balanced complete data set, showed satisfactory results of 80% for class 0 (no event) and 73% for class 1 (event). This demonstrates the ability of the SAIN method to work with imbalanced data. It also shows the superiority of utilising the missing values rather than removing them in experiment two. Furthermore, the selection of variables with a genetic algorithm also showed improvement. The genetic algorithm, included in the SAIN software, can also help select biomarker variables in other cases (See Figure 8).

Figure 8. An individual model of survival showing the closest neighbouring samples in the top 3 ranked variables

6. Data and Software Availability

The data has been obtained from the UCI Cleveland from https://archive.ics.uci.edu/dataset/45/heart+disease; the EEG data is available from https://github.com/KEDRI-AUT/NeuCube-Py/tree/master/example_data. Access to the software is available on request.

7. Conclusions

The paper presents a new search and inference method, called SAIN, for multi-modal data integration and personalised model creation based on these multi-modal data. The model not only evaluates the outcome for a person more accurately than traditional machine learning methods using a single modality of data, but it also explains the proposed solution in terms of probability and visual explanation.

In its current form, the paper is more directed towards revealing a new methodology and algorithms than real full-scale medical applications. However, we have illustrated the methods using hypothetical and real case health and medical data sets. Further utilisation of the proposed framework is currently being developed for large-scale biomedical data.

The proposed new method offers new functionality and features for personalised search and model creation in multimodal data, some of which are listed below:

The method is suitable for multimodal data searches in heterogeneous data sets, e.g., numbers, text, images, sound, categorical data;
It is suitable for personalised model creation to classify or predict specific outcomes based on multimodal and heterogeneous data.
It uses a similarity measure based on multicriteria metrics. In this way, inaccurate measurement of similarity on a large number of heterogeneous variables is avoided.
Its search is fast even on large data sets and includes advanced personalised searches with multiple parameters and features;
It facilitates multiple solutions with corresponding probabilities;
It is suitable for unsupervised clustering in multimodal heterogeneous data.

The proposed method is implemented as a computer system and applied to several case studies to illustrate its advantages and applicability. The SAIN method described in Section 4 was implemented as a software system.

In conclusion, integrating all possible data modalities for a single subject to predict/classify the object’s state in relation to existing ones is an open problem in data science. While the creation of personalised models based on a single modality data [23] and clustering of single modality data into a single cluster [24] have been successfully developed, the theory, framework and algorithms, proposed in this paper, are the first to integrate all data modalities for a single subject together into a single vector-based representation and to make an inference based on it. For the first time, time series, such as EEG and ECG data, are included in this unified representation, after suitable encoding. In this respect, spike encoding of time series is used, integrating statistical and brain-inspired information representation. The human brain integrates sensory data modalities into its spatio-temporal structure, and brain-inspired models using spike information representation have already been developed for learning [6,11,25] and for explanation of the learned patterns [26]. However, brain-inspired computers are still in their early stage of development [11], and even if they are developed, they may not be able to integrate all possible modalities of data into one brain-inspired mode. This paper offers a solution to the problem of multimodal personalised data integration and inference, six novel features characterise that: (1) it includes all possible modalities of data; (2) it can be implemented on any conventional computer platforms; (3) it takes into account the differences across modalities of data through offering different distance measures; (4) it offers a new way of ranking existing multimodal objects in order of similarity to a new multimodal object and uses that for building multiple neighbourhood clusters; (5) it offers a probability based inference with the use of the different similarity clusters; (6) it explains the inferred results, both in terms of probabilities and visual representation. The proposed original method here is planned to be applied to large-scale multimodal data for biomedical and health applications in the future.

The proposed method is implemented as a computer system and applied to several case studies to illustrate its advantages and applicability. The SAIN method described in Section 4 was implemented as a software system.

Acknowledgments

We thank Dr. Elena Calude for her contributions to the mathematical model. We also thank the referees for their suggestions, which improved the paper.

References

Budhraja, S.; Singh, B.; Doborjeh, M.; Doborjeh, Z.; Tan, S.; Lai, E.; Goh, W.; Kasabov, N. Mosaic LSM: A Liquid State Machine Approach for Multimodal Longitudinal Data Analysis. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN). IEEE, 2023, pp. 1–8. [CrossRef]
AbouHassan, I.; Kasabov, N.K.; Jagtap, V.; Kulkarni, P. Spiking neural networks for predictive and explainable modelling of multimodal streaming data with a case study on financial time series and online news. Sci. Rep. 2023, 13, 18367. [CrossRef]
Rodrigues, F.; Markou, I.; Pereira, F.C. Combining time-series and textual data for taxi demand prediction in event areas: A deep learning approach. Information Fusion 2019, 49, 120–129. [CrossRef]
Li, J.; Liu, J.; Zhou, S.; Zhang, Q.; Kasabov, N.K. GeSeNet: A General Semantic-Guided Network With Couple Mask Ensemble for Medical Image Fusion. IEEE Transactions on Neural Networks and Learning Systems 2024, 35, 16248–16261. [CrossRef]
Doborjeh, Z.; Doborjeh, M.; Sumich, A.; Singh, B.; Merkin, A.; Budhraja, S.; Goh, W.; Lai, E.M.; Williams, M.; Tan, S.; et al. Investigation of social and cognitive predictors in non-transition ultra-high-risk’individuals for psychosis using spiking neural networks. Schizophrenia 2023, 9, 10. [CrossRef]
Kasabov, N.K. NeuCube: A spiking neural network architecture for mapping, learning and understanding of spatio-temporal brain data. Neural networks 2014, 52, 62–76. [CrossRef]
Kasabov, N. Global, local and personalised modeling and pattern discovery in bioinformatics: An integrated approach. Pattern Recognition Letters 2007, 28, 673–685. [CrossRef]
Kasabov, N. Data Analysis and Predictive Systems and Related Methodologies, U.S. Patent 9,002,682 B2, 7 April 2015.
Doborjeh, M.; Doborjeh, Z.; Merkin, A.; Bahrami, H.; Sumich, A.; Krishnamurthi, R.; Medvedev, O.N.; Crook-Rumsey, M.; Morgan, C.; Kirk, I.; et al. Personalised predictive modelling with brain-inspired spiking neural networks of longitudinal MRI neuroimaging data and the case study of dementia. Neural Networks 2021, 144, 522–539. [CrossRef]
Kasabov, N.K. Evolving connectionist systems, 2 ed.; Springer: London, England, 2007.
Kasabov, N.K. Time-space, Spiking Neural Networks and Brain-inspired Artificial Intelligence; Vol. 750, Springer, 2019.
Santomauro, D.F.e.a. Global prevalence and burden of depressive and anxiety disorders in 204 countries and territories in 2020 due to the COVID-19 pandemic Global prevalence and burden of depressive and anxiety disorders in 204 countries and territories in 2020 due to the COVID-19 pandemic. The Lancet, 398, 1700 – 1712.
Swaddiwudhipong, N.; Whiteside, D.J.; Hezemans, F.H.; Street, D.; Rowe, J.B.; Rittman, T. Pre-diagnostic cognitive and functional impairment in multiple sporadic neurodegenerative diseases. bioRxiv 2022. [CrossRef]
Kasabov, N.K.; FEIGIN, V.; et al. Improved method and system for predicting outcomes based on spatio/spectro-temporal data, 2015.
Paprotny, D.; Morales-Nápoles, O.; Worm, D.T.; Ragno, E. BANSHEE–A MATLAB toolbox for non-parametric Bayesian networks. SoftwareX 2020, 12, 100588. [CrossRef]
Koot, P.; Mendoza-Lugo, M.A.; Paprotny, D.; Morales-Nápoles, O.; Ragno, E.; Worm, D.T. PyBanshee version (1.0): A Python implementation of the MATLAB toolbox BANSHEE for Non-Parametric Bayesian Networks with updated features. SoftwareX 2023, 21, 101279. [CrossRef]
Mendoza-Lugo, M.A.; Morales-Nápoles, O. Version 1.3-BANSHEE—A MATLAB toolbox for Non-Parametric Bayesian Networks. SoftwareX 2023, 23, 101479. [CrossRef]
Calude, C.; Calude, E. A metrical method for multicriteria decision making. St. Cerc. Mat 1982, 34, 223–234.
Calude, C. A simple non-uniform operation. Bull. Eur. Assoc. Theor. Comput. Sci. 1983, 20, 40–46.
Akhtarzada, A.; Calude, C.S.; Hosking, J. A Multi-Criteria Metric Algorithm for Recommender Systems. Fundamenta Informaticae 2011, 110, 1–11. [CrossRef]
Kahramanli, H.; Allahverdi, N. Design of a hybrid system for the diabetes and heart diseases. Expert systems with applications 2008, 35, 82–89. [CrossRef]
Gleeson, S.; Liao, Y.W.; Dugo, C.; Cave, A.; Zhou, L.; Ayar, Z.; Christiansen, J.; Scott, T.; Dawson, L.; Gavin, A.; et al. ECG-derived spatial QRS-T angle is associated with ICD implantation, mortality and heart failure admissions in patients with LV systolic dysfunction. PLOS ONE 2017, 12, e0171069. [CrossRef]
Song, Q.; Kasabov, N. TWNFI–a transductive neuro-fuzzy inference system with weighted data normalization for personalized modeling. Neural Netw. 2006, 19, 1591–1596. [CrossRef]
Kasabov, N., NeuCube EvoSpike Architecture for Spatio-temporal Modelling and Pattern Recognition of Brain Signals. In Artificial Neural Networks in Pattern Recognition; Springer Berlin Heidelberg, 2012; p. 225–243. [CrossRef]
Kumarasinghe, K.; Kasabov, N.; Taylor, D. Deep learning and deep knowledge representation in Spiking Neural Networks for Brain-Computer Interfaces. Neural Networks 2020, 121, 169–185. [CrossRef]
Futschik, M.; Kasabov, N. Fuzzy clustering of gene expression data. In Proceedings of the 2002 IEEE World Congress on Computational Intelligence. 2002 IEEE International Conference on Fuzzy Systems. FUZZ-IEEE’02. Proceedings (Cat. No.02CH37291), 2002, Vol. 1, pp. 414–419 vol.1. [CrossRef]

Figure 4. Every time series can be represented as a 3-value vector through a spike encoding method over time [11]. If at a time t the time series is increasing in value, there will be a positive spike (1), if decreasing – a negative spike (-1), and if no change – no spike (0) (left figure). Each element in this vector represents the signal change at a time. The original signal can be recovered over time using this vector (right figure) if necessary. The length of the vector is equal to the time points measured (reproduced from [11])

Figure 5. EEG signals from EEG electrodes spatially distributed on the scalp are spatio-temporal signals (left figure). Each time series signal from an electrode is measured every 1 millisecond. The figure on the right shows the measurements of 14 EEG electrodes over 124 milliseconds. Each signal can be encoded into a 124-element vector according to Figure 4, making altogether 14 such vectors to be processed in the SAIN framework (reproduced from [11])

Figure 6. ECG (Electro cardiogram) signals (a- nosy and b-filtered) can be encoded into binary vectors according to the spike encoding methods from Figure 4. Spike encoding is robust to noise, as any noise below a threshold would not cause the generation of a spike (either positive or negative), and the encoder will act as a filter. This vector’s length will equal the number of measurement time points. The vector data can be further processed in the SAIN framework

Figure 7. Sample of the classification breakdown and the confusion matrix.

Table 1. Unlabelled database

Objects/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$
$o_{1}$	$a_{1, 1}$	$a_{1, 2}$	...	$a_{1, j}$	...	$a_{1, n}$
⋮	⋮	⋮	...	⋮	...	⋮
$o_{i}$	$a_{i, 1}$	$a_{i, 2}$	...	$a_{i, j}$	...	$a_{i, n}$
⋮	⋮	⋮	...	⋮	...	⋮
$o_{m}$	$a_{m, 1}$	$a_{m, 2}$	...	$a_{m, j}$	...	$a_{m, n}$
w	$w_{1}$	$w_{2}$	...	$w_{j}$	...	$w_{n}$

Table 2. Labelled database

Objects/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$	Class label
$o_{1}$	$a_{1, 1}$	$a_{1, 2}$	...	$a_{1, j}$	...	$a_{1, n}$	$l_{1}$
⋮	⋮	⋮	...	⋮	...	⋮	⋮
$o_{i}$	$a_{i, 1}$	$a_{i, 2}$	...	$a_{i, j}$	...	$a_{i, n}$	$l_{i}$
⋮	⋮	⋮	...	⋮	...	⋮	⋮
$o_{m}$	$a_{m, 1}$	$a_{m, 2}$	...	$a_{m, j}$	...	$a_{m, n}$	$l_{m}$

Table 3. Weights

Criteria weights	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$
w	$w_{1}$	$w_{2}$	...	$w_{j}$	...	$w_{n}$

Table 4. New unlabelled object

Object/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$
x	$x_{1}$	$x_{2}$	...	$x_{j}$	...	$x_{n}$

Table 5. The hypothetical object

Object/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$	Class label
$o_{E}$	$n_{1}$	$n_{2}$	...	$n_{j}$	...	$n_{n}$	$l_{h}$

Table 6. Example of labelled data

68.2	0	6789	small	red	0,1,-1,-1,1,1,0,0, 1,-1	1,1,0	1
						0,0,1
						0,0,1
93	1	98000	medium	yellow	0,-1,-1,-1,-1,0,0, 1,-1,1	1,0,0	1
						0,0,1
						0,0,1
44.5	1	5600	large	red	0,1,-1,1,-1,1,0,0, 1,-1	1,1,0	1
						1,0,1
						1,1,1
56.8	0	89	small	white	1,-1,-1,-1,-1,1,0,0, 1,-1	1,1,0	1
						0,1,1
						1,0,1
26.3	0	9456	large	black	1,-1,-1,-1,0,1,0,0, 1,-1	1,1,0	2
						1,1,1
						1,0,1
81.5	1	78955	medium	red	0, 1,-1,1,-1,-1,0,0, 1,-1	1,1,0	2
						0,0,1
						1,1,1
56.7	1	68900	small	black	1,- 1,-1,1,-1,1,0,0, 1,1	1,1,1	2
						0,0,1
						1,1,1
20	0	7833	large	yellow	1,1,-1,-1,1,1,0,-1, -1,1	1,0,0	2
						0,0,1
						1,1,1
20	0	7833	∞	yellow	1,1,-1,-1,1,1,0,-1, -1,1	1,0,0	2
						0,0,1
						1,1,1

Table 7. Example of new unlabelled object

48.5	1	45679	large	red	1, 0, 0, -1, 1, -1, 1, 0, 0, 1	1,1,0
						0,0,1
						1,0,1

Table 8. Coded labelled data

$o_{1}$	68.2	0	6789	0	FF0000	0122110012	110001001	1
					111111110000000000000000
$o_{2}$	93	0	98000	1	FFFF00	0222200121	110001001	1
					111111111111111100000000
$o_{3}$	44.5	1	5600	2	FF0000	0121210012	110101111	1
					111111110000000000000000
$o_{4}$	56.8	0	89	0	FFFFFF	1222210012	110011101	1
					111111111111111111111111
$o_{5}$	26.3	0	9456	2	000000	1222010012	110111101	2
					000000000000000000000000
$o_{6}$	81.5	1	78955	1	FF0000	0121220012	110001111	2
					111111110000000000000000
$o_{7}$	56.7	1	68900	0	000000	1221210011	111001111	2
					000000000000000000000000
$o_{8}$	20	0	7833	2	FFFF00	1122110221	100001111	2
					111111111111111100000000
$o_{9}$	20	0	7833	∞	FFFF00	1122110221	100001111	2
					111111111111111100000000

Table 9. New unlabelled object coded

x	48.5	1	45679	2	FF0000	1002121001	110001101
					111111110000000000000000

Table 10. Coded labelled normalised data

$o_{1}$	0.682	0	0.06789	0	0.2	0.0122110012	0.110001001
$o_{2}$	0.93	1	0.98	0.5	0.6	0.0222200121	0.100001001
$o_{3}$	0.445	1	0.056	1	0.2	0.0121210012	0.110101111
$o_{4}$	0.568	0	0.00089	0	1	0.1222210012	0.110011101
$o_{5}$	0.263	0	0.09456	1	0	0.1222010012	0.110111101
$o_{6}$	0.815	1	0.78955	0.5	0.2	0.0121220012	0.110001111
$o_{7}$	0.567	1	0.689	0	0	0.1221210011	0.111001111
$o_{8}$	0.2	0	0.07833	1	0.6	0.1122110221	0.100001111
$o_{9}$	0.2	0	0.07833	∞	0.6	0.1122110221	0.100001111

Table 11. New unlabelled object coded normalised

x

0.485

1

0.45679

1

0.2

0.1002121001

0.110001101

Table 12. Normalised distances from the new object to all objects

$d (o_{1}, x)$	0.197	1	0.3889	1	0	0.4	0.11111111	3.09701111
$d (o_{2}, x)$	0.445	0	0.52321	0.5	0.33333333	0.6	0.22222222	2.62376556
$d (o_{3}, x)$	0.04	0	0.40079	0	0	0.5	0.22222222	1.16301222
$d (o_{4}, x)$	0.083	1	0.4559	1	0.66666667	0.45	0.11111111	3.76667778
$d (o_{5}, x)$	0.222	1	0.36223	0	0.33333333	0.45	0.22222222	2.58978556
$d (o_{6}, x)$	0.33	0	0.33276	0.5	0	0.45	0.11111111	1.72387111
$d (o_{7}, x)$	0.082	0	0.23221	1	0.33333333	0.45	0.22222222	2.31976556
$d (o_{8}, x)$	0.285	1	0.37846	0	0.33333333	0.45	0.22222222	2.66901556
$d (o_{9}, x)$	0.285	1	0.37846	1	0.33333333	0.45	0.22222222	3.66901556

Table 13. Ranking of distances in increasing order in Table 12

$d (o_{3}, x)$	0.04	0	0.40079	0	0	0.5	0.22222222	1.16301222
$d (o_{6}, x)$	0.33	0	0.33276	0.5	0	0.45	0.11111111	1.72387111
$d (o_{7}, x)$	0.082	0	0.23221	1	0.33333333	0.45	0.22222222	2.31976556
$d (o_{5}, x)$	0.222	1	0.36223	0	0.33333333	0.45	0.22222222	2.58978556
$d (o_{2}, x)$	0.445	0	0.52321	0.5	0.33333333	0.6	0.22222222	2.62376556
$d (o_{8}, x)$	0.285	1	0.37846	0	0.33333333	0.45	0.22222222	2.66901556
$d (o_{1}, x)$	0.197	1	0.3889	1	0	0.4	0.11111111	3.09701111
$d (o_{9}, x)$	0.285	1	0.37846	1	0.33333333	0.45	0.22222222	3.66901556
$d (o_{4}, x)$	0.083	1	0.4559	1	0.66666667	0.45	0.11111111	3.76667778

Table 14. An exemplar object

o_{E}

0.2

0

0.00089

0

1

0.1222210012

0.100001001

Table 15. Vectors

V_{0}, \dots, V_{m}

, rounded to two decimals

Table 15. Vectors

V_{0}, \dots, V_{m}

, rounded to two decimals

$V_{0}$	$V_{1}$	$V_{2}$	$V_{3}$	$V_{4}$	$V_{5}$	$V_{6}$	$V_{7}$
1.469	0.987	1.469	1.402	1.469	0.669	1.359	1.459
3.709	2.979	2.709	2.730	3.209	3.309	3.609	3.709
3.220	2.975	2.220	3.165	2.220	2.420	3.110	3.210
0.378	0.010	0.378	0.378	0.378	0.378	0.378	0.368
2.167	2.104	2.167	2.073	1.167	1.167	2.167	2.157
3.824	3.209	2.824	3.035	3.324	3.024	3.714	3.814
3.066	2.699	2.066	2.378	3.066	2.066	3.066	3.055
1.487	1.487	1.487	1.410	0.487	1.087	1.477	1.487
1.487	1.487	1.487	1.410	0.487	1.087	1.477	1.487

Table 16. Distances

D i s t (V_{0}, V_{t})

and (normalised) weights

Table 16. Distances

D i s t (V_{0}, V_{t})

and (normalised) weights

Distances	2.870	4.00	2.826	5.00	5.60	0.450	0.061
Weights	0.137	0.192	0.135	0.240	0.269	0.021	0.002

Table 17. Survival database

Patients/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$	Units of time
$p_{1}$	$a_{1, 1}$	$a_{1, 2}$	...	$a_{1, j}$	...	$a_{1, n}$	$t_{1}$
⋮	⋮	⋮	...	⋮	...	⋮	⋮
$p_{i}$	$a_{i, 1}$	$a_{i, 2}$	...	$a_{i, j}$	...	$a_{i, n}$	$t_{i}$
⋮	⋮	⋮	...	⋮	...	⋮	⋮
$p_{m}$	$a_{m, 1}$	$a_{m, 2}$	...	$a_{m, j}$	...	$a_{m, n}$	$t_{m}$

Table 18. The new patient record

Patient/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$
p	$x_{1}$	$x_{2}$	...	$x_{j}$	...	$x_{n}$

Table 19. Patient records

patients	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	units of time
$p 1$	0.682	0	0.06789	0	0.2	0.012211001	0.110001001	12.3
$p 2$	0.93	1	0.98	0.5	0.6	0.022220012	0.100001001	15
$p 3$	0.445	1	0.056	1	0.2	0.012121001	0.110101111	68
$p 4$	0.568	0	0.00089	0	1	0.122221001	0.110011101	1.4
$p 5$	0.263	0	0.09456	1	0	0.122201001	0.110111101	40.5
$p 6$	0.815	1	0.78955	0.5	0.2	0.012122001	0.110001111	97.2
$p 7$	0.567	1	0.689	0	0	0.122121001	0.111001111	97.2
$p 8$	0.2	0	0.07833	1	0.6	0.112211022	0.100001111	55.7
$p 9$	0.2	0	0.07833	∞	0.6	0.112211022	0.100001111	63.7

Table 20. New patient records

x p

0.485

1

0.45679

1

0.2

0.1002121001

0.110001101

Table 21. Distances between all patients and the new patient

	$d_{1}$	$d_{2}$	$d_{3}$	$d_{4}$	$d_{5}$	$d_{6}$	$d_{7}$	Distance d
$d (p_{1}, p)$	0.1970	1	0.388900	1.0	0.0	0.08800109890	0.000000100	2.67390119890
$d (p_{2}, p)$	0.4450	0	0.523210	0.5	0.4	0.07799208800	0.010000100	1.95620218800
$d (p_{3}, p)$	0.0400	0	0.400790	0.0	0.0	0.08809109890	0.000100010	0.52898110890
$d (p_{4}, p)$	0.0830	1	0.455900	1.0	0.8	0.02200890110	0.000010000	3.36091890110
$d (p_{5}, p)$	0.2220	1	0.362230	0.0	0.2	0.02198890110	0.000110000	1.80632890110
$d (p_{6}, p)$	0.3300	0	0.332760	0.5	0.0	0.08809009890	0.000000010	1.25085010890
$d (p_{7}, p)$	0.0820	0	0.232210	1.0	0.2	0.02190890100	0.001000010	1.53711891100
$d (p_{8}, p)$	0.2850	1	0.378460	0.0	0.4	0.01199892200	0.009999990	2.08545891200
$d (p_{9}, p)$	0.2850	1	0.378460	1.0	0.4	0.01199892200	0.009999990	3.08545891200

Table 22. The 14 variables used in the heart disease diagnosis case

Name	Data type	Definition
age	integer	age in years
sex	binary	sex
cp	{1,2,3,4}	chest pain type
trestbps	integer	resting blood pressure
chol	integer	serum cholesterol in mg/dl
fbs	binary	fasting blood sugar > 120 mg/d
restecg	{0,1,2}	resting electrocardiographic results
thalach	I integer	maximum heart rate achieved
exang	binary	exercise-induced angina
oldpeak	float	ST depression induced by exercise relative to rest
slope	{1,2,3}	the slope of the peak exercise ST segment
ca	{0,1,2,3,}	number of major vessels colored by flourosopy
thal	{3,6,7}	heart status
num	{0,1,2,3,4}	diagnosis of heart disease

Table 23. Database of EEG records

Record	Channel 1	Channel 2	Channel 3	Label
R1	(1, 1, -1, 0, 1)	(0, 1, 1, 1, -1)	(1, 1, -1, -1, 0)	1
R2	(1, 0, -1, 0, 1)	( 0, 1, 1, 1, -1)	(1, 0, -1, -1, 1 )	1
R3	(1, 1, -1, 0, 1)	(0, -1, 1, 1, -1)	(1, 1, -1, 0, 1)	2
R4	(1, 1, -1, 0, 1)	(0, -1, 1, 0, -1)	(1, 1, -1, 0, 1)	2
R5	(1, 1, -1, 0, 0)	(0, -1, 0, 1, -1)	(1, 1, -1, 1, 1)	3
R6	(1, -1, -1, 0, 1)	(0, -1, 1, 0, -1)	(1, 1, -1, 0, 1)	3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

SAIN: Search-And-INfer, A Mathematical and Computational Framework for Personalised Multimodal Data Modelling with Applications in Health Care

Abstract

Keywords:

Subject:

1. Introduction

2. Mathematical Description

2.1. Database

2.2. Distance Metrics

2.3. Tasks Specification

2.4. Tasks Solutions

2.5. An Example

2.6. Complexity Estimation of the SAIN Method

3. Survival Analysis in SAIN

3.1. Data and Tasks

3.2. Tasks Solutions

3.3. An Example

4. SAIN: A Modular Diagram and Functional Information Flow

5. Case Studies for Medical Diagnosis and Prognosis

5.1. Heart Disease Diagnosis

5.2. Time Series Classification

5.3. Predicting Longevity in Cardiac Patients

6. Data and Software Availability

7. Conclusions

Acknowledgments

References

MDPI Initiatives

Important Links

Subscribe