SAIN: Search-And-INfer, A Mathematical and Computational Framework for Personalised Multimodal Data Modelling with Applications in Health Care

Cristian Calude; Patrick Gladding; Alec Henderson; Nikola Kasabov

doi:10.20944/preprints202507.0022.v1

Submitted:

26 June 2025

Posted:

01 July 2025

Read the latest preprint version here

Abstract

Personalised modelling has become a dominant approach in personalised medicine and precision health. It creates a computational model for an individual, based on large data repositories of existing personalised data, aiming to achieve the best possible personal diagnosis or prognosis and derive an informative explanation for it. Current methods are still working on a single data modality or treating all modalities with the same method. The proposed method, SAIN (Search-And-INfer), offers better results and an informative explanation for classification and prediction tasks on a new multimodal object (sample) using a database of similar multimodal objects. The method is based on different distance measures suitable for each data modality and introduces a new formula to aggregate all modalities into a single vector distance measure to find the closest objects to a new one and then use them for a probabilistic inference. The paper describes SAIN and applies it to two types of multimodal data, cardiovascular diagnosis and EEG time series, modelled by integrating modalities, such as numbers, categories, images, and time series, and using a software implementation of SAIN. The method offers personalised explainability for medical diagnosis or prognosis. . 12 search in multimodal data; inference in multimodal data; personalised modelling; precision health.

Keywords:

search in multimodal data

;

inference in multimodal data

;

personalised modelling

;

precision health

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Multimodal data processing has become a new data science trend with applications in neuro-imaging, health diagnosis and prognosis, environmental modeling, and financial modelling, see [1,2,3].

Methods for searching relevant items in databases have been developed and used for decades, improving their accuracy and speed. These searches are a significant part of personalised modelling (e.g. precision medicine), where an optimal model is created for a given vector of person x data X and a database D of past personalised records with labelled outcomes to predict x behaviour. Most methods (e.g. [3,4]) select a subset

D_{X}

of closest vectors to X from the database D (for example, the K-nearest neighbours) and build a machine learning model using this subset

D_{X}

. To select a subset

D_{X}

of the closest vectors to X, the method searches D using predominantly Euclidean or Hamming distances to measure the similarity between the new vector X and the vectors in D. These methods have been applied in many applications and constitute the state-of-the-art in the field (e.g. [5,6,7,8]).

The enormous growth of personal multimodal data worldwide demands more advanced personalisation of search and inference methods. Most methods for multi-modal data represent all modalities of data for one object as a vector and then apply a single machine learning method, such as a deep neural network or a statistical regression (e.g. [9,10]). In these cases, the specificity of each data modality cannot be considered, which negatively impacts the inference results and explanation.

The proposed new method offers new functionality and features for personalised search and model creation in multimodal data, some of which are listed below. The method

is suitable for multimodal data searches in heterogeneous data sets, e.g. numbers, text, images, sound, and categorical data,
uses a novel mathematical similarity measure superseding a single (e.g. Euclidean, Hamming) distance used in the existing methods. In this way, inaccurate measurement of similarity on a large number of heterogeneous variables is avoided,
search is fast even on large data sets, with millions of records and thousands of variables,
includes advanced personalised searches with multiple parameters and other features,
facilitates multiple solutions with corresponding probabilities,
is suitable for unsupervised clustering in multimodal heterogeneous data,
is suitable for personalised model creation to classify or predict specific outcomes based on multimodal and heterogeneous data.

2. Mathematical Description

In this section, we present the mathematical method.

2.1. Database

We will work with the multidimensional data described as follows:

$m > 1$ objects (samples) $o_{1}, \dots, o_{m}$ ,
each object $o_{i}$ ( $1 \leq i \leq m$ ) is defined by $n > 1$ criteria (variables) $c_{1}, \dots, c_{n}$ with values in linearly ordered domains $D_{i}$ with $min D_{i}$ and $max D_{i}$ ; if some value $a_{i, j} \in D_{i}$ ( $1 \leq i \leq m$ , $1 \leq j \leq n$ ) is either missing or uncertain, then its value is recorded as ∞,
$n > 1$ weights $w_{1}, \dots, w_{n}$ in $(0, 1)$ with $\sum_{i = 1}^{n} w_{i} = 1$ , where each $w_{i}$ ( $1 \leq i \leq n$ ) quantifies the importance of the criterion $c_{i}$ ; if $w_{i} = \frac{1}{n}$ for all $1 \leq i \leq n$ , then all criteria are equally important; a criterion $c_{i}$ is ignore if $w_{i} = 0$ .

Data of independent variables are organised as follows:

Table 1. Unlabelled database.

Objects/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$
$o_{1}$	$a_{1, 1}$	$a_{1, 2}$	...	$a_{1, j}$	...	$a_{1, n}$
⋮	⋮	⋮	...	⋮	...	⋮
$o_{i}$	$a_{i, 1}$	$a_{i, 2}$	...	$a_{i, j}$	...	$a_{i, n}$
⋮	⋮	⋮	...	⋮	...	⋮
$o_{m}$	$a_{m, 1}$	$a_{m, 2}$	...	$a_{m, j}$	...	$a_{m, n}$
w	$w_{1}$	$w_{2}$	...	$w_{j}$	...	$w_{n}$

2.2. Distance Metrics

A distance metric on a space X is a positive real-valued function

d : X \times X \to R_{+}

satisfying the following three conditions for all

x, y, z \in X

: a)

d (x, y) = 0

if and only if

x = y

, b)

d (x, y) = d (y, x)

, c)

d (x, z) \leq d (x, y) + d (y, z)

.

The domains

X = D_{i}

can vary greatly: they can be sets of logical values, rational numbers, percentages, digitally codified images, sounds, videos, and many others. We use a bounded distributive complemented lattice

(L, \lor, \land, \bar{}, 0, 1)

to describe uniformly the domains

D_{i}

, [11,12].

Here is a list with illustrative, but far from exhaustive, examples of domains

D_{i}

:

Logical Boolean domain: $({0, 1}, max, min, \bar{}, 0, 1)$ , where $\bar{x} = 1 - x, x \in {0, 1}$ .
Logical non-Boolean domain: $(\{0, \frac{1}{N - 1} 0.5 e x, \frac{2}{N - 1} 0.5 e x, \dots, \frac{N - 2}{N - 1} 0.5 e x, 1\}, max, min, \bar{}, 0, 1)$ , where $x \in \{0, \frac{1}{N - 1} 0.5 e x, \frac{2}{N - 1} 0.5 e x, \dots, \frac{N - 2}{N - 1} 0.5 e x, 1\}$ and $\bar{x} = 1 - x,$ .
Numerical domain with natural values: $({0, 1, \dots, N}, max, min, \bar{}, 0, 1)$ , where $\bar{x} = N - x, x \in {0, 1, \dots, N}$ .
Numerical domain with rational values: $({x ∣ a \leq x \leq A}, max, min, \bar{}, a, A)$ , where $\bar{x} = A - x, a \leq x \leq A$ .
Binary code: $({0, 1}^{n}, max, min, \bar{}, 00 \dots 0, 11 \dots 1)$ , where the domain consists of all binary strings of length n, ${0, 1}^{n} = {x_{1} x_{2} \dots x_{n} ∣ x_{i} \in {0, 1}}$ and for all $x_{1} x_{2} \dots x_{n}, y_{1} y_{2} \dots y_{n} \in {0, 1}^{n}$ , $max (x_{1} x_{2} \dots x_{n}, y_{1} y_{2} \dots y_{n}) = max (x_{1}, y_{1}) max (x_{2}, y_{2}) \dots max (x_{n}, y_{n})$ , $min (x_{1} x_{2} \dots x_{n}, y_{1} y_{2} \dots y_{n}) = min (x_{1}, y_{1}) min (x_{2}, y_{2}) \dots min (x_{n}, y_{n})$ , $\bar{x_{1} x_{2} \dots x_{n}} = (1 - x_{1}) (1 - x_{2}) \dots (1 - x_{n}) .$

In the lattice

(L, \lor, \land, \bar{}, 0, 1)

we introduce, following [11], the metric:

d (x, y) = \{\begin{matrix} (x \land \bar{y}) \lor (\bar{x} \land y), & if x \neq y, \\ 0, & otherwise, \end{matrix}

for

x, y \in L

. This metric d can be extended to

L \cup {\infty}

as follows:

d_{\infty} (x, y) = \{\begin{matrix} d (x, y), & if x, y \in L, \\ σ (x), & if x \in L and y = \infty, \\ σ (y), & if y \in L and x = \infty, \end{matrix} 0, otherwise,

where

σ (x) = max (x, \bar{x})

.

The metrics

d_{\infty, i}

on

L_{i} \cup {\infty}

,

1 \leq i \leq n

, can be extended to

{(L_{i} \cup {\infty})}^{n}

, i.e. to n-dimensional vectors, as follows:

d_{\infty} (x_{1} x_{2} \dots x_{n}, y_{1} y_{2} \dots y_{n}) = \sum_{i = 1}^{n} d_{\infty, i} (x_{i}, y_{i}),

(1)

where

x_{i}, y_{i} \in L_{i} \cup {\infty}

,

1 \leq i \leq n

.

In what follows, we write d for

d_{\infty}

when the meaning is clear from the context.

2.3. Tasks Specification

Data organised as in Table 2 consists of independent objects augmented with a column of labels, the weights of criteria, and a new unlabelled object, see Table 2 and Table 3.

Additional information associated with data in Table 2 may include the range of each criterion

c_{j}

and the associated specific distance, e.g. the Euclidean distance for real numbers and the distance d for binary strings or strings over a non-binary alphabet (e.g. for images or colours).

We consider the following tasks:

Task: 1: Calculate the distance (or similarity metric) between the new object and each object in Table 2. If the distance corresponding to $c_{i}$ is $d_{i}$ , then

$d (o_{j}, x) = \sum_{i = 1}^{n} w_{i} \cdot d_{i} (a_{i, j}, x_{j}) .$
Task: 2: Given a threshold $δ > 0$ , calculate all objects $o_{i}$ at a distance at most $δ$ to x.
Task: 3: Calculate the probability of a new object to belong to a labelled class (e.g. low risk vs. high risk) using a threshold $δ$ and Table 2.
Task: 4: Rank the criteria in Table 2 and calculate the marker or markers criterion/criteria, that is the most important one/ones.
Task: 5: Assign alternative weights to criteria.
Task: 6: Test the accuracy of data and method for Task 4.

2.4. Tasks Solutions

For Task 1 we calculate the distances

d_{\infty} (o_{i}, x)

between each object

o_{i}

in Table 2 and x in Table 4.

For Task 2, given a threshold

δ > 0

, we calculate all objects in Table 2 at a distance at most

δ

to x, that is, the objects which are

δ

-similar to x:

C_{δ, x} = {o_{i} ∣ d (x, o_{i}) \leq δ, 1 \leq i \leq m},

and its complement

\bar{C_{δ, x}}

.

For Task 3 we calculate the probability that x is in class label

l_{t}

, which is the ratio of the number of objects in

C_{δ, x}

with the label

l_{t}

to the size of the cluster

C_{δ, x}

:

P r o b (x h a s l a b e l l_{t}) = \frac{# {o_{i} \in C_{δ, x} ∣ l_{i} = l_{t}}}{# (C_{δ, x})},

where

# {\dots}

means the number of elements in the set

{\dots}

.

For Task 4, we work with Table 2. Recall that for each criterion

c_{i}

we have a domain

D_{i}

augmented with information “high" or “low," indicating whether higher or lower values are desirable. Based on this information, we can construct a hypothetical object which has as the most desirable values for each criterion: one could see this object as an “exemplar" one.

Table 5. Hypothetical exemplar object.

Object/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$	Class label
$o_{E}$	$n_{1}$	$n_{2}$	...	$n_{j}$	...	$n_{n}$	$l_{h}$

Sometimes, criteria are interrelated or correlated. This means that in some cases, there is no unique “exemplar object", but a couple of them have to be studied in ranking the importance of criteria.

For example, fix an “exemplar object"

o_{E}

.

Compute the distances $d_{\infty} (o_{i}, o_{E})$ between each object $o_{i}$ in Table 2 and $o_{E}$ , so obtain a vector with n non-negative real components $V_{0} = (d_{1}^{0}, \dots d_{n}^{0})$ .
For each $1 \leq t \leq m$ , compute the distances $d_{\infty} (o_{i}, o_{E})$ taking into consideration all criteria in Table 2except $c_{t}$ : obtain the vector $V_{t} = (d_{1}^{t}, \dots d_{n}^{t})$ .
Compute the distances between $d i s t (V_{0}, V_{t})$ , $1 \leq t \leq m$ using the formula

$d i s t (V_{0}, V_{t}) = \sum_{i = 1}^{n} | d_{0, i} - d_{t, i} |,$

and sort them in increasing order. The criterion $c_{t}$ is a marker if $d i s t (V_{0}, V_{t}) \geq d i s t (V_{0}, V_{j})$ , for every $1 \leq j \leq m$ .

We repeat this procedure for each “exemplar object" and study possible variations.

For Task 5, normalise the distances

d i s t (V_{0}, V_{t})

and use these values to construct the weights

w_{i}^{*}

,

1 \leq t \leq m

.

For Task 6 assume we have weights

(w_{i})

associated to Table 2 (see Table 1). To test the accuracy of the data and method used for Task 4, compare the original weights

(w_{i})

with

(w_{i}^{*})

. Serious discrepancies should signal issues either with the data or the choices made in the applications of the method.

2.5. An Example

We illustrate the above tasks with an example of a labelled database in (see Table 6) and a new object (see Table 7), all having the following seven characteristics (the last column has the label classes 1 and 2):

$c_{1}$ : real number ${0 - 100}$ , e.g. age, weight, BMI etc.;
$c_{2}$ : Boolean value ${0, 1}$ , e.g. gender;
$c_{3}$ : integer number ${0 - 10, 000}$ , e.g. gene expression;
$c_{4}$ : categorical {small, med, large }, e.g. size of tumour, body size, keywords;
$c_{5}$ : colour {red, yellow, white, black}, e.g. colour of a spot on the body, on the heart;
$c_{6}$ : spike sequence of ${- 1, 0, 1}$ e.g. encoded EEG, ECG;
$c_{7}$ : black and white image, e.g. MRI, face image.

In this example, for simplicity, we didn’t use weights.

The first step is to code the data in Table 6 and Table 7. The new data is in Table 8 and Table 9.

Then, we normalise the data in Table 8 and Table 9 – the entries in the first, third and fourth columns have been divided by 100, 10,000 and 2, respectively, and the entries in the last three columns have been transformed in reals in the unit interval, and the column of labels has been removed.

In this way, we have obtained Table 10 and Table 11.

Then, we choose an appropriate distance according to each criterion. In this example, we used the Euclidean distance for all criteria.

We can compute

C_{δ, x} = {o_{i} ∣ d (o_{i}, x) \leq δ}

and, accordingly, the probability that x would be labelled in class 1 or class 2.

If

δ = 3.5

, then

C_{3.5, x} = {o_{1}, o_{2}, o_{3}, o_{5}, o_{6}, o_{7}, o_{8}}

so the probability that x is in class 1 is 2/7 and the probability that x is in class 2 is 5/7. If

δ = 2.5

, then its closest cluster is

C_{2.5, x} = {o_{2}, o_{3}, o_{5}, o_{6}, o_{7}, o_{8}}

, so the probability that x is in class 1 is 1/3 and the probability that x is in class 2 is 2/3.

Table 12. Normalised distances from the new object to all objects.

$d (o_{1}, x)$	0.197	1	0.3889	1	0	0.4	0.11111111	3.09701111
$d (o_{2}, x)$	0.445	0	0.52321	0.5	0.33333333	0.6	0.22222222	2.62376556
$d (o_{3}, x)$	0.04	0	0.40079	0	0	0.5	0.22222222	1.16301222
$d (o_{4}, x)$	0.083	1	0.4559	1	0.66666667	0.45	0.11111111	3.76667778
$d (o_{5}, x)$	0.222	1	0.36223	0	0.33333333	0.45	0.22222222	2.58978556
$d (o_{6}, x)$	0.33	0	0.33276	0.5	0	0.45	0.11111111	1.72387111
$d (o_{7}, x)$	0.082	0	0.23221	1	0.33333333	0.45	0.22222222	2.31976556
$d (o_{8}, x)$	0.285	1	0.37846	0	0.33333333	0.45	0.22222222	2.66901556
$d (o_{9}, x)$	0.285	1	0.37846	1	0.33333333	0.45	0.22222222	3.66901556

Table 13. Ranking of distances in increasing order in Table 12.

$d (o_{3}, x)$	0.04	0	0.40079	0	0	0.5	0.22222222	1.16301222
$d (o_{6}, x)$	0.33	0	0.33276	0.5	0	0.45	0.11111111	1.72387111
$d (o_{7}, x)$	0.082	0	0.23221	1	0.33333333	0.45	0.22222222	2.31976556
$d (o_{5}, x)$	0.222	1	0.36223	0	0.33333333	0.45	0.22222222	2.58978556
$d (o_{2}, x)$	0.445	0	0.52321	0.5	0.33333333	0.6	0.22222222	2.62376556
$d (o_{8}, x)$	0.285	1	0.37846	0	0.33333333	0.45	0.22222222	2.66901556
$d (o_{1}, x)$	0.197	1	0.3889	1	0	0.4	0.11111111	3.09701111
$d (o_{9}, x)$	0.285	1	0.37846	1	0.33333333	0.45	0.22222222	3.66901556
$d (o_{4}, x)$	0.083	1	0.4559	1	0.66666667	0.45	0.11111111	3.76667778

which induces a ranking of the objects in the Table 8:

o_{3}, o_{6}, o_{7}, o_{5}, o_{2}, o_{8}, o_{1}, o_{9}, o_{2},

.

For Task 4, assume that the criteria

c_{1}, \dots, c_{7}

in Table 10 have the additional information

(m, m, m, m, m, M, M)

, where m (M) means that the exemplar value is the minim (maximum) value. Based on this vector, we compute the exemplar object:

Table 14. Exemplar object.

o_{E}

0.2

0

0.00089

0

1

0.1222210012

0.100001001

Next we calculate

V_{0}, \dots, V_{t}

, see Table 14, and finally the distances

D i s t (V_{0}, V_{t})

,

t = 1, 2, \dots, 7

and the weights as their normalised values, see Table 16. The marker, in this case, is the criterion

c_{5}

.

Table 15. Vectors

V_{0}, \dots, V_{m}

, rounded to two decimals.

Table 15. Vectors

V_{0}, \dots, V_{m}

, rounded to two decimals.

$V_{0}$	$V_{1}$	$V_{2}$	$V_{3}$	$V_{4}$	$V_{5}$	$V_{6}$	$V_{7}$
1.469	0.987	1.469	1.402	1.469	0.669	1.359	1.459
3.709	2.979	2.709	2.730	3.209	3.309	3.609	3.709
3.220	2.975	2.220	3.165	2.220	2.420	3.110	3.210
0.378	0.010	0.378	0.378	0.378	0.378	0.378	0.368
2.167	2.104	2.167	2.073	1.167	1.167	2.167	2.157
3.824	3.209	2.824	3.035	3.324	3.024	3.714	3.814
3.066	2.699	2.066	2.378	3.066	2.066	3.066	3.055
1.487	1.487	1.487	1.410	0.487	1.087	1.477	1.487
1.487	1.487	1.487	1.410	0.487	1.087	1.477	1.487

Table 16. Distances

D i s t (V_{0}, V_{t})

and (normalised) weights.

Table 16. Distances

D i s t (V_{0}, V_{t})

and (normalised) weights.

Distances	2.870	4.00	2.826	5.00	5.60	0.450	0.061
Weights	0.137	0.192	0.135	0.240	0.269	0.021	0.002

3. Survival Analysis in SAIN

Medical survival analysis evaluates the time until an event of interest occurs, like death or disease recurrence, in a group of patients. This analysis is often used to compare treatment outcomes or predict prognosis.

3.1. Data and Tasks

We are given the following data:

Table 17 in which the first column lists the patients treated for the same disease with the same method under strict conditions and the last column records the times till the patient’s deaths.
Table 18, which includes the record of the new patient p.
A threshold $δ$ which defines the acceptable similarity between p and the relevant $p_{i}$ ’s in the Survival database (i.e. $d (p, p_{i}) \leq δ$ ).

We consider the following tasks:

Task 1: What is the life expectancy of p?

Task 2: What is the probability that the life expectancy of p is greater than or equal to a given T?

3.2. Tasks Solutions

Using a standard method of survival analysis

For Task 1,

(a)

Compute the set of patients that are similar up to $δ$ to p:

$C_{δ, p} = {p_{i} ∣ d (p, p_{i}) \leq δ, 1 \leq i \leq m} .$

(2)

(b)

Using $C_{δ, p}$ , compute the probability that p will survive the time $t_{j}$ :

$P r o b_{δ} (p s u r v i v e s t i m e t_{j}) = \frac{# {p_{i} \in C_{δ, p} ∣ t_{i} = t_{j}}}{# (C_{δ, p})} .$

(3)

(c)

Compute the life expectancy of p using the formula:

$L E_{δ} (p) = \sum_{j = 1, t_{j} \in C_{δ, p}}^{m} t_{j} \times P r o b_{δ} (p s u r v i v e s t i m e t_{j}) .$

(4)
For Task 2, calculate the probability that the life expectancy of p is at least time T:

$P r o b_{δ} (L E (p) \geq T) = \sum_{j = 1, t_{j} \in C_{δ, p}, t_{j} \geq T}^{m} P r o b_{δ} (p s u r v i v e s t i m e t_{j}) .$

(5)

3.3. An Example

We illustrate the above tasks with an example of a database in which columns 2–8 record patients’ medical test results, and the last column records time to death (see Table prec) and a new patient (see Table 20):

Table 19. Patient records.

patients	$c_{1}$	$c_{2}$	$c_{3}$	$c_{4}$	$c_{5}$	$c_{6}$	$c_{7}$	units of time
$p 1$	0.682	0	0.06789	0	0.2	0.012211001	0.110001001	12.3
$p 2$	0.93	1	0.98	0.5	0.6	0.022220012	0.100001001	15
$p 3$	0.445	1	0.056	1	0.2	0.012121001	0.110101111	68
$p 4$	0.568	0	0.00089	0	1	0.122221001	0.110011101	1.4
$p 5$	0.263	0	0.09456	1	0	0.122201001	0.110111101	40.5
$p 6$	0.815	1	0.78955	0.5	0.2	0.012122001	0.110001111	97.2
$p 7$	0.567	1	0.689	0	0	0.122121001	0.111001111	97.2
$p 8$	0.2	0	0.07833	1	0.6	0.112211022	0.100001111	55.7
$p 9$	0.2	0	0.07833	∞	0.6	0.112211022	0.100001111	63.7

Table 20. New patient records.

x p

0.485

1

0.45679

1

0.2

0.1002121001

0.110001101

The distance for column 4 is

d (x, y) = | x - y | =

and

d_{\infty} (x, \infty) = max (x, 1 - x)

. For example,

d_{\infty} (1, \infty) = max (1, 1 - 1) = 1

. For all other columns, the distance is

d (x, y) = | x - y |

. Finally, the total distance is the sum of individual distances (7 terms), with the results in Table 21.

The results for Task 1, (a), (b) and (c) are listed below:

For $δ \geq 3.37$ , $C_{δ, p} = {v_{1}, v_{2}, v_{3}, v_{4}, v_{5}, v_{6}, v_{7}, v_{8}, v_{9}}$ , that is the entire database. Then
(a)

$L E_{δ} (p) = 50.11$ ,

(b)
- $P r o b_{δ} (p s u r v i v e s t i m e = 12.3) = 1 / 9$ ,
- $P r o b (p s u r v i v e s t i m e = 15) = 1 / 9$ ,
- $P r o b_{δ} (p s u r v i v e s t i m e = 68) = 1 / 9$ ,
- $P r o b_{δ} (p s u r v i v e s t i m e = 1.4) = 1 / 9$ ,
- $P r o b_{δ} (p s u r v i v e s t i m e = 40.5) = 1 / 9$ ,
- $P r o b_{δ} (p s u r v i v e s t i m e = 97.2) = 2 / 9$ ,
- $P r o b_{δ} (p s u r v i v e s t i m e = 55.7) = 1 / 9$ ,
- $P r o b_{δ} (p s u r v i v e s t i m e = 63.7) = 1 / 9$ .
(c)
- $P r o b_{δ} (L E_{δ} (p) \geq 1.4) = 1$ ,
- $P r o b_{δ} (L E_{δ} (p) \geq 12.3) = 8 / 9$ ,
- $P r o b_{δ} (L E_{δ} (p) \geq 15) = 7 / 9$ ,
- $P r o b_{δ} (L E_{δ} (p) \geq 40.5) = 6 / 9$ ,
- $P r o b_{δ} (L E_{δ} (p) \geq 55.7) = 5 / 9$ ,
- $P r o b_{δ} (L E_{δ} (p) \geq 63.7) = 4 / 9$ ,
- $P r o b_{δ} (L E_{δ} (p) \geq 68) = 3 / 9$ ,
- $P r o b_{δ} (L E_{δ} (p) \geq 97.2) = 2 / 9$ ,
We can calculate other probabilities, for example, $P r o b_{δ} (L E_{δ} (p) \geq 60) = P r o b_{δ} (L E_{δ} (p) \geq 63.7) + P r o b_{δ} (L E_{δ} (p) \geq 68) = 3 / 9 + P r o b_{δ} (L E_{δ} (p) \geq 97.2) = 2 / 9 = 1 / 9 + 1 / 9 + 2 / 9 = 4 / 9$ .
For $δ \geq 2.5$ , $C_{δ, p} = {v_{2}, v_{3}, v_{5}, v_{6}, v_{7}, v_{8}}$ . Then
(a)

$L E_{δ} (p) = 62.27$ ,

(b)
- $P r o b (p s u r v i v e s t i m e = 15) = 1 / 6$ ,
- $P r o b (p s u r v i v e s t i m e = 68) = 1 / 6$ ,
- $P r o b (p s u r v i v e s t i m e = 40.5) = 1 / 6$ ,
- $P r o b (p s u r v i v e s t i m e = 97.2) = 2 / 6$ ,
- $P r o b (p s u r v i v e s t i m e = 55.7) = 1 / 6$ ,
(c)
- $P r o b_{δ} (L E_{δ} (p) \geq 15) = 1$ ,
- $P r o b_{δ} (L E_{δ} (p) \geq 40) = 5 / 6$ ,
- $P r o b_{δ} (L E_{δ} (p) \geq 55.7) = 4 / 6$ ,
- $P r o b_{δ} (L E_{δ} (p) \geq 68) = 3 / 6$ ,
- $P r o b_{δ} (L E_{δ} (p) \geq 97.2) = 2 / 6$ .
Similarly, we can calculate the probabilities $P r o b_{δ} (L E_{δ} (p) \geq 45) = 4 / 6$ , $P r o b_{δ} (L E_{δ} (p) \geq 100) = 0$ .

4. SAIN: A Modular Diagram and Functional Information Flow

In Figure 1 and Figure 2, we present the modular diagram and the functional information flow of SAIN.

5. Case Studies for Medical Diagnosis and Prognosis

We present three case studies in which we applied SAIN.

5.1. Heart Disease Diagnosis

We worked with the well-known Cleveland dataset, which contains multiple data types [13]. The UCI Heart Disease data set contains 76 attributes. As in most articles, the attributes in our experiment data were restricted to 14, see Table 22.

The problem is a binary classification on whether the patient has or does not have heart disease.

First, we selected suitable distance metrics and weights to classify the attributes. For binary objects, the distance metric is simply whether they are equal; for non-binary discrete objects such as resting electrocardiographic results, the appropriate distance measure is not obvious and should be informed by an expert. We give electrocardiographic results 0 for normal, 1 for having ST-T wave abnormality, and 2 for showing probable or definite left ventricular hypertrophy following Estes’ criteria.

Many studies with the Cleveland dataset have been tested with different machine learning techniques. For example, [14] lists different algorithms and performances ranging from 47% to 80% accuracy. SAIN achieved an 82% accuracy score. Why SAIN? The search is fast, it uses appropriate distances chosen by a medical expert, it provides explainability at a personal level including probabilities. If offers different scenarious for modelling by experimenting different sets of features, parameters and preferred outcome vizualitations.

5.2. Time Series Classification

Many data sets for classifying outcomes of events consist of multiple time series. Each variable in a time series may depend on other variables that change in time. The proposed model can deal with this problem by encoding time series (signal) into binary vectors which can be processed for classification in the SAIN framework. The variables for this data set are 14 channels of temporal EEG data channels, located at places of interest on the human scalp.

The signals measured over the same time period are the EEG channels, fMRI voxels, ECG electrodes, seismic sensory signals, financial time series, gene expressions, voice, and music frequency bands [8]. Even when the variable (signal) measurements are independent, the signals may have an impact on each other as they represent the same object/person at the same time period. The number N of these signals can vary from just a few for a short time window T (Figure 3) to hundreds and thousands when the time varies from a few milliseconds to minutes, hours, days, etc.

Figure 4 shows an EEG experiment and Figure 5 shows one cardio-vascular disease signal.

Next we present a simple example how this search can be computed for a new record X consisting of only 3 variables/signals (e.g. EEG channels, ECG electrodes) over a short period of 5 time moments and the data base D constituting of only 6 such records which are labelled by an outcome labels 1,2,3 (e.g. diagnosis, prognosis).

In addition to the record X, a weight vector is supplied with the weighted importance of the signals at different time points, e.g.

W = [0.1, 0.2, 0.4, 0.2, 0.1]

, meaning that the most important and informative part of the measurements is at time point 3.

The new record

X = [1, 1, - 1, 0, 1]

(signal, EEG channel 1) 0, 1, 1, 1, -1 (signal, EEG channel 2) 1, 1, -1, -1, 0 (signal, EEG channel 3),

W = [0.1, 0.2, 0.4, 0.2, 0.1]

The database contains records

(R e c o r d s, R 1, R 2, R 3, R 4, R 5, L a b e l s L)

where:

Table 23. Caption.

Record	Channel 1	Channel 2	Channel 3	Label
R1	(1, 1, -1, 0, 1)	(0, 1, 1, 1, -1)	(1, 1, -1, -1, 0)	1
R2	(1, 0, -1, 0, 1)	( 0, 1, 1, 1, -1)	(1, 0, -1, -1, 1 )	1
R3	(1, 1, -1, 0, 1)	(0, -1, 1, 1, -1)	(1, 1, -1, 0, 1)	2
R4	(1, 1, -1, 0, 1)	(0, -1, 1, 0, -1)	(1, 1, -1, 0, 1)	2
R5	(1, 1, -1, 0, 0)	(0, -1, 0, 1, -1)	(1, 1, -1, 1, 1)	3
R6	(1, -1, -1, 0, 1)	(0, -1, 1, 0, -1)	(1, 1, -1, 0, 1)	3

The new record X of EEG-signals will be classified in class 1 as it is closest according to the Euclidian distance class 1 data samples

R 1

and

R 2

.

5.3. Predicting Longevity in Cardiac Patients

We utilised a data set in which we applied a binary classification on whether the patient had an event (e.g. death) and further to those that had an event whether this would occur in the near future (within the next 180 days, e.g. approximately six months). The data set contained a set of 150 variables and an outcome, with 295 patients in the first data set and 49 in the second. The data included a mix of variables that could be grouped as follows:

demographics, risk factors, disease states, medication and deprivation scores,
echocardiography, cardiac ultrasound measurements,
advanced ECG measurements,

The other data includes the days until the event occurred of the censor date for the Cox proportional hazard monitoring.

The objectives are to predict an arrhythmic event or death.

Before running the algorithm, the data was normalised, and to account for the data being unbalanced, we utilised SMOTE data balancing method [15] each time we left one out (ensuring that we did not SMOTE when the true data point was part of the data set). For the event classification data set, the model achieved an accuracy of 79%. This is broken down into classifying no event (198/247, 80%) and an event with (36/49, 73%) accuracy. It is worth noting the confidence of each individual could be explored with a sample of the confidence for classification in Figue Figure 6.

For the second experiment, we normalised the dataset and removed any columns with unknown values. We then applied a genetic algorithm to find the set of features to use for classification. We found a set of 34 variables which would provide an accuracy of

81 %

with (34/34) for class 0 and (6/15) for class 1. Alternatively, we find that if we apply SMOTE and focus more on the accuracy of class 1, we obtain

69 %

accuracy, however, more evenly distributed with (24/34) for class zero and (10/15) for class 1.

6. Data and Software Availability

The data has been obtained from the following: UCI Cleveland data available at https://archive.ics.uci.edu/dataset/45/heart+disease EEG data available at https://github.com/KEDRI-AUT/NeuCube-Py/tree/master/example_data Access to the software is available on request.

7. Conclusions

The paper presents a new method for search and inference, called here SAIN, for multi-modal data integration and personalised model creation based on these multi-modal data. The model not only evaluates the outcome for a person more accurately than traditional machine learning methods using a single modality data, but it also explains the proposed solution in terms of probability and visual explanation.

The proposed method is implemented as a computer system and applied to several case studies to illustrate its advantages and applicability. The SAIN method described in Section 4 was implemented as a software system.

The proposed mathematical method and computational framework can be applied to a broad spectrum of applications, such as [15]: a) medical diagnosis based on multimodal data, such as genetic, clinical, cognitive, and ethnical, b) early disease prognosis based on multimodal personalised data modelling, c) multimodal neuroimaging data modelling, d) multisensory spatio-temporal data modelling for pollution level estimation and prediction, e) earthquake prediction based on both seismic and GPS spatio-temporal data.

Acknowledgement

We thank Dr. Elena Calude for the contributions to the mathematical model.

References

AbouHassan, I.; Kasabov, N.K.; Jagtap, V.; Kulkarni, P. Spiking neural networks for predictive and explainable modelling of multimodal streaming data with a case study on financial time series and online news. Sci. Rep. 2023, 13, 18367. [Google Scholar] [CrossRef] [PubMed]
Rodrigues, F.; Markou, I.; Pereira, F.C. Combining time-series and textual data for taxi demand prediction in event areas: A deep learning approach. Information Fusion 2019, 49, 120–129. [Google Scholar] [CrossRef]
Li, J.; Liu, J.; Zhou, S.; Zhang, Q.; Kasabov, N.K. GeSeNet: A General Semantic-Guided Network With Couple Mask Ensemble for Medical Image Fusion. IEEE Transactions on Neural Networks and Learning Systems 2024, 35, 16248–16261. [Google Scholar] [CrossRef] [PubMed]
Kasabov, N. Data Analysis and Predictive Systems and Related Methodologies, U.S. Patent 9,002,682 B2, 7 April 2015. [Google Scholar]
Doborjeh, M.; Doborjeh, Z.; Merkin, A.; Bahrami, H.; Sumich, A.; Krishnamurthi, R.; Medvedev, O.N.; Crook-Rumsey, M.; Morgan, C.; Kirk, I.; et al. Personalised predictive modelling with brain-inspired spiking neural networks of longitudinal MRI neuroimaging data and the case study of dementia. Neural Networks 2021, 144, 522–539. [Google Scholar] [CrossRef] [PubMed]
Budhraja, S.; Singh, B.; Doborjeh, M.; Doborjeh, Z.; Tan, S.; Lai, E.; Goh, W.; Kasabov, N. Mosaic LSM: A Liquid State Machine Approach for Multimodal Longitudinal Data Analysis. In Proceedings of the 2023 International Joint Conference on Neural Networks (IJCNN). IEEE; 2023; pp. 1–8. [Google Scholar] [CrossRef]
Kasabov, N.K. Evolving connectionist systems, 2 ed.; Springer: London, England, 2007. [Google Scholar]
Kasabov, N.K. Time-Space, Spiking Neural Networks and Brain-Inspired Artificial Intelligence; Springer Berlin Heidelberg, 2019. [CrossRef]
Santomauro, D.F.e.a. Global prevalence and burden of depressive and anxiety disorders in 204 countries and territories in 2020 due to the COVID-19 pandemic Global prevalence and burden of depressive and anxiety disorders in 204 countries and territories in 2020 due to the COVID-19 pandemic. The Lancet, 1700. [Google Scholar]
Swaddiwudhipong, N.; Whiteside, D.J.; Hezemans, F.H.; Street, D.; Rowe, J.B.; Rittman, T. Pre-diagnostic cognitive and functional impairment in multiple sporadic neurodegenerative diseases. bioRxiv, 2022. [Google Scholar] [CrossRef]
Calude, C.; Calude, E. A metrical method for multicriteria decision making. St. Cerc. Mat 1982, 34, 223–234. [Google Scholar]
Calude, C.; Calude, E. On some discrete metrics. Bulletin mathématique de la Société des Sciences Mathématiques de la République Socialiste de Roumanie.
Gleeson, S.; Liao, Y.W.; Dugo, C.; Cave, A.; Zhou, L.; Ayar, Z.; Christiansen, J.; Scott, T.; Dawson, L.; Gavin, A.; et al. ECG-derived spatial QRS-T angle is associated with ICD implantation, mortality and heart failure admissions in patients with LV systolic dysfunction. PLOS ONE 2017, 12, e0171069. [Google Scholar] [CrossRef] [PubMed]
Kahramanli, H.; Allahverdi, N. Design of a hybrid system for the diabetes and heart diseases. Expert systems with applications 2008, 35, 82–89. [Google Scholar] [CrossRef]
Kasabov, N.K. Time-space, Spiking Neural Networks and Brain-inspired Artificial Intelligence; Vol. 750, Springer, 2019.

Figure 1. A modular diagram of the proposed SAIN computational framework

Figure 2. A flow of data and information processing in the SAIN computational framework

Figure 3. Every time series can be represented as a 3-value vector through a spike encoding method over time [15]. If at a time t the times series is increasing in value, there will be a positive spike (1), if decreasing – negative spike (-1) and if no change – no spike (0) (left figure). Each element in this vector represents the change of the signal at a time. If necessary, the original signal can be recovered over time using this vector (right figure). The length of the vector is equal the time points measured.

Figure 4. EEG signals taken from EEG electrodes spatially distributed on the scalp are spatio- temporal signals (left figure). Each time series signal from an electrode is measured every 1 millisecond. The figure on the right shows the measurements of 14 EEG electrodes over time of 124 milliseconds. Each signal can be encoded into 124 element vector according to Figure 3, making altogether 14 such vectors to be processed in the SAIN framework.

Figure 5. ECG (Electro cardiogram) signals (a- nosy and b- filtered) can be encoded into binary vectors according to the spike encoding methods from Figure 3. Spike encoding is robust to noise, as any noise below a threshold would not cause the generation of a spike (either positive or negative) and the encoder will act as a filter. The length of this vector will be equal to the number of measurement time points. The vector data can be further processed in the SAIN framework.

Figure 6. Sample of the classification breakdown and the confusion matrix (class 2 would be utilised when we wish to have an uncertain class, but here we have classified based on probability

> 50 %

).

Figure 6. Sample of the classification breakdown and the confusion matrix (class 2 would be utilised when we wish to have an uncertain class, but here we have classified based on probability

> 50 %

).

Table 2. Labelled database.

Objects/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$	Class label
$o_{1}$	$a_{1, 1}$	$a_{1, 2}$	...	$a_{1, j}$	...	$a_{1, n}$	$l_{1}$
⋮	⋮	⋮	...	⋮	...	⋮	⋮
$o_{i}$	$a_{i, 1}$	$a_{i, 2}$	...	$a_{i, j}$	...	$a_{i, n}$	$l_{i}$
⋮	⋮	⋮	...	⋮	...	⋮	⋮
$o_{m}$	$a_{m, 1}$	$a_{m, 2}$	...	$a_{m, j}$	...	$a_{m, n}$	$l_{m}$

Table 3. Weights.

Criteria weights	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$
w	$w_{1}$	$w_{2}$	...	$w_{j}$	...	$w_{n}$

Table 4. New unlabelled object.

Object/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$
x	$x_{1}$	$x_{2}$	...	$x_{j}$	...	$x_{n}$

Table 6. Example of labelled data.

68.2	0	6789	small	red	0,1,-1,-1,1,1,0,0, 1,-1	1,1,0	1
						0,0,1
						0,0,1
93	1	98000	medium	yellow	0,-1,-1,-1,-1,0,0, 1,-1,1	1,0,0	1
						0,0,1
						0,0,1
44.5	1	5600	large	red	0,1,-1,1,-1,1,0,0, 1,-1	1,1,0	1
						1,0,1
						1,1,1
56.8	0	89	small	white	1,-1,-1,-1,-1,1,0,0, 1,-1	1,1,0	1
						0,1,1
						1,0,1
26.3	0	9456	large	black	1,-1,-1,-1,0,1,0,0, 1,-1	1,1,0	2
						1,1,1
						1,0,1
81.5	1	78955	medium	red	0, 1,-1,1,-1,-1,0,0, 1,-1	1,1,0	2
						0,0,1
						1,1,1
56.7	1	68900	small	black	1,- 1,-1,1,-1,1,0,0, 1,1	1,1,1	2
						0,0,1
						1,1,1
20	0	7833	large	yellow	1,1,-1,-1,1,1,0,-1, -1,1	1,0,0	2
						0,0,1
						1,1,1
20	0	7833	∞	yellow	1,1,-1,-1,1,1,0,-1, -1,1	1,0,0	2
						0,0,1
						1,1,1

Table 7. Example of new unlabelled object.

48.5	1	45679	large	red	1, 0, 0, -1, 1, -1, 1, 0, 0, 1	1,1,0
						0,0,1
						1,0,1

Table 8. Coded labelled data.

$o_{1}$	68.2	0	6789	0	FF0000	0122110012	110001001	1
					111111110000000000000000
$o_{2}$	93	0	98000	1	FFFF00	0222200121	110001001	1
					111111111111111100000000
$o_{3}$	44.5	1	5600	2	FF0000	0121210012	110101111	1
					111111110000000000000000
$o_{4}$	56.8	0	89	0	FFFFFF	1222210012	110011101	1
					111111111111111111111111
$o_{5}$	26.3	0	9456	2	000000	1222010012	110111101	2
					000000000000000000000000
$o_{6}$	81.5	1	78955	1	FF0000	0121220012	110001111	2
					111111110000000000000000
$o_{7}$	56.7	1	68900	0	000000	1221210011	111001111	2
					000000000000000000000000
$o_{8}$	20	0	7833	2	FFFF00	1122110221	100001111	2
					111111111111111100000000
$o_{9}$	20	0	7833	∞	FFFF00	1122110221	100001111	2
					111111111111111100000000

Table 9. New unlabelled object coded.

x	48.5	1	45679	2	FF0000	1002121001	110001101
					111111110000000000000000

Table 10. Coded labelled normalised data.

$o_{1}$	0.682	0	0.06789	0	0.2	0.0122110012	0.110001001
$o_{2}$	0.93	1	0.98	0.5	0.6	0.0222200121	0.100001001
$o_{3}$	0.445	1	0.056	1	0.2	0.0121210012	0.110101111
$o_{4}$	0.568	0	0.00089	0	1	0.1222210012	0.110011101
$o_{5}$	0.263	0	0.09456	1	0	0.1222010012	0.110111101
$o_{6}$	0.815	1	0.78955	0.5	0.2	0.0121220012	0.110001111
$o_{7}$	0.567	1	0.689	0	0	0.1221210011	0.111001111
$o_{8}$	0.2	0	0.07833	1	0.6	0.1122110221	0.100001111
$o_{9}$	0.2	0	0.07833	∞	0.6	0.1122110221	0.100001111

Table 11. New unlabelled object coded normalised.

x

0.485

1

0.45679

1

0.2

0.1002121001

0.110001101

Table 17. Survival database.

Patients/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$	Units of time
$p_{1}$	$a_{1, 1}$	$a_{1, 2}$	...	$a_{1, j}$	...	$a_{1, n}$	$t_{1}$
⋮	⋮	⋮	...	⋮	...	⋮	⋮
$p_{i}$	$a_{i, 1}$	$a_{i, 2}$	...	$a_{i, j}$	...	$a_{i, n}$	$t_{i}$
⋮	⋮	⋮	...	⋮	...	⋮	⋮
$p_{m}$	$a_{m, 1}$	$a_{m, 2}$	...	$a_{m, j}$	...	$a_{m, n}$	$t_{m}$

Table 18. New patient record.

Patient/Criteria	$c_{1}$	$c_{2}$	...	$c_{j}$	...	$c_{n}$
p	$x_{1}$	$x_{2}$	...	$x_{j}$	...	$x_{n}$

Table 21. Distances between all patients and the new patient.

	$d_{1}$	$d_{2}$	$d_{3}$	$d_{4}$	$d_{5}$	$d_{6}$	$d_{7}$	Distance d
$d (p_{1}, p)$	0.1970	1	0.388900	1.0	0.0	0.08800109890	0.000000100	2.67390119890
$d (p_{2}, p)$	0.4450	0	0.523210	0.5	0.4	0.07799208800	0.010000100	1.95620218800
$d (p_{3}, p)$	0.0400	0	0.400790	0.0	0.0	0.08809109890	0.000100010	0.52898110890
$d (p_{4}, p)$	0.0830	1	0.455900	1.0	0.8	0.02200890110	0.000010000	3.36091890110
$d (p_{5}, p)$	0.2220	1	0.362230	0.0	0.2	0.02198890110	0.000110000	1.80632890110
$d (p_{6}, p)$	0.3300	0	0.332760	0.5	0.0	0.08809009890	0.000000010	1.25085010890
$d (p_{7}, p)$	0.0820	0	0.232210	1.0	0.2	0.02190890100	0.001000010	1.53711891100
$d (p_{8}, p)$	0.2850	1	0.378460	0.0	0.4	0.01199892200	0.009999990	2.08545891200
$d (p_{9}, p)$	0.2850	1	0.378460	1.0	0.4	0.01199892200	0.009999990	3.08545891200

Table 22. The 14 variables used in the heart disease diagnosis case.

Name	Data type	Definition
age	integer	age in years
sex	binary	sex
cp	{1,2,3,4}	chest pain type
trestbps	integer	resting blood pressure
chol	integer	serum cholesterol in mg/dl
fbs	binary	fasting blood sugar > 120 mg/d
restecg	{0,1,2}	resting electrocardiographic results
thalach	I integer	maximum heart rate achieved
exang	binary	exercise-induced angina
oldpeak	float	ST depression induced by exercise relative to rest
slope	{1,2,3}	the slope of the peak exercise ST segment
ca	{0,1,2,3,}	number of major vessels colored by flourosopy
thal	{3,6,7}	heart status
num	{0,1,2,3,4}	diagnosis of heart disease

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

SAIN: Search-And-INfer, A Mathematical and Computational Framework for Personalised Multimodal Data Modelling with Applications in Health Care

Abstract

Keywords:

Subject:

1. Introduction

2. Mathematical Description

2.1. Database

2.2. Distance Metrics

2.3. Tasks Specification

2.4. Tasks Solutions

2.5. An Example

3. Survival Analysis in SAIN

3.1. Data and Tasks

3.2. Tasks Solutions

3.3. An Example

4. SAIN: A Modular Diagram and Functional Information Flow

5. Case Studies for Medical Diagnosis and Prognosis

5.1. Heart Disease Diagnosis

5.2. Time Series Classification

5.3. Predicting Longevity in Cardiac Patients

6. Data and Software Availability

7. Conclusions

Acknowledgement

References

MDPI Initiatives

Important Links

Subscribe