Brute Force Computations and Reference Solutions

Mihail Mihaylov Konstantinov; Petko Hristov Petkov; Ekaterina Borisova Madamlieva

doi:10.20944/preprints202412.0780.v1

Submitted:

07 December 2024

Posted:

09 December 2024

You are already at the latest version

Abstract

In this paper, we consider the application of brute force computational techniques (BFCT) for solving computational problems in mathematical analysis and matrix algebra in floating-point computing environment. These techniques include, among others, simple matrix computations and analysis of graphs of functions. Since BFCT are based on matrix calculations the program system MATLAB® is suitable for their computer realization. The computations in this paper are done in double precision floating-point arithmetic obeying the 2019 IEEE Standard for binary floating-point calculations. One of the aims of the paper is to analyze cases when popular algorithms and software fail to produce correct answers without warning for the user. In real time control applications this may have catastrophic consequences with heavy material damage and human casualties. It is known, or suspected, that a number of man-made catastrophes such as Dharhan accident (1991), Ariane 5 launch failure (1996), Boeing 737 Max tragedies (2018, 2019) and others are due to errors in the computer software and hardware. Another application of BFCT is finding good initial guesses for known computational algorithms. Sometimes simple and fast BFCT are useful tools in solving computational problems correctly and in real time. Among particular problems considered are genuine addition of machine numbers, numerically stable computations, finding minimums of arrays, minimization of functions, solving finite equations, integration and differentiation, computing condensed and canonical forms of matrices and clarifying the concepts of least squares method in the light of the conflict Remainders vs. Errors. Usually BFCT are applied under user’s supervision which is not be possible in automatic implementation of computational methods. To implement BFCT automatically is a challenging problem in the area of artificial intelligence (AI) and of mathematical artificial intelligence (MAI) in particular. BFCT allow to reveal the underlying arithmetic (FPA, VPA, etc.) in the performance of computational algorithms. Last but not least this paper may have some tutorial value since at certain places computational algorithms and mathematical software are still taught without taking into account the properties of computational algorithms and machine arithmetic.

Keywords:

computational algorithms

;

brute force computations

;

reference solutions

;

machine arithmetic

;

failure of computer codes

Subject:

Computer Science and Mathematics - Computational Mathematics

MSC: 65-04; 65Y04

1. Introduction and Notation

1.1. Preliminaries

Sophisticated computational algorithms have been developed during the last 300 years for solving a large variety of problems in mathematical and engineering sciences [1,2,3,4,5]. When digital computers become available since the middle of 20th century it became clear that some computational algorithms of the past are not suitable for computer implementation at least without significant modifications. Other algorithms (albeit very clever) became obsolete. Thus paradigms of scientific computations had been changed and new types of modern computational algorithms were developed and implemented as computer codes.

When implemented as computer codes in floating-point computer environment such as MATLAB^® [6,7] and Maple [8], however, some of these computational algorithms may produce wrong results without warning. Other systems as Mathematica [9,10] use variable precision arithmetic (VPA) and may avoid such pitfalls at the price of certain increase of the time necessary for performance of the algorithms. In such cases BFCT and a detailed analysis of computational problems with reference solution (CPRS) may be useful.

BFCT may also be used to obtain initial guesses for computational algorithms intended to solve finite equations and minimize functions of one or several scalar arguments. We stress finally that Monte Carlo algorithms [11] are also a form BFCT. As a perspective, a combination of mathematical AI [12,13] with BFCT is yet to be developed.

The computations in this paper are done in double precision binary floating-point arithmetic (FPA) obeying the IEEE Standard [14]. Some of the matrix notations used are inspired by the language of MATLAB^®. The codes for the numerical experiments are given so the results may be checked independently. BFCT allow to reveal the underlying machine arithmetic.

1.2. General Notations

Next we use the following general notations.

$Z$ – set of integers
$Z [m, n]$ – set of integers $m, m + 1, \dots, n$ , where $m \leq n$
$Z_{k} \subset Z$ – set of integers $\geq k$
$R (C)$ – set of real (complex) numbers
$i \in C$ – imaginary unit, where $i^{2} = - 1$
$sum (a_{k}, n)$ – sum of the scalar quantities $a_{1}, a_{2}, \dots, a_{n}$
$int (f, a, b)$ – definite integral of the function $f : R \to R$ on the interval $[a, b]$
≺ – lexicographical order on the set of pairs $(a, b) \in R^{2}$ defined as $(a_{1}, b_{1}) ≺ (a_{2}, b_{2})$ if $a_{1} < a_{2}$ , or $a_{1} = a_{2}$ and $b_{1} < b_{2}$ ; we write $(a_{1}, b_{1}) ⪯ (a_{2}, b_{2})$ if $(a_{1}, b_{1}) ≺ (a_{2}, b_{2})$ , or $(a_{1}, b_{1}) = (a_{2}, b_{2})$
$K (m, n)$ – set of $m \times n$ matrices A with elements $A (p, q) \in K$ , where $K$ is one of the sets $Z$ , $R$ or $C$
$K (m) = K (m, m)$
$A (p, :) \in K (1, n)$ – pth row of $A \in K (m, n)$
$A (:, q) \in K (m, 1)$ – qth column of $A \in K (m, n)$
$∥ \cdot ∥$ – spectral norm in $K (m, n)$
$ones (m, n) \in R (m, n)$ – matrix with unit elements
$rank (A) \in Z_{0}$ – rank of the matrix $A \in K (m, n)$
$tr (A) \in K$ – trace of the matrix $A \in K (m, n)$
$det (A) \in K$ – determinant of the matrix $A \in K (m)$
$I_{n} \in R (n)$ – identity matrix with $I_{n} (p, p) = 1$ and $I_{n} (p, q) = 0$ for $p \neq q$
$E_{i, j}$ – matrix with unique nonzero element, equal to 1, in position $(i, j)$
$N_{n} \in R (n)$ – nilpotent matrix with $N_{n} (p, p + 1) = 1$ and $N_{n} (p, q) = 0$ for $q \neq p + 1$
$λ_{k} = λ_{k} (A)$ – eigenvalues of the matrix $A \in K (n)$ , where $k = 1, 2, \dots, n$ ; we usually assume that $λ_{1} ⪯ λ_{2} ⪯ \dots ⪯ λ_{n}$
$σ_{k} = σ_{k} (A)$ – singular values of the matrix $A \in K (m, n)$ , where $k = 1, 2, \dots, p$ , $p = \min {m, n}$ ; the singular values are ordered as $σ_{1} \geq σ_{2} \geq \dots \geq σ_{p} \geq 0$
$∥ A ∥ = σ_{1}$ – spectral norm of the matrix $A \in K (m, n)$
$cond (A) = ∥ A ∥ ∥ A^{- 1} ∥ = σ_{1} / σ_{n} \geq 1$ – condition number of the non-singular matrix $A \in K (n)$

1.3. Machine Arithmetic Notations

We use the following notations from binary floating-point arithmetic, where codes and results in MATLAB^®environment are written in typewriter font.

$M_{+}$ $(M_{-})$ – set of finite positive (negative) machine numbers
$M = M_{-} \cup {0} \cup M_{+}$ – set of finite machine numbers
$fl (x) \in M$ – rounded value of $x \in R$ ; $fl (x) \in M$ is computed as $m / 2^{n}$ by the command sym(x,’f’), where $m \in Z$ , $n \in Z_{0}$ and m is odd if $n \in Z_{1}$
$M_{+}$ $(M_{-})$ – set of positive (negative) machine numbers; we have $M = M_{-} \cup {0} \cup M_{+}$
$X^{+} > X$ ( $X^{-} < X$ ) – machine number right (left) nearest to $X \in M$ ; $X^{+}$ is obtained as X + eps(X)
$eps = 2^{- 52}$ – machine epsilon; we have $1^{+} = 1 + eps$ , $1^{-} = 1 - eps / 2$ ; $eps$ is obtained as eps(1), or simply as eps
$ρ = eps / 2 = 2^{- 53} ≃ 1.11 \times 10^{- 16}$ – rounding unit
$R = 2^{53} = 9 007 199 254 740 992 ≃ 9.01 \times 10^{15}$ – maximal integer such that positive integers $\leq R$ are machine numbers; we have $R^{+} = R + 2$ , $R^{-} = R - 1$
$E = 2^{1023}$ – maximal machine integer degree of 2; we have $2 E \notin M$
$X_{\max} = (2 - eps) E ≃ 1.80 \times 10^{308}$ – maximal machine number; it is also denoted as realmax and we have eps(realmax) = 2^971
$X_{\min} = 2^{- 1022} ≃ 2.23 \times 10^{- 308}$ – minimal positive normal machine number; it is also denoted as realmin
$R_{\min} = 2 eps / E = 2^{- 1074} ≃ 4.94 \times 10^{- 324}$ – minimal positive subnormal machine number; it is found as eps(0)
$fl (x) \in M$ – rounded value of $x \in R$
$X = [X_{\min}, X_{max}] \subset R$ – normal range; numbers $x \in X$ are rounded with relative error less than $ρ$
Inf (-Inf) – positive (negative) machine infinity
NaN – result of mathematically undefined operation, comes from Not a Number
$M_{ext} = M \cup {\pm Inf} \cup {NaN}$ – extended machine set

1.4. Abbreviations

AI	–	artificial intelligence
ARRE	–	average rounding relative error
BFCT	–	brute force computational technique
CPRS	–	computational problem with reference solution
FPA	–	double precision binary floating-point arithmetic
LLSP	–	linear least squares problem
LSM	–	least squares method
LSP	–	least squares problem
MAI	–	mathematical artificial intelligence
MRRE	–	maximal rounding relative error
NLF	–	Newton-Leibniz formula
NFP	–	numerical fixed point
RRE	–	rounding relative error
RTA	–	real time application
VPA	–	variable precision arithmetic

1.5. Software and Hardware

The computations are done with the program system MATLAB^®, Version 9.9 (R2020b), on Lenovo X1 Carbon 7th Generation Ultrabook.

1.6. Problems with Reference Solution

An effective approach to using BFCT is solving CPRS. CPRS are specially designed computational problems with a priory known, or reference, solutions. It is supposed that the numerical algorithm which is to be checked does not “know” this fact and solves the computational problem in the standard way. Consider for example the finite equation

f (x) = 0

, where

f (x)

and x are scalars or vectors. Then a CPRS is obtained choosing

f (x) = g (x) - g (x_{ref})

, where

x_{ref}

is the reference solution. We usually choose

x_{ref}

to be a vector with unit elements, e.g. x_ref = ones(n,1) and

∥ x_{ref} ∥ = \sqrt{n}

.

Let

x_{exact} \neq 0

be a solution of the equation

f (x) = 0

and

x_{comp}

be an approximate solution computed in FPA. The absolute error

∥ x_{comp} - x_{exact} ∥

and the relative error

\frac{∥ x_{comp} - x_{exact} ∥}{∥ x_{exact} ∥}

of the computed solution

x_{comp}

, such that

f (x_{comp}) ≃ 0

, depends on rounding of initial data and on rounding and other errors made during the computational procedure [15,16]. To eliminate effects of rounding of initial data we may consider CPRS with relatively small integer data, or more generally, with data consisting of machine numbers.

Example 1.

Computational algorithms and software for solving linear algebraic equations

A x = b

, where

A \in C (n)

is invertible, may be checked choosing

A \in Z (n)

and

b = A x_{ref}

, where x_ref = ones(n,1).

The computed solution is x_comp = A\b and we may compare the relative error norm(x_ref-x_comp)/sqrt(n) with the a priori error estimate eps*cond(A).

Example 2.

Let

F : [a, b] \to R

be a differentiable highly oscillating function with derivative

F^{'} : [a, b] \to R

. The numerical integration of

f = F^{'}

may be a problem. For such integrands we may formulate a numerically difficult computational problem

J = int (f, a, b)

with reference solution

J_{ref} = 1

. Systems with mathematical intelligence such as MATLAB^® [6], Maple [8] and Mathematica [9,10] may find the integral with high accuracy by the NLF

J = F (b) - F (a)

thus avoiding numerical integration.

Example 3.

To check the codes for eigenstructure analysis of a matrix

A \in C (n)

we may choose

A = T (I_{n} + N_{n}) T^{- 1}

, where

T, T^{- 1}

are integer matrices with

det (T) = 1

. The eigenvalues of A are equal to 1 (the reference solution). Computational algorithms such as the QR algorithm and software for spectral matrix analysis may produce wrong results even for matrices

A \in Z (2)

of moderate norm.

Example 4.

When solving the differential equation

y^{'} (t) = f (t, y (t))

with initial condition

y (t_{0}) = y_{0}

we may choose a reference solution

y_{ref}

as follows. The function

y_{ref}

satisfies the differential equation

y^{'} (t) = Φ (t, y (t))

, where

Φ (t, y (t)) = f (t, y (t)) + y_{ref}^{'} (t) - f (t, y_{ref} (t))

with initial condition

y_{ref} (t_{0}) = y_{0}

.

An important observation is that for some computational problems such as finding extrema of functions, BFCT may produce better results than certain sophisticated algorithms used in modern computer systems for doing mathematics, see e.g. Section 7.

2. Machine Arithmetic

2.1. Preliminaries

In this section we consider computations in FPA obeying the IEEE Standard [14]. Numerical computations in MATLAB^® are done according to this standard. The FPA consists of set of machine numbers

M

and rules for performing operations, e.g. addition/subtraction, multiplication/division, exponentiation, etc. Rounding and arithmetic operations in FPA are performed with RRE of order

ρ

. The sum

A + B + C

and the product

A B C

are computed as

(A + B) + C

and

(A B) C

.

A major source of errors in computations is the (catastrophic) cancellation when relatively exact close numbers are subtracted so that the information coded in their left-most digits is lost [17]. Thus catastrophic cancellation is the phenomenon when subtracting good approximations to close numbers may cause a bad approximation to their difference. Less known is that genuine addition in FPA may also cause large and even unlimited errors.

A tutorial example of cancellation is solving quadratic equations

a x^{2} + b x + c = 0

,

a \neq 0

. During the last 4000 years students are taught to use the formula

x_{1, 2} = (- b \pm \sqrt{d}) / (2 a)

,

d = b^{2} - 4 a c

, without warning that this may be a bad way to solve numerically quadratic equations.

Similar considerations are valid for the use of explicit expressions for solving cubic and quartic algebraic equations. Because genuine subtraction is numerically dangerous. Genuine addition in FPA is not less dangerous.

2.2. Violation of Arithmetic Laws

In FPA the commutative law for addition and multiplication is valid but the associative law for addition and multiplication is violated, i.e. it may happen that

(A + B) + C \neq A + (B + C), (A B) C \neq A (B C)

The distributive law

A (B + C) = A B + A C

is also violated. This may lead to unexpectedly large (in fact arbitrarily large) errors.

For example, we have the wrong result

1 + R - R = (1 + R) - R = R - R = 0

. The reason for this machine calculation is that

1 + R

is not a machine number (this is the smallest positive integer with this property). It lies in the middle of the successive machine numbers R and

2 + R

and is rounded to R. At the same time we have

1 + (R - R) = 1 + 0 = 1

which is the correct answer. Thus we got the famous wrong equality

\begin{matrix} 0 = 1 \end{matrix}

arising in mathematical jokes.

This is not the end of the story. Consider the expression

S = R + 1 + 1 + \dots + 1 - R

consisting of

2 + R

members, where there are R members equal to 1. We have

S = R

. Computing S from left to right we get the wrong answer

S = (R + 1 + 1 + \dots + 1) - R = R - R = 0

. Computing S starting with the ones, i.e.

1 + 1 + \dots + 1 + R - R = R + R - R

we get the correct answer

S = R

. Now we have obtained the more impressive result

\begin{matrix} 0 = 9 007 199 254 740 992 \end{matrix}

. Summing (at least theoretically) sufficient number of units we compute the sum of 1’s as Inf and hence

\begin{matrix} 0 = Inf \end{matrix}

.

The associative law for multiplication is also violated in FPA. For example, the value of the expression

P = 2^{512} \times 2^{512} \times 2^{- 512}

is

2^{512}

and it is computed correctly as

P = 2^{512} \times (2^{512} \times 2^{- 512}) = 2^{512} \times 1 = 2^{512}

At the same time the machine computation of P from left to right gives

(2^{512} \times 2^{512}) \times 2^{- 512} = fl (2^{1024}) \times 2^{- 512} = Inf \times 2^{- 512} = Inf

where

Inf

is the symbol for infinity in FPA. Thus we got

\begin{matrix} 2^{512} = Inf \end{matrix}

.

Similar false equalities may also be obtained by addition and subtraction of machine numbers. Set

E = 2^{1023}

. In standard arithmetic we have

E + E - E - E = 0

. In FPA it is fulfilled

fl (E + E - E - E) = fl (E + E) - E - E = Inf - E - E = Inf

. This corresponds to the wrong equality

\begin{matrix} 0 = Inf \end{matrix}

although E is a machine number at a good distance

E - 2^{971}

to

X_{max}

.

We also have

fl (E + E - E - E - E - E) = fl (E + E) - E - E - E - E = Inf - E - E - E - E = Inf

although the exact result is

- 2 E

and should be rounded to -Inf. Thus we obtained

\begin{matrix} - Inf = Inf \end{matrix}

. The reason is that the maximum positive machine number in FPA is

X_{\max} = (2 - eps) E

and the number

E + E = 2^{1024}

is set to

Inf

. Moreover, the maximum positive integer that is still rounded to

X_{\max}

is

X_{max} + 2^{969} (2 - eps)

while the number

X_{\max} + 2^{970}

is set to Inf.

These entertaining exercises may produce undetermined results as well. For example, we have

U = 2^{1024} / 2^{1024} - 1 = 0

but the computed value is U = Inf/Inf - 1 = NaN. This may be interpreted as the exotic equality

\begin{matrix} 0 = NaN \end{matrix}

.

Working with positive numbers

x \leq X_{\min}

also leads to violation of the distributive law

(x + y) z = x z + y z

. In particular, we usually think that for

x \in R

it is fulfilled

(x + x) / 2 = x / 2 + x / 2 = x

. But setting

x = X_{\min}

we get (x + x)/2 = 2*x/2 = x and x/2 + x/2 = 0 + 0 = 0. We stress that the minimal positive quantity that is still rounded to

X_{\min}

instead to 0 is

X_{\min} / (2 - eps)

.

The distributive law in the form (1/m + 1/n)/p = 1/m/p + 1/n/p, where at least one of the operands

m, n, p \in Z

is not an integer degree of 2, is also violated. In this case the rounded values of both sides of the equality are usually neighboring machine numbers.

Example 5.

Let

m = 2

,

n = 3

,

p = 5

. Then

fl (1 / 2 / 5 + 1 / 3 / 5) = fl (1 / 6) + 2^{- 55}

and

fl (1 / 2 / 5 - 1 / 3 / 5) = fl (1 / 30) + 2^{- 57}

.

Example 5 illustrates the fact that machine subtraction of close numbers

a, b

is usually performed with small relative error. Moreover, if the operands are machine numbers and

a / 2 \leq b \leq 2 a

then the machine subtraction is exact [17].

The rounded value

fl (X) \in M

of

X \in R

is computed by the command sym(X,’f’) in the form

\pm m / 2^{n}

, where

m, n \in Z_{0}

and m is odd when

n \geq 1

. In particular the command sym(realmax,’f’) will give 309 decimal digits of the integer

X_{max}

= realmax.

2.3. Numerical Convergence of Series

We shall conclude our brief excursion in the machine summation of numbers with the consideration of numerical series with positive elements

a_{k}

. For

n \in Z_{1}

set

sum (a_{k}, n) = a_{1} + a_{2} + \dots + a_{n}

.

Definition 1.

The symbol

sum (a_{k}, \infty)

is called numerical series. The quantity

A_{n} = sum (a_{k}, n)

is the n-th partial sum of this series.

The partial sums are computed in FPA as

A_{1} = a_{1}

,

A_{m + 1} = fl (a_{m} + A_{m}) \in M

,

m \in Z_{1}

. Theoretically, the series

sum (a_{k}, \infty)

is divergent when

a_{k} \geq X_{\min}

. In FPA there are three mutually disjoint possibilities according to the next definition.

Definition 2.

The series

sum (a_{k}, \infty)

is said to be:

numerically convergent if there is a positive $S \in M$ and $m \geq 1$ such that $A_{n} = S$ for $n \geq m$ ; the number S is called numerical sum of the series $sum (a_{k}, \infty)$ .
numerically divergent if there is $m > 1$ such that $A_{n} = Inf$ for $n \geq m$ ; in this there are now terms $A_{n} = NaN$ for $n \geq m$
numerically undefined if there is $m > 1$ such that $A_{n} \neq NaN$ for $n \leq m$ and $A_{n + 1} = NaN$

Divergent numerical series may be (and very often are!) numerically convergent. For example, the divergent harmonic series with

a_{k} = 1 / k

is numerically convergent in FPA to a sum

S ≃ 36

. The divergent series with

a_{k} = 1

is numerically convergent to

S = R

. Also, it is easy to prove the following assertion.

Proposition 1.

For any

S \in M_{+}

there is a series

sum (a_{k}, \infty)

which is numerically convergent to S.

Proposition 1 for FPA is an analogue to the Riemann series theorem [18] which states that the members of a convergent real series which is not absolutely convergent may be reordered so that the sum of the new series to be equal to any prescribed real number.

2.4. Average Rounding Errors

We recall that the normal range of FPA is the interval

X = [R_{\min}, X_{max}] \subset R

, where numbers are rounded with RRE less than

ρ

.

Let

x \in X

and

fl (x) \in M

be the rounded value of x. It is well known [14] that the RRE satisfies the estimate

rre (x) = \frac{| x - fl (x) |}{| x |} \leq \frac{ρ}{1 + ρ} < ρ

(1)

The code sym(x,’f’) computes

fl (x)

in the form

m / 2^{n}

, where m is odd if

n \geq 1

. Let

n \in Z [0, 2^{52} - 1]

. The expression

rre (x)

between two successive machine numbers

X_{n} = 1 + 2 n ρ, X_{n + 1} = 1 + 2 (n + 1) ρ

in the interval

[1, 2)

is as follows. Let

x_{n} = \frac{1}{2} (X_{n} + X_{n + 1}) = 1 + (2 n + 1) ρ \notin M

Then

rre (x) = \{\begin{matrix} 1 - X_{n} / x & X_{n} \leq x \leq x_{n} \\ X_{n + 1} / x - 1 & x_{n} < x \leq X_{n + 1} \end{matrix}

The maximum of

rre (x)

in the interval

[X_{n}, X_{n + 1}]

is achieved for

x = x_{n}

and is equal to

e_{n} = rre (x_{n}) = \frac{X_{n + 1} - X_{n}}{X_{n} + X_{n + 1}} = \frac{ρ}{1 + (2 n + 1) ρ}

The graph of the function

rre : [1, 2) \to R

resembles a saw, see Figure 1.

For

x > X_{n}

near to

X_{n}

the maximum RRE is about

ρ

, while for

x < X_{n + 1}

near to

X_{n + 1}

the maximum RRE is close to

ρ / 2

. In particular we have

\begin{matrix} rre (1 + ρ) & = & \frac{ρ}{1 + ρ} = ρ - ρ^{2} + O (ρ^{3}) \\ rre (2 - 2 ρ) & = & \frac{ρ}{2 (1 - ρ)} = \frac{1}{2} ρ + \frac{1}{2} ρ^{2} + O (ρ^{3}) \end{matrix}

(2)

That actual behavior of RRE is not governed by (1) but is considerably smaller, has been known to specialists in computer calculations from a long time. To clarify this problem we performed intensive numerical experiments.

BFCT showed that actual RRE is considerably lower than

ρ

and lower than

ρ / 2

and this led to the definition and use of average RRE which are more realistic in comparison with the relative error

ρ

. Based on this observation [19] we have defined and calculated an integral ARRE as the definite integral

int (rre, 1, 2)

. This integral is of order

K ρ + O (ρ^{2}), K = \frac{1}{2} log (2) ≃ 0.3466

(3)

At the same time the average of the maximums of RRE is

2 K ρ = log (2) ρ ≃ 0.6931 ρ .

Intensive numerical BFCT experiments [19] with rounding of large sequences of numbers

x \notin M

had shown that the actual ARRE is about

0.40 ρ

. This is 2.5 times less than the widely used bound

ρ

in the literature [14]. In these experiments numbers

x \in M

had been excluded because

rre (x) = 0

. This explains why the multiplier of

ρ

was computed as 0.40 instead of 0.35.

In another series of experiments we have computed and averaged the RRE for fractions

x = 1 / k

,

x = 1 + 1 / k

and

x = 2 - 1 / k

for

k = 1 : m

and m up to 1000 and we have computed the ratios

rre (x) / ρ

. The results are shown in Table 1. The actual RRE in rounding of unit fraction

1 / m

(second column) is very close to the integral ARRE (3).

For fractions

1 + 1 / m

close to 1 the computed ARRE is larger than for fractions

2 - 1 / m

close to 2. This confirms the observation that rounding of

x \in [2^{k}, 2^{k + 1}]

,

k \in Z

, is done with less (although not twice less) RRE for x close to the left limit

2^{k}

than close to the right limit

2^{k + 1}

of the interval, see (2). For example, the result in field (4,2) of Table 1 is obtained by the code

for m = 1:1000

R(m) = double(abs((sym(1/m,’e’)-1/m))*m*2/eps);

end

mean(R)

Table 1. Ratios

rre (x) / ρ

for certain fractions.

Table 1. Ratios

rre (x) / ρ

for certain fractions.

m	$1 / m$	$1 + 1 / m$	$2 - 1 / m$
10	$0.3000$	$0.2423$	$0.1950$
100	$0.3417$	$0.2443$	$0.2413$
1000	$0.3323$	$0.2649$	$0.2316$

We stress that in such experiments we cannot use pseudo-random number generators such as rand, randn and randi from MATLAB^® since they produce machine numbers for which the rounding errors are zero.

We summarize the above considerations in the next important proposition.

Proposition 2.

The known rounding error bounds in all computational algorithms realized in FPA may be reduced three times (!) if integral ARRE are used instead of maximum RRE.

3. Numerically Stable Computations

3.1. Lipschitz Continuous Problems

Numerical stability of computational procedures is a major issue in numerical analysis [15]. A large variety of computational problems (actually, almost all computational problems) may be formulated as follows, see [15,16]. Let

D \subset C (m, 1)

and

R \subset C (n, 1)

be given sets and let

f : D \to R

be a continuous function. We shall use the following informal definition.

Definition 3.

The computational problem is described by the function f, while the particular way of computing the result

r = f (d)

for a given data

d \in D

is the computational algorithm.

Thus the computational problem is identified with the pair

(f, d)

, or with the equality

r = f (d)

. For many computational problems the function f is Lipschitz continuous.

Definition 4.

The computational problem

r = f (d)

is said to be (locally) Lipschitz continuous if there exist constants

L > 0

and

b > 0

such that

∥ f (d_{1}) - f (d_{2}) ∥ \leq L ∥ d_{1} - d_{2} ∥

whenever

∥ d_{1} ∥, ∥ d_{2} ∥ \leq b

. The problem is globally Lipschitz continuous if

b = \infty

.

If the function f has locally bounded derivative it is (locally) Lipschitz continuous. The concept of Lipschitz continuity is illustrated by the next example.

Example 6.

Let

f : D \to R

,

D \subset R

, and

a > 0

be a given constant. Then the following assertions take place.

The function $f (d) = 1 / d$ is Lipschitz continuous if $D = (- \infty, a] \cup [a, \infty)$ and is not Lipschitz continuous if $D = R ∖ {0}$ .
The function $f (d) = d^{1 / 3}$ is Lipschitz continuous if $D = (- \infty, a] \cup [a, \infty)$ and is not Lipschitz continuous if $D = R ∖ {0}$ .
The function $f (d) = d^{m}$ , $m \in Z_{2}$ , is Lipschitz continuous if $D = [- a, a]$ and is not Lipschitz continuous if $D = R$ .
The function $f (d) = d^{4 / 3} sin (1 / d)$ for $d \neq 0$ and $f (0) = 0$ , is differentiable on $R$ but is not Lipschitz continuous.

Suppose finally that the algorithm is realized in FPA, e.g. in MATLAB^® computing environment.

Definition 5.

The computational algorithm

r = f (d)

is said to be numerically stable if the computed result

r_{comp}

is close to the result

f (d^{*})

of a close problem with data

d^{*}

.

Definition 5 includes the popular concepts of forward and backwards numerical stability. To make this definition formal, suppose that there exist non-negative constants

A, B

such that

∥ r_{comp} - f (d^{*}) ∥ \leq ρ A ∥ r ∥, ∥ d - d^{*} ∥ \leq ρ B ∥ d ∥

and

d \neq 0

,

r \neq 0

. Further on we have

\begin{matrix} ∥ r_{comp} - r ∥ & = & ∥ r_{comp} - f (d^{*}) + f (d^{*}) - f (d) ∥ \\ \leq & ∥ r_{comp} - f (d^{*}) ∥ + ∥ f (d^{*}) - f (d) ∥ \\ \leq & ρ A ∥ r ∥ + L ∥ d^{*} - d ∥ \leq ρ A ∥ r ∥ + ρ B L ∥ d ∥ \end{matrix}

Dividing the last inequality by

∥ r ∥

we get the following estimate for the relative error in the computed solution

\frac{∥ r_{comp} - r ∥}{∥ r ∥} \leq ρ C, C = A + \frac{B L ∥ d ∥}{∥ f (d) ∥}

(4)

Definition 6.

The quantity

C = C (A, B, d)

is said to be relative condition number of the computational problem

r = f (d)

.

The remarkable formula (4) reveals the three main factors which determine the accuracy of the computed solution [16,19].

Proposition 3.

The accuracy of the computed solution

r_{comp}

depends on the following factors.

The sensitivity of the computational problem relative to perturbations in the data d measured by the Lipschitz constant L of the function f
The stability of the computational algorithm expressed by the constants $A, B$
The FPA characterized by the rounding unit ρ and the requirement that the intermediate computed results belong to the normal range $X$

The error estimate (4) is used in practice as follows. For a number of problems the Lipschitz constant L may be calculated, or estimated as in the solution of linear algebraic equations and the computation of the eigenstructure of a matrix with simple eigenvalues. Finally, the value of

ρ

is known exactly for FPA as well as for other machine environments.

The estimate of the constants

A, B

may be a problem. Often the heuristic assumption

A = 0

and

B = 1

is made which gives

C = C (d) = \frac{L ∥ d ∥}{∥ f (d) ∥}

(5)

It follows from (5) that C will be large if L and/or

∥ d ∥

are large and/or

∥ f (d) ∥

is small. The constant L is usually out user’s control. The quantities

∥ d ∥

and

∥ f (d) ∥

may be changed by scaling of the computational problem

Definition 7.

The computational problem is said to be:

well conditioned if $ρ C ≪ 1$ ; in this case we may expect about $- {log}_{10} (ρ C)$ true decimal digits in the computed solution $r_{comp}$
poorly conditioned if $ρ C ≃ 1$ ; in this case there may be no true digits in the computed solution $r_{comp}$

More precise classification of the conditioning (well, medium and poor) of computational problems is also possible. Computational problems which are not Lipschitz but Hölder continuous may also be analyzed, see e.g. [19].

The above considerations confirm a fundamental rule in numerical computations, namely that if the data d and/or some of the intermediate results in the computation of

f (d)

are large and/or the result

r = f (d)

is small then large relative errors in the computed solution

r_{comp}

may be expected, see Section 3.3.

For some computational problems

r = f (d)

there are a priory error estimates for the relative error

∥ r_{comp} - r ∥ / ∥ r ∥

in the computed solution

r_{comp}

in the form of explicit expressions in

ρ

and d. The importance of such estimates is that they may be computed (or estimated) a priory, before solving the computational problem.

Most a priory error estimates are heuristic. Among computational problems with such estimates is the solution of linear algebraic equations

A x = b

and the computation of the eigenvalues of a matrix A with simple spectrum. To check the accuracy and practical usefulness of such error estimates by BFCT, a set of CPRS is designed and the error estimates are compared with the observed errors.

3.2. Hölder Continuous Problems

Definition 8.

The computational problem

r = f (d)

is said to be Hölder continuous with exponent

h > 0

if there exist constants

L > 0

and

b > 0

such that

∥ f (d_{1}) - f (d_{2}) ∥ \leq L ∥ d_{1} - d_{2} ∥^{h}

whenever

∥ d_{1} ∥, ∥ d_{2} ∥ \leq b

.

Hölder continuity implies uniform continuity but the converse is not true. It is supposed that

h < 1

. Indeed, if

h = 1

the problem is Lipschitz continuous and if

h > 1

the function f is constant.

The machine computation of

r = f (d)

may be accomplished with large errors of order

{eps}^{h}

when the exponent

h > 0

is small. Such cases arise in the calculation of a multiple zero

d_{0}

of a smooth function

f : R \to R

, where

f^{(k)} (d_{0}) = 0

,

k \in Z [0, m - 1]

and

f^{(m)} (r_{0}) \neq 0

. In this case

h = 1 / m

.

Let

r \neq 0

and

r_{comp}

be the computed value of r. Then a heuristic accuracy estimate for the solution of a Hölder problem is

\frac{∥ r_{comp} - r ∥}{∥ r ∥} \leq H ρ^{h}, H = L \frac{{∥ d ∥}^{h}}{∥ r ∥}

Experimental results confirming this estimate are presented later on.

A typical example of a Hölder problem is the computation of the roots of n-degree algebraic equation with multiple roots, where

n \geq 3

(the case

n = 2

is treated by a specific algorithm). There are three codes in MATLAB^® that may be used for solving algebraic equations which may be characterized as follows.

The code roots is fast and works, on middle power computer platforms, with equations of degree n up to several thousands but gives large errors in case of multiple roots. This code solves algebraic equations only.
The code vpasolve works with VPA corresponding to 32 true decimal digits with equations of degree n up to several hundreds but may be slow in some cases. This code works with general equations as well finding one root at a time.
The code fzero is fast but finds one root at a time. It may not work properly in case of multiple roots of even multiplicity. This code works with general finite equations as well.

3.3. Golden Rule of Computations

The results presented in this and previous sections confirm the Golden Rule in doing computations in both exact and machine arithmetic which may be formulated as follows.

Proposition 4.

(Golden Rule of Computations) If in a chain of numerical computations the final result is small relative to the initial data and/or to the intermediate results then large relative errors in the computed solution are to be expected.

The already mentioned catastrophic cancellation in subtraction of close numbers is a particular case of Proposition 4.

An important class of computations in FPA are the so called computations with maximum accuracy. Let the vector r with elements

r_{k}

be the exact solution of the vector computational problem

r = f (d)

and let the vector

r_{comp}

with elements

r_{k}^{*} \in M

be the computed solution.

Definition 9.

The vector

r_{comp}

is computed with maximum accuracy if its elements

r_{k}^{*}

satisfy

r_{k}^{*} \in {r_{k}^{-}, r_{k}^{+}}

.

Solutions with maximum accuracy are the best that we may expect from a computational algorithm implemented in FPA.

4. Extremal Elements of Arrays

Finding minimal and maximal elements of real and complex arrays is a part of many computational algorithms. This problem may look simple but it has pitfalls in some cases.

4.1. Vectors

Let

w = [w_{1}, w_{2}, \dots, w_{n}]

be a given real n-vector. The problem of determining an extremal (minimal or maximal) element

w_{k}

of w together with its index k arises as part of many computational problems. The solution of this problem is obtained in MATLAB^® by the codes [U,K] = min(w) or [V,L] = max(w), where

U = w_{K}

is the minimal element of the vector w and K is its index. Similarly,

V = w_{L}

is the maximal element of the vector w and L is its index. These codes work for complex vectors as well if the complex numbers are ordered lexicographically by the relation ⪯, see [6].

An important rule here is that if there are several elements

w_{k}

of w equal to its minimal element U then the computed index K is supposed to be the minimal one. Similarly, if there are several elements

w_{l}

of w equal to its maximal element V then the computed index L is again the minimal one. The delicate point here is the calculation of indexes K and L. For example, for

w = [c, c, c]

, where

c \in R

or

c \in C

, the codes min and max produce

U = c

,

K = 1

and

V = c

,

L = 1

as expected.

Due to rounding the above codes may produce unexpected results. Indeed, the performance of the codes depends on the form in which the elements of the vector w are written.

Example 7.

The quantities

A = 1 / 3

,

B = 4 / 3 - 1

and

C = 1 / 2 - 1 / 6

are equal but their rounded values are different. The rounded value

fl (X) \in M

of

X \in R

is computed by the codesym(X,’f’). It produces the answer in the form

m 2^{- n}

, where

m, n \in Z_{0}

. We have

fl (A) = 6004799503160661 / 2^{54}

and

fl (B) = fl (A) - 2^{- 54}

,

fl (C) = fl (A) + 2^{- 54}

. Hence

fl (B) < fl (A) < fl (C)

.

Example 8.

Let w be a 3-vector with elements

A, B, C

at any positions, where

A = B = C

are defined in Example 7. Then the code [U,K] = min(w) will give

U = fl (B)

and K will be the index of B as an element of w. The code [V,L] = max(w) will give

V = fl (C)

and L will be the index of C as an element of w. At the same time both codes are supposed to give

K = L = 1

and

U = V

.

These details of the performance of the codes min and max may not be very popular even among experienced users of sophisticated computer systems for doing mathematics such as MATLAB^®.

4.2. Matrices

Similar problems as in Section 4.1 arise in finding extremal (minimal or maximal) elements of multidimensional arrays and in particular of real and complex matrices A.

Consider a matrix

A \in C (m, n)

with elements

A (p, q)

. Let the minimal element of A relative to the lexicographical order be

A_{\min} = A (i_{1}, i_{2})

. Then the code

[a,b] = min(A); [A_min,N] = min(a); A_min, ind = [b(N),N]

finds the minimal element

A_{\min}

of A and the pair

(i_{1}, i_{2})

of its indexes as

A_min, ind = i_1 i_2

where i_1 = ind(1), i_2 = ind(2). Here a and b are n-vectors and ind is a 2-vector.

If there is more than one minimal element then the computed pair

(i_{1}, i_{2})

of its indexes is minimal relative to the lexicographical order. This result, however, may depend on the way the elements of A are specified.

Finding minimal elements of matrices may be a part of improved algorithms to compute extrema of functions of two variables, see Section 7.2. A scheme for finding minimal elements of

n D

-arrays,

n > 2

, may also be derived and applied to the minimization of functions of n variables.

4.3. Application to Voting Theory

The above details in using the codes min and max from MATLAB^® [6], although usually neglected, may be important. For example, in certain automatic voting computations (in which one of the authors of this paper had participated back in 2013) with the Hamilton method for the bi-proportional voting system used in Bulgaria, these phenomena actually occurred and caused problems. To resolve quickly the problem we applied BFCT using hand calculations (!) being in emergency situation.

The computational algorithms for realization of the Hamilton method must be improved replacing the fractional remainders by integer remainders [19] as follows. Let n parties with votes

v_{1}, v_{2}, \dots, v_{n}

take part in the distribution of

S > n

parliamentary seats by the Hamilton method [20]. Set

V = v_{1} + v_{2} + \dots + v_{n}

and

m_{k} = S v_{k} / V

.

Next the rationals

m_{k}

are represented as

fix (m_{k}) + μ_{k} / V

, where

fix (x) \in Z_{0}

is the integer part of

m_{k}

,

μ_{k} / V

is the fractional remainder and

μ_{k} \in Z [1, V - 1]

is the integer remainder of

m_{k}

modulo V. Initially the kth party gets

fix (m_{k})

seats. If the sum

S_{0}

of initial seats is less than S then the first

S - S_{0}

parties with largest integer remainders

μ_{k}

get one seat more. Thus to avoid small errors in the computation of fractional remainders that may cheat the code max we must work with exactly computed integer remainders.

5. Problems with Reference Solution

5.1. Evaluation of Functions

Evaluation of functions f defined by an explicit expression

y = f (x), 0 \neq x \in R, 0 \neq y \in R

is a CPRS since

f (x)

may be found with high precision in a suitable computing environment. However, the relative error in computing y may be large even for well behaved functions f and arguments x far from the limits of the normal range [19]. The reason is that instead with the argument

x \in R

the corresponding computational algorithm works with the rounded value

x^{*} = fl (x) \in M

. At the same time for

x \in M

the computed quantity

f (x^{*})

is usually exact to working precision.

Example 9.

We have

[cos(realmax);sin(realmax)] = -0.999987689426560

0.004961954789184

The computed result is correct to full precision although the argument realmax is written with 309 decimal digits, which may be displayed by the command sym(realmax,’f’).

Let the function f be locally Lipschitz continuous with constant

L (x)

in a neighborhood of x and let the result computed in FPA be

y^{*} = f (x^{*})

. Then the relative error of

y^{*}

is estimated as

\frac{| y^{*} - y |}{| y |} \leq ρ C, C = C (f, x) = \frac{L (x) | x |}{| f (x) |}

where

C (f, x)

is the relative condition number of the computational problem

y = f (x)

.

Example 10.

Consider the vector

y = [cos (2 k π); sin (2 k π)]

, where

k \in Z

. We should have

y = [1; 0]

but for large k this may not be so. Let the vector

X \in R (1, 8)

has elements

X (p) = π 2^{46 + p}

, i.e.

X = pi*[2^47,2^48,2^49,2^50,2^51,2^52,2^53,2^54]

The command Y = [cos(X);sin(X)] gives the matrix

Y = 0.9999 0.9994 0.9976 0.9905 0.9622 0.8517 0.4509 -0.5934

0.0172 -0.0345 -0.0689 -0.1374 -0.2723 -0.5240 -0.8926 -0.8049

While the first column Y(:,1) of Y still resembles the exact vector [1;0], the next columns are heavily wrong with Y(:,8) being a complete catastrophe with relative error 170%. The commands Y1 = [C1;S1] and Y2 = [C2;S2], where C1 = vpa(cos(X)), S1 = vpa(sin(X)), C2 = cos(vpa(X)) and S2 = sin(vpa(X)) produce the same wrong result Y1 = Y2 = Y.

A way out from such wrong calculations is to evaluate standard trigonometric functions for arguments modulo

2 π

.

Example 11.

The command

YY = [cos(mod(X,2*pi));sin(mod(X,2*pi))]

produces the matrix

YY = 1 1 1 1 1 1 1 1

0 0 0 0 0 0 0 0

which is exact to full precision.

We recall that for

| x | \geq R

the rounded value

fl (x)

of x is an even integer. For example, the quantity

a = 23 * π^{32} / 20

is rounded to

fl (a) = 9 321 670 908 397 306

which is demonstrated by the command sym(23*pi^32/20,’f’). Another instructive example is rounding integer degrees

10^{n}

.

Example 12.

We have

fl (10^{n}) = 10^{n}

for

n \in Z [16, 22]

but

sym([10^23;10^24;10^25],’f’) = 99999999999999991611392

999999999999999983222784

10000000000000000905969664

If

x \in M

then for most elementary and special functions f the quantity

y = f (x)

is computed with maximum accuracy.

If the function f is locally Lipschitz with constant

L (x)

then large relative computational errors in evaluation of

y = f (x) \neq 0

may occur when

C (f, x)

is large, i.e. when

L (x)

and/or

| x |

is large and/or

| y |

is small. For the trigonometric functions above the quantities

L (x)

and

| y |

were equal to 1 but the argument

x \notin M

is large.

A standard approach to improve the performance of computational algorithms is to scale the initial and/or intermediate data to avoid large errors. In computing the matrix exponential

exp (A)

,

A \in C (n)

, the scaling

A \to A 2^{- m}

is done by a scaling factor

2^{- m}

,

m \in Z_{1}

, which is a binary degree in order to reduce rounding errors [15].

5.2. Linear Algebraic Equations

An instructive example, illustrating BFCT to the solution of vector algebraic equations

A x = b

with integer data, is designed as follows. Let

A \in Z (n)

be invertible matrix and

b \in Z (n, 1)

. The command A = fix(100*randn(n)); generates a matrix

A \in Z (n)

with normally distributed with integer elements of moderate magnitude and zero mean. Choosing the reference solution as X_0 = ones(n,1) we get b = A*X_0. The computed solution is X_comp = A\b with relative error E_n = norm(X_comp - X_0)/sqrt(n).

The a priory relative error estimate is est = eps*cond(A). Usually

E_{n}

is a small multiple of

est

and this may be checked by BFCT. We consider three types of computed solutions as follows.

Standard solution X_1 = A\b obtained by Gauss elimination with partial pivoting based on LR decomposition $P A = L R$ of A, where L is lower triangular matrix with unit diagonal elements, R is upper triangular matrix with nonzero diagonal elements and P is permutation matrix.
Solution X_2 = R\(Q’*b) obtained by QR decomposition [Q,R] = qr(A) of A, where $A = Q R$ , the matrix Q is orthogonal and the matrix R is upper triangular.
Solution X_3 = inv(A)*b based on finding the inverse $inv (A) = A^{- 1}$ of A.

Solution

X_{1}

is computed as a default in MATLAB^®and is accurate and fast. Solution

X_{2}

is very accurate although more expensive and is also used. Solution

X_{3}

is not recommended and is included here for tutorial purposes. Indeed, computing inv(A) and multiplying the vector b by this matrix are unnecessary operations which are expensive and reduce accuracy.

The above computational techniques are illustrated by the commands

n = 1000; A = fix(100*randn(n)); X_0 = ones(n); b = A*X_0;...

X_1 = A\b; [Q,R] = qr(A); X_2 = R\(Q’*b); X_3 = inv(A)*b;...

E_1 = norm(X_0-X_1)/sqrt(n);E_2 = norm(X_0-X_2)/sqrt(n);...

E_3 = norm(X_0-X_3)/sqrt(n);est = eps*cond(A);...

e_1 = E_1/est; e_2 = E_2/est; e_3 = E_3/est; [e_1;e_2;e_3]

Each execution of these commands produces different result because of the code randn. The computed quantity e_1 for systems with up to

n = 1000

unknowns is usually less than 20, the quantity e_2 is about 40 and the quantity e_3 is less than 4. More precisely, intensive computations with

n = 1000

showed that

mean (e_{1}) = 18

,

mean (e_{2}) = 3

and

mean (e_{3}) = 40

.

Tests with

n = 5000

had also been done but they require more computational time. For such values of n we have observed average values

mean (e_{1}) = 30

,

mean (e_{2}) = 3

,

mean (e_{3}) = 150

.

The comparison of

e_{1}

and

e_{2}

with

e_{3}

confirms, among others, the tutorial conclusion that solving linear equations by matrix inversion must definitely be avoided. Rather, the opposite is fact. If the inverse matrix

B = A^{- 1}

is needed for some reason, it is found solving n linear equations

A b_{k} = i_{k}

for the columns

b_{k}

of B, where

i_{k} = I_{n} (:, k)

is the kth column if

I_{n}

.

Our main observation, based on BFCT, about comparison of the methods of Gauss elimination and QR decomposition for solving linear algebraic equations, is formulated in the next proposition.

Proposition 5.

For matrices of order up to 1000 the QR decomposition method gives error that is between 5 and 10 times less than the error of the Gauss elimination method.

In addition, the solution of a linear equation

A x = b

via QR decomposition is backward stable. Thus Proposition 5 is a message to the developer of MATLAB^®, the respected MathWorks, Inc. Corporation.

The Gauss elimination method based on LR decomposition is preferred to the QR decomposition method because it requires twice less floating-point operations. Since fast computational platforms are now widely available, this consideration is no more valid.

As a rule the implementation of the code x = A\b gives good results for matrices

A \in R (n)

with n of order

10^{3}

. At the same time large errors in x may occur even for

n = 2

.

Example 13.

Consider the algebraic equation

A (m) x = b (m)

, where the matrices

A (m) \in R (2)

are defined by the relation (7) below. For

x_{ref} = [1; 1]

we have

b (m) = [1 - 10^{m}; - 1 - 10^{m}]

. The computed results for

m = 2 : 5

are presented at Table 2. In cases

m = 4, 5

there is a warning that the matrix

A (m)

is close to singular or badly scaled since

cond (A (m))

is larger than

10^{16}

. As shown in Table 2, for these two cases the relative error is indeed 25% and 100%, respectively.

5.3. Eigenvalues of $2 \times 2$ Machine Matrices

The integer

2 \times 2

matrix

A = [\begin{matrix} - 100009999 & 100000000 \\ - 100020001 & 100010001 \end{matrix}]

(6)

has double eigenvalue

λ = 1

and Jordan form

J_{2} = I_{2} + N_{2}

. The elements of A are integers of magnitude

10^{8}

. This is far from the the quantity R such that integers

n > R

may not be machine numbers.

Example 14.

The MATLAB^® code e = eig(A)’ computes the row 2-vector e of eigenvalues of A as

e_{comp} = [- 0.3955, 2.3955]

instead of

e = [1, 1]

. The relative error is 140% and there is no true digit in the computed vector

e_{comp}

(even the sign of the first computed eigenvalue is wrong). The trace of A is computed fortunately as trace(A) = 2 and is the sum of computed eigenvalues. The determinant

det (A) = 1

is computed wrongly as det(A) = 0.2214.

Surprisingly, the vector p of the coefficients of the characteristic polynomial of A, computed as p = poly(A), is [1,-2,-0.9475]. The last element of the computed vector is not the computed determinant 0.2214, which may confuse the user. Moreover, the computed eigenvalues of A are the roots of this wrong characteristic polynomial and hence they are not computed from the Schur form of A as supposed.

Here we recall the modern paradigm in numerical spectral analysis and root finding of polynomials: the eigenvalues of a matrix are not computed as roots of its characteristic polynomial; rather the roots of any polynomial are computed as the eigenvalues of its companion matrix.

The Maple [8] code jordan(A), incorporated in MATLAB^®, computes correctly the Jordan form

I_{2} + N_{2}

of A. The vector of eigenvalue condition numbers, computed by the command condeig(A), is

[c, c]

, where

c = 7.1665 \times 10^{7}

, is relatively large. This indicates that something is wrong. In reality things are even worse since the vector of eigenvalue condition numbers of the matrix A does not exist because A has one linearly independent eigenvector.

At this moment the user may, or may not be aware that something is wrong (with exception of the code jordan which is not very popular and is of restricted use). Now comes the time of BFCT to find the spectrum and the Jordan form of A as follows.

The sum

e_{1} + e_{2}

of the eigenvalues of A is equal to the trace

A (1, 1) + A (2, 2) = 2

which is computed exactly. If the eigenvalues are close or equal, because the eigenvalue condition numbers are large, then we should have

e_{1} + e_{2} ≃ 2 e_{1} = 2

and

e_{1} = e_{2} = 1

. The Jordan form of A is then either

J_{2}

or

I_{2}

. But the only matrix with Jordan form

I_{2}

is

I_{2}

itself and hence the Jordan form of A must be

J_{2}

. We have obtained the correct result using BFCT and simple logical reasoning.

The matrix (6) may be written as

A (m) = [\begin{matrix} 1 - a - a^{2} & a^{2} \\ - 1 - 2 a - a^{2} & 1 + a + a^{2} \end{matrix}], a = 10^{m}

(7)

for

m = 4

. Thus we have a family of matrices

{A (m)}

parametrized by the integer m. The trace of

A (m)

is equal to 2, the determinant of

A (m)

is equal to 1, the eigenvalues of

A (m)

are

e_{1} = e_{2} = 1

and the Jordan form of

A (m)

is

I_{2} + N_{2}

.

Below we give the values of the coefficients

c_{2} = tr (A) = 2

and

c_{3} = det (A) = 1

of the characteristic polynomial

λ^{2} - c_{2} λ + c_{3}

of A as found by the code poly(A), and of the eigenvalues

e_{1} = e_{2} = 1

of A, as computed by the code eig(A) in MATLAB^®.

For $m = 2$ we have $c_{2} = 2.0000$ , $c_{3} = 1.0000$ and $e_{1, 2} = 1.0000 \pm 0.0001 i$ , where the quantity 1.0000 has at least 5 true decimal digits and $0.0001 i$ is a small imaginary tail which may be neglected. This result seems acceptable.
For $m = 3$ we have $c_{2} = 2.0000$ , $c_{3} = 1.0000$ and $e_{1} = 0.9937$ , $e_{2} = 1.0063$ . With certain optimism these results may also be declared as acceptable.
For $m = 4$ a computational catastrophe occurs with $c_{2} = 2.0000$ (true), $c_{3} = - 0.9475$ (completely wrong) and $e_{1} = - 0.3955$ , $e_{2} = 2.3955$ (also completely wrong). Here surprisingly $det (A)$ is computed as $0.2214$ which differs from 1 (which is to be expected) but differs also from $c_{3}$ . It is not clear what and why had happened. Using the computed coefficients $c_{2}, c_{3}$ , the roots of the characteristic equation $λ^{2} - c_{2} λ + c_{3} = 0$ now are $λ_{1} = 0.7709$ , $λ_{2} = 1.2291$ instead of the computed $e_{1}$ and $e_{2}$ . This is a strange wrong result.
We give also the results for $m = 5$ which are full trash and are served without any warning for the user. We have $c_{2} = 2.0000$ , $c_{3} = 5591.5$ and $e_{1, 2} = 1.0000 \pm 74.770 i$ . We also get $det (A) = - 4472.2$ for completeness.

In all cases the sum

c_{2} = 2.0000

of the wrongly computed eigenvalues is almost exact which is a well known fact in the numerical spectral analysis of matrices [15].

The conclusion from the above considerations is that BFCT in this case, in contrast to the standard software, always give

e_{1} ≃ e_{2} = c_{2} / 2 = 1.0000

with 5 true decimal digits. This is a good approximation to the eigenvalues

e_{1} = e_{2} = 1

of the matrices

A (m)

for

m = 3, 4, 5

when the codes eig and poly fail.

6. Zeros of Functions

Computing zeros of functions and of polynomials in particular is a classical problem in numerical analysis. Although powerful sophisticated computational algorithms for this purpose have been developed, the use of BFCT in this area is still useful.

Consider first the scalar case. Computing zeros of functions

f (x) = 0

, where

f : R \to R

, in a given interval

T = [a, b]

, where f is a continuous function, may be a difficult problem. Here we distinguish three main tasks.

Find one particular solution $r_{1}$ of equation $f (x) = 0$ in the interval T.
Find the general solution $f^{- 1} (0) \cap T$ of equation $f (x) = 0$ in the interval T.
Find all roots $[r_{1}, r_{2}, \dots, r_{n}]$ of a given nth degree polynomial $P (x)$ .

The condition

f (a) f (b) < 0

guarantees that there is at least one solution in the interval

(a, b)

. Whether f has zeros at the points

a, b

is checked by direct computation.

Solving real and complex polynomial equations

P (x) = 0

of nth of the form

P (x) = p_{1} x^{n} + p_{2} x^{n - 1} + \dots + p_{n} x + p_{n + 1} = 0

and

c_{1} \neq 0

is done by the code vpasolve(P(x) == 0) from MATLAB^®. It finds all roots of

P (x)

with quadrupole precision of 32 true decimal digits.

For

n = 100

the speed of the corresponding software on a medium computer platform is still acceptable. For

n = 1000

and higher the performance of the code may be slow even on fast platforms and hence it may not be applicable to RTA (if such computations are ever needed in RTA). Note that we write this in end of year 2024.

A faster solution may be obtained by the code roots(p), where p is the vector of coefficients

p_{1}, p_{2}, \dots, p_{n + 1}

of

P (x)

. This code computes the column vector of roots

r = [r_{1}; r_{2}; \dots; r_{n}] \in C (n, 1)

of

P (x)

as the vector of eigenvalues of the companion matrix

C (A)

of

P (x)

using the QR algorithm, i.e. r = eig(compan(p)). The code roots(p) works relatively fast for n up to several thousands but may produce wrong results even for small values of n when the polynomial has multiple roots and the computational problem is poorly conditioned.

It is known that relative errors in computing multiple roots of equations

f (x) = 0

in FPA, where f has a sufficient number of derivatives around a root

x = r

, are of order

{eps}^{1 / n}

, where n is the multiplicity of r. Correspondingly, relative errors for most computational algorithms for solving such equations are of order

γ {eps}^{1 / n}

, where

γ

is a constant of order

O (1)

.

To estimate

γ

, consider the equation

P (x) = 0

when the nth degree polynomial

P (x)

has n-tuple root

r_{1} = 1

. Denote the relative error in the computed vector

r_{comp}

as

E_{n} = \frac{∥ r_{comp} - ones (n, 1) ∥}{\sqrt{n}}

The ratio

γ_{n} = E_{n} / {(eps)}^{1 / n}

is shown at Table 3 for the extended form

x^{n} - n x^{n - 1} + \dots + {(- 1)}^{n}

of the polynomial

P (x) = {(x - 1)}^{n}

, and values of n from 3 to 20. For

n = 2

the computed roots are exact and

γ_{2} = 0

.

We see that the relative error

E_{n}

in the computed solution for n up to 20 satisfies the heuristic estimate

E_{n} \leq γ_{n} {eps}^{1 / n}

, where the average of

γ_{n}

is

1.87

. Next we assume that

γ_{n} = 1.9

for simplicity.

The polynomial

P (x) = {(x - 1)}^{n}

may be found using the command expand((x-1)^n), or asking the intelligent dialogue computer system WolframAlpha [10]

Find the coefficients of the polynomial (x-1)^n

We stress that in all cases the code vpasolve(P(x) == 0) gives the exact solution vector ones(n,1).

Usually the computed roots are unacceptable if their relative error is greater than 1%. For example, it follows from Table 3 that the root

r = 1

of multiplicity 20 is computed by the code roots with relative error 32% which is unacceptable. It must be stressed that this is not due to program disadvantages of the code but rather to high sensitivity of the eigenvalues of the companion matrix

C (A)

to perturbations in A.

If the polynomial equation

P (x) = 0

has a root of multiplicity n the relative error is estimated as

1.9 {eps}^{1 / n}

. Solving the equation

1.9 {eps}^{1 / n} = 0.01

we get

n = 6.89

and

n \geq 7

. We summarize this slightly unexpected observation in a special proposition.

Proposition 6.

Relative errors in computing n-tuple roots of polynomials by the coderootsmay be unacceptable when

n \geq 7

.

Let now a continuous function f be given and suppose that we look for all roots of the equation

f (x) = 0

in the interval

T = [a, b]

, or for the set

f^{- 1} (0) \cap T

. Here use of BFCT seems necessary. A direct approach is to plot the graph of the function

x \mapsto | f (x) |

,

x \in T

. Let N be a sufficiently large integer, say

N = 1000

. We may use the commands X = linspace(a,b,N); Y = abs(f(X)); and plot(X,Y) to plot the graph of the function

x \mapsto | f (x) |

,

x \in T

.

The zeros of f are marked as inverted peaks at the roots

r_{1} < r_{2} < \dots

. Next the codes vpasolve(f,r_k) or fzero(f,r_k) may be used around each of the points

r_{k}

to specify the roots with full accuracy. This approach is well known in the computational practice and is recommended in some MATLAB^® guides.

More difficult problems are the solutions of nonlinear vector equations

f (x) = 0

and of matrix equations

F (X) = 0

, where x and

f (x)

are n-vectors, and X and

F (X)

are

n \times n

matrices. The MATLAB^® code fsolve is intended to solve nonlinear vector problems, while the codes are, axxbc, care, dare, dlyap, dric, lyap, ric, sylv and others solve linear and nonlinear matrix equations. The performance of all these codes may be checked by BFCT and CPRS. This a challenging problem which will be considered elsewhere.

7. Minimization of Functions

7.1. Functions of Scalar Argument

The minimization of the expression

y = f (x)

,

x \in T = [a, b]

, where

f : T \to R

is a continuous function, is done in MATLAB^® by a variant of the method of golden ratio. The corresponding code [x_0,y_0] = fminbnd(f,a,b) finds a local minimum in the interval T, where

y_{0} = f (x_{0})

.

To find the global minimum

[x_{\min}, y_{\min}]

of f in the interval T we may use the graph of the function f plotted by the commands x = linspace(a,b,n); y = f(x); plot(x,y). Then the approximate global minimum

[c, f (c)]

of f is determined visually, considering also the pairs

[a, f (a)]

and

[b, f (b)]

at the end points of T. The code

[x_min,y_min] = fminbnd(f,c-h,c+h)

is then used in a neighborhood of the approximate minimum

[c, f (c)]

. This approach requires human interference in the process and may be avoided as shown later.

The problem with the code fminbnd is not that it may find local minimums. The real problem with this and similar codes is that they can miss the global minimum even when the function f is unimodal, i.e. when it has a unique minimum. So BFCT for such functions is a must. The problem is that usually we do not know whether the function is unimodal or not and whether the computed minimum is minimum at all. The more so when performing complicated algorithms in RTA.

Example 15.

Consider the function

f (x) = 2 - x^{2} exp (1 - x^{2}), x \in [0, \infty)

This function is unimodal with minimum

[x_{\min}; y_{\min}] = [1; 1]

. In some cases this minimum is found as

[x_min,y_min] = fminbnd(f,0,b)

for some

b > 0

. But in other cases it is not found correctly and here we should use BFCT. The code works well for

1 \leq b \leq 16.85

but gives a completely wrong result for

b \geq 16.86

without warning. Indeed, we have the good result

[x_min;y_min] = fminbnd(f,0,16.85) = 0.999992123678799

1.000000000124073

and the bad result

[x_min;y_min] = fminbnd(f,0,16.86) = 16.859937887512295

2

The second result has 1586% relative error for

x_{\min}

and 100% relative error for

y_{\min}

.

The reason for the disturbing events in Example 15 is complicated. We may expect that

f (x) = 2

for values of x such that the expression

φ (x) = x^{2} exp (1 - x^{2})

is less than

ρ

since then

f (x) = 2 - φ (x)

is rounded to 2. The root

r_{0} > 0

of the equation

φ (x) = ρ

is

r_{0} = 6.38

and we indeed have

φ (r_{0}) = 2

. But

r_{0}

is much less then 16.86 and hence the code fminbnd works with certain sophisticated version of VPA instead with FPA. In particular, the quantity

α = φ (16.86)

is much less than

ρ

with

α / ρ ≃ 2.46 \times 10^{- 105} ≪ 1

. We shall summarize the above considerations in the next statement.

Proposition 7.

The software for minimization of functions of one variable such asfminbnd(f,a,b)from MATLAB^® may fail to produce correct results even for some unimodular functions

f : T \to R

.

In contrast, BFCT can always be used to find the global minimum automatically. Let the function

f : T \to R

be unimodal. Consider the vectors X = linspace(a,b,n+1) with elements

X (k) = a + (k - 1) (b - a) / n

and

Y = f (X)

with elements

Y (k)

. The MATLAB^® command [y,m] = min(Y) computes an approximation

[X (m), y]

to the minimum

[x_{\min}, y_{\min}]

. Finally, the (almost) exact minimum is computed between the neighboring elements of

X (m)

as

[x_min,y_min] = fminbnd(f,X(m-1),X(m+1))

7.2. Functions of Vector Argument

For

n \geq 2

the minimization of continuous functions

f : R (n, 1) \to R

of vector argument meets difficulties similar to those described in Section 7.1. And other specific difficulties as well. Unconstraint minimization

f (x) \to \min

is done in MATLAB^® by the Nelder-Mead algorithm [23,24], realized by the code fminsearch [6]. We stress that the code finds a local minimizer

[x_{\min}, y_{\min}]

of the function f, where

y_{\min} = f (x_{\min})

.

The function f may not be differentiable which is an advantage of this algorithm. The algorithm compares the function values

f (x)

at the vertexes of a

(n + 1)

-simplex and then replaces the vertex with highest value of f by another vertex. The simplex usually (but not always, as shown below) contracts on a minimum of f which may be local or global. The corresponding command is

[x_min,y_min] = fminsearch(f,x_0)

where

x_{0}

is the initial vector.

The computed result depends on the next five factors, among others:

the computational problem
the computational algorithm
the FPA
the computer platform
the starting point $x_{0}$

Factors 1-3 are usually out of control of the user. Factor 4 is partially under control, e.g. the user may buy another platform. Only factor 5, which is very important, is entirely under control. As mentioned above, the computed value

x_{\min}

depends on the starting point

x_{0}

. This is denoted as

x_{\min} = Ψ (x_{0})

.

Definition 10.

The point

x_{0}

is said to be a numerical fixed point (NFP) of the computational procedure if

x_{0} = Ψ (x_{0})

.

The computed result

x_{\min}

is NFP of the computational procedure. NFP depends not only on the computational problem and the computational algorithm but also on the FPA and on the particular computer platform, where the algorithm is realized. Unfortunately, computed values for

x_{\min}

which are far from any actual minimum are often NFP of the computational procedures. They are served without warning to the user.

For a class of optimization problems the function f has the form

f (x) = c + p (x) e (x), e (x) = exp (- a_{1} x_{1}^{2} - a_{2} x_{2}^{2} - \dots - a_{n} x_{n}^{2})

(8)

where

c > 0

,

p (x)

is a polynomial in x and

a_{k}

are positive constants. Since

f (x) \to c

when

∥ x ∥ \to \infty

the function (8) has (at least one) global minimum

[x_{\min}, y_{\min}]

.

Numerical experiments with functions f of type (8) show that not only the computed valus for

x_{\min}

and

y_{\min}

depend on the initial guess

x_{0}

but that in many cases the code fminsearch produces a wrong solution for

x_{\min}

which is far from any minimizer of the function f. The reason is that the value of

p (x) e (x)

may become very small relative to c. Then

f (x)

is computed as c and the algorithm stops at a point which is far from any minimizer of f. Usually, no warning is issued for the user in such cases.This may be a problem when the minimization code is a part of an automatic computational procedure which is not under human supervision. In such cases application of the techniques of AI and MAI may be useful.

Example 16.

Let

n = 2

and consider the function of type (8)

f (x) = 2 - x_{1} x_{2} exp (1 - \frac{1}{2} (x_{1}^{2} + x_{2}^{2}))

i.e.

c = 2

,

p (x) = - exp (1) x_{1} x_{2}

and

a_{1} = a_{2} = 1 / 2

. The function f is coded as

f = @(x)2-x(1)*x(2)*exp(1-x(1)^2/2-x(2)^2/2)

The function

y = f (x)

,

x \in R \times R

, has two global minimizers

x_{\min, 1} = [1, 1], x_{\min, 2} = [- 1, - 1]

for which

y_{\min, 1} = y_{\min, 2} = 1 .

It also has two global maximizers

x_{max, 1} = [1, - 1], x_{max, 2} = [- 1, 1]

for which

y_{max, 1} = y_{max, 2} = 3 .

Let the initial guess for the code

[x_min,y_min] = fminsearch(f,x_0)

be

x_{0} = [c, c]

. For

c = 6.4308

we get the relatively good result

x_min = -1.000041440968428 -0.999998218805417

y_min = 1.000000001720503

For the slightly different value

c = 6.4309

the computed result

x_{\min} = [c, c]

is a numerical fixed point which is not a minimizer. Indeed, we have

x_min = 6.430900000000000 6.430900000000000

y_min = 2.000000000000000

The second computed result in Example 16 is a numerical catastrophe with 543% relative error in

x_{\min}

and 100% relative error in

y_{\min}

. The reason is that

c^{2} exp (1 - c^{2}) \leq ρ

for

c \geq 6.4391

and

f (c)

is rounded to 2. The conclusion is that the algorithm of the moving simplex uses FPA rather than VPA; compare with Example 15.

The minimization problem in Example 16 is not unimodal since it has two solutions for

x_{\min}

and a unique solution for

y_{\min}

. Solving unimodal problems by the code fminsearch also leads to large errors. Consider the next unimodal problem for which the code also fails for moderate data.

Example 17.

Let

n = 2

and consider the function of type (8)

f (x) = 2 - x_{1} exp (\frac{1}{2} - \frac{1}{2} x_{1}^{2} - {(x_{2} - 1)}^{2}))

The function f is coded as

f = @(x)2-x(1)*exp(1/2-x(1)^2/2-(x(2)-1)^2)

The minimization problem has unique solution

x_{\min} = [1, 1]

,

y_{\min} = 1

. For

c = 5.81

the code

[x_min,y_min] = fminsearch(f,[c,c])|

gives a quite acceptable result, namely

x_min = 1.000039724696251 1.000017661431510

y_min = 1.000000001889957

For a slightly different value of c the computed result

x_{\min} = [c, c]

is a numerical fixed point which is not a minimizer. Indeed, for

c = 5.82

the computed result is wrong, namely

x_min = 5.820000000000000 5.820000000000000

y_min = 2

There is 482% relative error in the computed argument

x_{\min}

and 100% relative error in the computed minimum

y_{\min}

.

The output of such situations is, at least for small n, to use BFCT as follows. Let

n = 2

and let

f : T = T_{1} \times T_{2} \to R

be the function which has to be minimized, where

T_{k} = [a_{k}, b_{k}] \subset R \times R

. We choose positive integers

n_{1}, n_{2}

and compute the grids

X_k = linspace(a_k,b_k,n_k)

and the quantities

A (i, j) = f (X_{1} (i), X_{2} (j)) .

Thus we construct the matrix

F \in R (n_{1}, n_{2})

which is a discrete analogue of the surface

z = f (x, y)

,

(x, y) \in T

.

Next the minimal element

F (k_{1}, k_{2})

and its indexes

(k_{1}, k_{2})

are found by the BFCT described in Section 4.2. Now the approximate global minimum of f is

[X_{\min}, Y_{\min}]

, where

X_{\min} \in R (2, 1)

may be used as the starting point

x_{0}

for the code

[x_min,y_min] = fminsearch(f,x_0)

Extensive numerical experiments show that this code now works properly.

8. Canonical Forms of $2 \times 2$ Matrices

8.1. Preliminaries

An interesting observation is that important properties of linear operators in n-dimensional vector spaces can be revealed for dimensions as low as

n = 2

, see [28]. In this section we consider some modified definitions of canonical forms and present instructive examples for the sensitivity of Jordan forms using BFCT.

8.2. Jordan and Generalized Jordan Forms

In this section we use some not very popular definitions of canonical forms, e.g. a definition of Jordan form of a matrix without mentioning eigenvalues and without using the concept of Jordan blocks.

Definition 11.

Let

n \in Z_{2}

. The upper triangular bi-diagonal matrix

J \in C (n)

is said to be a Jordan matrix if

$J (k, k + 1) \in {0, 1}$
$J (k, k + 1) = 1$ implies $J (k, k) = J (k + 1, k + 1)$
$J (k, k) \neq J (k + 1, k + 1)$ implies $J (k, k + 1) = 0$

for

k = 1, 2, \dots, n - 1

. The set of Jordan matrices is denoted as

J (n)

.

Definition 12.

The Jordan matrix J is spectrally ordered if

J (k, k) ⪯ J (k + 1, k + 1)

for

k = 1, 2, \dots, n

. The set of spectrally ordered Jordan matrices is denoted as

\tilde{J} (n)

.

Proposition 8.

The set

J (2)

consists of matrices

diag (λ, μ)

and

λ I_{2} + E_{1, 2}

, where

λ, μ \in C

.

Definition 13.

The Jordan problem for a nonzero matrix

A \in C (n)

is to find a pair

(X_{A}, J_{A}) \in G L (n) \times J (n)

such that

X_{A} J_{A} = A X_{A}

. Here

X_{A}

is the transformation matrix and

J_{A}

is the Jordan form of A.

Note that we do not use explicitly the spectrum of A or the concept of Jordan blocks. Indeed, given a general matrix A (even with integer elements and of low size as in (6)) then a priori we know neither its spectrum nor its Jordan structure.

Remark. The Jordan form

J_{A}

is not unique and is defined only by the order of units on its bi-diagonal unless

A = λ I_{n}

when

J_{A} = A

is unique and

X_{A}

is any invertible matrix.

The Jordan form may be defined uniquely to become a canonical Jordan form in terms of Group theory [29] and Integer partition theory [30] although this is rarely done in sufficient extent in the literature. The transformation matrix

X_{A}

is always non-unique since e.g. the matrix

γ X_{A}

,

γ \neq 0

, is also a transformation matrix.

Example 18.

Zz If the matrix

A \in C (n)

has a single eigenvalue then there are

2^{n - 1}

different orderings of the unit elements on the

n - 1

positions

(k, k + 1)

of the super diagonal of A.

Definition 14.

Let

J \in J (n)

. The matrix

X \in G L (n)

is said to be a stabilizer of J if

X^{- 1} J X \in J (n)

. The set of stabilizers of A is denoted as

Stb (A) \subset G L (n)

. The matrix

I_{n}

is the center of

Stb (A)

.

Note that the set

Stb (A)

in general is not necessarily a group unless

A = λ I_{n}

and

{I_{n}}

is not a center in group-theoretical sense (we recall that the center of

G L (n)

is the subgroup of matrices

a I_{n}

,

a \neq 0

).

Example 19.

Let

n = 2

and

A \in J (2)

. Then the following three cases are possible.

If $A = λ I_{2}$ then $Stb (A)$ is the group $G L (2)$ .
If $A = diag (λ_{1}, λ_{2})$ , $λ_{1} \neq λ_{2}$ , then $Stb (A)$ is the set of matrices $diag (a_{1}, a_{2})$ and $a_{1} E_{1, 2} + a_{2} E_{2, 1}$ , where $a_{1} a_{2} \neq 0$ .
If $A = λ I_{2} + E_{1, 2}$ then $Stb (A)$ is the set of matrices $a (E_{1, 2} + E_{2, 1})$ , $a \neq 0$ .

Example 20.

Definition 15.

The upper triangular bi-diagonal matrix

G \in C (n)

is said to be a generalized Jordan matrix if

$G (k, k + 1) \neq 0$ implies $G (k, k) = G (k + 1, k + 1)$
$G (k, k) \neq G (k + 1, k + 1)$ implies $G (k, k + 1) = 0$ .

The set of generalized Jordan matrices is denoted as

G (n)

.

Generalized Jordan matrices G have the structure of Jordan matrices, where the nonzero elements

G (i, i + 1)

(if any) are not necessarily equal to 1. Hence they are less sensitive to perturbations in the matrix A.

There are two forms

diag (λ, μ)

and

λ I_{2} + a E_{1, 2}

of the elements of

G (2)

, where

λ, μ, a \in C

and

a \neq 0

. Note that we do not define spectrally ordered generalized Jordan matrices since they are not important for the computational practice.

Both Jordan matrices J and generalized Jordan matrices G are bi-diagonal, where the bi-diagonal elements of J are zeros or ones, while the bi-diagonal elements of G are correspondingly zeros or nonzero numbers.

Definition 16.

The generalized Jordan problem for a nonzero matrix

A \in C (n)

is to find a pair

(Y_{A}, G_{A}) \in G L (n) \times G (n)

such that

Y_{A} G_{A} = A Y_{A}

. Here

Y_{A}

is the transformation matrix and

G_{A}

is the generalized Jordan form of A.

Consider now perturbations in Jordan forms. Let

A (ε) = A_{0} + ε A_{1}

, where

A_{0}, A_{1} \in C (n)

and

ε \in (- ε_{0}, ε_{0})

with

ε_{0} > 0

being a small parameter. If the matrix

A_{0}

has multiple eigenvalues then the transformation matrix

X (ε) = X_{A (ε)}

and the Jordan form

J (ε) = J_{A (ε)}

of

A (ε)

may be discontinuous at the point

ε = 0

. The situation with the transformation matrix

X (ε) = X_{A (ε}

is even more subtle.

If the matrix

A_{0}

has simple eigenvalues then the matrices

X (ε)

and

J (ε)

depend continuously on

ε

. In particular

X (0) = X_{0}

and

J (0) = J_{0}

.

Next we consider matrices

A \in R (2)

with real spectrum. Then the transformation matrices

X, Y

, the Jordan forms J and the generalized Jordan forms G are real as well.

For small size matrices A with rational elements (machine elements in particular)

A (i, j)

the matrices

X_{A}

and

J_{A}

are computed by the MATLAB^® (Maple) command [X_A,J_A] = jordan(A) exactly.

Example 21.

Consider the matrix

A = I_{2}

and let

A (ε) = I_{2} + ε E_{2, 1}, ε = 2^{- m}, m \in Z_{1} \neq 0 .

Then the transformation matrix and the Jordan form of

A (ε)

are

X (ε) = E_{2, 1} + ε^{- 1} E_{1, 2}, J (ε) = I_{2} + E_{1, 2} .

The code [X,J] = jordan(A) computes the matrices

X (ε)

,

J (ε)

exactly for

m \leq 1074

. For

m = 1075

the matrices

X_{A}

and

J_{A}

are computed as

I_{2}

since ε is rounded to zero.

Example 22.

Consider the matrix

A = I_{2} + E_{1, 2}

and let

A (ε) = I_{2} + ε E_{1, 1}, ε = 2^{- m}, m \in Z_{1} \neq 0 .

Then the Jordan forms of A are

J_{1} (e) = diag (1 + ε, 1), J_{2} (ε) = diag (1, 1 + ε) .

The form

J_{1} (ε)

is unachievable. For the form

J_{2} (ε)

we have infinitely many transformation matrices

X_{2} (ε)

, e.g.

X_{2} (ε) = μ E_{1, 2} - E_{2, 1} - ε^{- 1} E_{1, 1}

where

μ \neq 0

is arbitrary.

The code [X,J] = jordan(A) computes exactly the matrices

X_{2} (ε)

(with

μ = 1

) and

J_{2} (ε)

for

ε = 2^{- m}

and

m \leq 44

. For

m = 45

these matrices are computed wrongly as

X_{2} (ε) = I_{2}

and

J_{2} (ε) = I_{2}

.

Proposition 9.

Jordan forms

J_{A}

of matrices A with multiple eigenvalues are sensitive to perturbations in A for three main reasons: 1) because

J_{A} (i, j) = 0

for

i \geq j + 1

; 2) because

J (i, j) = 0

for

j \geq i + 2

(if

n \geq 3

) and 3) because their nonzero elements

J_{A} (i, i + 1)

(if any) are equal to 1.

8.3. Condensed Schur Form

Schur forms

S_{A}

of matrices A satisfy the only condition

S_{A} (i, j) = 0

for

i \geq j + 1

and are this less sensitive to perturbations in the nevertheless they may be discontinuous in a neighborhood of a matrix

A_{0}

with multiple eigenvalues. The case of simple eigenvalues is studied in more detail, see e.g. [31].

Definition 17.

The matrix

S \in C (n)

is said to be a Schur matrix if it is upper triangular. The set of Schur matrices is denoted as

S (n)

.

We recall that we study only cases

n \geq 2

. Now it is easy to see that

S (n) = G (n)

if and only if

n = 2

.

Each matrix

A \in C

is unitary similar to a Schur matrix

S_{A} \in S (n)

such that

U_{A} S_{A} = A U_{A}^{H}

for some matrix

U_{A} \in U (n)

. The matrix

U_{A}

is the transformation matrix and the matrix

S_{A} \in U (n)

is the condensed Schur form of A. The Schur problem for A is to describe the set of pairs

(U_{A}, S_{A})

and to study their sensitivity relative to perturbations in A.

Although less sensitive to perturbations in A, the solution

(U_{A}, S_{A})

of the Schur problem may also be sensitive and even discontinuous as a function of A at points

A_{0}

, where the matrix

A_{0}

has multiple eigenvalues.

Let

A = λ I_{2}

and

ε E_{A}

be a perturbation in A. The Schur form of A is A itself and in this case we may consider the trivial solution

(I_{2}, A)

of the Schur problem for A. For arbitrary small

ε \neq 0

the Schur forms of

A + ε E_{2, 1}

are

S = I_{2} \pm ε E_{1, 2}

with transformation matrices

V = \pm E_{1, 2} \pm E_{2, 1}

. Thus the solution changed from the trivial solution

U = I_{2}

for V for

ε \neq 0

.

Example 23.

Let

m = 50

,

a = 2^{- m}

and

A = [1, 0; a, 1]

. The command

[X,J] = schur(A)

gives the correct answer

X = 0 -1 J = 1.000000000000000 -0.000000000000001

1 0 0 1.000000000000000

For

m = 51

we have

X = 1 0 J = 1 0

0 1 0 1

due to rounding errors. Obviously the codeschurworks with FPA.

9. Least Squares Revisited

9.1. Preliminaries

In least squares methods (LSM) the aim is to minimize the sum of squared residuals. A not very popular fact is that a smaller residual may correspond to a larger error [21,22]. Moreover, this phenomenon may be observed on an infinite set of residual and errors. This may have far going consequences. So the LSM has to be revisited taking into account such cases.

A least squares problem (LSP) may arise naturally in connection with a given approximation scheme, or as an equivalent formulation of a zero finding method.

9.2. Nonlinear Problems

The solution of the (overdetermined) equation

f (x) = 0

, where x and

f (x)

are vectors and f is a continuous function, may be reduced to minimization of the quantity

{∥ f (x) ∥}^{2}

. Such minimization problems arise also in a genuinely optimization statement in e.g. approximation of data.

Let

φ : C (n, 1) \to R_{+}

be a continuous function and let the problem

φ (x) \to \min

be unimodal, i.e. there is a unique (global) minimum

[x_{0}; y_{0}] \in C (n, 1) \times R_{+}

, where

y_{0} = φ (x_{0})

. Denote by

e_{k} = ∥ x_{k} - x_{0} ∥

the distance between the vector

x_{k}

and

x_{0}

, or the absolute error of the vector

x_{k}

when

x_{k}

is interpreted as a computed solution.

Let

n = 1

and

φ : R \to R_{+}

. Let

(x_{k})

be a sequence of computed solutions. Then it is possible that

r_{k} = φ (x_{k})

tends to 0 and

e_{k}

tends to ∞ when

k \to \infty

, as the next example shows.

Example 24.

Let

φ (x) = 1 + \frac{{(x - 1)}^{2}}{1 + {(x - 1)}^{4}}, x \in R

The function φ is unimodal with minimum

[x_{0}, y_{0}] = [1, 1]

. Let

x_{k} = k

. Then

r_{k} = φ (x_{k}) \to 0

and

e_{k} = | k - 1 | \to \infty

as

k \to \infty

.

This result is easily extended to the case

n \geq 2

.

Example 25.

Let

φ (x) = 1 + \frac{{(∥ x ∥ - 1)}^{2}}{1 + {(∥ x ∥ - 1)}^{4}}, x \in C (n, 1)

The function φ has infinitely many minimums

[x_{0}, 1]

on the unit sphere

∥ x_{0} ∥ = 1

. Let

x_{k} = k [1; 0; \dots; 0]

for

k \in Z_{1}

, i.e.

∥ x_{k} ∥ = k

. Then

r_{k} \to 0

and

e_{k} \to \infty

as

k \to \infty

.

Thus the residual

r_{k}

and the error

e_{k}

may have opposite behavior. This phenomenon, namely larger the remainder

r_{k}

, smaller the error

e_{k}

, is known as Remainders vs. Errors, see [22]. It may concern the stopping rule in the implementation of LSM.

9.3. Linear Problems

Consider the matrix

A \in R (m, n)

of full column rank

m > n

and let

b \in R (m, 1)

be a given vector which is not in the range of A. The linear least squares problem (LLSP) is to find the vector

x \in R (n, 1)

such that

r (x) = ∥ A x - b ∥ \to \min

. This problem is unimodal and its theoretical solution is

A^{†} b

, where

A^{†} = {(A^{⊤} A)}^{- 1} A^{⊤} \in R (n)

is the pseudo-inverse of the matrix A.

This representation of the solution is not used in computational practice since it is ineffective and may be connected with substantial loss of accuracy. Instead, the solution is found by QR decomposition

A = Q R

of the matrix A. Let

R = [S; T]

and

Q^{⊤} b = [c; d]

, where the matrix

S \in R (n)

is upper triangular and invertible, and

c \in R (n, 1)

. The solution

x^{*}

is obtained by the code x = A\b, which in fact computes x = S\c.

Intensive experiments based on BFCT with random data show that the error of the “bad” solution

A^{†} b

is between 5 and 10 times larger than the error of the “ good” solution

x^{*} = A ∖ b

. In addition, finding

x_{0}

requires much more computational time.

Let

x_{1}, x_{2} \in R (n, 1)

be two approximate solutions of the above LLSP which are different from the exact solution

x_{0}

. Set

r_{k} = ∥ A x_{k} - b ∥

and

e_{k} = ∥ x_{k} - x_{0} ∥

,

k = 1, 2

. We have

σ_{n} \leq r_{k} / e_{k} \leq σ_{1}

, where

σ_{1} = ∥ A ∥

and

σ_{n} = 1 / {∥ A^{- 1} ∥}^{- 1}

are the maximal and minimal singular values of A. Next, it is possible [22] to choose

x_{k}

so as

r_{1} / e_{1} = σ_{1}

and

r_{2} / s_{2} = σ_{n}

. It is fulfilled

\frac{r_{1}}{r_{2}} = cond (A) \frac{e_{1}}{e_{2}}, cond (A) = \frac{σ_{1}}{σ_{n}}

(9)

We see that when the matrix A is poorly conditioned in the sense that

cond (A) ≫ 1

the behavior of remainders and errors is opposite: larger the remainder, smaller the error. This fact deserves a separate formulation.

Proposition 10.

When

cond (A)

is large, larger remainders may correspond to smaller errors. In this case checking the accuracy of the computed solutions by the remainders is misleading.

The situation with remainders and errors is indeed ironic. If the matrix A is well conditioned with

cond (A)

close to 1 the computed approximations are most probably good and there is no need to check their accuracy by remainders. But if the matrix A is poorly conditioned and the accuracy test seems necessary, the accuracy check by remainders may lead to wrong conclusions. Note that this phenomenon is possible only if

n \geq 2

.

Example 26.

Consider the simplest case

n = 2

and

A = [B; 0] \in R (m, 2)

,

b = 0

, where

B = [μ^{3}, - 1; μ^{3}, 1] \in R (2)

and

μ > 0

is a small parameter. The only solution of the LSP is

x_{0} = 0

. Let

x_{1} [0, μ]

and

x_{2} = [1 / μ; 0]

be two approximate solutions (the vector

x_{2}

with error

e_{2} = 1 / μ ≫ 1

can hardly be recognized as approximate solution but we shall ignore this fact). We have

r_{1} = μ \sqrt{2}, e_{1} = μ, r^{2} = μ^{2} \sqrt{2}, e_{2} = 1 / μ

and

\frac{r_{1}}{r_{2}} = \frac{1}{μ} ≫ 1, \frac{e_{1}}{e_{2}} = μ^{2} ≪ 1

This is possible since the matrix B is very badly conditioned with

cond (B) = μ^{- 3} ≫ 1

and the estimate (9) is achieved.

Denote

T = (0, 1) \subset R_{+}

and let

{x (t) : t \in T} \subset R (n, 1)

be a family of approximate solutions of the equation

A x = b

, where the function

x : T \to R (n, 1)

is differentiable. Let

x (t)

tends to

x_{0}

as

t \to \infty

and denote by

e (t) = ∥ x (t) - x_{0} ∥

and

r (t) = ∥ A x (t) - b ∥

the error and the remainder of the approximate solution

x (t)

. These relations define the remainder r as a parametric function of e and vice versa.

Proposition 11.

There exists a smooth function

x : T \to R (n, 1)

such that the following assertions hold true.

The function $e : T \to R_{+}$ is smooth and decreasing.
The function $r : T \to R_{+}$ is smooth and non-monotone in each interval $(0, β)$ , where $0 < β < 1$ .
The function $e \mapsto r (e)$ (whenever defined) is smooth and non-monotonic in each arbitrarily small subunterval $(0, θ)$ of the interval $(0, ε)$ , where $ε = sup {e (t) : t \in T}$ .

Example 27.

Let

n = 2

,

A = [\begin{matrix} a & 0 \\ 0 & 1 \end{matrix}], x (t) = [\begin{matrix} ξ_{1} (t) \\ ξ_{2} (t) \end{matrix}], b = [\begin{matrix} 0 \\ 0 \end{matrix}]

and

a \notin {- 1, 0, 1}

. Then

x_{0} = 0

and

r (t) = \sqrt{a^{2} ξ_{1}^{2} (t) + ξ_{2}^{2} (t)}, e (t) = \sqrt{ξ_{1}^{2} (t) + ξ_{2}^{2} (t)}

Let

t \in T

and

ξ_{1} (t) = t^{3} cos (1 / t)

,

ξ_{2} (t) = t^{3} sin (1 / t)

. We have

e (t) = t^{3}

and

r (t) = t^{3} \sqrt{1 + (a^{2} - 1) {cos}^{2} (1 / t)}

Therefore

r (e) = e \sqrt{1 + (a^{2} - 1) {cos}^{2} (e^{- 1 / 3})}

The differentiable function

e \mapsto r (e)

thus defined is non-monotone on each interval

(0, θ)

This non-monotonicity means that for any

θ > 0

there exist infinitely many points

e_{1}, e_{2} \in (0, θ)

with

e_{1} < e_{2}

and

r (e_{1}) > r (e_{2})

. Thus the LSM may be misleading not only for linear algebraic equations but also for LLSP for which the method had been specially designed.

10. Integrals and Derivatives

10.1. Integrals

The calculation of the definite integral

I = int (f, a, b)

is one of the oldest computational problems in numerical mathematical analysis. According to [25] a variant of the trapezoid rule was used in Babylon 50 years BCE for integrating the velocity of Jupiter along the ecliptic.

Usually the integral is solved for

a, b \in R

,

a < b

, but the case when one or both of the limits a, b, are infinite and the integral is improper is also considered. There are many sophisticated algorithms and computer codes for solving such integrals, see [1]. In MATLAB^® definite integrals of function of one variable

f [a, b] \to R

are computed by the command integral(f,a,b).

In this section we apply BFCT to solving an integral with highly oscillating integrand by the standard trapezoidal rule [26].

Let

a = x_{1} < x_{2} < \dots < x_{n} = b

be a partition of

[a, b]

. Sophisticated integration schemes for computing I using values

y_{k} = f (x_{k})

had been developed in the pre-computer ages (before 1950) when the computation of

y_{k}

for n of order

10^{3}

was a problem. Now BFCT allow to obtain results for large n up to

10^{8}

, using simple quadratures such as the formula of trapezoids with equal spacing [27] described below as MATLAB^® code

n = 10^m; X = linspace(a,b,n+1); Y = f(X); h = (b-a)/n; ...

T_n = h*(sum(Y(2:n)) + (Y(1) + Y(n+1))/2)

If the function f is twice continuously differentiable then the absolute error of the trapezoidal method with equal spacing is estimated as

| T_{n} - I | \leq E_{n} (f, a, b) = \frac{γ (f, a, b)}{n^{2}}

(10)

where

γ (f, a, b) = \frac{{(b - a)}^{3}}{12} μ_{2} (f), μ_{2} (f) = max {| f^{''} (x) | : x \in [a, b]}

Example 28.

Consider the integral

I = int (f, a, b) = F (b) - F (a)

, where

f = F^{'}

,

a > 0

and

F (x) = c x + x^{2} sin (x^{- 1}), x \in [a, b]

Then

f (x) = c + 2 x sin (x^{- 1}) - cos (x^{- 1}), f^{''} (x) = x^{- 4} cos (x^{- 1})

For

a = \frac{1}{m π}, b = \frac{1}{π}, c = \frac{m π}{m - 1}, m \in Z_{2}

we have

sin (a^{- 1}) = 0, cos (a^{- 1}) = - 1, F (a) = a c, sin (b^{- 1}) = 0, F (b) = b c

and

I = c (b - a) = 1

.

For

m = 11

the number

a ≃ 0.0289

is relatively small and the function f is highly oscillating close to a, see Figure 2. Next we have

μ_{2} (f) = | f^{''} (a) | = {(11 π)}^{4}, γ (f, a, b) = 2750 π / 3 ≃ 2880

and

E_{n} (f, a, b) = 2880 n^{- 2}

.

The error

| T_{n} - 1 |

and the ratio

r_{n} (f, a, b) = | T_{n} - 1 | / E_{n} (f, a, b)

for

n = 10^{m}

and m up to 8 are shown in Table 4.

For n from

1 000

to

1 000 000

the error (which is both absolute and relative since

I = 1

) decreases as

0.4 n^{- 2}

while the ratio of the error and the estimate

E_{n} (f, a, b)

is approximately constant about

1.5 \times 10^{- 4}

. The behavior of the error

| T_{n} - 1 |

is correctly evaluated but the bound

μ_{2} (f) = a^{- 4}

for

| f^{''} (x) |

considerably overestimates the real behavior of

| f^{''} (x) |

although this bound is achieved for

x = a

and is hence not improvable.

For n between

1 500 000

and

100 000 000

the accuracy of the computed solution does not increase with n and shows chaotic behavior due to accumulation of rounding errors. The result

T_{n}

for

n = 1000

and even for

n = 100

is satisfactory from practical viewpoint.

For the choice of

f, a, b

as in Example 28 the MATLAB^® code integral(f,a,b) produces the answer with error

7.1720 \times 10^{- 14}

. At the same time the NLF

F (b) - F (a)

gives the result

1 - ρ

with error

ρ ≃ 1.1011 \times 10^{- 16}

. So in this case the code integral(f,a,b) does not use NLF.

10.2. Derivatives

Numerical differentiation of a scalar function f of scalar or vector argument x is used in many cases, i.e. when analytical expression

f (x)

is not available or is too complicated, when numerical algorithms are applied that use approximate derivatives f f, etc. In the simplest yet most spread case the derivative

f^{'} (x)

is approximated numerically by the first order finite difference

Δ_{h} (f, x) = (f (x + h) - f (x)) / h

, where

| h |

is small but not too small. For given a x the behavior of

Δ_{h} (f, x)

as function of h resembles a saw, see Example 29 below.

Example 29.

Consider the first difference

Δ_{h} (x, 1)

for the simplest non-constant function

f (x) = x

at the point

x = 1

. We have

Δ_{h} (x, 1) = (1 + h - 1) h^{- 1}

. The graph of the computed function

h \mapsto (1 + h - 1) h^{- 1}

for

h \in [- 3 eps, 3 eps] ∖ {0}

is shown at Figure 3. The graph of the exact function

h \to 1

is the straight line in red.

Let

f^{'} (x) \neq 0

and

x \neq 0

. When

h / x

is too small, e.g.

h / x \in [- ρ / 2, ρ]

, the computed value of

f (x + h)

is

f (x)

,

Δ_{h} (f, x) = 0

and the derivative

f^{'} (x)

is computed as 0 with 100% relative error. This numerical catastrophe may be avoided by BFCT revealing the behavior of

Δ_{h} (f, x)

for small values of h.

The general principle here is that h must be chosen as

C {(eps)}^{1 / 2} = C / 10^{8}

, and the problem is how to estimate the constant C especially in RTA. When no information for a constant

C > 0

is available, the heuristic rule is to set

C = 1

.

11. VPA versus FPA

There are many types of variable precision arithmetics (VPA) and computer languages in which the VPA are realized. Here we include systems using symbolic objects like

π

,

\sqrt{2}

, etc. VPA may work with very high precision (actually, with infinite precision) but the performance of numerical algorithms with VPA may be slow and not suitable for RTA. In contrast, binary double-precision floating point arithmetic (FPA) works with a finite set of machine numbers, has finite precision and works relatively fast. Thus FPA is suitable for RTA.

12. Conclusion

The combination of BFCT and CPRS provides a powerful tool for solving problems and checking computed solutions and a priori accuracy estimates for many computational problems. Also, according to the authors, BFCT may become a useful complement in the implementation of techniques of artificial intelligence (AI) to the systems for doing mathematics such as MATLAB^® [6,7] and Maple [8].

Our main results are summarized in Propositions 1, 2, 3, 4, 5, 6 and 19. Most of the results presented are valid for using BFCT in FPA. Using extended precision as in VPA, some of the observed effects may no more occur. This may be done at the price of considerable increase of the computational time thus excluding some RTA.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated during the current study are available from the author upon reasonable request.

Conflicts of Interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Stoer, J.; Bulirsch, R. Introduction to Numerical Analysis; Springer New York: New York, 2002; ISBN 978-0387954523. [Google Scholar] [CrossRef]
Faires, D.; Burden, A. Numerical Analysis; Cengage Learning: Boston, 2016; ISBN 978-1305253667. [Google Scholar]
Chaptra, S. Applied Numerical Methods with MATLAB for Engineers and Scientists (5th edition); McGraw Hill, 2017; ISBN 978-12644162604. [Google Scholar]
Novak, K. Numerical Methods for Scientific Computing; Equal Share Press, 2022; ISBN 978-8985421804. [Google Scholar]
Driscoll, T.; Braun, R. Fundamentals of Numerical Computations; SIAM, 2022. [Google Scholar] [CrossRef]
TheMathWorks, Inc. MATLAB Version 9.9.0.1538559 (R2020b); The MathWorks, Inc.: Natick, MA, USA, 2020; Available online: www.mathworks.com.
Higham, D.; Higham, N. MATLAB Guide, 3rd ed.; SIAM: Philadelphia, 2017; ISBN 978-1611974652. [Google Scholar]
Maplesoft. Maple 2017.3. Ontario, 2017. Available online: www.maplesoft.com/ products/maple/.
Mathematica. Available online: www.wolfram.com/mathematica/.
WolframAlpha. Available online: www.wolframalpha.com/.
Borbu, A.; Zhu, S. Monte Carlo Methods; Springer: Singapore, 2020; ISBN 978-9811329708. [Google Scholar] [CrossRef]
Davies, A.; et al. Advancing mathematics by guiding human intuition with AI. Nature 2021, 600. [Google Scholar] [CrossRef] [PubMed]
Fink, T. Why mathematics is set to be revolutionized by AI. Nature 2024, 629. [Google Scholar] [CrossRef]
754-2019; IEEE Standard for Floating-Point Arithmetic. IEEE Computer Society: NJ, 2019. [CrossRef]
Higham, N. Accuracy and Stability of Numerical Algorithms; SIAM: Philadelphia, 2002; ISBN 0-898715210. [Google Scholar]
Higham, N.; Konstantinov, M.; Mehrmann, V.; Petkov, P. The sensitivity of computational control problems. IEEE Control Systems Magazine 2004, 24, 28–43. [Google Scholar] [CrossRef]
Goldberg, D. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys 1991, 23, 5–48. [Google Scholar] [CrossRef]
Pugh, C. Real Mathematical Analysis; Springer Nature: Switzerland, 2015; ISBN 978-3319177700. [Google Scholar] [CrossRef]
Konstantinov, M.; Petkov, P. Computational errors. International Journal of Applied Mathematics 2022, 1, 181–203. [Google Scholar] [CrossRef]
Balinski, M.; Young, H. Fair Representation: Meeting the Ideal One Man One Vote, 2nd ed.; Brookings Institutional Press: New Haven and London, 2001; ISBN 978-0815701118. [Google Scholar] [CrossRef]
Booth, A. Numerical Methods, 3rd ed.; LCCN 57004892; London, Butterworths Scientific Publications: London, 1966. [Google Scholar]
Konstantinov, M.; Petkov, P. Remainders vs. errors. AIP Conference Proceedings 2016, 1789, 060007. [Google Scholar] [CrossRef]
Lagarias, J.; Reeds, J.; Wright, M.; Wright, P. Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM Journal on Optimization 1998, 9, 112–147. [Google Scholar] [CrossRef]
Nelder, J.; Mead, R. A simplex method for function minimization. Computer Journal 1965, 7, 308–313. [Google Scholar] [CrossRef]
Ossendrijver, M. Ancient Babylonian astronomers calculated Jupiter’s position from the area under a time-velocity graph. Science 2016, 351, 482–484. [Google Scholar] [CrossRef] [PubMed]
Rahman, Q.; Schmeisser, G. Characterization of the speed of convergence of the trapezoidal rule. Numerische Mathematik 1990, 57, 123–138. [Google Scholar] [CrossRef]
Atkinson, K. An Introduction to Numerical Analysis, 2nd ed.; John Wiley & Sons: New York, NY, USA, 1989; ISBN 978-0471500230. [Google Scholar]
Glazman, M.; Ljubich, J. Finite-Dimensional Linear Analysis: A Systematic Presentation in Problem Form (Dover Books in Mathematics); Dover Publications: New York, NY, USA, 2006; ISBN 978-04866453323. [Google Scholar]
Jackobson, N. Basic Algebra I; W. Freeman & Co Ltd.: London, UK, 1986; ISBN 978-07167114804. [Google Scholar]
Andrews, G. The Theory of Partitions, (revisited edition); Cambridge University Press: Cambridge, 2008; ISBN 978-0521637664. [Google Scholar]
Konstantinov, M.; Petkov, P. On Schur forms for matrices with simple eigenvalues. Axioms 2024, 13, 839. [Google Scholar] [CrossRef]

Figure 1. Scaled rounding errors.

Figure 2. Oscillating function.

Figure 3. Computed first difference of the function

f (x) = x

.

Figure 3. Computed first difference of the function

f (x) = x

.

Table 2. Errors and estimates for low order systems.

m	`err`	`est`	`err/est`
2	$3.1601 \times 10^{- 9}$	$9.0612 \times 10^{- 8}$	0.0349
3	$1.6691 \times 10^{- 5}$	$8.9027 \times 10^{- 4}$	0.0187
4	$2.4926 \times 10^{- 1}$	9.1475	0.0272
5	1.0002	4.6283	0.2161

Table 3. Constants in error estimates for code roots.

n	3	4	5	6	7	8	9	10	11
$γ_{n}$	1.73	1.78	1.53	1.95	1.62	1.60	1.84	1.72	2.03
n	12	13	14	15	16	17	18	19	20
$γ_{n}$	1.76	2.04	1.94	2.10	1.92	2.06	2.01	1.97	1.91

Table 4. Errors and estimates for the trapezoidal rule.

n	$\| T_{n} - 1 \|$	$r_{n} (f, a, b)$
5	$2.7522 \times 10^{- 2}$	$2.3724 \times 10^{- 4}$
10	$9.5472 \times 10^{- 4}$	$3.2918 \times 10^{- 5}$
100	$2.0633 \times 10^{- 5}$	$7.1142 \times 10^{- 5}$
1 000	$4.3650 \times 10^{- 7}$	$1.5051 \times 10^{- 4}$
10 000	$4.3842 \times 10^{- 9}$	$1.5117 \times 10^{- 4}$
100 000	$4.3844 \times 10^{- 11}$	$1.5117 \times 10^{- 4}$
1 000 000	$4.4420 \times 10^{- 13}$	$1.5316 \times 10^{- 4}$
1 500 000	$1.9340 \times 10^{- 13}$	$1.5004 \times 10^{- 4}$
2 000 000	$1.0059 \times 10^{- 13}$	$1.3873 \times 10^{- 4}$
5 000 000	$3.4084 \times 10^{- 14}$	$1.5189 \times 10^{- 4}$
10 000 000	$1.9762 \times 10^{- 14}$	$6.8139 \times 10^{- 4}$
50 000 000	$2.1316 \times 10^{- 14}$	$1.8375 \times 10^{- 2}$
100 000 000	$5.0848 \times 10^{- 14}$	$1.7533 \times 10^{- 1}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Brute Force Computations and Reference Solutions

Abstract

Keywords:

Subject:

1. Introduction and Notation

1.1. Preliminaries

1.2. General Notations

1.3. Machine Arithmetic Notations

1.4. Abbreviations

1.5. Software and Hardware

1.6. Problems with Reference Solution

2. Machine Arithmetic

2.1. Preliminaries

2.2. Violation of Arithmetic Laws

2.3. Numerical Convergence of Series

2.4. Average Rounding Errors

3. Numerically Stable Computations

3.1. Lipschitz Continuous Problems

3.2. Hölder Continuous Problems

3.3. Golden Rule of Computations

4. Extremal Elements of Arrays

4.1. Vectors

4.2. Matrices

4.3. Application to Voting Theory

5. Problems with Reference Solution

5.1. Evaluation of Functions

5.2. Linear Algebraic Equations

5.3. Eigenvalues of 2 × 2 Machine Matrices

6. Zeros of Functions

7. Minimization of Functions

7.1. Functions of Scalar Argument

7.2. Functions of Vector Argument

8. Canonical Forms of 2 × 2 Matrices

8.1. Preliminaries

8.2. Jordan and Generalized Jordan Forms

8.3. Condensed Schur Form

9. Least Squares Revisited

9.1. Preliminaries

9.2. Nonlinear Problems

9.3. Linear Problems

10. Integrals and Derivatives

10.1. Integrals

10.2. Derivatives

11. VPA versus FPA

12. Conclusion

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

MDPI Initiatives

Important Links

Subscribe

5.3. Eigenvalues of $2 \times 2$ Machine Matrices

8. Canonical Forms of $2 \times 2$ Matrices