Preprint
Article

This version is not peer-reviewed.

Brute Force Computations and Reference Solutions

A peer-reviewed article of this preprint also exists.

Submitted:

07 December 2024

Posted:

09 December 2024

You are already at the latest version

Abstract
In this paper, we consider the application of brute force computational techniques (BFCT) for solving computational problems in mathematical analysis and matrix algebra in floating-point computing environment. These techniques include, among others, simple matrix computations and analysis of graphs of functions. Since BFCT are based on matrix calculations the program system MATLAB® is suitable for their computer realization. The computations in this paper are done in double precision floating-point arithmetic obeying the 2019 IEEE Standard for binary floating-point calculations. One of the aims of the paper is to analyze cases when popular algorithms and software fail to produce correct answers without warning for the user. In real time control applications this may have catastrophic consequences with heavy material damage and human casualties. It is known, or suspected, that a number of man-made catastrophes such as Dharhan accident (1991), Ariane 5 launch failure (1996), Boeing 737 Max tragedies (2018, 2019) and others are due to errors in the computer software and hardware. Another application of BFCT is finding good initial guesses for known computational algorithms. Sometimes simple and fast BFCT are useful tools in solving computational problems correctly and in real time. Among particular problems considered are genuine addition of machine numbers, numerically stable computations, finding minimums of arrays, minimization of functions, solving finite equations, integration and differentiation, computing condensed and canonical forms of matrices and clarifying the concepts of least squares method in the light of the conflict Remainders vs. Errors. Usually BFCT are applied under user’s supervision which is not be possible in automatic implementation of computational methods. To implement BFCT automatically is a challenging problem in the area of artificial intelligence (AI) and of mathematical artificial intelligence (MAI) in particular. BFCT allow to reveal the underlying arithmetic (FPA, VPA, etc.) in the performance of computational algorithms. Last but not least this paper may have some tutorial value since at certain places computational algorithms and mathematical software are still taught without taking into account the properties of computational algorithms and machine arithmetic.
Keywords: 
;  ;  ;  ;  

1. Introduction and Notation

1.1. Preliminaries

Sophisticated computational algorithms have been developed during the last 300 years for solving a large variety of problems in mathematical and engineering sciences [1,2,3,4,5]. When digital computers become available since the middle of 20th century it became clear that some computational algorithms of the past are not suitable for computer implementation at least without significant modifications. Other algorithms (albeit very clever) became obsolete. Thus paradigms of scientific computations had been changed and new types of modern computational algorithms were developed and implemented as computer codes.
When implemented as computer codes in floating-point computer environment such as MATLAB® [6,7] and Maple [8], however, some of these computational algorithms may produce wrong results without warning. Other systems as Mathematica [9,10] use variable precision arithmetic (VPA) and may avoid such pitfalls at the price of certain increase of the time necessary for performance of the algorithms. In such cases BFCT and a detailed analysis of computational problems with reference solution (CPRS) may be useful.
BFCT may also be used to obtain initial guesses for computational algorithms intended to solve finite equations and minimize functions of one or several scalar arguments. We stress finally that Monte Carlo algorithms [11] are also a form BFCT. As a perspective, a combination of mathematical AI [12,13] with BFCT is yet to be developed.
The computations in this paper are done in double precision binary floating-point arithmetic (FPA) obeying the IEEE Standard [14]. Some of the matrix notations used are inspired by the language of MATLAB®. The codes for the numerical experiments are given so the results may be checked independently. BFCT allow to reveal the underlying machine arithmetic.

1.2. General Notations

Next we use the following general notations.
  • Z – set of integers
  • Z [ m , n ] – set of integers m , m + 1 , , n , where m n
  • Z k Z – set of integers k
  • R ( C ) – set of real (complex) numbers
  • i C – imaginary unit, where i 2 = 1
  • sum ( a k , n ) – sum of the scalar quantities a 1 , a 2 , , a n
  • int ( f , a , b ) – definite integral of the function f : R R on the interval [ a , b ]
  • ≺ – lexicographical order on the set of pairs ( a , b ) R 2 defined as ( a 1 , b 1 ) ( a 2 , b 2 ) if a 1 < a 2 , or a 1 = a 2 and b 1 < b 2 ; we write ( a 1 , b 1 ) ( a 2 , b 2 ) if ( a 1 , b 1 ) ( a 2 , b 2 ) , or ( a 1 , b 1 ) = ( a 2 , b 2 )
  • K ( m , n ) – set of m × n matrices A with elements A ( p , q ) K , where K is one of the sets Z , R or C
  • K ( m ) = K ( m , m )
  • A ( p , : ) K ( 1 , n ) pth row of A K ( m , n )
  • A ( : , q ) K ( m , 1 ) qth column of A K ( m , n )
  • · – spectral norm in K ( m , n )
  • ones ( m , n ) R ( m , n ) – matrix with unit elements
  • rank ( A ) Z 0 – rank of the matrix A K ( m , n )
  • tr ( A ) K – trace of the matrix A K ( m , n )
  • det ( A ) K – determinant of the matrix A K ( m )
  • I n R ( n ) – identity matrix with I n ( p , p ) = 1 and I n ( p , q ) = 0 for p q
  • E i , j – matrix with unique nonzero element, equal to 1, in position ( i , j )
  • N n R ( n ) – nilpotent matrix with N n ( p , p + 1 ) = 1 and N n ( p , q ) = 0 for q p + 1
  • λ k = λ k ( A ) – eigenvalues of the matrix A K ( n ) , where k = 1 , 2 , , n ; we usually assume that λ 1 λ 2 λ n
  • σ k = σ k ( A ) – singular values of the matrix A K ( m , n ) , where k = 1 , 2 , , p , p = min { m , n } ; the singular values are ordered as σ 1 σ 2 σ p 0
  • A = σ 1 – spectral norm of the matrix A K ( m , n )
  • cond ( A ) = A A 1 = σ 1 / σ n 1 – condition number of the non-singular matrix A K ( n )

1.3. Machine Arithmetic Notations

We use the following notations from binary floating-point arithmetic, where codes and results in MATLAB®environment are written in typewriter font.
  • M + ( M ) – set of finite positive (negative) machine numbers
  • M = M { 0 } M + – set of finite machine numbers
  • fl ( x ) M – rounded value of x R ; fl ( x ) M is computed as m / 2 n by the command sym(x,’f’), where m Z , n Z 0 and m is odd if n Z 1
  • M + ( M ) – set of positive (negative) machine numbers; we have M = M { 0 } M +
  • X + > X ( X < X ) – machine number right (left) nearest to X M ; X + is obtained as X + eps(X)
  • eps = 2 52 – machine epsilon; we have 1 + = 1 + eps , 1 = 1 eps / 2 ; eps is obtained as eps(1), or simply as eps
  • ρ = eps / 2 = 2 53 1.11 × 10 16 – rounding unit
  • R = 2 53 = 9 007 199 254 740 992 9.01 × 10 15 – maximal integer such that positive integers R are machine numbers; we have R + = R + 2 , R = R 1
  • E = 2 1023 – maximal machine integer degree of 2; we have 2 E M
  • X max = ( 2 eps ) E 1.80 × 10 308 – maximal machine number; it is also denoted as realmax and we have eps(realmax) = 2^971
  • X min = 2 1022 2.23 × 10 308 – minimal positive normal machine number; it is also denoted as realmin
  • R min = 2 eps / E = 2 1074 4.94 × 10 324 – minimal positive subnormal machine number; it is found as eps(0)
  • fl ( x ) M – rounded value of x R
  • X = [ X min , X max ] R – normal range; numbers x X are rounded with relative error less than ρ
  • Inf (-Inf) – positive (negative) machine infinity
  • NaN – result of mathematically undefined operation, comes from Not a Number
  • M ext = M { ± Inf } { NaN } – extended machine set

1.4. Abbreviations

AI artificial intelligence
ARRE average rounding relative error
BFCT brute force computational technique
CPRS computational problem with reference solution
FPA double precision binary floating-point arithmetic
LLSP linear least squares problem
LSM least squares method
LSP least squares problem
MAI mathematical artificial intelligence
MRRE maximal rounding relative error
NLF Newton-Leibniz formula
NFP numerical fixed point
RRE rounding relative error
RTA real time application
VPA variable precision arithmetic

1.5. Software and Hardware

The computations are done with the program system MATLAB®, Version 9.9 (R2020b), on Lenovo X1 Carbon 7th Generation Ultrabook.

1.6. Problems with Reference Solution

An effective approach to using BFCT is solving CPRS. CPRS are specially designed computational problems with a priory known, or reference, solutions. It is supposed that the numerical algorithm which is to be checked does not “know” this fact and solves the computational problem in the standard way. Consider for example the finite equation f ( x ) = 0 , where f ( x ) and x are scalars or vectors. Then a CPRS is obtained choosing f ( x ) = g ( x ) g ( x ref ) , where x ref is the reference solution. We usually choose x ref to be a vector with unit elements, e.g. x_ref = ones(n,1) and x ref = n .
Let x exact 0 be a solution of the equation f ( x ) = 0 and x comp be an approximate solution computed in FPA. The absolute error x comp x exact and the relative error
x comp x exact x exact
of the computed solution x comp , such that f ( x comp ) 0 , depends on rounding of initial data and on rounding and other errors made during the computational procedure [15,16]. To eliminate effects of rounding of initial data we may consider CPRS with relatively small integer data, or more generally, with data consisting of machine numbers.
Example 1.
Computational algorithms and software for solving linear algebraic equations A x = b , where A C ( n ) is invertible, may be checked choosing A Z ( n ) and b = A x ref , where x_ref = ones(n,1).
The computed solution is x_comp = A\b and we may compare the relative error norm(x_ref-x_comp)/sqrt(n) with the a priori error estimate eps*cond(A).
Example 2.
Let F : [ a , b ] R be a differentiable highly oscillating function with derivative F : [ a , b ] R . The numerical integration of f = F may be a problem. For such integrands we may formulate a numerically difficult computational problem J = int ( f , a , b ) with reference solution J ref = 1 . Systems with mathematical intelligence such as MATLAB® [6], Maple [8] and Mathematica [9,10] may find the integral with high accuracy by the NLF J = F ( b ) F ( a ) thus avoiding numerical integration.
Example 3.
To check the codes for eigenstructure analysis of a matrix A C ( n ) we may choose A = T ( I n + N n ) T 1 , where T , T 1 are integer matrices with det ( T ) = 1 . The eigenvalues of A are equal to 1 (the reference solution). Computational algorithms such as the QR algorithm and software for spectral matrix analysis may produce wrong results even for matrices A Z ( 2 ) of moderate norm.
Example 4.
When solving the differential equation y ( t ) = f ( t , y ( t ) ) with initial condition y ( t 0 ) = y 0 we may choose a reference solution y ref as follows. The function y ref satisfies the differential equation y ( t ) = Φ ( t , y ( t ) ) , where
Φ ( t , y ( t ) ) = f ( t , y ( t ) ) + y ref ( t ) f ( t , y ref ( t ) )
with initial condition y ref ( t 0 ) = y 0 .
An important observation is that for some computational problems such as finding extrema of functions, BFCT may produce better results than certain sophisticated algorithms used in modern computer systems for doing mathematics, see e.g. Section 7.

2. Machine Arithmetic

2.1. Preliminaries

In this section we consider computations in FPA obeying the IEEE Standard [14]. Numerical computations in MATLAB® are done according to this standard. The FPA consists of set of machine numbers M and rules for performing operations, e.g. addition/subtraction, multiplication/division, exponentiation, etc. Rounding and arithmetic operations in FPA are performed with RRE of order ρ . The sum A + B + C and the product A B C are computed as ( A + B ) + C and ( A B ) C .
A major source of errors in computations is the (catastrophic) cancellation when relatively exact close numbers are subtracted so that the information coded in their left-most digits is lost [17]. Thus catastrophic cancellation is the phenomenon when subtracting good approximations to close numbers may cause a bad approximation to their difference. Less known is that genuine addition in FPA may also cause large and even unlimited errors.
A tutorial example of cancellation is solving quadratic equations a x 2 + b x + c = 0 , a 0 . During the last 4000 years students are taught to use the formula x 1 , 2 = ( b ± d ) / ( 2 a ) , d = b 2 4 a c , without warning that this may be a bad way to solve numerically quadratic equations.
Similar considerations are valid for the use of explicit expressions for solving cubic and quartic algebraic equations. Because genuine subtraction is numerically dangerous. Genuine addition in FPA is not less dangerous.

2.2. Violation of Arithmetic Laws

In FPA the commutative law for addition and multiplication is valid but the associative law for addition and multiplication is violated, i.e. it may happen that
( A + B ) + C A + ( B + C ) , ( A B ) C A ( B C )
The distributive law A ( B + C ) = A B + A C is also violated. This may lead to unexpectedly large (in fact arbitrarily large) errors.
For example, we have the wrong result 1 + R R = ( 1 + R ) R = R R = 0 . The reason for this machine calculation is that 1 + R is not a machine number (this is the smallest positive integer with this property). It lies in the middle of the successive machine numbers R and 2 + R and is rounded to R. At the same time we have 1 + ( R R ) = 1 + 0 = 1 which is the correct answer. Thus we got the famous wrong equality 0 = 1 arising in mathematical jokes.
This is not the end of the story. Consider the expression S = R + 1 + 1 + + 1 R consisting of 2 + R members, where there are R members equal to 1. We have S = R . Computing S from left to right we get the wrong answer S = ( R + 1 + 1 + + 1 ) R = R R = 0 . Computing S starting with the ones, i.e. 1 + 1 + + 1 + R R = R + R R we get the correct answer S = R . Now we have obtained the more impressive result 0 = 9 007 199 254 740 992 . Summing (at least theoretically) sufficient number of units we compute the sum of 1’s as Inf and hence 0 = Inf .
The associative law for multiplication is also violated in FPA. For example, the value of the expression P = 2 512 × 2 512 × 2 512 is 2 512 and it is computed correctly as
P = 2 512 × 2 512 × 2 512 = 2 512 × 1 = 2 512
At the same time the machine computation of P from left to right gives
2 512 × 2 512 ) × 2 512 = fl ( 2 1024 × 2 512 = Inf × 2 512 = Inf
where Inf is the symbol for infinity in FPA. Thus we got 2 512 = Inf .
Similar false equalities may also be obtained by addition and subtraction of machine numbers. Set E = 2 1023 . In standard arithmetic we have E + E E E = 0 . In FPA it is fulfilled fl ( E + E E E ) = fl ( E + E ) E E = Inf E E = Inf . This corresponds to the wrong equality 0 = Inf although E is a machine number at a good distance E 2 971 to X max .
We also have fl ( E + E E E E E ) = fl ( E + E ) E E E E = Inf E E E E = Inf although the exact result is 2 E and should be rounded to -Inf. Thus we obtained Inf = Inf . The reason is that the maximum positive machine number in FPA is X max = ( 2 eps ) E and the number E + E = 2 1024 is set to Inf . Moreover, the maximum positive integer that is still rounded to X max is X max + 2 969 ( 2 eps ) while the number X max + 2 970 is set to Inf.
These entertaining exercises may produce undetermined results as well. For example, we have U = 2 1024 / 2 1024 1 = 0 but the computed value is U = Inf/Inf - 1 = NaN. This may be interpreted as the exotic equality 0 = NaN .
Working with positive numbers x X min also leads to violation of the distributive law ( x + y ) z = x z + y z . In particular, we usually think that for x R it is fulfilled ( x + x ) / 2 = x / 2 + x / 2 = x . But setting x = X min we get (x + x)/2 = 2*x/2 = x and x/2 + x/2 = 0 + 0 = 0. We stress that the minimal positive quantity that is still rounded to X min instead to 0 is X min / ( 2 eps ) .
The distributive law in the form (1/m + 1/n)/p = 1/m/p + 1/n/p, where at least one of the operands m , n , p Z is not an integer degree of 2, is also violated. In this case the rounded values of both sides of the equality are usually neighboring machine numbers.
Example 5.
Let m = 2 , n = 3 , p = 5 . Then fl ( 1 / 2 / 5 + 1 / 3 / 5 ) = fl ( 1 / 6 ) + 2 55 and fl ( 1 / 2 / 5 1 / 3 / 5 ) = fl ( 1 / 30 ) + 2 57 .
Example 5 illustrates the fact that machine subtraction of close numbers a , b is usually performed with small relative error. Moreover, if the operands are machine numbers and a / 2 b 2 a then the machine subtraction is exact [17].
The rounded value fl ( X ) M of X R is computed by the command sym(X,’f’) in the form ± m / 2 n , where m , n Z 0 and m is odd when n 1 . In particular the command sym(realmax,’f’) will give 309 decimal digits of the integer X max = realmax.

2.3. Numerical Convergence of Series

We shall conclude our brief excursion in the machine summation of numbers with the consideration of numerical series with positive elements a k . For n Z 1 set sum ( a k , n ) = a 1 + a 2 + + a n .
Definition 1.
The symbol sum ( a k , ) is called numerical series. The quantity A n = sum ( a k , n ) is the n-th partial sum of this series.
The partial sums are computed in FPA as A 1 = a 1 , A m + 1 = fl ( a m + A m ) M , m Z 1 . Theoretically, the series sum ( a k , ) is divergent when a k X min . In FPA there are three mutually disjoint possibilities according to the next definition.
Definition 2.
The series sum ( a k , ) is said to be:
  • numerically convergent if there is a positive S M and m 1 such that A n = S for n m ; the number S is called numerical sum of the series sum ( a k , ) .
  • numerically divergent if there is m > 1 such that A n = Inf for n m ; in this there are now terms A n = NaN for n m
  • numerically undefined if there is m > 1 such that A n NaN for n m and A n + 1 = NaN
Divergent numerical series may be (and very often are!) numerically convergent. For example, the divergent harmonic series with a k = 1 / k is numerically convergent in FPA to a sum S 36 . The divergent series with a k = 1 is numerically convergent to S = R . Also, it is easy to prove the following assertion.
Proposition 1.
For any S M + there is a series sum ( a k , ) which is numerically convergent to S.
Proposition 1 for FPA is an analogue to the Riemann series theorem [18] which states that the members of a convergent real series which is not absolutely convergent may be reordered so that the sum of the new series to be equal to any prescribed real number.

2.4. Average Rounding Errors

We recall that the normal range of FPA is the interval X = [ R min , X max ] R , where numbers are rounded with RRE less than ρ .
Let x X and fl ( x ) M be the rounded value of x. It is well known [14] that the RRE satisfies the estimate
rre ( x ) = | x fl ( x ) | | x | ρ 1 + ρ < ρ
The code sym(x,’f’) computes fl ( x ) in the form m / 2 n , where m is odd if n 1 . Let n Z [ 0 , 2 52 1 ] . The expression rre ( x ) between two successive machine numbers
X n = 1 + 2 n ρ , X n + 1 = 1 + 2 ( n + 1 ) ρ
in the interval [ 1 , 2 ) is as follows. Let
x n = 1 2 ( X n + X n + 1 ) = 1 + ( 2 n + 1 ) ρ M
Then
rre ( x ) = 1 X n / x X n x x n X n + 1 / x 1 x n < x X n + 1
The maximum of rre ( x ) in the interval [ X n , X n + 1 ] is achieved for x = x n and is equal to
e n = rre ( x n ) = X n + 1 X n X n + X n + 1 = ρ 1 + ( 2 n + 1 ) ρ
The graph of the function rre : [ 1 , 2 ) R resembles a saw, see Figure 1.
For x > X n near to X n the maximum RRE is about ρ , while for x < X n + 1 near to X n + 1 the maximum RRE is close to ρ / 2 . In particular we have
rre ( 1 + ρ ) = ρ 1 + ρ = ρ ρ 2 + O ( ρ 3 ) rre ( 2 2 ρ ) = ρ 2 ( 1 ρ ) = 1 2 ρ + 1 2 ρ 2 + O ( ρ 3 )
That actual behavior of RRE is not governed by (1) but is considerably smaller, has been known to specialists in computer calculations from a long time. To clarify this problem we performed intensive numerical experiments.
BFCT showed that actual RRE is considerably lower than ρ and lower than ρ / 2 and this led to the definition and use of average RRE which are more realistic in comparison with the relative error ρ . Based on this observation [19] we have defined and calculated an integral ARRE as the definite integral int ( rre , 1 , 2 ) . This integral is of order
K ρ + O ( ρ 2 ) , K = 1 2 log ( 2 ) 0.3466
At the same time the average of the maximums of RRE is
2 K ρ = log ( 2 ) ρ 0.6931 ρ .
Intensive numerical BFCT experiments [19] with rounding of large sequences of numbers x M had shown that the actual ARRE is about 0.40 ρ . This is 2.5 times less than the widely used bound ρ in the literature [14]. In these experiments numbers x M had been excluded because rre ( x ) = 0 . This explains why the multiplier of ρ was computed as 0.40 instead of 0.35.
In another series of experiments we have computed and averaged the RRE for fractions x = 1 / k , x = 1 + 1 / k and x = 2 1 / k for k = 1 : m and m up to 1000 and we have computed the ratios rre ( x ) / ρ . The results are shown in Table 1. The actual RRE in rounding of unit fraction 1 / m (second column) is very close to the integral ARRE (3).
For fractions 1 + 1 / m close to 1 the computed ARRE is larger than for fractions 2 1 / m close to 2. This confirms the observation that rounding of x [ 2 k , 2 k + 1 ] , k Z , is done with less (although not twice less) RRE for x close to the left limit 2 k than close to the right limit 2 k + 1 of the interval, see (2). For example, the result in field (4,2) of Table 1 is obtained by the code
for m = 1:1000
 R(m) = double(abs((sym(1/m,’e’)-1/m))*m*2/eps);
end
mean(R)
Table 1. Ratios rre ( x ) / ρ for certain fractions.
Table 1. Ratios rre ( x ) / ρ for certain fractions.
  m 1 / m 1 + 1 / m 2 1 / m
10 0.3000 0.2423 0.1950
100 0.3417 0.2443 0.2413
1000 0.3323 0.2649 0.2316
We stress that in such experiments we cannot use pseudo-random number generators such as rand, randn and randi from MATLAB® since they produce machine numbers for which the rounding errors are zero.
We summarize the above considerations in the next important proposition.
Proposition 2.
The known rounding error bounds in all computational algorithms realized in FPA may be reduced three times (!) if integral ARRE are used instead of maximum RRE.

3. Numerically Stable Computations

3.1. Lipschitz Continuous Problems

Numerical stability of computational procedures is a major issue in numerical analysis [15]. A large variety of computational problems (actually, almost all computational problems) may be formulated as follows, see [15,16]. Let D C ( m , 1 ) and R C ( n , 1 ) be given sets and let f : D R be a continuous function. We shall use the following informal definition.
Definition 3.
The computational problem is described by the function f, while the particular way of computing the result r = f ( d ) for a given data d D is the computational algorithm.
Thus the computational problem is identified with the pair ( f , d ) , or with the equality r = f ( d ) . For many computational problems the function f is Lipschitz continuous.
Definition 4.
The computational problem r = f ( d ) is said to be (locally) Lipschitz continuous if there exist constants L > 0 and b > 0 such that
f ( d 1 ) f ( d 2 ) L d 1 d 2
whenever d 1 , d 2 b . The problem is globally Lipschitz continuous if b = .
If the function f has locally bounded derivative it is (locally) Lipschitz continuous. The concept of Lipschitz continuity is illustrated by the next example.
Example 6.
Let f : D R , D R , and a > 0 be a given constant. Then the following assertions take place.
  • The function f ( d ) = 1 / d is Lipschitz continuous if D = ( , a ] [ a , ) and is not Lipschitz continuous if D = R { 0 } .
  • The function f ( d ) = d 1 / 3 is Lipschitz continuous if D = ( , a ] [ a , ) and is not Lipschitz continuous if D = R { 0 } .
  • The function f ( d ) = d m , m Z 2 , is Lipschitz continuous if D = [ a , a ] and is not Lipschitz continuous if D = R .
  • The function f ( d ) = d 4 / 3 sin ( 1 / d ) for d 0 and f ( 0 ) = 0 , is differentiable on R but is not Lipschitz continuous.
Suppose finally that the algorithm is realized in FPA, e.g. in MATLAB® computing environment.
Definition 5.
The computational algorithm r = f ( d ) is said to be numerically stable if the computed result r comp is close to the result f ( d * ) of a close problem with data d * .
Definition 5 includes the popular concepts of forward and backwards numerical stability. To make this definition formal, suppose that there exist non-negative constants A , B such that
r comp f ( d * ) ρ A r , d d * ρ B d
and d 0 , r 0 . Further on we have
r comp r = r comp f ( d * ) + f ( d * ) f ( d ) r comp f ( d * ) + f ( d * ) f ( d ) ρ A r + L d * d ρ A r + ρ B L d
Dividing the last inequality by r we get the following estimate for the relative error in the computed solution
r comp r r ρ C , C = A + B L d f ( d )
Definition 6.
The quantity C = C ( A , B , d ) is said to be relative condition number of the computational problem r = f ( d ) .
The remarkable formula (4) reveals the three main factors which determine the accuracy of the computed solution [16,19].
Proposition 3.
The accuracy of the computed solution r comp depends on the following factors.
  • The sensitivity of the computational problem relative to perturbations in the data d measured by the Lipschitz constant L of the function f
  • The stability of the computational algorithm expressed by the constants A , B
  • The FPA characterized by the rounding unit ρ and the requirement that the intermediate computed results belong to the normal range X
The error estimate (4) is used in practice as follows. For a number of problems the Lipschitz constant L may be calculated, or estimated as in the solution of linear algebraic equations and the computation of the eigenstructure of a matrix with simple eigenvalues. Finally, the value of ρ is known exactly for FPA as well as for other machine environments.
The estimate of the constants A , B may be a problem. Often the heuristic assumption A = 0 and B = 1 is made which gives
C = C ( d ) = L d f ( d )
It follows from (5) that C will be large if L and/or d are large and/or f ( d ) is small. The constant L is usually out user’s control. The quantities d and f ( d ) may be changed by scaling of the computational problem
Definition 7.
The computational problem is said to be:
  • well conditioned if ρ C 1 ; in this case we may expect about log 10 ( ρ C ) true decimal digits in the computed solution r comp
  • poorly conditioned if ρ C 1 ; in this case there may be no true digits in the computed solution r comp
More precise classification of the conditioning (well, medium and poor) of computational problems is also possible. Computational problems which are not Lipschitz but Hölder continuous may also be analyzed, see e.g. [19].
The above considerations confirm a fundamental rule in numerical computations, namely that if the data d and/or some of the intermediate results in the computation of f ( d ) are large and/or the result r = f ( d ) is small then large relative errors in the computed solution r comp may be expected, see Section 3.3.
For some computational problems r = f ( d ) there are a priory error estimates for the relative error r comp r / r in the computed solution r comp in the form of explicit expressions in ρ and d. The importance of such estimates is that they may be computed (or estimated) a priory, before solving the computational problem.
Most a priory error estimates are heuristic. Among computational problems with such estimates is the solution of linear algebraic equations A x = b and the computation of the eigenvalues of a matrix A with simple spectrum. To check the accuracy and practical usefulness of such error estimates by BFCT, a set of CPRS is designed and the error estimates are compared with the observed errors.

3.2. Hölder Continuous Problems

Definition 8.
The computational problem r = f ( d ) is said to be Hölder continuous with exponent h > 0 if there exist constants L > 0 and b > 0 such that
f ( d 1 ) f ( d 2 ) L d 1 d 2 h
whenever d 1 , d 2 b .
Hölder continuity implies uniform continuity but the converse is not true. It is supposed that h < 1 . Indeed, if h = 1 the problem is Lipschitz continuous and if h > 1 the function f is constant.
The machine computation of r = f ( d ) may be accomplished with large errors of order eps h when the exponent h > 0 is small. Such cases arise in the calculation of a multiple zero d 0 of a smooth function f : R R , where f ( k ) ( d 0 ) = 0 , k Z [ 0 , m 1 ] and f ( m ) ( r 0 ) 0 . In this case h = 1 / m .
Let r 0 and r comp be the computed value of r. Then a heuristic accuracy estimate for the solution of a Hölder problem is
r comp r r H ρ h , H = L d h r
Experimental results confirming this estimate are presented later on.
A typical example of a Hölder problem is the computation of the roots of n-degree algebraic equation with multiple roots, where n 3 (the case n = 2 is treated by a specific algorithm). There are three codes in MATLAB® that may be used for solving algebraic equations which may be characterized as follows.
  • The code roots is fast and works, on middle power computer platforms, with equations of degree n up to several thousands but gives large errors in case of multiple roots. This code solves algebraic equations only.
  • The code vpasolve works with VPA corresponding to 32 true decimal digits with equations of degree n up to several hundreds but may be slow in some cases. This code works with general equations as well finding one root at a time.
  • The code fzero is fast but finds one root at a time. It may not work properly in case of multiple roots of even multiplicity. This code works with general finite equations as well.

3.3. Golden Rule of Computations

The results presented in this and previous sections confirm the Golden Rule in doing computations in both exact and machine arithmetic which may be formulated as follows.
Proposition 4.
(Golden Rule of Computations) If in a chain of numerical computations the final result is small relative to the initial data and/or to the intermediate results then large relative errors in the computed solution are to be expected.
The already mentioned catastrophic cancellation in subtraction of close numbers is a particular case of Proposition 4.
An important class of computations in FPA are the so called computations with maximum accuracy. Let the vector r with elements r k be the exact solution of the vector computational problem r = f ( d ) and let the vector r comp with elements r k * M be the computed solution.
Definition 9.
The vector r comp is computed with maximum accuracy if its elements r k * satisfy r k * { r k , r k + } .
Solutions with maximum accuracy are the best that we may expect from a computational algorithm implemented in FPA.

4. Extremal Elements of Arrays

Finding minimal and maximal elements of real and complex arrays is a part of many computational algorithms. This problem may look simple but it has pitfalls in some cases.

4.1. Vectors

Let w = [ w 1 , w 2 , , w n ] be a given real n-vector. The problem of determining an extremal (minimal or maximal) element w k of w together with its index k arises as part of many computational problems. The solution of this problem is obtained in MATLAB® by the codes [U,K] = min(w) or [V,L] = max(w), where U = w K is the minimal element of the vector w and K is its index. Similarly, V = w L is the maximal element of the vector w and L is its index. These codes work for complex vectors as well if the complex numbers are ordered lexicographically by the relation ⪯, see [6].
An important rule here is that if there are several elements w k of w equal to its minimal element U then the computed index K is supposed to be the minimal one. Similarly, if there are several elements w l of w equal to its maximal element V then the computed index L is again the minimal one. The delicate point here is the calculation of indexes K and L. For example, for w = [ c , c , c ] , where c R or c C , the codes min and max produce U = c , K = 1 and V = c , L = 1 as expected.
Due to rounding the above codes may produce unexpected results. Indeed, the performance of the codes depends on the form in which the elements of the vector w are written.
Example 7.
The quantities A = 1 / 3 , B = 4 / 3 1 and C = 1 / 2 1 / 6 are equal but their rounded values are different. The rounded value fl ( X ) M of X R is computed by the codesym(X,’f’). It produces the answer in the form m 2 n , where m , n Z 0 . We have fl ( A ) = 6004799503160661 / 2 54 and fl ( B ) = fl ( A ) 2 54 , fl ( C ) = fl ( A ) + 2 54 . Hence fl ( B ) < fl ( A ) < fl ( C ) .
Example 8.
Let w be a 3-vector with elements A , B , C at any positions, where A = B = C are defined in Example 7. Then the code [U,K] = min(w)  will give U = fl ( B ) and K will be the index of B as an element of w. The code  [V,L] = max(w)  will give V = fl ( C ) and L will be the index of C as an element of w. At the same time both codes are supposed to give K = L = 1 and U = V .
These details of the performance of the codes min and max may not be very popular even among experienced users of sophisticated computer systems for doing mathematics such as MATLAB®.

4.2. Matrices

Similar problems as in Section 4.1 arise in finding extremal (minimal or maximal) elements of multidimensional arrays and in particular of real and complex matrices A.
Consider a matrix A C ( m , n ) with elements A ( p , q ) . Let the minimal element of A relative to the lexicographical order be A min = A ( i 1 , i 2 ) . Then the code
[a,b] = min(A); [A_min,N] = min(a); A_min, ind = [b(N),N]
finds the minimal element A min of A and the pair ( i 1 , i 2 ) of its indexes as
A_min, ind = i_1  i_2
where i_1 = ind(1), i_2 = ind(2). Here a and b are n-vectors and ind is a 2-vector.
If there is more than one minimal element then the computed pair ( i 1 , i 2 ) of its indexes is minimal relative to the lexicographical order. This result, however, may depend on the way the elements of A are specified.
Finding minimal elements of matrices may be a part of improved algorithms to compute extrema of functions of two variables, see Section 7.2. A scheme for finding minimal elements of n D -arrays, n > 2 , may also be derived and applied to the minimization of functions of n variables.

4.3. Application to Voting Theory

The above details in using the codes min and max from MATLAB® [6], although usually neglected, may be important. For example, in certain automatic voting computations (in which one of the authors of this paper had participated back in 2013) with the Hamilton method for the bi-proportional voting system used in Bulgaria, these phenomena actually occurred and caused problems. To resolve quickly the problem we applied BFCT using hand calculations (!) being in emergency situation.
The computational algorithms for realization of the Hamilton method must be improved replacing the fractional remainders by integer remainders [19] as follows. Let n parties with votes v 1 , v 2 , , v n take part in the distribution of S > n parliamentary seats by the Hamilton method [20]. Set V = v 1 + v 2 + + v n and m k = S v k / V .
Next the rationals m k are represented as fix ( m k ) + μ k / V , where fix ( x ) Z 0 is the integer part of m k , μ k / V is the fractional remainder and μ k Z [ 1 , V 1 ] is the integer remainder of m k modulo V. Initially the kth party gets fix ( m k ) seats. If the sum S 0 of initial seats is less than S then the first S S 0 parties with largest integer remainders μ k get one seat more. Thus to avoid small errors in the computation of fractional remainders that may cheat the code max we must work with exactly computed integer remainders.

5. Problems with Reference Solution

5.1. Evaluation of Functions

Evaluation of functions f defined by an explicit expression
y = f ( x ) , 0 x R , 0 y R
is a CPRS since f ( x ) may be found with high precision in a suitable computing environment. However, the relative error in computing y may be large even for well behaved functions f and arguments x far from the limits of the normal range [19]. The reason is that instead with the argument x R the corresponding computational algorithm works with the rounded value x * = fl ( x ) M . At the same time for x M the computed quantity f ( x * ) is usually exact to working precision.
Example 9.
We have
[cos(realmax);sin(realmax)] = -0.999987689426560
                               0.004961954789184
The computed result is correct to full precision although the argument realmax is written with 309 decimal digits, which may be displayed by the command sym(realmax,’f’).
Let the function f be locally Lipschitz continuous with constant L ( x ) in a neighborhood of x and let the result computed in FPA be y * = f ( x * ) . Then the relative error of y * is estimated as
| y * y | | y | ρ C , C = C ( f , x ) = L ( x ) | x | | f ( x ) |
where C ( f , x ) is the relative condition number of the computational problem y = f ( x ) .
Example 10.
Consider the vector y = [ cos ( 2 k π ) ; sin ( 2 k π ) ] , where k Z . We should have y = [ 1 ; 0 ] but for large k this may not be so. Let the vector X R ( 1 , 8 ) has elements X ( p ) = π 2 46 + p , i.e.
X = pi*[2^47,2^48,2^49,2^50,2^51,2^52,2^53,2^54]
The command  Y = [cos(X);sin(X)]  gives the matrix
Y = 0.9999  0.9994  0.9976  0.9905  0.9622  0.8517  0.4509 -0.5934
    0.0172 -0.0345 -0.0689 -0.1374 -0.2723 -0.5240 -0.8926 -0.8049
While the first column Y(:,1) of Y still resembles the exact vector [1;0], the next columns are heavily wrong with Y(:,8) being a complete catastrophe with relative error 170%. The commands Y1 = [C1;S1] and Y2 = [C2;S2], where C1 = vpa(cos(X)), S1 = vpa(sin(X)), C2 = cos(vpa(X)) and S2 = sin(vpa(X)) produce the same wrong result Y1 = Y2 = Y.
A way out from such wrong calculations is to evaluate standard trigonometric functions for arguments modulo 2 π .
Example 11.
The command
YY = [cos(mod(X,2*pi));sin(mod(X,2*pi))]
produces the matrix
YY =  1     1     1     1     1     1     1     1
      0     0     0     0     0     0     0     0
which is exact to full precision.
We recall that for | x | R the rounded value fl ( x ) of x is an even integer. For example, the quantity a = 23 * π 32 / 20 is rounded to fl ( a ) = 9 321 670 908 397 306 which is demonstrated by the command sym(23*pi^32/20,’f’). Another instructive example is rounding integer degrees 10 n .
Example 12.
We have fl ( 10 n ) = 10 n for n Z [ 16 , 22 ] but
sym([10^23;10^24;10^25],’f’) =  99999999999999991611392
                               999999999999999983222784
                             10000000000000000905969664
If x M then for most elementary and special functions f the quantity y = f ( x ) is computed with maximum accuracy.
If the function f is locally Lipschitz with constant L ( x ) then large relative computational errors in evaluation of y = f ( x ) 0 may occur when C ( f , x ) is large, i.e. when L ( x ) and/or | x | is large and/or | y | is small. For the trigonometric functions above the quantities L ( x ) and | y | were equal to 1 but the argument x M is large.
A standard approach to improve the performance of computational algorithms is to scale the initial and/or intermediate data to avoid large errors. In computing the matrix exponential exp ( A ) , A C ( n ) , the scaling A A 2 m is done by a scaling factor 2 m , m Z 1 , which is a binary degree in order to reduce rounding errors [15].

5.2. Linear Algebraic Equations

An instructive example, illustrating BFCT to the solution of vector algebraic equations A x = b with integer data, is designed as follows. Let A Z ( n ) be invertible matrix and b Z ( n , 1 ) . The command A = fix(100*randn(n)); generates a matrix A Z ( n ) with normally distributed with integer elements of moderate magnitude and zero mean. Choosing the reference solution as X_0 = ones(n,1) we get b = A*X_0. The computed solution is X_comp = A\b with relative error E_n = norm(X_comp - X_0)/sqrt(n).
The a priory relative error estimate is est = eps*cond(A). Usually E n is a small multiple of est and this may be checked by BFCT. We consider three types of computed solutions as follows.
  • Standard solution X_1 = A\b obtained by Gauss elimination with partial pivoting based on LR decomposition P A = L R of A, where L is lower triangular matrix with unit diagonal elements, R is upper triangular matrix with nonzero diagonal elements and P is permutation matrix.
  • Solution X_2 = R\(Q’*b) obtained by QR decomposition [Q,R] = qr(A) of A, where A = Q R , the matrix Q is orthogonal and the matrix R is upper triangular.
  • Solution X_3 = inv(A)*b based on finding the inverse inv ( A ) = A 1 of A.
Solution X 1 is computed as a default in MATLAB®and is accurate and fast. Solution X 2 is very accurate although more expensive and is also used. Solution X 3 is not recommended and is included here for tutorial purposes. Indeed, computing inv(A) and multiplying the vector b by this matrix are unnecessary operations which are expensive and reduce accuracy.
The above computational techniques are illustrated by the commands
n = 1000; A = fix(100*randn(n)); X_0 = ones(n); b = A*X_0;...
X_1 = A\b; [Q,R] = qr(A); X_2 = R\(Q’*b); X_3 = inv(A)*b;...
E_1 = norm(X_0-X_1)/sqrt(n);E_2 = norm(X_0-X_2)/sqrt(n);...
E_3 = norm(X_0-X_3)/sqrt(n);est = eps*cond(A);...
e_1 = E_1/est; e_2 = E_2/est; e_3 = E_3/est; [e_1;e_2;e_3]
Each execution of these commands produces different result because of the code randn. The computed quantity e_1 for systems with up to n = 1000 unknowns is usually less than 20, the quantity e_2 is about 40 and the quantity e_3 is less than 4. More precisely, intensive computations with n = 1000 showed that mean ( e 1 ) = 18 , mean ( e 2 ) = 3 and mean ( e 3 ) = 40 .
Tests with n = 5000 had also been done but they require more computational time. For such values of n we have observed average values mean ( e 1 ) = 30 , mean ( e 2 ) = 3 , mean ( e 3 ) = 150 .
The comparison of e 1 and e 2 with e 3 confirms, among others, the tutorial conclusion that solving linear equations by matrix inversion must definitely be avoided. Rather, the opposite is fact. If the inverse matrix B = A 1 is needed for some reason, it is found solving n linear equations A b k = i k for the columns b k of B, where i k = I n ( : , k ) is the kth column if I n .
Our main observation, based on BFCT, about comparison of the methods of Gauss elimination and QR decomposition for solving linear algebraic equations, is formulated in the next proposition.
Proposition 5.
For matrices of order up to 1000 the QR decomposition method gives error that is between 5 and 10 times less than the error of the Gauss elimination method.
In addition, the solution of a linear equation A x = b via QR decomposition is backward stable. Thus Proposition 5 is a message to the developer of MATLAB®, the respected MathWorks, Inc. Corporation.
The Gauss elimination method based on LR decomposition is preferred to the QR decomposition method because it requires twice less floating-point operations. Since fast computational platforms are now widely available, this consideration is no more valid.
As a rule the implementation of the code x = A\b gives good results for matrices A R ( n ) with n of order 10 3 . At the same time large errors in x may occur even for n = 2 .
Example 13.
Consider the algebraic equation A ( m ) x = b ( m ) , where the matrices A ( m ) R ( 2 ) are defined by the relation (7) below. For x ref = [ 1 ; 1 ] we have b ( m ) = [ 1 10 m ; 1 10 m ] . The computed results for m = 2 : 5 are presented at Table 2. In cases m = 4 , 5 there is a warning that the matrix A ( m ) is close to singular or badly scaled since cond ( A ( m ) ) is larger than 10 16 . As shown in Table 2, for these two cases the relative error is indeed 25% and 100%, respectively.

5.3. Eigenvalues of 2 × 2 Machine Matrices

The integer 2 × 2 matrix
A = 100009999 100000000 100020001 100010001
has double eigenvalue λ = 1 and Jordan form J 2 = I 2 + N 2 . The elements of A are integers of magnitude 10 8 . This is far from the the quantity R such that integers n > R may not be machine numbers.
Example 14.
The MATLAB® code e = eig(A)’ computes the row 2-vector e of eigenvalues of A as e comp = [ 0.3955 , 2.3955 ] instead of e = [ 1 , 1 ] . The relative error is 140% and there is no true digit in the computed vector e comp (even the sign of the first computed eigenvalue is wrong). The trace of A is computed fortunately as trace(A) = 2 and is the sum of computed eigenvalues. The determinant det ( A ) = 1 is computed wrongly as det(A) =  0.2214.
Surprisingly, the vector p of the coefficients of the characteristic polynomial of A, computed as p = poly(A), is [1,-2,-0.9475]. The last element of the computed vector is not the computed determinant 0.2214, which may confuse the user. Moreover, the computed eigenvalues of A are the roots of this wrong characteristic polynomial and hence they are not computed from the Schur form of A as supposed.
Here we recall the modern paradigm in numerical spectral analysis and root finding of polynomials: the eigenvalues of a matrix are not computed as roots of its characteristic polynomial; rather the roots of any polynomial are computed as the eigenvalues of its companion matrix.
The Maple [8] code jordan(A), incorporated in MATLAB®, computes correctly the Jordan form I 2 + N 2 of A. The vector of eigenvalue condition numbers, computed by the command condeig(A), is [ c , c ] , where c = 7.1665 × 10 7 , is relatively large. This indicates that something is wrong. In reality things are even worse since the vector of eigenvalue condition numbers of the matrix A does not exist because A has one linearly independent eigenvector.
At this moment the user may, or may not be aware that something is wrong (with exception of the code jordan which is not very popular and is of restricted use). Now comes the time of BFCT to find the spectrum and the Jordan form of A as follows.
The sum e 1 + e 2 of the eigenvalues of A is equal to the trace A ( 1 , 1 ) + A ( 2 , 2 ) = 2 which is computed exactly. If the eigenvalues are close or equal, because the eigenvalue condition numbers are large, then we should have e 1 + e 2 2 e 1 = 2 and e 1 = e 2 = 1 . The Jordan form of A is then either J 2 or I 2 . But the only matrix with Jordan form I 2 is I 2 itself and hence the Jordan form of A must be J 2 . We have obtained the correct result using BFCT and simple logical reasoning.
The matrix (6) may be written as
A ( m ) = 1 a a 2 a 2 1 2 a a 2 1 + a + a 2 , a = 10 m
for m = 4 . Thus we have a family of matrices { A ( m ) } parametrized by the integer m. The trace of A ( m ) is equal to 2, the determinant of A ( m ) is equal to 1, the eigenvalues of A ( m ) are e 1 = e 2 = 1 and the Jordan form of A ( m ) is I 2 + N 2 .
Below we give the values of the coefficients c 2 = tr ( A ) = 2 and c 3 = det ( A ) = 1 of the characteristic polynomial λ 2 c 2 λ + c 3 of A as found by the code poly(A), and of the eigenvalues e 1 = e 2 = 1 of A, as computed by the code eig(A) in MATLAB®.
  • For m = 2 we have c 2 = 2.0000 , c 3 = 1.0000 and e 1 , 2 = 1.0000 ± 0.0001 i , where the quantity 1.0000 has at least 5 true decimal digits and 0.0001 i is a small imaginary tail which may be neglected. This result seems acceptable.
  • For m = 3 we have c 2 = 2.0000 , c 3 = 1.0000 and e 1 = 0.9937 , e 2 = 1.0063 . With certain optimism these results may also be declared as acceptable.
  • For m = 4 a computational catastrophe occurs with c 2 = 2.0000 (true), c 3 = 0.9475 (completely wrong) and e 1 = 0.3955 , e 2 = 2.3955 (also completely wrong). Here surprisingly det ( A ) is computed as 0.2214 which differs from 1 (which is to be expected) but differs also from c 3 . It is not clear what and why had happened. Using the computed coefficients c 2 , c 3 , the roots of the characteristic equation λ 2 c 2 λ + c 3 = 0 now are λ 1 = 0.7709 , λ 2 = 1.2291 instead of the computed e 1 and e 2 . This is a strange wrong result.
  • We give also the results for m = 5 which are full trash and are served without any warning for the user. We have c 2 = 2.0000 , c 3 = 5591.5 and e 1 , 2 = 1.0000 ± 74.770 i . We also get det ( A ) = 4472.2 for completeness.
In all cases the sum c 2 = 2.0000 of the wrongly computed eigenvalues is almost exact which is a well known fact in the numerical spectral analysis of matrices [15].
The conclusion from the above considerations is that BFCT in this case, in contrast to the standard software, always give e 1 e 2 = c 2 / 2 = 1.0000 with 5 true decimal digits. This is a good approximation to the eigenvalues e 1 = e 2 = 1 of the matrices A ( m ) for m = 3 , 4 , 5 when the codes eig and poly fail.

6. Zeros of Functions

Computing zeros of functions and of polynomials in particular is a classical problem in numerical analysis. Although powerful sophisticated computational algorithms for this purpose have been developed, the use of BFCT in this area is still useful.
Consider first the scalar case. Computing zeros of functions f ( x ) = 0 , where f : R R , in a given interval T = [ a , b ] , where f is a continuous function, may be a difficult problem. Here we distinguish three main tasks.
  • Find one particular solution r 1 of equation f ( x ) = 0 in the interval T.
  • Find the general solution f 1 ( 0 ) T of equation f ( x ) = 0 in the interval T.
  • Find all roots [ r 1 , r 2 , , r n ] of a given nth degree polynomial P ( x ) .
The condition f ( a ) f ( b ) < 0 guarantees that there is at least one solution in the interval ( a , b ) . Whether f has zeros at the points a , b is checked by direct computation.
Solving real and complex polynomial equations P ( x ) = 0 of nth of the form
P ( x ) = p 1 x n + p 2 x n 1 + + p n x + p n + 1 = 0
and c 1 0 is done by the code vpasolve(P(x) == 0) from MATLAB®. It finds all roots of P ( x ) with quadrupole precision of 32 true decimal digits.
For n = 100 the speed of the corresponding software on a medium computer platform is still acceptable. For n = 1000 and higher the performance of the code may be slow even on fast platforms and hence it may not be applicable to RTA (if such computations are ever needed in RTA). Note that we write this in end of year 2024.
A faster solution may be obtained by the code roots(p), where p is the vector of coefficients p 1 , p 2 , , p n + 1 of P ( x ) . This code computes the column vector of roots r = [ r 1 ; r 2 ; ; r n ] C ( n , 1 ) of P ( x ) as the vector of eigenvalues of the companion matrix C ( A ) of P ( x ) using the QR algorithm, i.e. r = eig(compan(p)). The code roots(p) works relatively fast for n up to several thousands but may produce wrong results even for small values of n when the polynomial has multiple roots and the computational problem is poorly conditioned.
It is known that relative errors in computing multiple roots of equations f ( x ) = 0 in FPA, where f has a sufficient number of derivatives around a root x = r , are of order eps 1 / n , where n is the multiplicity of r. Correspondingly, relative errors for most computational algorithms for solving such equations are of order γ eps 1 / n , where γ is a constant of order O ( 1 ) .
To estimate γ , consider the equation P ( x ) = 0 when the nth degree polynomial P ( x ) has n-tuple root r 1 = 1 . Denote the relative error in the computed vector r comp as
E n = r comp ones ( n , 1 ) n
The ratio γ n = E n / ( eps ) 1 / n is shown at Table 3 for the extended form x n n x n 1 + + ( 1 ) n of the polynomial P ( x ) = ( x 1 ) n , and values of n from 3 to 20. For n = 2 the computed roots are exact and γ 2 = 0 .
We see that the relative error E n in the computed solution for n up to 20 satisfies the heuristic estimate E n γ n eps 1 / n , where the average of γ n is 1.87 . Next we assume that γ n = 1.9 for simplicity.
The polynomial P ( x ) = ( x 1 ) n may be found using the command expand((x-1)^n), or asking the intelligent dialogue computer system WolframAlpha [10]
Find the coefficients of the polynomial (x-1)^n
We stress that in all cases the code vpasolve(P(x) == 0) gives the exact solution vector ones(n,1).
Usually the computed roots are unacceptable if their relative error is greater than 1%. For example, it follows from Table 3 that the root r = 1 of multiplicity 20 is computed by the code roots with relative error 32% which is unacceptable. It must be stressed that this is not due to program disadvantages of the code but rather to high sensitivity of the eigenvalues of the companion matrix C ( A ) to perturbations in A.
If the polynomial equation P ( x ) = 0 has a root of multiplicity n the relative error is estimated as 1.9 eps 1 / n . Solving the equation 1.9 eps 1 / n = 0.01 we get n = 6.89 and n 7 . We summarize this slightly unexpected observation in a special proposition.
Proposition 6.
Relative errors in computing n-tuple roots of polynomials by the coderootsmay be unacceptable when n 7 .
Let now a continuous function f be given and suppose that we look for all roots of the equation f ( x ) = 0 in the interval T = [ a , b ] , or for the set f 1 ( 0 ) T . Here use of BFCT seems necessary. A direct approach is to plot the graph of the function x | f ( x ) | , x T . Let N be a sufficiently large integer, say N = 1000 . We may use the commands X = linspace(a,b,N); Y = abs(f(X)); and plot(X,Y) to plot the graph of the function x | f ( x ) | , x T .
The zeros of f are marked as inverted peaks at the roots r 1 < r 2 < . Next the codes vpasolve(f,r_k) or fzero(f,r_k) may be used around each of the points r k to specify the roots with full accuracy. This approach is well known in the computational practice and is recommended in some MATLAB® guides.
More difficult problems are the solutions of nonlinear vector equations f ( x ) = 0 and of matrix equations F ( X ) = 0 , where x and f ( x ) are n-vectors, and X and F ( X ) are n × n matrices. The MATLAB® code fsolve is intended to solve nonlinear vector problems, while the codes are, axxbc, care, dare, dlyap, dric, lyap, ric, sylv and others solve linear and nonlinear matrix equations. The performance of all these codes may be checked by BFCT and CPRS. This a challenging problem which will be considered elsewhere.

7. Minimization of Functions

7.1. Functions of Scalar Argument

The minimization of the expression y = f ( x ) , x T = [ a , b ] , where f : T R is a continuous function, is done in MATLAB® by a variant of the method of golden ratio. The corresponding code [x_0,y_0] = fminbnd(f,a,b) finds a local minimum in the interval T, where y 0 = f ( x 0 ) .
To find the global minimum [ x min , y min ] of f in the interval T we may use the graph of the function f plotted by the commands x = linspace(a,b,n); y = f(x); plot(x,y). Then the approximate global minimum [ c , f ( c ) ] of f is determined visually, considering also the pairs [ a , f ( a ) ] and [ b , f ( b ) ] at the end points of T. The code
[x_min,y_min] = fminbnd(f,c-h,c+h)
is then used in a neighborhood of the approximate minimum [ c , f ( c ) ] . This approach requires human interference in the process and may be avoided as shown later.
The problem with the code fminbnd is not that it may find local minimums. The real problem with this and similar codes is that they can miss the global minimum even when the function f is unimodal, i.e. when it has a unique minimum. So BFCT for such functions is a must. The problem is that usually we do not know whether the function is unimodal or not and whether the computed minimum is minimum at all. The more so when performing complicated algorithms in RTA.
Example 15.
Consider the function
f ( x ) = 2 x 2 exp ( 1 x 2 ) , x [ 0 , )
This function is unimodal with minimum [ x min ; y min ] = [ 1 ; 1 ] . In some cases this minimum is found as
[x_min,y_min] = fminbnd(f,0,b)
for some b > 0 . But in other cases it is not found correctly and here we should use BFCT. The code works well for 1 b 16.85 but gives a completely wrong result for b 16.86 without warning. Indeed, we have the good result
[x_min;y_min] = fminbnd(f,0,16.85) =  0.999992123678799
                                      1.000000000124073
and the bad result
[x_min;y_min] = fminbnd(f,0,16.86) = 16.859937887512295
                                      2
The second result has 1586% relative error for x min and 100% relative error for y min .
The reason for the disturbing events in Example 15 is complicated. We may expect that f ( x ) = 2 for values of x such that the expression φ ( x ) = x 2 exp ( 1 x 2 ) is less than ρ since then f ( x ) = 2 φ ( x ) is rounded to 2. The root r 0 > 0 of the equation φ ( x ) = ρ is r 0 = 6.38 and we indeed have φ ( r 0 ) = 2 . But r 0 is much less then 16.86 and hence the code fminbnd works with certain sophisticated version of VPA instead with FPA. In particular, the quantity α = φ ( 16.86 ) is much less than ρ with α / ρ 2.46 × 10 105 1 . We shall summarize the above considerations in the next statement.
Proposition 7.
The software for minimization of functions of one variable such asfminbnd(f,a,b)from MATLAB® may fail to produce correct results even for some unimodular functions f : T R .
In contrast, BFCT can always be used to find the global minimum automatically. Let the function f : T R be unimodal. Consider the vectors X = linspace(a,b,n+1) with elements X ( k ) = a + ( k 1 ) ( b a ) / n and Y = f ( X ) with elements Y ( k ) . The MATLAB® command [y,m] = min(Y) computes an approximation [ X ( m ) , y ] to the minimum [ x min , y min ] . Finally, the (almost) exact minimum is computed between the neighboring elements of X ( m ) as
[x_min,y_min] = fminbnd(f,X(m-1),X(m+1))

7.2. Functions of Vector Argument

For n 2 the minimization of continuous functions f : R ( n , 1 ) R of vector argument meets difficulties similar to those described in Section 7.1. And other specific difficulties as well. Unconstraint minimization f ( x ) min is done in MATLAB® by the Nelder-Mead algorithm [23,24], realized by the code fminsearch [6]. We stress that the code finds a local minimizer [ x min , y min ] of the function f, where y min = f ( x min ) .
The function f may not be differentiable which is an advantage of this algorithm. The algorithm compares the function values f ( x ) at the vertexes of a ( n + 1 ) -simplex and then replaces the vertex with highest value of f by another vertex. The simplex usually (but not always, as shown below) contracts on a minimum of f which may be local or global. The corresponding command is
[x_min,y_min] = fminsearch(f,x_0)
where x 0 is the initial vector.
The computed result depends on the next five factors, among others:
  • the computational problem
  • the computational algorithm
  • the FPA
  • the computer platform
  • the starting point x 0
Factors 1-3 are usually out of control of the user. Factor 4 is partially under control, e.g. the user may buy another platform. Only factor 5, which is very important, is entirely under control. As mentioned above, the computed value x min depends on the starting point x 0 . This is denoted as x min = Ψ ( x 0 ) .
Definition 10.
The point x 0 is said to be a numerical fixed point (NFP) of the computational procedure if x 0 = Ψ ( x 0 ) .
The computed result x min is NFP of the computational procedure. NFP depends not only on the computational problem and the computational algorithm but also on the FPA and on the particular computer platform, where the algorithm is realized. Unfortunately, computed values for x min which are far from any actual minimum are often NFP of the computational procedures. They are served without warning to the user.
For a class of optimization problems the function f has the form
f ( x ) = c + p ( x ) e ( x ) , e ( x ) = exp a 1 x 1 2 a 2 x 2 2 a n x n 2
where c > 0 , p ( x ) is a polynomial in x and a k are positive constants. Since f ( x ) c when x the function (8) has (at least one) global minimum [ x min , y min ] .
Numerical experiments with functions f of type (8) show that not only the computed valus for x min and y min depend on the initial guess x 0 but that in many cases the code fminsearch produces a wrong solution for x min which is far from any minimizer of the function f. The reason is that the value of p ( x ) e ( x ) may become very small relative to c. Then f ( x ) is computed as c and the algorithm stops at a point which is far from any minimizer of f. Usually, no warning is issued for the user in such cases.This may be a problem when the minimization code is a part of an automatic computational procedure which is not under human supervision. In such cases application of the techniques of AI and MAI may be useful.
Example 16.
Let n = 2 and consider the function of type (8)
f ( x ) = 2 x 1 x 2 exp 1 1 2 ( x 1 2 + x 2 2 )
i.e. c = 2 , p ( x ) = exp ( 1 ) x 1 x 2 and a 1 = a 2 = 1 / 2 . The function f is coded as
f = @(x)2-x(1)*x(2)*exp(1-x(1)^2/2-x(2)^2/2)
The function y = f ( x ) , x R × R , has two global minimizers
x min , 1 = [ 1 , 1 ] , x min , 2 = [ 1 , 1 ]
for which
y min , 1 = y min , 2 = 1 .
It also has two global maximizers
x max , 1 = [ 1 , 1 ] , x max , 2 = [ 1 , 1 ]
for which
y max , 1 = y max , 2 = 3 .
Let the initial guess for the code
[x_min,y_min] = fminsearch(f,x_0)
be x 0 = [ c , c ] . For c = 6.4308 we get the relatively good result
x_min = -1.000041440968428   -0.999998218805417
y_min =  1.000000001720503
For the slightly different value c = 6.4309 the computed result x min = [ c , c ] is a numerical fixed point which is not a minimizer. Indeed, we have
x_min = 6.430900000000000   6.430900000000000
y_min = 2.000000000000000
The second computed result in Example 16 is a numerical catastrophe with 543% relative error in x min and 100% relative error in y min . The reason is that c 2 exp ( 1 c 2 ) ρ for c 6.4391 and f ( c ) is rounded to 2. The conclusion is that the algorithm of the moving simplex uses FPA rather than VPA; compare with Example 15.
The minimization problem in Example 16 is not unimodal since it has two solutions for x min and a unique solution for y min . Solving unimodal problems by the code fminsearch also leads to large errors. Consider the next unimodal problem for which the code also fails for moderate data.
Example 17.
Let n = 2 and consider the function of type (8)
f ( x ) = 2 x 1 exp 1 2 1 2 x 1 2 ( x 2 1 ) 2 )
The function f is coded as
f = @(x)2-x(1)*exp(1/2-x(1)^2/2-(x(2)-1)^2)
The minimization problem has unique solution x min = [ 1 , 1 ] , y min = 1 . For c = 5.81 the code
[x_min,y_min] = fminsearch(f,[c,c])|
gives a quite acceptable result, namely
x_min =  1.000039724696251   1.000017661431510
y_min =  1.000000001889957
For a slightly different value of c the computed result x min = [ c , c ] is a numerical fixed point which is not a minimizer. Indeed, for c = 5.82 the computed result is wrong, namely
x_min =  5.820000000000000   5.820000000000000
y_min =  2
There is 482% relative error in the computed argument x min and 100% relative error in the computed minimum y min .
The output of such situations is, at least for small n, to use BFCT as follows. Let n = 2 and let f : T = T 1 × T 2 R be the function which has to be minimized, where T k = [ a k , b k ] R × R . We choose positive integers n 1 , n 2 and compute the grids
X_k = linspace(a_k,b_k,n_k)
and the quantities
A ( i , j ) = f ( X 1 ( i ) , X 2 ( j ) ) .
Thus we construct the matrix F R ( n 1 , n 2 ) which is a discrete analogue of the surface z = f ( x , y ) , ( x , y ) T .
Next the minimal element F ( k 1 , k 2 ) and its indexes ( k 1 , k 2 ) are found by the BFCT described in Section 4.2. Now the approximate global minimum of f is [ X min , Y min ] , where X min R ( 2 , 1 ) may be used as the starting point x 0 for the code
[x_min,y_min] = fminsearch(f,x_0)
Extensive numerical experiments show that this code now works properly.

8. Canonical Forms of 2 × 2 Matrices

8.1. Preliminaries

An interesting observation is that important properties of linear operators in n-dimensional vector spaces can be revealed for dimensions as low as n = 2 , see [28]. In this section we consider some modified definitions of canonical forms and present instructive examples for the sensitivity of Jordan forms using BFCT.

8.2. Jordan and Generalized Jordan Forms

In this section we use some not very popular definitions of canonical forms, e.g. a definition of Jordan form of a matrix without mentioning eigenvalues and without using the concept of Jordan blocks.
Definition 11.
Let n Z 2 . The upper triangular bi-diagonal matrix J C ( n ) is said to be a Jordan matrix if
  • J ( k , k + 1 ) { 0 , 1 }
  • J ( k , k + 1 ) = 1 implies J ( k , k ) = J ( k + 1 , k + 1 )
  • J ( k , k ) J ( k + 1 , k + 1 ) implies J ( k , k + 1 ) = 0
for k = 1 , 2 , , n 1 . The set of Jordan matrices is denoted as J ( n ) .
Definition 12.
The Jordan matrix J is spectrally ordered if J ( k , k ) J ( k + 1 , k + 1 ) for k = 1 , 2 , , n . The set of spectrally ordered Jordan matrices is denoted as J ˜ ( n ) .
Proposition 8.
The set J ( 2 ) consists of matrices diag ( λ , μ ) and λ I 2 + E 1 , 2 , where λ , μ C .
Definition 13.
The Jordan problem for a nonzero matrix A C ( n ) is to find a pair ( X A , J A ) G L ( n ) × J ( n ) such that X A J A = A X A . Here X A is the transformation matrix and J A is the Jordan form of A.
Note that we do not use explicitly the spectrum of A or the concept of Jordan blocks. Indeed, given a general matrix A (even with integer elements and of low size as in (6)) then a priori we know neither its spectrum nor its Jordan structure.
Remark.  The Jordan form J A is not unique and is defined only by the order of units on its bi-diagonal unless A = λ I n when J A = A is unique and X A is any invertible matrix.
The Jordan form may be defined uniquely to become a canonical Jordan form in terms of Group theory [29] and Integer partition theory [30] although this is rarely done in sufficient extent in the literature. The transformation matrix X A is always non-unique since e.g. the matrix γ X A , γ 0 , is also a transformation matrix.
Example 18.
Zz If the matrix A C ( n ) has a single eigenvalue then there are 2 n 1 different orderings of the unit elements on the n 1 positions ( k , k + 1 ) of the super diagonal of A.
Definition 14.
Let J J ( n ) . The matrix X G L ( n ) is said to be a stabilizer of J if X 1 J X J ( n ) . The set of stabilizers of A is denoted as Stb ( A ) G L ( n ) . The matrix I n is the center of Stb ( A ) .
Note that the set Stb ( A ) in general is not necessarily a group unless A = λ I n and { I n } is not a center in group-theoretical sense (we recall that the center of G L ( n ) is the subgroup of matrices a I n , a 0 ).
Example 19.
Let n = 2 and A J ( 2 ) . Then the following three cases are possible.
  • If A = λ I 2 then Stb ( A ) is the group G L ( 2 ) .
  • If A = diag ( λ 1 , λ 2 ) , λ 1 λ 2 , then Stb ( A ) is the set of matrices diag ( a 1 , a 2 ) and a 1 E 1 , 2 + a 2 E 2 , 1 , where a 1 a 2 0 .
  • If A = λ I 2 + E 1 , 2 then Stb ( A ) is the set of matrices a ( E 1 , 2 + E 2 , 1 ) , a 0 .
Example 20.
 
Definition 15.
The upper triangular bi-diagonal matrix G C ( n ) is said to be a generalized Jordan matrix if
  • G ( k , k + 1 ) 0 implies G ( k , k ) = G ( k + 1 , k + 1 )
  • G ( k , k ) G ( k + 1 , k + 1 ) implies G ( k , k + 1 ) = 0 .
The set of generalized Jordan matrices is denoted as G ( n ) .
Generalized Jordan matrices G have the structure of Jordan matrices, where the nonzero elements G ( i , i + 1 ) (if any) are not necessarily equal to 1. Hence they are less sensitive to perturbations in the matrix A.
There are two forms diag ( λ , μ ) and λ I 2 + a E 1 , 2 of the elements of G ( 2 ) , where λ , μ , a C and a 0 . Note that we do not define spectrally ordered generalized Jordan matrices since they are not important for the computational practice.
Both Jordan matrices J and generalized Jordan matrices G are bi-diagonal, where the bi-diagonal elements of J are zeros or ones, while the bi-diagonal elements of G are correspondingly zeros or nonzero numbers.
Definition 16.
The generalized Jordan problem for a nonzero matrix A C ( n ) is to find a pair ( Y A , G A ) G L ( n ) × G ( n ) such that Y A G A = A Y A . Here Y A is the transformation matrix and G A is the generalized Jordan form of A.
Consider now perturbations in Jordan forms. Let A ( ε ) = A 0 + ε A 1 , where A 0 , A 1 C ( n ) and ε ( ε 0 , ε 0 ) with ε 0 > 0 being a small parameter. If the matrix A 0 has multiple eigenvalues then the transformation matrix X ( ε ) = X A ( ε ) and the Jordan form J ( ε ) = J A ( ε ) of A ( ε ) may be discontinuous at the point ε = 0 . The situation with the transformation matrix X ( ε ) = X A ( ε is even more subtle.
If the matrix A 0 has simple eigenvalues then the matrices X ( ε ) and J ( ε ) depend continuously on ε . In particular X ( 0 ) = X 0 and J ( 0 ) = J 0 .
Next we consider matrices A R ( 2 ) with real spectrum. Then the transformation matrices X , Y , the Jordan forms J and the generalized Jordan forms G are real as well.
For small size matrices A with rational elements (machine elements in particular) A ( i , j ) the matrices X A and J A are computed by the MATLAB® (Maple) command [X_A,J_A] = jordan(A) exactly.
Example 21.
Consider the matrix A = I 2 and let
A ( ε ) = I 2 + ε E 2 , 1 , ε = 2 m , m Z 1 0 .
Then the transformation matrix and the Jordan form of A ( ε ) are
X ( ε ) = E 2 , 1 + ε 1 E 1 , 2 , J ( ε ) = I 2 + E 1 , 2 .
The code  [X,J] = jordan(A)  computes the matrices X ( ε ) , J ( ε ) exactly for m 1074 . For m = 1075 the matrices X A and J A are computed as I 2 since ε is rounded to zero.
Example 22.
Consider the matrix A = I 2 + E 1 , 2 and let
A ( ε ) = I 2 + ε E 1 , 1 , ε = 2 m , m Z 1 0 .
Then the Jordan forms of A are
J 1 ( e ) = diag ( 1 + ε , 1 ) , J 2 ( ε ) = diag ( 1 , 1 + ε ) .
The form J 1 ( ε ) is unachievable. For the form J 2 ( ε ) we have infinitely many transformation matrices X 2 ( ε ) , e.g.
X 2 ( ε ) = μ E 1 , 2 E 2 , 1 ε 1 E 1 , 1
where μ 0 is arbitrary.
The code  [X,J] = jordan(A)  computes exactly the matrices X 2 ( ε ) (with μ = 1 ) and J 2 ( ε ) for ε = 2 m and m 44 . For m = 45 these matrices are computed wrongly as X 2 ( ε ) = I 2 and J 2 ( ε ) = I 2 .
Proposition 9.
Jordan forms J A of matrices A with multiple eigenvalues are sensitive to perturbations in A for three main reasons: 1) because J A ( i , j ) = 0 for i j + 1 ; 2) because J ( i , j ) = 0 for j i + 2 (if n 3 ) and 3) because their nonzero elements J A ( i , i + 1 ) (if any) are equal to 1.

8.3. Condensed Schur Form

Schur forms S A of matrices A satisfy the only condition S A ( i , j ) = 0 for i j + 1 and are this less sensitive to perturbations in the nevertheless they may be discontinuous in a neighborhood of a matrix A 0 with multiple eigenvalues. The case of simple eigenvalues is studied in more detail, see e.g. [31].
Definition 17.
The matrix S C ( n ) is said to be a Schur matrix if it is upper triangular. The set of Schur matrices is denoted as S ( n ) .
We recall that we study only cases n 2 . Now it is easy to see that S ( n ) = G ( n ) if and only if n = 2 .
Each matrix A C is unitary similar to a Schur matrix S A S ( n ) such that U A S A = A U A H for some matrix U A U ( n ) . The matrix U A is the transformation matrix and the matrix S A U ( n ) is the condensed Schur form of A. The Schur problem for A is to describe the set of pairs ( U A , S A ) and to study their sensitivity relative to perturbations in A.
Although less sensitive to perturbations in A, the solution ( U A , S A ) of the Schur problem may also be sensitive and even discontinuous as a function of A at points A 0 , where the matrix A 0 has multiple eigenvalues.
Let A = λ I 2 and ε E A be a perturbation in A. The Schur form of A is A itself and in this case we may consider the trivial solution ( I 2 , A ) of the Schur problem for A. For arbitrary small ε 0 the Schur forms of A + ε E 2 , 1 are S = I 2 ± ε E 1 , 2 with transformation matrices V = ± E 1 , 2 ± E 2 , 1 . Thus the solution changed from the trivial solution U = I 2 for V for ε 0 .
Example 23.
Let m = 50 , a = 2 m and A = [ 1 , 0 ; a , 1 ] . The command
[X,J] = schur(A)
gives the correct answer
X =  0    -1       J = 1.000000000000000  -0.000000000000001
     1     0           0                   1.000000000000000
For m = 51 we have
X =  1     0       J = 1     0
     0     1           0     1
due to rounding errors. Obviously the codeschurworks with FPA.

9. Least Squares Revisited

9.1. Preliminaries

In least squares methods (LSM) the aim is to minimize the sum of squared residuals. A not very popular fact is that a smaller residual may correspond to a larger error [21,22]. Moreover, this phenomenon may be observed on an infinite set of residual and errors. This may have far going consequences. So the LSM has to be revisited taking into account such cases.
A least squares problem (LSP) may arise naturally in connection with a given approximation scheme, or as an equivalent formulation of a zero finding method.

9.2. Nonlinear Problems

The solution of the (overdetermined) equation f ( x ) = 0 , where x and f ( x ) are vectors and f is a continuous function, may be reduced to minimization of the quantity f ( x ) 2 . Such minimization problems arise also in a genuinely optimization statement in e.g. approximation of data.
Let φ : C ( n , 1 ) R + be a continuous function and let the problem φ ( x ) min be unimodal, i.e. there is a unique (global) minimum [ x 0 ; y 0 ] C ( n , 1 ) × R + , where y 0 = φ ( x 0 ) . Denote by e k = x k x 0 the distance between the vector x k and x 0 , or the absolute error of the vector x k when x k is interpreted as a computed solution.
Let n = 1 and φ : R R + . Let ( x k ) be a sequence of computed solutions. Then it is possible that r k = φ ( x k ) tends to 0 and e k tends to when k , as the next example shows.
Example 24.
Let
φ ( x ) = 1 + ( x 1 ) 2 1 + ( x 1 ) 4 , x R
The function φ is unimodal with minimum [ x 0 , y 0 ] = [ 1 , 1 ] . Let x k = k . Then r k = φ ( x k ) 0 and e k = | k 1 | as k .
This result is easily extended to the case n 2 .
Example 25.
Let
φ ( x ) = 1 + ( x 1 ) 2 1 + ( x 1 ) 4 , x C ( n , 1 )
The function φ has infinitely many minimums [ x 0 , 1 ] on the unit sphere x 0 = 1 . Let x k = k [ 1 ; 0 ; ; 0 ] for k Z 1 , i.e. x k = k . Then r k 0 and e k as k .
Thus the residual r k and the error e k may have opposite behavior. This phenomenon, namely larger the remainder r k , smaller the error e k , is known as Remainders vs. Errors, see [22]. It may concern the stopping rule in the implementation of LSM.

9.3. Linear Problems

Consider the matrix A R ( m , n ) of full column rank m > n and let b R ( m , 1 ) be a given vector which is not in the range of A. The linear least squares problem (LLSP) is to find the vector x R ( n , 1 ) such that r ( x ) = A x b min . This problem is unimodal and its theoretical solution is A b , where A = ( A A ) 1 A R ( n ) is the pseudo-inverse of the matrix A.
This representation of the solution is not used in computational practice since it is ineffective and may be connected with substantial loss of accuracy. Instead, the solution is found by QR decomposition A = Q R of the matrix A. Let R = [ S ; T ] and Q b = [ c ; d ] , where the matrix S R ( n ) is upper triangular and invertible, and c R ( n , 1 ) . The solution x * is obtained by the code x = A\b, which in fact computes x = S\c.
Intensive experiments based on BFCT with random data show that the error of the “bad” solution A b is between 5 and 10 times larger than the error of the “ good” solution x * = A b . In addition, finding x 0 requires much more computational time.
Let x 1 , x 2 R ( n , 1 ) be two approximate solutions of the above LLSP which are different from the exact solution x 0 . Set r k = A x k b and e k = x k x 0 , k = 1 , 2 . We have σ n r k / e k σ 1 , where σ 1 = A and σ n = 1 / A 1 1 are the maximal and minimal singular values of A. Next, it is possible [22] to choose x k so as r 1 / e 1 = σ 1 and r 2 / s 2 = σ n . It is fulfilled
r 1 r 2 = cond ( A ) e 1 e 2 , cond ( A ) = σ 1 σ n
We see that when the matrix A is poorly conditioned in the sense that cond ( A ) 1 the behavior of remainders and errors is opposite: larger the remainder, smaller the error. This fact deserves a separate formulation.
Proposition 10.
When cond ( A ) is large, larger remainders may correspond to smaller errors. In this case checking the accuracy of the computed solutions by the remainders is misleading.
The situation with remainders and errors is indeed ironic. If the matrix A is well conditioned with cond ( A ) close to 1 the computed approximations are most probably good and there is no need to check their accuracy by remainders. But if the matrix A is poorly conditioned and the accuracy test seems necessary, the accuracy check by remainders may lead to wrong conclusions. Note that this phenomenon is possible only if n 2 .
Example 26.
Consider the simplest case n = 2 and A = [ B ; 0 ] R ( m , 2 ) , b = 0 , where B = [ μ 3 , 1 ; μ 3 , 1 ] R ( 2 ) and μ > 0 is a small parameter. The only solution of the LSP is x 0 = 0 . Let x 1 [ 0 , μ ] and x 2 = [ 1 / μ ; 0 ] be two approximate solutions (the vector x 2 with error e 2 = 1 / μ 1 can hardly be recognized as approximate solution but we shall ignore this fact). We have
r 1 = μ 2 , e 1 = μ , r 2 = μ 2 2 , e 2 = 1 / μ
and
r 1 r 2 = 1 μ 1 , e 1 e 2 = μ 2 1
This is possible since the matrix B is very badly conditioned with cond ( B ) = μ 3 1 and the estimate (9) is achieved.
Denote T = ( 0 , 1 ) R + and let { x ( t ) : t T } R ( n , 1 ) be a family of approximate solutions of the equation A x = b , where the function x : T R ( n , 1 ) is differentiable. Let x ( t ) tends to x 0 as t and denote by e ( t ) = x ( t ) x 0 and r ( t ) = A x ( t ) b the error and the remainder of the approximate solution x ( t ) . These relations define the remainder r as a parametric function of e and vice versa.
Proposition 11.
There exists a smooth function x : T R ( n , 1 ) such that the following assertions hold true.
  • The function e : T R + is smooth and decreasing.
  • The function r : T R + is smooth and non-monotone in each interval ( 0 , β ) , where 0 < β < 1 .
  • The function e r ( e ) (whenever defined) is smooth and non-monotonic in each arbitrarily small subunterval ( 0 , θ ) of the interval ( 0 , ε ) , where ε = sup { e ( t ) : t T } .
Example 27.
Let n = 2 ,
A = a 0 0 1 , x ( t ) = ξ 1 ( t ) ξ 2 ( t ) , b = 0 0
and a { 1 , 0 , 1 } . Then x 0 = 0 and
r ( t ) = a 2 ξ 1 2 ( t ) + ξ 2 2 ( t ) , e ( t ) = ξ 1 2 ( t ) + ξ 2 2 ( t )
Let t T and ξ 1 ( t ) = t 3 cos ( 1 / t ) , ξ 2 ( t ) = t 3 sin ( 1 / t ) . We have e ( t ) = t 3 and
r ( t ) = t 3 1 + ( a 2 1 ) cos 2 ( 1 / t )
Therefore
r ( e ) = e 1 + ( a 2 1 ) cos 2 e 1 / 3
The differentiable function e r ( e ) thus defined is non-monotone on each interval ( 0 , θ )
This non-monotonicity means that for any θ > 0 there exist infinitely many points e 1 , e 2 ( 0 , θ ) with e 1 < e 2 and r ( e 1 ) > r ( e 2 ) . Thus the LSM may be misleading not only for linear algebraic equations but also for LLSP for which the method had been specially designed.

10. Integrals and Derivatives

10.1. Integrals

The calculation of the definite integral I = int ( f , a , b ) is one of the oldest computational problems in numerical mathematical analysis. According to [25] a variant of the trapezoid rule was used in Babylon 50 years BCE for integrating the velocity of Jupiter along the ecliptic.
Usually the integral is solved for a , b R , a < b , but the case when one or both of the limits a, b, are infinite and the integral is improper is also considered. There are many sophisticated algorithms and computer codes for solving such integrals, see [1]. In MATLAB® definite integrals of function of one variable f [ a , b ] R are computed by the command integral(f,a,b).
In this section we apply BFCT to solving an integral with highly oscillating integrand by the standard trapezoidal rule [26].
Let a = x 1 < x 2 < < x n = b be a partition of [ a , b ] . Sophisticated integration schemes for computing I using values y k = f ( x k ) had been developed in the pre-computer ages (before 1950) when the computation of y k for n of order 10 3 was a problem. Now BFCT allow to obtain results for large n up to 10 8 , using simple quadratures such as the formula of trapezoids with equal spacing [27] described below as MATLAB® code
n = 10^m; X = linspace(a,b,n+1); Y = f(X); h = (b-a)/n; ...
T_n = h*(sum(Y(2:n)) + (Y(1) + Y(n+1))/2)
If the function f is twice continuously differentiable then the absolute error of the trapezoidal method with equal spacing is estimated as
| T n I | E n ( f , a , b ) = γ ( f , a , b ) n 2
where
γ ( f , a , b ) = ( b a ) 3 12 μ 2 ( f ) , μ 2 ( f ) = max { | f ( x ) | : x [ a , b ] }
Example 28.
Consider the integral I = int ( f , a , b ) = F ( b ) F ( a ) , where f = F , a > 0 and
F ( x ) = c x + x 2 sin x 1 , x [ a , b ]
Then
f ( x ) = c + 2 x sin x 1 cos x 1 , f ( x ) = x 4 cos x 1
For
a = 1 m π , b = 1 π , c = m π m 1 , m Z 2
we have
sin a 1 = 0 , cos a 1 = 1 , F ( a ) = a c , sin b 1 = 0 , F ( b ) = b c
and I = c ( b a ) = 1 .
For m = 11 the number a 0.0289 is relatively small and the function f is highly oscillating close to a, see Figure 2. Next we have
μ 2 ( f ) = | f ( a ) | = ( 11 π ) 4 , γ ( f , a , b ) = 2750 π / 3 2880
and E n ( f , a , b ) = 2880 n 2 .
The error | T n 1 | and the ratio r n ( f , a , b ) = | T n 1 | / E n ( f , a , b ) for n = 10 m and m up to 8 are shown in Table 4.
For n from 1 000 to 1 000 000 the error (which is both absolute and relative since I = 1 ) decreases as 0.4 n 2 while the ratio of the error and the estimate E n ( f , a , b ) is approximately constant about 1.5 × 10 4 . The behavior of the error | T n 1 | is correctly evaluated but the bound μ 2 ( f ) = a 4 for | f ( x ) | considerably overestimates the real behavior of | f ( x ) | although this bound is achieved for x = a and is hence not improvable.
For n between 1 500 000 and 100 000 000 the accuracy of the computed solution does not increase with n and shows chaotic behavior due to accumulation of rounding errors. The result T n for n = 1000 and even for n = 100 is satisfactory from practical viewpoint.
For the choice of f , a , b as in Example 28 the MATLAB® code integral(f,a,b) produces the answer with error 7.1720 × 10 14 . At the same time the NLF F ( b ) F ( a ) gives the result 1 ρ with error ρ 1.1011 × 10 16 . So in this case the code integral(f,a,b) does not use NLF.

10.2. Derivatives

Numerical differentiation of a scalar function f of scalar or vector argument x is used in many cases, i.e. when analytical expression f ( x ) is not available or is too complicated, when numerical algorithms are applied that use approximate derivatives f f, etc. In the simplest yet most spread case the derivative f ( x ) is approximated numerically by the first order finite difference Δ h ( f , x ) = ( f ( x + h ) f ( x ) ) / h , where | h | is small but not too small. For given a x the behavior of Δ h ( f , x ) as function of h resembles a saw, see Example 29 below.
Example 29.
Consider the first difference Δ h ( x , 1 ) for the simplest non-constant function f ( x ) = x at the point x = 1 . We have Δ h ( x , 1 ) = ( 1 + h 1 ) h 1 . The graph of the computed function h ( 1 + h 1 ) h 1 for h [ 3 eps , 3 eps ] { 0 } is shown at Figure 3. The graph of the exact function h 1 is the straight line in red.
Let f ( x ) 0 and x 0 . When h / x is too small, e.g. h / x [ ρ / 2 , ρ ] , the computed value of f ( x + h ) is f ( x ) , Δ h ( f , x ) = 0 and the derivative f ( x ) is computed as 0 with 100% relative error. This numerical catastrophe may be avoided by BFCT revealing the behavior of Δ h ( f , x ) for small values of h.
The general principle here is that h must be chosen as C ( eps ) 1 / 2 = C / 10 8 , and the problem is how to estimate the constant C especially in RTA. When no information for a constant C > 0 is available, the heuristic rule is to set C = 1 .

11. VPA versus FPA

There are many types of variable precision arithmetics (VPA) and computer languages in which the VPA are realized. Here we include systems using symbolic objects like π , 2 , etc. VPA may work with very high precision (actually, with infinite precision) but the performance of numerical algorithms with VPA may be slow and not suitable for RTA. In contrast, binary double-precision floating point arithmetic (FPA) works with a finite set of machine numbers, has finite precision and works relatively fast. Thus FPA is suitable for RTA.

12. Conclusion

The combination of BFCT and CPRS provides a powerful tool for solving problems and checking computed solutions and a priori accuracy estimates for many computational problems. Also, according to the authors, BFCT may become a useful complement in the implementation of techniques of artificial intelligence (AI) to the systems for doing mathematics such as MATLAB® [6,7] and Maple [8].
Our main results are summarized in Propositions 1, 2, 3, 4, 5, 6 and 19. Most of the results presented are valid for using BFCT in FPA. Using extended precision as in VPA, some of the observed effects may no more occur. This may be done at the price of considerable increase of the computational time thus excluding some RTA.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets generated during the current study are available from the author upon reasonable request.

Conflicts of Interest

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

  1. Stoer, J.; Bulirsch, R. Introduction to Numerical Analysis; Springer New York: New York, 2002; ISBN 978-0387954523. [Google Scholar] [CrossRef]
  2. Faires, D.; Burden, A. Numerical Analysis; Cengage Learning: Boston, 2016; ISBN 978-1305253667. [Google Scholar]
  3. Chaptra, S. Applied Numerical Methods with MATLAB for Engineers and Scientists (5th edition); McGraw Hill, 2017; ISBN 978-12644162604. [Google Scholar]
  4. Novak, K. Numerical Methods for Scientific Computing; Equal Share Press, 2022; ISBN 978-8985421804. [Google Scholar]
  5. Driscoll, T.; Braun, R. Fundamentals of Numerical Computations; SIAM, 2022. [Google Scholar] [CrossRef]
  6. TheMathWorks, Inc. MATLAB Version 9.9.0.1538559 (R2020b); The MathWorks, Inc.: Natick, MA, USA, 2020; Available online: www.mathworks.com.
  7. Higham, D.; Higham, N. MATLAB Guide, 3rd ed.; SIAM: Philadelphia, 2017; ISBN 978-1611974652. [Google Scholar]
  8. Maplesoft. Maple 2017.3. Ontario, 2017. Available online: www.maplesoft.com/ products/maple/.
  9. Mathematica. Available online: www.wolfram.com/mathematica/.
  10. WolframAlpha. Available online: www.wolframalpha.com/.
  11. Borbu, A.; Zhu, S. Monte Carlo Methods; Springer: Singapore, 2020; ISBN 978-9811329708. [Google Scholar] [CrossRef]
  12. Davies, A.; et al. Advancing mathematics by guiding human intuition with AI. Nature 2021, 600. [Google Scholar] [CrossRef] [PubMed]
  13. Fink, T. Why mathematics is set to be revolutionized by AI. Nature 2024, 629. [Google Scholar] [CrossRef]
  14. 754-2019; IEEE Standard for Floating-Point Arithmetic. IEEE Computer Society: NJ, 2019. [CrossRef]
  15. Higham, N. Accuracy and Stability of Numerical Algorithms; SIAM: Philadelphia, 2002; ISBN 0-898715210. [Google Scholar]
  16. Higham, N.; Konstantinov, M.; Mehrmann, V.; Petkov, P. The sensitivity of computational control problems. IEEE Control Systems Magazine 2004, 24, 28–43. [Google Scholar] [CrossRef]
  17. Goldberg, D. What every computer scientist should know about floating-point arithmetic. ACM Computing Surveys 1991, 23, 5–48. [Google Scholar] [CrossRef]
  18. Pugh, C. Real Mathematical Analysis; Springer Nature: Switzerland, 2015; ISBN 978-3319177700. [Google Scholar] [CrossRef]
  19. Konstantinov, M.; Petkov, P. Computational errors. International Journal of Applied Mathematics 2022, 1, 181–203. [Google Scholar] [CrossRef]
  20. Balinski, M.; Young, H. Fair Representation: Meeting the Ideal One Man One Vote, 2nd ed.; Brookings Institutional Press: New Haven and London, 2001; ISBN 978-0815701118. [Google Scholar] [CrossRef]
  21. Booth, A. Numerical Methods, 3rd ed.; LCCN 57004892; London, Butterworths Scientific Publications: London, 1966. [Google Scholar]
  22. Konstantinov, M.; Petkov, P. Remainders vs. errors. AIP Conference Proceedings 2016, 1789, 060007. [Google Scholar] [CrossRef]
  23. Lagarias, J.; Reeds, J.; Wright, M.; Wright, P. Convergence properties of the Nelder-Mead simplex method in low dimensions. SIAM Journal on Optimization 1998, 9, 112–147. [Google Scholar] [CrossRef]
  24. Nelder, J.; Mead, R. A simplex method for function minimization. Computer Journal 1965, 7, 308–313. [Google Scholar] [CrossRef]
  25. Ossendrijver, M. Ancient Babylonian astronomers calculated Jupiter’s position from the area under a time-velocity graph. Science 2016, 351, 482–484. [Google Scholar] [CrossRef] [PubMed]
  26. Rahman, Q.; Schmeisser, G. Characterization of the speed of convergence of the trapezoidal rule. Numerische Mathematik 1990, 57, 123–138. [Google Scholar] [CrossRef]
  27. Atkinson, K. An Introduction to Numerical Analysis, 2nd ed.; John Wiley & Sons: New York, NY, USA, 1989; ISBN 978-0471500230. [Google Scholar]
  28. Glazman, M.; Ljubich, J. Finite-Dimensional Linear Analysis: A Systematic Presentation in Problem Form (Dover Books in Mathematics); Dover Publications: New York, NY, USA, 2006; ISBN 978-04866453323. [Google Scholar]
  29. Jackobson, N. Basic Algebra I; W. Freeman & Co Ltd.: London, UK, 1986; ISBN 978-07167114804. [Google Scholar]
  30. Andrews, G. The Theory of Partitions, (revisited edition); Cambridge University Press: Cambridge, 2008; ISBN 978-0521637664. [Google Scholar]
  31. Konstantinov, M.; Petkov, P. On Schur forms for matrices with simple eigenvalues. Axioms 2024, 13, 839. [Google Scholar] [CrossRef]
Figure 1. Scaled rounding errors.
Figure 1. Scaled rounding errors.
Preprints 142158 g001
Figure 2. Oscillating function.
Figure 2. Oscillating function.
Preprints 142158 g002
Figure 3. Computed first difference of the function f ( x ) = x .
Figure 3. Computed first difference of the function f ( x ) = x .
Preprints 142158 g003
Table 2. Errors and estimates for low order systems.
Table 2. Errors and estimates for low order systems.
m err est err/est
2 3.1601 × 10 9 9.0612 × 10 8 0.0349
3 1.6691 × 10 5 8.9027 × 10 4 0.0187
4 2.4926 × 10 1 9.1475 0.0272
5 1.0002 4.6283 0.2161
Table 3. Constants in error estimates for code roots.
Table 3. Constants in error estimates for code roots.
n 3 4 5 6 7 8 9 10 11
γ n 1.73 1.78 1.53 1.95 1.62 1.60 1.84 1.72 2.03
n 12 13 14 15 16 17 18 19 20
γ n 1.76 2.04 1.94 2.10 1.92 2.06 2.01 1.97 1.91
Table 4. Errors and estimates for the trapezoidal rule.
Table 4. Errors and estimates for the trapezoidal rule.
n | T n 1 | r n ( f , a , b )
5 2.7522 × 10 2 2.3724 × 10 4
10 9.5472 × 10 4 3.2918 × 10 5
100 2.0633 × 10 5 7.1142 × 10 5
1 000 4.3650 × 10 7 1.5051 × 10 4
10 000 4.3842 × 10 9 1.5117 × 10 4
100 000 4.3844 × 10 11 1.5117 × 10 4
1 000 000 4.4420 × 10 13 1.5316 × 10 4
1 500 000 1.9340 × 10 13 1.5004 × 10 4
2 000 000 1.0059 × 10 13 1.3873 × 10 4
5 000 000 3.4084 × 10 14 1.5189 × 10 4
10 000 000 1.9762 × 10 14 6.8139 × 10 4
50 000 000 2.1316 × 10 14 1.8375 × 10 2
100 000 000 5.0848 × 10 14 1.7533 × 10 1
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated