xjb: Fast Float to String Algorithm

Junbo Xiang; Tiejun Wang

doi:10.20944/preprints202511.1698.v1

Submitted:

21 November 2025

Posted:

24 November 2025

You are already at the latest version

Abstract

With the wide application of numerical computation and data exchange, efficiently and accurately converting floating-point numbers to decimal strings has become an important computer science issue. Existing floating-point printing algorithms, such as Ryu¯, Dragonbox, and Schubfach, although they meet the Steele-White (SW) principle in terms of accuracy, still have room for optimization in performance, especially in terms of branch prediction failure and high-precision multiplication overhead. This paper proposes a novel floating-point to string conversion algorithm named "xjb", which is an improvement based on the Schubfach algorithm, aiming to further enhance the conversion efficiency. The algorithm in this paper is designed for IEEE754 single-precision (binary32) and double-precision (binary64) floating-point numbers. By reducing instruction dependencies, decreasing the number of multiplication operations, and minimizing branch prediction failures, it significantly improves performance. In addition, the algorithm supports parallel computing, and the core implementation code is concise, with good portability and scalability. We conducted extensive benchmark tests on multiple platforms, including AMD-R7 7840H and Apple M1, using different compilers (gcc, clang, icpx). The results show that the xjb algorithm outperforms the existing mainstream algorithms in most cases.

Keywords:

floating-point

;

printing

;

performance

Subject:

Computer Science and Mathematics - Computer Science

1. Introduction

In 1990, Steele and White[1] published the paper how to print floating-point numbers Accurately and proposed the optimal principle of floating-point number printing algorithms (hereinafter referred to as the SW principle) :

Information preservation: The print result can be parsed back to the original floating-point number.
Minimum length: The print result should be as short as possible.
Correct rounding: On the basis of satisfying 1 and 2, if there are two candidate values, they should be correctly rounded (i.e., the even value should be selected).
Generate from left to right: The print result is generated from the left.

Floating-point number printing algorithms that satisfy the SW principle convert floating-point numbers into real values with unique and definite results. Over the past few years, a variety of different algorithms have been proposed, such as Grisu3[2], Errol[3], Ry

\bar{u}

[4][5], Schubfach[6], Grisu-Exact[7], Dragonbox[8], and yy_double[9].

The algorithm in this paper is based on the Schubfach algorithm, and is inspired by algorithms such as yy_double and Dragonbox. This article only introduces two floating-point number types, IEEE754-binary32 and IEEE754-binary64. To simplify the content, in this article, float represents IEEE754-binary32 and double represents IEEE754-binary64. This article involves the python code and algorithm implementation code at https://github.com/xjb714/xjb.

2. IEEE754 Floating Point Number Representation

Since the print result of a negative floating-point number only has one more negative sign than the print result of its absolute value, this article only discusses positive floating-point numbers and does not include special values such as 0, NaN, and Inf.

The IEEE754 double-precision floating-point number consists of 64 bits, including 1 sign bit (

s i g n

), 11 exponent bits (

e x p

), and 52 fraction bits (

f r a c

).

s i g n

’s range is 0 or 1,

e x p

’s range is

[0, 2047]

, and

f r a c

’s range is

[0, 2^{52} - 1]

.

The IEEE754 single-precision floating-point number consists of 32 bits, including 1 sign bit (

s i g n

), 8 exponent bits (

e x p

), and 23 fraction bits (

f r a c

).

s i g n

’s range is 0 or 1,

e x p

’s range is

[0, 255]

, and

f r a c

’s range is

[0, 2^{23} - 1]

.

When

f r a c = 0

, it is an irregular floating-point number.

The real value of the positive floating-point number v can be expressed as the following expression:

\begin{matrix} d o u b l e & : v = (f r a c + (e x p \neq 0 ? 2^{52} : 0)) \cdot 2^{max (e x p, 1) - 1075} = c \cdot 2^{q} \\ f l o a t & : v = (f r a c + (e x p \neq 0 ? 2^{23} : 0)) \cdot 2^{max (e x p, 1) - 150} = c \cdot 2^{q} \end{matrix}

(1)

There are two cases in total. When

e x p

equals 0 (referred to as subnormal floating-point numbers), there are:

\begin{matrix} d o u b l e & : v = f r a c \cdot 2^{- 1074} \\ f l o a t & : v = f r a c \cdot 2^{- 149} \end{matrix}

(2)

When

e x p

is not equal to 0 (referred to as a normal floating-point number), there is:

\begin{matrix} d o u b l e & : v = (f r a c + 2^{52}) \cdot 2^{e x p - 1075} \\ f l o a t & : v = (f r a c + 2^{23}) \cdot 2^{e x p - 150} \end{matrix}

(3)

In the rounding interval

R_{v}

of floating-point numbers, all real numbers will be rounded to this floating-point number when parsed.

R_{v}

is:

\begin{matrix} v_{l} = \{\begin{matrix} \begin{matrix} (c - \frac{1}{2}) \cdot 2^{q}, if f r a c \neq 0 or e x p ⩽ 1 \\ (c - \frac{1}{4}) \cdot 2^{q}, if f r a c = 0 \end{matrix} \end{matrix} \\ v_{r} = (c + \frac{1}{2}) \cdot 2^{q} \\ R_{v} = \{\begin{matrix} [v_{l}, v_{r}], if f r a c % 2 = 0 \\ (v_{l}, v_{r}), if f r a c % 2 = 1 \end{matrix} \end{matrix}

(4)

When the floating-point number is a regular floating-point number,

2^{q - 1}

is the rounded radius.

3. Principle of Algorithm

At present, other algorithms use a large number of branches, which can easily lead to branch prediction failure penalties and excessive high multiplication overhead. The algorithm in this paper will minimize the overhead of branch prediction failures and reduce the number of multiplication operations to improve performance. Moreover, the core code for the algorithm implementation in this paper is only about twenty lines and it also supports parallel computing. The process of printing floating-point numbers is usually divided into two parts: the first part is to convert the floating-point number to a decimal number, and the second part is to convert the decimal number to a string. And this article will only introduce the first part. All double-precision floating-point numbers are classified into two types: irregular values and regular values. An irregular value is one where all the lower 52 bits are 0, meaning the

f r a c

value is 0. There are a total of 2046 valid irregular values (i.e.,

e x p

values range from 1 to 2046). Dividing by the irregular values yields the regular value. Similarly, there are a total of 254 irregular values in a single-precision floating-point number. When

e x p

is 0, it is called a subnormal floating-point number.

The valid range for c and q in regular floating-point numbers is:

\begin{matrix} f l o a t & : \{\begin{matrix} 1 ⩽ c ⩽ 2^{24} - 1, c \neq 2^{23}; q = - 149 \\ 2^{23} + 1 ⩽ c ⩽ 2^{24} - 1; - 148 ⩽ q ⩽ 104 \end{matrix} \\ d o u b l e & : \{\begin{matrix} 1 ⩽ c ⩽ 2^{53} - 1, c \neq 2^{52}; q = - 1074 \\ 2^{52} + 1 ⩽ c ⩽ 2^{53} - 1; - 1073 ⩽ q ⩽ 971 \end{matrix} \end{matrix}

(5)

The valid range for c and q in irregular floating-point numbers is:

\begin{matrix} f l o a t & : \{c = 2^{23}; - 149 ⩽ q ⩽ 104 \\ d o u b l e & : \{c = 2^{52}; - 1074 ⩽ q ⩽ 971 \end{matrix}

(6)

The valid range for c and q in subnormal floating-point numbers is:

\begin{matrix} f l o a t & : \{c ⩽ 2^{23} - 1; q = - 149 \\ d o u b l e & : \{c ⩽ 2^{52} - 1; q = - 1074 \end{matrix}

(7)

Floating-point numbers that do not fall within the subnormal range are called normal floating-point numbers.

regular floating-point numbers account for the vast majority of all possible values of floating-point numbers and are the most worthy of discussion part. Therefore, unless otherwise specified, only regular floating-point numbers will be discussed below. Suppose the floating-point number v is converted to the optimal solution that satisfies the SW principle as

o p t

, d is a positive integer and k is an integer,which is expressed as:

\begin{matrix} v = c \cdot 2^{q} \to o p t = d \cdot 10^{k} \\ o p t \in R_{v}; d \in N^{+}; k \in Z \end{matrix}

(8)

For example: IEEE754-binary64 floating-point number "1.3", the real value of the floating-point number is 1.3000000000000000444089209850062616169452667236328125, hexadecimal representation of floating-point Numbers is 3ff4cccccccccccd, Then the

o p t

value that meets the SW principle is 1.3. The IEEE754-binary32 floating-point number "1.3" has an actual value of 1.2999999523162841796875, and its hexadecimal representation is 3FA66666. Therefore, the

o p t

value that satisfies the SW principle is 1.3.

3.1. Review the Schubfach algorithm and the derivation of the algorithm in this paper

According to the Schubfach[6] algorithm, the possible values of d can be one of the following four situations:

\begin{matrix} 10 \cdot ⌊ v \cdot 10^{- k - 1} ⌋, ⌊ 10 \cdot (v \cdot 10^{- k - 1}) ⌋, ⌊ 10 \cdot (v \cdot 10^{- k - 1}) ⌋ + 1, 10 \cdot ⌊ v \cdot 10^{- k - 1} ⌋ + 10 \end{matrix}

(9)

The calculation method of k in equation (9) is as follows:

\begin{matrix} k = ⌊ q \cdot lg (2) ⌋ if v \in r e g u l a r else ⌊ q \cdot lg (2) - lg (\frac{4}{3}) ⌋ \end{matrix}

(10)

In the range of float and double, equation (10) can be equivalent to:

\begin{matrix} k = (q \cdot 315653 - (v \in r e g u l a r ? 0 : 131237)) ≫ 20 \end{matrix}

(11)

Suppose the integer part of

v \cdot 10^{- k - 1}

is m and the decimal part is n, then we have:

\begin{matrix} ⌊ v \cdot 10^{- k - 1} ⌋ = m \\ v \cdot 10^{- k - 1} = m + n \\ 0 ⩽ n = v \cdot 10^{- k - 1} - ⌊ v \cdot 10^{- k - 1} ⌋ < 1 \end{matrix}

(12)

Then the decimal part of

v \cdot 10^{- k}

is expressed as:

\begin{matrix} v \cdot 10^{- k} - ⌊ v \cdot 10^{- k} ⌋ = 10 m + 10 n - ⌊ 10 m + 10 n ⌋ = 10 n - ⌊ 10 n ⌋ \end{matrix}

(13)

The possible values of d obtained from equation (9) are:

\begin{matrix} 10 m, ⌊ 10 (m + n) ⌋, ⌊ 10 (m + n) ⌋ + 1, 10 m + 10 \end{matrix}

(14)

The possible values of d in equation (14) can be simplified to:

\begin{matrix} 10 m, 10 m + ⌊ 10 n ⌋, 10 m + ⌊ 10 n ⌋ + 1, 10 m + 10 \end{matrix}

(15)

Among them,

10 m

represents the minimum possible value and

10 m + 10

represents the maximum possible value. Suppose

t e n

is used to represent

10 m

. There are four possible values for

o n e

, with

d = t e n + o n e

, denoted as:

\begin{matrix} t e n = 10 m \\ o n e \in \{0, ⌊ 10 n ⌋, ⌊ 10 n ⌋ + 1, 10\} \\ d = t e n + o n e \end{matrix}

(16)

Calculating d will be converted to calculating

t e n

and

o n e

.

The final possible values of d are as follows:

$10 m$

When the following conditions are met, the result is $10 m$ (or equivalent to $o n e = 0$ ). That is, the floating-point number v minus the minimum possible value of $10 m$ is less than the rounded radius $2^{q - 1}$ .

$\begin{matrix} c \cdot 2^{q} - 10 m \cdot 10^{k} & < 2^{q - 1} \\ c \cdot 2^{q} - ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ \cdot 10^{k + 1} & < 2^{q - 1} \\ c \cdot 2^{q} \cdot 10^{- k - 1} - ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ & < 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \\ n & < 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \\ 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} & > n \end{matrix}$

(17)

Or when $2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n$ , $c % 2 = 0$ must also be satisfied. Therefore, the following conditions are valid:

$\begin{matrix} if 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > n or (2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n & & c % 2 = 0) : o n e = 0 \end{matrix}$

(18)
$10 m + 10$

When the following conditions are met, the result is $10 m + 10$ (or equivalent to $o n e = 10$ ). The maximum possible value of $10 m + 10$ minus the floating-point number v is less than the rounded radius $2^{q - 1}$ .

$\begin{matrix} (10 m + 10) \cdot 10^{k} - c \cdot 2^{q} & < 2^{q - 1} \\ ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ \cdot 10^{k + 1} + 10^{k + 1} - c \cdot 2^{q} & < 2^{q - 1} \\ ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ - c \cdot 2^{q} \cdot 10^{- k - 1} + 1 & < 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \\ 1 - n & < 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \\ 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} & > 1 - n \end{matrix}$

(19)

Or when $2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n$ , $c % 2 = 0$ must also be satisfied. Therefore, the following conditions are valid:

$\begin{matrix} if 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > 1 - n or (2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n & & c % 2 = 0) : o n e = 10 \end{matrix}$

(20)
$10 m + ⌊ 10 n ⌋$ or $10 m + ⌊ 10 n ⌋ + 1$

When none of the conditions are met as $d = 10 m$ or $d = 10 m + 10$ , d is either $10 m + ⌊ 10 n ⌋$ or $10 m + ⌊ 10 n ⌋ + 1$ . The final value is determined based on the decimal part of $10 n$ . If the decimal part is 0.5, it is rounded to the nearest even value; if it is not 0.5, it is rounded to the nearest value. For irregular floating-point numbers, it is also necessary to determine whether $10 m + ⌊ 10 n ⌋$ is within the rounding interval $R_{v}$ . If it is not, then $10 m + ⌊ 10 n ⌋ + 1$ .

In summary, the steps of the Schubfach algorithm variants are as follows, that is, the algorithms proposed in this paper (xjb32(for float), xjb64(for double)):

\begin{matrix} input : c, q \\ output : d, k \\ convert c \cdot 2^{q} to d \cdot 10^{k} \\ (1) v = c \cdot 2^{q} \\ (2) k = ⌊ q \cdot lg (2) ⌋ if v \in r e g u l a r else ⌊ q \cdot lg (2) - lg (\frac{4}{3}) ⌋ \\ (3) m = ⌊ v \cdot 10^{- k - 1} ⌋, n = v \cdot 10^{- k - 1} - m \\ (4) t e n = 10 m \\ (5) if 10 n - ⌊ 10 n ⌋ = 0.5 : o n e = ⌊ 10 n ⌋ if (⌊ 10 n ⌋ % 2 = 0) else ⌊ 10 n ⌋ + 1 \\ (6) if 10 n - ⌊ 10 n ⌋ < 0.5 : o n e = ⌊ 10 n ⌋ \\ (7) if 10 n - ⌊ 10 n ⌋ > 0.5 : o n e = ⌊ 10 n ⌋ + 1 \\ (8) if v \in i r r e g u l a r : \\ (9) if 10 n - ⌊ 10 n ⌋ > 2^{q - 2} \cdot 10^{- k} : o n e = ⌊ 10 n ⌋ + 1 \\ (10) if 2^{q - 2} \cdot 10^{- k - 1} ⩾ n : o n e = 0 \\ (11) else : \\ (12) if 2^{q - 1} \cdot 10^{- k - 1} > n or (2^{q - 1} \cdot 10^{- k - 1} = n & & c % 2 = 0) : o n e = 0 \\ (13) endif \\ (14) if 2^{q - 1} \cdot 10^{- k - 1} > 1 - n or (2^{q - 1} \cdot 10^{- k - 1} = 1 - n & & c % 2 = 0) : o n e = 10 \\ (15) d = t e n + o n e \end{matrix}

(21)

This algorithm process (21) is applicable to float and double floating-point numbers. Taking a floating-point number v as input, c and q are extracted, and the calculation results d (line 15) and k (line 2) are returned. The real value represented by the returned results is

d \cdot 10^{k}

, which conforms to the SW principle. The calculation process of k is relatively simple and can be obtained from (11). Therefore, the following only focuses on introducing the rapid calculation process of d.

The following will be divided into five parts to introduce the algorithm process (21) :

Introduce the pre-computation process of the algorithm’s lookup table.
Quickly calculate m.
Quickly determine whether $o n e = 0$ or $o n e = 10$ .
Quickly calculate $⌊ 10 n ⌋$ and determine whether $o n e = ⌊ 10 n ⌋$ or $o n e = ⌊ 10 n ⌋ + 1$ based on the decimal part of $10 n$ .
Processing of irregular floating-point numbers.

3.2. Pre-computation of Lookup Table

The algorithm in this paper uses a lookup table to store the values of

10^{- k - 1}

for q in the range of

[- 149, 104]

for float and

[- 1074, 971]

for double. In the algorithm of this paper, float uses 64-bit precision and double uses 128-bit precision lookup tables. The code implementation in this section is gen.py. Suppose the bit length of a single value data in the lookup table is B. For float, it has

B = 64

, and for double, it has

B = 128

. Suppose there are integers

e_{10}

and real numbers

e_{2}

, where

1 ⩽ f < 2

. There are:

\begin{matrix} f \cdot 2^{⌊ e_{2} ⌋} = 2^{e_{2}} = 10^{e_{10}} \end{matrix}

(22)

Then:

\begin{matrix} ⌊ e_{2} ⌋ = ⌊ e_{10} \cdot lg (2) ⌋ \end{matrix}

(23)

The calculation leads to f, and the following conclusions are drawn:

\begin{matrix} f = \frac{10^{e_{10}}}{2^{⌊ e_{10} \cdot lg (2) ⌋}} \end{matrix}

(24)

The way to calculate the lookup table is as follows (using the upward rounding method) :

\begin{matrix} l o o k u p [e_{10}] = ⌈ f \cdot 2^{B - 1} ⌉ = ⌈ \frac{10^{e_{10}}}{2^{⌊ e_{10} \cdot lg (2) ⌋}} \cdot 2^{B - 1} ⌉ = ⌈ 10^{e_{10}} \cdot 2^{B - 1 - ⌊ e_{10} \cdot lg (2) ⌋} ⌉ \end{matrix}

(25)

For float, when

0 ⩽ e_{10} ⩽ 27

,

f \cdot 2^{B - 1}

is an integer in equation (25). For double, when

0 ⩽ e_{10} ⩽ 55

,

f \cdot 2^{B - 1}

is an integer in equation (25). The detailed calculation process is as follows:

Float

The range of $- k - 1$ is calculated to be [-32, 44] through the q value range in equation (5), so the lookup table contains representation values from 10 to the power of -32 to 10 to the power of 44. The calculation process is as follows:

$\begin{matrix} - 32 ⩽ e_{10} ⩽ 44 \\ e_{2} = |⌊ e_{10} \cdot {log}_{2} (10) ⌋ - 63| \\ p o w 10 t = \{\begin{matrix} 2^{e_{2}} / / 10^{|e 10|}; if e_{10} < 0 \\ 10^{|e 10|} / / 2^{e_{2}}; if e_{10} ⩾ 20 \\ 10^{|e 10|} \cdot 2^{e_{2}}; if 1 ⩽ e_{10} ⩽ 19 \end{matrix} \\ f_{1, e_{10}} = p o w 10 = p o w 10 t + (e_{10} ⩾ 0 & & e_{10} ⩽ 27 ? 0 : 1) \end{matrix}$

(26)

When $0 ⩽ e_{10} ⩽ 27$ , the lookup table variable indicates that the values $f_{1, e_{10}} \cdot 2^{⌊ e_{10} \cdot {log}_{2} (10) ⌋ - 63}$ and $10^{e_{10}}$ are equal. In other cases, the relative error is less than $2^{- 63}$ . Expressed as:

$\begin{matrix} r_{1, e_{10}} & = \frac{f_{1, e_{10}} \cdot 2^{⌊ e_{10} \cdot {log}_{2} (10) ⌋ - 63}}{10^{e_{10}}} \\ \in \{\begin{matrix} 1; if 0 ⩽ e_{10} ⩽ 27 \\ (1, 1 + 2^{- 63}); if e_{10} < 0 or e_{10} > 27 \end{matrix} \end{matrix}$

(27)
Double

The range of $- k - 1$ is calculated to be [-293, 323] through the q value range in equation (5), so the lookup table contains representation values from 10 to the power of -293 to 10 to the power of 323. The calculation process is as follows:

$\begin{matrix} - 293 ⩽ e_{10} ⩽ 323 \\ e_{2} = |⌊ e_{10} \cdot {log}_{2} (10) ⌋ - 127| \\ p o w 10 t = \{\begin{matrix} 2^{e_{2}} / / 10^{|e 10|}; if e_{10} < 0 \\ 10^{|e 10|} / / 2^{e_{2}}; if e_{10} ⩾ 39 \\ 10^{|e 10|} \cdot 2^{e_{2}}; if 1 ⩽ e_{10} ⩽ 38 \end{matrix} \\ f_{1, e_{10}} = p o w 10 = p o w 10 t + (e_{10} ⩾ 0 & & e_{10} ⩽ 55 ? 0 : 1) \end{matrix}$

(28)

When $0 ⩽ e_{10} ⩽ 55$ , the lookup table variable indicates that the values $f_{1, e_{10}} \cdot 2^{⌊ e_{10} \cdot {log}_{2} (10) ⌋ - 127}$ and $10^{e_{10}}$ are equal. In other cases, the relative error is less than $2^{- 127}$ . Expressed as:

$\begin{matrix} r_{1, e_{10}} & = \frac{f_{1, e_{10}} \cdot 2^{⌊ e_{10} \cdot {log}_{2} (10) ⌋ - 127}}{10^{e_{10}}} \\ \in \{\begin{matrix} 1; if 0 ⩽ e_{10} ⩽ 55 \\ (1, 1 + 2^{- 127}); if e_{10} < 0 or e_{10} > 55 \end{matrix} \end{matrix}$

(29)

The following uses

r_{1}

to represent all possible errors of the lookup table values within the float range,

r_{2}

to represent all possible errors of the lookup table values within the double range, and r to represent all possible errors of the lookup table values within either the float or double range. In algorithm process (21), an approximate representation value of 10 to the power of

- k - 1

needs to be obtained through a lookup table. From equation (27) and equation (29), the lookup table representation value is error-free when q is within the following range:

\begin{matrix} f l o a t & : 0 ⩽ - k - 1 ⩽ 27 \Rightarrow - 93 ⩽ q ⩽ - 1 \\ d o u b l e & : 0 ⩽ - k - 1 ⩽ 55 \Rightarrow - 186 ⩽ q ⩽ - 1 \end{matrix}

(30)

When q is not within the range of equation (30), the error range of the value represented by the lookup table can be concluded as follows:

\begin{matrix} f l o a t & : 0 < r_{1} - 1 < 2^{- 63} \\ d o u b l e & : 0 < r_{2} - 1 < 2^{- 127} \end{matrix}

(31)

The introduction of the lookup table calculation process is complete. The storage space required for a float range lookup table is 616 bytes, and that for a double range lookup table is 9872 bytes.

3.3. Quickly Calculate m

Relevant theorems (partially from the Dragonbox[8] algorithm paper) : Suppose there are positive integers n,P,and Q, where P and Q are coprime,

P < Q

,

1 ⩽ n ⩽ n_{m a x}

,

Q > n_{m a x}

,

P^{*} / Q^{*}

is the best rational approximation result greater than or equal to

P / Q

,

P_{*} / Q_{*}

is the best rational approximation result less than or equal to

P / Q

, and it satisfies

Q^{*} ⩽ n_{m a x}

,

Q_{*} ⩽ n_{m a x}

. And if

n \cdot P

does not divide Q evenly, it is expressed as:

\begin{matrix} ⌊ n \cdot \frac{P}{Q} ⌋ + 1 = ⌈ n \cdot \frac{P}{Q} ⌉ \end{matrix}

(32)

Suppose the following holds true:

\begin{matrix} ⌊ n \cdot \frac{P}{Q} ⌋ = ⌊ n \cdot ξ ⌋ \end{matrix}

(33)

Then there are:

\begin{matrix} \frac{P_{*}}{Q_{*}} = max_{1 ⩽ n ⩽ n_{max}} \frac{⌊ n \cdot \frac{P}{Q} ⌋}{n} ⩽ ξ < min_{1 ⩽ n ⩽ n_{max}} \frac{⌊ n \cdot \frac{P}{Q} ⌋ + 1}{n} = min_{1 ⩽ n ⩽ n_{max}} \frac{⌈ n \cdot \frac{P}{Q} ⌉}{n} = \frac{P^{*}}{Q^{*}} \end{matrix}

(34)

Therefore, the range of values for

ξ

is:

\begin{matrix} \frac{P_{*}}{Q_{*}} ⩽ ξ < \frac{P^{*}}{Q^{*}} \end{matrix}

(35)

And the range of the decimal part with

n \cdot \frac{P}{Q}

is:

\begin{matrix} [\frac{(Q_{*} P) % Q}{Q}, \frac{(Q^{*} P) % Q}{Q}] \end{matrix}

(36)

That is, when

n = Q_{*}

, the decimal part is the smallest; when

n = Q^{*}

, the decimal part is the largest.

The definition of the best rational approximation function is as follows (this function is implemented on line 15 of the test1.py file):

\begin{matrix} (D N, U P) = f (C, P, Q) \end{matrix}

(37)

The function (37) Calculate the best rational approximation result with a denominator not exceeding C based on the mean term theorem of the Farey sequence.

D N

and

U P

are two adjacent terms in the C-order Farey sequence

F_{C}

.

In algorithm process (21), m is calculated as

⌊ v \cdot 10^{- k - 1} ⌋

(line 3). Just prove that the following equation holds:

\begin{matrix} m = ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ c \cdot 2^{q} \cdot r \cdot 10^{- k - 1} ⌋ \end{matrix}

(38)

Where r is the error of the lookup table value, as defined in equation (27) and equation (29). When the condition (30) is met, r is 1, and the equation (38) clearly holds. When r is not 1, there is:

\begin{matrix} f l o a t & : 1 < r < 1 + 2^{- 63} \\ d o u b l e & : 1 < r < 1 + 2^{- 127} \end{matrix}

(39)

Calculate the range of

2^{q} \cdot 10^{- k - 1}

and we get:

\begin{matrix} 2^{q} \cdot 10^{- k - 1} = 10^{- 1} \cdot (10^{q \cdot lg (2) - ⌊ q \cdot lg (2) ⌋}) \end{matrix}

(40)

When q is not 0, equation (40) exists:

\begin{matrix} q \cdot lg (2) \neq ⌊ q \cdot lg (2) ⌋ \\ 0 < q \cdot lg (2) - ⌊ q \cdot lg (2) ⌋ < 1 \end{matrix}

(41)

When q is 0,

q \cdot lg (2) - ⌊ q \cdot lg (2) ⌋ = 0

, so the final conclusion is:

\begin{matrix} 10^{- 1} ⩽ 2^{q} \cdot 10^{- k - 1} < 1 \end{matrix}

(42)

Because there is:

\begin{matrix} c \cdot 2^{q} \cdot 10^{- k - 1} = c \cdot \frac{2^{q - k - 1}}{5^{k + 1}} \in [0.1 c, c) \end{matrix}

(43)

Therefore:

\begin{matrix} c \cdot 2^{q} \cdot 10^{- k - 1} = \{\begin{matrix} \frac{c \cdot 2^{q - k - 1}}{5^{k + 1}}; q ⩾ 1 \\ \frac{c}{2^{1 + k - q} \cdot 5^{k + 1}} = \frac{c}{10}; q = 0 \\ \frac{c \cdot 5^{- k - 1}}{2^{1 + k - q}}; q < 0 \end{matrix} \end{matrix}

(44)

Suppose:

\begin{matrix} c \cdot 2^{q} \cdot 10^{- k - 1} = c \cdot \frac{x}{y} < c \end{matrix}

(45)

Then there are:

\begin{matrix} (x, y) = \{\begin{matrix} (2^{q - k - 1}, 5^{k + 1}); q ⩾ 1 \\ (1, 10); q = 0 \\ (5^{- k - 1}, 2^{1 + k - q}); q < 0 \end{matrix} \end{matrix}

(46)

Suppose:

\begin{matrix} f l o a t & : c ⩽ c_{max} = C_{1} = 2^{24} - 1 \\ d o u b l e & : c ⩽ c_{max} = C_{2} = 2^{53} - 1 \end{matrix}

(47)

The following is represented by C as

C_{1}

or

C_{2}

. C within the float range is

C_{1}

, and C within the double range is

C_{2}

.

When

y > C

, calculate the

P^{*}

and

Q^{*}

corresponding to each q by calling

f (C, x, y)

according to function (37). And calculate the minimum

B I T

value when the following conditions are met:

\begin{matrix} \frac{x}{y} (1 + 2^{- B I T}) < \frac{P^{*}}{Q^{*}} \end{matrix}

(48)

When

y ⩽ C

, there is:

\begin{matrix} c \cdot \frac{x}{y} (1 + \frac{1}{C y}) = \frac{c x + \frac{c}{C} \cdot \frac{x}{y}}{y} < \frac{c x + 1}{y} \end{matrix}

(49)

Therefore:

\begin{matrix} ⌊ c \cdot \frac{x}{y} ⌋ = ⌊ c \cdot \frac{x}{y} (1 + \frac{1}{C y}) ⌋ \end{matrix}

(50)

Similarly, calculate the minimum

B I T

value:

\begin{matrix} \frac{x}{y} (1 + 2^{- B I T}) < \frac{x}{y} (1 + \frac{1}{C y}) \end{matrix}

(51)

In summary, the calculation results of the maximum value among the minimum

B I T

values corresponding to different q are as follows (the running result is in the test1.py file, and the running time of this code is only about 1 to 2 seconds) :

\begin{matrix} f l o a t & : B I T_{max} = 52 \\ d o u b l e & : B I T_{max} = 113 \end{matrix}

(52)

Therefore, the following conclusions exist:

\begin{matrix} f l o a t & : ⌊ c \cdot \frac{x}{y} ⌋ = ⌊ c \cdot \frac{x}{y} \cdot (1 + 2^{- 52}) ⌋ = ⌊ c \cdot \frac{x}{y} \cdot r_{1} ⌋ \\ d o u b l e & : ⌊ c \cdot \frac{x}{y} ⌋ = ⌊ c \cdot \frac{x}{y} \cdot (1 + 2^{- 113}) ⌋ = ⌊ c \cdot \frac{x}{y} \cdot r_{2} ⌋ \end{matrix}

(53)

This section has been verified.After quickly calculating m, the value of

t e n = 10 m

can be obtained very quickly.

3.4. Quickly Determine Whether $o n e = 0$ or $o n e = 10$

In algorithm process (21), the conditions for determining

o n e = 0

and

o n e = 10

are on lines 12, and 14. This section will introduce how to quickly determine whether

o n e = 0

or

o n e = 10

holds by using equivalent conditions.

When discussing the case of

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

(line 12,

o n e

might be 0), it is equivalent to:

\begin{matrix} c \cdot 2^{q} \cdot 10^{- k - 1} - ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ & = 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \\ (2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} & = ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ \end{matrix}

(54)

When discussing the case of

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

(line 14,

o n e

might be 10), it is equivalent to:

\begin{matrix} ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ - c \cdot 2^{q} \cdot 10^{- k - 1} + 1 & = 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \\ (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} & = ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ + 1 \end{matrix}

(55)

Since equation (42), we have:

\begin{matrix} 2^{q - 1} \cdot 10^{- k - 1} \in [0.05, 0.5) \end{matrix}

(56)

Therefore, there is:

\begin{matrix} ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ - 1 & < c \cdot 2^{q} \cdot 10^{- k - 1} - 0.5 \\ < (2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} \\ ⩽ c \cdot 2^{q} \cdot 10^{- k - 1} - 0.05 < ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ + 1 \end{matrix}

(57)

Therefore, for equation (54), when

(2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is an integer, it must be equal to

⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋

. Similarly, for equation (55), there is:

\begin{matrix} ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ & < c \cdot 2^{q} \cdot 10^{- k - 1} + 0.05 \\ ⩽ (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} \\ < c \cdot 2^{q} \cdot 10^{- k - 1} + 0.5 < ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ + 2 \end{matrix}

(58)

Therefore, for equation (55), when

(2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is an integer, it must be equal to

⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ + 1

.

In conclusion, it is equivalent to discussing whether

(2 c \pm 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is an integer. There are:

\begin{matrix} (2 c \pm 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} = (2 c \pm 1) \cdot 2^{q - k - 2} \cdot 5^{- k - 1} \end{matrix}

(59)

According to the range of q, there are:

\begin{matrix} \{\begin{matrix} q - k - 2 ⩾ 0, - k - 1 < 0; q ⩾ 2 \\ q - k - 2 < 0, - k - 1 < 0; 1 ⩾ q ⩾ 0 \\ q - k - 2 < 0, - k - 1 ⩾ 0; q < 0 \end{matrix} \end{matrix}

(60)

Therefore, equation (59) is equivalent to:

\begin{matrix} (2 c \pm 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} = \{\begin{matrix} \frac{(2 c \pm 1) \cdot 2^{q - k - 2}}{5^{k + 1}}; q ⩾ 2 \\ \frac{(2 c \pm 1)}{2^{2 + k - q} \cdot 5^{k + 1}}; 1 ⩾ q ⩾ 0 \\ \frac{(2 c \pm 1) \cdot 5^{- k - 1}}{2^{2 + k - q}}; q < 0 \end{matrix} \end{matrix}

(61)

According to the different ranges of q, the following situations are discussed:

$q ⩾ 2$

From $q ⩾ 2$ , we get $k ⩾ 0$ .When $q ⩾ 2$ , it is equivalent to discussing whether $(2 c \pm 1) \cdot 2^{q - k - 2}$ is divisible by $5^{k + 1}$ . Since 2 and 5 are coprime, it is equivalent to discussing whether $(2 c \pm 1)$ is divisible by $5^{k + 1}$ .

$\begin{matrix} (2 c \pm 1) % 5^{k + 1} = 0 \end{matrix}$

(62)

Suppose t is a positive integer:

$\begin{matrix} 2 c \pm 1 = t \cdot 5^{k + 1}; t ⩾ 1 \end{matrix}$

(63)

Since $2 c \pm 1$ is odd, t is also odd. Because the following conditions exist:

$\begin{matrix} f l o a t & : 2 c - 1 \in [2^{24} + 1, 2^{25} - 3]; 2 c + 1 \in [2^{24} + 3, 2^{25} - 1]; \\ d o u b l e & : 2 c - 1 \in [2^{53} + 1, 2^{54} - 3]; 2 c + 1 \in [2^{53} + 3, 2^{54} - 1]; \end{matrix}$

(64)

Therefore, the following satisfies:

$\begin{matrix} f l o a t & : 2^{24} + 1 ⩽ t \cdot 5^{k + 1} ⩽ 2^{25} - 1 \\ d o u b l e & : 2^{53} + 1 ⩽ t \cdot 5^{k + 1} ⩽ 2^{54} - 1 \end{matrix}$

(65)

Therefore, the following conclusions are drawn:

$\begin{matrix} f l o a t & : \frac{2^{24} + 1}{5^{k + 1}} ⩽ t ⩽ \frac{2^{25} - 1}{5^{k + 1}}; \\ d o u b l e & : \frac{2^{53} + 1}{5^{k + 1}} ⩽ t ⩽ \frac{2^{54} - 1}{5^{k + 1}}; \end{matrix}$

(66)

For the above equation (66), the maximum value of k when t can obtain at least one odd number is:

$\begin{matrix} f l o a t & : k_{max} = 9 \Rightarrow q_{max} = 33, t = 3 \\ d o u b l e & : k_{max} = 22 \Rightarrow q_{max} = 76, t = 1 \end{matrix}$

(67)

Therefore, the maximum value of k is 9 within the float range and 22 within the double range. Therefore, when k exceeds the above range, $(2 c \pm 1)$ is not divisible by $5^{k + 1}$ .
$1 ⩾ q ⩾ 0$

Because the denominator $2^{2 + k - q} \cdot 5^{k + 1}$ is even and the numerator $(2 c \pm 1)$ is odd, the condition is not met.
$q < 0$

Because the denominator $2^{2 + k - q}$ is even and the numerator $(2 c \pm 1) \cdot 5^{- k - 1}$ is odd, the condition is not met.

In summary, the situations where

(2 c \pm 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

may be an integer are as follows:

\begin{matrix} f l o a t & : 2 ⩽ q ⩽ 33 & & (2 c \pm 1) % 5^{k + 1} = 0; \\ d o u b l e & : 2 ⩽ q ⩽ 76 & & (2 c \pm 1) % 5^{k + 1} = 0; \end{matrix}

(68)

And, the range of

- k - 1

is:

\begin{matrix} f l o a t : - 10 ⩽ - k - 1 ⩽ - 1 \\ d o u b l e : - 23 ⩽ - k - 1 ⩽ - 1 \end{matrix}

(69)

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

, the following conclusions can be drawn:

\begin{matrix} f l o a t & : \{2^{35} \cdot 2^{q} \cdot 10^{- k - 1} = n \cdot 2^{36} \Rightarrow ⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ n \cdot 2^{36} ⌋ \\ d o u b l e & : \{2^{63} \cdot 2^{q} \cdot 10^{- k - 1} = n \cdot 2^{64} \Rightarrow ⌊ 2^{63} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ n \cdot 2^{64} ⌋ \end{matrix}

(70)

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

, the following conclusions can be drawn:

\begin{matrix} f l o a t : \{\begin{matrix} 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} = 2^{36} - n \cdot 2^{36} \Rightarrow \\ ⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} - n \cdot 2^{36} ⌋ = 2^{36} - 1 - ⌊ n \cdot 2^{36} ⌋ \end{matrix} \\ d o u b l e : \{\begin{matrix} 2^{63} \cdot 2^{q} \cdot 10^{- k - 1} = 2^{64} - n \cdot 2^{64} \Rightarrow \\ ⌊ 2^{63} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{64} - n \cdot 2^{64} ⌋ = 2^{64} - 1 - ⌊ n \cdot 2^{64} ⌋ \end{matrix} \end{matrix}

(71)

The discussion on whether

⌊ 2^{36} - n \cdot 2^{36} ⌋ = 2^{36} - 1 - ⌊ n \cdot 2^{36} ⌋

in equation (71) holds true, that is, whether

2^{36} \cdot n

in equation (71) is an integer, or equivalent to discussing whether the following values are integers when equation (68) holds true (the same applies to double) :

\begin{matrix} f l o a t & : c \cdot 2^{q + 36} \cdot 10^{- k - 1} = c \cdot 2^{q - k + 35} \cdot 5^{- k - 1} = c \cdot \frac{2^{q - k + 35}}{5^{k + 1}} \\ d o u b l e & : c \cdot 2^{q + 64} \cdot 10^{- k - 1} = c \cdot 2^{q - k + 63} \cdot 5^{- k - 1} = c \cdot \frac{2^{q - k + 63}}{5^{k + 1}} \end{matrix}

(72)

Suppose c can divide

5^{k + 1}

evenly (where t is a temporary integer variable):

\begin{matrix} c = t \cdot 5^{k + 1}; t ⩾ 1 \end{matrix}

(73)

Therefore, when equation (73) was established, there were:

\begin{matrix} 2 c \pm 1 = 2 \cdot t \cdot 5^{k + 1} \pm 1 \end{matrix}

(74)

Expression (74) cannot divide

5^{k + 1}

evenly, which contradicts equation (68), so c cannot divide

5^{k + 1}

evenly. Therefore, for float,

c \cdot 2^{q + 36} \cdot 10^{- k - 1}

is not an integer; For double,

c \cdot 2^{64 + q} \cdot 10^{- k - 1}

is not an integer, that is:

\begin{matrix} f l o a t & : ⌊ 2^{36} - 2^{36} \cdot n ⌋ = 2^{36} + ⌊ - 2^{36} \cdot n ⌋ = 2^{36} - 1 - ⌊ 2^{36} \cdot n ⌋ \\ d o u b l e & : ⌊ 2^{64} - 2^{64} \cdot n ⌋ = 2^{64} + ⌊ - 2^{64} \cdot n ⌋ = 2^{64} - 1 - ⌊ 2^{64} \cdot n ⌋ \end{matrix}

(75)

Therefore, the conclusion (71) is correct. Discuss the necessary and sufficient conditions for whether

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ n \cdot 2^{36} ⌋

is

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

. The same applies to double, expressed as:

\begin{matrix} f l o a t & : 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n \Leftrightarrow ⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ n \cdot 2^{36} ⌋ \\ d o u b l e & : 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n \Leftrightarrow ⌊ 2^{63} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ n \cdot 2^{64} ⌋ \end{matrix}

(76)

Similarly, the necessary and sufficient conditions for whether

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} - n \cdot 2^{36} ⌋

is

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

. The same applies to double, expressed as:

\begin{matrix} f l o a t & : 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n \Leftrightarrow ⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} - n \cdot 2^{36} ⌋ \\ d o u b l e & : 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n \Leftrightarrow ⌊ 2^{63} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{64} - n \cdot 2^{64} ⌋ \end{matrix}

(77)

The sufficient conditions of equations (76) and (77) are obviously established. Introduce the proof that equation (76) holds. For float, only the necessary conditions need to be discussed, that is, whether

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

must hold true when

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ n \cdot 2^{36} ⌋

holds, or equivalent to

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ \neq ⌊ n \cdot 2^{36} ⌋

must hold true when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \neq n

. The following is proved by proof by contradiction.

Assume that

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ n \cdot 2^{36} ⌋

holds when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \neq n

. Then there is:

\begin{matrix} ⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ n \cdot 2^{36} ⌋ \\ \Rightarrow 0 < |2^{35} \cdot 2^{q} \cdot 10^{- k - 1} - 2^{36} \cdot n| < 1 \\ \Rightarrow 0 < |(2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} - m| < 2^{- 36} \end{matrix}

(78)

As is known from equation (57), there is:

\begin{matrix} m - 1 < (2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} < m + 1 \end{matrix}

(79)

Suppose the decimal part of

(2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is represented as

n^{-}

, thus we have:

\begin{matrix} |(2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} - m| = \{\begin{matrix} n^{-}; if (2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} > m \\ 1 - n^{-}; if (2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} < m \end{matrix} \end{matrix}

(80)

Substitute expression (80) into expression (78), and we get:

\begin{matrix} 0 < |(2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} - m| < 2^{- 36} \\ \Rightarrow 0 < n^{-} < 2^{- 36} or 0 < 1 - n^{-} < 2^{- 36} \end{matrix}

(81)

Similarly, it can be known that the double range is the range of

n^{-}

. Therefore, there is:

\begin{matrix} f l o a t & : n^{-} \in (0, 2^{- 36}) \cup (1 - 2^{- 36}, 1) \\ d o u b l e & : n^{-} \in (0, 2^{- 64}) \cup (1 - 2^{- 64}, 1) \end{matrix}

(82)

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \neq n

, it is known from equation (54) that

(2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is not an integer. Therefore, there is:

\begin{matrix} 0 < n^{-} < 1 \end{matrix}

(83)

It is only necessary to prove that equation (82) does not hold. Discuss the range of the decimal part

n^{-}

when

(2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is not an integer. According to equation (61), there are:

\begin{matrix} (2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} = (2 c - 1) \cdot \frac{x}{y} = \{\begin{matrix} \frac{(2 c - 1) \cdot 2^{q - k - 2}}{5^{k + 1}}; q ⩾ 2 \\ \frac{(2 c - 1)}{2^{2 + k - q} \cdot 5^{k + 1}}; 1 ⩾ q ⩾ 0 \\ \frac{(2 c - 1) \cdot 5^{- k - 1}}{2^{2 + k - q}}; q < 0 \end{matrix} \end{matrix}

(84)

The maximum value of

2 c - 1

is:

\begin{matrix} f l o a t & : {(2 c - 1)}_{max} = 2^{25} - 3 \\ d o u b l e & : {(2 c - 1)}_{max} = 2^{54} - 3 \end{matrix}

(85)

Discuss based on the denominator range in equation (84).

$y ⩽ {(2 c - 1)}_{max}$

When $y ⩽ {(2 c - 1)}_{max}$ , $y_{max}$ is the expression (85), the following holds true:

$\begin{matrix} \frac{1}{y_{max}} ⩽ n^{-} ⩽ 1 - \frac{1}{y_{max}} \\ \frac{1}{y_{max}} ⩽ 1 - n^{-} ⩽ 1 - \frac{1}{y_{max}} \end{matrix}$

(86)

Therefore, when $y ⩽ {(2 c - 1)}_{max}$ , equation (82) does not hold true.
$y > {(2 c - 1)}_{max}$

Call function (37) to calculate the approximation results $P_{*} / Q_{*}$ and $P^{*} / Q^{*}$ of all possible upper and lower limit rational numbers:

$\begin{matrix} (\frac{P_{*}}{Q_{*}}, \frac{P^{*}}{Q^{*}}) = f ({(2 c - 1)}_{max}, x, y) \end{matrix}$

(87)

Therefore, for $n^{-}$ , the following conclusion can be drawn from formula (36).

$\begin{matrix} n^{-} \in [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \end{matrix}$

(88)

By exhausting all possibilities, we thus have (the test code file is test3.py) :

$\begin{matrix} f l o a t & : 2^{- 33} < n^{-} < 1 - 2^{- 29} \\ d o u b l e & : 2^{- 62} < n^{-} < 1 - 2^{- 63} \end{matrix}$

(89)

$\begin{matrix} f l o a t & : [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (0, 2^{- 36}) = ⌀ \\ [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (1 - 2^{- 36}, 1) = ⌀ \\ d o u b l e & : [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (0, 2^{- 64}) = ⌀ \\ [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (1 - 2^{- 64}, 1) = ⌀ \end{matrix}$

(90)

Therefore, when $y > {(2 c - 1)}_{max}$ , equation (82) does not hold true.

In summary, when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \neq n

, equation (82) does not hold true, that is,

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ \neq ⌊ n \cdot 2^{36} ⌋

must hold true. Therefore, when

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ n \cdot 2^{36} ⌋

holds,

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

must hold true. Therefore, equation (76) holds.

Similarly, it can be proved that when

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} - n \cdot 2^{36} ⌋

holds,

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

must hold true. The same applies to double. Similarly, by proof of contradiction, for float, it is assumed that when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \neq 1 - n

holds,

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} - 2^{36} \cdot n ⌋

holds. That is:

\begin{matrix} ⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} - n \cdot 2^{36} ⌋ \\ \Rightarrow 0 < |2^{35} \cdot 2^{q} \cdot 10^{- k - 1} - 2^{36} + 2^{36} \cdot n| < 1 \\ \Rightarrow 0 < |2^{q - 1} \cdot 10^{- k - 1} - 1 + n| < 2^{- 36} \\ \Rightarrow - 2^{- 36} < (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} - m - 1 < 2^{- 36} \end{matrix}

(91)

As is known from equation (58), there is:

\begin{matrix} m < (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} < m + 2 \end{matrix}

(92)

Suppose the decimal part of

(2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is represented as

n^{+}

, thus we have:

\begin{matrix} (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} - m - 1 = \{\begin{matrix} n^{+}; if (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} > m + 1 \\ 1 - n^{+}; if (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} < m + 1 \end{matrix} \end{matrix}

(93)

Substitute expression (93) into expression (91), and we get:

\begin{matrix} 0 < |(2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} - m - 1| < 2^{- 36} \\ \Rightarrow 0 < 1 - n^{+} < 2^{- 36} or 0 < n^{+} < 2^{- 36} \end{matrix}

(94)

Similarly, it can be known that the double range is the range of

n^{+}

. Therefore, there is:

\begin{matrix} f l o a t & : n^{+} \in (0, 2^{- 36}) \cup (1 - 2^{- 36}, 1) \\ d o u b l e & : n^{+} \in (0, 2^{- 64}) \cup (1 - 2^{- 64}, 1) \end{matrix}

(95)

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \neq 1 - n

, it is known from equation (55) that

(2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is not an integer.

Therefore, there is:

\begin{matrix} 0 < n^{+} < 1 \end{matrix}

(96)

It is only necessary to prove that equation (95) does not hold. Discuss the range of the decimal part

n^{+}

when

(2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is not an integer. According to equation (61), there are:

\begin{matrix} (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} = (2 c + 1) \cdot \frac{x}{y} = \{\begin{matrix} \frac{(2 c + 1) \cdot 2^{q - k - 2}}{5^{k + 1}}; q ⩾ 2 \\ \frac{(2 c + 1)}{2^{2 + k - q} \cdot 5^{k + 1}}; 1 ⩾ q ⩾ 0 \\ \frac{(2 c + 1) \cdot 5^{- k - 1}}{2^{2 + k - q}}; q < 0 \end{matrix} \end{matrix}

(97)

The maximum value of

2 c + 1

is:

\begin{matrix} f l o a t & : {(2 c + 1)}_{max} = 2^{25} - 1 \\ d o u b l e & : {(2 c + 1)}_{max} = 2^{54} - 1 \end{matrix}

(98)

Discuss based on the denominator range in equation (97).

$y ⩽ {(2 c + 1)}_{max}$

When $y ⩽ {(2 c + 1)}_{max}$ , $y_{max}$ is the expression (98), the following holds true:

$\begin{matrix} \frac{1}{y_{max}} ⩽ n^{+} ⩽ 1 - \frac{1}{y_{max}} \\ \frac{1}{y_{max}} ⩽ 1 - n^{+} ⩽ 1 - \frac{1}{y_{max}} \end{matrix}$

(99)

Therefore, when $y ⩽ {(2 c + 1)}_{max}$ , equation (95) does not hold true.
$y > {(2 c + 1)}_{max}$

Call function (37) to calculate the approximation results $P_{*} / Q_{*}$ and $P^{*} / Q^{*}$ of all possible upper and lower limit rational numbers:

$\begin{matrix} (\frac{P_{*}}{Q_{*}}, \frac{P^{*}}{Q^{*}}) = f ({(2 c + 1)}_{max}, x, y) \end{matrix}$

(100)

Therefore, for $n^{+}$ , the following conclusion can be drawn from formula (36).

$\begin{matrix} n^{+} \in [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \end{matrix}$

(101)

By exhausting all possibilities, we thus have (the test code file is test7.py) :

$\begin{matrix} f l o a t & : 2^{- 33} < n^{+} < 1 - 2^{- 29} \\ d o u b l e & : 2^{- 62} < n^{+} < 1 - 2^{- 63} \end{matrix}$

(102)

$\begin{matrix} f l o a t & : [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (0, 2^{- 36}) = ⌀ \\ [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (1 - 2^{- 36}, 1) = ⌀ \\ d o u b l e & : [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (0, 2^{- 64}) = ⌀ \\ [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (1 - 2^{- 64}, 1) = ⌀ \end{matrix}$

(103)

Therefore, when $y > {(2 c + 1)}_{max}$ , equation (95) does not hold true.

In summary, when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \neq 1 - n

, equation (95) does not hold true, that is,

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ \neq ⌊ 2^{36} - n \cdot 2^{36} ⌋

must hold true. Therefore, when

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} - n \cdot 2^{36} ⌋

holds,

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

must hold true. Therefore, equation (77) holds.

The following conclusions hold:

\begin{matrix} f l o a t : & ⌊ 2^{36} - n \cdot 2^{36} ⌋ & = \{\begin{matrix} 2^{36} - 1 - ⌊ 2^{36} \cdot n ⌋; if c \cdot 2^{36 + q} \cdot 10^{- k - 1} \notin Z \\ 2^{36} - ⌊ 2^{36} \cdot n ⌋; if c \cdot 2^{36 + q} \cdot 10^{- k - 1} \in Z \end{matrix} \\ d o u b l e : & ⌊ 2^{64} - n \cdot 2^{64} ⌋ & = \{\begin{matrix} 2^{64} - 1 - ⌊ 2^{64} \cdot n ⌋; if c \cdot 2^{64 + q} \cdot 10^{- k - 1} \notin Z \\ 2^{64} - ⌊ 2^{64} \cdot n ⌋; if c \cdot 2^{64 + q} \cdot 10^{- k - 1} \in Z \end{matrix} \end{matrix}

(104)

Discuss whether the following equation (105) holds when conditions (68) and (69) are met:

\begin{matrix} f l o a t & : ⌊ c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}} ⌋ = ⌊ c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}} \cdot r ⌋ \\ = ⌊ c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}} \cdot \frac{(2^{63 - ⌊ (- k - 1) \cdot {log}_{2} (10) ⌋} / / 10^{k + 1}) + 1}{10^{- k - 1}} \cdot 2^{⌊ (- k - 1) \cdot {log}_{2} (10) ⌋ - 63} ⌋ \\ d o u b l e & : ⌊ c \cdot \frac{2^{q + 63 - k}}{5^{k + 1}} ⌋ = ⌊ c \cdot \frac{2^{q + 63 - k}}{5^{k + 1}} \cdot r ⌋ \\ = ⌊ c \cdot \frac{2^{q + 63 - k}}{5^{k + 1}} \cdot \frac{(2^{127 - ⌊ (- k - 1) \cdot {log}_{2} (10) ⌋} / / 10^{k + 1}) + 1}{10^{- k - 1}} \cdot 2^{⌊ (- k - 1) \cdot {log}_{2} (10) ⌋ - 127} ⌋ \end{matrix}

(105)

There are:

\begin{matrix} f l o a t & : ⌊ c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}} ⌋ = ⌊ 2^{36} \cdot (m + n) ⌋ = 2^{36} m + ⌊ 2^{36} n ⌋ \\ d o u b l e & : ⌊ c \cdot \frac{2^{q + 63 - k}}{5^{k + 1}} ⌋ = ⌊ 2^{64} \cdot (m + n) ⌋ = 2^{64} m + ⌊ 2^{64} n ⌋ \end{matrix}

(106)

It has been proven earlier that m can be accurately calculated. Then, when (1-118) holds true, the values

⌊ 2^{36} n ⌋

and

⌊ 2^{64} n ⌋

on the right side of equations (70) and (71) can be accurately calculated.

From equation (63), we have:

\begin{matrix} c = \frac{t \cdot 5^{k + 1} - 1}{2} \end{matrix}

(107)

Substituting equation (107) into equation (105), we have:

\begin{matrix} f l o a t : c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}} = t \cdot 2^{q + 34 - k} - \frac{2^{q + 34 - k}}{5^{k + 1}} \\ d o u b l e : c \cdot \frac{2^{q + 63 - k}}{5^{k + 1}} = t \cdot 2^{q + 62 - k} - \frac{2^{q + 62 - k}}{5^{k + 1}} \end{matrix}

(108)

When conditions (68) and (69) are met,

t \cdot 2^{q + 34 - k}

and

t \cdot 2^{q + 62 - k}

are integers. Under the condition of meeting condition (68), the decimal part of expression (108) is represented as:

\begin{matrix} f l o a t : \frac{2^{q + 34 - k} % 5^{k + 1}}{5^{k + 1}}; 2 ⩽ q ⩽ 33 \\ d o u b l e : \frac{2^{q + 62 - k} % 5^{k + 1}}{5^{k + 1}}; 2 ⩽ q ⩽ 76 \end{matrix}

(109)

It is only necessary to prove that the increase in the value

c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}} \cdot r

on the right side of the expression compared to the value

c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}}

on the left side plus the decimal part of the value on the left side is less than 1 for equation (105) to hold true. That is:

\begin{matrix} f l o a t : & \frac{2^{q + 34 - k} % 5^{k + 1}}{5^{k + 1}} + (c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}} \cdot r - c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}}) < 1 \\ d o u b l e : & \frac{2^{q + 62 - k} % 5^{k + 1}}{5^{k + 1}} + (c \cdot \frac{2^{q + 63 - k}}{5^{k + 1}} \cdot r - c \cdot \frac{2^{q + 63 - k}}{5^{k + 1}}) < 1 \end{matrix}

(110)

By exhaustionally calculating the maximum possible c value under each q and substituting it into equation (110), it holds. The calculation result is in test2.py. The calculation results show that for the float range and the double range, equation (110) always holds true. Therefore, equation (105) holds true, and thus the values of

⌊ 2^{36} n ⌋

and

⌊ 2^{64} n ⌋

on the right side of equations (70) and (71) can be accurately calculated. The values of

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋

and

⌊ 2^{63} \cdot 2^{q} \cdot 10^{- k - 1} ⌋

on the left side of equations (70) and (71) can be calculated through lookup tables.

\begin{matrix} f l o a t : ⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = p o w 10 ≫ (28 - q - ⌊ (- k - 1) \cdot {log}_{2} (10) ⌋) \\ d o u b l e : ⌊ 2^{64} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = p o w 10 ≫ (64 - q - ⌊ (- k - 1) \cdot {log}_{2} (10) ⌋) \end{matrix}

(111)

The code file for verifying the validity of equation (111) is test4.py. Therefore, when conditions (68) and (69) are met, the values of both sides of equations (70) and (71) can be accurately calculated.

Discuss the relationship between the following two values within all ranges of floating-point numbers:

\begin{matrix} f l o a t : ⌊ c \cdot 2^{q + 36} \cdot 10^{- k - 1} ⌋; ⌊ c \cdot 2^{q + 36} \cdot r \cdot 10^{- k - 1} ⌋; \\ d o u b l e : ⌊ c \cdot 2^{q + 64} \cdot 10^{- k - 1} ⌋; ⌊ c \cdot 2^{q + 64} \cdot r \cdot 10^{- k - 1} ⌋; \end{matrix}

(112)

When

r = 1

, it is obvious that the two values in expression (112) are equal. When

r \neq 1

, or equivalent to

r > 1

, has:

\begin{matrix} f l o a t : \\ c \cdot 2^{q + 36} \cdot r \cdot 10^{- k - 1} & = c \cdot 2^{q + 36} \cdot 10^{- k - 1} + c \cdot 2^{q + 36} \cdot (r - 1) \cdot 10^{- k - 1} \\ < c \cdot 2^{q + 36} \cdot 10^{- k - 1} + 2^{24} \cdot 2^{36} \cdot 2^{q} \cdot 10^{- k - 1} \cdot (r - 1) \\ < c \cdot 2^{q + 36} \cdot 10^{- k - 1} + 2^{- 3} \\ ⌊ c \cdot 2^{q + 36} \cdot r \cdot 10^{- k - 1} ⌋ & ⩽ ⌊ c \cdot 2^{q + 36} \cdot 10^{- k - 1} ⌋ + 1 \\ d o u b l e : \\ c \cdot 2^{q + 64} \cdot r \cdot 10^{- k - 1} & = c \cdot 2^{q + 64} \cdot 10^{- k - 1} + c \cdot 2^{q + 64} \cdot (r - 1) \cdot 10^{- k - 1} \\ < c \cdot 2^{q + 64} \cdot 10^{- k - 1} + 2^{53} \cdot 2^{64} \cdot 2^{q} \cdot 10^{- k - 1} \cdot (r - 1) \\ < c \cdot 2^{q + 64} \cdot 10^{- k - 1} + 2^{- 10} \\ ⌊ c \cdot 2^{q + 64} \cdot r \cdot 10^{- k - 1} ⌋ & ⩽ ⌊ c \cdot 2^{q + 64} \cdot 10^{- k - 1} ⌋ + 1 \end{matrix}

(113)

Therefore, there is:

\begin{matrix} f l o a t & : 0 ⩽ ⌊ c \cdot 2^{q + 36} \cdot r \cdot 10^{- k - 1} ⌋ - ⌊ c \cdot 2^{q + 36} \cdot 10^{- k - 1} ⌋ ⩽ 1 \\ d o u b l e & : 0 ⩽ ⌊ c \cdot 2^{q + 64} \cdot r \cdot 10^{- k - 1} ⌋ - ⌊ c \cdot 2^{q + 64} \cdot 10^{- k - 1} ⌋ ⩽ 1 \end{matrix}

(114)

Because there is:

\begin{matrix} ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ c \cdot 2^{q} \cdot r \cdot 10^{- k - 1} ⌋ = m \end{matrix}

(115)

\begin{matrix} f l o a t & : ⌊ c \cdot 2^{q + 36} \cdot 10^{- k - 1} ⌋ = 2^{36} m + ⌊ 2^{36} n ⌋ \\ d o u b l e & : ⌊ c \cdot 2^{q + 64} \cdot 10^{- k - 1} ⌋ = 2^{64} m + ⌊ 2^{64} n ⌋ \end{matrix}

(116)

Suppose:

\begin{matrix} n_{r} = c \cdot 2^{q} \cdot r \cdot 10^{- k - 1} - m \end{matrix}

(117)

Therefore, the following conclusion can be drawn: when condition (68) is met, from equation (105), we have:

\begin{matrix} f l o a t : 2 ⩽ q ⩽ 33 & & (2 c \pm 1) % 5^{k + 1} = 0 \Rightarrow ⌊ 2^{36} \cdot n ⌋ = ⌊ 2^{36} \cdot n_{r} ⌋ \\ d o u b l e : 2 ⩽ q ⩽ 76 & & (2 c \pm 1) % 5^{k + 1} = 0 \Rightarrow ⌊ 2^{64} \cdot n ⌋ = ⌊ 2^{64} \cdot n_{r} ⌋ \end{matrix}

(118)

Within the range of floating-point numbers, there exists:

\begin{matrix} f l o a t & : ⌊ 2^{36} \cdot n ⌋ ⩽ ⌊ 2^{36} \cdot n_{r} ⌋ ⩽ ⌊ 2^{36} \cdot n ⌋ + 1 \\ d o u b l e & : ⌊ 2^{64} \cdot n ⌋ ⩽ ⌊ 2^{64} \cdot n_{r} ⌋ ⩽ ⌊ 2^{64} \cdot n ⌋ + 1 \end{matrix}

(119)

To simplify the expression,

e v e n

is used to indicate whether c is an even number:

\begin{matrix} e v e n = (c + 1) % 2 \in \{0, 1\} \end{matrix}

(120)

The following will introduce the proof process of the float range, from equation (121) to equation (144). When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

or

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

,

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

is the boundary condition for

o n e = 0

, and

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

is the boundary condition for

o n e = 10

. Whether

o n e

is 0 or 10 is determined based on whether c is an even number.Therefore, the following exists:

\begin{matrix} f l o a t : \{\begin{matrix} o n e = 0 : ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n > ⌊ n_{r} \cdot 2^{36} ⌋ \\ o n e = 10 : ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n > 2^{36} - 1 - ⌊ n_{r} \cdot 2^{36} ⌋ \end{matrix} \end{matrix}

(121)

Therefore, when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

or

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

, we can use the condition (122) to determine whether

o n e = 0

or

o n e = 10

.

\begin{matrix} f l o a t : \{\begin{matrix} if ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n > ⌊ n_{r} \cdot 2^{36} ⌋ : o n e = 0 \\ if ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n > 2^{36} - 1 - ⌊ n_{r} \cdot 2^{36} ⌋ : o n e = 10 \end{matrix} \end{matrix}

(122)

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > n

or

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > 1 - n

, We can also use the above condition (122) to determine whether

o n e = 0

or

o n e = 10

. When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < n

or

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < 1 - n

, we can also use the above condition (122) to determine whether

o n e \neq 0

or

o n e \neq 10

. The proof is as follows:

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < n

, there must exist

o n e \neq 0

, and there is:

\begin{matrix} 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} - n = n^{-} - 1 \in (2^{- 33} - 1, - 2^{- 29}) \end{matrix}

(123)

Therefore, the following exists:

\begin{matrix} 2^{q + 35} \cdot 10^{- k - 1} - 2^{36} \cdot n \in (2^{3} - 2^{36}, - 2^{7}) \end{matrix}

(124)

Suppose there are two real numbers a and b, and the following relationship must exist:

\begin{matrix} 0 ⩽ & b - ⌊ b ⌋ < 1 \\ a - ⌊ a ⌋ - 1 < & b - ⌊ b ⌋ < 1 + a - ⌊ a ⌋ \\ a - b - 1 < & ⌊ a ⌋ - ⌊ b ⌋ < a - b + 1 \end{matrix}

(125)

When

a = 2^{q + 35} \cdot 10^{- k - 1}

and

b = 2^{36} \cdot n

, the following exists:

\begin{matrix} ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ - ⌊ 2^{36} \cdot n ⌋ < 2^{q + 35} \cdot 10^{- k - 1} - 2^{36} \cdot n + 1 \end{matrix}

(126)

From equation (124), we have:

\begin{matrix} ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ - ⌊ 2^{36} \cdot n ⌋ < 1 - 2^{7} < 0 \end{matrix}

(127)

Therefore, there is:

\begin{matrix} ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & ⩽ ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + 1 \\ < ⌊ 2^{36} \cdot n ⌋ ⩽ ⌊ 2^{36} \cdot n_{r} ⌋ \\ \Rightarrow ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & < ⌊ 2^{36} \cdot n_{r} ⌋ \end{matrix}

(128)

Therefore, when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < n

, the condition (122) can be used to determine that

o n e \neq 0

.

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > n

, there must exist

o n e = 0

, and there is:

\begin{matrix} 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} - n = n^{-} \in (2^{- 33}, 1 - 2^{- 29}) \end{matrix}

(129)

Therefore, the following exists:

\begin{matrix} 2^{q + 35} \cdot 10^{- k - 1} - 2^{36} \cdot n \in (2^{3}, 2^{36} - 2^{7}) \end{matrix}

(130)

When

a = 2^{q + 35} \cdot 10^{- k - 1}

and

b = 2^{36} \cdot n

, from equation (125), the following exists:

\begin{matrix} ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ - ⌊ 2^{36} \cdot n ⌋ > 2^{q + 35} \cdot 10^{- k - 1} - 2^{36} \cdot n - 1 \end{matrix}

(131)

From equation (130), we have:

\begin{matrix} ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ - ⌊ 2^{36} \cdot n ⌋ > 2^{3} - 1 ⩾ 0 \end{matrix}

(132)

Therefore, there is:

\begin{matrix} ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & ⩾ ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ \\ > ⌊ 2^{36} \cdot n ⌋ + 1 ⩾ ⌊ 2^{36} \cdot n_{r} ⌋ \\ \Rightarrow ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & > ⌊ 2^{36} \cdot n_{r} ⌋ \end{matrix}

(133)

Therefore, when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > n

, the condition (122) can be used to determine that

o n e = 0

.

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < 1 - n

, there must exist

o n e \neq 10

, and there is:

\begin{matrix} 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} + n = n^{+} \in (2^{- 33}, 1 - 2^{- 29}) \end{matrix}

(134)

Therefore, the following exists:

\begin{matrix} 2^{q + 35} \cdot 10^{- k - 1} + 2^{36} \cdot n \in (2^{3}, 2^{36} - 2^{7}) \end{matrix}

(135)

Suppose there are two real numbers a and b, and the following relationship must exist:

\begin{matrix} a - 1 & < ⌊ a ⌋ ⩽ a \\ b - 1 & < ⌊ b ⌋ ⩽ b \\ a + b - 2 & < ⌊ a ⌋ + ⌊ b ⌋ ⩽ a + b \end{matrix}

(136)

When

a = 2^{q + 35} \cdot 10^{- k - 1}

and

b = 2^{36} \cdot n

, the following exists:

\begin{matrix} ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + ⌊ 2^{36} \cdot n ⌋ ⩽ 2^{q + 35} \cdot 10^{- k - 1} + 2^{36} \cdot n \end{matrix}

(137)

From equation (135), we have:

\begin{matrix} ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + ⌊ 2^{36} \cdot n ⌋ < 2^{36} - 2^{7} \end{matrix}

(138)

Therefore, there is:

\begin{matrix} ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & ⩽ ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + 1 \\ < 2^{36} - 2 - ⌊ 2^{36} \cdot n ⌋ \\ ⩽ 2^{36} - 1 - ⌊ 2^{36} \cdot n_{r} ⌋ \\ \Rightarrow ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & < 2^{36} - 1 - ⌊ 2^{36} \cdot n_{r} ⌋ \end{matrix}

(139)

Therefore, when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < 1 - n

, the condition (122) can be used to determine that

o n e \neq 10

.

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > 1 - n

, there must exist

o n e = 10

, and there is:

\begin{matrix} 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} + n = n^{+} + 1 \in (1 + 2^{- 33}, 2 - 2^{- 29}) \end{matrix}

(140)

Therefore, the following exists:

\begin{matrix} 2^{q + 35} \cdot 10^{- k - 1} + 2^{36} \cdot n \in (2^{3} + 2^{36}, 2^{37} - 2^{7}) \end{matrix}

(141)

When

a = 2^{q + 35} \cdot 10^{- k - 1}

and

b = 2^{36} \cdot n

, from equation (136), the following exists:

\begin{matrix} ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + ⌊ 2^{36} \cdot n ⌋ > 2^{q + 35} \cdot 10^{- k - 1} + 2^{36} \cdot n - 2 \end{matrix}

(142)

From equation (141), we have:

\begin{matrix} ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + ⌊ 2^{36} \cdot n ⌋ > 2^{36} + 2^{3} - 2 ⩾ 2^{36} \end{matrix}

(143)

Therefore, there is:

\begin{matrix} ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & ⩾ ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ \\ > 2^{36} - 2 - ⌊ 2^{36} \cdot n ⌋ \\ ⩾ 2^{36} - 1 - ⌊ 2^{36} \cdot n_{r} ⌋ \\ \Rightarrow ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & > 2^{36} - 1 - ⌊ 2^{36} \cdot n_{r} ⌋ \end{matrix}

(144)

Therefore, when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > 1 - n

, the condition (122) can be used to determine that

o n e = 10

.

From the above proof, it can be seen that when condition (68) is met, the condition (122) can be used to determine whether

o n e = 0

or

o n e = 10

when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

or

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

. When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > n

or

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > 1 - n

, the condition (122) can be used to determine whether

o n e = 0

or

o n e = 10

. When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < n

or

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < 1 - n

, the condition (122) can be used to determine whether

o n e \neq 0

or

o n e \neq 10

.

For the double range, the value of

o n e

can be calculated based on the following conditions.

\begin{matrix} d o u b l e : \{\begin{matrix} if ⌊ 2^{q + 64} \cdot 10^{- k - 1} ⌋ + e v e n > ⌊ n_{r} \cdot 2^{64} ⌋ : o n e = 0 \\ if ⌊ 2^{q + 64} \cdot 10^{- k - 1} ⌋ + e v e n > 2^{64} - 1 - ⌊ n_{r} \cdot 2^{64} ⌋ : o n e = 10 \end{matrix} \end{matrix}

(145)

Readers can deduce it by themselves based on the above proof process. The proof process of this section is completed. In the code implementation, the two judgment conditions can be quickly calculated using addition and subtraction shift operations, and can be compiled by the compiler into cmov instructions, thereby reducing the impact of branch prediction failure on performance.

3.5. Determine whether $o n e = ⌊ 10 n ⌋$ or $o n e = ⌊ 10 n ⌋ + 1$

Determine whether

o n e

is

⌊ 10 n ⌋

or

⌊ 10 n ⌋ + 1

based on the decimal part of

10 n

. There are two cases: the decimal part of

10 n

is 0.5 and it is not 0.5.

3.5.1. $10 n - ⌊ 10 n ⌋ = 0.5$

When the decimal part of

10 n

is 0.5, there must be:

\begin{matrix} 10 n - ⌊ 10 n ⌋ = 0.5 \\ \Rightarrow 10 \cdot c \cdot 2^{q} \cdot 10^{- k - 1} - ⌊ 10 \cdot c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = 0.5 \\ \Rightarrow c \cdot 2^{q} \cdot 10^{- k} - ⌊ c \cdot 2^{q} \cdot 10^{- k} ⌋ = 0.5 \\ \Rightarrow c \cdot 2^{q} \cdot 10^{- k} = ⌊ c \cdot 2^{q} \cdot 10^{- k} ⌋ + 0.5 \\ \Rightarrow 2 c \cdot 2^{q} \cdot 10^{- k} = 2 ⌊ c \cdot 2^{q} \cdot 10^{- k} ⌋ + 1 \end{matrix}

(146)

So

2 c \cdot 2^{q} \cdot 10^{- k}

is an odd number. Then the following expression is odd:

\begin{matrix} c \cdot 2^{q + 1} \cdot 10^{- k} = c \cdot 2^{q - k + 1} \cdot 5^{- k} \end{matrix}

(147)

According to the range of q, there are:

\begin{matrix} c \cdot 2^{q + 1} \cdot 10^{- k} = \{\begin{matrix} \frac{c \cdot 2^{q - k + 1}}{5^{k}}; q ⩾ 0 \\ c \cdot 2 \cdot 5^{- k}; q = - 1 \\ \frac{c \cdot 5^{- k}}{2^{k - q - 1}}; q ⩽ - 2 \end{matrix} \end{matrix}

(148)

According to the range of q, the following situations are discussed:

$q ⩾ 0$

When $q ⩾ 0$ , it can be concluded that $q - k + 1 ⩾ 1$ , the numerator $c \cdot 2^{q - k + 1}$ is even and the denominator $5^{k}$ is odd, which does not meet the condition.
$q = - 1$

When $q = - 1$ , it can be concluded that $c \cdot 2 \cdot 5^{- k}$ is even, which does not meet the condition.
$q ⩽ - 2$

$5^{- k}$ is an odd number. c is an odd multiple of $2^{k - q - 1}$ . So:

$\begin{matrix} f l o a t : c ⩾ 2^{k - q - 1} \Rightarrow k - q - 1 ⩽ 22 \Rightarrow q ⩾ - 34 \\ d o u b l e : c ⩾ 2^{k - q - 1} \Rightarrow k - q - 1 ⩽ 51 \Rightarrow q ⩾ - 75 \end{matrix}$

(149)

Therefore, when q meets the above conditions, c must be an odd multiple of $2^{k - q - 1}$ to meet the condition. Therefore, when the following conditions are met, expression (147) is an odd number:

$\begin{matrix} f l o a t & : - 34 ⩽ q ⩽ - 2 & & c % 2^{k - q} = 2^{k - q - 1} \\ d o u b l e & : - 75 ⩽ q ⩽ - 2 & & c % 2^{k - q} = 2^{k - q - 1} \end{matrix}$

(150)

When q is within the above range (150), $r = 1$ is derived from equation (30).Therefore, there is:

$\begin{matrix} n_{r} = n \end{matrix}$

(151)

The following equation holds:

$\begin{matrix} 20 m + 20 n = c \cdot 2^{q} \cdot 10^{- k + 1} = c \cdot 2^{q - k + 1} \cdot 5^{- k} = \frac{c}{2^{k - q - 1}} \cdot 5^{- k} \end{matrix}$

(152)

Since $- k ⩾ 1$ , $5^{- k}$ is multiple of 5 and is an odd number. Since $\frac{c}{2^{k - q - 1}}$ and $5^{- k}$ are both odd numbers, $20 m$ is an even number, $20 n$ is multiple of 5 and is an odd number. Therefore, there is:

$\begin{matrix} 20 n \in \{5, 15\} \\ \Rightarrow n \in \{0.25, 0.75\} \\ \Rightarrow n_{r} \in \{0.25, 0.75\} \end{matrix}$

(153)

The result of $o n e$ is an even number between $⌊ 10 n ⌋$ and $⌊ 10 n ⌋ + 1$ . Therefore, when the following conditions are met:

$\begin{matrix} o n e = \{\begin{matrix} ⌊ 10 n ⌋ = 2, if n = 0.25 \\ ⌊ 10 n ⌋ + 1 = 8, if n = 0.75 \end{matrix} \Rightarrow o n e = ⌊ 20 n + 1 ⌋ / / 2 - (n = 0.25 ? 1 : 0) \end{matrix}$

(154)

3.5.2. $10 n - ⌊ 10 n ⌋ \neq 0.5$

When the decimal part of

10 n

is not 0.5, round to the nearest integer value based on the decimal part of

10 n

. Therefore, there is:

\begin{matrix} o n e = \{\begin{matrix} ⌊ 10 n ⌋, if 10 n - ⌊ 10 n ⌋ < 0.5 \\ ⌊ 10 n ⌋ + 1, if 10 n - ⌊ 10 n ⌋ > 0.5 \end{matrix} \Rightarrow o n e = ⌊ 10 n + 0.5 ⌋ = ⌊ 20 n + 1 ⌋ / / 2 \end{matrix}

(155)

Since

⌊ 20 n + 1 ⌋ = ⌊ 20 n ⌋ + 1

, it is only necessary to accurately calculate the value of

⌊ 20 n ⌋

. And, there is:

\begin{matrix} d & = t e n + o n e \\ = 10 m + ⌊ 20 n + 1 ⌋ / / 2 \\ = (⌊ 20 m + 20 n ⌋ + 1) / / 2 \end{matrix}

(156)

Suppose there are:

\begin{matrix} 20 m + 20 n = c \cdot 2^{q + 1} \cdot 10^{- k} = c \cdot 2^{q - k + 1} \cdot 5^{- k} = c \cdot \frac{x}{y} \end{matrix}

(157)

Suppose the decimal part of

20 n

is

n_{20}

.

When

y ⩽ c_{max} = C

, the range of the decimal part must include:

\begin{matrix} f l o a t : \frac{1}{2^{24} - 1} = \frac{1}{C} ⩽ n_{20} ⩽ 1 - \frac{1}{C} = \frac{2^{24} - 2}{2^{24} - 1} \\ d o u b l e : \frac{1}{2^{53} - 1} = \frac{1}{C} ⩽ n_{20} ⩽ 1 - \frac{1}{C} = \frac{2^{53} - 2}{2^{53} - 1} \end{matrix}

(158)

When

y > c_{max} = C

, the range of the decimal part must include(the test file is test5.py):

\begin{matrix} f l o a t : 2^{- 32} < n_{20} < 1 - 2^{- 30} \\ d o u b l e : 2^{- 64} < n_{20} < 1 - 2^{- 62} \end{matrix}

(159)

Therefore, the range of

n_{20}

satisfies equation (159). In the code implementation, for float, only the high 36 bits of

n_{r}

are retained, and for double, only the high 70 bits of

n_{r}

are retained. Suppose the discarded part of a float is represented as

n_{36}

, and similarly, the discarded part of a double is represented as

n_{70}

. Therefore, there is:

\begin{matrix} f l o a t : n_{36} \in [0, 2^{- 36}) \\ d o u b l e : n_{70} \in [0, 2^{- 70}) \end{matrix}

(160)

Calculate the boundary conditions of the following expression:

\begin{matrix} f l o a t : F = 20 \cdot (c \cdot 2^{q} \cdot r \cdot 10^{- k - 1} - n_{36}) \\ d o u b l e : F = 20 \cdot (c \cdot 2^{q} \cdot r \cdot 10^{- k - 1} - n_{70}) \end{matrix}

(161)

Therefore, there is:

\begin{matrix} f l o a t : F_{min} & > 20 \cdot (c \cdot 2^{q} \cdot 10^{- k - 1} - 2^{- 36}) \\ = 20 m + 20 n - 20 \cdot 2^{- 36} \\ F_{max} & < 20 \cdot (c \cdot 2^{q} \cdot (1 + 2^{- 63}) \cdot 10^{- k - 1} - 0) \\ < 20 m + 20 n + 20 \cdot 2^{- 63} \cdot c \\ < 20 m + ⌊ 20 n ⌋ + 1 \\ d o u b l e : F_{min} & > 20 \cdot (c \cdot 2^{q} \cdot 10^{- k - 1} - 2^{- 70}) \\ = 20 m + 20 n - 20 \cdot 2^{- 70} \\ > 20 m + ⌊ 20 n ⌋ \\ F_{max} & < 20 \cdot (c \cdot 2^{q} \cdot (1 + 2^{- 127}) \cdot 10^{- k - 1} - 0) \\ < 20 m + 20 n + 20 \cdot 2^{- 127} \cdot c \\ < 20 m + ⌊ 20 n ⌋ + 1 \end{matrix}

(162)

Therefore, there is:

\begin{matrix} f l o a t : ⌊ F ⌋ = 20 m + ⌊ 20 n ⌋ \\ d o u b l e : ⌊ F ⌋ = 20 m + ⌊ 20 n ⌋ \end{matrix}

(163)

In fact, in the above proof process, for float,

⌊ F_{m i n} ⌋ \neq 20 m + ⌊ 20 n ⌋

may exist, but the code implementation has passed the exhaustive test, so this not-so-perfect proof process can be ignored. Therefore, the calculation of d can be simplified as follows:

\begin{matrix} d & = t e n + o n e \\ = (⌊ F ⌋ + 1) / / 2 \\ = (⌊ 20 \cdot (c \cdot 2^{q} \cdot r \cdot 10^{- k - 1} - n_{x}) ⌋ + 1) / / 2 \end{matrix}

(164)

For the float range,

n_{x} = n_{36}

; for the double range,

n_{x} = n_{70}

.

For double, quickly determine that

n = = 0.25

in equation (154).

When

n = 0.25

,

⌊ 2^{64} \cdot n_{r} ⌋ = ⌊ 2^{64} \cdot n ⌋ = 2^{62}

. Therefore, the following condition can be used to quickly determine whether

n = 0.25

:

\begin{matrix} d o u b l e : n = 0.25 if ⌊ 2^{64} \cdot n_{r} ⌋ = 2^{62} \end{matrix}

(165)

When

n \neq 0.25

,Calculate the range of the decimal part of the following expression:

\begin{matrix} 4 m + 4 n = c \cdot 2^{q + 2} \cdot 10^{- k - 1} \end{matrix}

(166)

Therefore, when equation (166) is not an integer, we have:(test6.py):

\begin{matrix} 2^{- 62} < 4 n - ⌊ 4 n ⌋ < 1 - 2^{- 62} \end{matrix}

(167)

Calculate the two boundary cases of

4 n

that are closest to 1:

\begin{matrix} ⌊ 4 n ⌋ = 0 \Rightarrow 4 n - 0 < 1 - 2^{- 62} \Rightarrow ⌊ 2^{64} \cdot n ⌋ ⩽ 2^{62} - 2 \\ ⌊ 4 n ⌋ = 1 \Rightarrow 4 n - 1 > 2^{- 62} \Rightarrow ⌊ 2^{64} \cdot n ⌋ ⩾ 2^{62} + 1 \end{matrix}

(168)

Then there are:

\begin{matrix} ⌊ 2^{64} \cdot n ⌋ \neq 2^{62} & & ⌊ 2^{64} \cdot n ⌋ + 1 \neq 2^{62} \\ \Rightarrow ⌊ 2^{64} \cdot n_{r} ⌋ \neq 2^{62} \end{matrix}

(169)

Therefore, the following condition can be used to quickly determine whether

n \neq 0.25

:

\begin{matrix} d o u b l e : n \neq 0.25 if ⌊ 2^{64} \cdot n_{r} ⌋ \neq 2^{62} \end{matrix}

(170)

In summary, for double, the following condition can be used to quickly determine whether

n = 0.25

:

\begin{matrix} d o u b l e : n = 0.25 if ⌊ 2^{64} \cdot n_{r} ⌋ = 2^{62} \\ d o u b l e : n \neq 0.25 if ⌊ 2^{64} \cdot n_{r} ⌋ \neq 2^{62} \end{matrix}

(171)

In the double range, introduce a fast way to calculate

o n e

:

\begin{matrix} d o u b l e : o n e = ⌊ \frac{⌊ 2^{64} n_{r} ⌋}{2^{64}} \cdot 10 + (n = 0.25) ? 0 : (2^{- 1} + \frac{6}{2^{64}}) ⌋ \end{matrix}

(172)

The proof of equation (172) is as follows:

when

n = 0.25

,

⌊ \frac{⌊ 2^{64} n_{r} ⌋}{2^{64}} \cdot 10 ⌋ = ⌊ 10 n ⌋ = 2

;

when

n \neq 0.25

, equation (172) can be equivalent to the following:

\begin{matrix} d o u b l e : o n e = ⌊ \frac{⌊ 2^{64} n_{r} ⌋}{2^{64}} \cdot 10 + 2^{- 1} + \frac{6}{2^{64}} ⌋ \end{matrix}

(173)

According to the

10 n - ⌊ 10 n ⌋

range,

o n e

is represented as:

\begin{matrix} d o u b l e : o n e = \{\begin{matrix} ⌊ 10 n ⌋, if 10 n - ⌊ 10 n ⌋ < 0.5 \\ 8, if 10 n - ⌊ 10 n ⌋ = 0.5 \\ ⌊ 10 n ⌋ + 1, if 10 n - ⌊ 10 n ⌋ > 0.5 \end{matrix} = ⌊ 20 n + 1 ⌋ / / 2 \end{matrix}

(174)

Therefore,when

n \neq 0.25

, we need to prove that the following equation holds:

\begin{matrix} ⌊ \frac{⌊ 2^{64} n_{r} ⌋}{2^{64}} \cdot 10 + 2^{- 1} + \frac{6}{2^{64}} ⌋ = \{\begin{matrix} ⌊ 10 n ⌋, if 10 n - ⌊ 10 n ⌋ < 0.5 \\ 8, if 10 n - ⌊ 10 n ⌋ = 0.5 \\ ⌊ 10 n ⌋ + 1, if 10 n - ⌊ 10 n ⌋ > 0.5 \end{matrix} = ⌊ 20 n + 1 ⌋ / / 2 \end{matrix}

(175)

From the range of n, there is:

\begin{matrix} \frac{⌊ 2^{64} n_{r} ⌋}{2^{64}} \in (n_{r} - 2^{- 64}, n_{r}] \end{matrix}

(176)

Because the following conditions exist:

\begin{matrix} c \cdot 2^{q} \cdot 10^{- k - 1} & = m + n \\ c \cdot 2^{q} \cdot r \cdot 10^{- k - 1} & = m + n_{r} \end{matrix}

(177)

Therefore, the following relationship can be concluded:

\begin{matrix} n_{r} - n & = (r - 1) \cdot c \cdot 2^{q} \cdot 10^{- k - 1} \\ n_{r} & = (r - 1) \cdot (m + n) + n \\ \Rightarrow n ⩽ n_{r} & < 2^{- 127} \cdot c + n \\ n ⩽ n_{r} & < 2^{- 127} \cdot 2^{53} + n \\ n ⩽ n_{r} & < 2^{- 74} + n \end{matrix}

(178)

From equation (176) and (178), it can be concluded that:

\begin{matrix} \frac{⌊ 2^{64} n_{r} ⌋}{2^{64}} \in (n - 2^{- 64}, n + 2^{- 74}) \\ \Rightarrow \frac{⌊ 2^{64} n_{r} ⌋}{2^{64}} \cdot 10 \in (10 n - 10 \cdot 2^{- 64}, 10 n + 10 \cdot 2^{- 74}) \\ \Rightarrow \frac{⌊ 2^{64} n_{r} ⌋}{2^{64}} \cdot 20 \in (20 n - 20 \cdot 2^{- 64}, 20 n + 20 \cdot 2^{- 74}) \\ \Rightarrow \frac{⌊ 2^{64} n_{r} ⌋}{2^{64}} \cdot 20 \in (⌊ 20 n ⌋ + n_{20} - 20 \cdot 2^{- 64}, ⌊ 20 n ⌋ + n_{20} + 20 \cdot 2^{- 74}) \end{matrix}

(179)

Discuss the range of values of x when the following conditions are met.

\begin{matrix} ⌊ \frac{⌊ 2^{64} n_{r} ⌋}{2^{64}} \cdot 20 + 1 + x ⌋ / / 2 = ⌊ 20 n + 1 ⌋ / / 2 = o n e \end{matrix}

(180)

Therefore, the following conclusions can be drawn:

\begin{matrix} ⌊ 20 n ⌋ + n_{20} - 20 \cdot 2^{- 64} + 1 + x ⩾ ⌊ 20 n + 1 ⌋ & \Rightarrow x ⩾ 20 \cdot 2^{- 64} - n_{20} \\ ⌊ 20 n ⌋ + n_{20} + 20 \cdot 2^{- 74} + 1 + x < ⌊ 20 n + 2 ⌋ & \Rightarrow x < 1 - 20 \cdot 2^{- 74} - n_{20} \end{matrix}

(181)

Suppose

x = 12 \cdot 2^{- 64}

. Through the exhaustive method, all floating-point numbers that do not meet the following conditions can be obtained.

\begin{matrix} x = 12 \cdot 2^{- 64} ⩾ 20 \cdot 2^{- 64} - n_{20} \end{matrix}

(182)

All floating-point numbers that do not meet condition (182) are as follows (in hexadecimal) :

\begin{matrix} 0 x d 17 c 0747 b d 76 f a 1 & , \\ 0 x d 27 c 0747 b d 76 f a 1 & , \\ 0 x 4 d 73 d e 005 b d 620 d f & , \\ 0 x 4 d 83 d e 005 b d 620 d f & , \\ 0 x 4 d 93 d e 005 b d 620 d f & , \end{matrix}

(183)

Through the exhaustive method, all floating-point numbers that do not meet the following conditions can be obtained.

\begin{matrix} x = 12 \cdot 2^{- 64} < 1 - 20 \cdot 2^{- 74} - n_{20} \end{matrix}

(184)

All floating-point numbers that do not meet condition (184) are as follows (in hexadecimal) :

\begin{matrix} 0 x 612491 d a a d 0 b a 280 & , \\ 0 x 6159 b 651584 e 8 b 20 & , \\ 0 x 619011 f 2 d 73116 f 4 & , \\ 0 x 61 c 4166 f 8 c f d 5 c b 1 & , \\ 0 x 61 d 4166 f 8 c f d 5 c b 1 & , \end{matrix}

(185)

There are:

\begin{matrix} 2 (\frac{⌊ 2^{64} n_{r} ⌋}{2^{64}} \cdot 10 + 2^{- 1} + \frac{6}{2^{64}}) = \frac{⌊ 2^{64} n_{r} ⌋}{2^{64}} \cdot 20 + 1 + x \end{matrix}

(186)

When the floating-point number is not within the above range (183) and (185), the condition (181) is satisfied. We have tested all floating-point numbers within the above-mentioned range (183) and (185), and the algorithm implementation code has output the correct result, that is, it satisfies the SW principle. The test process file is test8.py.

In summary, equation (175) and equation (172) holds. Therefore, equation (172) can be used to quickly calculate

o n e

.

3.6. Irregular Number

Due to the limited and small number of irregular floating-point numbers, there are a total of 2046 double floating-point numbers and 254 float floating-point numbers. The correctness of the algorithm code in this paper can be proved by the exhaustive method. Therefore, it is not introduced in this article. For the specific implementation process, please refer to the source code.

4. Benchmark Result

In fact, this article only discusses the binary to decimal part and does not discuss the decimal to string part. In the decimal to string section, the neon instruction set is adopted for the arm64 architecture, and sse2 is used for the x86-64 architecture to accelerate the conversion process. Please refer to the source code design. The link to the benchmark project is https://github.com/xjb714/f2dec_bench. In the performance test comparison, we compared the time spent by the following several different algorithms converting floating-point numbers to decimal results and string, as shown in Table (Table 1). Test process: Generate

10^{7}

random numbers without 0, NaN, and Inf, measure the total time spent converting all floating-point numbers to decimal results, and obtain the average time for converting a single floating-point number to decimal and string. The compilation option for all compilers is "-O3 -march=native". We conducted benchmark tests on two processors, and the test results are shown in Table (Table 2), (Table 3), (Table 4), (Table 5).

Special note: The algorithm of teju_jagua only supports float/double to decimal, because its author did not implement the source code of decimal to string. yy_double only supports double. Dragonbox_comp and fmt_comp represent the versions of the compressed constant lookup table. Dragonbox_full and fmt_full represent uncompressed constant lookup table.

In the test comparison of float/double to decimal, the results produced by different algorithms may vary and may include the results without removing the trailing zeros in decimal. Therefore, this comparison is not very fair and the results are for reference only. Since the AMD-R7 7840H supports AVX2 and AVX512, the AVX2 and AVX512 in the test results are the test results optimized by the clang compiler. From the benchmark results, it can be seen that the performance of the algorithm in this paper is better than other algorithms in most cases. For the double to string algorithm in this article, due to the incorrect optimization of gcc, such as too many branch statements, the performance is lower than that of the icpx compiler. The compilation result of the clang compiler is almost the same as that of the icpx compiler, but the result does not seem to meet expectations. The reason is under investigation.

5. Conclusions and Future Work

This paper proposes a new floating-point number to string conversion algorithm. The algorithm improves the calculation process of Schubfach[6] algorithms, reduces the number of multiplication operations, and optimizes some calculation steps. The algorithm has been implemented in C/C++ language and passed exhaustive tests. The benchmark results show that the performance of the algorithm is better than most existing algorithms in most cases. Future work includes further optimization of the algorithm to improve performance, especially for parallel computing on x86-64 and arm64 architecture,and compatibility with the msvc compiler.

Acknowledgments

This study was funded by the Sichuan Science and Technology Program (Grant No. 2024ZDZX0001) and the Technology Development Program (JCKY2022110C119).

References

G. L. Steel Jr. and J. L. White. How to Print Floating-Point Numbers Accurately. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation, PLDI 1990. ACM, New York, NY, USA, 112-126. [CrossRef]
F. Loitsch. Printing Floating-Point Numbers Quickly and Accurately with Integers. In Proceedings of the ACM SIGPLAN 2010 Conference on Programming Language Design and Implementation, PLDI 2010. ACM, New York, NY, USA, 233-243. [CrossRef]
M. Andrysco, R. M. Andrysco, R. Jhala, and S. Lerner. Printing Floating-Point Numbers: a Faster, Always Correct Method. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016. ACM, New York, NY, USA, 555-567. [CrossRef]
Ulf Adams. 2018. Ryu¯: Fast Float-to-String Conversion. In Proceed- ings of 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’18). ACM, New York, NY, USA, 13 pages. [CrossRef]
Ulf Adams. 2019. Ryu¯ Revisited: Printf Floating Point Conversion. Proc. ACM Program. Lang. 3, OOPSLA, Article 169 (October 2019), 23 pages. [CrossRef]
R. Giulietti. The Schubfach Way to Render Doubles. 2020. https://drive.google.com/file/d/1KLtG_LaIbK9ETXI290zqCxvBW94dj058/view (Sep. 2020.
J. Jeon. Grisu-Exact: A Fast and Exact Floating-Point Printing Algorithm. 2020. https://github.com/jk-jeon/Grisu-Exact/blob/master/other_files/Grisu-Exact.pdf. (Sep. 2020.
Junekey Jeon. 2024. Dragonbox: A New Floating-Point Binary-to-Decimal Conversion Algorithm. https://github.com/jk-jeon/Dragonbox.
Guo YaoYuan. https://github.com/ibireme/c_numconv_benchmark/blob/master/vendor/yy_double/yy_double.c (Nov. 2024.
Victor Zverovich. https://github.com/fmtlib/fmt (Oct. 2025.
Guo YaoYuan. https://github.com/ibireme/yyjson (Aug. 2025.
Cassio Neri. https://github.com/cassioneri/teju_jagua (Nov. 2025.

Table 1. All algorithms in the benchmark test.

algorithm	float	double	description
]1*Schubfach	Schubfach32	Schubfach64	author:Raffaello Giulietti,https://github.com/c4f7fcce9cb06515/Schubfach.
]1*Schubfach_xjb	Schubfach32_xjb	Schubfach64_xjb	It is improved by Schubfach and has the same output result.
]1*Ryu	Ryu32	Ryu64	author:Ulf Adams,https://github.com/ulfjack/ryu.
]1*Dragonbox	Dragonbox32	Dragonbox64	author:Junekey Jeon,https://github.com/jk-jeon/Dragonbox.
]1*fmt[10]	fmt32	fmt64	author:Victor Zverovich,https://github.com/fmtlib/fmt version:12.1.0
]1*yy_double	-	yy_double	author:Guo YaoYuan,https://github.com/ibireme/c_numconv_benchmark/blob/master/vendor/yy_double/yy_double.c.
]1*yy_json[11]	yy_json32	yy_json64	author:Guo YaoYuan,https://github.com/ibireme/yyjson version:0.12.0
]1*teju_jagua[12]	teju32	teju64	author:Cassio Neri,https://github.com/cassioneri/teju_jagua.
]1*xjb	xjb32	xjb64	this paper,https://github.com/xjb714/xjb.

* The blank Spaces in the table indicate no-code implementations.

Table 2. float/double to decimal benchmark results on AMD-R7 7840H and Ubuntu 24.04. The unit is nanosecond(ns).

algorithm	float			double
	gcc 13.3	icpx 2025.0.4	clang 18.1.3	gcc 13.3	icpx 2025.0.4	clang 18.1.3
Schubfach	11.72	11.26	11.26	11.96	12.07	11.97
Schubfach_xjb	7.50	5.38	5.45	8.70	6.43	7.48
Ryu	14.23	14.21	14.48	13.73	13.53	13.59
Dragonbox	10.47	9.96	9.52	10.44	10.15	10.00
yy_json	6.31	4.84	4.69	7.20	6.10	6.25
yy_double	-	-	-	6.32	5.75	5.22
teju_jagua	13.73	14.90	14.39	13.38	15.23	13.59
xjb	3.38	2.88	4.19	7.15	3.90	3.60
			1.88(AVX2)			3.13(AVX2)
			1.68(AVX512)			3.47(AVX512)

Table 3. float/double to string benchmark results on AMD-R7 7840H and Ubuntu 24.04. The unit is nanosecond(ns).

algorithm	float			double
	gcc 13.3	icpx 2025.0.4	clang 18.1.3	gcc 13.3	icpx 2025.0.4	clang 18.1.3
Schubfach	20.71	19.67	20.32	25.78	24.55	24.76
Schubfach_xjb	21.04	19.99	20.49	20.14	20.23	20.37
Ryu	22.57	20.58	20.62	26.36	24.71	24.95
Dragonbox_comp	21.73	20.65	22.84	21.69	21.37	22.72
Dragonbox_full	16.93	15.49	17.28	18.88	17.78	18.80
fmt_comp	22.45	22.81	21.76	26.35	27.14	26.78
fmt_full	23.05	23.31	22.21	25.67	27.28	26.13
yy_json	21.71	21.09	21.08	18.13	18.03	19.16
yy_double	-	-	-	17.54	17.52	17.85
xjb	9.04	9.02	8.78	12.15	9.12	15.78

Table 4. float/double to decimal benchmark results on Apple M1 and MacOS 26.1. The unit is nanosecond(ns).

algorithm	float	double
	apple clang 17.0.0	apple clang 17.0.0
Schubfach	10.94	12.62
Schubfach_xjb	5.92	6.58
Ryu	15.40	14.16
Dragonbox	11.94	12.03
yy_json	4.18	4.72
yy_double	-	4.74
teju_jagua	19.27	18.66
xjb	3.24	3.53

Table 5. float/double to string benchmark results on Apple M1 and MacOS 26.1. The unit is nanosecond(ns).

algorithm	float	double
	apple clang 17.0.0	apple clang 17.0.0
Schubfach	23.26	27.51
Schubfach_xjb	23.30	22.44
Ryu	25.27	29.15
Dragonbox_comp	28.55	27.28
Dragonbox_full	21.03	22.12
fmt_comp	37.19	41.73
fmt_full	36.36	41.77
yy_json	15.50	15.58
yy_double	-	15.13
xjb	8.91	10.32

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

xjb: Fast Float to String Algorithm

Abstract

Keywords:

Subject:

1. Introduction

2. IEEE754 Floating Point Number Representation

3. Principle of Algorithm

3.1. Review the Schubfach algorithm and the derivation of the algorithm in this paper

3.2. Pre-computation of Lookup Table

3.3. Quickly Calculate m

3.4. Quickly Determine Whether $o n e = 0$ or $o n e = 10$

3.5. Determine whether $o n e = ⌊ 10 n ⌋$ or $o n e = ⌊ 10 n ⌋ + 1$

3.5.1. $10 n - ⌊ 10 n ⌋ = 0.5$

3.5.2. $10 n - ⌊ 10 n ⌋ \neq 0.5$

3.6. Irregular Number

4. Benchmark Result

5. Conclusions and Future Work

Acknowledgments

References

MDPI Initiatives

Important Links

Subscribe

xjb: Fast Float to String Algorithm

Abstract

Keywords:

Subject:

1. Introduction

2. IEEE754 Floating Point Number Representation

3. Principle of Algorithm

3.1. Review the Schubfach algorithm and the derivation of the algorithm in this paper

3.2. Pre-computation of Lookup Table

3.3. Quickly Calculate m

3.4. Quickly Determine Whether o n e = 0 or o n e = 10

3.5. Determine whether o n e = ⌊ 10 n ⌋ or o n e = ⌊ 10 n ⌋ + 1

3.5.1. 10 n − ⌊ 10 n ⌋ = 0.5

3.5.2. 10 n − ⌊ 10 n ⌋ ≠ 0.5

3.6. Irregular Number

4. Benchmark Result

5. Conclusions and Future Work

Acknowledgments

References

MDPI Initiatives

Important Links

Subscribe

3.4. Quickly Determine Whether $o n e = 0$ or $o n e = 10$

3.5. Determine whether $o n e = ⌊ 10 n ⌋$ or $o n e = ⌊ 10 n ⌋ + 1$

3.5.1. $10 n - ⌊ 10 n ⌋ = 0.5$

3.5.2. $10 n - ⌊ 10 n ⌋ \neq 0.5$