Preprint
Article

This version is not peer-reviewed.

xjb: Fast Float to String Algorithm

Submitted:

21 November 2025

Posted:

24 November 2025

You are already at the latest version

Abstract
With the wide application of numerical computation and data exchange, efficiently and accurately converting floating-point numbers to decimal strings has become an important computer science issue. Existing floating-point printing algorithms, such as Ryu¯, Dragonbox, and Schubfach, although they meet the Steele-White (SW) principle in terms of accuracy, still have room for optimization in performance, especially in terms of branch prediction failure and high-precision multiplication overhead. This paper proposes a novel floating-point to string conversion algorithm named "xjb", which is an improvement based on the Schubfach algorithm, aiming to further enhance the conversion efficiency. The algorithm in this paper is designed for IEEE754 single-precision (binary32) and double-precision (binary64) floating-point numbers. By reducing instruction dependencies, decreasing the number of multiplication operations, and minimizing branch prediction failures, it significantly improves performance. In addition, the algorithm supports parallel computing, and the core implementation code is concise, with good portability and scalability. We conducted extensive benchmark tests on multiple platforms, including AMD-R7 7840H and Apple M1, using different compilers (gcc, clang, icpx). The results show that the xjb algorithm outperforms the existing mainstream algorithms in most cases.
Keywords: 
;  ;  

1. Introduction

In 1990, Steele and White[1] published the paper how to print floating-point numbers Accurately and proposed the optimal principle of floating-point number printing algorithms (hereinafter referred to as the SW principle) :
  • Information preservation: The print result can be parsed back to the original floating-point number.
  • Minimum length: The print result should be as short as possible.
  • Correct rounding: On the basis of satisfying 1 and 2, if there are two candidate values, they should be correctly rounded (i.e., the even value should be selected).
  • Generate from left to right: The print result is generated from the left.
Floating-point number printing algorithms that satisfy the SW principle convert floating-point numbers into real values with unique and definite results. Over the past few years, a variety of different algorithms have been proposed, such as Grisu3[2], Errol[3], Ry u ¯ [4][5], Schubfach[6], Grisu-Exact[7], Dragonbox[8], and yy_double[9].
The algorithm in this paper is based on the Schubfach algorithm, and is inspired by algorithms such as yy_double and Dragonbox. This article only introduces two floating-point number types, IEEE754-binary32 and IEEE754-binary64. To simplify the content, in this article, float represents IEEE754-binary32 and double represents IEEE754-binary64. This article involves the python code and algorithm implementation code at https://github.com/xjb714/xjb.

2. IEEE754 Floating Point Number Representation

Since the print result of a negative floating-point number only has one more negative sign than the print result of its absolute value, this article only discusses positive floating-point numbers and does not include special values such as 0, NaN, and Inf.
The IEEE754 double-precision floating-point number consists of 64 bits, including 1 sign bit ( s i g n ), 11 exponent bits ( e x p ), and 52 fraction bits ( f r a c ). s i g n ’s range is 0 or 1, e x p ’s range is 0 , 2047 , and f r a c ’s range is 0 , 2 52 1 .
The IEEE754 single-precision floating-point number consists of 32 bits, including 1 sign bit ( s i g n ), 8 exponent bits ( e x p ), and 23 fraction bits ( f r a c ). s i g n ’s range is 0 or 1, e x p ’s range is 0 , 255 , and f r a c ’s range is 0 , 2 23 1 .
When f r a c = 0 , it is an irregular floating-point number.
The real value of the positive floating-point number v can be expressed as the following expression:
d o u b l e : v = f r a c + e x p 0 ? 2 52 : 0 · 2 max e x p , 1 1075 = c · 2 q f l o a t : v = f r a c + e x p 0 ? 2 23 : 0 · 2 max e x p , 1 150 = c · 2 q
There are two cases in total. When e x p equals 0 (referred to as subnormal floating-point numbers), there are:
d o u b l e : v = f r a c · 2 1074 f l o a t : v = f r a c · 2 149
When e x p is not equal to 0 (referred to as a normal floating-point number), there is:
d o u b l e : v = f r a c + 2 52 · 2 e x p 1075 f l o a t : v = f r a c + 2 23 · 2 e x p 150
In the rounding interval R v of floating-point numbers, all real numbers will be rounded to this floating-point number when parsed. R v is:
v l = c 1 2 · 2 q , if f r a c 0 or e x p 1 c 1 4 · 2 q , if f r a c = 0 v r = c + 1 2 · 2 q R v = v l , v r , if f r a c % 2 = 0 v l , v r , if f r a c % 2 = 1
When the floating-point number is a regular floating-point number, 2 q 1 is the rounded radius.

3. Principle of Algorithm

At present, other algorithms use a large number of branches, which can easily lead to branch prediction failure penalties and excessive high multiplication overhead. The algorithm in this paper will minimize the overhead of branch prediction failures and reduce the number of multiplication operations to improve performance. Moreover, the core code for the algorithm implementation in this paper is only about twenty lines and it also supports parallel computing. The process of printing floating-point numbers is usually divided into two parts: the first part is to convert the floating-point number to a decimal number, and the second part is to convert the decimal number to a string. And this article will only introduce the first part. All double-precision floating-point numbers are classified into two types: irregular values and regular values. An irregular value is one where all the lower 52 bits are 0, meaning the f r a c value is 0. There are a total of 2046 valid irregular values (i.e., e x p values range from 1 to 2046). Dividing by the irregular values yields the regular value. Similarly, there are a total of 254 irregular values in a single-precision floating-point number. When e x p is 0, it is called a subnormal floating-point number.
The valid range for c and q in regular floating-point numbers is:
f l o a t : 1 c 2 24 1 , c 2 23 ; q = 149 2 23 + 1 c 2 24 1 ; 148 q 104 d o u b l e : 1 c 2 53 1 , c 2 52 ; q = 1074 2 52 + 1 c 2 53 1 ; 1073 q 971
The valid range for c and q in irregular floating-point numbers is:
f l o a t : c = 2 23 ; 149 q 104 d o u b l e : c = 2 52 ; 1074 q 971
The valid range for c and q in subnormal floating-point numbers is:
f l o a t : c 2 23 1 ; q = 149 d o u b l e : c 2 52 1 ; q = 1074
Floating-point numbers that do not fall within the subnormal range are called normal floating-point numbers.
regular floating-point numbers account for the vast majority of all possible values of floating-point numbers and are the most worthy of discussion part. Therefore, unless otherwise specified, only regular floating-point numbers will be discussed below. Suppose the floating-point number v is converted to the optimal solution that satisfies the SW principle as o p t , d is a positive integer and k is an integer,which is expressed as:
v = c · 2 q o p t = d · 10 k o p t R v ; d N + ; k Z
For example: IEEE754-binary64 floating-point number "1.3", the real value of the floating-point number is 1.3000000000000000444089209850062616169452667236328125, hexadecimal representation of floating-point Numbers is 3ff4cccccccccccd, Then the o p t value that meets the SW principle is 1.3. The IEEE754-binary32 floating-point number "1.3" has an actual value of 1.2999999523162841796875, and its hexadecimal representation is 3FA66666. Therefore, the o p t value that satisfies the SW principle is 1.3.

3.1. Review the Schubfach algorithm and the derivation of the algorithm in this paper

According to the Schubfach[6] algorithm, the possible values of d can be one of the following four situations:
10 · v · 10 k 1 , 10 · v · 10 k 1 , 10 · v · 10 k 1 + 1 , 10 · v · 10 k 1 + 10
The calculation method of k in equation (9) is as follows:
k = q · lg ( 2 ) if v r e g u l a r else q · lg ( 2 ) lg ( 4 3 )
In the range of float and double, equation (10) can be equivalent to:
k = ( q · 315653 v r e g u l a r ? 0 : 131237 ) 20
Suppose the integer part of v · 10 k 1 is m and the decimal part is n, then we have:
v · 10 k 1 = m v · 10 k 1 = m + n 0 n = v · 10 k 1 v · 10 k 1 < 1
Then the decimal part of v · 10 k is expressed as:
v · 10 k v · 10 k = 10 m + 10 n 10 m + 10 n = 10 n 10 n
The possible values of d obtained from equation (9) are:
10 m , 10 m + n , 10 m + n + 1 , 10 m + 10
The possible values of d in equation (14) can be simplified to:
10 m , 10 m + 10 n , 10 m + 10 n + 1 , 10 m + 10
Among them, 10 m represents the minimum possible value and 10 m + 10 represents the maximum possible value. Suppose t e n is used to represent 10 m . There are four possible values for o n e , with d = t e n + o n e , denoted as:
t e n = 10 m o n e 0 , 10 n , 10 n + 1 , 10 d = t e n + o n e
Calculating d will be converted to calculating t e n and o n e .
The final possible values of d are as follows:
  • 10 m
    When the following conditions are met, the result is 10 m (or equivalent to o n e = 0 ). That is, the floating-point number v minus the minimum possible value of 10 m is less than the rounded radius 2 q 1 .
    c · 2 q 10 m · 10 k < 2 q 1 c · 2 q c · 2 q · 10 k 1 · 10 k + 1 < 2 q 1 c · 2 q · 10 k 1 c · 2 q · 10 k 1 < 2 1 · 2 q · 10 k 1 n < 2 1 · 2 q · 10 k 1 2 1 · 2 q · 10 k 1 > n
    Or when 2 1 · 2 q · 10 k 1 = n , c % 2 = 0 must also be satisfied. Therefore, the following conditions are valid:
    if 2 1 · 2 q · 10 k 1 > n or 2 1 · 2 q · 10 k 1 = n & & c % 2 = 0 : o n e = 0
  • 10 m + 10
    When the following conditions are met, the result is 10 m + 10 (or equivalent to o n e = 10 ). The maximum possible value of 10 m + 10 minus the floating-point number v is less than the rounded radius 2 q 1 .
    10 m + 10 · 10 k c · 2 q < 2 q 1 c · 2 q · 10 k 1 · 10 k + 1 + 10 k + 1 c · 2 q < 2 q 1 c · 2 q · 10 k 1 c · 2 q · 10 k 1 + 1 < 2 1 · 2 q · 10 k 1 1 n < 2 1 · 2 q · 10 k 1 2 1 · 2 q · 10 k 1 > 1 n
    Or when 2 1 · 2 q · 10 k 1 = 1 n , c % 2 = 0 must also be satisfied. Therefore, the following conditions are valid:
    if 2 1 · 2 q · 10 k 1 > 1 n or 2 1 · 2 q · 10 k 1 = 1 n & & c % 2 = 0 : o n e = 10
  • 10 m + 10 n or 10 m + 10 n + 1
    When none of the conditions are met as d = 10 m or d = 10 m + 10 , d is either 10 m + 10 n or 10 m + 10 n + 1 . The final value is determined based on the decimal part of 10 n . If the decimal part is 0.5, it is rounded to the nearest even value; if it is not 0.5, it is rounded to the nearest value. For irregular floating-point numbers, it is also necessary to determine whether 10 m + 10 n is within the rounding interval R v . If it is not, then 10 m + 10 n + 1 .
In summary, the steps of the Schubfach algorithm variants are as follows, that is, the algorithms proposed in this paper (xjb32(for float), xjb64(for double)):
input : c , q output : d , k convert c · 2 q to d · 10 k 1 v = c · 2 q 2 k = q · lg ( 2 ) if v r e g u l a r else q · lg ( 2 ) lg ( 4 3 ) 3 m = v · 10 k 1 , n = v · 10 k 1 m 4 t e n = 10 m 5 if 10 n 10 n = 0.5 : o n e = 10 n if 10 n % 2 = 0 else 10 n + 1 6 if 10 n 10 n < 0.5 : o n e = 10 n 7 if 10 n 10 n > 0.5 : o n e = 10 n + 1 8 if v i r r e g u l a r : 9 if 10 n 10 n > 2 q 2 · 10 k : o n e = 10 n + 1 10 if 2 q 2 · 10 k 1 n : o n e = 0 11 else : 12 if 2 q 1 · 10 k 1 > n or 2 q 1 · 10 k 1 = n & & c % 2 = 0 : o n e = 0 13 endif 14 if 2 q 1 · 10 k 1 > 1 n or 2 q 1 · 10 k 1 = 1 n & & c % 2 = 0 : o n e = 10 15 d = t e n + o n e
This algorithm process (21) is applicable to float and double floating-point numbers. Taking a floating-point number v as input, c and q are extracted, and the calculation results d (line 15) and k (line 2) are returned. The real value represented by the returned results is d · 10 k , which conforms to the SW principle. The calculation process of k is relatively simple and can be obtained from (11). Therefore, the following only focuses on introducing the rapid calculation process of d.
The following will be divided into five parts to introduce the algorithm process (21) :
  • Introduce the pre-computation process of the algorithm’s lookup table.
  • Quickly calculate m.
  • Quickly determine whether o n e = 0 or o n e = 10 .
  • Quickly calculate 10 n and determine whether o n e = 10 n or o n e = 10 n + 1 based on the decimal part of 10 n .
  • Processing of irregular floating-point numbers.

3.2. Pre-computation of Lookup Table

The algorithm in this paper uses a lookup table to store the values of 10 k 1 for q in the range of 149 , 104 for float and 1074 , 971 for double. In the algorithm of this paper, float uses 64-bit precision and double uses 128-bit precision lookup tables. The code implementation in this section is gen.py. Suppose the bit length of a single value data in the lookup table is B. For float, it has B = 64 , and for double, it has B = 128 . Suppose there are integers e 10 and real numbers e 2 , where 1 f < 2 . There are:
f · 2 e 2 = 2 e 2 = 10 e 10
Then:
e 2 = e 10 · lg ( 2 )
The calculation leads to f, and the following conclusions are drawn:
f = 10 e 10 2 e 10 · lg ( 2 )
The way to calculate the lookup table is as follows (using the upward rounding method) :
l o o k u p [ e 10 ] = f · 2 B 1 = 10 e 10 2 e 10 · lg ( 2 ) · 2 B 1 = 10 e 10 · 2 B 1 e 10 · lg ( 2 )
For float, when 0 e 10 27 , f · 2 B 1 is an integer in equation (25). For double, when 0 e 10 55 , f · 2 B 1 is an integer in equation (25). The detailed calculation process is as follows:
  • Float
    The range of k 1 is calculated to be [-32, 44] through the q value range in equation (5), so the lookup table contains representation values from 10 to the power of -32 to 10 to the power of 44. The calculation process is as follows:
    32 e 10 44 e 2 = e 10 · log 2 ( 10 ) 63 p o w 10 t = 2 e 2 / / 10 e 10 ; if e 10 < 0 10 e 10 / / 2 e 2 ; if e 10 20 10 e 10 · 2 e 2 ; if 1 e 10 19 f 1 , e 10 = p o w 10 = p o w 10 t + e 10 0 & & e 10 27 ? 0 : 1
    When 0 e 10 27 , the lookup table variable indicates that the values f 1 , e 10 · 2 e 10 · log 2 ( 10 ) 63 and 10 e 10 are equal. In other cases, the relative error is less than 2 63 . Expressed as:
    r 1 , e 10 = f 1 , e 10 · 2 e 10 · log 2 ( 10 ) 63 10 e 10 1 ; if 0 e 10 27 1 , 1 + 2 63 ; if e 10 < 0 or e 10 > 27
  • Double
    The range of k 1 is calculated to be [-293, 323] through the q value range in equation (5), so the lookup table contains representation values from 10 to the power of -293 to 10 to the power of 323. The calculation process is as follows:
    293 e 10 323 e 2 = e 10 · log 2 ( 10 ) 127 p o w 10 t = 2 e 2 / / 10 e 10 ; if e 10 < 0 10 e 10 / / 2 e 2 ; if e 10 39 10 e 10 · 2 e 2 ; if 1 e 10 38 f 1 , e 10 = p o w 10 = p o w 10 t + e 10 0 & & e 10 55 ? 0 : 1
    When 0 e 10 55 , the lookup table variable indicates that the values f 1 , e 10 · 2 e 10 · log 2 ( 10 ) 127 and 10 e 10 are equal. In other cases, the relative error is less than 2 127 . Expressed as:
    r 1 , e 10 = f 1 , e 10 · 2 e 10 · log 2 ( 10 ) 127 10 e 10 1 ; if 0 e 10 55 1 , 1 + 2 127 ; if e 10 < 0 or e 10 > 55
The following uses r 1 to represent all possible errors of the lookup table values within the float range, r 2 to represent all possible errors of the lookup table values within the double range, and r to represent all possible errors of the lookup table values within either the float or double range. In algorithm process (21), an approximate representation value of 10 to the power of k 1 needs to be obtained through a lookup table. From equation (27) and equation (29), the lookup table representation value is error-free when q is within the following range:
f l o a t : 0 k 1 27 93 q 1 d o u b l e : 0 k 1 55 186 q 1
When q is not within the range of equation (30), the error range of the value represented by the lookup table can be concluded as follows:
f l o a t : 0 < r 1 1 < 2 63 d o u b l e : 0 < r 2 1 < 2 127
The introduction of the lookup table calculation process is complete. The storage space required for a float range lookup table is 616 bytes, and that for a double range lookup table is 9872 bytes.

3.3. Quickly Calculate m

Relevant theorems (partially from the Dragonbox[8] algorithm paper) : Suppose there are positive integers n,P,and Q, where P and Q are coprime, P < Q , 1 n n m a x , Q > n m a x , P * / Q * is the best rational approximation result greater than or equal to P / Q , P * / Q * is the best rational approximation result less than or equal to P / Q , and it satisfies Q * n m a x , Q * n m a x . And if n · P does not divide Q evenly, it is expressed as:
n · P Q + 1 = n · P Q
Suppose the following holds true:
n · P Q = n · ξ
Then there are:
P * Q * = max 1 n n max n · P Q n ξ < min 1 n n max n · P Q + 1 n = min 1 n n max n · P Q n = P * Q *
Therefore, the range of values for ξ is:
P * Q * ξ < P * Q *
And the range of the decimal part with n · P Q is:
Q * P % Q Q , Q * P % Q Q
That is, when n = Q * , the decimal part is the smallest; when n = Q * , the decimal part is the largest.
The definition of the best rational approximation function is as follows (this function is implemented on line 15 of the test1.py file):
D N , U P = f C , P , Q
The function (37) Calculate the best rational approximation result with a denominator not exceeding C based on the mean term theorem of the Farey sequence. D N and U P are two adjacent terms in the C-order Farey sequence F C .
In algorithm process (21), m is calculated as v · 10 k 1 (line 3). Just prove that the following equation holds:
m = c · 2 q · 10 k 1 = c · 2 q · r · 10 k 1
Where r is the error of the lookup table value, as defined in equation (27) and equation (29). When the condition (30) is met, r is 1, and the equation (38) clearly holds. When r is not 1, there is:
f l o a t : 1 < r < 1 + 2 63 d o u b l e : 1 < r < 1 + 2 127
Calculate the range of 2 q · 10 k 1 and we get:
2 q · 10 k 1 = 10 1 · 10 q · lg ( 2 ) q · lg ( 2 )
When q is not 0, equation (40) exists:
q · lg ( 2 ) q · lg ( 2 ) 0 < q · lg ( 2 ) q · lg ( 2 ) < 1
When q is 0, q · lg ( 2 ) q · lg ( 2 ) = 0 , so the final conclusion is:
10 1 2 q · 10 k 1 < 1
Because there is:
c · 2 q · 10 k 1 = c · 2 q k 1 5 k + 1 0.1 c , c
Therefore:
c · 2 q · 10 k 1 = c · 2 q k 1 5 k + 1 ; q 1 c 2 1 + k q · 5 k + 1 = c 10 ; q = 0 c · 5 k 1 2 1 + k q ; q < 0
Suppose:
c · 2 q · 10 k 1 = c · x y < c
Then there are:
x , y = 2 q k 1 , 5 k + 1 ; q 1 1 , 10 ; q = 0 5 k 1 , 2 1 + k q ; q < 0
Suppose:
f l o a t : c c max = C 1 = 2 24 1 d o u b l e : c c max = C 2 = 2 53 1
The following is represented by C as C 1 or C 2 . C within the float range is C 1 , and C within the double range is C 2 .
When y > C , calculate the P * and Q * corresponding to each q by calling f C , x , y according to function (37). And calculate the minimum B I T value when the following conditions are met:
x y 1 + 2 B I T < P * Q *
When y C , there is:
c · x y 1 + 1 C y = c x + c C · x y y < c x + 1 y
Therefore:
c · x y = c · x y 1 + 1 C y
Similarly, calculate the minimum B I T value:
x y 1 + 2 B I T < x y 1 + 1 C y
In summary, the calculation results of the maximum value among the minimum B I T values corresponding to different q are as follows (the running result is in the test1.py file, and the running time of this code is only about 1 to 2 seconds) :
f l o a t : B I T max = 52 d o u b l e : B I T max = 113
Therefore, the following conclusions exist:
f l o a t : c · x y = c · x y · 1 + 2 52 = c · x y · r 1 d o u b l e : c · x y = c · x y · 1 + 2 113 = c · x y · r 2
This section has been verified.After quickly calculating m, the value of t e n = 10 m can be obtained very quickly.

3.4. Quickly Determine Whether o n e = 0 or o n e = 10

In algorithm process (21), the conditions for determining o n e = 0 and o n e = 10 are on lines 12, and 14. This section will introduce how to quickly determine whether o n e = 0 or o n e = 10 holds by using equivalent conditions.
When discussing the case of 2 1 · 2 q · 10 k 1 = n (line 12, o n e might be 0), it is equivalent to:
c · 2 q · 10 k 1 c · 2 q · 10 k 1 = 2 1 · 2 q · 10 k 1 2 c 1 · 2 q 1 · 10 k 1 = c · 2 q · 10 k 1
When discussing the case of 2 1 · 2 q · 10 k 1 = 1 n (line 14, o n e might be 10), it is equivalent to:
c · 2 q · 10 k 1 c · 2 q · 10 k 1 + 1 = 2 1 · 2 q · 10 k 1 2 c + 1 · 2 q 1 · 10 k 1 = c · 2 q · 10 k 1 + 1
Since equation (42), we have:
2 q 1 · 10 k 1 0.05 , 0.5
Therefore, there is:
c · 2 q · 10 k 1 1 < c · 2 q · 10 k 1 0.5 < 2 c 1 · 2 q 1 · 10 k 1 c · 2 q · 10 k 1 0.05 < c · 2 q · 10 k 1 + 1
Therefore, for equation (54), when 2 c 1 · 2 q 1 · 10 k 1 is an integer, it must be equal to c · 2 q · 10 k 1 . Similarly, for equation (55), there is:
c · 2 q · 10 k 1 < c · 2 q · 10 k 1 + 0.05 2 c + 1 · 2 q 1 · 10 k 1 < c · 2 q · 10 k 1 + 0.5 < c · 2 q · 10 k 1 + 2
Therefore, for equation (55), when 2 c + 1 · 2 q 1 · 10 k 1 is an integer, it must be equal to c · 2 q · 10 k 1 + 1 .
In conclusion, it is equivalent to discussing whether 2 c ± 1 · 2 q 1 · 10 k 1 is an integer. There are:
2 c ± 1 · 2 q 1 · 10 k 1 = 2 c ± 1 · 2 q k 2 · 5 k 1
According to the range of q, there are:
q k 2 0 , k 1 < 0 ; q 2 q k 2 < 0 , k 1 < 0 ; 1 q 0 q k 2 < 0 , k 1 0 ; q < 0
Therefore, equation (59) is equivalent to:
2 c ± 1 · 2 q 1 · 10 k 1 = 2 c ± 1 · 2 q k 2 5 k + 1 ; q 2 2 c ± 1 2 2 + k q · 5 k + 1 ; 1 q 0 2 c ± 1 · 5 k 1 2 2 + k q ; q < 0
According to the different ranges of q, the following situations are discussed:
  • q 2
    From q 2 , we get k 0 .When q 2 , it is equivalent to discussing whether 2 c ± 1 · 2 q k 2 is divisible by 5 k + 1 . Since 2 and 5 are coprime, it is equivalent to discussing whether 2 c ± 1 is divisible by 5 k + 1 .
    2 c ± 1 % 5 k + 1 = 0
    Suppose t is a positive integer:
    2 c ± 1 = t · 5 k + 1 ; t 1
    Since 2 c ± 1 is odd, t is also odd. Because the following conditions exist:
    f l o a t : 2 c 1 2 24 + 1 , 2 25 3 ; 2 c + 1 2 24 + 3 , 2 25 1 ; d o u b l e : 2 c 1 2 53 + 1 , 2 54 3 ; 2 c + 1 2 53 + 3 , 2 54 1 ;
    Therefore, the following satisfies:
    f l o a t : 2 24 + 1 t · 5 k + 1 2 25 1 d o u b l e : 2 53 + 1 t · 5 k + 1 2 54 1
    Therefore, the following conclusions are drawn:
    f l o a t : 2 24 + 1 5 k + 1 t 2 25 1 5 k + 1 ; d o u b l e : 2 53 + 1 5 k + 1 t 2 54 1 5 k + 1 ;
    For the above equation (66), the maximum value of k when t can obtain at least one odd number is:
    f l o a t : k max = 9 q max = 33 , t = 3 d o u b l e : k max = 22 q max = 76 , t = 1
    Therefore, the maximum value of k is 9 within the float range and 22 within the double range. Therefore, when k exceeds the above range, 2 c ± 1 is not divisible by 5 k + 1 .
  • 1 q 0
    Because the denominator 2 2 + k q · 5 k + 1 is even and the numerator 2 c ± 1 is odd, the condition is not met.
  • q < 0
    Because the denominator 2 2 + k q is even and the numerator 2 c ± 1 · 5 k 1 is odd, the condition is not met.
In summary, the situations where 2 c ± 1 · 2 q 1 · 10 k 1 may be an integer are as follows:
f l o a t : 2 q 33 & & 2 c ± 1 % 5 k + 1 = 0 ; d o u b l e : 2 q 76 & & 2 c ± 1 % 5 k + 1 = 0 ;
And, the range of k 1 is:
f l o a t : 10 k 1 1 d o u b l e : 23 k 1 1
When 2 1 · 2 q · 10 k 1 = n , the following conclusions can be drawn:
f l o a t : 2 35 · 2 q · 10 k 1 = n · 2 36 2 35 · 2 q · 10 k 1 = n · 2 36 d o u b l e : 2 63 · 2 q · 10 k 1 = n · 2 64 2 63 · 2 q · 10 k 1 = n · 2 64
When 2 1 · 2 q · 10 k 1 = 1 n , the following conclusions can be drawn:
f l o a t : 2 35 · 2 q · 10 k 1 = 2 36 n · 2 36 2 35 · 2 q · 10 k 1 = 2 36 n · 2 36 = 2 36 1 n · 2 36 d o u b l e : 2 63 · 2 q · 10 k 1 = 2 64 n · 2 64 2 63 · 2 q · 10 k 1 = 2 64 n · 2 64 = 2 64 1 n · 2 64
The discussion on whether 2 36 n · 2 36 = 2 36 1 n · 2 36 in equation (71) holds true, that is, whether 2 36 · n in equation (71) is an integer, or equivalent to discussing whether the following values are integers when equation (68) holds true (the same applies to double) :
f l o a t : c · 2 q + 36 · 10 k 1 = c · 2 q k + 35 · 5 k 1 = c · 2 q k + 35 5 k + 1 d o u b l e : c · 2 q + 64 · 10 k 1 = c · 2 q k + 63 · 5 k 1 = c · 2 q k + 63 5 k + 1
Suppose c can divide 5 k + 1 evenly (where t is a temporary integer variable):
c = t · 5 k + 1 ; t 1
Therefore, when equation (73) was established, there were:
2 c ± 1 = 2 · t · 5 k + 1 ± 1
Expression (74) cannot divide 5 k + 1 evenly, which contradicts equation (68), so c cannot divide 5 k + 1 evenly. Therefore, for float, c · 2 q + 36 · 10 k 1 is not an integer; For double, c · 2 64 + q · 10 k 1 is not an integer, that is:
f l o a t : 2 36 2 36 · n = 2 36 + 2 36 · n = 2 36 1 2 36 · n d o u b l e : 2 64 2 64 · n = 2 64 + 2 64 · n = 2 64 1 2 64 · n
Therefore, the conclusion (71) is correct. Discuss the necessary and sufficient conditions for whether 2 35 · 2 q · 10 k 1 = n · 2 36 is 2 1 · 2 q · 10 k 1 = n . The same applies to double, expressed as:
f l o a t : 2 1 · 2 q · 10 k 1 = n 2 35 · 2 q · 10 k 1 = n · 2 36 d o u b l e : 2 1 · 2 q · 10 k 1 = n 2 63 · 2 q · 10 k 1 = n · 2 64
Similarly, the necessary and sufficient conditions for whether 2 35 · 2 q · 10 k 1 = 2 36 n · 2 36 is 2 1 · 2 q · 10 k 1 = 1 n . The same applies to double, expressed as:
f l o a t : 2 1 · 2 q · 10 k 1 = 1 n 2 35 · 2 q · 10 k 1 = 2 36 n · 2 36 d o u b l e : 2 1 · 2 q · 10 k 1 = 1 n 2 63 · 2 q · 10 k 1 = 2 64 n · 2 64
The sufficient conditions of equations (76) and (77) are obviously established. Introduce the proof that equation (76) holds. For float, only the necessary conditions need to be discussed, that is, whether 2 1 · 2 q · 10 k 1 = n must hold true when 2 35 · 2 q · 10 k 1 = n · 2 36 holds, or equivalent to 2 35 · 2 q · 10 k 1 n · 2 36 must hold true when 2 1 · 2 q · 10 k 1 n . The following is proved by proof by contradiction.
Assume that 2 35 · 2 q · 10 k 1 = n · 2 36 holds when 2 1 · 2 q · 10 k 1 n . Then there is:
2 35 · 2 q · 10 k 1 = n · 2 36 0 < 2 35 · 2 q · 10 k 1 2 36 · n < 1 0 < 2 c 1 · 2 q 1 · 10 k 1 m < 2 36
As is known from equation (57), there is:
m 1 < 2 c 1 · 2 q 1 · 10 k 1 < m + 1
Suppose the decimal part of 2 c 1 · 2 q 1 · 10 k 1 is represented as n , thus we have:
2 c 1 · 2 q 1 · 10 k 1 m = n ; if 2 c 1 · 2 q 1 · 10 k 1 > m 1 n ; if 2 c 1 · 2 q 1 · 10 k 1 < m
Substitute expression (80) into expression (78), and we get:
0 < 2 c 1 · 2 q 1 · 10 k 1 m < 2 36 0 < n < 2 36 or 0 < 1 n < 2 36
Similarly, it can be known that the double range is the range of n . Therefore, there is:
f l o a t : n 0 , 2 36 1 2 36 , 1 d o u b l e : n 0 , 2 64 1 2 64 , 1
When 2 1 · 2 q · 10 k 1 n , it is known from equation (54) that 2 c 1 · 2 q 1 · 10 k 1 is not an integer. Therefore, there is:
0 < n < 1
It is only necessary to prove that equation (82) does not hold. Discuss the range of the decimal part n when 2 c 1 · 2 q 1 · 10 k 1 is not an integer. According to equation (61), there are:
2 c 1 · 2 q 1 · 10 k 1 = 2 c 1 · x y = 2 c 1 · 2 q k 2 5 k + 1 ; q 2 2 c 1 2 2 + k q · 5 k + 1 ; 1 q 0 2 c 1 · 5 k 1 2 2 + k q ; q < 0
The maximum value of 2 c 1 is:
f l o a t : 2 c 1 max = 2 25 3 d o u b l e : 2 c 1 max = 2 54 3
Discuss based on the denominator range in equation (84).
  • y 2 c 1 max
    When y 2 c 1 max , y max is the expression (85), the following holds true:
    1 y max n 1 1 y max 1 y max 1 n 1 1 y max
    Therefore, when y 2 c 1 max , equation (82) does not hold true.
  • y > 2 c 1 max
    Call function (37) to calculate the approximation results P * / Q * and P * / Q * of all possible upper and lower limit rational numbers:
    P * Q * , P * Q * = f 2 c 1 max , x , y
    Therefore, for n , the following conclusion can be drawn from formula (36).
    n Q * x % y y , Q * x % y y
    By exhausting all possibilities, we thus have (the test code file is test3.py) :
    f l o a t : 2 33 < n < 1 2 29 d o u b l e : 2 62 < n < 1 2 63
    f l o a t : Q * x % y y , Q * x % y y 0 , 2 36 = Q * x % y y , Q * x % y y 1 2 36 , 1 = d o u b l e : Q * x % y y , Q * x % y y 0 , 2 64 = Q * x % y y , Q * x % y y 1 2 64 , 1 =
    Therefore, when y > 2 c 1 max , equation (82) does not hold true.
In summary, when 2 1 · 2 q · 10 k 1 n , equation (82) does not hold true, that is, 2 35 · 2 q · 10 k 1 n · 2 36 must hold true. Therefore, when 2 35 · 2 q · 10 k 1 = n · 2 36 holds, 2 1 · 2 q · 10 k 1 = n must hold true. Therefore, equation (76) holds.
Similarly, it can be proved that when 2 35 · 2 q · 10 k 1 = 2 36 n · 2 36 holds, 2 1 · 2 q · 10 k 1 = 1 n must hold true. The same applies to double. Similarly, by proof of contradiction, for float, it is assumed that when 2 1 · 2 q · 10 k 1 1 n holds, 2 35 · 2 q · 10 k 1 = 2 36 2 36 · n holds. That is:
2 35 · 2 q · 10 k 1 = 2 36 n · 2 36 0 < 2 35 · 2 q · 10 k 1 2 36 + 2 36 · n < 1 0 < 2 q 1 · 10 k 1 1 + n < 2 36 2 36 < 2 c + 1 · 2 q 1 · 10 k 1 m 1 < 2 36
As is known from equation (58), there is:
m < 2 c + 1 · 2 q 1 · 10 k 1 < m + 2
Suppose the decimal part of 2 c + 1 · 2 q 1 · 10 k 1 is represented as n + , thus we have:
2 c + 1 · 2 q 1 · 10 k 1 m 1 = n + ; if 2 c + 1 · 2 q 1 · 10 k 1 > m + 1 1 n + ; if 2 c + 1 · 2 q 1 · 10 k 1 < m + 1
Substitute expression (93) into expression (91), and we get:
0 < 2 c + 1 · 2 q 1 · 10 k 1 m 1 < 2 36 0 < 1 n + < 2 36 or 0 < n + < 2 36
Similarly, it can be known that the double range is the range of n + . Therefore, there is:
f l o a t : n + 0 , 2 36 1 2 36 , 1 d o u b l e : n + 0 , 2 64 1 2 64 , 1
When 2 1 · 2 q · 10 k 1 1 n , it is known from equation (55) that 2 c + 1 · 2 q 1 · 10 k 1 is not an integer.
Therefore, there is:
0 < n + < 1
It is only necessary to prove that equation (95) does not hold. Discuss the range of the decimal part n + when 2 c + 1 · 2 q 1 · 10 k 1 is not an integer. According to equation (61), there are:
2 c + 1 · 2 q 1 · 10 k 1 = 2 c + 1 · x y = 2 c + 1 · 2 q k 2 5 k + 1 ; q 2 2 c + 1 2 2 + k q · 5 k + 1 ; 1 q 0 2 c + 1 · 5 k 1 2 2 + k q ; q < 0
The maximum value of 2 c + 1 is:
f l o a t : 2 c + 1 max = 2 25 1 d o u b l e : 2 c + 1 max = 2 54 1
Discuss based on the denominator range in equation (97).
  • y 2 c + 1 max
    When y 2 c + 1 max , y max is the expression (98), the following holds true:
    1 y max n + 1 1 y max 1 y max 1 n + 1 1 y max
    Therefore, when y 2 c + 1 max , equation (95) does not hold true.
  • y > 2 c + 1 max
    Call function (37) to calculate the approximation results P * / Q * and P * / Q * of all possible upper and lower limit rational numbers:
    P * Q * , P * Q * = f 2 c + 1 max , x , y
    Therefore, for n + , the following conclusion can be drawn from formula (36).
    n + Q * x % y y , Q * x % y y
    By exhausting all possibilities, we thus have (the test code file is test7.py) :
    f l o a t : 2 33 < n + < 1 2 29 d o u b l e : 2 62 < n + < 1 2 63
    f l o a t : Q * x % y y , Q * x % y y 0 , 2 36 = Q * x % y y , Q * x % y y 1 2 36 , 1 = d o u b l e : Q * x % y y , Q * x % y y 0 , 2 64 = Q * x % y y , Q * x % y y 1 2 64 , 1 =
    Therefore, when y > 2 c + 1 max , equation (95) does not hold true.
In summary, when 2 1 · 2 q · 10 k 1 1 n , equation (95) does not hold true, that is, 2 35 · 2 q · 10 k 1 2 36 n · 2 36 must hold true. Therefore, when 2 35 · 2 q · 10 k 1 = 2 36 n · 2 36 holds, 2 1 · 2 q · 10 k 1 = 1 n must hold true. Therefore, equation (77) holds.
The following conclusions hold:
f l o a t : 2 36 n · 2 36 = 2 36 1 2 36 · n ; if c · 2 36 + q · 10 k 1 Z 2 36 2 36 · n ; if c · 2 36 + q · 10 k 1 Z d o u b l e : 2 64 n · 2 64 = 2 64 1 2 64 · n ; if c · 2 64 + q · 10 k 1 Z 2 64 2 64 · n ; if c · 2 64 + q · 10 k 1 Z
Discuss whether the following equation (105) holds when conditions (68) and (69) are met:
f l o a t : c · 2 q + 35 k 5 k + 1 = c · 2 q + 35 k 5 k + 1 · r = c · 2 q + 35 k 5 k + 1 · 2 63 k 1 · log 2 ( 10 ) / / 10 k + 1 + 1 10 k 1 · 2 k 1 · log 2 ( 10 ) 63 d o u b l e : c · 2 q + 63 k 5 k + 1 = c · 2 q + 63 k 5 k + 1 · r = c · 2 q + 63 k 5 k + 1 · 2 127 k 1 · log 2 ( 10 ) / / 10 k + 1 + 1 10 k 1 · 2 k 1 · log 2 ( 10 ) 127
There are:
f l o a t : c · 2 q + 35 k 5 k + 1 = 2 36 · m + n = 2 36 m + 2 36 n d o u b l e : c · 2 q + 63 k 5 k + 1 = 2 64 · m + n = 2 64 m + 2 64 n
It has been proven earlier that m can be accurately calculated. Then, when (1-118) holds true, the values 2 36 n and 2 64 n on the right side of equations (70) and (71) can be accurately calculated.
From equation (63), we have:
c = t · 5 k + 1 1 2
Substituting equation (107) into equation (105), we have:
f l o a t : c · 2 q + 35 k 5 k + 1 = t · 2 q + 34 k 2 q + 34 k 5 k + 1 d o u b l e : c · 2 q + 63 k 5 k + 1 = t · 2 q + 62 k 2 q + 62 k 5 k + 1
When conditions (68) and (69) are met, t · 2 q + 34 k and t · 2 q + 62 k are integers. Under the condition of meeting condition (68), the decimal part of expression (108) is represented as:
f l o a t : 2 q + 34 k % 5 k + 1 5 k + 1 ; 2 q 33 d o u b l e : 2 q + 62 k % 5 k + 1 5 k + 1 ; 2 q 76
It is only necessary to prove that the increase in the value c · 2 q + 35 k 5 k + 1 · r on the right side of the expression compared to the value c · 2 q + 35 k 5 k + 1 on the left side plus the decimal part of the value on the left side is less than 1 for equation (105) to hold true. That is:
f l o a t : 2 q + 34 k % 5 k + 1 5 k + 1 + c · 2 q + 35 k 5 k + 1 · r c · 2 q + 35 k 5 k + 1 < 1 d o u b l e : 2 q + 62 k % 5 k + 1 5 k + 1 + c · 2 q + 63 k 5 k + 1 · r c · 2 q + 63 k 5 k + 1 < 1
By exhaustionally calculating the maximum possible c value under each q and substituting it into equation (110), it holds. The calculation result is in test2.py. The calculation results show that for the float range and the double range, equation (110) always holds true. Therefore, equation (105) holds true, and thus the values of 2 36 n and 2 64 n on the right side of equations (70) and (71) can be accurately calculated. The values of 2 35 · 2 q · 10 k 1 and 2 63 · 2 q · 10 k 1 on the left side of equations (70) and (71) can be calculated through lookup tables.
f l o a t : 2 35 · 2 q · 10 k 1 = p o w 10 28 q k 1 · log 2 ( 10 ) d o u b l e : 2 64 · 2 q · 10 k 1 = p o w 10 64 q k 1 · log 2 ( 10 )
The code file for verifying the validity of equation (111) is test4.py. Therefore, when conditions (68) and (69) are met, the values of both sides of equations (70) and (71) can be accurately calculated.
Discuss the relationship between the following two values within all ranges of floating-point numbers:
f l o a t : c · 2 q + 36 · 10 k 1 ; c · 2 q + 36 · r · 10 k 1 ; d o u b l e : c · 2 q + 64 · 10 k 1 ; c · 2 q + 64 · r · 10 k 1 ;
When r = 1 , it is obvious that the two values in expression (112) are equal. When r 1 , or equivalent to r > 1 , has:
f l o a t : c · 2 q + 36 · r · 10 k 1 = c · 2 q + 36 · 10 k 1 + c · 2 q + 36 · r 1 · 10 k 1 < c · 2 q + 36 · 10 k 1 + 2 24 · 2 36 · 2 q · 10 k 1 · r 1 < c · 2 q + 36 · 10 k 1 + 2 3 c · 2 q + 36 · r · 10 k 1 c · 2 q + 36 · 10 k 1 + 1 d o u b l e : c · 2 q + 64 · r · 10 k 1 = c · 2 q + 64 · 10 k 1 + c · 2 q + 64 · r 1 · 10 k 1 < c · 2 q + 64 · 10 k 1 + 2 53 · 2 64 · 2 q · 10 k 1 · r 1 < c · 2 q + 64 · 10 k 1 + 2 10 c · 2 q + 64 · r · 10 k 1 c · 2 q + 64 · 10 k 1 + 1
Therefore, there is:
f l o a t : 0 c · 2 q + 36 · r · 10 k 1 c · 2 q + 36 · 10 k 1 1 d o u b l e : 0 c · 2 q + 64 · r · 10 k 1 c · 2 q + 64 · 10 k 1 1
Because there is:
c · 2 q · 10 k 1 = c · 2 q · r · 10 k 1 = m
f l o a t : c · 2 q + 36 · 10 k 1 = 2 36 m + 2 36 n d o u b l e : c · 2 q + 64 · 10 k 1 = 2 64 m + 2 64 n
Suppose:
n r = c · 2 q · r · 10 k 1 m
Therefore, the following conclusion can be drawn: when condition (68) is met, from equation (105), we have:
f l o a t : 2 q 33 & & 2 c ± 1 % 5 k + 1 = 0 2 36 · n = 2 36 · n r d o u b l e : 2 q 76 & & 2 c ± 1 % 5 k + 1 = 0 2 64 · n = 2 64 · n r
Within the range of floating-point numbers, there exists:
f l o a t : 2 36 · n 2 36 · n r 2 36 · n + 1 d o u b l e : 2 64 · n 2 64 · n r 2 64 · n + 1
To simplify the expression, e v e n is used to indicate whether c is an even number:
e v e n = ( c + 1 ) % 2 0 , 1
The following will introduce the proof process of the float range, from equation (121) to equation (144). When 2 1 · 2 q · 10 k 1 = n or 2 1 · 2 q · 10 k 1 = 1 n , 2 1 · 2 q · 10 k 1 = n is the boundary condition for o n e = 0 , and 2 1 · 2 q · 10 k 1 = 1 n is the boundary condition for o n e = 10 . Whether o n e is 0 or 10 is determined based on whether c is an even number.Therefore, the following exists:
f l o a t : o n e = 0 : 2 q + 35 · 10 k 1 + e v e n > n r · 2 36 o n e = 10 : 2 q + 35 · 10 k 1 + e v e n > 2 36 1 n r · 2 36
Therefore, when 2 1 · 2 q · 10 k 1 = n or 2 1 · 2 q · 10 k 1 = 1 n , we can use the condition (122) to determine whether o n e = 0 or o n e = 10 .
f l o a t : if 2 q + 35 · 10 k 1 + e v e n > n r · 2 36 : o n e = 0 if 2 q + 35 · 10 k 1 + e v e n > 2 36 1 n r · 2 36 : o n e = 10
When 2 1 · 2 q · 10 k 1 > n or 2 1 · 2 q · 10 k 1 > 1 n , We can also use the above condition (122) to determine whether o n e = 0 or o n e = 10 . When 2 1 · 2 q · 10 k 1 < n or 2 1 · 2 q · 10 k 1 < 1 n , we can also use the above condition (122) to determine whether o n e 0 or o n e 10 . The proof is as follows:
When 2 1 · 2 q · 10 k 1 < n , there must exist o n e 0 , and there is:
2 1 · 2 q · 10 k 1 n = n 1 2 33 1 , 2 29
Therefore, the following exists:
2 q + 35 · 10 k 1 2 36 · n 2 3 2 36 , 2 7
Suppose there are two real numbers a and b, and the following relationship must exist:
0 b b < 1 a a 1 < b b < 1 + a a a b 1 < a b < a b + 1
When a = 2 q + 35 · 10 k 1 and b = 2 36 · n , the following exists:
2 q + 35 · 10 k 1 2 36 · n < 2 q + 35 · 10 k 1 2 36 · n + 1
From equation (124), we have:
2 q + 35 · 10 k 1 2 36 · n < 1 2 7 < 0
Therefore, there is:
2 q + 35 · 10 k 1 + e v e n 2 q + 35 · 10 k 1 + 1 < 2 36 · n 2 36 · n r 2 q + 35 · 10 k 1 + e v e n < 2 36 · n r
Therefore, when 2 1 · 2 q · 10 k 1 < n , the condition (122) can be used to determine that o n e 0 .
When 2 1 · 2 q · 10 k 1 > n , there must exist o n e = 0 , and there is:
2 1 · 2 q · 10 k 1 n = n 2 33 , 1 2 29
Therefore, the following exists:
2 q + 35 · 10 k 1 2 36 · n 2 3 , 2 36 2 7
When a = 2 q + 35 · 10 k 1 and b = 2 36 · n , from equation (125), the following exists:
2 q + 35 · 10 k 1 2 36 · n > 2 q + 35 · 10 k 1 2 36 · n 1
From equation (130), we have:
2 q + 35 · 10 k 1 2 36 · n > 2 3 1 0
Therefore, there is:
2 q + 35 · 10 k 1 + e v e n 2 q + 35 · 10 k 1 > 2 36 · n + 1 2 36 · n r 2 q + 35 · 10 k 1 + e v e n > 2 36 · n r
Therefore, when 2 1 · 2 q · 10 k 1 > n , the condition (122) can be used to determine that o n e = 0 .
When 2 1 · 2 q · 10 k 1 < 1 n , there must exist o n e 10 , and there is:
2 1 · 2 q · 10 k 1 + n = n + 2 33 , 1 2 29
Therefore, the following exists:
2 q + 35 · 10 k 1 + 2 36 · n 2 3 , 2 36 2 7
Suppose there are two real numbers a and b, and the following relationship must exist:
a 1 < a a b 1 < b b a + b 2 < a + b a + b
When a = 2 q + 35 · 10 k 1 and b = 2 36 · n , the following exists:
2 q + 35 · 10 k 1 + 2 36 · n 2 q + 35 · 10 k 1 + 2 36 · n
From equation (135), we have:
2 q + 35 · 10 k 1 + 2 36 · n < 2 36 2 7
Therefore, there is:
2 q + 35 · 10 k 1 + e v e n 2 q + 35 · 10 k 1 + 1 < 2 36 2 2 36 · n 2 36 1 2 36 · n r 2 q + 35 · 10 k 1 + e v e n < 2 36 1 2 36 · n r
Therefore, when 2 1 · 2 q · 10 k 1 < 1 n , the condition (122) can be used to determine that o n e 10 .
When 2 1 · 2 q · 10 k 1 > 1 n , there must exist o n e = 10 , and there is:
2 1 · 2 q · 10 k 1 + n = n + + 1 1 + 2 33 , 2 2 29
Therefore, the following exists:
2 q + 35 · 10 k 1 + 2 36 · n 2 3 + 2 36 , 2 37 2 7
When a = 2 q + 35 · 10 k 1 and b = 2 36 · n , from equation (136), the following exists:
2 q + 35 · 10 k 1 + 2 36 · n > 2 q + 35 · 10 k 1 + 2 36 · n 2
From equation (141), we have:
2 q + 35 · 10 k 1 + 2 36 · n > 2 36 + 2 3 2 2 36
Therefore, there is:
2 q + 35 · 10 k 1 + e v e n 2 q + 35 · 10 k 1 > 2 36 2 2 36 · n 2 36 1 2 36 · n r 2 q + 35 · 10 k 1 + e v e n > 2 36 1 2 36 · n r
Therefore, when 2 1 · 2 q · 10 k 1 > 1 n , the condition (122) can be used to determine that o n e = 10 .
From the above proof, it can be seen that when condition (68) is met, the condition (122) can be used to determine whether o n e = 0 or o n e = 10 when 2 1 · 2 q · 10 k 1 = n or 2 1 · 2 q · 10 k 1 = 1 n . When 2 1 · 2 q · 10 k 1 > n or 2 1 · 2 q · 10 k 1 > 1 n , the condition (122) can be used to determine whether o n e = 0 or o n e = 10 . When 2 1 · 2 q · 10 k 1 < n or 2 1 · 2 q · 10 k 1 < 1 n , the condition (122) can be used to determine whether o n e 0 or o n e 10 .
For the double range, the value of o n e can be calculated based on the following conditions.
d o u b l e : if 2 q + 64 · 10 k 1 + e v e n > n r · 2 64 : o n e = 0 if 2 q + 64 · 10 k 1 + e v e n > 2 64 1 n r · 2 64 : o n e = 10
Readers can deduce it by themselves based on the above proof process. The proof process of this section is completed. In the code implementation, the two judgment conditions can be quickly calculated using addition and subtraction shift operations, and can be compiled by the compiler into cmov instructions, thereby reducing the impact of branch prediction failure on performance.

3.5. Determine whether o n e = 10 n or o n e = 10 n + 1

Determine whether o n e is 10 n or 10 n + 1 based on the decimal part of 10 n . There are two cases: the decimal part of 10 n is 0.5 and it is not 0.5.

3.5.1. 10 n 10 n = 0.5

When the decimal part of 10 n is 0.5, there must be:
10 n 10 n = 0.5 10 · c · 2 q · 10 k 1 10 · c · 2 q · 10 k 1 = 0.5 c · 2 q · 10 k c · 2 q · 10 k = 0.5 c · 2 q · 10 k = c · 2 q · 10 k + 0.5 2 c · 2 q · 10 k = 2 c · 2 q · 10 k + 1
So 2 c · 2 q · 10 k is an odd number. Then the following expression is odd:
c · 2 q + 1 · 10 k = c · 2 q k + 1 · 5 k
According to the range of q, there are:
c · 2 q + 1 · 10 k = c · 2 q k + 1 5 k ; q 0 c · 2 · 5 k ; q = 1 c · 5 k 2 k q 1 ; q 2
According to the range of q, the following situations are discussed:
  • q 0
    When q 0 , it can be concluded that q k + 1 1 , the numerator c · 2 q k + 1 is even and the denominator 5 k is odd, which does not meet the condition.
  • q = 1
    When q = 1 , it can be concluded that c · 2 · 5 k is even, which does not meet the condition.
  • q 2
    5 k is an odd number. c is an odd multiple of 2 k q 1 . So:
    f l o a t : c 2 k q 1 k q 1 22 q 34 d o u b l e : c 2 k q 1 k q 1 51 q 75
    Therefore, when q meets the above conditions, c must be an odd multiple of 2 k q 1 to meet the condition. Therefore, when the following conditions are met, expression (147) is an odd number:
    f l o a t : 34 q 2 & & c % 2 k q = 2 k q 1 d o u b l e : 75 q 2 & & c % 2 k q = 2 k q 1
    When q is within the above range (150), r = 1 is derived from equation (30).Therefore, there is:
    n r = n
    The following equation holds:
    20 m + 20 n = c · 2 q · 10 k + 1 = c · 2 q k + 1 · 5 k = c 2 k q 1 · 5 k
    Since k 1 , 5 k is multiple of 5 and is an odd number. Since c 2 k q 1 and 5 k are both odd numbers, 20 m is an even number, 20 n is multiple of 5 and is an odd number. Therefore, there is:
    20 n 5 , 15 n 0.25 , 0.75 n r 0.25 , 0.75
    The result of o n e is an even number between 10 n and 10 n + 1 . Therefore, when the following conditions are met:
    o n e = 10 n = 2 , if n = 0.25 10 n + 1 = 8 , if n = 0.75 o n e = 20 n + 1 / / 2 ( n = 0.25 ? 1 : 0 )

3.5.2. 10 n 10 n 0.5

When the decimal part of 10 n is not 0.5, round to the nearest integer value based on the decimal part of 10 n . Therefore, there is:
o n e = 10 n , if 10 n 10 n < 0.5 10 n + 1 , if 10 n 10 n > 0.5 o n e = 10 n + 0.5 = 20 n + 1 / / 2
Since 20 n + 1 = 20 n + 1 , it is only necessary to accurately calculate the value of 20 n . And, there is:
d = t e n + o n e = 10 m + 20 n + 1 / / 2 = ( 20 m + 20 n + 1 ) / / 2
Suppose there are:
20 m + 20 n = c · 2 q + 1 · 10 k = c · 2 q k + 1 · 5 k = c · x y
Suppose the decimal part of 20 n is n 20 .
When y c max = C , the range of the decimal part must include:
f l o a t : 1 2 24 1 = 1 C n 20 1 1 C = 2 24 2 2 24 1 d o u b l e : 1 2 53 1 = 1 C n 20 1 1 C = 2 53 2 2 53 1
When y > c max = C , the range of the decimal part must include(the test file is test5.py):
f l o a t : 2 32 < n 20 < 1 2 30 d o u b l e : 2 64 < n 20 < 1 2 62
Therefore, the range of n 20 satisfies equation (159). In the code implementation, for float, only the high 36 bits of n r are retained, and for double, only the high 70 bits of n r are retained. Suppose the discarded part of a float is represented as n 36 , and similarly, the discarded part of a double is represented as n 70 . Therefore, there is:
f l o a t : n 36 0 , 2 36 d o u b l e : n 70 0 , 2 70
Calculate the boundary conditions of the following expression:
f l o a t : F = 20 · c · 2 q · r · 10 k 1 n 36 d o u b l e : F = 20 · c · 2 q · r · 10 k 1 n 70
Therefore, there is:
f l o a t : F min > 20 · c · 2 q · 10 k 1 2 36 = 20 m + 20 n 20 · 2 36 F max < 20 · c · 2 q · 1 + 2 63 · 10 k 1 0 < 20 m + 20 n + 20 · 2 63 · c < 20 m + 20 n + 1 d o u b l e : F min > 20 · c · 2 q · 10 k 1 2 70 = 20 m + 20 n 20 · 2 70 > 20 m + 20 n F max < 20 · c · 2 q · 1 + 2 127 · 10 k 1 0 < 20 m + 20 n + 20 · 2 127 · c < 20 m + 20 n + 1
Therefore, there is:
f l o a t : F = 20 m + 20 n d o u b l e : F = 20 m + 20 n
In fact, in the above proof process, for float, F m i n 20 m + 20 n may exist, but the code implementation has passed the exhaustive test, so this not-so-perfect proof process can be ignored. Therefore, the calculation of d can be simplified as follows:
d = t e n + o n e = ( F + 1 ) / / 2 = ( 20 · ( c · 2 q · r · 10 k 1 n x ) + 1 ) / / 2
For the float range, n x = n 36 ; for the double range, n x = n 70 .
For double, quickly determine that n = = 0.25 in equation (154).
When n = 0.25 , 2 64 · n r = 2 64 · n = 2 62 . Therefore, the following condition can be used to quickly determine whether n = 0.25 :
d o u b l e : n = 0.25 if 2 64 · n r = 2 62
When n 0.25 ,Calculate the range of the decimal part of the following expression:
4 m + 4 n = c · 2 q + 2 · 10 k 1
Therefore, when equation (166) is not an integer, we have:(test6.py):
2 62 < 4 n 4 n < 1 2 62
Calculate the two boundary cases of 4 n that are closest to 1:
4 n = 0 4 n 0 < 1 2 62 2 64 · n 2 62 2 4 n = 1 4 n 1 > 2 62 2 64 · n 2 62 + 1
Then there are:
2 64 · n 2 62 & & 2 64 · n + 1 2 62 2 64 · n r 2 62
Therefore, the following condition can be used to quickly determine whether n 0.25 :
d o u b l e : n 0.25 if 2 64 · n r 2 62
In summary, for double, the following condition can be used to quickly determine whether n = 0.25 :
d o u b l e : n = 0.25 if 2 64 · n r = 2 62 d o u b l e : n 0.25 if 2 64 · n r 2 62
In the double range, introduce a fast way to calculate o n e :
d o u b l e : o n e = 2 64 n r 2 64 · 10 + n = 0.25 ? 0 : 2 1 + 6 2 64
The proof of equation (172) is as follows:
when n = 0.25 , 2 64 n r 2 64 · 10 = 10 n = 2 ;
when n 0.25 , equation (172) can be equivalent to the following:
d o u b l e : o n e = 2 64 n r 2 64 · 10 + 2 1 + 6 2 64
According to the 10 n 10 n range, o n e is represented as:
d o u b l e : o n e = 10 n , if 10 n 10 n < 0.5 8 , if 10 n 10 n = 0.5 10 n + 1 , if 10 n 10 n > 0.5 = 20 n + 1 / / 2
Therefore,when n 0.25 , we need to prove that the following equation holds:
2 64 n r 2 64 · 10 + 2 1 + 6 2 64 = 10 n , if 10 n 10 n < 0.5 8 , if 10 n 10 n = 0.5 10 n + 1 , if 10 n 10 n > 0.5 = 20 n + 1 / / 2
From the range of n, there is:
2 64 n r 2 64 n r 2 64 , n r
Because the following conditions exist:
c · 2 q · 10 k 1 = m + n c · 2 q · r · 10 k 1 = m + n r
Therefore, the following relationship can be concluded:
n r n = r 1 · c · 2 q · 10 k 1 n r = r 1 · m + n + n n n r < 2 127 · c + n n n r < 2 127 · 2 53 + n n n r < 2 74 + n
From equation (176) and (178), it can be concluded that:
2 64 n r 2 64 n 2 64 , n + 2 74 2 64 n r 2 64 · 10 10 n 10 · 2 64 , 10 n + 10 · 2 74 2 64 n r 2 64 · 20 20 n 20 · 2 64 , 20 n + 20 · 2 74 2 64 n r 2 64 · 20 20 n + n 20 20 · 2 64 , 20 n + n 20 + 20 · 2 74
Discuss the range of values of x when the following conditions are met.
2 64 n r 2 64 · 20 + 1 + x / / 2 = 20 n + 1 / / 2 = o n e
Therefore, the following conclusions can be drawn:
20 n + n 20 20 · 2 64 + 1 + x 20 n + 1 x 20 · 2 64 n 20 20 n + n 20 + 20 · 2 74 + 1 + x < 20 n + 2 x < 1 20 · 2 74 n 20
Suppose x = 12 · 2 64 . Through the exhaustive method, all floating-point numbers that do not meet the following conditions can be obtained.
x = 12 · 2 64 20 · 2 64 n 20
All floating-point numbers that do not meet condition (182) are as follows (in hexadecimal) :
0 x d 17 c 0747 b d 76 f a 1 , 0 x d 27 c 0747 b d 76 f a 1 , 0 x 4 d 73 d e 005 b d 620 d f , 0 x 4 d 83 d e 005 b d 620 d f , 0 x 4 d 93 d e 005 b d 620 d f ,
Through the exhaustive method, all floating-point numbers that do not meet the following conditions can be obtained.
x = 12 · 2 64 < 1 20 · 2 74 n 20
All floating-point numbers that do not meet condition (184) are as follows (in hexadecimal) :
0 x 612491 d a a d 0 b a 280 , 0 x 6159 b 651584 e 8 b 20 , 0 x 619011 f 2 d 73116 f 4 , 0 x 61 c 4166 f 8 c f d 5 c b 1 , 0 x 61 d 4166 f 8 c f d 5 c b 1 ,
There are:
2 ( 2 64 n r 2 64 · 10 + 2 1 + 6 2 64 ) = 2 64 n r 2 64 · 20 + 1 + x
When the floating-point number is not within the above range (183) and (185), the condition (181) is satisfied. We have tested all floating-point numbers within the above-mentioned range (183) and (185), and the algorithm implementation code has output the correct result, that is, it satisfies the SW principle. The test process file is test8.py.
In summary, equation (175) and equation (172) holds. Therefore, equation (172) can be used to quickly calculate o n e .

3.6. Irregular Number

Due to the limited and small number of irregular floating-point numbers, there are a total of 2046 double floating-point numbers and 254 float floating-point numbers. The correctness of the algorithm code in this paper can be proved by the exhaustive method. Therefore, it is not introduced in this article. For the specific implementation process, please refer to the source code.

4. Benchmark Result

In fact, this article only discusses the binary to decimal part and does not discuss the decimal to string part. In the decimal to string section, the neon instruction set is adopted for the arm64 architecture, and sse2 is used for the x86-64 architecture to accelerate the conversion process. Please refer to the source code design. The link to the benchmark project is https://github.com/xjb714/f2dec_bench. In the performance test comparison, we compared the time spent by the following several different algorithms converting floating-point numbers to decimal results and string, as shown in Table (Table 1). Test process: Generate 10 7 random numbers without 0, NaN, and Inf, measure the total time spent converting all floating-point numbers to decimal results, and obtain the average time for converting a single floating-point number to decimal and string. The compilation option for all compilers is "-O3 -march=native". We conducted benchmark tests on two processors, and the test results are shown in Table (Table 2), (Table 3), (Table 4), (Table 5).
Special note: The algorithm of teju_jagua only supports float/double to decimal, because its author did not implement the source code of decimal to string. yy_double only supports double. Dragonbox_comp and fmt_comp represent the versions of the compressed constant lookup table. Dragonbox_full and fmt_full represent uncompressed constant lookup table.
In the test comparison of float/double to decimal, the results produced by different algorithms may vary and may include the results without removing the trailing zeros in decimal. Therefore, this comparison is not very fair and the results are for reference only. Since the AMD-R7 7840H supports AVX2 and AVX512, the AVX2 and AVX512 in the test results are the test results optimized by the clang compiler. From the benchmark results, it can be seen that the performance of the algorithm in this paper is better than other algorithms in most cases. For the double to string algorithm in this article, due to the incorrect optimization of gcc, such as too many branch statements, the performance is lower than that of the icpx compiler. The compilation result of the clang compiler is almost the same as that of the icpx compiler, but the result does not seem to meet expectations. The reason is under investigation.

5. Conclusions and Future Work

This paper proposes a new floating-point number to string conversion algorithm. The algorithm improves the calculation process of Schubfach[6] algorithms, reduces the number of multiplication operations, and optimizes some calculation steps. The algorithm has been implemented in C/C++ language and passed exhaustive tests. The benchmark results show that the performance of the algorithm is better than most existing algorithms in most cases. Future work includes further optimization of the algorithm to improve performance, especially for parallel computing on x86-64 and arm64 architecture,and compatibility with the msvc compiler.

Acknowledgments

This study was funded by the Sichuan Science and Technology Program (Grant No. 2024ZDZX0001) and the Technology Development Program (JCKY2022110C119).

References

  1. G. L. Steel Jr. and J. L. White. How to Print Floating-Point Numbers Accurately. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation, PLDI 1990. ACM, New York, NY, USA, 112-126. [CrossRef]
  2. F. Loitsch. Printing Floating-Point Numbers Quickly and Accurately with Integers. In Proceedings of the ACM SIGPLAN 2010 Conference on Programming Language Design and Implementation, PLDI 2010. ACM, New York, NY, USA, 233-243. [CrossRef]
  3. M. Andrysco, R. M. Andrysco, R. Jhala, and S. Lerner. Printing Floating-Point Numbers: a Faster, Always Correct Method. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016. ACM, New York, NY, USA, 555-567. [CrossRef]
  4. Ulf Adams. 2018. Ryu¯: Fast Float-to-String Conversion. In Proceed- ings of 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’18). ACM, New York, NY, USA, 13 pages. [CrossRef]
  5. Ulf Adams. 2019. Ryu¯ Revisited: Printf Floating Point Conversion. Proc. ACM Program. Lang. 3, OOPSLA, Article 169 (October 2019), 23 pages. [CrossRef]
  6. R. Giulietti. The Schubfach Way to Render Doubles. 2020. https://drive.google.com/file/d/1KLtG_LaIbK9ETXI290zqCxvBW94dj058/view (Sep. 2020.
  7. J. Jeon. Grisu-Exact: A Fast and Exact Floating-Point Printing Algorithm. 2020. https://github.com/jk-jeon/Grisu-Exact/blob/master/other_files/Grisu-Exact.pdf. (Sep. 2020.
  8. Junekey Jeon. 2024. Dragonbox: A New Floating-Point Binary-to-Decimal Conversion Algorithm. https://github.com/jk-jeon/Dragonbox.
  9. Guo YaoYuan. https://github.com/ibireme/c_numconv_benchmark/blob/master/vendor/yy_double/yy_double.c (Nov. 2024.
  10. Victor Zverovich. https://github.com/fmtlib/fmt (Oct. 2025.
  11. Guo YaoYuan. https://github.com/ibireme/yyjson (Aug. 2025.
  12. Cassio Neri. https://github.com/cassioneri/teju_jagua (Nov. 2025.
Table 1. All algorithms in the benchmark test.
Table 1. All algorithms in the benchmark test.
algorithm float double description
]1*Schubfach Schubfach32 Schubfach64 author:Raffaello Giulietti,https://github.com/c4f7fcce9cb06515/Schubfach.
]1*Schubfach_xjb Schubfach32_xjb Schubfach64_xjb It is improved by Schubfach and has the same output result.
]1*Ryu Ryu32 Ryu64 author:Ulf Adams,https://github.com/ulfjack/ryu.
]1*Dragonbox Dragonbox32 Dragonbox64 author:Junekey Jeon,https://github.com/jk-jeon/Dragonbox.
]1*fmt[10] fmt32 fmt64 author:Victor Zverovich,https://github.com/fmtlib/fmt version:12.1.0
]1*yy_double - yy_double author:Guo YaoYuan,https://github.com/ibireme/c_numconv_benchmark/blob/master/vendor/yy_double/yy_double.c.
]1*yy_json[11] yy_json32 yy_json64 author:Guo YaoYuan,https://github.com/ibireme/yyjson version:0.12.0
]1*teju_jagua[12] teju32 teju64 author:Cassio Neri,https://github.com/cassioneri/teju_jagua.
]1*xjb xjb32 xjb64 this paper,https://github.com/xjb714/xjb.
* The blank Spaces in the table indicate no-code implementations.
Table 2. float/double to decimal benchmark results on AMD-R7 7840H and Ubuntu 24.04. The unit is nanosecond(ns).
Table 2. float/double to decimal benchmark results on AMD-R7 7840H and Ubuntu 24.04. The unit is nanosecond(ns).
algorithm float double
gcc 13.3 icpx 2025.0.4 clang 18.1.3 gcc 13.3 icpx 2025.0.4 clang 18.1.3
Schubfach 11.72 11.26 11.26 11.96 12.07 11.97
Schubfach_xjb 7.50 5.38 5.45 8.70 6.43 7.48
Ryu 14.23 14.21 14.48 13.73 13.53 13.59
Dragonbox 10.47 9.96 9.52 10.44 10.15 10.00
yy_json 6.31 4.84 4.69 7.20 6.10 6.25
yy_double - - - 6.32 5.75 5.22
teju_jagua 13.73 14.90 14.39 13.38 15.23 13.59
xjb 3.38 2.88 4.19 7.15 3.90 3.60
1.88(AVX2) 3.13(AVX2)
1.68(AVX512) 3.47(AVX512)
Table 3. float/double to string benchmark results on AMD-R7 7840H and Ubuntu 24.04. The unit is nanosecond(ns).
Table 3. float/double to string benchmark results on AMD-R7 7840H and Ubuntu 24.04. The unit is nanosecond(ns).
algorithm float double
gcc 13.3 icpx 2025.0.4 clang 18.1.3 gcc 13.3 icpx 2025.0.4 clang 18.1.3
Schubfach 20.71 19.67 20.32 25.78 24.55 24.76
Schubfach_xjb 21.04 19.99 20.49 20.14 20.23 20.37
Ryu 22.57 20.58 20.62 26.36 24.71 24.95
Dragonbox_comp 21.73 20.65 22.84 21.69 21.37 22.72
Dragonbox_full 16.93 15.49 17.28 18.88 17.78 18.80
fmt_comp 22.45 22.81 21.76 26.35 27.14 26.78
fmt_full 23.05 23.31 22.21 25.67 27.28 26.13
yy_json 21.71 21.09 21.08 18.13 18.03 19.16
yy_double - - - 17.54 17.52 17.85
xjb 9.04 9.02 8.78 12.15 9.12 15.78
Table 4. float/double to decimal benchmark results on Apple M1 and MacOS 26.1. The unit is nanosecond(ns).
Table 4. float/double to decimal benchmark results on Apple M1 and MacOS 26.1. The unit is nanosecond(ns).
algorithm float double
apple clang 17.0.0 apple clang 17.0.0
Schubfach 10.94 12.62
Schubfach_xjb 5.92 6.58
Ryu 15.40 14.16
Dragonbox 11.94 12.03
yy_json 4.18 4.72
yy_double - 4.74
teju_jagua 19.27 18.66
xjb 3.24 3.53
Table 5. float/double to string benchmark results on Apple M1 and MacOS 26.1. The unit is nanosecond(ns).
Table 5. float/double to string benchmark results on Apple M1 and MacOS 26.1. The unit is nanosecond(ns).
algorithm float double
apple clang 17.0.0 apple clang 17.0.0
Schubfach 23.26 27.51
Schubfach_xjb 23.30 22.44
Ryu 25.27 29.15
Dragonbox_comp 28.55 27.28
Dragonbox_full 21.03 22.12
fmt_comp 37.19 41.73
fmt_full 36.36 41.77
yy_json 15.50 15.58
yy_double - 15.13
xjb 8.91 10.32
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated