xjb: Fast Float to String Algorithm

Junbo Xiang; Tiejun Wang

doi:10.20944/preprints202511.1698.v2

Submitted:

31 March 2026

Posted:

01 April 2026

You are already at the latest version

Abstract

Efficiently and accurately converting floating-point numbers to decimal strings is a critical challenge in numerical computation and data exchange. While existing algorithms like Ryu, Dragonbox, and Schubfach satisfy the Steele-White (SW) principle for accuracy, they often suffer from performance bottlenecks due to branch prediction failures and high-precision multiplication overhead. This paper presents a novel floating-point to string conversion algorithm called "xjb", an optimized variant of the Schubfach algorithm designed to deliver superior performance for IEEE754 single-precision (binary32) and double-precision (binary64) floating-point numbers. By minimizing instruction dependencies, reducing multiplication operations, mitigating branch prediction penalties and by utilizing the simd instruction set, xjb achieves significant performance gains. The algorithm features concise core implementation. Extensive benchmarking across diverse platforms, including AMD R7-7840H and Apple M1, demonstrates that xjb outperforms state-of-the-art algorithms in most scenarios while maintaining full compliance with the SW principle.

Keywords:

floating-point

;

printing

;

algorithm

;

performance

Subject:

Computer Science and Mathematics - Computer Science

1. Introduction

Floating-point to decimal string conversion is a fundamental operation in computer systems with widespread applications across numerous domains. From scientific computing and financial systems to web services and database management, the ability to efficiently and accurately convert binary floating-point representations into human-readable decimal strings is essential. Despite its apparent simplicity, this conversion problem presents significant challenges in balancing the competing demands of correctness, performance, and output compactness.

1.1. Background and Motivation

In 1990, Steele and White [1,2] published the seminal paper How to Print Floating-Point Numbers Accurately, which established the foundational principles for optimal floating-point printing algorithms, now widely known as the Steele-White (SW) principle. The SW principle comprises four key requirements:

Information preservation: The printed result must be parseable back to the original floating-point number without loss of precision.
Minimum length: The output string should be as short as possible while maintaining information preservation.
Correct rounding: When multiple representations satisfy the first two criteria, the algorithm must correctly round to the nearest value, with ties broken by selecting the even value.
Left-to-right generation: The output digits should be generated sequentially from the most significant to the least significant digit.

These principles ensure that floating-point numbers have a unique, well-defined decimal representation that is both human-readable and machine-parseable. Algorithms satisfying the SW principle guarantee that the conversion process is reversible and produces the shortest possible output, which is crucial for data exchange, serialization, and user interface display.

Over the past three decades, significant research efforts have been devoted to developing efficient algorithms that satisfy the SW principle. Early approaches, such as the Dragon4 algorithm, provided correct results but suffered from performance limitations due to the use of arbitrary-precision arithmetic. Subsequent innovations introduced various optimizations:

dtoa.c [3] is an improved version of Dragon4, written by Robert G. Burger and R. Kent Dybvig in 1996. The source code link is https://www.netlib.org/fp/dtoa.c.
Grisu3 [4] pioneered the use of precomputed powers of ten to avoid expensive arbitrary-precision operations, though it occasionally falls back to slower methods for certain inputs.
Errol [5] improved upon Grisu3 by reducing the fallback rate through more precise error analysis.
Ry $\bar{u}$ [6,7] achieved significant performance gains through careful instruction scheduling and lookup table optimizations, establishing a new baseline for high-performance conversion.
Schubfach [8] introduced a novel approach based on the concept of "Schubfach" (pigeonhole principle), providing a compact and elegant solution with competitive performance.
Grisu-Exact [9] eliminated the need for fallbacks entirely while maintaining high performance.
Dragonbox [10] reduces the number of multiplications by sacrificing more branches.
yy_double [11] explored alternative computational strategies to minimize the multiplication cost, but there were still a few branches.
uscale [12,13,14] was proposed by Russ Cox and is used to enhance the performance of the floating-point number printing algorithm in the Go programming language.
zmij [15] is based on yy_double and has undergone extensive code optimization.

Despite these advances, existing algorithms still face several performance challenges that limit their effectiveness in high-throughput scenarios:

1.: Branch prediction penalties: Many algorithms rely heavily on conditional branches to handle different cases, leading to frequent branch mispredictions on modern pipelined processors.
2.: High-precision multiplication overhead: The conversion process requires high-precision arithmetic operations, particularly multiplications involving large precomputed constants, which can be expensive on standard hardware.
3.: Instruction dependency chains: Sequential dependencies between operations limit instruction-level parallelism and prevent efficient utilization of modern superscalar processors.
4.: Limited SIMD utilization: Most existing algorithms do not exploit vector instruction sets (SIMD) that are now ubiquitous in contemporary processors.

These limitations motivate the need for a new approach that addresses these performance bottlenecks while maintaining full compliance with the SW principle.

1.2. Contributions

This paper presents xjb, a novel floating-point to string conversion algorithm that achieves superior performance through systematic optimization of the underlying computational structure. The xjb algorithm is derived from the Schubfach algorithm and incorporates insights from yy_double and Dragonbox, but introduces several key innovations:

1.: Reduced instruction dependencies: By carefully restructuring the computation, xjb minimizes data dependencies between operations, enabling better instruction-level parallelism and improved pipeline utilization.
2.: Minimized multiplication operations: The algorithm reduces the number of expensive high-precision multiplications required during conversion, significantly decreasing computational overhead.
3.: Mitigated branch prediction penalties: Through branchless programming techniques and careful case analysis, xjb minimizes conditional branches that could lead to prediction failures.
4.: SIMD instruction utilization: The algorithm is designed to leverage SIMD instructions where applicable.
5.: Concise core implementation: Despite its sophisticated optimizations, xjb maintains a compact and readable core implementation, facilitating adoption and maintenance.

The xjb algorithm supports both IEEE 754 single-precision (binary32) and double-precision (binary64) floating-point formats, which are the most widely used floating-point representations in modern computing. For simplicity, this paper uses float to refer to IEEE 754 binary32 and double to refer to IEEE 754 binary64.

1.3. Evaluation Overview

We conducted extensive benchmarking of xjb across diverse hardware platforms, including AMD R7-7840H and Apple M1 processors. Our evaluation demonstrates that xjb outperforms state-of-the-art algorithms in most scenarios while maintaining full compliance with the SW principle. The algorithm exhibits excellent portability and scalability, making it suitable for deployment across a wide range of systems from embedded devices to high-performance servers.

The implementation code for the xjb algorithm and all benchmark tools used in this paper are available at https://github.com/xjb714/xjb.

The remainder of this paper is organized as follows. Section 2 reviews the IEEE 754 floating-point representation and establishes the mathematical foundation for the conversion problem. Section 3 presents the core principles and derivation of the xjb algorithm and describes the implementation details and optimizations. Section 4 presents experimental results comparing xjb against existing algorithms.

1.4. Explanation of Special Symbols in This Article

Provide special explanations for the special symbols used in the formulas of this article.

Table 1. Explanation of Special Symbols in This Article.

symbol	brief explanation	example
%	Integer modulus operation	2 = 8%3
//	Integer division operation	1 = 5//3
$< <$ or $> >$	Left or right shift of binary values	8 = 1 $< <$ 3
? :	Similar to the ternary operator in C syntax	a = 1?a:b

2. IEEE754 Floating-Point Number Representation

This section establishes the mathematical foundation for floating-point number representation and defines the notation used throughout this paper. We focus on the IEEE 754 standard, which is the most widely adopted floating-point arithmetic standard in modern computing systems.

2.1. Scope and Assumptions

For clarity of presentation, we make the following simplifying assumptions:

We consider only positive floating-point numbers, as negative numbers differ only by a leading minus sign.
We exclude special values (zero, NaN, and infinity) from our analysis, as these are handled separately in practice.

These assumptions are standard in the literature and do not affect the generality of our algorithm, as the excluded cases can be handled with straightforward special-case logic.

2.2. Binary Representation

The IEEE 754 standard [16,17] defines two primary floating-point formats relevant to this work:

Double-precision (binary64): A 64-bit format consisting of:

1 sign bit (s): indicates positive ( $s = 0$ ) or negative ( $s = 1$ ).
11 exponent bits (e): biased exponent in the range $[0, 2047]$ .
52 fraction bits (f): significand fraction in the range $[0, 2^{52} - 1]$ .

Single-precision (binary32): A 32-bit format consisting of:

1 sign bit (s): indicates positive ( $s = 0$ ) or negative ( $s = 1$ ).
8 exponent bits (e): biased exponent in the range $[0, 255]$ .
23 fraction bits (f): significand fraction in the range $[0, 2^{23} - 1]$ .

2.3. Classification of Floating-Point Numbers

We classify floating-point numbers into three categories based on their exponent and fraction fields:

1.: Subnormal numbers ( $e = 0$ and $f \neq 0$ ): These represent very small values close to zero, where the implicit leading bit of the significand is 0 instead of 1.
2.: Normal numbers ( $e \neq 0$ and $f \neq 0$ ): The standard case where the implicit leading bit of the significand is 1.
3.: Irregular numbers ( $e \neq 0$ and $f = 0$ ): Numbers with zero fraction field, representing powers of two.

We use the term regular to refer to both subnormal and normal numbers (i.e., all cases where

f \neq 0

).

2.4. Value Representation

The real value v of a positive floating-point number can be expressed in the unified form

v = c \cdot 2^{q}

, where c is the integer significand and q is the exponent. The general formula covering all cases is:

\begin{matrix} double : & v = (f + (e \neq 0 ? 2^{52} : 0)) \cdot 2^{max (e, 1) - 1075} = c \cdot 2^{q} \\ float : & v = (f + (e \neq 0 ? 2^{23} : 0)) \cdot 2^{max (e, 1) - 150} = c \cdot 2^{q} \end{matrix}

(1)

For each category, the values decompose as follows:

Subnormal numbers (

e = 0

):

\begin{matrix} subnormal : \{\begin{matrix} double : & v = f \cdot 2^{- 1074} \\ float : & v = f \cdot 2^{- 149} \end{matrix} \end{matrix}

(2)

Irregular numbers (

f = 0

,

e \neq 0

):

\begin{matrix} irregular : \{\begin{matrix} double : & v = 2^{52} \cdot 2^{e - 1075} \\ float : & v = 2^{23} \cdot 2^{e - 150} \end{matrix} \end{matrix}

(3)

Normal numbers (

e \neq 0

,

f \neq 0

):

\begin{matrix} normal : \{\begin{matrix} double : & v = (f + 2^{52}) \cdot 2^{e - 1075} \\ float : & v = (f + 2^{23}) \cdot 2^{e - 150} \end{matrix} \end{matrix}

(4)

2.5. Rounding Intervals

A critical concept for accurate floating-point printing is the rounding interval

R_{v}

, which defines the range of real numbers that round to the given floating-point value v when parsed. The rounding interval is bounded by:

\begin{matrix} v_{l} & = \{\begin{matrix} (c - \frac{1}{2}) \cdot 2^{q} & if f \neq 0 or e \leq 1 \\ (c - \frac{1}{4}) \cdot 2^{q} & if f = 0 \end{matrix} \\ v_{r} & = (c + \frac{1}{2}) \cdot 2^{q} \\ R_{v} & = \{\begin{matrix} [v_{l}, v_{r}] & if f mod 2 = 0 (even significand) \\ (v_{l}, v_{r}) & if f mod 2 = 1 (odd significand) \end{matrix} \end{matrix}

(5)

The rounding radius for regular floating-point numbers is

2^{q - 1}

. The distinction between closed and open intervals at the boundaries depends on the parity of the significand, ensuring correct rounding according to the round-to-even rule specified in the IEEE 754 standard. Any decimal number within

R_{v}

will parse back to the original floating-point value v, which is essential for ensuring the information preservation property of the SW principle.

Table 2 summarizes the valid ranges for c and q across different categories.

3. Principle of Algorithm

This section presents the algorithmic principles and mathematical foundation of the xjb floating-point to string conversion algorithm. We first introduce the overall architecture and design goals, followed by the mathematical formulation of the conversion problem.

3.1. Design Overview

This paper focuses on converting float (single-precision) and double (double-precision) floating-point numbers to decimal strings. The conversion process consists of two stages:

1.: Float-to-Decimal Conversion: Converting binary floating-point values to decimal significand-exponent pairs $(d, k)$ .
2.: Decimal-to-String Conversion: Formatting $(d, k)$ into human-readable strings.

We focus primarily on the first stage, as it is the computationally intensive component that benefits most from our optimizations.

The xjb algorithm is designed with four key performance objectives:

Minimize branch mispredictions through unlikely branches and branchless techniques
Reduce expensive high-precision multiplication operations
Optimize instruction dependencies to enable better parallelism
Maintain core implementation simplicity and readability

3.2. Mathematical Foundation

Before presenting the algorithm details, we establish the mathematical framework for the float-to-decimal conversion problem.

Recall from Section 2 that any floating-point value v can be expressed in the form

v = c \cdot 2^{q}

, where c is the integer significand and q is the exponent. Our goal is to find the optimal decimal representation

o p t = d \cdot 10^{k}

that satisfies the SW principle.

As established in Section 2, regular floating-point numbers (which include all normal and subnormal numbers with non-zero fraction fields) account for the vast majority of possible floating-point values. For the purposes of algorithm derivation, we focus primarily on regular numbers, as special cases can be handled with minimal additional logic.

The valid ranges for the significand c and exponent q for regular floating-point numbers are:

\begin{matrix} float : & \{\begin{matrix} 1 \leq c \leq 2^{24} - 1, c \neq 2^{23}, & q = - 149 \\ 2^{23} + 1 \leq c \leq 2^{24} - 1, & - 148 \leq q \leq 104 \end{matrix} \\ double : & \{\begin{matrix} 1 \leq c \leq 2^{53} - 1, c \neq 2^{52}, & q = - 1074 \\ 2^{52} + 1 \leq c \leq 2^{53} - 1, & - 1073 \leq q \leq 971 \end{matrix} \end{matrix}

(6)

For irregular floating-point numbers (powers of two), the ranges are:

\begin{matrix} float : & c = 2^{23}, - 149 \leq q \leq 104 \\ double : & c = 2^{52}, - 1074 \leq q \leq 971 \end{matrix}

(7)

For subnormal numbers:

\begin{matrix} float : & c \leq 2^{23} - 1, q = - 149 \\ double : & c \leq 2^{52} - 1, q = - 1074 \end{matrix}

(8)

The conversion problem can now be formally stated as: given a floating-point value

v = c \cdot 2^{q}

, find the optimal decimal representation

o p t = d \cdot 10^{k}

such that:

\begin{matrix} v = c \cdot 2^{q} & \to o p t = d \cdot 10^{k} \\ subject to : & o p t \in R_{v}, d \in Z^{+}, k \in Z \end{matrix}

(9)

where

R_{v}

is the rounding interval of v as defined in Section 2.

For example, consider the IEEE 754 binary64 floating-point number representing 1.3. Its actual value is 1.3000000000000000444089209850062616169452667236328125, with hexadecimal representation 3ff4cccccccccccd. The optimal decimal representation

o p t

satisfying the SW principle is simply 1.3. Similarly, for the binary32 representation of 1.3, with actual value 1.2999999523162841796875 and hexadecimal representation 3fa66666, the optimal representation is also 1.3.

3.3. Review of the Schubfach Algorithm and Our Derivation

This section reviews the Schubfach algorithm and presents our derivation of an optimized variant. We begin by establishing the mathematical foundation for determining the optimal decimal representation.

3.3.1. Candidate Values for the Significand d

The Schubfach algorithm identifies four candidate values for the decimal significand d:

\begin{matrix} d \in {10 ⌊ v \cdot 10^{- k - 1} ⌋, ⌊ 10 v \cdot 10^{- k - 1} ⌋, ⌊ 10 v \cdot 10^{- k - 1} ⌋ + 1, 10 ⌊ v \cdot 10^{- k - 1} ⌋ + 10} \end{matrix}

(10)

The exponent k is computed as:

k = \{\begin{matrix} ⌊ q \cdot lg 2 ⌋ & if v \in regular \\ ⌊ q \cdot lg 2 - lg (4 / 3) ⌋ & otherwise \end{matrix}

(11)

For efficient computation on modern processors, Equation (11) can be implemented using integer arithmetic:

\begin{matrix} double : k & = (q \cdot 315653 - (v \in regular ? 0 : 131072)) ≫ 20 \\ float : k & = (q \cdot 1233 - (v \in regular ? 0 : 512)) ≫ 12 \end{matrix}

(12)

3.3.2. Decomposition into Integer and Fractional Parts

Let

m = ⌊ v \cdot 10^{- k - 1} ⌋

denote the integer part and

n = v \cdot 10^{- k - 1} - m

the fractional part, where

0 \leq n < 1

. Substituting into Equation (10):

d \in {10 m, ⌊ 10 (m + n) ⌋, ⌊ 10 (m + n) ⌋ + 1, 10 m + 10}

(13)

Since

⌊ 10 (m + n) ⌋ = 10 m + ⌊ 10 n ⌋

, the candidates simplify to:

d \in {10 m, 10 m + ⌊ 10 n ⌋, 10 m + ⌊ 10 n ⌋ + 1, 10 m + 10}

(14)

Based on the Schubfach algorithm and the SW principle, if the value

10 m

or

10 m + 10

falls within the range

R_{v}

, it is selected as the optimal solution d. In cases where neither

10 m

nor

10 m + 10

lies within

R_{v}

, the optimal solution d is determined as either

10 m + ⌊ 10 n ⌋

or

10 m + ⌊ 10 n ⌋ + 1

in accordance with the rules of correct rounding. We decompose d as

d = t e n + o n e

, where

t e n = 10 m

and

o n e \in {0, ⌊ 10 n ⌋, ⌊ 10 n ⌋ + 1, 10}

. The problem thus reduces to determining the appropriate value of

o n e

.

o n e = \{\begin{matrix} 0; if 10 m \in R_{v} \\ 10; else if 10 m + 10 \in R_{v} \\ ⌊ 10 n ⌋ or ⌊ 10 n ⌋ + 1; else apply correct rounding \end{matrix}

(15)

3.3.3. Selection Criteria for $O n e$

The selection of

o n e

depends on the relationship between the rounding interval and the candidate values. Recall that the rounding interval for a regular floating-point number

v = c \cdot 2^{q}

is

R_{v} = [v - 2^{q - 1}, v + 2^{q - 1}]

, where

2^{q - 1}

represents the half-unit in the last place (ulp) of v.

Case $o n e = 0$ (i.e., $d = 10 m$ ): This case applies when $10 m \cdot 10^{k}$ is the closest representable decimal to v within the rounding interval. The condition is derived as follows:

The lower bound of the rounding interval must be less than $10 m \cdot 10^{k}$ :

$\begin{matrix} c \cdot 2^{q} - 10 m \cdot 10^{k} & < 2^{q - 1} \\ c \cdot 2^{q} - ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ \cdot 10^{k + 1} & < 2^{q - 1} \\ c \cdot 2^{q} \cdot 10^{- k - 1} - ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ & < 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \\ n & < 2^{q - 1} \cdot 10^{- k - 1} \end{matrix}$

(16)

When equality holds ( $n = 2^{q - 1} \cdot 10^{- k - 1}$ ), we apply the round-to-even rule, requiring c to be even:

$o n e = 0 if 2^{q - 1} \cdot 10^{- k - 1} > n or (2^{q - 1} \cdot 10^{- k - 1} = n and c mod 2 = 0)$

(17)
Case $o n e = 10$ (i.e., $d = 10 m + 10$ ): This case applies when $(10 m + 10) \cdot 10^{k}$ is the closest representable decimal to v. The condition is derived similarly:

The upper bound of the rounding interval must be greater than $(10 m + 10) \cdot 10^{k}$ :

$\begin{matrix} (10 m + 10) \cdot 10^{k} - c \cdot 2^{q} & < 2^{q - 1} \\ ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ \cdot 10^{k + 1} + 10^{k + 1} - c \cdot 2^{q} & < 2^{q - 1} \\ 1 - (c \cdot 2^{q} \cdot 10^{- k - 1} - ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋) & < 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \\ 1 - n & < 2^{q - 1} \cdot 10^{- k - 1} \end{matrix}$

(18)

When equality holds ( $1 - n = 2^{q - 1} \cdot 10^{- k - 1}$ ), we again apply round-to-even:

$o n e = 10 if 2^{q - 1} \cdot 10^{- k - 1} > 1 - n or (2^{q - 1} \cdot 10^{- k - 1} = 1 - n and c mod 2 = 0)$

(19)
Case $o n e \in {⌊ 10 n ⌋, ⌊ 10 n ⌋ + 1}$ : When neither boundary condition applies, the optimal value lies between $10 m + ⌊ 10 n ⌋$ and $10 m + ⌊ 10 n ⌋ + 1$ . We determine $o n e$ by rounding $10 n$ to the nearest integer:

–

If the fractional part ${10 n} < 0.5$ : $o n e = ⌊ 10 n ⌋$

–

If the fractional part ${10 n} > 0.5$ : $o n e = ⌊ 10 n ⌋ + 1$

–

If the fractional part ${10 n} = 0.5$ : apply round-to-even

For irregular floating-point numbers (powers of two), additional verification is required to ensure the selected value lies within the rounding interval $R_{v}$ , as the interval boundaries differ for these special cases.

3.3.4. Algorithm Overview

Algorithm 1 summarizes our optimized variant of the Schubfach algorithm (xjb32 for float, xjb64 for double). Given inputs c and q, the algorithm returns d and k such that

d \cdot 10^{k}

satisfies the SW principle. The computation of k follows Equation (12); the remainder of this section focuses on efficient computation of d.

The subsequent sections detail:

1.: Lookup table precomputation
2.: Efficient computation of m
3.: Fast boundary condition testing for $o n e \in {0, 10}$
4.: Efficient computation of $⌊ 10 n ⌋$ and rounding
5.: Handling of irregular floating-point numbers
6.: implementation of pseudocode

The following will discuss the above content in detail from Section 3.4 to Section 3.9.

Algorithm 1:The xjb Algorithm for Float-to-Decimal Conversion

Require:: Floating-point components c (significand) and q (exponent)
Ensure:: Decimal representation $d \cdot 10^{k}$ satisfying the SW principle
1:: $c \cdot 2^{q} \leftarrow v$
2:: if $v \in regular$ then
3:: $k \leftarrow ⌊ q \cdot lg 2 ⌋$
4:: else
5:: $k \leftarrow ⌊ q \cdot lg 2 - lg (4 / 3) ⌋$
6:: end if
7:: $m \leftarrow ⌊ v \cdot 10^{- k - 1} ⌋$
8:: $n \leftarrow v \cdot 10^{- k - 1} - m$
9:: $t e n \leftarrow 10 m$
10:: $δ \leftarrow 10 n - ⌊ 10 n ⌋$ fractional part of $10 n$
11:: if $δ = 0.5$ then
12:: if $⌊ 10 n ⌋ mod 2 = 0$ then
13:: $o n e \leftarrow ⌊ 10 n ⌋$ round to even
14:: else
15:: $o n e \leftarrow ⌊ 10 n ⌋ + 1$
16:: end if $δ < 0.5$
17:: $o n e \leftarrow ⌊ 10 n ⌋$
18:: else
19:: $o n e \leftarrow ⌊ 10 n ⌋ + 1$
20:: end if
21:: if $v \in irregular$ then
22:: if $δ > 2^{q - 2} \cdot 10^{- k}$ then
23:: $o n e \leftarrow ⌊ 10 n ⌋ + 1$
24:: end if
25:: if $2^{q - 2} \cdot 10^{- k - 1} \geq n$ then
26:: $o n e \leftarrow 0$
27:: end if
28:: else
29:: if $2^{q - 1} \cdot 10^{- k - 1} > n$ OR $(2^{q - 1} \cdot 10^{- k - 1} = n and c mod 2 = 0)$ then
30:: $o n e \leftarrow 0$
31:: end if
32:: if $2^{q - 1} \cdot 10^{- k - 1} > 1 - n$ OR $(2^{q - 1} \cdot 10^{- k - 1} = 1 - n and c mod 2 = 0)$ then
33:: $o n e \leftarrow 10$
34:: end if
35:: end if
36:: $d \leftarrow t e n + o n e$
37:: return $d, k$

3.4. Lookup Table Precomputation

The algorithm in this paper employs a lookup table to store pre-computed values of

10^{- k - 1}

for different ranges of q:

[- 149, 104]

for float and

[- 1074, 971]

for double. These lookup tables use extended precision: 64-bit for float and 128-bit for double. The reference implementation is available in gen.py.

3.4.1. Fundamental Calculation

Let B denote the bit length of each entry in the lookup table, with

B = 64

for float and

B = 128

for double. For any integer

e_{10}

(representing a power of 10), we aim to represent

10^{e_{10}}

in the form

f \cdot 2^{⌊ e_{2} ⌋}

, where

1 ⩽ f < 2

and

e_{2}

is a real number. This gives:

\begin{matrix} f \cdot 2^{⌊ e_{2} ⌋} = 2^{e_{2}} = 10^{e_{10}} \end{matrix}

(20)

Taking the logarithm base 2 of both sides, we get:

\begin{matrix} ⌊ e_{2} ⌋ = ⌊ e_{10} \cdot {log}_{2} (10) ⌋ \end{matrix}

(21)

Solving for f gives:

\begin{matrix} f = \frac{10^{e_{10}}}{2^{⌊ e_{10} \cdot {log}_{2} (10) ⌋}} \end{matrix}

(22)

The lookup table entries are computed using upward rounding:

\begin{matrix} l o o k u p [e_{10}] & = ⌈ f \cdot 2^{B - 1} ⌉ \\ = ⌈ \frac{10^{e_{10}}}{2^{⌊ e_{10} \cdot {log}_{2} (10) ⌋}} \cdot 2^{B - 1} ⌉ \\ = ⌈ 10^{e_{10}} \cdot 2^{B - 1 - ⌊ e_{10} \cdot {log}_{2} (10) ⌋} ⌉ \end{matrix}

(23)

Notably,

f \cdot 2^{B - 1}

becomes an integer for certain ranges of

e_{10}

:

0 ⩽ e_{10} ⩽ 27

for float and

0 ⩽ e_{10} ⩽ 55

for double.

3.4.2. Detailed Calculation Process

The detailed calculation process is as follows:

Float

The range of $- k - 1$ is calculated to be [-32, 44] through the q value range in Equation (6), so the lookup table contains representation values from 10 to the power of -32 to 10 to the power of 44. The calculation process is as follows:

$\begin{matrix} - 32 ⩽ e_{10} ⩽ 44 \\ e_{2} = |⌊ e_{10} \cdot {log}_{2} (10) ⌋ - 63| \\ p o w 10 t = \{\begin{matrix} 2^{e_{2}} / / 10^{|e 10|}; if e_{10} < 0 \\ 10^{|e 10|} / / 2^{e_{2}}; if e_{10} ⩾ 20 \\ 10^{|e 10|} \cdot 2^{e_{2}}; if 1 ⩽ e_{10} ⩽ 19 \end{matrix} \\ f_{1, e_{10}} = p o w 10 = p o w 10 t + (e_{10} ⩾ 0 & & e_{10} ⩽ 27 ? 0 : 1) \end{matrix}$

(24)

When $0 ⩽ e_{10} ⩽ 27$ , the lookup table variable indicates that the values $f_{1, e_{10}} \cdot 2^{⌊ e_{10} \cdot {log}_{2} (10) ⌋ - 63}$ and $10^{e_{10}}$ are equal. In other cases, the relative error is less than $2^{- 63}$ . Expressed as:

$\begin{matrix} r_{1, e_{10}} & = \frac{f_{1, e_{10}} \cdot 2^{⌊ e_{10} \cdot {log}_{2} (10) ⌋ - 63}}{10^{e_{10}}} \\ \in \{\begin{matrix} 1; if 0 ⩽ e_{10} ⩽ 27 \\ (1, 1 + 2^{- 63}); if e_{10} < 0 or e_{10} > 27 \end{matrix} \end{matrix}$

(25)
Double

The range of $- k - 1$ is calculated to be [-293, 323] through the q value range in Equation (6), so the lookup table contains representation values from 10 to the power of -293 to 10 to the power of 323. The calculation process is as follows:

$\begin{matrix} - 293 ⩽ e_{10} ⩽ 323 \\ e_{2} = |⌊ e_{10} \cdot {log}_{2} (10) ⌋ - 127| \\ p o w 10 t = \{\begin{matrix} 2^{e_{2}} / / 10^{|e 10|}; if e_{10} < 0 \\ 10^{|e 10|} / / 2^{e_{2}}; if e_{10} ⩾ 39 \\ 10^{|e 10|} \cdot 2^{e_{2}}; if 1 ⩽ e_{10} ⩽ 38 \end{matrix} \\ f_{1, e_{10}} = p o w 10 = p o w 10 t + (e_{10} ⩾ 0 & & e_{10} ⩽ 55 ? 0 : 1) \end{matrix}$

(26)

When $0 ⩽ e_{10} ⩽ 55$ , the lookup table variable indicates that the values $f_{1, e_{10}} \cdot 2^{⌊ e_{10} \cdot {log}_{2} (10) ⌋ - 127}$ and $10^{e_{10}}$ are equal. In other cases, the relative error is less than $2^{- 127}$ . Expressed as:

$\begin{matrix} r_{1, e_{10}} & = \frac{f_{1, e_{10}} \cdot 2^{⌊ e_{10} \cdot {log}_{2} (10) ⌋ - 127}}{10^{e_{10}}} \\ \in \{\begin{matrix} 1; if 0 ⩽ e_{10} ⩽ 55 \\ (1, 1 + 2^{- 127}); if e_{10} < 0 or e_{10} > 55 \end{matrix} \end{matrix}$

(27)

Let

r_{1}

denote the error for float lookup table entries,

r_{2}

for double entries, and r for both. In Algorithm (1), we retrieve

10^{- k - 1}

from the lookup table. The lookup table provides exact values when q falls within specific ranges:

\begin{matrix} float : & 0 ⩽ - k - 1 ⩽ 27 \Rightarrow - 93 ⩽ q ⩽ - 1 \\ double : & 0 ⩽ - k - 1 ⩽ 55 \Rightarrow - 186 ⩽ q ⩽ - 1 \end{matrix}

(28)

For q outside these ranges, the lookup table entries have bounded relative errors:

\begin{matrix} float : & 0 < r_{1} - 1 < 2^{- 63} \\ double : & 0 < r_{2} - 1 < 2^{- 127} \end{matrix}

(29)

3.4.3. Storage Requirements

The float lookup table requires 616 bytes of storage, calculated as

(44 - (- 32) + 1) \times 8

bytes (77 entries × 8 bytes each). The double lookup table requires 9872 bytes, calculated as

(323 - (- 293) + 1) \times 16

bytes (617 entries × 16 bytes each).

3.4.4. Implementation Notes

The lookup table pre-computation uses efficient integer arithmetic to avoid precision loss during calculations. The conditional logic in Equations (24) and (26) optimizes the computation based on the sign and magnitude of

e_{10}

, ensuring efficient generation of accurate lookup table entries.

3.5. Efficient Computation of m

This section presents an efficient method for calculating m in Algorithm (1), which is defined as

m = ⌊ v \cdot 10^{- k - 1} ⌋

.

3.5.1. Relevant Theorems

We start with some theorems from the Dragonbox algorithm paper:

Let n, P, and Q be positive integers where P and Q are coprime,

P < Q

,

1 ⩽ n ⩽ n_{m a x}

, and

Q > n_{m a x}

. Let

P^{*} / Q^{*}

be the best rational approximation of

P / Q

with

Q^{*} ⩽ n_{m a x}

and

P_{*} / Q_{*}

be the best rational approximation with

Q_{*} ⩽ n_{m a x}

.

If

n \cdot P

does not divide Q evenly:

\begin{matrix} ⌊ n \cdot \frac{P}{Q} ⌋ + 1 = ⌈ n \cdot \frac{P}{Q} ⌉ \end{matrix}

(30)

Suppose the following holds:

\begin{matrix} ⌊ n \cdot \frac{P}{Q} ⌋ = ⌊ n \cdot ξ ⌋ \end{matrix}

(31)

Then:

\begin{matrix} \frac{P_{*}}{Q_{*}} = max_{1 ⩽ n ⩽ n_{m a x}} \frac{⌊ n \cdot \frac{P}{Q} ⌋}{n} ⩽ ξ < min_{1 ⩽ n ⩽ n_{m a x}} \frac{⌊ n \cdot \frac{P}{Q} ⌋ + 1}{n} = min_{1 ⩽ n ⩽ n_{m a x}} \frac{⌈ n \cdot \frac{P}{Q} ⌉}{n} = \frac{P^{*}}{Q^{*}} \end{matrix}

(32)

Thus, the range of

ξ

is:

\begin{matrix} \frac{P_{*}}{Q_{*}} ⩽ ξ < \frac{P^{*}}{Q^{*}} \end{matrix}

(33)

The range of the fractional part of

n \cdot \frac{P}{Q}

is:

\begin{matrix} [\frac{(Q_{*} P) % Q}{Q}, \frac{(Q^{*} P) % Q}{Q}] \end{matrix}

(34)

Here, the fractional part is minimized when

n = Q_{*}

and maximized when

n = Q^{*}

.

3.5.2. Best Rational Approximation Function

We define the best rational approximation function (implemented in test1.py):

\begin{matrix} (\frac{P_{*}}{Q_{*}}, \frac{P^{*}}{Q^{*}}) = (D N, U P) = f (C, P, Q) \end{matrix}

(35)

This function

f (C, P, Q)

calculates the best rational approximation with denominator not exceeding C using the Farey sequence median theorem.

D N

and

U P

are adjacent terms in the C-th order Farey sequence

F_{C}

. Therefore, we can calculate Equation (34) very quickly.

3.5.3. Key Proof

In Algorithm (1), we need to compute

m = ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋

. We aim to prove:

\begin{matrix} m = ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ c \cdot 2^{q} \cdot r \cdot 10^{- k - 1} ⌋ \end{matrix}

(36)

Where r is the lookup table error defined in Equations (25) and (27). When condition (28) is met,

r = 1

, and the equation holds trivially. For

r \neq 1

:

\begin{matrix} float : & 1 < r < 1 + 2^{- 63} \\ double : & 1 < r < 1 + 2^{- 127} \end{matrix}

(37)

Calculate the range of

2^{q} \cdot 10^{- k - 1}

and we get:

\begin{matrix} 2^{q} \cdot 10^{- k - 1} = 10^{- 1} \cdot (2^{q} \cdot 10^{- ⌊ q \cdot lg (2) ⌋}) = 10^{- 1} \cdot (10^{q \cdot lg (2) - ⌊ q \cdot lg (2) ⌋}) \end{matrix}

(38)

When q is not 0, Equation (38) exists:

\begin{matrix} q \cdot lg (2) \neq ⌊ q \cdot lg (2) ⌋ \\ 0 < q \cdot lg (2) - ⌊ q \cdot lg (2) ⌋ < 1 \end{matrix}

(39)

When q is 0,

q \cdot lg (2) - ⌊ q \cdot lg (2) ⌋ = 0

, so the final conclusion is:

\begin{matrix} 10^{- 1} ⩽ 2^{q} \cdot 10^{- k - 1} < 1 \end{matrix}

(40)

Because there is:

\begin{matrix} c \cdot 2^{q} \cdot 10^{- k - 1} = c \cdot \frac{2^{q - k - 1}}{5^{k + 1}} \in [0.1 c, c) \end{matrix}

(41)

Therefore:

\begin{matrix} c \cdot 2^{q} \cdot 10^{- k - 1} = \{\begin{matrix} \frac{c \cdot 2^{q - k - 1}}{5^{k + 1}}; q ⩾ 1 \\ \frac{c}{2^{1 + k - q} \cdot 5^{k + 1}} = \frac{c}{10}; q = 0 \\ \frac{c \cdot 5^{- k - 1}}{2^{1 + k - q}}; q < 0 \end{matrix} \end{matrix}

(42)

Suppose:

\begin{matrix} c \cdot 2^{q} \cdot 10^{- k - 1} = c \cdot \frac{x}{y} < c \end{matrix}

(43)

Then there are:

\begin{matrix} (x, y) = \{\begin{matrix} (2^{q - k - 1}, 5^{k + 1}); q ⩾ 1 \\ (1, 10); q = 0 \\ (5^{- k - 1}, 2^{1 + k - q}); q < 0 \end{matrix} \end{matrix}

(44)

3.5.4. Bit Width Calculation

Define maximum values for c:

\begin{matrix} float : & c ⩽ c_{m a x} = C_{1} = 2^{24} - 1 \\ double : & c ⩽ c_{m a x} = C_{2} = 2^{53} - 1 \end{matrix}

(45)

Let C denote either

C_{1}

or

C_{2}

depending on the precision.

For

y > C

, compute

P^{*}

and

Q^{*}

for each q using

f (C, x, y)

and find the minimum

B I T

such that:

\begin{matrix} \frac{x}{y} (1 + 2^{- B I T}) < \frac{P^{*}}{Q^{*}} \end{matrix}

(46)

For

y ⩽ C

:

\begin{matrix} c \cdot \frac{x}{y} (1 + \frac{1}{C y}) = \frac{c x + \frac{c}{C} \cdot \frac{x}{y}}{y} < \frac{c x + 1}{y} \end{matrix}

(47)

Thus:

\begin{matrix} ⌊ c \cdot \frac{x}{y} ⌋ = ⌊ c \cdot \frac{x}{y} (1 + \frac{1}{C y}) ⌋ \end{matrix}

(48)

Similarly, find the minimum

B I T

such that:

\begin{matrix} \frac{x}{y} (1 + 2^{- B I T}) < \frac{x}{y} (1 + \frac{1}{C y}) \end{matrix}

(49)

3.5.5. Results

The maximum of the minimum

B I T

values for all q (calculated in test1.py in about 1-2 seconds) is:

\begin{matrix} float : & B I T_{m a x} = 52 \\ double : & B I T_{m a x} = 113 \end{matrix}

(50)

Thus:

\begin{matrix} float : & ⌊ c \cdot \frac{x}{y} ⌋ = ⌊ c \cdot \frac{x}{y} \cdot (1 + 2^{- 52}) ⌋ = ⌊ c \cdot \frac{x}{y} \cdot r_{1} ⌋ \\ double : & ⌊ c \cdot \frac{x}{y} ⌋ = ⌊ c \cdot \frac{x}{y} \cdot (1 + 2^{- 113}) ⌋ = ⌊ c \cdot \frac{x}{y} \cdot r_{2} ⌋ \end{matrix}

(51)

This result confirms that m can be calculated efficiently using the lookup table values, even with their inherent errors. Once m is determined,

t e n = 10 m

can be computed quickly.

3.6. Fast Boundary Condition Testing for $o n e = 0$ and $o n e = 10$

In Algorithm 1, the conditions for determining

o n e = 0

and

o n e = 10

appear on lines 31 and 34, respectively. This section introduces an optimized method to quickly test these boundary conditions using equivalent mathematical formulations.

3.6.1. Equivalent Conditions for Boundary Testing

We start by deriving equivalent mathematical conditions for testing

o n e = 0

and

o n e = 10

.

Case 1: Testing $o n e = 0$

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

, this is equivalent to:

\begin{matrix} c \cdot 2^{q} \cdot 10^{- k - 1} - ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ & = 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \\ (2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} & = ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ \end{matrix}

(52)

Case 2: Testing $o n e = 10$

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

, this is equivalent to:

\begin{matrix} ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ - c \cdot 2^{q} \cdot 10^{- k - 1} + 1 & = 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \\ (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} & = ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ + 1 \end{matrix}

(53)

3.6.2. Integer Testing Analysis

To further analyze these conditions, we start with the range of

2^{q - 1} \cdot 10^{- k - 1}

from Equation (40):

2^{q - 1} \cdot 10^{- k - 1} \in [0.05, 0.5)

(54)

Analysis for $o n e = 0$

For the

o n e = 0

case, we can derive:

\begin{matrix} ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ - 1 & < c \cdot 2^{q} \cdot 10^{- k - 1} - 0.5 \\ < (2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} \\ ⩽ c \cdot 2^{q} \cdot 10^{- k - 1} - 0.05 < ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ + 1 \end{matrix}

(55)

This implies that when

(2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is an integer, it must equal

⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋

.

Analysis for $o n e = 10$

Similarly, for the

o n e = 10

case:

\begin{matrix} ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ & < c \cdot 2^{q} \cdot 10^{- k - 1} + 0.05 \\ ⩽ (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} \\ < c \cdot 2^{q} \cdot 10^{- k - 1} + 0.5 < ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ + 2 \end{matrix}

(56)

This implies that when

(2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is an integer, it must equal

⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ + 1

.

3.6.3. Key Insight: Integer Divisibility Test

The key insight is that testing for

o n e = 0

or

o n e = 10

is equivalent to checking whether

(2 c \pm 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is an integer. This can be rewritten as:

(2 c \pm 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} = (2 c \pm 1) \cdot 2^{q - k - 2} \cdot 5^{- k - 1}

(57)

We analyze different ranges of q to simplify this condition:

Case $q ⩾ 2$ : From $q ⩾ 2$ , we get $k ⩾ 0$ . The expression simplifies to checking if $(2 c \pm 1) \cdot 2^{q - k - 2}$ is divisible by $5^{k + 1}$ . Since 2 and 5 are coprime, this reduces to checking if $(2 c \pm 1)$ is divisible by $5^{k + 1}$ :

$(2 c \pm 1) mod 5^{k + 1} = 0$

(58)

Let t be a positive integer such that $2 c \pm 1 = t \cdot 5^{k + 1}$ . Since $2 c \pm 1$ is odd, t must also be odd. Considering the ranges of c for float and double:

$\begin{matrix} float : & 2 c - 1 \in [2^{24} + 1, 2^{25} - 3]; 2 c + 1 \in [2^{24} + 3, 2^{25} - 1]; \\ double : & 2 c - 1 \in [2^{53} + 1, 2^{54} - 3]; 2 c + 1 \in [2^{53} + 3, 2^{54} - 1]; \end{matrix}$

(59)

This gives us the range for t:

$\begin{matrix} float : & \frac{2^{24} + 1}{5^{k + 1}} ⩽ t ⩽ \frac{2^{25} - 1}{5^{k + 1}}; \\ double : & \frac{2^{53} + 1}{5^{k + 1}} ⩽ t ⩽ \frac{2^{54} - 1}{5^{k + 1}}; \end{matrix}$

(60)

The maximum values of k where t can be at least one odd integer are:

$\begin{matrix} float : & k_{max} = 9 \Rightarrow q_{max} = 33, t = 3 \\ double : & k_{max} = 22 \Rightarrow q_{max} = 76, t = 1 \end{matrix}$

(61)
Case $1 ⩾ q ⩾ 0$ : The denominator $2^{2 + k - q} \cdot 5^{k + 1}$ is even, while the numerator $(2 c \pm 1)$ is odd, so no solution exists.
Case $q < 0$ : The denominator $2^{2 + k - q}$ is even, while the numerator $(2 c \pm 1) \cdot 5^{- k - 1}$ is odd, so no solution exists.

3.6.4. Summary of Boundary Conditions

In summary, the situations when

(2 c \pm 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is an integer are as follows:

\begin{matrix} float : & 2 ⩽ q ⩽ 33 & & (2 c \pm 1) mod 5^{k + 1} = 0; \\ double : & 2 ⩽ q ⩽ 76 & & (2 c \pm 1) mod 5^{k + 1} = 0; \end{matrix}

(62)

The range of

- k - 1

is:

\begin{matrix} float : & - 10 ⩽ - k - 1 ⩽ - 1 \\ double : & - 23 ⩽ - k - 1 ⩽ - 1 \end{matrix}

(63)

3.6.5. Efficient Implementation

We can further simplify the testing conditions using bitwise operations. For

o n e = 0

:

\begin{matrix} float : & ⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} \cdot n ⌋ \\ double : & ⌊ 2^{63} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{64} \cdot n ⌋ \end{matrix}

(64)

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

, the following conclusions can be drawn:

\begin{matrix} float : & \{\begin{matrix} 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} = 2^{36} - 2^{36} \cdot n \Rightarrow \\ ⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} - 2^{36} \cdot n ⌋ = 2^{36} - 1 - ⌊ 2^{36} \cdot n ⌋ \end{matrix} \\ double : & \{\begin{matrix} 2^{63} \cdot 2^{q} \cdot 10^{- k - 1} = 2^{64} - 2^{64} \cdot n \Rightarrow \\ ⌊ 2^{63} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{64} - 2^{64} \cdot n ⌋ = 2^{64} - 1 - ⌊ 2^{64} \cdot n ⌋ \end{matrix} \end{matrix}

(65)

The discussion on whether

⌊ 2^{36} - 2^{36} \cdot n ⌋ = 2^{36} - 1 - ⌊ 2^{36} \cdot n ⌋

in Equation (65) holds true, that is, whether

2^{36} \cdot n

in Equation (65) is an integer, or equivalent to discussing whether the following values are integers when Equation (62) holds true (the same applies to double) :

\begin{matrix} float : & 2^{36} \cdot (m + n) = c \cdot 2^{q + 36} \cdot 10^{- k - 1} = c \cdot 2^{q - k + 35} \cdot 5^{- k - 1} = c \cdot \frac{2^{q - k + 35}}{5^{k + 1}} \\ double : & 2^{64} \cdot (m + n) = c \cdot 2^{q + 64} \cdot 10^{- k - 1} = c \cdot 2^{q - k + 63} \cdot 5^{- k - 1} = c \cdot \frac{2^{q - k + 63}}{5^{k + 1}} \end{matrix}

(66)

Suppose c can divide

5^{k + 1}

evenly (where t is a temporary integer variable):

\begin{matrix} c = t \cdot 5^{k + 1}; t ⩾ 1 \end{matrix}

(67)

Therefore, when Equation (67) was established, there were:

\begin{matrix} 2 c \pm 1 = 2 \cdot t \cdot 5^{k + 1} \pm 1 \end{matrix}

(68)

Expression (68) cannot divide

5^{k + 1}

evenly, which contradicts Equation (62), so c cannot divide

5^{k + 1}

evenly. Therefore, for float,

c \cdot 2^{q + 36} \cdot 10^{- k - 1}

and

2^{36} \cdot n

are not integers; For double,

c \cdot 2^{64 + q} \cdot 10^{- k - 1}

and

2^{64} \cdot n

are not integers, that is:

\begin{matrix} float : & ⌊ 2^{36} - 2^{36} \cdot n ⌋ = 2^{36} + ⌊ - 2^{36} \cdot n ⌋ = 2^{36} - 1 - ⌊ 2^{36} \cdot n ⌋ \\ double : & ⌊ 2^{64} - 2^{64} \cdot n ⌋ = 2^{64} + ⌊ - 2^{64} \cdot n ⌋ = 2^{64} - 1 - ⌊ 2^{64} \cdot n ⌋ \end{matrix}

(69)

Therefore, the conclusion (65) is correct. Discuss the necessary and sufficient conditions for whether

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} \cdot n ⌋

is

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

. The same applies to double, expressed as:

\begin{matrix} float : & 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n \Leftrightarrow ⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} \cdot n ⌋ \\ double : & 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n \Leftrightarrow ⌊ 2^{63} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{64} \cdot n ⌋ \end{matrix}

(70)

Similarly, the necessary and sufficient conditions for whether

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} - 2^{36} \cdot n ⌋

is

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

. The same applies to double, expressed as:

\begin{matrix} float : & 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n \Leftrightarrow ⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} - 2^{36} \cdot n ⌋ \\ double : & 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n \Leftrightarrow ⌊ 2^{63} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{64} - 2^{64} \cdot n ⌋ \end{matrix}

(71)

The sufficient conditions of Equations (70) and (71) are obviously established. Introduce the proof that Equation (70) holds. For float, only the necessary conditions need to be discussed, that is, whether

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

must hold true when

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} \cdot n ⌋

holds, or equivalent to

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ \neq ⌊ 2^{36} \cdot n ⌋

must hold true when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \neq n

. The following is proved by proof by contradiction.

Assume that

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} \cdot n ⌋

holds when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \neq n

. Then there is:

\begin{matrix} ⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} \cdot n ⌋ \\ \Rightarrow 0 < |2^{35} \cdot 2^{q} \cdot 10^{- k - 1} - 2^{36} \cdot n| < 1 \\ \Rightarrow 0 < |(2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} - m| < 2^{- 36} \end{matrix}

(72)

As is known from Equation (55), there is:

\begin{matrix} m - 1 < (2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} < m + 1 \end{matrix}

(73)

Suppose the decimal part of

(2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is represented as

n^{-}

, thus we have:

\begin{matrix} |(2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} - m| = \{\begin{matrix} n^{-}; if (2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} > m \\ 1 - n^{-}; if (2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} < m \end{matrix} \end{matrix}

(74)

Substitute expression (74) into expression (72), and we get:

\begin{matrix} 0 < |(2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} - m| < 2^{- 36} \\ \Rightarrow 0 < n^{-} < 2^{- 36} or 0 < 1 - n^{-} < 2^{- 36} \end{matrix}

(75)

Similarly, it can be known that the double range is the range of

n^{-}

. Therefore, there is:

\begin{matrix} float : & n^{-} \in (0, 2^{- 36}) \cup (1 - 2^{- 36}, 1) \\ double : & n^{-} \in (0, 2^{- 64}) \cup (1 - 2^{- 64}, 1) \end{matrix}

(76)

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \neq n

, it is known from Equation (52) that

(2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is not an integer. Therefore, there is:

\begin{matrix} 0 < n^{-} < 1 \end{matrix}

(77)

It is only necessary to prove that Equation (76) does not hold. Discuss the range of the decimal part

n^{-}

when

(2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is not an integer. According to Equation (57), there are:

\begin{matrix} (2 c - 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} = (2 c - 1) \cdot \frac{x}{y} = \{\begin{matrix} \frac{(2 c - 1) \cdot 2^{q - k - 2}}{5^{k + 1}}; q ⩾ 2 \\ \frac{(2 c - 1)}{2^{2 + k - q} \cdot 5^{k + 1}}; 1 ⩾ q ⩾ 0 \\ \frac{(2 c - 1) \cdot 5^{- k - 1}}{2^{2 + k - q}}; q < 0 \end{matrix} \end{matrix}

(78)

The maximum value of

2 c - 1

is:

\begin{matrix} float : & {(2 c - 1)}_{max} = 2^{25} - 3 \\ double : & {(2 c - 1)}_{max} = 2^{54} - 3 \end{matrix}

(79)

Discuss based on the denominator range in Equation (78).

$y ⩽ {(2 c - 1)}_{max}$

When $y ⩽ {(2 c - 1)}_{max}$ , $y_{max}$ is the expression (79), the following holds true:

$\begin{matrix} \frac{1}{y_{max}} ⩽ n^{-} ⩽ 1 - \frac{1}{y_{max}} \\ \frac{1}{y_{max}} ⩽ 1 - n^{-} ⩽ 1 - \frac{1}{y_{max}} \end{matrix}$

(80)

Therefore, when $y ⩽ {(2 c - 1)}_{max}$ , Equation (76) does not hold true.
$y > {(2 c - 1)}_{max}$

Call function (35) to calculate the approximation results $P_{*} / Q_{*}$ and $P^{*} / Q^{*}$ of all possible upper and lower limit rational numbers:

$\begin{matrix} (\frac{P_{*}}{Q_{*}}, \frac{P^{*}}{Q^{*}}) = f ({(2 c - 1)}_{max}, x, y) \end{matrix}$

(81)

Therefore, for $n^{-}$ , the following conclusion can be drawn from formula (34).

$\begin{matrix} n^{-} \in [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \end{matrix}$

(82)

By exhausting all possibilities, we thus have (the test code file is test3.py) :

$\begin{matrix} float : & 2^{- 33} < n^{-} < 1 - 2^{- 29} \\ double : & 2^{- 62} < n^{-} < 1 - 2^{- 63} \end{matrix}$

(83)

$\begin{matrix} float : & [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (0, 2^{- 36}) = ⌀ \\ [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (1 - 2^{- 36}, 1) = ⌀ \\ double : & [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (0, 2^{- 64}) = ⌀ \\ [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (1 - 2^{- 64}, 1) = ⌀ \end{matrix}$

(84)

Therefore, when $y > {(2 c - 1)}_{max}$ , Equation (76) does not hold true.

In summary, when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \neq n

, Equation (76) does not hold true, that is,

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ \neq ⌊ 2^{36} \cdot n ⌋

must hold true. Therefore, when

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} \cdot n ⌋

holds,

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

must hold true. Therefore, Equation (70) holds.

Similarly, it can be proved that when

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} - 2^{36} \cdot n ⌋

holds,

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

must hold true. The same applies to double. Similarly, by proof of contradiction, for float, it is assumed that when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \neq 1 - n

holds,

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} - 2^{36} \cdot n ⌋

holds. That is:

\begin{matrix} ⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} - 2^{36} \cdot n ⌋ \\ \Rightarrow 0 < |2^{35} \cdot 2^{q} \cdot 10^{- k - 1} - 2^{36} + 2^{36} \cdot n| < 1 \\ \Rightarrow 0 < |2^{q - 1} \cdot 10^{- k - 1} - 1 + n| < 2^{- 36} \\ \Rightarrow - 2^{- 36} < (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} - m - 1 < 2^{- 36} \end{matrix}

(85)

As is known from Equation (56), there is:

\begin{matrix} m < (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} < m + 2 \end{matrix}

(86)

Suppose the decimal part of

(2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is represented as

n^{+}

, thus we have:

\begin{matrix} (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} - m - 1 = \{\begin{matrix} n^{+}; if (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} > m + 1 \\ 1 - n^{+}; if (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} < m + 1 \end{matrix} \end{matrix}

(87)

Substitute expression (87) into expression (85), and we get:

\begin{matrix} 0 < |(2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} - m - 1| < 2^{- 36} \\ \Rightarrow 0 < 1 - n^{+} < 2^{- 36} or 0 < n^{+} < 2^{- 36} \end{matrix}

(88)

Similarly, it can be known that the double range is the range of

n^{+}

. Therefore, there is:

\begin{matrix} float : & n^{+} \in (0, 2^{- 36}) \cup (1 - 2^{- 36}, 1) \\ double : & n^{+} \in (0, 2^{- 64}) \cup (1 - 2^{- 64}, 1) \end{matrix}

(89)

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \neq 1 - n

, it is known from Equation (53) that

(2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is not an integer. Therefore, there is:

\begin{matrix} 0 < n^{+} < 1 \end{matrix}

(90)

It is only necessary to prove that Equation (89) does not hold. Discuss the range of the decimal part

n^{+}

when

(2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1}

is not an integer. According to Equation (57), there are:

\begin{matrix} (2 c + 1) \cdot 2^{q - 1} \cdot 10^{- k - 1} = (2 c + 1) \cdot \frac{x}{y} = \{\begin{matrix} \frac{(2 c + 1) \cdot 2^{q - k - 2}}{5^{k + 1}}; q ⩾ 2 \\ \frac{(2 c + 1)}{2^{2 + k - q} \cdot 5^{k + 1}}; 1 ⩾ q ⩾ 0 \\ \frac{(2 c + 1) \cdot 5^{- k - 1}}{2^{2 + k - q}}; q < 0 \end{matrix} \end{matrix}

(91)

The maximum value of

2 c + 1

is:

\begin{matrix} float : & {(2 c + 1)}_{max} = 2^{25} - 1 \\ double : & {(2 c + 1)}_{max} = 2^{54} - 1 \end{matrix}

(92)

Discuss based on the denominator range in Equation (91).

$y ⩽ {(2 c + 1)}_{max}$

When $y ⩽ {(2 c + 1)}_{max}$ , $y_{max}$ is the expression (92), the following holds true:

$\begin{matrix} \frac{1}{y_{max}} ⩽ n^{+} ⩽ 1 - \frac{1}{y_{max}} \\ \frac{1}{y_{max}} ⩽ 1 - n^{+} ⩽ 1 - \frac{1}{y_{max}} \end{matrix}$

(93)

Therefore, when $y ⩽ {(2 c + 1)}_{max}$ , Equation (89) does not hold true.
$y > {(2 c + 1)}_{max}$

Call function (35) to calculate the approximation results $\frac{P_{*}}{Q_{*}}$ and $\frac{P^{*}}{Q^{*}}$ of all possible upper and lower limit rational numbers:

$\begin{matrix} (\frac{P_{*}}{Q_{*}}, \frac{P^{*}}{Q^{*}}) = f ({(2 c + 1)}_{max}, x, y) \end{matrix}$

(94)

Therefore, for $n^{+}$ , the following conclusion can be drawn from formula (34).

$\begin{matrix} n^{+} \in [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \end{matrix}$

(95)

By exhausting all possibilities, we thus have (the test code file is test7.py) :

$\begin{matrix} float : & 2^{- 33} < n^{+} < 1 - 2^{- 29} \\ double : & 2^{- 62} < n^{+} < 1 - 2^{- 63} \end{matrix}$

(96)

$\begin{matrix} float : & [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (0, 2^{- 36}) = ⌀ \\ [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (1 - 2^{- 36}, 1) = ⌀ \\ double : & [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (0, 2^{- 64}) = ⌀ \\ [\frac{(Q_{*} x) % y}{y}, \frac{(Q^{*} x) % y}{y}] \cap (1 - 2^{- 64}, 1) = ⌀ \end{matrix}$

(97)

Therefore, when $y > {(2 c + 1)}_{max}$ , Equation (89) does not hold true.

In summary, when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} \neq 1 - n

, Equation (89) does not hold true, that is,

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ \neq ⌊ 2^{36} - 2^{36} \cdot n ⌋

must hold true. Therefore, when

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ 2^{36} - 2^{36} \cdot n ⌋

holds,

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

must hold true. The same is true for double.Therefore, Equation (71) holds.

The following conclusions hold:

\begin{matrix} float : & ⌊ 2^{36} - 2^{36} \cdot n ⌋ & = \{\begin{matrix} 2^{36} - 1 - ⌊ 2^{36} \cdot n ⌋; if c \cdot 2^{36 + q} \cdot 10^{- k - 1} \notin Z \\ 2^{36} - ⌊ 2^{36} \cdot n ⌋; if c \cdot 2^{36 + q} \cdot 10^{- k - 1} \in Z \end{matrix} \\ double : & ⌊ 2^{64} - 2^{64} \cdot n ⌋ & = \{\begin{matrix} 2^{64} - 1 - ⌊ 2^{64} \cdot n ⌋; if c \cdot 2^{64 + q} \cdot 10^{- k - 1} \notin Z \\ 2^{64} - ⌊ 2^{64} \cdot n ⌋; if c \cdot 2^{64 + q} \cdot 10^{- k - 1} \in Z \end{matrix} \end{matrix}

(98)

Discuss whether the following Equation (99) holds when conditions (62) and (63) are met:

\begin{matrix} float : & ⌊ c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}} ⌋ = ⌊ c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}} \cdot r ⌋ \\ = ⌊ c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}} \cdot \frac{(2^{63 - ⌊ (- k - 1) \cdot {log}_{2} (10) ⌋} / / 10^{k + 1}) + 1}{10^{- k - 1}} \cdot 2^{⌊ (- k - 1) \cdot {log}_{2} (10) ⌋ - 63} ⌋ \\ double : & ⌊ c \cdot \frac{2^{q + 63 - k}}{5^{k + 1}} ⌋ = ⌊ c \cdot \frac{2^{q + 63 - k}}{5^{k + 1}} \cdot r ⌋ \\ = ⌊ c \cdot \frac{2^{q + 63 - k}}{5^{k + 1}} \cdot \frac{(2^{127 - ⌊ (- k - 1) \cdot {log}_{2} (10) ⌋} / / 10^{k + 1}) + 1}{10^{- k - 1}} \cdot 2^{⌊ (- k - 1) \cdot {log}_{2} (10) ⌋ - 127} ⌋ \end{matrix}

(99)

There are:

\begin{matrix} float : & ⌊ c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}} ⌋ = ⌊ 2^{36} \cdot (m + n) ⌋ = 2^{36} \cdot m + ⌊ 2^{36} \cdot n ⌋ \\ double : & ⌊ c \cdot \frac{2^{q + 63 - k}}{5^{k + 1}} ⌋ = ⌊ 2^{64} \cdot (m + n) ⌋ = 2^{64} \cdot m + ⌊ 2^{64} \cdot n ⌋ \end{matrix}

(100)

It has been proven earlier that m can be accurately calculated. Then, when (99) holds true, the values

⌊ 2^{36} \cdot n ⌋

and

⌊ 2^{64} \cdot n ⌋

on the right side of Equations (64) and (65) can be accurately calculated.

From Equation (57), we have:

\begin{matrix} c = \frac{t \cdot 5^{k + 1} - 1}{2} \end{matrix}

(101)

Substituting Equation (101) into Equation (99), we have:

\begin{matrix} float : & c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}} = t \cdot 2^{q + 34 - k} - \frac{2^{q + 34 - k}}{5^{k + 1}} \\ double : & c \cdot \frac{2^{q + 63 - k}}{5^{k + 1}} = t \cdot 2^{q + 62 - k} - \frac{2^{q + 62 - k}}{5^{k + 1}} \end{matrix}

(102)

When conditions (62) and (63) are met,

t \cdot 2^{q + 34 - k}

and

t \cdot 2^{q + 62 - k}

are integers. Under the condition of meeting condition (62), the decimal part of expression (102) is represented as:

\begin{matrix} float : & \frac{2^{q + 34 - k} % 5^{k + 1}}{5^{k + 1}}; 2 ⩽ q ⩽ 33 \\ double : & \frac{2^{q + 62 - k} % 5^{k + 1}}{5^{k + 1}}; 2 ⩽ q ⩽ 76 \end{matrix}

(103)

It is only necessary to prove that the increase in the value

c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}} \cdot r

on the right side of the expression compared to the value

c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}}

on the left side plus the decimal part of the value on the left side is less than 1 for Equation (99) to hold true. That is:

\begin{matrix} float : & \frac{2^{q + 34 - k} % 5^{k + 1}}{5^{k + 1}} + (c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}} \cdot r - c \cdot \frac{2^{q + 35 - k}}{5^{k + 1}}) < 1 \\ double : & \frac{2^{q + 62 - k} % 5^{k + 1}}{5^{k + 1}} + (c \cdot \frac{2^{q + 63 - k}}{5^{k + 1}} \cdot r - c \cdot \frac{2^{q + 63 - k}}{5^{k + 1}}) < 1 \end{matrix}

(104)

By exhaustionally calculating the maximum possible c value under each q and substituting it into Equation (104), it holds. The calculation result is in test2.py. The calculation results show that for the float range and the double range, Equation (104) always holds true. Therefore, Equation (99) holds true, and thus the values of

⌊ 2^{36} \cdot n ⌋

and

⌊ 2^{64} \cdot n ⌋

on the right side of Equations (64) and (65) can be accurately calculated. The values of

⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋

and

⌊ 2^{63} \cdot 2^{q} \cdot 10^{- k - 1} ⌋

on the left side of Equations (64) and (65) can be calculated through lookup tables.

\begin{matrix} float : & ⌊ 2^{35} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = p o w 10 ≫ (28 - q - ⌊ (- k - 1) \cdot {log}_{2} (10) ⌋) \\ double : & ⌊ 2^{63} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = p o w 10 ≫ (64 - q - ⌊ (- k - 1) \cdot {log}_{2} (10) ⌋) \end{matrix}

(105)

The code file for verifying the validity of Equation (105) is test4.py. Therefore, when conditions (62) and (63) are met, the values of both sides of Equations (64) and (65) can be accurately calculated.

Discuss the relationship between the following two values within all ranges of floating-point numbers:

\begin{matrix} float : & ⌊ c \cdot 2^{q + 36} \cdot 10^{- k - 1} ⌋; ⌊ c \cdot 2^{q + 36} \cdot r \cdot 10^{- k - 1} ⌋; \\ double : & ⌊ c \cdot 2^{q + 64} \cdot 10^{- k - 1} ⌋; ⌊ c \cdot 2^{q + 64} \cdot r \cdot 10^{- k - 1} ⌋; \end{matrix}

(106)

When

r = 1

, it is obvious that the two values in expression (106) are equal. When

r \neq 1

, or equivalent to

r > 1

, has:

\begin{matrix} float : c \cdot 2^{q + 36} \cdot r \cdot 10^{- k - 1} & = c \cdot 2^{q + 36} \cdot 10^{- k - 1} + c \cdot 2^{q + 36} \cdot (r - 1) \cdot 10^{- k - 1} \\ < c \cdot 2^{q + 36} \cdot 10^{- k - 1} + 2^{24} \cdot 2^{36} \cdot 2^{q} \cdot 10^{- k - 1} \cdot (r - 1) \\ < c \cdot 2^{q + 36} \cdot 10^{- k - 1} + 2^{- 3} \\ ⌊ c \cdot 2^{q + 36} \cdot r \cdot 10^{- k - 1} ⌋ & ⩽ ⌊ c \cdot 2^{q + 36} \cdot 10^{- k - 1} ⌋ + 1 \\ double : c \cdot 2^{q + 64} \cdot r \cdot 10^{- k - 1} & = c \cdot 2^{q + 64} \cdot 10^{- k - 1} + c \cdot 2^{q + 64} \cdot (r - 1) \cdot 10^{- k - 1} \\ < c \cdot 2^{q + 64} \cdot 10^{- k - 1} + 2^{53} \cdot 2^{64} \cdot 2^{q} \cdot 10^{- k - 1} \cdot (r - 1) \\ < c \cdot 2^{q + 64} \cdot 10^{- k - 1} + 2^{- 10} \\ ⌊ c \cdot 2^{q + 64} \cdot r \cdot 10^{- k - 1} ⌋ & ⩽ ⌊ c \cdot 2^{q + 64} \cdot 10^{- k - 1} ⌋ + 1 \end{matrix}

(107)

Therefore, there is:

\begin{matrix} float : & 0 ⩽ ⌊ c \cdot 2^{q + 36} \cdot r \cdot 10^{- k - 1} ⌋ - ⌊ c \cdot 2^{q + 36} \cdot 10^{- k - 1} ⌋ ⩽ 1 \\ double : & 0 ⩽ ⌊ c \cdot 2^{q + 64} \cdot r \cdot 10^{- k - 1} ⌋ - ⌊ c \cdot 2^{q + 64} \cdot 10^{- k - 1} ⌋ ⩽ 1 \end{matrix}

(108)

Because there is:

\begin{matrix} ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = ⌊ c \cdot 2^{q} \cdot r \cdot 10^{- k - 1} ⌋ = m \end{matrix}

(109)

\begin{matrix} float : & ⌊ c \cdot 2^{q + 36} \cdot 10^{- k - 1} ⌋ = 2^{36} \cdot m + ⌊ 2^{36} \cdot n ⌋ \\ double : & ⌊ c \cdot 2^{q + 64} \cdot 10^{- k - 1} ⌋ = 2^{64} \cdot m + ⌊ 2^{64} \cdot n ⌋ \end{matrix}

(110)

Suppose:

\begin{matrix} n_{r} = c \cdot 2^{q} \cdot r \cdot 10^{- k - 1} - m \end{matrix}

(111)

Therefore, the following conclusion can be drawn: when condition (62) is met, from Equation (99), we have:

\begin{matrix} float : & 2 ⩽ q ⩽ 33 & & (2 c \pm 1) % 5^{k + 1} = 0 \Rightarrow ⌊ 2^{36} \cdot n ⌋ = ⌊ 2^{36} \cdot n_{r} ⌋ \\ double : & 2 ⩽ q ⩽ 76 & & (2 c \pm 1) % 5^{k + 1} = 0 \Rightarrow ⌊ 2^{64} \cdot n ⌋ = ⌊ 2^{64} \cdot n_{r} ⌋ \end{matrix}

(112)

Within the range of floating-point numbers, there exists:

\begin{matrix} float : & ⌊ 2^{36} \cdot n ⌋ ⩽ ⌊ 2^{36} \cdot n_{r} ⌋ ⩽ ⌊ 2^{36} \cdot n ⌋ + 1 \\ double : & ⌊ 2^{64} \cdot n ⌋ ⩽ ⌊ 2^{64} \cdot n_{r} ⌋ ⩽ ⌊ 2^{64} \cdot n ⌋ + 1 \end{matrix}

(113)

To simplify the expression,

e v e n

is used to indicate whether c is an even number:

\begin{matrix} e v e n = (c + 1) % 2 \in \{0, 1\} \end{matrix}

(114)

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

or

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

,

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

is the boundary condition for

o n e = 0

, and

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

is the boundary condition for

o n e = 10

. Whether

o n e

is 0 or 10 is determined based on whether c is an even number.Therefore, the following exists:

\begin{matrix} float : & \{\begin{matrix} o n e = 0 : ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n > ⌊ 2^{36} \cdot n_{r} ⌋ \\ o n e = 10 : ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n > 2^{36} - 1 - ⌊ 2^{36} \cdot n_{r} ⌋ \end{matrix} \\ double : & \{\begin{matrix} o n e = 0 : ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + e v e n > ⌊ 2^{64} \cdot n_{r} ⌋ \\ o n e = 10 : ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + e v e n > 2^{64} - 1 - ⌊ 2^{64} \cdot n_{r} ⌋ \end{matrix} \end{matrix}

(115)

Therefore, when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

or

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

, we can use the condition (116) to determine whether

o n e = 0

or

o n e = 10

.

\begin{matrix} float : & \{\begin{matrix} if ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n > ⌊ 2^{36} \cdot n_{r} ⌋ : o n e = 0 \\ if ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n > 2^{36} - 1 - ⌊ 2^{36} \cdot n_{r} ⌋ : o n e = 10 \end{matrix} \\ double : & \{\begin{matrix} if ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + e v e n > ⌊ 2^{64} \cdot n_{r} ⌋ : o n e = 0 \\ if ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + e v e n > 2^{64} - 1 - ⌊ 2^{64} \cdot n_{r} ⌋ : o n e = 10 \end{matrix} \end{matrix}

(116)

When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > n

or

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > 1 - n

, We can also use the above condition (116) to determine whether

o n e = 0

or

o n e = 10

. When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < n

or

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < 1 - n

, we can also use the above condition (116) to determine whether

o n e \neq 0

or

o n e \neq 10

. There are a total of four situations.The proof is as follows:

(1)When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < n

, there must exist

o n e \neq 0

, and there is:

\begin{matrix} float : & 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} - n = n^{-} - 1 \in (2^{- 33} - 1, - 2^{- 29}) \\ double : & 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} - n = n^{-} - 1 \in (2^{- 62} - 1, - 2^{- 63}) \end{matrix}

(117)

Therefore, the following exists:

\begin{matrix} float : & 2^{q + 35} \cdot 10^{- k - 1} - 2^{36} \cdot n \in (2^{3} - 2^{36}, - 2^{7}) \\ double : & 2^{q + 63} \cdot 10^{- k - 1} - 2^{64} \cdot n \in (4 - 2^{64}, - 2) \end{matrix}

(118)

Suppose there are two real numbers a and b, and the following relationship must exist:

\begin{matrix} 0 ⩽ & b - ⌊ b ⌋ < 1 \\ a - ⌊ a ⌋ - 1 < & b - ⌊ b ⌋ < 1 + a - ⌊ a ⌋ \\ a - b - 1 < & ⌊ a ⌋ - ⌊ b ⌋ < a - b + 1 \end{matrix}

(119)

When

a = 2^{q + 35} \cdot 10^{- k - 1}

and

b = 2^{36} \cdot n

or

a = 2^{q + 63} \cdot 10^{- k - 1}

and

b = 2^{64} \cdot n

, the following exists:

\begin{matrix} float : & ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ - ⌊ 2^{36} \cdot n ⌋ < 2^{q + 35} \cdot 10^{- k - 1} - 2^{36} \cdot n + 1 \\ double : & ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ - ⌊ 2^{64} \cdot n ⌋ < 2^{q + 63} \cdot 10^{- k - 1} - 2^{64} \cdot n + 1 \end{matrix}

(120)

From Equation (118), we have:

\begin{matrix} float : & ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ - ⌊ 2^{36} \cdot n ⌋ < 1 - 2^{7} < 0 \\ double : & ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ - ⌊ 2^{64} \cdot n ⌋ < 1 - 2 < 0 \end{matrix}

(121)

Therefore, there is:

\begin{matrix} float : ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & ⩽ ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + 1 \\ < ⌊ 2^{36} \cdot n ⌋ ⩽ ⌊ 2^{36} \cdot n_{r} ⌋ \\ \Rightarrow ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & < ⌊ 2^{36} \cdot n_{r} ⌋ \\ double : ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + e v e n & ⩽ ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + 1 \\ < ⌊ 2^{64} \cdot n ⌋ ⩽ ⌊ 2^{64} \cdot n_{r} ⌋ \\ \Rightarrow ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + e v e n & < ⌊ 2^{64} \cdot n_{r} ⌋ \end{matrix}

(122)

Therefore, when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < n

, the condition (116) can be used to determine that

o n e \neq 0

.

(2)When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > n

, there must exist

o n e = 0

, and there is:

\begin{matrix} float : & 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} - n = n^{-} \in (2^{- 33}, 1 - 2^{- 29}) \\ double : & 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} - n = n^{-} \in (2^{- 62}, 1 - 2^{- 63}) \end{matrix}

(123)

Therefore, the following exists:

\begin{matrix} float : & 2^{q + 35} \cdot 10^{- k - 1} - 2^{36} \cdot n \in (2^{3}, 2^{36} - 2^{7}) \\ double : & 2^{q + 63} \cdot 10^{- k - 1} - 2^{64} \cdot n \in (4, 2^{64} - 2) \end{matrix}

(124)

When

a = 2^{q + 35} \cdot 10^{- k - 1}

and

b = 2^{36} \cdot n

or

a = 2^{q + 63} \cdot 10^{- k - 1}

and

b = 2^{64} \cdot n

, from Equation (119), the following exists:

\begin{matrix} float : & ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ - ⌊ 2^{36} \cdot n ⌋ > 2^{q + 35} \cdot 10^{- k - 1} - 2^{36} \cdot n - 1 \\ double : & ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ - ⌊ 2^{64} \cdot n ⌋ > 2^{q + 63} \cdot 10^{- k - 1} - 2^{64} \cdot n - 1 \end{matrix}

(125)

From Equation (124), we have:

\begin{matrix} float : & ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ - ⌊ 2^{36} \cdot n ⌋ > 2^{3} - 1 ⩾ 0 \\ double : & ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ - ⌊ 2^{64} \cdot n ⌋ > 4 - 1 ⩾ 0 \end{matrix}

(126)

Therefore, there is:

\begin{matrix} float : ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & ⩾ ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ \\ > ⌊ 2^{36} \cdot n ⌋ + 1 ⩾ ⌊ 2^{36} \cdot n_{r} ⌋ \\ \Rightarrow ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & > ⌊ 2^{36} \cdot n_{r} ⌋ \\ double : ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + e v e n & ⩾ ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ \\ > ⌊ 2^{64} \cdot n ⌋ + 1 ⩾ ⌊ 2^{64} \cdot n_{r} ⌋ \\ \Rightarrow ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + e v e n & > ⌊ 2^{64} \cdot n_{r} ⌋ \end{matrix}

(127)

Therefore, when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > n

, the condition (116) can be used to determine that

o n e = 0

.

(3)When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < 1 - n

, there must exist

o n e \neq 10

, and there is:

\begin{matrix} float : & 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} + n = n^{+} \in (2^{- 33}, 1 - 2^{- 29}) \\ double : & 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} + n = n^{+} \in (2^{- 62}, 1 - 2^{- 63}) \end{matrix}

(128)

Therefore, the following exists:

\begin{matrix} float : & 2^{q + 35} \cdot 10^{- k - 1} + 2^{36} \cdot n \in (2^{3}, 2^{36} - 2^{7}) \\ double : & 2^{q + 63} \cdot 10^{- k - 1} + 2^{64} \cdot n \in (4, 2^{64} - 2) \end{matrix}

(129)

Suppose there are two real numbers a and b, and the following relationship must exist:

\begin{matrix} a - 1 & < ⌊ a ⌋ ⩽ a \\ b - 1 & < ⌊ b ⌋ ⩽ b \\ a + b - 2 & < ⌊ a ⌋ + ⌊ b ⌋ ⩽ a + b \end{matrix}

(130)

When

a = 2^{q + 35} \cdot 10^{- k - 1}

and

b = 2^{36} \cdot n

or

a = 2^{q + 63} \cdot 10^{- k - 1}

and

b = 2^{64} \cdot n

, the following exists:

\begin{matrix} float : & ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + ⌊ 2^{36} \cdot n ⌋ ⩽ 2^{q + 35} \cdot 10^{- k - 1} + 2^{36} \cdot n \\ double : & ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + ⌊ 2^{64} \cdot n ⌋ ⩽ 2^{q + 63} \cdot 10^{- k - 1} + 2^{64} \cdot n \end{matrix}

(131)

From Equation (129), we have:

\begin{matrix} float : & ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + ⌊ 2^{36} \cdot n ⌋ < 2^{36} - 2^{7} \\ double : & ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + ⌊ 2^{64} \cdot n ⌋ < 2^{64} - 2 \end{matrix}

(132)

Therefore, there is:

\begin{matrix} float : ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & ⩽ ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + 1 \\ < 2^{36} - 2 - ⌊ 2^{36} \cdot n ⌋ \\ < 2^{36} - 1 - ⌊ 2^{36} \cdot n_{r} ⌋ \\ \Rightarrow ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & < 2^{36} - 1 - ⌊ 2^{36} \cdot n_{r} ⌋ \\ double : ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + e v e n & ⩽ ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + 1 \\ < 2^{64} - 2 - ⌊ 2^{64} \cdot n ⌋ \\ < 2^{64} - 1 - ⌊ 2^{64} \cdot n_{r} ⌋ \\ \Rightarrow ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + e v e n & < 2^{64} - 1 - ⌊ 2^{64} \cdot n_{r} ⌋ \end{matrix}

(133)

Therefore, when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < 1 - n

, the condition (116) can be used to determine that

o n e \neq 10

.

(4)When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > 1 - n

, there must exist

o n e = 10

, and there is:

\begin{matrix} float : & 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} + n = n^{+} + 1 \in (1 + 2^{- 33}, 2 - 2^{- 29}) \\ double : & 2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} + n = n^{+} + 1 \in (1 + 2^{- 62}, 2 - 2^{- 63}) \end{matrix}

(134)

Therefore, the following exists:

\begin{matrix} float : & 2^{q + 35} \cdot 10^{- k - 1} + 2^{36} \cdot n \in (2^{3} + 2^{36}, 2^{37} - 2^{7}) \\ double : & 2^{q + 63} \cdot 10^{- k - 1} + 2^{64} \cdot n \in (4 + 2^{64}, 2^{65} - 2) \end{matrix}

(135)

When

a = 2^{q + 35} \cdot 10^{- k - 1}

and

b = 2^{36} \cdot n

or

a = 2^{q + 63} \cdot 10^{- k - 1}

and

b = 2^{64} \cdot n

, from Equation (130), the following exists:

\begin{matrix} float : & ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + ⌊ 2^{36} \cdot n ⌋ > 2^{q + 35} \cdot 10^{- k - 1} + 2^{36} \cdot n - 2 \\ double : & ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + ⌊ 2^{64} \cdot n ⌋ > 2^{q + 63} \cdot 10^{- k - 1} + 2^{64} \cdot n - 2 \end{matrix}

(136)

From Equation (135), we have:

\begin{matrix} float : & ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + ⌊ 2^{36} \cdot n ⌋ > 2^{36} + 2^{3} - 2 > 2^{36} \\ double : & ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + ⌊ 2^{64} \cdot n ⌋ > 2^{64} + 2 - 2 > 2^{64} \end{matrix}

(137)

Therefore, there is:

\begin{matrix} float : ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & ⩾ ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ \\ > 2^{36} - ⌊ 2^{36} \cdot n ⌋ \\ > 2^{36} - 1 - ⌊ 2^{36} \cdot n_{r} ⌋ \\ \Rightarrow ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & > 2^{36} - 1 - ⌊ 2^{36} \cdot n_{r} ⌋ \\ double : ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + e v e n & ⩾ ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ \\ > 2^{64} - ⌊ 2^{64} \cdot n ⌋ \\ > 2^{64} - 1 - ⌊ 2^{64} \cdot n_{r} ⌋ \\ \Rightarrow ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + e v e n & > 2^{64} - 1 - ⌊ 2^{64} \cdot n_{r} ⌋ \end{matrix}

(138)

Therefore, when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > 1 - n

, the condition (116) can be used to determine that

o n e = 10

.

From the above proof, it can be seen that when condition (62) is met, the condition (116) can be used to determine whether

o n e = 0

or

o n e = 10

when

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = n

or

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} = 1 - n

. When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > n

or

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} > 1 - n

, the condition (116) can be used to determine whether

o n e = 0

or

o n e = 10

. When

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < n

or

2^{- 1} \cdot 2^{q} \cdot 10^{- k - 1} < 1 - n

, the condition (116) can be used to determine whether

o n e \neq 0

or

o n e \neq 10

.

The proof process of this section is completed. In the code implementation, the two judgment conditions can be quickly calculated using addition and subtraction shift operations, and can be compiled by the compiler into cmov instructions, thereby reducing the impact of branch prediction failure on performance.

Ultimately, we reached the following conclusion:

o n e

can quickly determine whether

o n e

equals 0 or 10 by using the following method.

\begin{matrix} float : ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & > ⌊ 2^{36} \cdot n_{r} ⌋ \Rightarrow o n e = 0 \\ ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + e v e n & > 2^{36} - 1 - ⌊ 2^{36} \cdot n_{r} ⌋ \Rightarrow o n e = 10 \\ double : ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + e v e n & > ⌊ 2^{64} \cdot n_{r} ⌋ \Rightarrow o n e = 0 \\ ⌊ 2^{q + 63} \cdot 10^{- k - 1} ⌋ + e v e n & > 2^{64} - 1 - ⌊ 2^{64} \cdot n_{r} ⌋ \Rightarrow o n e = 10 \end{matrix}

(139)

3.7. Efficient Computation of $⌊ 10 n ⌋$ and Rounding

Determine whether

o n e

is

⌊ 10 n ⌋

or

⌊ 10 n ⌋ + 1

based on the decimal part of

10 n

. There are two cases: the decimal part of

10 n

is 0.5 and it is not 0.5.

3.7.1. $10 n - ⌊ 10 n ⌋ = 0.5$

When the decimal part of

10 n

is 0.5, there must be:

\begin{matrix} 10 n - ⌊ 10 n ⌋ = 0.5 \\ \Rightarrow 10 \cdot c \cdot 2^{q} \cdot 10^{- k - 1} - ⌊ 10 \cdot c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ = 0.5 \\ \Rightarrow c \cdot 2^{q} \cdot 10^{- k} - ⌊ c \cdot 2^{q} \cdot 10^{- k} ⌋ = 0.5 \\ \Rightarrow c \cdot 2^{q} \cdot 10^{- k} = ⌊ c \cdot 2^{q} \cdot 10^{- k} ⌋ + 0.5 \\ \Rightarrow 2 c \cdot 2^{q} \cdot 10^{- k} = 2 ⌊ c \cdot 2^{q} \cdot 10^{- k} ⌋ + 1 \end{matrix}

(140)

So

2 c \cdot 2^{q} \cdot 10^{- k}

is an odd number. Then the following expression is odd:

\begin{matrix} c \cdot 2^{q + 1} \cdot 10^{- k} = c \cdot 2^{q - k + 1} \cdot 5^{- k} \end{matrix}

(141)

According to the range of q, there are:

\begin{matrix} c \cdot 2^{q + 1} \cdot 10^{- k} = \{\begin{matrix} \frac{c \cdot 2^{q - k + 1}}{5^{k}}; q ⩾ 0 \\ c \cdot 2 \cdot 5^{- k}; q = - 1 \\ \frac{c \cdot 5^{- k}}{2^{k - q - 1}}; q ⩽ - 2 \end{matrix} \end{matrix}

(142)

According to the range of q, the following situations are discussed:

$q ⩾ 0$

When $q ⩾ 0$ , it can be concluded that $q - k + 1 ⩾ 1$ , the numerator $c \cdot 2^{q - k + 1}$ is even and the denominator $5^{k}$ is odd, which does not meet the condition.
$q = - 1$

When $q = - 1$ , it can be concluded that $c \cdot 2 \cdot 5^{- k}$ is even, which does not meet the condition.
$q ⩽ - 2$

$5^{- k}$ is an odd number. c is an odd multiple of $2^{k - q - 1}$ . So:

$\begin{matrix} float : & c ⩾ 2^{k - q - 1} \Rightarrow k - q - 1 ⩽ 22 \Rightarrow q ⩾ - 34 \\ double : & c ⩾ 2^{k - q - 1} \Rightarrow k - q - 1 ⩽ 51 \Rightarrow q ⩾ - 75 \end{matrix}$

(143)

Therefore, when q meets the above conditions, c must be an odd multiple of $2^{k - q - 1}$ to meet the condition. Therefore, when the following conditions are met, expression (141) is an odd number:

$\begin{matrix} float : & - 34 ⩽ q ⩽ - 2 & & c % 2^{k - q} = 2^{k - q - 1} \\ double : & - 75 ⩽ q ⩽ - 2 & & c % 2^{k - q} = 2^{k - q - 1} \end{matrix}$

(144)

When q is within the above range (144), $r = 1$ is derived from Equation (28).Therefore, there is:

$\begin{matrix} n_{r} = n \end{matrix}$

(145)

The following equation holds:

$\begin{matrix} 20 m + 20 n = c \cdot 2^{q} \cdot 10^{- k + 1} = c \cdot 2^{q - k + 1} \cdot 5^{- k} = \frac{c}{2^{k - q - 1}} \cdot 5^{- k} \end{matrix}$

(146)

Since $- k ⩾ 1$ , $5^{- k}$ is multiple of 5 and is an odd number. Since $\frac{c}{2^{k - q - 1}}$ and $5^{- k}$ are both odd numbers, $20 m$ is an even number, $20 n$ is multiple of 5 and is an odd number. Therefore, there is:

$\begin{matrix} 20 n \in \{5, 15\} \\ \Rightarrow n \in \{0.25, 0.75\} \\ \Rightarrow n_{r} \in \{0.25, 0.75\} \end{matrix}$

(147)

The result of $o n e$ is an even number between $⌊ 10 n ⌋$ and $⌊ 10 n ⌋ + 1$ . Therefore:

$\begin{matrix} o n e = \{\begin{matrix} ⌊ 10 n ⌋ = 2, if n = 0.25 \\ ⌊ 10 n ⌋ + 1 = 8, if n = 0.75 \end{matrix} \Rightarrow o n e = ⌊ 20 n + 1 ⌋ / / 2 - (n = 0.25 ? 1 : 0) \end{matrix}$

(148)

3.7.2. $10 n - ⌊ 10 n ⌋ \neq 0.5$

When the decimal part of

10 n

is not 0.5, round to the nearest integer value based on the decimal part of

10 n

. Therefore, there is:

\begin{matrix} o n e = \{\begin{matrix} ⌊ 10 n ⌋, if 10 n - ⌊ 10 n ⌋ < 0.5 \\ ⌊ 10 n ⌋ + 1, if 10 n - ⌊ 10 n ⌋ > 0.5 \end{matrix} \Rightarrow o n e = ⌊ 10 n + 0.5 ⌋ = ⌊ 20 n + 1 ⌋ / / 2 \end{matrix}

(149)

Since

⌊ 20 n + 1 ⌋ = ⌊ 20 n ⌋ + 1

, it is only necessary to accurately calculate the value of

⌊ 20 n ⌋

. And, there is:

\begin{matrix} d & = t e n + o n e \\ = 10 m + ⌊ 20 n + 1 ⌋ / / 2 \\ = (⌊ 20 m + 20 n ⌋ + 1) / / 2 \end{matrix}

(150)

Suppose there are:

\begin{matrix} 20 m + 20 n = c \cdot 2^{q + 1} \cdot 10^{- k} = c \cdot 2^{q - k + 1} \cdot 5^{- k} = c \cdot \frac{x}{y} \end{matrix}

(151)

Suppose the decimal part of

20 n

is

n_{20}

.

When

y ⩽ c_{max} = C

, the range of the decimal part must include:

\begin{matrix} float : & \frac{1}{2^{24} - 1} = \frac{1}{C} ⩽ n_{20} ⩽ 1 - \frac{1}{C} = \frac{2^{24} - 2}{2^{24} - 1} \\ double : & \frac{1}{2^{53} - 1} = \frac{1}{C} ⩽ n_{20} ⩽ 1 - \frac{1}{C} = \frac{2^{53} - 2}{2^{53} - 1} \end{matrix}

(152)

When

y > c_{max} = C

, the range of the decimal part must include(the test file is test5.py):

\begin{matrix} float : & 2^{- 32} < n_{20} < 1 - 2^{- 30} \\ double : & 2^{- 64} < n_{20} < 1 - 2^{- 62} \end{matrix}

(153)

Therefore, the range of

n_{20}

satisfies Equation (153). In the code implementation, for float, only the high 36 bits of

n_{r}

are retained, and for double, only the high 70 bits of

n_{r}

are retained. Suppose the discarded part of a float is represented as

n_{36}

, and similarly, the discarded part of a double is represented as

n_{70}

. Therefore, there is:

\begin{matrix} float : & n_{36} \in [0, 2^{- 36}) \\ double : & n_{70} \in [0, 2^{- 70}) \end{matrix}

(154)

Calculate the boundary conditions of the following expression:

\begin{matrix} float : & F = 20 \cdot (c \cdot 2^{q} \cdot r \cdot 10^{- k - 1} - n_{36}) \\ double : & F = 20 \cdot (c \cdot 2^{q} \cdot r \cdot 10^{- k - 1} - n_{70}) \end{matrix}

(155)

Therefore, there is:

\begin{matrix} float : F_{min} & > 20 \cdot (c \cdot 2^{q} \cdot 10^{- k - 1} - 2^{- 36}) \\ = 20 m + 20 n - 20 \cdot 2^{- 36} \\ F_{max} & < 20 \cdot (c \cdot 2^{q} \cdot (1 + 2^{- 63}) \cdot 10^{- k - 1} - 0) \\ < 20 m + 20 n + 20 \cdot 2^{- 63} \cdot c \\ < 20 m + ⌊ 20 n ⌋ + 1 \\ double : F_{min} & > 20 \cdot (c \cdot 2^{q} \cdot 10^{- k - 1} - 2^{- 70}) \\ = 20 m + 20 n - 20 \cdot 2^{- 70} \\ > 20 m + ⌊ 20 n ⌋ \\ F_{max} & < 20 \cdot (c \cdot 2^{q} \cdot (1 + 2^{- 127}) \cdot 10^{- k - 1} - 0) \\ < 20 m + 20 n + 20 \cdot 2^{- 127} \cdot c \\ < 20 m + ⌊ 20 n ⌋ + 1 \end{matrix}

(156)

Therefore, there is:

\begin{matrix} float : & ⌊ F ⌋ = 20 m + ⌊ 20 n ⌋ \\ double : & ⌊ F ⌋ = 20 m + ⌊ 20 n ⌋ \end{matrix}

(157)

In fact, in the above proof process, for float,

⌊ F_{m i n} ⌋ \neq 20 m + ⌊ 20 n ⌋

may exist, but the code implementation has passed the exhaustive test, so this not-so-perfect proof process can be ignored. Therefore, the calculation of d can be simplified as follows:

\begin{matrix} d & = t e n + o n e \\ = (⌊ F ⌋ + 1) / / 2 \\ = (⌊ 20 \cdot (c \cdot 2^{q} \cdot r \cdot 10^{- k - 1} - n_{x}) ⌋ + 1) / / 2 \end{matrix}

(158)

For the float range,

n_{x} = n_{36}

; for the double range,

n_{x} = n_{70}

.

3.7.3. Efficient Implementation of $n = 0.25$ for Double

For double, quickly determine that

n = = 0.25

in Equation (148).

When

n = 0.25

,

⌊ 2^{64} \cdot n_{r} ⌋ = ⌊ 2^{64} \cdot n ⌋ = 2^{62}

. Therefore, the following condition can be used to quickly determine whether

n = 0.25

:

\begin{matrix} double : & n = 0.25 if ⌊ 2^{64} \cdot n_{r} ⌋ = 2^{62} \end{matrix}

(159)

When

n \neq 0.25

,Calculate the range of the decimal part of the following expression:

\begin{matrix} 4 m + 4 n = c \cdot 2^{q + 2} \cdot 10^{- k - 1} \end{matrix}

(160)

Therefore, when Equation (160) is not an integer, we have(the test file is test6.py):

\begin{matrix} 2^{- 62} < 4 n - ⌊ 4 n ⌋ < 1 - 2^{- 62} \end{matrix}

(161)

Calculate the two boundary cases of

4 n

that are closest to 1:

\begin{matrix} ⌊ 4 n ⌋ = 0 \Rightarrow 4 n - 0 < 1 - 2^{- 62} \Rightarrow ⌊ 2^{64} \cdot n ⌋ ⩽ 2^{62} - 2 \\ ⌊ 4 n ⌋ = 1 \Rightarrow 4 n - 1 > 2^{- 62} \Rightarrow ⌊ 2^{64} \cdot n ⌋ ⩾ 2^{62} + 1 \end{matrix}

(162)

Then there are:

\begin{matrix} ⌊ 2^{64} \cdot n ⌋ \neq 2^{62} & & ⌊ 2^{64} \cdot n ⌋ + 1 \neq 2^{62} \\ \Rightarrow ⌊ 2^{64} \cdot n_{r} ⌋ \neq 2^{62} \end{matrix}

(163)

Therefore, the following condition can be used to quickly determine whether

n \neq 0.25

:

\begin{matrix} double : & n \neq 0.25 if ⌊ 2^{64} \cdot n_{r} ⌋ \neq 2^{62} \end{matrix}

(164)

In summary, for double, the following condition can be used to quickly determine whether

n = 0.25

:

\begin{matrix} double : & n = 0.25 if ⌊ 2^{64} \cdot n_{r} ⌋ = 2^{62} \\ double : & n \neq 0.25 if ⌊ 2^{64} \cdot n_{r} ⌋ \neq 2^{62} \end{matrix}

(165)

3.7.4. Efficient Calculation of $o n e$ for Double

In the double range, introduce another faster way to calculate

o n e

:

\begin{matrix} double : & o n e = ⌊ \frac{⌊ 2^{64} \cdot n_{r} ⌋}{2^{64}} \cdot 10 + ((n = 0.25) ? 0 : (2^{- 1} + \frac{6}{2^{64}})) ⌋ \end{matrix}

(166)

The proof of Equation (166) is as follows:

when

n = 0.25

,

⌊ \frac{⌊ 2^{64} \cdot n_{r} ⌋}{2^{64}} \cdot 10 ⌋ = ⌊ 10 n ⌋ = 2

;

when

n \neq 0.25

, Equation (166) can be equivalent to the following:

\begin{matrix} double : & o n e = ⌊ \frac{⌊ 2^{64} \cdot n_{r} ⌋}{2^{64}} \cdot 10 + 2^{- 1} + \frac{6}{2^{64}} ⌋ \end{matrix}

(167)

According to the

10 n - ⌊ 10 n ⌋

range,

o n e

is represented as:

\begin{matrix} double : & o n e = \{\begin{matrix} ⌊ 10 n ⌋, if 10 n - ⌊ 10 n ⌋ < 0.5 \\ 8, if 10 n - ⌊ 10 n ⌋ = 0.5 \\ ⌊ 10 n ⌋ + 1, if 10 n - ⌊ 10 n ⌋ > 0.5 \end{matrix} = ⌊ 20 n + 1 ⌋ / / 2 \end{matrix}

(168)

Therefore,when

n \neq 0.25

, we need to prove that the following equation holds:

\begin{matrix} ⌊ \frac{⌊ 2^{64} \cdot n_{r} ⌋}{2^{64}} \cdot 10 + 2^{- 1} + \frac{6}{2^{64}} ⌋ = \{\begin{matrix} ⌊ 10 n ⌋, if 10 n - ⌊ 10 n ⌋ < 0.5 \\ 8, if 10 n - ⌊ 10 n ⌋ = 0.5 \\ ⌊ 10 n ⌋ + 1, if 10 n - ⌊ 10 n ⌋ > 0.5 \end{matrix} = ⌊ 20 n + 1 ⌋ / / 2 \end{matrix}

(169)

From the range of n, there is:

\begin{matrix} \frac{⌊ 2^{64} \cdot n_{r} ⌋}{2^{64}} \in (n_{r} - 2^{- 64}, n_{r}] \end{matrix}

(170)

Because the following conditions exist:

\begin{matrix} c \cdot 2^{q} \cdot 10^{- k - 1} & = m + n \\ c \cdot 2^{q} \cdot r \cdot 10^{- k - 1} & = m + n_{r} \end{matrix}

(171)

Therefore, the following relationship can be concluded:

\begin{matrix} n_{r} - n & = (r - 1) \cdot c \cdot 2^{q} \cdot 10^{- k - 1} \\ n_{r} & = (r - 1) \cdot (m + n) + n \\ \Rightarrow n ⩽ n_{r} & < 2^{- 127} \cdot c + n \\ n ⩽ n_{r} & < 2^{- 127} \cdot 2^{53} + n \\ n ⩽ n_{r} & < 2^{- 74} + n \end{matrix}

(172)

From Equation (170) and (172), it can be concluded that:

\begin{matrix} \frac{⌊ 2^{64} \cdot n_{r} ⌋}{2^{64}} \in (n - 2^{- 64}, n + 2^{- 74}) \\ \Rightarrow \frac{⌊ 2^{64} \cdot n_{r} ⌋}{2^{64}} \cdot 10 \in (10 n - 10 \cdot 2^{- 64}, 10 n + 10 \cdot 2^{- 74}) \\ \Rightarrow \frac{⌊ 2^{64} \cdot n_{r} ⌋}{2^{64}} \cdot 20 \in (20 n - 20 \cdot 2^{- 64}, 20 n + 20 \cdot 2^{- 74}) \\ \Rightarrow \frac{⌊ 2^{64} \cdot n_{r} ⌋}{2^{64}} \cdot 20 \in (⌊ 20 n ⌋ + n_{20} - 20 \cdot 2^{- 64}, ⌊ 20 n ⌋ + n_{20} + 20 \cdot 2^{- 74}) \end{matrix}

(173)

Discuss the range of values of x when the following conditions are met.

\begin{matrix} ⌊ \frac{⌊ 2^{64} \cdot n_{r} ⌋}{2^{64}} \cdot 20 + 1 + x ⌋ / / 2 = ⌊ 20 n + 1 ⌋ / / 2 = o n e \end{matrix}

(174)

Therefore, the following conclusions can be drawn:

\begin{matrix} ⌊ 20 n ⌋ + n_{20} - 20 \cdot 2^{- 64} + 1 + x ⩾ ⌊ 20 n + 1 ⌋ & \Rightarrow x ⩾ 20 \cdot 2^{- 64} - n_{20} \\ ⌊ 20 n ⌋ + n_{20} + 20 \cdot 2^{- 74} + 1 + x < ⌊ 20 n + 2 ⌋ & \Rightarrow x < 1 - 20 \cdot 2^{- 74} - n_{20} \end{matrix}

(175)

Suppose

x = 12 \cdot 2^{- 64}

. Through the exhaustive method, all floating-point numbers that do not meet the following conditions can be obtained.

\begin{matrix} x = 12 \cdot 2^{- 64} ⩾ 20 \cdot 2^{- 64} - n_{20} \end{matrix}

(176)

All floating-point numbers that do not meet condition (176) are as follows (in hexadecimal) :

\begin{matrix} 0 x d 17 c 0747 b d 76 f a 1 & , \\ 0 x d 27 c 0747 b d 76 f a 1 & , \\ 0 x 4 d 73 d e 005 b d 620 d f & , \\ 0 x 4 d 83 d e 005 b d 620 d f & , \\ 0 x 4 d 93 d e 005 b d 620 d f & , \end{matrix}

(177)

Through the exhaustive method, all floating-point numbers that do not meet the following conditions can be obtained.

\begin{matrix} x = 12 \cdot 2^{- 64} < 1 - 20 \cdot 2^{- 74} - n_{20} \end{matrix}

(178)

All floating-point numbers that do not meet condition (178) are as follows (in hexadecimal) :

\begin{matrix} 0 x 612491 d a a d 0 b a 280 & , \\ 0 x 6159 b 651584 e 8 b 20 & , \\ 0 x 619011 f 2 d 73116 f 4 & , \\ 0 x 61 c 4166 f 8 c f d 5 c b 1 & , \\ 0 x 61 d 4166 f 8 c f d 5 c b 1 & , \end{matrix}

(179)

There are:

\begin{matrix} 2 (\frac{⌊ 2^{64} \cdot n_{r} ⌋}{2^{64}} \cdot 10 + 2^{- 1} + \frac{6}{2^{64}}) = \frac{⌊ 2^{64} \cdot n_{r} ⌋}{2^{64}} \cdot 20 + 1 + x \end{matrix}

(180)

When the floating-point number is not within the above range (177) and (179), the condition (175) is satisfied. We have tested all floating-point numbers within the above-mentioned range (177) and (179), and the algorithm implementation code has output the correct result, that is, it satisfies the SW principle. The test process file is test8.py.

In summary, Equation (169) and Equation (166) holds. Therefore, Equation (166) can be used to quickly calculate

o n e

.

3.8. Irregular Number

Due to the limited and small number of irregular floating-point numbers, there are a total of 2046 double floating-point numbers and 254 float floating-point numbers. The correctness of the algorithm code in this paper can be proved by the exhaustive method. Therefore, it is not introduced in this article. For the specific implementation process, please refer to the source code.

3.9. Implementation of Pseudocode

This subsection details the pseudocode implementation for converting regular floating-point numbers to decimal representation. The handling of subnormal (irregular) floating-point numbers is omitted here due to space constraints; interested readers may refer to the source code for the complete implementation.

3.9.1. Single-Precision Floating-Point Numbers

\begin{matrix} uint 32 v i & = bit_copy_from_f 32 (v) \\ (uint 64 c, int 32 q) & = extract (v i) \\ const int BIT & = 36 \\ const uint 64 offset & = (1 ULL ≪ (BIT - 2)) - 7 \\ int 32 k & = (q \cdot 1233) ≫ 12 \\ int 32 h & = q + ((1701 \cdot (- k - 1)) ≫ 9) \\ uint 64 pow 10 & = get_pow 10 (- k - 1) \\ uint 64 c b & = c ≪ (h + BIT + 1) \\ uint 64 hi 64 & = umul_{hi}_{64 \times 64} (c b, pow 10) \\ bool even & = (v i + 1) & 1 \\ uint 64 half & = (pow 10 ≫ (65 - (h + BIT + 1))) + even \\ uint 32 shorter & = (hi 64 + half) ≫ BIT \\ uint 64 bias & = (hi 64 ≫ (BIT - 4)) & 15 \\ uint 32 longer & = (5 \cdot hi 64 + offset + bias) ≫ (BIT - 1) \\ bool updown & = shorter > ((hi 64 - half) ≫ BIT) \\ uint 32 d & = updown ? shorter \cdot 10 : longer \end{matrix}

(181)

Algorithm 181 outlines the procedure for computing d and k for single-precision floating-point numbers.

Since c contains at most 23 significant bits and

pow 10

has 64 significant bits, their product contains at most 87 significant bits. Given that the range of h is

[- 4, - 1]

, the value

c b = c ≪ (h + BIT + 1)

does not overflow, preserving all bits of c. When

e_{10} = - k - 1

,

pow 10

is computed using Equation (24). From Equation (25), we derive:

\begin{matrix} pow 10 \cdot 2^{⌊ (- k - 1) \cdot {log}_{2} 10 ⌋ - 63} = 10^{- k - 1} \cdot r_{1, - k - 1} \end{matrix}

(182)

From Equation (51):

\begin{matrix} m & = ⌊ v \cdot 10^{- k - 1} ⌋ \\ = ⌊ v \cdot 10^{- k - 1} \cdot r_{1, - k - 1} ⌋ \\ = ⌊ v \cdot pow 10 \cdot 2^{⌊ (- k - 1) \cdot {log}_{2} 10 ⌋ - 63} ⌋ \\ = ⌊ c \cdot pow 10 \cdot 2^{q + ⌊ (- k - 1) \cdot {log}_{2} 10 ⌋ - 63} ⌋ \\ = (c \cdot pow 10) ≫ (63 - q - ⌊ (- k - 1) \cdot {log}_{2} 10 ⌋) \\ = (c \cdot pow 10) ≫ (63 - h) \\ = ((c ≪ (37 + h)) \cdot pow 10) ≫ (36 + 64) \\ = hi 64 ≫ 36 \end{matrix}

(183)

Let

up

indicate whether

o n e = 10

. From Equations (138) and (105):

\begin{matrix} bool up & = ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + even > 2^{36} - 1 - ⌊ 2^{36} \cdot n_{r} ⌋ \\ = ⌊ 2^{q + 35} \cdot 10^{- k - 1} ⌋ + even + ⌊ 2^{36} \cdot n_{r} ⌋ ⩾ 2^{36} \\ = (pow 10 ≫ (28 - h)) + even + ⌊ 2^{36} \cdot n_{r} ⌋ ⩾ 2^{36} \\ = (half + ⌊ 2^{36} \cdot n_{r} ⌋) ≫ 36 \end{matrix}

(184)

When

o n e \neq 10

, we have

d / / 10 = m

; when

o n e = 10

, we have

d / / 10 = m + 1

. Therefore:

\begin{matrix} d / / 10 = m + up \end{matrix}

(185)

The upper 28 bits of

hi 64

represent m, while the lower 36 bits represent

⌊ 2^{36} \cdot n_{r} ⌋

:

\begin{matrix} hi 64 = (m ≪ 36) + ⌊ 2^{36} \cdot n_{r} ⌋ \end{matrix}

(186)

Consequently:

\begin{matrix} shorter = d / / 10 = (hi 64 + half) ≫ 36 \end{matrix}

(187)

Let

updown

indicate whether

o n e \in {0, 10}

, equivalently whether the last digit of d is zero:

\begin{matrix} updown = (d mod 10 = 0) \end{matrix}

(188)

For brevity, let

dot_one = ⌊ 2^{36} \cdot n_{r} ⌋

. From Equations (134) and (140):

\begin{matrix} o n e = 0 & : half > dot_one \\ o n e = 10 & : half > 2^{36} - 1 - dot_one \end{matrix}

(189)

The following equations hold:

\begin{matrix} o n e = 0 & : (hi 64 - half) ≫ 36 = m - 1 \\ o n e = 10 & : (hi 64 - half) ≫ 36 = m \end{matrix}

(190)

Therefore:

\begin{matrix} o n e = 0 & : shorter = m > (hi 64 - half) ≫ 36 = m - 1 \\ o n e = 10 & : shorter = m + 1 > (hi 64 - half) ≫ 36 = m \end{matrix}

(191)

Thus, the condition for

o n e \in {0, 10}

is:

\begin{matrix} shorter > (hi 64 - half) ≫ 36 \Rightarrow d mod 10 = 0 \end{matrix}

(192)

When

updown

is true,

d = 10 \cdot shorter

. Otherwise, we compute d using Equation (158). When

10 n - ⌊ 10 n ⌋ \neq 0.5

:

\begin{matrix} d & = (⌊ 20 \cdot (v \cdot 10^{- k - 1} \cdot r - n_{36}) ⌋ + 1) / / 2 \\ = (⌊ 20 \cdot (hi 64 \cdot 2^{- 36}) ⌋ + 1) / / 2 \\ = (⌊ 5 \cdot hi 64 \cdot 2^{- 34} ⌋ + 1) / / 2 \\ = (⌊ (5 \cdot hi 64 + 2^{34}) \cdot 2^{- 34} ⌋) / / 2 \\ = ((5 \cdot hi 64 + 2^{34}) ≫ 34) / / 2 \\ = (5 \cdot hi 64 + 2^{34}) ≫ 35 \end{matrix}

(193)

Adding

2^{34}

serves as a rounding operation. When

10 n - ⌊ 10 n ⌋ = 0.5

, Equation (148) requires checking whether

n = 0.25

. The derivation of

longer

involves careful handling of edge cases; the correctness of this computation has been verified through exhaustive testing.

3.9.2. Double-Precision Floating-Point Numbers

\begin{matrix} uint 64 v i & = bit_copy_from_f 64 (v) \\ (uint 64 c, int q) & = extract (v i) \\ const int BIT & = 6 \\ int k & = (q \cdot 78913) ≫ 18 \\ int h & = q + ((217707 \cdot (- k - 1)) ≫ 16) \\ uint 128 pow 10 & = get_pow 10 (- k - 1) \\ uint 64 c b & = c ≪ (h + BIT + 1) \\ uint 128 hi 128 & = umul_{hi}_{64 \times 128} (c b, pow 10) \\ bool even & = (v i + 1) & 1 \\ uint 64 half & = (pow 10 ≫ (64 - h)) + even \\ uint 64 dot_one & = (uint 64) (hi 128 ≫ BIT) \\ uint 64 ten & = 10 \cdot (hi 128 ≫ (BIT + 64)) \\ uint 64 offset_num & = (dot_one = = 2^{62}) ? 0 : 2^{63} + 6 \\ uint 64 o n e & = ((uint 128) 10 \cdot dot_one + offset_num) ≫ 64 \\ o n e & = (half > dot_one) ? 0 : o n e \\ o n e & = (half > 2^{64} - 1 - dot_one) ? 10 : o n e \\ uint 64 d & = ten + o n e \end{matrix}

(194)

Algorithm 194 presents the corresponding procedure for double-precision floating-point numbers.

pow 10

is computed using Equation (26), satisfying:

\begin{matrix} pow 10 \cdot 2^{⌊ (- k - 1) \cdot {log}_{2} 10 ⌋ - 127} = 10^{- k - 1} \cdot r_{1, - k - 1} \end{matrix}

(195)

From Equation (51):

\begin{matrix} m & = ⌊ v \cdot 10^{- k - 1} ⌋ \\ = ⌊ v \cdot 10^{- k - 1} \cdot r_{1, - k - 1} ⌋ \\ = ⌊ v \cdot pow 10 \cdot 2^{⌊ (- k - 1) \cdot {log}_{2} 10 ⌋ - 127} ⌋ \\ = ⌊ c \cdot pow 10 \cdot 2^{q + ⌊ (- k - 1) \cdot {log}_{2} 10 ⌋ - 127} ⌋ \\ = (c \cdot pow 10) ≫ (127 - q - ⌊ (- k - 1) \cdot {log}_{2} 10 ⌋) \\ = (c \cdot pow 10) ≫ (127 - h) \\ = ((c ≪ (7 + h)) \cdot pow 10) ≫ (64 + 70) \\ = hi 128 ≫ 70 \end{matrix}

(196)

Given

h \in [- 4, - 1]

and c having at most 53 significant bits,

c b

does not overflow, ensuring accurate computation of

ten = 10 \cdot m

. By definition,

dot_one = ⌊ 2^{64} \cdot n_{r} ⌋

. From Equation (105):

\begin{matrix} half & = (pow 10 ≫ (64 - h)) + even \\ = ⌊ 2^{63} \cdot 2^{q} \cdot 10^{- k - 1} ⌋ + even \end{matrix}

(197)

The value of

o n e

is first computed using Equation (166), then adjusted if the conditions for

o n e = 0

or

o n e = 10

are met:

\begin{matrix} o n e & = ⌊\frac{⌊ 2^{64} \cdot n_{r} ⌋}{2^{64}} \cdot 10 + ((n = 0.25) ? 0 : (2^{- 1} + \frac{6}{2^{64}}))⌋ \\ = (dot_one \cdot 10 + (dot_one = = 2^{62} ? 0 : 2^{63} + 6)) ≫ 64 \end{matrix}

(198)

An equivalent formulation is:

\begin{matrix} o n e = (dot_one = = 2^{62}) ? 2 : (dot_one \cdot 10 + 2^{63} + 6) ≫ 64 \end{matrix}

(199)

The conditions

o n e = 0

and

o n e = 10

are determined by Equations (127) and (138), respectively.

In the C/C++ implementation, the only branch occurs when computing c and q for subnormal floating-point numbers (marked as

unlikely

). The conversion of normal floating-point numbers is branch-free, eliminating branch misprediction penalties. As shown in Algorithms 181 and 194, the core algorithm is remarkably concise. Irregular numbers are handled in a separate branch—a worthwhile trade-off given their rarity.

3.10. Decimal to String Conversion

The floating-point number printing algorithm consists of two main phases. While the first phase, which computes the decimal digits d and exponent k, has been discussed in previous sections, this section focuses on the second phase: converting the computed decimal representation into a string.

Floating-point numbers are typically printed in two formats: fixed-point notation and scientific notation. Table (Table 3) illustrates examples of both formats.

Having computed d and k, we now convert the floating-point number to a string based on these values. According to the Schubfach algorithm, d may contain trailing zeros, meaning

d mod 10

could be zero. In order to reduce instruction dependencies. We decompose d into two parts:

d / / 10

and

d mod 10

. Let

u p

indicate whether

o n e

equals 10, and

u p d o w n

represent the cases where

o n e

is either 0 or 10. The following relationships hold:

\begin{matrix} d / / 10 & = m + u p \\ u p d o w n & = (o n e = 0 \lor o n e = 10) = (d mod 10 = 0) \\ d & = 10 \cdot (m + u p) + (u p d o w n ? 0 : o n e) \end{matrix}

(200)

By calculating the approximate range of m, we obtain:

\begin{matrix} m & = ⌊ c \cdot 2^{q} \cdot 10^{- k - 1} ⌋ \\ c \cdot 2^{q} \cdot 10^{- k - 1} & \in [0.1 \cdot c, c) \\ \Rightarrow ⌊ 0.1 \cdot c_{min} ⌋ ⩽ ⌊ 0.1 \cdot c ⌋ ⩽ & m ⩽ ⌊ c ⌋ ⩽ ⌊ c_{max} ⌋ \end{matrix}

(201)

Based on the range of c, we derive:

\begin{matrix} float : & \{\begin{matrix} normal : [\frac{2^{23} + 1}{10}, 2^{24} - 1] = [838860.9, 16777215] \\ subnormal : [\frac{1}{10}, 2^{23} - 1] = [0.1, 8388607] \end{matrix} \\ double : & \{\begin{matrix} normal : [\frac{2^{52} + 1}{10}, 2^{53} - 1] = [4.5 \times 10^{14}, 9 \times 10^{15}] \\ subnormal : [\frac{1}{10}, 2^{52} - 1] = [0.1, 4.5 \times 10^{15}] \end{matrix} \end{matrix}

(202)

Let

l e n 10 (x)

denote the number of decimal digits in x. For

x > 0

:

l e n 10 (x) = ⌊ {log}_{10} (x) ⌋ + 1

(203)

We define

l e n 10 (0) = 0

. For example,

l e n 10 (12345) = 5

. Consequently:

\begin{matrix} float : & \{\begin{matrix} normal : l e n 10 (m + u p) \in [6, 8] \\ subnormal : l e n 10 (m + u p) \in [0, 7] \end{matrix} \\ double : & \{\begin{matrix} normal : l e n 10 (m + u p) \in [15, 16] \\ subnormal : l e n 10 (m + u p) \in [0, 16] \end{matrix} \end{matrix}

(204)

Since

d / / 10 = m + u p

, we have

l e n 10 (d) = l e n 10 (m + u p) + 1

. Let

t z 10 (x)

denote the number of trailing zeros in the decimal representation of x; for example,

t z 10 (12300) = 2

. The number of significant digits

s i g 10 (x)

equals

l e n 10 (x) - t z 10 (x)

. When

u p d o w n = 0

,

o n e \neq 0

, thus

d mod 10 = o n e \neq 0

and

t z 10 (d) = 0

. Therefore:

\begin{matrix} s i g 10 (d) & = u p d o w n ? l e n 10 (d) - t z 10 (d) : l e n 10 (d) \\ float : & s i g 10 (d) \in [1, 9] \\ double : & s i g 10 (d) \in [1, 17] \end{matrix}

(205)

For normal floating-point numbers,

l e n 10 (d)

can be computed as:

\begin{matrix} float : & l e n 10 (d) = l e n (m + u p) + 1 = 9 - [m + u p < 10^{7}] - [m + u p < 10^{6}] \\ double : & l e n 10 (d) = l e n (m + u p) + 1 = 16 + [(m + u p) ⩾ 10^{15}] \end{matrix}

(206)

where

[P]

denotes the Iverson bracket, which evaluates to 1 if predicate P is true and 0 otherwise.

Computing the ASCII representation of d reduces to computing the ASCII codes of

m + u p

and

o n e

. From the above, we have:

\begin{matrix} float : & m + u p < 10^{8} \\ double : & m + u p < 10^{16} \end{matrix}

(207)

These bounds allow efficient computation of the ASCII code for

m + u p

using CPU SIMD instruction sets.

Scalar Implementation

We first present a scalar method that does not rely on SIMD instructions.

Let

x \in [0, 10^{8})

and

d e c_t o_a s c i i 8 (x)

be a function that computes the ASCII representation of x. Algorithm (2) describes this process. This version is an optimized version of the itoa algorithm written by Paul Khuong [18].

Algorithm 2:Convert an 8-digit decimal number to ASCII: dec_to_ascii8(x)

Input:: X (type: uint64, range: [0, $10^{8}$ ))
Output:: Y (type: uint64), tz (type: uint)
1:: uint64 abcd_efgh = X + (0x100000000 - 10000) * ((X * 0x68db8bbULL) $> >$ 40)
2:: uint64 ab_cd_ef_gh = abcd_efgh + (0x10000 - 100) * (((abcd_efgh * 0x147b) $> >$ 19) & 0x7f0000007f)
3:: uint64 a_b_c_d_e_f_g_h = ab_cd_ef_gh + (0x100 - 10) * (((ab_cd_ef_gh * 0x67) $> >$ 10) & 0xf000f000f000f)
4:: uint tz = __builtin_ctzll(a_b_c_d_e_f_g_h) $> >$ 3
5:: uint64 BCD = CPU_IS_LITTLE_ENDIAN ? byteswap(a_b_c_d_e_f_g_h) : a_b_c_d_e_f_g_h
6:: uint64 Y = BCD + 0x3030303030303030

return Y, tz

The function simultaneously computes

t z 10 (x)

as the return value

t z

. Example: X = 12345600, Y="12345600", tz=2.

Algorithm 3:Convert a 16-digit decimal number to ASCII: dec_to_ascii16(x)

Input:: X (type: uint64, range: [0, $10^{16}$ ))
Output:: Y (type: uint128), tz (type: uint)
1:: uint64 abcdefgh = X // $10^{8}$
2:: uint64 ijklmnop = X mod $10^{8}$
3:: (uint64 abcdefgh_ascii, uint tz1) = dec_to_ascii8(abcdefgh)
4:: (uint64 ijklmnop_ascii, uint tz2) = dec_to_ascii8(ijklmnop)
5:: uint128 Y = CPU_IS_LITTLE_ENDIAN ? (ijklmnop_ascii << 64) + abcdefgh_ascii : (abcdefgh_ascii << 64) + ijklmnop_ascii
6:: uint tz = (ijklmnop == 0) ? 8 + tz1 : tz2

return Y, tz

For

x \in [0, 10^{16})

, let

d e c_t o_a s c i i 16 (x)

compute the ASCII representation of x. Algorithm (3) describes this process. Example: X = 123456001234500, Y="1234560012345600", tz=2.

Handling Undefined Behavior

Since

__b u i l t i n_c t z l l (0)

is undefined behavior in C, special handling is required. For single-precision floating-point numbers, when

m + u p = 0

(the six smallest subnormal numbers),

u p d o w n = 0

, avoiding the undefined case. For double-precision numbers,

m + u p = 0

only occurs for

5 \times 10^{- 324}

(the smallest subnormal number). In our implementation, we handle this special case separately at the function entry.

SIMD Implementation

The SIMD implementations of

d e c_t o_a s c i i 8

and

d e c_t o_a s c i i 16

follow the same principles as the scalar version. For ARM64 architecture, we use the NEON instruction set; for x64 architecture, we support three variants: AVX512, SSE4.1, and SSE2. Table (Table 4) summarizes these implementations.

Using the methods above, we compute the ASCII representation of

m + u p

and the number of significant digits. Printing d is equivalent to printing

m + u p

and

o n e

. Based on

s i g 10 (d)

, we determine which buffer contents to retain. In our implementation, we adopt different formats for floating-point numbers in different ranges to enhance readability, as shown in Table (Table 5).

In scientific notation, the result includes an exponent. For example, in “1.2e-02”, “e-02” represents the exponent. Converting

d \cdot 10^{k}

to standard decimal scientific notation yields:

d \cdot 10^{k} = (d \cdot 10^{- ⌊ {log}_{10} (d) ⌋}) \cdot 10^{k + ⌊ {log}_{10} (d) ⌋}

(208)

Since

d \cdot 10^{- ⌊ {log}_{10} (d) ⌋} \in [1, 10)

, the exponent is determined by

k + ⌊ {log}_{10} (d) ⌋

. Let

E_{10} = k + ⌊ {log}_{10} (d) ⌋

. Then:

E_{10} = k + ⌊ {log}_{10} (d) ⌋ = k + ⌊ {log}_{10} (m + u p) ⌋ + 1

(209)

For single-precision floating-point numbers, fixed-point notation is used when

E_{10} \in [- 3, 6]

. For double-precision numbers, fixed-point notation is used when

E_{10} \in [- 4, 15]

.

The floating-point number printing algorithm proposed in this paper is illustrated as Algorithm 4. The key distinction between our algorithm and others is the use of SIMD instruction sets for converting

m + u p

to ASCII values. For single-precision numbers, we use

d e c_t o_a s c i i 8

; for double-precision, we use

d e c_t o_a s c i i 16

. The detailed implementation is available in the source code.

Algorithm 4:Floating-point number printing algorithm

Input:: v (type: float/double), buffer (type: char*)
Output:: buffer (type: char*)
1:: compute $m + u p$ , $u p d o w n$ , $o n e$ , k from v
2:: compute $s i g 10 (d) = u p d o w n ? l e n 10 (m + u p) - t z 10 (m + u p) : l e n 10 (m + u p) + 1$
3:: compute $E_{10} = k + ⌊ {log}_{10} (m + u p) ⌋ + 1$
4:: convert $m + u p$ , $o n e$ to ASCII codes and store in buffer
5:: based on $s i g 10 (d)$ and $E_{10}$ , determine which buffer contents to retain
6:: if result is in scientific notation, print $E_{10}$

return buffer

In our C implementation, all branches are designed as unlikely branches to minimize the impact of branch prediction failures. By transforming the computation of d into computing

m + u p

and

o n e

, we reduce instruction dependencies and enhance instruction-level parallelism. Unlike other algorithms that compute the exact value of d before printing, we avoid computing d directly and instead quickly compute

d / / 10 = m + u p

.

3.11. Summary

This chapter explains how to quickly calculate d and k, as well as the printing optimization. Since

d = 10 m + o n e

, the process of calculating d is transformed into calculating m and

o n e

.

Section 3.5 introduces the method for quickly calculating m.
Section 3.6 and Section 3.7 introduce the methods for quickly calculating $o n e$ .
Section 3.9 provides a detailed description of the pseudo-code implementation.
Section 3.10 discusses print optimization.

Table 6. All algorithms in the benchmark test.

algorithm	float	double	description
Schubfach	Schubfach32	Schubfach64	author:Raffaello Giulietti,https://github.com/c4f7fcce9cb06515/Schubfach.
Schubfach_xjb	Schubfach32_xjb	Schubfach64_xjb	It is improved by Schubfach and has the same output result.
Ryu	Ryu32	Ryu64	author:Ulf Adams,https://github.com/ulfjack/ryu.
Dragonbox	Dragonbox32	Dragonbox64	author:Junekey Jeon,https://github.com/jk-jeon/Dragonbox.
fmt [22]	fmt32	fmt64	author:Victor Zverovich,https://github.com/fmtlib/fmt version:12.1.0
yy_double	-	yy_double	author:GuoYaoYuan,link:yy_double.
yy_json [23]	yy_json32	yy_json64	author:Guo YaoYuan,https://github.com/ibireme/yyjson version:0.12.0
teju_jagua [24]	teju32	teju64	author:Cassio Neri,https://github.com/cassioneri/teju_jagua.
xjb	xjb32	xjb64	this paper,https://github.com/xjb714/xjb.
zmij	zmij32	zmij64	author:Victor Zverovich,https://github.com/vitaut/zmij.
jnum [25]	jnum32	jnum64	author:Jing Leng,https://github.com/lengjingzju/json/jnum.c.
uscalec	-	uscalec	author:Russ Cox,src:https://github.com/rsc/fpfmt commit 6255750 (19 Jan 2026).

4. Experimental Evaluation

This section presents a comprehensive experimental evaluation of the xjb algorithm, comparing its performance against state-of-the-art floating-point to string conversion algorithms across multiple hardware platforms and compilers.

4.1. Correctness Verification

Before conducting performance benchmarks, we verified the correctness of the xjb algorithm through extensive testing:

Single-precision (float): We tested all possible input values ( $2^{32}$ values) and verified that the output matches the Schubfach algorithm, ensuring complete correctness for the binary32 format.
Double-precision (double): Due to the impracticality of exhaustively testing all $2^{64}$ possible values, we conducted comprehensive random testing with a large sample size to verify correctness for the binary64 format.

All test results confirmed that xjb produces output identical to Schubfach while satisfying the SW principle.

4.2. Experimental Setup

4.2.1. Hardware Platforms

We conducted benchmarks on two representative hardware platforms:

AMD Ryzen 7 7840H: A modern x86-64 processor with support for AVX2 and AVX-512 instruction sets, running Ubuntu 26.04.
Apple M1: A representative ARM-based processor with NEON SIMD support, running macOS 26.4.

4.2.2. Compilers and Compilation Flags

AMD R7-7840H: Intel C++ Compiler (icpx) version 2025.0.4
Apple M1: Apple Clang version 21.0.0

All benchmarks were compiled with the optimization flags -O3 -march=native to enable maximum compiler optimizations and architecture-specific instruction generation.

4.2.3. Benchmark Methodology

The benchmark methodology was designed to provide fair and reproducible performance measurements:

1.: Generate $2^{24}$ random floating-point numbers, excluding special values (0, NaN, and Inf).
2.: Measure the total time required to convert all numbers to decimal representation.
3.: Calculate the average conversion time per floating-point number.

This methodology ensures statistical significance while avoiding the overhead of special-case handling that would skew results.

4.3. Algorithms Compared

We compared xjb against the following state-of-the-art algorithms:

Table 7 and (Table 8) present the benchmark results for float-to-decimal and double-to-decimal conversions on AMD R7-7840H and Apple M1, respectively. Figure 1a–f presents the benchmark results for the conversion from float to string and from double to string. Figure 1a–d present the benchmark results when the input values are completely random. Since the significant digit length range of the printing results is from 1 to 17, comparative tests were also conducted for different effective lengths, as shown in Figure 1e,f.

Schubfach: The baseline algorithm from which xjb is derived.
Schubfach_xjb: A variant of Schubfach with xjb optimizations.
Ryu: A widely-used high-performance algorithm.
Dragonbox: Another state-of-the-art algorithm with competitive performance.
fmt: A modern formatting library.
yy_json / yy_double: Fast algorithms optimized for JSON serialization.
teju_jagua: A recent algorithm (note: only supports float/double to decimal conversion).
zmij: A modern algorithm with competitive performance.
uscalec: An algorithm supporting double-precision only.

4.4. Performance Results

4.5. Analysis and Discussion

4.5.1. Performance Comparison

The benchmark results demonstrate that xjb achieves superior performance compared to all other algorithms across both hardware platforms:

On AMD R7-7840H:

For float-to-decimal conversion, xjb achieves 2.24 ns, which is 1.98× faster than the second-fastest algorithm (Schubfach_xjb at 4.44 ns) and 5.44× faster than the baseline Schubfach (12.2 ns).
For double-to-decimal conversion, xjb achieves 3.76 ns, which is 1.27× faster than the second-fastest algorithm (yy_double at 5.24 ns) and 3.06× faster than the baseline Schubfach (11.51 ns).
For float-to-string conversion, xjb is about 60% faster than zmij.
For double-to-string conversion, xjb is about 8% faster than zmij.

On Apple M1:

For float-to-decimal conversion, xjb achieves 2.15 ns, which is 1.84× faster than the second-fastest algorithm (zmij at 4.11 ns) and 5.41× faster than the baseline Schubfach (11.64 ns).
For double-to-decimal conversion, xjb achieves 2.58 ns, which is 1.49× faster than the second-fastest algorithm (yy_double at 4.08 ns) and 5.08× faster than the baseline Schubfach (13.12 ns).
For float-to-string conversion, xjb is about 96% faster than zmij.
For double-to-string conversion, xjb is about 20% faster than zmij.

4.5.2. Performance Consistency

A key advantage of the xjb algorithm is its consistent performance across different input distributions. Since all branches in the xjb implementation are designed to be unlikely branches, the algorithm maintains a high branch prediction success rate regardless of input patterns. This property is particularly valuable in real-world applications where input distributions may vary unpredictably.

4.5.3. Cross-Platform Performance

The benchmark results demonstrate that xjb performs well across different processor architectures:

On x86-64 (AMD R7-7840H), xjb benefits from the compiler’s ability to generate efficient code for the algorithm’s arithmetic operations.
On ARM64 (Apple M1), xjb maintains its performance advantage, demonstrating the algorithm’s portability and effectiveness across different instruction set architectures.

4.5.4. Comparison with Related Algorithms

vs. Schubfach: xjb achieves 3–5× speedup over the baseline Schubfach algorithm, validating the effectiveness of our optimizations in reducing multiplication operations and minimizing branch prediction penalties.
vs. yy_json/yy_double: While these algorithms are highly optimized for JSON serialization workloads, xjb outperforms them by 1.3–1.8×, demonstrating broader applicability beyond specific use cases.
vs. zmij: xjb achieves 1.5–1.9× speedup over zmij, another recent high-performance algorithm, showing the benefits of our approach to instruction dependency reduction.
vs. Ryu and Dragonbox: xjb significantly outperforms these widely-used algorithms (2.5–6× faster), demonstrating that the optimizations in xjb provide substantial benefits over established approaches.

4.6. Summary

The experimental evaluation demonstrates that xjb achieves state-of-the-art performance for floating-point to decimal conversion across multiple hardware platforms. The algorithm’s key strengths include:

1.: Consistently superior performance across both x86-64 and ARM64 architectures.
2.: Significant speedups over the baseline Schubfach algorithm (3–5×).
3.: Competitive or superior performance compared to all other state-of-the-art algorithms.
4.: Stable performance regardless of input distribution due to effective branch prediction optimization.

These results validate the effectiveness of our optimization strategies: reducing multiplication operations, minimizing instruction dependencies, and designing branch patterns that work well with modern processor pipelines.

5. Conclusions

This paper presented xjb, a novel high-performance algorithm for converting IEEE 754 floating-point numbers to decimal string representations. Building upon the Schubfach algorithm, xjb introduces several key optimizations that significantly improve conversion speed while maintaining full compliance with the Steele-White principle for accurate and minimal-length output.

5.1. Summary of Contributions

The main contributions of this work are:

1.: Reduced computational complexity: By restructuring the calculation process and minimizing the number of high-precision multiplication operations, xjb achieves substantial performance improvements over the baseline Schubfach algorithm.
2.: Optimized branch structure: The algorithm employs unlikely branch patterns that enable efficient branch prediction on modern processors, resulting in consistent performance across diverse input distributions.
3.: Cross-platform portability: xjb demonstrates excellent performance across both x86-64 and ARM64 architectures, making it suitable for deployment in heterogeneous computing environments.
4.: Comprehensive validation: The algorithm has been thoroughly tested and verified to produce correct output for all IEEE 754 binary32 and binary64 floating-point values.

5.2. Experimental Findings

Our experimental evaluation on AMD R7-7840H and Apple M1 processors demonstrates that xjb achieves state-of-the-art performance:

3–5× speedup over the baseline Schubfach algorithm
Faster than other high-performance algorithms such as yy_json, yy_double, and zmij
Consistent performance across different input patterns and hardware platforms

5.3. Practical Implications

The xjb algorithm has immediate practical applications in numerous domains:

Data serialization: JSON and other text-based data formats require efficient floating-point to string conversion for serialization operations.
Scientific computing: Applications that output numerical results in human-readable format benefit from faster conversion without sacrificing accuracy.
Database systems: Export operations and query result formatting can leverage xjb for improved throughput.
Web services: RESTful APIs and web applications that return numerical data can achieve lower latency with efficient conversion.

5.4. Limitations and Future Work

While xjb demonstrates strong performance, several areas warrant further investigation:

1.: Extended precision support: Future work could extend xjb to support extended precision formats (e.g., 80-bit and 128-bit floating-point numbers) for applications requiring higher precision.
2.: SIMD vectorization: Although xjb is designed to be SIMD-friendly, explicit vectorization using AVX-512 or NEON could yield additional performance gains for batch conversion workloads.
3.: Compiler compatibility: Further optimization for different compilers (particularly MSVC) would improve portability across development environments.
4.: Memory-constrained environments: Investigating memory-efficient variants of xjb could benefit embedded systems and other resource-constrained platforms.

5.5. Availability

The complete implementation of the xjb algorithm, along with benchmark tools and test suites, is publicly available at https://github.com/xjb714/xjb/releases/tag/v1.0.0. We encourage the community to integrate, test, and contribute to the ongoing development of this work.

References

Steel, G. L., Jr.; White, J. L. How to Print Floating-Point Numbers Accurately. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation, PLDI, New York, NY, USA; ACM, 1990; pp. 112–126. [Google Scholar] [CrossRef]
Steel, G. L., Jr.; White, J. L. How to Print Floating-Point Numbers Accurately (Retrospective). ACM SIGPLAN Notices;Best of PLDI 2004, 39(4), 1979–1999. Available online: https://dl.acm.org/doi/10.1145/989393.989431.
Burger, Robert G.; Dybvig, R. Kent. Printing Floating-point Numbers Quickly and Accurately. In Proceedings of the ACM SIGPLAN1996 Conference on Programming Language Design and Implementation (PLDI ’96), New York, NY, USA, 1996; ACM; p. 108ś116. [Google Scholar] [CrossRef]
Loitsch, F. Printing Floating-Point Numbers Quickly and Accurately with Integers. In Proceedings of the ACM SIGPLAN 2010 Conference on Programming Language Design and Implementation, PLDI 2010. ACM, New York, NY, USA; pp. 233–243. [CrossRef]
Andrysco, M.; Jhala, R.; Lerner, S. Printing Floating-Point Numbers: a Faster, Always Correct Method. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016. ACM, New York, NY, USA; pp. 555–567. [CrossRef]
Adams, Ulf. Ryu¯: Fast Float-to-String Conversion. Proceed- ings of 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’18), New York, NY, USA, 2018; ACM; p. 13 pages. [Google Scholar] [CrossRef]
Adams, Ulf. Ryu¯ Revisited: Printf Floating Point Conversion. Proc. ACM Program. Lang. 3, OOPSLA, October 2019; 2019; Article 169, p. 23 pages. [Google Scholar] [CrossRef]
Giulietti, R. The Schubfach Way to Render Doubles. 2020. Available online: https://drive.google.com/file/d/1KLtG_LaIbK9ETXI290zqCxvBW94dj058/view.
Jeon, J. Grisu-Exact: A Fast and Exact Floating-Point Printing Algorithm. 2020. Available online: https://github.com/jk-jeon/Grisu-Exact/blob/master/other_files/Grisu-Exact.pdf.
Jeon, Junekey. Dragonbox: A New Floating-Point Binary-to-Decimal Conversion Algorithm. 2024. Available online: https://github.com/jk-jeon/Dragonbox.
YaoYuan, Guo. Nov 2024. Available online: https://github.com/ibireme/c_numconv_benchmark/blob/master/vendor/yy_double/yy_double.c.
Cox, Russ. Jan 2026. Available online: https://github.com/rsc/fpfmt.
Cox, Russ. Floating-Point Printing and Parsing Can Be Simple And Fast. Jan 2026. Available online: https://research.swtch.com/fp.
Cox, Russ. Fast Unrounded Scaling: Proof by Ivy. Jan 2026. Available online: https://research.swtch.com/fp-proof.
Zverovich, Victor. Mar 2026. Available online: https://github.com/vitaut/zmij.
IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std 754-1985 1985, vol., no., 1–20. [CrossRef]
IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008) 2019, vol., no., 1–84. [CrossRef]
Khuong, Paul. How to print integers really fast (with Open Source AppNexus code!). Dec 2017. Available online: https://pvk.ca/Blog/2017/12/22/appnexus-common-framework-its-out-also-how-to-print-integers-faster/.
Johnson, Dougall. Converting integers to fixed-width strings faster with Neon SIMD on the Apple M1. Apr 2022. Available online: https://dougallj.wordpress.com/2022/04/01/converting-integers-to-fixed-width-strings-faster-with-neon-simd-on-the-apple-m1/.
Muła, Wojciech. SSE: conversion integers to decimal representation. Oct 2011. Available online: http://0x80.pl/notesen/2011-10-21-sse-itoa.html.
Lemire, Daniel. Converting integers to decimal strings faster with AVX-512. Mar 2022. Available online: https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/.
Zverovich, Victor. Oct 2025. Available online: https://github.com/fmtlib/fmt.
YaoYuan, Guo. Aug 2025. Available online: https://github.com/ibireme/yyjson.
Neri, Cassio. Nov 2025. Available online: https://github.com/cassioneri/teju_jagua.
Leng, Jing. Nov 2025. Available online: https://github.com/lengjingzju/json/jnum.c.

Figure 1. float/double to string benchmark results.

Table 2. Valid ranges for significand c and exponent q

Category	Float (binary32)	Double (binary64)
Subnormal	$1 \leq c \leq 2^{23} - 1$ , $q = - 149$	$1 \leq c \leq 2^{52} - 1$ , $q = - 1074$
Normal	$2^{23} + 1 \leq c \leq 2^{24} - 1$	$2^{52} + 1 \leq c \leq 2^{53} - 1$
	$- 148 \leq q \leq 104$	$- 1073 \leq q \leq 971$
Irregular	$c = 2^{23}$ , $- 149 \leq q \leq 104$	$c = 2^{52}$ , $- 1074 \leq q \leq 971$

Table 3. Example of floating-point number printing result.

float number	fixed-point	scientific
2.34	“2.34”	“2.34e+00"
12	“12.0"	“1.2e+01"
120	“120.0"	“1.2e+02"
0.012	“0.012"	“1.2e-02"

Table 4. SIMD implementations of

d e c_t o_a s c i i 8

and

d e c_t o_a s c i i 16

Table 4. SIMD implementations of

d e c_t o_a s c i i 8

and

d e c_t o_a s c i i 16

SIMD implementation	Description
NEON [19]	Original author: Dougall Johnson. Runs on ARM processors with NEON instruction set.
SSE2 [20]	Based on scalar version; requires only SSE2 instruction set.
SSE4.1	Nearly identical to SSE2 implementation; requires SSE4.1 instruction set.
AVX512 [21]	Original author: Daniel Lemire. Requires AVX512IFMA and AVX512VBMI instruction sets.

Table 5. Printing formats for different ranges.

Type	Fixed-point	Scientific
float	$[10^{- 3}, 10^{7})$	other ranges
double	$[10^{- 4}, 10^{16})$	other ranges

Table 7. Float/double to decimal benchmark results on AMD R7-7840H and ubuntu 26.04. The unit is nanosecond(ns).

algorithm	float	double
	icpx 2025.0.4	icpx 2025.0.4
Schubfach	12.2	11.51
Schubfach_xjb	4.44	6.33
Ryu	14.02	13.08
Dragonbox	10.19	10.05
yy_json	4.67	5.72
yy_double	-	5.24
teju_jagua	14.99	14.37
zmij	4.76	4.78
uscalec	-	11.27
xjb	2.24	3.76

Table 8. Float/double to decimal benchmark results on Apple M1 and MacOS 26.4. The unit is nanosecond(ns).

algorithm	float	double
	apple clang 21.0.0	apple clang 21.0.0
Schubfach	11.64	13.12
Schubfach_xjb	5.16	6.58
Ryu	15.75	14.16
Dragonbox	11.78	12.03
yy_json	3.97	4.46
yy_double	-	4.08
teju_jagua	20.25	18.66
zmij	4.11	3.83
uscalec	-	15.26
xjb	2.15	2.58

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

xjb: Fast Float to String Algorithm

Abstract

Keywords:

Subject:

1. Introduction

1.1. Background and Motivation

1.2. Contributions

1.3. Evaluation Overview

1.4. Explanation of Special Symbols in This Article

2. IEEE754 Floating-Point Number Representation

2.1. Scope and Assumptions

2.2. Binary Representation

2.3. Classification of Floating-Point Numbers

2.4. Value Representation

2.5. Rounding Intervals

3. Principle of Algorithm

3.1. Design Overview

3.2. Mathematical Foundation

3.3. Review of the Schubfach Algorithm and Our Derivation

3.3.1. Candidate Values for the Significand d

3.3.2. Decomposition into Integer and Fractional Parts

3.3.3. Selection Criteria for O n e

3.3.4. Algorithm Overview

3.4. Lookup Table Precomputation

3.4.1. Fundamental Calculation

3.4.2. Detailed Calculation Process

3.4.3. Storage Requirements

3.4.4. Implementation Notes

3.5. Efficient Computation of m

3.5.1. Relevant Theorems

3.5.2. Best Rational Approximation Function

3.5.3. Key Proof

3.5.4. Bit Width Calculation

3.5.5. Results

3.6. Fast Boundary Condition Testing for o n e = 0 and o n e = 10

3.6.1. Equivalent Conditions for Boundary Testing

Case 1: Testing o n e = 0

Case 2: Testing o n e = 10

3.6.2. Integer Testing Analysis

Analysis for o n e = 0

Analysis for o n e = 10

3.6.3. Key Insight: Integer Divisibility Test

3.6.4. Summary of Boundary Conditions

3.6.5. Efficient Implementation

3.7. Efficient Computation of ⌊ 10 n ⌋ and Rounding

3.7.1. 10 n − ⌊ 10 n ⌋ = 0.5

3.7.2. 10 n − ⌊ 10 n ⌋ ≠ 0.5

3.7.3. Efficient Implementation of n = 0.25 for Double

3.7.4. Efficient Calculation of o n e for Double

3.8. Irregular Number

3.9. Implementation of Pseudocode

3.9.1. Single-Precision Floating-Point Numbers

3.9.2. Double-Precision Floating-Point Numbers

3.10. Decimal to String Conversion

Scalar Implementation

Handling Undefined Behavior

SIMD Implementation

3.11. Summary

4. Experimental Evaluation

4.1. Correctness Verification

4.2. Experimental Setup

4.2.1. Hardware Platforms

4.2.2. Compilers and Compilation Flags

4.2.3. Benchmark Methodology

4.3. Algorithms Compared

4.4. Performance Results

4.5. Analysis and Discussion

4.5.1. Performance Comparison

4.5.2. Performance Consistency

4.5.3. Cross-Platform Performance

4.5.4. Comparison with Related Algorithms

4.6. Summary

5. Conclusions

5.1. Summary of Contributions

5.2. Experimental Findings

5.3. Practical Implications

5.4. Limitations and Future Work

5.5. Availability

References

MDPI Initiatives

3.3.3. Selection Criteria for $O n e$

3.6. Fast Boundary Condition Testing for $o n e = 0$ and $o n e = 10$

Case 1: Testing $o n e = 0$

Case 2: Testing $o n e = 10$

Analysis for $o n e = 0$

Analysis for $o n e = 10$

3.7. Efficient Computation of $⌊ 10 n ⌋$ and Rounding

3.7.1. $10 n - ⌊ 10 n ⌋ = 0.5$

3.7.2. $10 n - ⌊ 10 n ⌋ \neq 0.5$

3.7.3. Efficient Implementation of $n = 0.25$ for Double

3.7.4. Efficient Calculation of $o n e$ for Double