Preprint
Article

This version is not peer-reviewed.

xjb: Fast Float to String Algorithm

Submitted:

31 March 2026

Posted:

01 April 2026

You are already at the latest version

Abstract
Efficiently and accurately converting floating-point numbers to decimal strings is a critical challenge in numerical computation and data exchange. While existing algorithms like Ryu, Dragonbox, and Schubfach satisfy the Steele-White (SW) principle for accuracy, they often suffer from performance bottlenecks due to branch prediction failures and high-precision multiplication overhead. This paper presents a novel floating-point to string conversion algorithm called "xjb", an optimized variant of the Schubfach algorithm designed to deliver superior performance for IEEE754 single-precision (binary32) and double-precision (binary64) floating-point numbers. By minimizing instruction dependencies, reducing multiplication operations, mitigating branch prediction penalties and by utilizing the simd instruction set, xjb achieves significant performance gains. The algorithm features concise core implementation. Extensive benchmarking across diverse platforms, including AMD R7-7840H and Apple M1, demonstrates that xjb outperforms state-of-the-art algorithms in most scenarios while maintaining full compliance with the SW principle.
Keywords: 
;  ;  ;  

1. Introduction

Floating-point to decimal string conversion is a fundamental operation in computer systems with widespread applications across numerous domains. From scientific computing and financial systems to web services and database management, the ability to efficiently and accurately convert binary floating-point representations into human-readable decimal strings is essential. Despite its apparent simplicity, this conversion problem presents significant challenges in balancing the competing demands of correctness, performance, and output compactness.

1.1. Background and Motivation

In 1990, Steele and White [1,2] published the seminal paper How to Print Floating-Point Numbers Accurately, which established the foundational principles for optimal floating-point printing algorithms, now widely known as the Steele-White (SW) principle. The SW principle comprises four key requirements:
  • Information preservation: The printed result must be parseable back to the original floating-point number without loss of precision.
  • Minimum length: The output string should be as short as possible while maintaining information preservation.
  • Correct rounding: When multiple representations satisfy the first two criteria, the algorithm must correctly round to the nearest value, with ties broken by selecting the even value.
  • Left-to-right generation: The output digits should be generated sequentially from the most significant to the least significant digit.
These principles ensure that floating-point numbers have a unique, well-defined decimal representation that is both human-readable and machine-parseable. Algorithms satisfying the SW principle guarantee that the conversion process is reversible and produces the shortest possible output, which is crucial for data exchange, serialization, and user interface display.
Over the past three decades, significant research efforts have been devoted to developing efficient algorithms that satisfy the SW principle. Early approaches, such as the Dragon4 algorithm, provided correct results but suffered from performance limitations due to the use of arbitrary-precision arithmetic. Subsequent innovations introduced various optimizations:
  • dtoa.c [3] is an improved version of Dragon4, written by Robert G. Burger and R. Kent Dybvig in 1996. The source code link is https://www.netlib.org/fp/dtoa.c.
  • Grisu3 [4] pioneered the use of precomputed powers of ten to avoid expensive arbitrary-precision operations, though it occasionally falls back to slower methods for certain inputs.
  • Errol [5] improved upon Grisu3 by reducing the fallback rate through more precise error analysis.
  • Ry u ¯ [6,7] achieved significant performance gains through careful instruction scheduling and lookup table optimizations, establishing a new baseline for high-performance conversion.
  • Schubfach [8] introduced a novel approach based on the concept of "Schubfach" (pigeonhole principle), providing a compact and elegant solution with competitive performance.
  • Grisu-Exact [9] eliminated the need for fallbacks entirely while maintaining high performance.
  • Dragonbox [10] reduces the number of multiplications by sacrificing more branches.
  • yy_double [11] explored alternative computational strategies to minimize the multiplication cost, but there were still a few branches.
  • uscale [12,13,14] was proposed by Russ Cox and is used to enhance the performance of the floating-point number printing algorithm in the Go programming language.
  • zmij [15] is based on yy_double and has undergone extensive code optimization.
Despite these advances, existing algorithms still face several performance challenges that limit their effectiveness in high-throughput scenarios:
1.
Branch prediction penalties: Many algorithms rely heavily on conditional branches to handle different cases, leading to frequent branch mispredictions on modern pipelined processors.
2.
High-precision multiplication overhead: The conversion process requires high-precision arithmetic operations, particularly multiplications involving large precomputed constants, which can be expensive on standard hardware.
3.
Instruction dependency chains: Sequential dependencies between operations limit instruction-level parallelism and prevent efficient utilization of modern superscalar processors.
4.
Limited SIMD utilization: Most existing algorithms do not exploit vector instruction sets (SIMD) that are now ubiquitous in contemporary processors.
These limitations motivate the need for a new approach that addresses these performance bottlenecks while maintaining full compliance with the SW principle.

1.2. Contributions

This paper presents xjb, a novel floating-point to string conversion algorithm that achieves superior performance through systematic optimization of the underlying computational structure. The xjb algorithm is derived from the Schubfach algorithm and incorporates insights from yy_double and Dragonbox, but introduces several key innovations:
1.
Reduced instruction dependencies: By carefully restructuring the computation, xjb minimizes data dependencies between operations, enabling better instruction-level parallelism and improved pipeline utilization.
2.
Minimized multiplication operations: The algorithm reduces the number of expensive high-precision multiplications required during conversion, significantly decreasing computational overhead.
3.
Mitigated branch prediction penalties: Through branchless programming techniques and careful case analysis, xjb minimizes conditional branches that could lead to prediction failures.
4.
SIMD instruction utilization: The algorithm is designed to leverage SIMD instructions where applicable.
5.
Concise core implementation: Despite its sophisticated optimizations, xjb maintains a compact and readable core implementation, facilitating adoption and maintenance.
The xjb algorithm supports both IEEE 754 single-precision (binary32) and double-precision (binary64) floating-point formats, which are the most widely used floating-point representations in modern computing. For simplicity, this paper uses float to refer to IEEE 754 binary32 and double to refer to IEEE 754 binary64.

1.3. Evaluation Overview

We conducted extensive benchmarking of xjb across diverse hardware platforms, including AMD R7-7840H and Apple M1 processors. Our evaluation demonstrates that xjb outperforms state-of-the-art algorithms in most scenarios while maintaining full compliance with the SW principle. The algorithm exhibits excellent portability and scalability, making it suitable for deployment across a wide range of systems from embedded devices to high-performance servers.
The implementation code for the xjb algorithm and all benchmark tools used in this paper are available at https://github.com/xjb714/xjb.
The remainder of this paper is organized as follows. Section 2 reviews the IEEE 754 floating-point representation and establishes the mathematical foundation for the conversion problem. Section 3 presents the core principles and derivation of the xjb algorithm and describes the implementation details and optimizations. Section 4 presents experimental results comparing xjb against existing algorithms.

1.4. Explanation of Special Symbols in This Article

Provide special explanations for the special symbols used in the formulas of this article.
Table 1. Explanation of Special Symbols in This Article.
Table 1. Explanation of Special Symbols in This Article.
symbol brief explanation example
% Integer modulus operation 2 = 8%3
// Integer division operation 1 = 5//3
< < or > > Left or right shift of binary values 8 = 1 < < 3
? : Similar to the ternary operator in C syntax a = 1?a:b

2. IEEE754 Floating-Point Number Representation

This section establishes the mathematical foundation for floating-point number representation and defines the notation used throughout this paper. We focus on the IEEE 754 standard, which is the most widely adopted floating-point arithmetic standard in modern computing systems.

2.1. Scope and Assumptions

For clarity of presentation, we make the following simplifying assumptions:
  • We consider only positive floating-point numbers, as negative numbers differ only by a leading minus sign.
  • We exclude special values (zero, NaN, and infinity) from our analysis, as these are handled separately in practice.
These assumptions are standard in the literature and do not affect the generality of our algorithm, as the excluded cases can be handled with straightforward special-case logic.

2.2. Binary Representation

The IEEE 754 standard [16,17] defines two primary floating-point formats relevant to this work:
Double-precision (binary64): A 64-bit format consisting of:
  • 1 sign bit (s): indicates positive ( s = 0 ) or negative ( s = 1 ).
  • 11 exponent bits (e): biased exponent in the range [ 0 , 2047 ] .
  • 52 fraction bits (f): significand fraction in the range [ 0 , 2 52 1 ] .
Single-precision (binary32): A 32-bit format consisting of:
  • 1 sign bit (s): indicates positive ( s = 0 ) or negative ( s = 1 ).
  • 8 exponent bits (e): biased exponent in the range [ 0 , 255 ] .
  • 23 fraction bits (f): significand fraction in the range [ 0 , 2 23 1 ] .

2.3. Classification of Floating-Point Numbers

We classify floating-point numbers into three categories based on their exponent and fraction fields:
1.
Subnormal numbers ( e = 0 and f 0 ): These represent very small values close to zero, where the implicit leading bit of the significand is 0 instead of 1.
2.
Normal numbers ( e 0 and f 0 ): The standard case where the implicit leading bit of the significand is 1.
3.
Irregular numbers ( e 0 and f = 0 ): Numbers with zero fraction field, representing powers of two.
We use the term regular to refer to both subnormal and normal numbers (i.e., all cases where f 0 ).

2.4. Value Representation

The real value v of a positive floating-point number can be expressed in the unified form v = c · 2 q , where c is the integer significand and q is the exponent. The general formula covering all cases is:
double : v = f + ( e 0 ? 2 52 : 0 ) · 2 max ( e , 1 ) 1075 = c · 2 q float : v = f + ( e 0 ? 2 23 : 0 ) · 2 max ( e , 1 ) 150 = c · 2 q
For each category, the values decompose as follows:
Subnormal numbers ( e = 0 ):
subnormal : double : v = f · 2 1074 float : v = f · 2 149
Irregular numbers ( f = 0 , e 0 ):
irregular : double : v = 2 52 · 2 e 1075 float : v = 2 23 · 2 e 150
Normal numbers ( e 0 , f 0 ):
normal : double : v = ( f + 2 52 ) · 2 e 1075 float : v = ( f + 2 23 ) · 2 e 150

2.5. Rounding Intervals

A critical concept for accurate floating-point printing is the rounding interval  R v , which defines the range of real numbers that round to the given floating-point value v when parsed. The rounding interval is bounded by:
v l = c 1 2 · 2 q if f 0 or e 1 c 1 4 · 2 q if f = 0 v r = c + 1 2 · 2 q R v = [ v l , v r ] if f mod 2 = 0 ( even significand ) ( v l , v r ) if f mod 2 = 1 ( odd significand )
The rounding radius for regular floating-point numbers is 2 q 1 . The distinction between closed and open intervals at the boundaries depends on the parity of the significand, ensuring correct rounding according to the round-to-even rule specified in the IEEE 754 standard. Any decimal number within R v will parse back to the original floating-point value v, which is essential for ensuring the information preservation property of the SW principle.
Table 2 summarizes the valid ranges for c and q across different categories.

3. Principle of Algorithm

This section presents the algorithmic principles and mathematical foundation of the xjb floating-point to string conversion algorithm. We first introduce the overall architecture and design goals, followed by the mathematical formulation of the conversion problem.

3.1. Design Overview

This paper focuses on converting float (single-precision) and double (double-precision) floating-point numbers to decimal strings. The conversion process consists of two stages:
1.
Float-to-Decimal Conversion: Converting binary floating-point values to decimal significand-exponent pairs ( d , k ) .
2.
Decimal-to-String Conversion: Formatting ( d , k ) into human-readable strings.
We focus primarily on the first stage, as it is the computationally intensive component that benefits most from our optimizations.
The xjb algorithm is designed with four key performance objectives:
  • Minimize branch mispredictions through unlikely branches and branchless techniques
  • Reduce expensive high-precision multiplication operations
  • Optimize instruction dependencies to enable better parallelism
  • Maintain core implementation simplicity and readability

3.2. Mathematical Foundation

Before presenting the algorithm details, we establish the mathematical framework for the float-to-decimal conversion problem.
Recall from Section 2 that any floating-point value v can be expressed in the form v = c · 2 q , where c is the integer significand and q is the exponent. Our goal is to find the optimal decimal representation o p t = d · 10 k that satisfies the SW principle.
As established in Section 2, regular floating-point numbers (which include all normal and subnormal numbers with non-zero fraction fields) account for the vast majority of possible floating-point values. For the purposes of algorithm derivation, we focus primarily on regular numbers, as special cases can be handled with minimal additional logic.
The valid ranges for the significand c and exponent q for regular floating-point numbers are:
float : 1 c 2 24 1 , c 2 23 , q = 149 2 23 + 1 c 2 24 1 , 148 q 104 double : 1 c 2 53 1 , c 2 52 , q = 1074 2 52 + 1 c 2 53 1 , 1073 q 971
For irregular floating-point numbers (powers of two), the ranges are:
float : c = 2 23 , 149 q 104 double : c = 2 52 , 1074 q 971
For subnormal numbers:
float : c 2 23 1 , q = 149 double : c 2 52 1 , q = 1074
The conversion problem can now be formally stated as: given a floating-point value v = c · 2 q , find the optimal decimal representation o p t = d · 10 k such that:
v = c · 2 q o p t = d · 10 k subject to : o p t R v , d Z + , k Z
where R v is the rounding interval of v as defined in Section 2.
For example, consider the IEEE 754 binary64 floating-point number representing 1.3. Its actual value is 1.3000000000000000444089209850062616169452667236328125, with hexadecimal representation 3ff4cccccccccccd. The optimal decimal representation o p t satisfying the SW principle is simply 1.3. Similarly, for the binary32 representation of 1.3, with actual value 1.2999999523162841796875 and hexadecimal representation 3fa66666, the optimal representation is also 1.3.

3.3. Review of the Schubfach Algorithm and Our Derivation

This section reviews the Schubfach algorithm and presents our derivation of an optimized variant. We begin by establishing the mathematical foundation for determining the optimal decimal representation.

3.3.1. Candidate Values for the Significand d

The Schubfach algorithm identifies four candidate values for the decimal significand d:
d { 10 v · 10 k 1 , 10 v · 10 k 1 , 10 v · 10 k 1 + 1 , 10 v · 10 k 1 + 10 }
The exponent k is computed as:
k = q · lg 2 if v regular q · lg 2 lg ( 4 / 3 ) otherwise
For efficient computation on modern processors, Equation (11) can be implemented using integer arithmetic:
double : k = ( q · 315653 ( v regular ? 0 : 131072 ) ) 20 float : k = ( q · 1233 ( v regular ? 0 : 512 ) ) 12

3.3.2. Decomposition into Integer and Fractional Parts

Let m = v · 10 k 1 denote the integer part and n = v · 10 k 1 m the fractional part, where 0 n < 1 . Substituting into Equation (10):
d { 10 m , 10 ( m + n ) , 10 ( m + n ) + 1 , 10 m + 10 }
Since 10 ( m + n ) = 10 m + 10 n , the candidates simplify to:
d { 10 m , 10 m + 10 n , 10 m + 10 n + 1 , 10 m + 10 }
Based on the Schubfach algorithm and the SW principle, if the value 10 m or 10 m + 10 falls within the range R v , it is selected as the optimal solution d. In cases where neither 10 m nor 10 m + 10 lies within R v , the optimal solution d is determined as either 10 m + 10 n or 10 m + 10 n + 1 in accordance with the rules of correct rounding. We decompose d as d = t e n + o n e , where t e n = 10 m and o n e { 0 , 10 n , 10 n + 1 , 10 } . The problem thus reduces to determining the appropriate value of o n e .
o n e = 0 ; if 10 m R v 10 ; else if 10 m + 10 R v 10 n or 10 n + 1 ; else apply correct rounding

3.3.3. Selection Criteria for O n e

The selection of o n e depends on the relationship between the rounding interval and the candidate values. Recall that the rounding interval for a regular floating-point number v = c · 2 q is R v = [ v 2 q 1 , v + 2 q 1 ] , where 2 q 1 represents the half-unit in the last place (ulp) of v.
  • Case o n e = 0 (i.e., d = 10 m ): This case applies when 10 m · 10 k is the closest representable decimal to v within the rounding interval. The condition is derived as follows:
    The lower bound of the rounding interval must be less than 10 m · 10 k :
    c · 2 q 10 m · 10 k < 2 q 1 c · 2 q c · 2 q · 10 k 1 · 10 k + 1 < 2 q 1 c · 2 q · 10 k 1 c · 2 q · 10 k 1 < 2 1 · 2 q · 10 k 1 n < 2 q 1 · 10 k 1
    When equality holds ( n = 2 q 1 · 10 k 1 ), we apply the round-to-even rule, requiring c to be even:
    o n e = 0 if 2 q 1 · 10 k 1 > n or 2 q 1 · 10 k 1 = n and c mod 2 = 0
  • Case o n e = 10 (i.e., d = 10 m + 10 ): This case applies when ( 10 m + 10 ) · 10 k is the closest representable decimal to v. The condition is derived similarly:
    The upper bound of the rounding interval must be greater than ( 10 m + 10 ) · 10 k :
    ( 10 m + 10 ) · 10 k c · 2 q < 2 q 1 c · 2 q · 10 k 1 · 10 k + 1 + 10 k + 1 c · 2 q < 2 q 1 1 c · 2 q · 10 k 1 c · 2 q · 10 k 1 < 2 1 · 2 q · 10 k 1 1 n < 2 q 1 · 10 k 1
    When equality holds ( 1 n = 2 q 1 · 10 k 1 ), we again apply round-to-even:
    o n e = 10 if 2 q 1 · 10 k 1 > 1 n or 2 q 1 · 10 k 1 = 1 n and c mod 2 = 0
  • Case o n e { 10 n , 10 n + 1 } : When neither boundary condition applies, the optimal value lies between 10 m + 10 n and 10 m + 10 n + 1 . We determine o n e by rounding 10 n to the nearest integer:
    If the fractional part { 10 n } < 0.5 : o n e = 10 n
    If the fractional part { 10 n } > 0.5 : o n e = 10 n + 1
    If the fractional part { 10 n } = 0.5 : apply round-to-even
    For irregular floating-point numbers (powers of two), additional verification is required to ensure the selected value lies within the rounding interval R v , as the interval boundaries differ for these special cases.

3.3.4. Algorithm Overview

Algorithm 1 summarizes our optimized variant of the Schubfach algorithm (xjb32 for float, xjb64 for double). Given inputs c and q, the algorithm returns d and k such that d · 10 k satisfies the SW principle. The computation of k follows Equation (12); the remainder of this section focuses on efficient computation of d.
The subsequent sections detail:
1.
Lookup table precomputation
2.
Efficient computation of m
3.
Fast boundary condition testing for o n e { 0 , 10 }
4.
Efficient computation of 10 n and rounding
5.
Handling of irregular floating-point numbers
6.
implementation of pseudocode
The following will discuss the above content in detail from Section 3.4 to Section 3.9.
Algorithm 1:The xjb Algorithm for Float-to-Decimal Conversion
Require: 
Floating-point components c (significand) and q (exponent)
Ensure: 
Decimal representation d · 10 k satisfying the SW principle
1:
c · 2 q v
2:
if v regular then
3:
    k q · lg 2
4:
else
5:
    k q · lg 2 lg ( 4 / 3 )
6:
end if
7:
m v · 10 k 1
8:
n v · 10 k 1 m
9:
t e n 10 m
10:
δ 10 n 10 n fractional part of 10 n
11:
if δ = 0.5 then
12:
   if  10 n mod 2 = 0  then
13:
      o n e 10 n round to even
14:
   else
15:
      o n e 10 n + 1
16:
   end if δ < 0.5
17:
    o n e 10 n
18:
else
19:
    o n e 10 n + 1
20:
end if
21:
if v irregular then
22:
   if  δ > 2 q 2 · 10 k  then
23:
      o n e 10 n + 1
24:
   end if
25:
   if  2 q 2 · 10 k 1 n  then
26:
      o n e 0
27:
   end if
28:
else
29:
   if  2 q 1 · 10 k 1 > n OR 2 q 1 · 10 k 1 = n and c mod 2 = 0  then
30:
      o n e 0
31:
   end if
32:
   if  2 q 1 · 10 k 1 > 1 n OR 2 q 1 · 10 k 1 = 1 n and c mod 2 = 0  then
33:
      o n e 10
34:
   end if
35:
end if
36:
d t e n + o n e
37:
return d , k

3.4. Lookup Table Precomputation

The algorithm in this paper employs a lookup table to store pre-computed values of 10 k 1 for different ranges of q: 149 , 104 for float and 1074 , 971 for double. These lookup tables use extended precision: 64-bit for float and 128-bit for double. The reference implementation is available in gen.py.

3.4.1. Fundamental Calculation

Let B denote the bit length of each entry in the lookup table, with B = 64 for float and B = 128 for double. For any integer e 10 (representing a power of 10), we aim to represent 10 e 10 in the form f · 2 e 2 , where 1 f < 2 and e 2 is a real number. This gives:
f · 2 e 2 = 2 e 2 = 10 e 10
Taking the logarithm base 2 of both sides, we get:
e 2 = e 10 · log 2 ( 10 )
Solving for f gives:
f = 10 e 10 2 e 10 · log 2 ( 10 )
The lookup table entries are computed using upward rounding:
l o o k u p [ e 10 ] = f · 2 B 1 = 10 e 10 2 e 10 · log 2 ( 10 ) · 2 B 1 = 10 e 10 · 2 B 1 e 10 · log 2 ( 10 )
Notably, f · 2 B 1 becomes an integer for certain ranges of e 10 : 0 e 10 27 for float and 0 e 10 55 for double.

3.4.2. Detailed Calculation Process

The detailed calculation process is as follows:
  • Float
    The range of k 1 is calculated to be [-32, 44] through the q value range in Equation (6), so the lookup table contains representation values from 10 to the power of -32 to 10 to the power of 44. The calculation process is as follows:
    32 e 10 44 e 2 = e 10 · log 2 ( 10 ) 63 p o w 10 t = 2 e 2 / / 10 e 10 ; if e 10 < 0 10 e 10 / / 2 e 2 ; if e 10 20 10 e 10 · 2 e 2 ; if 1 e 10 19 f 1 , e 10 = p o w 10 = p o w 10 t + e 10 0 & & e 10 27 ? 0 : 1
    When 0 e 10 27 , the lookup table variable indicates that the values f 1 , e 10 · 2 e 10 · log 2 ( 10 ) 63 and 10 e 10 are equal. In other cases, the relative error is less than 2 63 . Expressed as:
    r 1 , e 10 = f 1 , e 10 · 2 e 10 · log 2 ( 10 ) 63 10 e 10 1 ; if 0 e 10 27 1 , 1 + 2 63 ; if e 10 < 0 or e 10 > 27
  • Double
    The range of k 1 is calculated to be [-293, 323] through the q value range in Equation (6), so the lookup table contains representation values from 10 to the power of -293 to 10 to the power of 323. The calculation process is as follows:
    293 e 10 323 e 2 = e 10 · log 2 ( 10 ) 127 p o w 10 t = 2 e 2 / / 10 e 10 ; if e 10 < 0 10 e 10 / / 2 e 2 ; if e 10 39 10 e 10 · 2 e 2 ; if 1 e 10 38 f 1 , e 10 = p o w 10 = p o w 10 t + e 10 0 & & e 10 55 ? 0 : 1
    When 0 e 10 55 , the lookup table variable indicates that the values f 1 , e 10 · 2 e 10 · log 2 ( 10 ) 127 and 10 e 10 are equal. In other cases, the relative error is less than 2 127 . Expressed as:
    r 1 , e 10 = f 1 , e 10 · 2 e 10 · log 2 ( 10 ) 127 10 e 10 1 ; if 0 e 10 55 1 , 1 + 2 127 ; if e 10 < 0 or e 10 > 55
Let r 1 denote the error for float lookup table entries, r 2 for double entries, and r for both. In Algorithm  (1), we retrieve 10 k 1 from the lookup table. The lookup table provides exact values when q falls within specific ranges:
float : 0 k 1 27 93 q 1 double : 0 k 1 55 186 q 1
For q outside these ranges, the lookup table entries have bounded relative errors:
float : 0 < r 1 1 < 2 63 double : 0 < r 2 1 < 2 127

3.4.3. Storage Requirements

The float lookup table requires 616 bytes of storage, calculated as ( 44 ( 32 ) + 1 ) × 8 bytes (77 entries × 8 bytes each). The double lookup table requires 9872 bytes, calculated as ( 323 ( 293 ) + 1 ) × 16 bytes (617 entries × 16 bytes each).

3.4.4. Implementation Notes

The lookup table pre-computation uses efficient integer arithmetic to avoid precision loss during calculations. The conditional logic in Equations (24) and (26) optimizes the computation based on the sign and magnitude of e 10 , ensuring efficient generation of accurate lookup table entries.

3.5. Efficient Computation of m

This section presents an efficient method for calculating m in Algorithm  (1), which is defined as m = v · 10 k 1 .

3.5.1. Relevant Theorems

We start with some theorems from the Dragonbox algorithm paper:
Let n, P, and Q be positive integers where P and Q are coprime, P < Q , 1 n n m a x , and Q > n m a x . Let P * / Q * be the best rational approximation of P / Q with Q * n m a x and P * / Q * be the best rational approximation with Q * n m a x .
If n · P does not divide Q evenly:
n · P Q + 1 = n · P Q
Suppose the following holds:
n · P Q = n · ξ
Then:
P * Q * = max 1 n n m a x n · P Q n ξ < min 1 n n m a x n · P Q + 1 n = min 1 n n m a x n · P Q n = P * Q *
Thus, the range of ξ is:
P * Q * ξ < P * Q *
The range of the fractional part of n · P Q is:
( Q * P ) % Q Q , ( Q * P ) % Q Q
Here, the fractional part is minimized when n = Q * and maximized when n = Q * .

3.5.2. Best Rational Approximation Function

We define the best rational approximation function (implemented in test1.py):
( P * Q * , P * Q * ) = ( D N , U P ) = f ( C , P , Q )
This function f ( C , P , Q ) calculates the best rational approximation with denominator not exceeding C using the Farey sequence median theorem. D N and U P are adjacent terms in the C-th order Farey sequence F C . Therefore, we can calculate Equation (34) very quickly.

3.5.3. Key Proof

In Algorithm  (1), we need to compute m = c · 2 q · 10 k 1 . We aim to prove:
m = c · 2 q · 10 k 1 = c · 2 q · r · 10 k 1
Where r is the lookup table error defined in Equations (25) and (27). When condition (28) is met, r = 1 , and the equation holds trivially. For r 1 :
float : 1 < r < 1 + 2 63 double : 1 < r < 1 + 2 127
Calculate the range of 2 q · 10 k 1 and we get:
2 q · 10 k 1 = 10 1 · 2 q · 10 q · lg ( 2 ) = 10 1 · 10 q · lg ( 2 ) q · lg ( 2 )
When q is not 0, Equation (38) exists:
q · lg ( 2 ) q · lg ( 2 ) 0 < q · lg ( 2 ) q · lg ( 2 ) < 1
When q is 0, q · lg ( 2 ) q · lg ( 2 ) = 0 , so the final conclusion is:
10 1 2 q · 10 k 1 < 1
Because there is:
c · 2 q · 10 k 1 = c · 2 q k 1 5 k + 1 0.1 c , c
Therefore:
c · 2 q · 10 k 1 = c · 2 q k 1 5 k + 1 ; q 1 c 2 1 + k q · 5 k + 1 = c 10 ; q = 0 c · 5 k 1 2 1 + k q ; q < 0
Suppose:
c · 2 q · 10 k 1 = c · x y < c
Then there are:
x , y = 2 q k 1 , 5 k + 1 ; q 1 1 , 10 ; q = 0 5 k 1 , 2 1 + k q ; q < 0

3.5.4. Bit Width Calculation

Define maximum values for c:
float : c c m a x = C 1 = 2 24 1 double : c c m a x = C 2 = 2 53 1
Let C denote either C 1 or C 2 depending on the precision.
For y > C , compute P * and Q * for each q using f ( C , x , y ) and find the minimum B I T such that:
x y ( 1 + 2 B I T ) < P * Q *
For y C :
c · x y 1 + 1 C y = c x + c C · x y y < c x + 1 y
Thus:
c · x y = c · x y 1 + 1 C y
Similarly, find the minimum B I T such that:
x y ( 1 + 2 B I T ) < x y 1 + 1 C y

3.5.5. Results

The maximum of the minimum B I T values for all q (calculated in test1.py in about 1-2 seconds) is:
float : B I T m a x = 52 double : B I T m a x = 113
Thus:
float : c · x y = c · x y · ( 1 + 2 52 ) = c · x y · r 1 double : c · x y = c · x y · ( 1 + 2 113 ) = c · x y · r 2
This result confirms that m can be calculated efficiently using the lookup table values, even with their inherent errors. Once m is determined, t e n = 10 m can be computed quickly.

3.6. Fast Boundary Condition Testing for o n e = 0 and o n e = 10

In Algorithm 1, the conditions for determining o n e = 0 and o n e = 10 appear on lines 31 and 34, respectively. This section introduces an optimized method to quickly test these boundary conditions using equivalent mathematical formulations.

3.6.1. Equivalent Conditions for Boundary Testing

We start by deriving equivalent mathematical conditions for testing o n e = 0 and o n e = 10 .
Case 1: Testing o n e = 0
When 2 1 · 2 q · 10 k 1 = n , this is equivalent to:
c · 2 q · 10 k 1 c · 2 q · 10 k 1 = 2 1 · 2 q · 10 k 1 ( 2 c 1 ) · 2 q 1 · 10 k 1 = c · 2 q · 10 k 1
Case 2: Testing o n e = 10
When 2 1 · 2 q · 10 k 1 = 1 n , this is equivalent to:
c · 2 q · 10 k 1 c · 2 q · 10 k 1 + 1 = 2 1 · 2 q · 10 k 1 ( 2 c + 1 ) · 2 q 1 · 10 k 1 = c · 2 q · 10 k 1 + 1

3.6.2. Integer Testing Analysis

To further analyze these conditions, we start with the range of 2 q 1 · 10 k 1 from Equation (40):
2 q 1 · 10 k 1 [ 0.05 , 0.5 )
Analysis for o n e = 0
For the o n e = 0 case, we can derive:
c · 2 q · 10 k 1 1 < c · 2 q · 10 k 1 0.5 < ( 2 c 1 ) · 2 q 1 · 10 k 1 c · 2 q · 10 k 1 0.05 < c · 2 q · 10 k 1 + 1
This implies that when ( 2 c 1 ) · 2 q 1 · 10 k 1 is an integer, it must equal c · 2 q · 10 k 1 .
Analysis for o n e = 10
Similarly, for the o n e = 10 case:
c · 2 q · 10 k 1 < c · 2 q · 10 k 1 + 0.05 ( 2 c + 1 ) · 2 q 1 · 10 k 1 < c · 2 q · 10 k 1 + 0.5 < c · 2 q · 10 k 1 + 2
This implies that when ( 2 c + 1 ) · 2 q 1 · 10 k 1 is an integer, it must equal c · 2 q · 10 k 1 + 1 .

3.6.3. Key Insight: Integer Divisibility Test

The key insight is that testing for o n e = 0 or o n e = 10 is equivalent to checking whether ( 2 c ± 1 ) · 2 q 1 · 10 k 1 is an integer. This can be rewritten as:
( 2 c ± 1 ) · 2 q 1 · 10 k 1 = ( 2 c ± 1 ) · 2 q k 2 · 5 k 1
We analyze different ranges of q to simplify this condition:
  • Case q 2 : From q 2 , we get k 0 . The expression simplifies to checking if ( 2 c ± 1 ) · 2 q k 2 is divisible by 5 k + 1 . Since 2 and 5 are coprime, this reduces to checking if ( 2 c ± 1 ) is divisible by 5 k + 1 :
    ( 2 c ± 1 ) mod 5 k + 1 = 0
    Let t be a positive integer such that 2 c ± 1 = t · 5 k + 1 . Since 2 c ± 1 is odd, t must also be odd. Considering the ranges of c for float and double:
    float : 2 c 1 [ 2 24 + 1 , 2 25 3 ] ; 2 c + 1 [ 2 24 + 3 , 2 25 1 ] ; double : 2 c 1 [ 2 53 + 1 , 2 54 3 ] ; 2 c + 1 [ 2 53 + 3 , 2 54 1 ] ;
    This gives us the range for t:
    float : 2 24 + 1 5 k + 1 t 2 25 1 5 k + 1 ; double : 2 53 + 1 5 k + 1 t 2 54 1 5 k + 1 ;
    The maximum values of k where t can be at least one odd integer are:
    float : k max = 9 q max = 33 , t = 3 double : k max = 22 q max = 76 , t = 1
  • Case 1 q 0 : The denominator 2 2 + k q · 5 k + 1 is even, while the numerator ( 2 c ± 1 ) is odd, so no solution exists.
  • Case q < 0 : The denominator 2 2 + k q is even, while the numerator ( 2 c ± 1 ) · 5 k 1 is odd, so no solution exists.

3.6.4. Summary of Boundary Conditions

In summary, the situations when ( 2 c ± 1 ) · 2 q 1 · 10 k 1 is an integer are as follows:
float : 2 q 33 & & ( 2 c ± 1 ) mod 5 k + 1 = 0 ; double : 2 q 76 & & ( 2 c ± 1 ) mod 5 k + 1 = 0 ;
The range of k 1 is:
float : 10 k 1 1 double : 23 k 1 1

3.6.5. Efficient Implementation

We can further simplify the testing conditions using bitwise operations. For o n e = 0 :
float : 2 35 · 2 q · 10 k 1 = 2 36 · n double : 2 63 · 2 q · 10 k 1 = 2 64 · n
When 2 1 · 2 q · 10 k 1 = 1 n , the following conclusions can be drawn:
float : 2 35 · 2 q · 10 k 1 = 2 36 2 36 · n 2 35 · 2 q · 10 k 1 = 2 36 2 36 · n = 2 36 1 2 36 · n double : 2 63 · 2 q · 10 k 1 = 2 64 2 64 · n 2 63 · 2 q · 10 k 1 = 2 64 2 64 · n = 2 64 1 2 64 · n
The discussion on whether 2 36 2 36 · n = 2 36 1 2 36 · n in Equation (65) holds true, that is, whether 2 36 · n in Equation (65) is an integer, or equivalent to discussing whether the following values are integers when Equation (62) holds true (the same applies to double) :
float : 2 36 · ( m + n ) = c · 2 q + 36 · 10 k 1 = c · 2 q k + 35 · 5 k 1 = c · 2 q k + 35 5 k + 1 double : 2 64 · ( m + n ) = c · 2 q + 64 · 10 k 1 = c · 2 q k + 63 · 5 k 1 = c · 2 q k + 63 5 k + 1
Suppose c can divide 5 k + 1 evenly (where t is a temporary integer variable):
c = t · 5 k + 1 ; t 1
Therefore, when Equation (67) was established, there were:
2 c ± 1 = 2 · t · 5 k + 1 ± 1
Expression (68) cannot divide 5 k + 1 evenly, which contradicts Equation (62), so c cannot divide 5 k + 1 evenly. Therefore, for float, c · 2 q + 36 · 10 k 1 and 2 36 · n are not integers; For double, c · 2 64 + q · 10 k 1 and 2 64 · n are not integers, that is:
float : 2 36 2 36 · n = 2 36 + 2 36 · n = 2 36 1 2 36 · n double : 2 64 2 64 · n = 2 64 + 2 64 · n = 2 64 1 2 64 · n
Therefore, the conclusion (65) is correct. Discuss the necessary and sufficient conditions for whether 2 35 · 2 q · 10 k 1 = 2 36 · n is 2 1 · 2 q · 10 k 1 = n . The same applies to double, expressed as:
float : 2 1 · 2 q · 10 k 1 = n 2 35 · 2 q · 10 k 1 = 2 36 · n double : 2 1 · 2 q · 10 k 1 = n 2 63 · 2 q · 10 k 1 = 2 64 · n
Similarly, the necessary and sufficient conditions for whether 2 35 · 2 q · 10 k 1 = 2 36 2 36 · n is 2 1 · 2 q · 10 k 1 = 1 n . The same applies to double, expressed as:
float : 2 1 · 2 q · 10 k 1 = 1 n 2 35 · 2 q · 10 k 1 = 2 36 2 36 · n double : 2 1 · 2 q · 10 k 1 = 1 n 2 63 · 2 q · 10 k 1 = 2 64 2 64 · n
The sufficient conditions of Equations (70) and (71) are obviously established. Introduce the proof that Equation (70) holds. For float, only the necessary conditions need to be discussed, that is, whether 2 1 · 2 q · 10 k 1 = n must hold true when 2 35 · 2 q · 10 k 1 = 2 36 · n holds, or equivalent to 2 35 · 2 q · 10 k 1 2 36 · n must hold true when 2 1 · 2 q · 10 k 1 n . The following is proved by proof by contradiction.
Assume that 2 35 · 2 q · 10 k 1 = 2 36 · n holds when 2 1 · 2 q · 10 k 1 n . Then there is:
2 35 · 2 q · 10 k 1 = 2 36 · n 0 < 2 35 · 2 q · 10 k 1 2 36 · n < 1 0 < 2 c 1 · 2 q 1 · 10 k 1 m < 2 36
As is known from Equation (55), there is:
m 1 < 2 c 1 · 2 q 1 · 10 k 1 < m + 1
Suppose the decimal part of 2 c 1 · 2 q 1 · 10 k 1 is represented as n , thus we have:
2 c 1 · 2 q 1 · 10 k 1 m = n ; if 2 c 1 · 2 q 1 · 10 k 1 > m 1 n ; if 2 c 1 · 2 q 1 · 10 k 1 < m
Substitute expression (74) into expression (72), and we get:
0 < 2 c 1 · 2 q 1 · 10 k 1 m < 2 36 0 < n < 2 36 or 0 < 1 n < 2 36
Similarly, it can be known that the double range is the range of n . Therefore, there is:
float : n 0 , 2 36 1 2 36 , 1 double : n 0 , 2 64 1 2 64 , 1
When 2 1 · 2 q · 10 k 1 n , it is known from Equation (52) that 2 c 1 · 2 q 1 · 10 k 1 is not an integer. Therefore, there is:
0 < n < 1
It is only necessary to prove that Equation (76) does not hold. Discuss the range of the decimal part n when 2 c 1 · 2 q 1 · 10 k 1 is not an integer. According to Equation (57), there are:
2 c 1 · 2 q 1 · 10 k 1 = 2 c 1 · x y = 2 c 1 · 2 q k 2 5 k + 1 ; q 2 2 c 1 2 2 + k q · 5 k + 1 ; 1 q 0 2 c 1 · 5 k 1 2 2 + k q ; q < 0
The maximum value of 2 c 1 is:
float : 2 c 1 max = 2 25 3 double : 2 c 1 max = 2 54 3
Discuss based on the denominator range in Equation (78).
  • y 2 c 1 max
    When y 2 c 1 max , y max is the expression (79), the following holds true:
    1 y max n 1 1 y max 1 y max 1 n 1 1 y max
    Therefore, when y 2 c 1 max , Equation (76) does not hold true.
  • y > 2 c 1 max
    Call function (35) to calculate the approximation results P * / Q * and P * / Q * of all possible upper and lower limit rational numbers:
    P * Q * , P * Q * = f 2 c 1 max , x , y
    Therefore, for n , the following conclusion can be drawn from formula (34).
    n Q * x % y y , Q * x % y y
    By exhausting all possibilities, we thus have (the test code file is test3.py) :
    float : 2 33 < n < 1 2 29 double : 2 62 < n < 1 2 63
    float : Q * x % y y , Q * x % y y 0 , 2 36 = Q * x % y y , Q * x % y y 1 2 36 , 1 = double : Q * x % y y , Q * x % y y 0 , 2 64 = Q * x % y y , Q * x % y y 1 2 64 , 1 =
    Therefore, when y > 2 c 1 max , Equation (76) does not hold true.
In summary, when 2 1 · 2 q · 10 k 1 n , Equation (76) does not hold true, that is, 2 35 · 2 q · 10 k 1 2 36 · n must hold true. Therefore, when 2 35 · 2 q · 10 k 1 = 2 36 · n holds, 2 1 · 2 q · 10 k 1 = n must hold true. Therefore, Equation (70) holds.
Similarly, it can be proved that when 2 35 · 2 q · 10 k 1 = 2 36 2 36 · n holds, 2 1 · 2 q · 10 k 1 = 1 n must hold true. The same applies to double. Similarly, by proof of contradiction, for float, it is assumed that when 2 1 · 2 q · 10 k 1 1 n holds, 2 35 · 2 q · 10 k 1 = 2 36 2 36 · n holds. That is:
2 35 · 2 q · 10 k 1 = 2 36 2 36 · n 0 < 2 35 · 2 q · 10 k 1 2 36 + 2 36 · n < 1 0 < 2 q 1 · 10 k 1 1 + n < 2 36 2 36 < 2 c + 1 · 2 q 1 · 10 k 1 m 1 < 2 36
As is known from Equation (56), there is:
m < 2 c + 1 · 2 q 1 · 10 k 1 < m + 2
Suppose the decimal part of 2 c + 1 · 2 q 1 · 10 k 1 is represented as n + , thus we have:
2 c + 1 · 2 q 1 · 10 k 1 m 1 = n + ; if 2 c + 1 · 2 q 1 · 10 k 1 > m + 1 1 n + ; if 2 c + 1 · 2 q 1 · 10 k 1 < m + 1
Substitute expression (87) into expression (85), and we get:
0 < 2 c + 1 · 2 q 1 · 10 k 1 m 1 < 2 36 0 < 1 n + < 2 36 or 0 < n + < 2 36
Similarly, it can be known that the double range is the range of n + . Therefore, there is:
float : n + 0 , 2 36 1 2 36 , 1 double : n + 0 , 2 64 1 2 64 , 1
When 2 1 · 2 q · 10 k 1 1 n , it is known from Equation (53) that 2 c + 1 · 2 q 1 · 10 k 1 is not an integer. Therefore, there is:
0 < n + < 1
It is only necessary to prove that Equation (89) does not hold. Discuss the range of the decimal part n + when 2 c + 1 · 2 q 1 · 10 k 1 is not an integer. According to Equation (57), there are:
2 c + 1 · 2 q 1 · 10 k 1 = 2 c + 1 · x y = 2 c + 1 · 2 q k 2 5 k + 1 ; q 2 2 c + 1 2 2 + k q · 5 k + 1 ; 1 q 0 2 c + 1 · 5 k 1 2 2 + k q ; q < 0
The maximum value of 2 c + 1 is:
float : 2 c + 1 max = 2 25 1 double : 2 c + 1 max = 2 54 1
Discuss based on the denominator range in Equation (91).
  • y 2 c + 1 max
    When y 2 c + 1 max , y max is the expression (92), the following holds true:
    1 y max n + 1 1 y max 1 y max 1 n + 1 1 y max
    Therefore, when y 2 c + 1 max , Equation (89) does not hold true.
  • y > 2 c + 1 max
    Call function (35) to calculate the approximation results P * Q * and P * Q * of all possible upper and lower limit rational numbers:
    P * Q * , P * Q * = f 2 c + 1 max , x , y
    Therefore, for n + , the following conclusion can be drawn from formula (34).
    n + Q * x % y y , Q * x % y y
    By exhausting all possibilities, we thus have (the test code file is test7.py) :
    float : 2 33 < n + < 1 2 29 double : 2 62 < n + < 1 2 63
    float : Q * x % y y , Q * x % y y 0 , 2 36 = Q * x % y y , Q * x % y y 1 2 36 , 1 = double : Q * x % y y , Q * x % y y 0 , 2 64 = Q * x % y y , Q * x % y y 1 2 64 , 1 =
    Therefore, when y > 2 c + 1 max , Equation (89) does not hold true.
In summary, when 2 1 · 2 q · 10 k 1 1 n , Equation (89) does not hold true, that is, 2 35 · 2 q · 10 k 1 2 36 2 36 · n must hold true. Therefore, when 2 35 · 2 q · 10 k 1 = 2 36 2 36 · n holds, 2 1 · 2 q · 10 k 1 = 1 n must hold true. The same is true for double.Therefore, Equation (71) holds.
The following conclusions hold:
float : 2 36 2 36 · n = 2 36 1 2 36 · n ; if c · 2 36 + q · 10 k 1 Z 2 36 2 36 · n ; if c · 2 36 + q · 10 k 1 Z double : 2 64 2 64 · n = 2 64 1 2 64 · n ; if c · 2 64 + q · 10 k 1 Z 2 64 2 64 · n ; if c · 2 64 + q · 10 k 1 Z
Discuss whether the following Equation (99) holds when conditions (62) and (63) are met:
float : c · 2 q + 35 k 5 k + 1 = c · 2 q + 35 k 5 k + 1 · r = c · 2 q + 35 k 5 k + 1 · 2 63 k 1 · log 2 ( 10 ) / / 10 k + 1 + 1 10 k 1 · 2 k 1 · log 2 ( 10 ) 63 double : c · 2 q + 63 k 5 k + 1 = c · 2 q + 63 k 5 k + 1 · r = c · 2 q + 63 k 5 k + 1 · 2 127 k 1 · log 2 ( 10 ) / / 10 k + 1 + 1 10 k 1 · 2 k 1 · log 2 ( 10 ) 127
There are:
float : c · 2 q + 35 k 5 k + 1 = 2 36 · m + n = 2 36 · m + 2 36 · n double : c · 2 q + 63 k 5 k + 1 = 2 64 · m + n = 2 64 · m + 2 64 · n
It has been proven earlier that m can be accurately calculated. Then, when (99) holds true, the values 2 36 · n and 2 64 · n on the right side of Equations (64) and (65) can be accurately calculated.
From Equation (57), we have:
c = t · 5 k + 1 1 2
Substituting Equation (101) into Equation (99), we have:
float : c · 2 q + 35 k 5 k + 1 = t · 2 q + 34 k 2 q + 34 k 5 k + 1 double : c · 2 q + 63 k 5 k + 1 = t · 2 q + 62 k 2 q + 62 k 5 k + 1
When conditions (62) and (63) are met, t · 2 q + 34 k and t · 2 q + 62 k are integers. Under the condition of meeting condition (62), the decimal part of expression (102) is represented as:
float : 2 q + 34 k % 5 k + 1 5 k + 1 ; 2 q 33 double : 2 q + 62 k % 5 k + 1 5 k + 1 ; 2 q 76
It is only necessary to prove that the increase in the value c · 2 q + 35 k 5 k + 1 · r on the right side of the expression compared to the value c · 2 q + 35 k 5 k + 1 on the left side plus the decimal part of the value on the left side is less than 1 for Equation (99) to hold true. That is:
float : 2 q + 34 k % 5 k + 1 5 k + 1 + c · 2 q + 35 k 5 k + 1 · r c · 2 q + 35 k 5 k + 1 < 1 double : 2 q + 62 k % 5 k + 1 5 k + 1 + c · 2 q + 63 k 5 k + 1 · r c · 2 q + 63 k 5 k + 1 < 1
By exhaustionally calculating the maximum possible c value under each q and substituting it into Equation (104), it holds. The calculation result is in test2.py. The calculation results show that for the float range and the double range, Equation (104) always holds true. Therefore, Equation (99) holds true, and thus the values of 2 36 · n and 2 64 · n on the right side of Equations (64) and (65) can be accurately calculated. The values of 2 35 · 2 q · 10 k 1 and 2 63 · 2 q · 10 k 1 on the left side of Equations (64) and (65) can be calculated through lookup tables.
float : 2 35 · 2 q · 10 k 1 = p o w 10 28 q k 1 · log 2 ( 10 ) double : 2 63 · 2 q · 10 k 1 = p o w 10 64 q k 1 · log 2 ( 10 )
The code file for verifying the validity of Equation (105) is test4.py. Therefore, when conditions (62) and (63) are met, the values of both sides of Equations (64) and (65) can be accurately calculated.
Discuss the relationship between the following two values within all ranges of floating-point numbers:
float : c · 2 q + 36 · 10 k 1 ; c · 2 q + 36 · r · 10 k 1 ; double : c · 2 q + 64 · 10 k 1 ; c · 2 q + 64 · r · 10 k 1 ;
When r = 1 , it is obvious that the two values in expression (106) are equal. When r 1 , or equivalent to r > 1 , has:
float : c · 2 q + 36 · r · 10 k 1 = c · 2 q + 36 · 10 k 1 + c · 2 q + 36 · r 1 · 10 k 1 < c · 2 q + 36 · 10 k 1 + 2 24 · 2 36 · 2 q · 10 k 1 · r 1 < c · 2 q + 36 · 10 k 1 + 2 3 c · 2 q + 36 · r · 10 k 1 c · 2 q + 36 · 10 k 1 + 1 double : c · 2 q + 64 · r · 10 k 1 = c · 2 q + 64 · 10 k 1 + c · 2 q + 64 · r 1 · 10 k 1 < c · 2 q + 64 · 10 k 1 + 2 53 · 2 64 · 2 q · 10 k 1 · r 1 < c · 2 q + 64 · 10 k 1 + 2 10 c · 2 q + 64 · r · 10 k 1 c · 2 q + 64 · 10 k 1 + 1
Therefore, there is:
float : 0 c · 2 q + 36 · r · 10 k 1 c · 2 q + 36 · 10 k 1 1 double : 0 c · 2 q + 64 · r · 10 k 1 c · 2 q + 64 · 10 k 1 1
Because there is:
c · 2 q · 10 k 1 = c · 2 q · r · 10 k 1 = m
float : c · 2 q + 36 · 10 k 1 = 2 36 · m + 2 36 · n double : c · 2 q + 64 · 10 k 1 = 2 64 · m + 2 64 · n
Suppose:
n r = c · 2 q · r · 10 k 1 m
Therefore, the following conclusion can be drawn: when condition (62) is met, from Equation (99), we have:
float : 2 q 33 & & 2 c ± 1 % 5 k + 1 = 0 2 36 · n = 2 36 · n r double : 2 q 76 & & 2 c ± 1 % 5 k + 1 = 0 2 64 · n = 2 64 · n r
Within the range of floating-point numbers, there exists:
float : 2 36 · n 2 36 · n r 2 36 · n + 1 double : 2 64 · n 2 64 · n r 2 64 · n + 1
To simplify the expression, e v e n is used to indicate whether c is an even number:
e v e n = ( c + 1 ) % 2 0 , 1
When 2 1 · 2 q · 10 k 1 = n or 2 1 · 2 q · 10 k 1 = 1 n , 2 1 · 2 q · 10 k 1 = n is the boundary condition for o n e = 0 , and 2 1 · 2 q · 10 k 1 = 1 n is the boundary condition for o n e = 10 . Whether o n e is 0 or 10 is determined based on whether c is an even number.Therefore, the following exists:
float : o n e = 0 : 2 q + 35 · 10 k 1 + e v e n > 2 36 · n r o n e = 10 : 2 q + 35 · 10 k 1 + e v e n > 2 36 1 2 36 · n r double : o n e = 0 : 2 q + 63 · 10 k 1 + e v e n > 2 64 · n r o n e = 10 : 2 q + 63 · 10 k 1 + e v e n > 2 64 1 2 64 · n r
Therefore, when 2 1 · 2 q · 10 k 1 = n or 2 1 · 2 q · 10 k 1 = 1 n , we can use the condition (116) to determine whether o n e = 0 or o n e = 10 .
float : if 2 q + 35 · 10 k 1 + e v e n > 2 36 · n r : o n e = 0 if 2 q + 35 · 10 k 1 + e v e n > 2 36 1 2 36 · n r : o n e = 10 double : if 2 q + 63 · 10 k 1 + e v e n > 2 64 · n r : o n e = 0 if 2 q + 63 · 10 k 1 + e v e n > 2 64 1 2 64 · n r : o n e = 10
When 2 1 · 2 q · 10 k 1 > n or 2 1 · 2 q · 10 k 1 > 1 n , We can also use the above condition (116) to determine whether o n e = 0 or o n e = 10 . When 2 1 · 2 q · 10 k 1 < n or 2 1 · 2 q · 10 k 1 < 1 n , we can also use the above condition (116) to determine whether o n e 0 or o n e 10 . There are a total of four situations.The proof is as follows:
(1)When 2 1 · 2 q · 10 k 1 < n , there must exist o n e 0 , and there is:
float : 2 1 · 2 q · 10 k 1 n = n 1 2 33 1 , 2 29 double : 2 1 · 2 q · 10 k 1 n = n 1 2 62 1 , 2 63
Therefore, the following exists:
float : 2 q + 35 · 10 k 1 2 36 · n 2 3 2 36 , 2 7 double : 2 q + 63 · 10 k 1 2 64 · n 4 2 64 , 2
Suppose there are two real numbers a and b, and the following relationship must exist:
0 b b < 1 a a 1 < b b < 1 + a a a b 1 < a b < a b + 1
When a = 2 q + 35 · 10 k 1 and b = 2 36 · n or a = 2 q + 63 · 10 k 1 and b = 2 64 · n , the following exists:
float : 2 q + 35 · 10 k 1 2 36 · n < 2 q + 35 · 10 k 1 2 36 · n + 1 double : 2 q + 63 · 10 k 1 2 64 · n < 2 q + 63 · 10 k 1 2 64 · n + 1
From Equation (118), we have:
float : 2 q + 35 · 10 k 1 2 36 · n < 1 2 7 < 0 double : 2 q + 63 · 10 k 1 2 64 · n < 1 2 < 0
Therefore, there is:
float : 2 q + 35 · 10 k 1 + e v e n 2 q + 35 · 10 k 1 + 1 < 2 36 · n 2 36 · n r 2 q + 35 · 10 k 1 + e v e n < 2 36 · n r double : 2 q + 63 · 10 k 1 + e v e n 2 q + 63 · 10 k 1 + 1 < 2 64 · n 2 64 · n r 2 q + 63 · 10 k 1 + e v e n < 2 64 · n r
Therefore, when 2 1 · 2 q · 10 k 1 < n , the condition (116) can be used to determine that o n e 0 .
(2)When 2 1 · 2 q · 10 k 1 > n , there must exist o n e = 0 , and there is:
float : 2 1 · 2 q · 10 k 1 n = n 2 33 , 1 2 29 double : 2 1 · 2 q · 10 k 1 n = n 2 62 , 1 2 63
Therefore, the following exists:
float : 2 q + 35 · 10 k 1 2 36 · n 2 3 , 2 36 2 7 double : 2 q + 63 · 10 k 1 2 64 · n 4 , 2 64 2
When a = 2 q + 35 · 10 k 1 and b = 2 36 · n or a = 2 q + 63 · 10 k 1 and b = 2 64 · n , from Equation (119), the following exists:
float : 2 q + 35 · 10 k 1 2 36 · n > 2 q + 35 · 10 k 1 2 36 · n 1 double : 2 q + 63 · 10 k 1 2 64 · n > 2 q + 63 · 10 k 1 2 64 · n 1
From Equation (124), we have:
float : 2 q + 35 · 10 k 1 2 36 · n > 2 3 1 0 double : 2 q + 63 · 10 k 1 2 64 · n > 4 1 0
Therefore, there is:
float : 2 q + 35 · 10 k 1 + e v e n 2 q + 35 · 10 k 1 > 2 36 · n + 1 2 36 · n r 2 q + 35 · 10 k 1 + e v e n > 2 36 · n r double : 2 q + 63 · 10 k 1 + e v e n 2 q + 63 · 10 k 1 > 2 64 · n + 1 2 64 · n r 2 q + 63 · 10 k 1 + e v e n > 2 64 · n r
Therefore, when 2 1 · 2 q · 10 k 1 > n , the condition (116) can be used to determine that o n e = 0 .
(3)When 2 1 · 2 q · 10 k 1 < 1 n , there must exist o n e 10 , and there is:
float : 2 1 · 2 q · 10 k 1 + n = n + 2 33 , 1 2 29 double : 2 1 · 2 q · 10 k 1 + n = n + 2 62 , 1 2 63
Therefore, the following exists:
float : 2 q + 35 · 10 k 1 + 2 36 · n 2 3 , 2 36 2 7 double : 2 q + 63 · 10 k 1 + 2 64 · n 4 , 2 64 2
Suppose there are two real numbers a and b, and the following relationship must exist:
a 1 < a a b 1 < b b a + b 2 < a + b a + b
When a = 2 q + 35 · 10 k 1 and b = 2 36 · n or a = 2 q + 63 · 10 k 1 and b = 2 64 · n , the following exists:
float : 2 q + 35 · 10 k 1 + 2 36 · n 2 q + 35 · 10 k 1 + 2 36 · n double : 2 q + 63 · 10 k 1 + 2 64 · n 2 q + 63 · 10 k 1 + 2 64 · n
From Equation (129), we have:
float : 2 q + 35 · 10 k 1 + 2 36 · n < 2 36 2 7 double : 2 q + 63 · 10 k 1 + 2 64 · n < 2 64 2
Therefore, there is:
float : 2 q + 35 · 10 k 1 + e v e n 2 q + 35 · 10 k 1 + 1 < 2 36 2 2 36 · n < 2 36 1 2 36 · n r 2 q + 35 · 10 k 1 + e v e n < 2 36 1 2 36 · n r double : 2 q + 63 · 10 k 1 + e v e n 2 q + 63 · 10 k 1 + 1 < 2 64 2 2 64 · n < 2 64 1 2 64 · n r 2 q + 63 · 10 k 1 + e v e n < 2 64 1 2 64 · n r
Therefore, when 2 1 · 2 q · 10 k 1 < 1 n , the condition (116) can be used to determine that o n e 10 .
(4)When 2 1 · 2 q · 10 k 1 > 1 n , there must exist o n e = 10 , and there is:
float : 2 1 · 2 q · 10 k 1 + n = n + + 1 1 + 2 33 , 2 2 29 double : 2 1 · 2 q · 10 k 1 + n = n + + 1 1 + 2 62 , 2 2 63
Therefore, the following exists:
float : 2 q + 35 · 10 k 1 + 2 36 · n 2 3 + 2 36 , 2 37 2 7 double : 2 q + 63 · 10 k 1 + 2 64 · n 4 + 2 64 , 2 65 2
When a = 2 q + 35 · 10 k 1 and b = 2 36 · n or a = 2 q + 63 · 10 k 1 and b = 2 64 · n , from Equation (130), the following exists:
float : 2 q + 35 · 10 k 1 + 2 36 · n > 2 q + 35 · 10 k 1 + 2 36 · n 2 double : 2 q + 63 · 10 k 1 + 2 64 · n > 2 q + 63 · 10 k 1 + 2 64 · n 2
From Equation (135), we have:
float : 2 q + 35 · 10 k 1 + 2 36 · n > 2 36 + 2 3 2 > 2 36 double : 2 q + 63 · 10 k 1 + 2 64 · n > 2 64 + 2 2 > 2 64
Therefore, there is:
float : 2 q + 35 · 10 k 1 + e v e n 2 q + 35 · 10 k 1 > 2 36 2 36 · n > 2 36 1 2 36 · n r 2 q + 35 · 10 k 1 + e v e n > 2 36 1 2 36 · n r double : 2 q + 63 · 10 k 1 + e v e n 2 q + 63 · 10 k 1 > 2 64 2 64 · n > 2 64 1 2 64 · n r 2 q + 63 · 10 k 1 + e v e n > 2 64 1 2 64 · n r
Therefore, when 2 1 · 2 q · 10 k 1 > 1 n , the condition (116) can be used to determine that o n e = 10 .
From the above proof, it can be seen that when condition (62) is met, the condition (116) can be used to determine whether o n e = 0 or o n e = 10 when 2 1 · 2 q · 10 k 1 = n or 2 1 · 2 q · 10 k 1 = 1 n . When 2 1 · 2 q · 10 k 1 > n or 2 1 · 2 q · 10 k 1 > 1 n , the condition (116) can be used to determine whether o n e = 0 or o n e = 10 . When 2 1 · 2 q · 10 k 1 < n or 2 1 · 2 q · 10 k 1 < 1 n , the condition (116) can be used to determine whether o n e 0 or o n e 10 .
The proof process of this section is completed. In the code implementation, the two judgment conditions can be quickly calculated using addition and subtraction shift operations, and can be compiled by the compiler into cmov instructions, thereby reducing the impact of branch prediction failure on performance.
Ultimately, we reached the following conclusion: o n e can quickly determine whether o n e equals 0 or 10 by using the following method.
float : 2 q + 35 · 10 k 1 + e v e n > 2 36 · n r o n e = 0 2 q + 35 · 10 k 1 + e v e n > 2 36 1 2 36 · n r o n e = 10 double : 2 q + 63 · 10 k 1 + e v e n > 2 64 · n r o n e = 0 2 q + 63 · 10 k 1 + e v e n > 2 64 1 2 64 · n r o n e = 10

3.7. Efficient Computation of 10 n and Rounding

Determine whether o n e is 10 n or 10 n + 1 based on the decimal part of 10 n . There are two cases: the decimal part of 10 n is 0.5 and it is not 0.5.

3.7.1. 10 n 10 n = 0.5

When the decimal part of 10 n is 0.5, there must be:
10 n 10 n = 0.5 10 · c · 2 q · 10 k 1 10 · c · 2 q · 10 k 1 = 0.5 c · 2 q · 10 k c · 2 q · 10 k = 0.5 c · 2 q · 10 k = c · 2 q · 10 k + 0.5 2 c · 2 q · 10 k = 2 c · 2 q · 10 k + 1
So 2 c · 2 q · 10 k is an odd number. Then the following expression is odd:
c · 2 q + 1 · 10 k = c · 2 q k + 1 · 5 k
According to the range of q, there are:
c · 2 q + 1 · 10 k = c · 2 q k + 1 5 k ; q 0 c · 2 · 5 k ; q = 1 c · 5 k 2 k q 1 ; q 2
According to the range of q, the following situations are discussed:
  • q 0
    When q 0 , it can be concluded that q k + 1 1 , the numerator c · 2 q k + 1 is even and the denominator 5 k is odd, which does not meet the condition.
  • q = 1
    When q = 1 , it can be concluded that c · 2 · 5 k is even, which does not meet the condition.
  • q 2
    5 k is an odd number. c is an odd multiple of 2 k q 1 . So:
    float : c 2 k q 1 k q 1 22 q 34 double : c 2 k q 1 k q 1 51 q 75
    Therefore, when q meets the above conditions, c must be an odd multiple of 2 k q 1 to meet the condition. Therefore, when the following conditions are met, expression (141) is an odd number:
    float : 34 q 2 & & c % 2 k q = 2 k q 1 double : 75 q 2 & & c % 2 k q = 2 k q 1
    When q is within the above range (144), r = 1 is derived from Equation (28).Therefore, there is:
    n r = n
    The following equation holds:
    20 m + 20 n = c · 2 q · 10 k + 1 = c · 2 q k + 1 · 5 k = c 2 k q 1 · 5 k
    Since k 1 , 5 k is multiple of 5 and is an odd number. Since c 2 k q 1 and 5 k are both odd numbers, 20 m is an even number, 20 n is multiple of 5 and is an odd number. Therefore, there is:
    20 n 5 , 15 n 0.25 , 0.75 n r 0.25 , 0.75
    The result of o n e is an even number between 10 n and 10 n + 1 . Therefore:
    o n e = 10 n = 2 , if n = 0.25 10 n + 1 = 8 , if n = 0.75 o n e = 20 n + 1 / / 2 ( n = 0.25 ? 1 : 0 )

3.7.2. 10 n 10 n 0.5

When the decimal part of 10 n is not 0.5, round to the nearest integer value based on the decimal part of 10 n . Therefore, there is:
o n e = 10 n , if 10 n 10 n < 0.5 10 n + 1 , if 10 n 10 n > 0.5 o n e = 10 n + 0.5 = 20 n + 1 / / 2
Since 20 n + 1 = 20 n + 1 , it is only necessary to accurately calculate the value of 20 n . And, there is:
d = t e n + o n e = 10 m + 20 n + 1 / / 2 = ( 20 m + 20 n + 1 ) / / 2
Suppose there are:
20 m + 20 n = c · 2 q + 1 · 10 k = c · 2 q k + 1 · 5 k = c · x y
Suppose the decimal part of 20 n is n 20 .
When y c max = C , the range of the decimal part must include:
float : 1 2 24 1 = 1 C n 20 1 1 C = 2 24 2 2 24 1 double : 1 2 53 1 = 1 C n 20 1 1 C = 2 53 2 2 53 1
When y > c max = C , the range of the decimal part must include(the test file is test5.py):
float : 2 32 < n 20 < 1 2 30 double : 2 64 < n 20 < 1 2 62
Therefore, the range of n 20 satisfies Equation (153). In the code implementation, for float, only the high 36 bits of n r are retained, and for double, only the high 70 bits of n r are retained. Suppose the discarded part of a float is represented as n 36 , and similarly, the discarded part of a double is represented as n 70 . Therefore, there is:
float : n 36 0 , 2 36 double : n 70 0 , 2 70
Calculate the boundary conditions of the following expression:
float : F = 20 · c · 2 q · r · 10 k 1 n 36 double : F = 20 · c · 2 q · r · 10 k 1 n 70
Therefore, there is:
float : F min > 20 · c · 2 q · 10 k 1 2 36 = 20 m + 20 n 20 · 2 36 F max < 20 · c · 2 q · 1 + 2 63 · 10 k 1 0 < 20 m + 20 n + 20 · 2 63 · c < 20 m + 20 n + 1 double : F min > 20 · c · 2 q · 10 k 1 2 70 = 20 m + 20 n 20 · 2 70 > 20 m + 20 n F max < 20 · c · 2 q · 1 + 2 127 · 10 k 1 0 < 20 m + 20 n + 20 · 2 127 · c < 20 m + 20 n + 1
Therefore, there is:
float : F = 20 m + 20 n double : F = 20 m + 20 n
In fact, in the above proof process, for float, F m i n 20 m + 20 n may exist, but the code implementation has passed the exhaustive test, so this not-so-perfect proof process can be ignored. Therefore, the calculation of d can be simplified as follows:
d = t e n + o n e = ( F + 1 ) / / 2 = ( 20 · ( c · 2 q · r · 10 k 1 n x ) + 1 ) / / 2
For the float range, n x = n 36 ; for the double range, n x = n 70 .

3.7.3. Efficient Implementation of n = 0.25 for Double

For double, quickly determine that n = = 0.25 in Equation (148).
When n = 0.25 , 2 64 · n r = 2 64 · n = 2 62 . Therefore, the following condition can be used to quickly determine whether n = 0.25 :
double : n = 0.25 if 2 64 · n r = 2 62
When n 0.25 ,Calculate the range of the decimal part of the following expression:
4 m + 4 n = c · 2 q + 2 · 10 k 1
Therefore, when Equation (160) is not an integer, we have(the test file is test6.py):
2 62 < 4 n 4 n < 1 2 62
Calculate the two boundary cases of 4 n that are closest to 1:
4 n = 0 4 n 0 < 1 2 62 2 64 · n 2 62 2 4 n = 1 4 n 1 > 2 62 2 64 · n 2 62 + 1
Then there are:
2 64 · n 2 62 & & 2 64 · n + 1 2 62 2 64 · n r 2 62
Therefore, the following condition can be used to quickly determine whether n 0.25 :
double : n 0.25 if 2 64 · n r 2 62
In summary, for double, the following condition can be used to quickly determine whether n = 0.25 :
double : n = 0.25 if 2 64 · n r = 2 62 double : n 0.25 if 2 64 · n r 2 62

3.7.4. Efficient Calculation of o n e for Double

In the double range, introduce another faster way to calculate o n e :
double : o n e = 2 64 · n r 2 64 · 10 + n = 0.25 ? 0 : 2 1 + 6 2 64
The proof of Equation (166) is as follows:
when n = 0.25 , 2 64 · n r 2 64 · 10 = 10 n = 2 ;
when n 0.25 , Equation (166) can be equivalent to the following:
double : o n e = 2 64 · n r 2 64 · 10 + 2 1 + 6 2 64
According to the 10 n 10 n range, o n e is represented as:
double : o n e = 10 n , if 10 n 10 n < 0.5 8 , if 10 n 10 n = 0.5 10 n + 1 , if 10 n 10 n > 0.5 = 20 n + 1 / / 2
Therefore,when n 0.25 , we need to prove that the following equation holds:
2 64 · n r 2 64 · 10 + 2 1 + 6 2 64 = 10 n , if 10 n 10 n < 0.5 8 , if 10 n 10 n = 0.5 10 n + 1 , if 10 n 10 n > 0.5 = 20 n + 1 / / 2
From the range of n, there is:
2 64 · n r 2 64 n r 2 64 , n r
Because the following conditions exist:
c · 2 q · 10 k 1 = m + n c · 2 q · r · 10 k 1 = m + n r
Therefore, the following relationship can be concluded:
n r n = r 1 · c · 2 q · 10 k 1 n r = r 1 · m + n + n n n r < 2 127 · c + n n n r < 2 127 · 2 53 + n n n r < 2 74 + n
From Equation (170) and (172), it can be concluded that:
2 64 · n r 2 64 n 2 64 , n + 2 74 2 64 · n r 2 64 · 10 10 n 10 · 2 64 , 10 n + 10 · 2 74 2 64 · n r 2 64 · 20 20 n 20 · 2 64 , 20 n + 20 · 2 74 2 64 · n r 2 64 · 20 20 n + n 20 20 · 2 64 , 20 n + n 20 + 20 · 2 74
Discuss the range of values of x when the following conditions are met.
2 64 · n r 2 64 · 20 + 1 + x / / 2 = 20 n + 1 / / 2 = o n e
Therefore, the following conclusions can be drawn:
20 n + n 20 20 · 2 64 + 1 + x 20 n + 1 x 20 · 2 64 n 20 20 n + n 20 + 20 · 2 74 + 1 + x < 20 n + 2 x < 1 20 · 2 74 n 20
Suppose x = 12 · 2 64 . Through the exhaustive method, all floating-point numbers that do not meet the following conditions can be obtained.
x = 12 · 2 64 20 · 2 64 n 20
All floating-point numbers that do not meet condition (176) are as follows (in hexadecimal) :
0 x d 17 c 0747 b d 76 f a 1 , 0 x d 27 c 0747 b d 76 f a 1 , 0 x 4 d 73 d e 005 b d 620 d f , 0 x 4 d 83 d e 005 b d 620 d f , 0 x 4 d 93 d e 005 b d 620 d f ,
Through the exhaustive method, all floating-point numbers that do not meet the following conditions can be obtained.
x = 12 · 2 64 < 1 20 · 2 74 n 20
All floating-point numbers that do not meet condition (178) are as follows (in hexadecimal) :
0 x 612491 d a a d 0 b a 280 , 0 x 6159 b 651584 e 8 b 20 , 0 x 619011 f 2 d 73116 f 4 , 0 x 61 c 4166 f 8 c f d 5 c b 1 , 0 x 61 d 4166 f 8 c f d 5 c b 1 ,
There are:
2 ( 2 64 · n r 2 64 · 10 + 2 1 + 6 2 64 ) = 2 64 · n r 2 64 · 20 + 1 + x
When the floating-point number is not within the above range (177) and (179), the condition (175) is satisfied. We have tested all floating-point numbers within the above-mentioned range (177) and (179), and the algorithm implementation code has output the correct result, that is, it satisfies the SW principle. The test process file is test8.py.
In summary, Equation (169) and Equation (166) holds. Therefore, Equation (166) can be used to quickly calculate o n e .

3.8. Irregular Number

Due to the limited and small number of irregular floating-point numbers, there are a total of 2046 double floating-point numbers and 254 float floating-point numbers. The correctness of the algorithm code in this paper can be proved by the exhaustive method. Therefore, it is not introduced in this article. For the specific implementation process, please refer to the source code.

3.9. Implementation of Pseudocode

This subsection details the pseudocode implementation for converting regular floating-point numbers to decimal representation. The handling of subnormal (irregular) floating-point numbers is omitted here due to space constraints; interested readers may refer to the source code for the complete implementation.

3.9.1. Single-Precision Floating-Point Numbers

uint 32 v i = bit _ copy _ from _ f 32 ( v ) ( uint 64 c , int 32 q ) = extract ( v i ) const int BIT = 36 const uint 64 offset = ( 1 ULL ( BIT 2 ) ) 7 int 32 k = ( q · 1233 ) 12 int 32 h = q + ( ( 1701 · ( k 1 ) ) 9 ) uint 64 pow 10 = get _ pow 10 ( k 1 ) uint 64 c b = c ( h + BIT + 1 ) uint 64 hi 64 = umul _ hi 64 × 64 ( c b , pow 10 ) bool even = ( v i + 1 ) & 1 uint 64 half = ( pow 10 ( 65 ( h + BIT + 1 ) ) ) + even uint 32 shorter = ( hi 64 + half ) BIT uint 64 bias = ( hi 64 ( BIT 4 ) ) & 15 uint 32 longer = ( 5 · hi 64 + offset + bias ) ( BIT 1 ) bool updown = shorter > ( ( hi 64 half ) BIT ) uint 32 d = updown ? shorter · 10 : longer
Algorithm 181 outlines the procedure for computing d and k for single-precision floating-point numbers.
Since c contains at most 23 significant bits and pow 10 has 64 significant bits, their product contains at most 87 significant bits. Given that the range of h is [ 4 , 1 ] , the value c b = c ( h + BIT + 1 ) does not overflow, preserving all bits of c. When e 10 = k 1 , pow 10 is computed using Equation (24). From Equation (25), we derive:
pow 10 · 2 ( k 1 ) · log 2 10 63 = 10 k 1 · r 1 , k 1
From Equation (51):
m = v · 10 k 1 = v · 10 k 1 · r 1 , k 1 = v · pow 10 · 2 ( k 1 ) · log 2 10 63 = c · pow 10 · 2 q + ( k 1 ) · log 2 10 63 = ( c · pow 10 ) ( 63 q ( k 1 ) · log 2 10 ) = ( c · pow 10 ) ( 63 h ) = ( ( c ( 37 + h ) ) · pow 10 ) ( 36 + 64 ) = hi 64 36
Let up indicate whether o n e = 10 . From Equations (138) and (105):
bool up = 2 q + 35 · 10 k 1 + even > 2 36 1 2 36 · n r = 2 q + 35 · 10 k 1 + even + 2 36 · n r 2 36 = ( pow 10 ( 28 h ) ) + even + 2 36 · n r 2 36 = ( half + 2 36 · n r ) 36
When o n e 10 , we have d / / 10 = m ; when o n e = 10 , we have d / / 10 = m + 1 . Therefore:
d / / 10 = m + up
The upper 28 bits of hi 64 represent m, while the lower 36 bits represent 2 36 · n r :
hi 64 = ( m 36 ) + 2 36 · n r
Consequently:
shorter = d / / 10 = ( hi 64 + half ) 36
Let updown indicate whether o n e { 0 , 10 } , equivalently whether the last digit of d is zero:
updown = ( d mod 10 = 0 )
For brevity, let dot _ one = 2 36 · n r . From Equations (134) and (140):
o n e = 0 : half > dot _ one o n e = 10 : half > 2 36 1 dot _ one
The following equations hold:
o n e = 0 : ( hi 64 half ) 36 = m 1 o n e = 10 : ( hi 64 half ) 36 = m
Therefore:
o n e = 0 : shorter = m > ( hi 64 half ) 36 = m 1 o n e = 10 : shorter = m + 1 > ( hi 64 half ) 36 = m
Thus, the condition for o n e { 0 , 10 } is:
shorter > ( hi 64 half ) 36 d mod 10 = 0
When updown is true, d = 10 · shorter . Otherwise, we compute d using Equation (158). When 10 n 10 n 0.5 :
d = ( 20 · ( v · 10 k 1 · r n 36 ) + 1 ) / / 2 = ( 20 · ( hi 64 · 2 36 ) + 1 ) / / 2 = ( 5 · hi 64 · 2 34 + 1 ) / / 2 = ( ( 5 · hi 64 + 2 34 ) · 2 34 ) / / 2 = ( ( 5 · hi 64 + 2 34 ) 34 ) / / 2 = ( 5 · hi 64 + 2 34 ) 35
Adding 2 34 serves as a rounding operation. When 10 n 10 n = 0.5 , Equation (148) requires checking whether n = 0.25 . The derivation of longer involves careful handling of edge cases; the correctness of this computation has been verified through exhaustive testing.

3.9.2. Double-Precision Floating-Point Numbers

uint 64 v i = bit _ copy _ from _ f 64 ( v ) ( uint 64 c , int q ) = extract ( v i ) const int BIT = 6 int k = ( q · 78913 ) 18 int h = q + ( ( 217707 · ( k 1 ) ) 16 ) uint 128 pow 10 = get _ pow 10 ( k 1 ) uint 64 c b = c ( h + BIT + 1 ) uint 128 hi 128 = umul _ hi 64 × 128 ( c b , pow 10 ) bool even = ( v i + 1 ) & 1 uint 64 half = ( pow 10 ( 64 h ) ) + even uint 64 dot _ one = ( uint 64 ) ( hi 128 BIT ) uint 64 ten = 10 · ( hi 128 ( BIT + 64 ) ) uint 64 offset _ num = ( dot _ one = = 2 62 ) ? 0 : 2 63 + 6 uint 64 o n e = ( ( uint 128 ) 10 · dot _ one + offset _ num ) 64 o n e = ( half > dot _ one ) ? 0 : o n e o n e = ( half > 2 64 1 dot _ one ) ? 10 : o n e uint 64 d = ten + o n e
Algorithm 194 presents the corresponding procedure for double-precision floating-point numbers. pow 10 is computed using Equation (26), satisfying:
pow 10 · 2 ( k 1 ) · log 2 10 127 = 10 k 1 · r 1 , k 1
From Equation (51):
m = v · 10 k 1 = v · 10 k 1 · r 1 , k 1 = v · pow 10 · 2 ( k 1 ) · log 2 10 127 = c · pow 10 · 2 q + ( k 1 ) · log 2 10 127 = ( c · pow 10 ) ( 127 q ( k 1 ) · log 2 10 ) = ( c · pow 10 ) ( 127 h ) = ( ( c ( 7 + h ) ) · pow 10 ) ( 64 + 70 ) = hi 128 70
Given h [ 4 , 1 ] and c having at most 53 significant bits, c b does not overflow, ensuring accurate computation of ten = 10 · m . By definition, dot _ one = 2 64 · n r . From Equation (105):
half = ( pow 10 ( 64 h ) ) + even = 2 63 · 2 q · 10 k 1 + even
The value of o n e is first computed using Equation (166), then adjusted if the conditions for o n e = 0 or o n e = 10 are met:
o n e = 2 64 · n r 2 64 · 10 + ( n = 0.25 ) ? 0 : ( 2 1 + 6 2 64 ) = ( dot _ one · 10 + ( dot _ one = = 2 62 ? 0 : 2 63 + 6 ) ) 64
An equivalent formulation is:
o n e = ( dot _ one = = 2 62 ) ? 2 : ( dot _ one · 10 + 2 63 + 6 ) 64
The conditions o n e = 0 and o n e = 10 are determined by Equations (127) and (138), respectively.
In the C/C++ implementation, the only branch occurs when computing c and q for subnormal floating-point numbers (marked as unlikely ). The conversion of normal floating-point numbers is branch-free, eliminating branch misprediction penalties. As shown in Algorithms 181 and 194, the core algorithm is remarkably concise. Irregular numbers are handled in a separate branch—a worthwhile trade-off given their rarity.

3.10. Decimal to String Conversion

The floating-point number printing algorithm consists of two main phases. While the first phase, which computes the decimal digits d and exponent k, has been discussed in previous sections, this section focuses on the second phase: converting the computed decimal representation into a string.
Floating-point numbers are typically printed in two formats: fixed-point notation and scientific notation. Table (Table 3) illustrates examples of both formats.
Having computed d and k, we now convert the floating-point number to a string based on these values. According to the Schubfach algorithm, d may contain trailing zeros, meaning d mod 10 could be zero. In order to reduce instruction dependencies. We decompose d into two parts: d / / 10 and d mod 10 . Let u p indicate whether o n e equals 10, and u p d o w n represent the cases where o n e is either 0 or 10. The following relationships hold:
d / / 10 = m + u p u p d o w n = ( o n e = 0 o n e = 10 ) = ( d mod 10 = 0 ) d = 10 · ( m + u p ) + ( u p d o w n ? 0 : o n e )
By calculating the approximate range of m, we obtain:
m = c · 2 q · 10 k 1 c · 2 q · 10 k 1 [ 0.1 · c , c ) 0.1 · c min 0.1 · c m c c max
Based on the range of c, we derive:
float : normal : [ 2 23 + 1 10 , 2 24 1 ] = [ 838860.9 , 16777215 ] subnormal : [ 1 10 , 2 23 1 ] = [ 0.1 , 8388607 ] double : normal : [ 2 52 + 1 10 , 2 53 1 ] = [ 4.5 × 10 14 , 9 × 10 15 ] subnormal : [ 1 10 , 2 52 1 ] = [ 0.1 , 4.5 × 10 15 ]
Let l e n 10 ( x ) denote the number of decimal digits in x. For x > 0 :
l e n 10 ( x ) = log 10 ( x ) + 1
We define l e n 10 ( 0 ) = 0 . For example, l e n 10 ( 12345 ) = 5 . Consequently:
float : normal : l e n 10 ( m + u p ) [ 6 , 8 ] subnormal : l e n 10 ( m + u p ) [ 0 , 7 ] double : normal : l e n 10 ( m + u p ) [ 15 , 16 ] subnormal : l e n 10 ( m + u p ) [ 0 , 16 ]
Since d / / 10 = m + u p , we have l e n 10 ( d ) = l e n 10 ( m + u p ) + 1 . Let t z 10 ( x ) denote the number of trailing zeros in the decimal representation of x; for example, t z 10 ( 12300 ) = 2 . The number of significant digits s i g 10 ( x ) equals l e n 10 ( x ) t z 10 ( x ) . When u p d o w n = 0 , o n e 0 , thus d mod 10 = o n e 0 and t z 10 ( d ) = 0 . Therefore:
s i g 10 ( d ) = u p d o w n ? l e n 10 ( d ) t z 10 ( d ) : l e n 10 ( d ) float : s i g 10 ( d ) [ 1 , 9 ] double : s i g 10 ( d ) [ 1 , 17 ]
For normal floating-point numbers, l e n 10 ( d ) can be computed as:
float : l e n 10 ( d ) = l e n ( m + u p ) + 1 = 9 [ m + u p < 10 7 ] [ m + u p < 10 6 ] double : l e n 10 ( d ) = l e n ( m + u p ) + 1 = 16 + [ ( m + u p ) 10 15 ]
where [ P ] denotes the Iverson bracket, which evaluates to 1 if predicate P is true and 0 otherwise.
Computing the ASCII representation of d reduces to computing the ASCII codes of m + u p and o n e . From the above, we have:
float : m + u p < 10 8 double : m + u p < 10 16
These bounds allow efficient computation of the ASCII code for m + u p using CPU SIMD instruction sets.

Scalar Implementation

We first present a scalar method that does not rely on SIMD instructions.
Let x [ 0 , 10 8 ) and d e c _ t o _ a s c i i 8 ( x ) be a function that computes the ASCII representation of x. Algorithm (2) describes this process. This version is an optimized version of the itoa algorithm written by Paul Khuong [18].
Algorithm 2:Convert an 8-digit decimal number to ASCII: dec_to_ascii8(x)
Input: 
X (type: uint64, range: [0, 10 8 ))
Output: 
Y (type: uint64), tz (type: uint)
1:
uint64 abcd_efgh = X + (0x100000000 - 10000) * ((X * 0x68db8bbULL) > > 40)
2:
uint64 ab_cd_ef_gh = abcd_efgh + (0x10000 - 100) * (((abcd_efgh * 0x147b) > > 19) & 0x7f0000007f)
3:
uint64 a_b_c_d_e_f_g_h = ab_cd_ef_gh + (0x100 - 10) * (((ab_cd_ef_gh * 0x67) > > 10) & 0xf000f000f000f)
4:
uint tz = __builtin_ctzll(a_b_c_d_e_f_g_h) > > 3
5:
uint64 BCD = CPU_IS_LITTLE_ENDIAN ? byteswap(a_b_c_d_e_f_g_h) : a_b_c_d_e_f_g_h
6:
uint64 Y = BCD + 0x3030303030303030
return Y, tz
The function simultaneously computes t z 10 ( x ) as the return value t z . Example: X = 12345600, Y="12345600", tz=2.
Algorithm 3:Convert a 16-digit decimal number to ASCII: dec_to_ascii16(x)
Input: 
X (type: uint64, range: [0, 10 16 ))
Output: 
Y (type: uint128), tz (type: uint)
1:
uint64 abcdefgh = X // 10 8
2:
uint64 ijklmnop = X mod 10 8
3:
(uint64 abcdefgh_ascii, uint tz1) = dec_to_ascii8(abcdefgh)
4:
(uint64 ijklmnop_ascii, uint tz2) = dec_to_ascii8(ijklmnop)
5:
uint128 Y = CPU_IS_LITTLE_ENDIAN ? (ijklmnop_ascii << 64) + abcdefgh_ascii : (abcdefgh_ascii << 64) + ijklmnop_ascii
6:
uint tz = (ijklmnop == 0) ? 8 + tz1 : tz2
return Y, tz
For x [ 0 , 10 16 ) , let d e c _ t o _ a s c i i 16 ( x ) compute the ASCII representation of x. Algorithm (3) describes this process. Example: X = 123456001234500, Y="1234560012345600", tz=2.

Handling Undefined Behavior

Since _ _ b u i l t i n _ c t z l l ( 0 ) is undefined behavior in C, special handling is required. For single-precision floating-point numbers, when m + u p = 0 (the six smallest subnormal numbers), u p d o w n = 0 , avoiding the undefined case. For double-precision numbers, m + u p = 0 only occurs for 5 × 10 324 (the smallest subnormal number). In our implementation, we handle this special case separately at the function entry.

SIMD Implementation

The SIMD implementations of d e c _ t o _ a s c i i 8 and d e c _ t o _ a s c i i 16 follow the same principles as the scalar version. For ARM64 architecture, we use the NEON instruction set; for x64 architecture, we support three variants: AVX512, SSE4.1, and SSE2. Table (Table 4) summarizes these implementations.
Using the methods above, we compute the ASCII representation of m + u p and the number of significant digits. Printing d is equivalent to printing m + u p and o n e . Based on s i g 10 ( d ) , we determine which buffer contents to retain. In our implementation, we adopt different formats for floating-point numbers in different ranges to enhance readability, as shown in Table (Table 5).
In scientific notation, the result includes an exponent. For example, in “1.2e-02”, “e-02” represents the exponent. Converting d · 10 k to standard decimal scientific notation yields:
d · 10 k = d · 10 log 10 ( d ) · 10 k + log 10 ( d )
Since d · 10 log 10 ( d ) [ 1 , 10 ) , the exponent is determined by k + log 10 ( d ) . Let E 10 = k + log 10 ( d ) . Then:
E 10 = k + log 10 ( d ) = k + log 10 ( m + u p ) + 1
For single-precision floating-point numbers, fixed-point notation is used when E 10 [ 3 , 6 ] . For double-precision numbers, fixed-point notation is used when E 10 [ 4 , 15 ] .
The floating-point number printing algorithm proposed in this paper is illustrated as Algorithm 4. The key distinction between our algorithm and others is the use of SIMD instruction sets for converting m + u p to ASCII values. For single-precision numbers, we use d e c _ t o _ a s c i i 8 ; for double-precision, we use d e c _ t o _ a s c i i 16 . The detailed implementation is available in the source code.
Algorithm 4:Floating-point number printing algorithm
Input: 
v (type: float/double), buffer (type: char*)
Output: 
buffer (type: char*)
1:
compute m + u p , u p d o w n , o n e , k from v
2:
compute s i g 10 ( d ) = u p d o w n ? l e n 10 ( m + u p ) t z 10 ( m + u p ) : l e n 10 ( m + u p ) + 1
3:
compute E 10 = k + log 10 ( m + u p ) + 1
4:
convert m + u p , o n e to ASCII codes and store in buffer
5:
based on s i g 10 ( d ) and E 10 , determine which buffer contents to retain
6:
if result is in scientific notation, print E 10
return buffer
In our C implementation, all branches are designed as unlikely branches to minimize the impact of branch prediction failures. By transforming the computation of d into computing m + u p and o n e , we reduce instruction dependencies and enhance instruction-level parallelism. Unlike other algorithms that compute the exact value of d before printing, we avoid computing d directly and instead quickly compute d / / 10 = m + u p .

3.11. Summary

This chapter explains how to quickly calculate d and k, as well as the printing optimization. Since d = 10 m + o n e , the process of calculating d is transformed into calculating m and o n e .
  • Section 3.5 introduces the method for quickly calculating m.
  • Section 3.6 and Section 3.7 introduce the methods for quickly calculating o n e .
  • Section 3.9 provides a detailed description of the pseudo-code implementation.
  • Section 3.10 discusses print optimization.
Table 6. All algorithms in the benchmark test.
Table 6. All algorithms in the benchmark test.
algorithm float double description
Schubfach Schubfach32 Schubfach64 author:Raffaello Giulietti,https://github.com/c4f7fcce9cb06515/Schubfach.
Schubfach_xjb Schubfach32_xjb Schubfach64_xjb It is improved by Schubfach and has the same output result.
Ryu Ryu32 Ryu64 author:Ulf Adams,https://github.com/ulfjack/ryu.
Dragonbox Dragonbox32 Dragonbox64 author:Junekey Jeon,https://github.com/jk-jeon/Dragonbox.
fmt [22] fmt32 fmt64 author:Victor Zverovich,https://github.com/fmtlib/fmt version:12.1.0
yy_double - yy_double author:GuoYaoYuan,link:yy_double.
yy_json [23] yy_json32 yy_json64 author:Guo YaoYuan,https://github.com/ibireme/yyjson version:0.12.0
teju_jagua [24] teju32 teju64 author:Cassio Neri,https://github.com/cassioneri/teju_jagua.
xjb xjb32 xjb64 this paper,https://github.com/xjb714/xjb.
zmij zmij32 zmij64 author:Victor Zverovich,https://github.com/vitaut/zmij.
jnum [25] jnum32 jnum64 author:Jing Leng,https://github.com/lengjingzju/json/jnum.c.
uscalec - uscalec author:Russ Cox,src:https://github.com/rsc/fpfmt commit 6255750 (19 Jan 2026).

4. Experimental Evaluation

This section presents a comprehensive experimental evaluation of the xjb algorithm, comparing its performance against state-of-the-art floating-point to string conversion algorithms across multiple hardware platforms and compilers.

4.1. Correctness Verification

Before conducting performance benchmarks, we verified the correctness of the xjb algorithm through extensive testing:
  • Single-precision (float): We tested all possible input values ( 2 32 values) and verified that the output matches the Schubfach algorithm, ensuring complete correctness for the binary32 format.
  • Double-precision (double): Due to the impracticality of exhaustively testing all 2 64 possible values, we conducted comprehensive random testing with a large sample size to verify correctness for the binary64 format.
All test results confirmed that xjb produces output identical to Schubfach while satisfying the SW principle.

4.2. Experimental Setup

4.2.1. Hardware Platforms

We conducted benchmarks on two representative hardware platforms:
  • AMD Ryzen 7 7840H: A modern x86-64 processor with support for AVX2 and AVX-512 instruction sets, running Ubuntu 26.04.
  • Apple M1: A representative ARM-based processor with NEON SIMD support, running macOS 26.4.

4.2.2. Compilers and Compilation Flags

  • AMD R7-7840H: Intel C++ Compiler (icpx) version 2025.0.4
  • Apple M1: Apple Clang version 21.0.0
All benchmarks were compiled with the optimization flags -O3 -march=native to enable maximum compiler optimizations and architecture-specific instruction generation.

4.2.3. Benchmark Methodology

The benchmark methodology was designed to provide fair and reproducible performance measurements:
1.
Generate 2 24 random floating-point numbers, excluding special values (0, NaN, and Inf).
2.
Measure the total time required to convert all numbers to decimal representation.
3.
Calculate the average conversion time per floating-point number.
This methodology ensures statistical significance while avoiding the overhead of special-case handling that would skew results.

4.3. Algorithms Compared

We compared xjb against the following state-of-the-art algorithms:
Table 7 and (Table 8) present the benchmark results for float-to-decimal and double-to-decimal conversions on AMD R7-7840H and Apple M1, respectively. Figure 1a–f presents the benchmark results for the conversion from float to string and from double to string. Figure 1a–d present the benchmark results when the input values are completely random. Since the significant digit length range of the printing results is from 1 to 17, comparative tests were also conducted for different effective lengths, as shown in Figure 1e,f.
  • Schubfach: The baseline algorithm from which xjb is derived.
  • Schubfach_xjb: A variant of Schubfach with xjb optimizations.
  • Ryu: A widely-used high-performance algorithm.
  • Dragonbox: Another state-of-the-art algorithm with competitive performance.
  • fmt: A modern formatting library.
  • yy_json / yy_double: Fast algorithms optimized for JSON serialization.
  • teju_jagua: A recent algorithm (note: only supports float/double to decimal conversion).
  • zmij: A modern algorithm with competitive performance.
  • uscalec: An algorithm supporting double-precision only.

4.4. Performance Results

4.5. Analysis and Discussion

4.5.1. Performance Comparison

The benchmark results demonstrate that xjb achieves superior performance compared to all other algorithms across both hardware platforms:
On AMD R7-7840H:
  • For float-to-decimal conversion, xjb achieves 2.24 ns, which is 1.98× faster than the second-fastest algorithm (Schubfach_xjb at 4.44 ns) and 5.44× faster than the baseline Schubfach (12.2 ns).
  • For double-to-decimal conversion, xjb achieves 3.76 ns, which is 1.27× faster than the second-fastest algorithm (yy_double at 5.24 ns) and 3.06× faster than the baseline Schubfach (11.51 ns).
  • For float-to-string conversion, xjb is about 60% faster than zmij.
  • For double-to-string conversion, xjb is about 8% faster than zmij.
On Apple M1:
  • For float-to-decimal conversion, xjb achieves 2.15 ns, which is 1.84× faster than the second-fastest algorithm (zmij at 4.11 ns) and 5.41× faster than the baseline Schubfach (11.64 ns).
  • For double-to-decimal conversion, xjb achieves 2.58 ns, which is 1.49× faster than the second-fastest algorithm (yy_double at 4.08 ns) and 5.08× faster than the baseline Schubfach (13.12 ns).
  • For float-to-string conversion, xjb is about 96% faster than zmij.
  • For double-to-string conversion, xjb is about 20% faster than zmij.

4.5.2. Performance Consistency

A key advantage of the xjb algorithm is its consistent performance across different input distributions. Since all branches in the xjb implementation are designed to be unlikely branches, the algorithm maintains a high branch prediction success rate regardless of input patterns. This property is particularly valuable in real-world applications where input distributions may vary unpredictably.

4.5.3. Cross-Platform Performance

The benchmark results demonstrate that xjb performs well across different processor architectures:
  • On x86-64 (AMD R7-7840H), xjb benefits from the compiler’s ability to generate efficient code for the algorithm’s arithmetic operations.
  • On ARM64 (Apple M1), xjb maintains its performance advantage, demonstrating the algorithm’s portability and effectiveness across different instruction set architectures.

4.5.4. Comparison with Related Algorithms

  • vs. Schubfach: xjb achieves 3–5× speedup over the baseline Schubfach algorithm, validating the effectiveness of our optimizations in reducing multiplication operations and minimizing branch prediction penalties.
  • vs. yy_json/yy_double: While these algorithms are highly optimized for JSON serialization workloads, xjb outperforms them by 1.3–1.8×, demonstrating broader applicability beyond specific use cases.
  • vs. zmij: xjb achieves 1.5–1.9× speedup over zmij, another recent high-performance algorithm, showing the benefits of our approach to instruction dependency reduction.
  • vs. Ryu and Dragonbox: xjb significantly outperforms these widely-used algorithms (2.5–6× faster), demonstrating that the optimizations in xjb provide substantial benefits over established approaches.

4.6. Summary

The experimental evaluation demonstrates that xjb achieves state-of-the-art performance for floating-point to decimal conversion across multiple hardware platforms. The algorithm’s key strengths include:
1.
Consistently superior performance across both x86-64 and ARM64 architectures.
2.
Significant speedups over the baseline Schubfach algorithm (3–5×).
3.
Competitive or superior performance compared to all other state-of-the-art algorithms.
4.
Stable performance regardless of input distribution due to effective branch prediction optimization.
These results validate the effectiveness of our optimization strategies: reducing multiplication operations, minimizing instruction dependencies, and designing branch patterns that work well with modern processor pipelines.

5. Conclusions

This paper presented xjb, a novel high-performance algorithm for converting IEEE 754 floating-point numbers to decimal string representations. Building upon the Schubfach algorithm, xjb introduces several key optimizations that significantly improve conversion speed while maintaining full compliance with the Steele-White principle for accurate and minimal-length output.

5.1. Summary of Contributions

The main contributions of this work are:
1.
Reduced computational complexity: By restructuring the calculation process and minimizing the number of high-precision multiplication operations, xjb achieves substantial performance improvements over the baseline Schubfach algorithm.
2.
Optimized branch structure: The algorithm employs unlikely branch patterns that enable efficient branch prediction on modern processors, resulting in consistent performance across diverse input distributions.
3.
Cross-platform portability: xjb demonstrates excellent performance across both x86-64 and ARM64 architectures, making it suitable for deployment in heterogeneous computing environments.
4.
Comprehensive validation: The algorithm has been thoroughly tested and verified to produce correct output for all IEEE 754 binary32 and binary64 floating-point values.

5.2. Experimental Findings

Our experimental evaluation on AMD R7-7840H and Apple M1 processors demonstrates that xjb achieves state-of-the-art performance:
  • 3–5× speedup over the baseline Schubfach algorithm
  • Faster than other high-performance algorithms such as yy_json, yy_double, and zmij
  • Consistent performance across different input patterns and hardware platforms

5.3. Practical Implications

The xjb algorithm has immediate practical applications in numerous domains:
  • Data serialization: JSON and other text-based data formats require efficient floating-point to string conversion for serialization operations.
  • Scientific computing: Applications that output numerical results in human-readable format benefit from faster conversion without sacrificing accuracy.
  • Database systems: Export operations and query result formatting can leverage xjb for improved throughput.
  • Web services: RESTful APIs and web applications that return numerical data can achieve lower latency with efficient conversion.

5.4. Limitations and Future Work

While xjb demonstrates strong performance, several areas warrant further investigation:
1.
Extended precision support: Future work could extend xjb to support extended precision formats (e.g., 80-bit and 128-bit floating-point numbers) for applications requiring higher precision.
2.
SIMD vectorization: Although xjb is designed to be SIMD-friendly, explicit vectorization using AVX-512 or NEON could yield additional performance gains for batch conversion workloads.
3.
Compiler compatibility: Further optimization for different compilers (particularly MSVC) would improve portability across development environments.
4.
Memory-constrained environments: Investigating memory-efficient variants of xjb could benefit embedded systems and other resource-constrained platforms.

5.5. Availability

The complete implementation of the xjb algorithm, along with benchmark tools and test suites, is publicly available at https://github.com/xjb714/xjb/releases/tag/v1.0.0. We encourage the community to integrate, test, and contribute to the ongoing development of this work.

References

  1. Steel, G. L., Jr.; White, J. L. How to Print Floating-Point Numbers Accurately. In Proceedings of the ACM SIGPLAN 1990 Conference on Programming Language Design and Implementation, PLDI, New York, NY, USA; ACM, 1990; pp. 112–126. [Google Scholar] [CrossRef]
  2. Steel, G. L., Jr.; White, J. L. How to Print Floating-Point Numbers Accurately (Retrospective). ACM SIGPLAN Notices;Best of PLDI 2004, 39(4), 1979–1999. Available online: https://dl.acm.org/doi/10.1145/989393.989431.
  3. Burger, Robert G.; Dybvig, R. Kent. Printing Floating-point Numbers Quickly and Accurately. In Proceedings of the ACM SIGPLAN1996 Conference on Programming Language Design and Implementation (PLDI ’96), New York, NY, USA, 1996; ACM; p. 108ś116. [Google Scholar] [CrossRef]
  4. Loitsch, F. Printing Floating-Point Numbers Quickly and Accurately with Integers. In Proceedings of the ACM SIGPLAN 2010 Conference on Programming Language Design and Implementation, PLDI 2010. ACM, New York, NY, USA; pp. 233–243. [CrossRef]
  5. Andrysco, M.; Jhala, R.; Lerner, S. Printing Floating-Point Numbers: a Faster, Always Correct Method. In Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, POPL 2016. ACM, New York, NY, USA; pp. 555–567. [CrossRef]
  6. Adams, Ulf. Ryu¯: Fast Float-to-String Conversion. Proceed- ings of 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI’18), New York, NY, USA, 2018; ACM; p. 13 pages. [Google Scholar] [CrossRef]
  7. Adams, Ulf. Ryu¯ Revisited: Printf Floating Point Conversion. Proc. ACM Program. Lang. 3, OOPSLA, October 2019; 2019; Article 169, p. 23 pages. [Google Scholar] [CrossRef]
  8. Giulietti, R. The Schubfach Way to Render Doubles. 2020. Available online: https://drive.google.com/file/d/1KLtG_LaIbK9ETXI290zqCxvBW94dj058/view.
  9. Jeon, J. Grisu-Exact: A Fast and Exact Floating-Point Printing Algorithm. 2020. Available online: https://github.com/jk-jeon/Grisu-Exact/blob/master/other_files/Grisu-Exact.pdf.
  10. Jeon, Junekey. Dragonbox: A New Floating-Point Binary-to-Decimal Conversion Algorithm. 2024. Available online: https://github.com/jk-jeon/Dragonbox.
  11. YaoYuan, Guo. Nov 2024. Available online: https://github.com/ibireme/c_numconv_benchmark/blob/master/vendor/yy_double/yy_double.c.
  12. Cox, Russ. Jan 2026. Available online: https://github.com/rsc/fpfmt.
  13. Cox, Russ. Floating-Point Printing and Parsing Can Be Simple And Fast. Jan 2026. Available online: https://research.swtch.com/fp.
  14. Cox, Russ. Fast Unrounded Scaling: Proof by Ivy. Jan 2026. Available online: https://research.swtch.com/fp-proof.
  15. Zverovich, Victor. Mar 2026. Available online: https://github.com/vitaut/zmij.
  16. IEEE Standard for Binary Floating-Point Arithmetic. ANSI/IEEE Std 754-1985 1985, vol., no., 1–20. [CrossRef]
  17. IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2019 (Revision of IEEE 754-2008) 2019, vol., no., 1–84. [CrossRef]
  18. Khuong, Paul. How to print integers really fast (with Open Source AppNexus code!). Dec 2017. Available online: https://pvk.ca/Blog/2017/12/22/appnexus-common-framework-its-out-also-how-to-print-integers-faster/.
  19. Johnson, Dougall. Converting integers to fixed-width strings faster with Neon SIMD on the Apple M1. Apr 2022. Available online: https://dougallj.wordpress.com/2022/04/01/converting-integers-to-fixed-width-strings-faster-with-neon-simd-on-the-apple-m1/.
  20. Muła, Wojciech. SSE: conversion integers to decimal representation. Oct 2011. Available online: http://0x80.pl/notesen/2011-10-21-sse-itoa.html.
  21. Lemire, Daniel. Converting integers to decimal strings faster with AVX-512. Mar 2022. Available online: https://lemire.me/blog/2022/03/28/converting-integers-to-decimal-strings-faster-with-avx-512/.
  22. Zverovich, Victor. Oct 2025. Available online: https://github.com/fmtlib/fmt.
  23. YaoYuan, Guo. Aug 2025. Available online: https://github.com/ibireme/yyjson.
  24. Neri, Cassio. Nov 2025. Available online: https://github.com/cassioneri/teju_jagua.
  25. Leng, Jing. Nov 2025. Available online: https://github.com/lengjingzju/json/jnum.c.
Figure 1. float/double to string benchmark results.
Figure 1. float/double to string benchmark results.
Preprints 205942 g001
Table 2. Valid ranges for significand c and exponent q
Table 2. Valid ranges for significand c and exponent q
Category Float (binary32) Double (binary64)
Subnormal 1 c 2 23 1 , q = 149 1 c 2 52 1 , q = 1074
Normal 2 23 + 1 c 2 24 1 2 52 + 1 c 2 53 1
148 q 104 1073 q 971
Irregular c = 2 23 , 149 q 104 c = 2 52 , 1074 q 971
Table 3. Example of floating-point number printing result.
Table 3. Example of floating-point number printing result.
float number fixed-point scientific
2.34 “2.34” “2.34e+00"
12 “12.0" “1.2e+01"
120 “120.0" “1.2e+02"
0.012 “0.012" “1.2e-02"
Table 4. SIMD implementations of d e c _ t o _ a s c i i 8 and d e c _ t o _ a s c i i 16
Table 4. SIMD implementations of d e c _ t o _ a s c i i 8 and d e c _ t o _ a s c i i 16
SIMD implementation Description
NEON [19] Original author: Dougall Johnson. Runs on ARM processors with NEON instruction set.
SSE2 [20] Based on scalar version; requires only SSE2 instruction set.
SSE4.1 Nearly identical to SSE2 implementation; requires SSE4.1 instruction set.
AVX512 [21] Original author: Daniel Lemire. Requires AVX512IFMA and AVX512VBMI instruction sets.
Table 5. Printing formats for different ranges.
Table 5. Printing formats for different ranges.
Type Fixed-point Scientific
float [ 10 3 , 10 7 ) other ranges
double [ 10 4 , 10 16 ) other ranges
Table 7. Float/double to decimal benchmark results on AMD R7-7840H and ubuntu 26.04. The unit is nanosecond(ns).
Table 7. Float/double to decimal benchmark results on AMD R7-7840H and ubuntu 26.04. The unit is nanosecond(ns).
algorithm float double
icpx 2025.0.4 icpx 2025.0.4
Schubfach 12.2 11.51
Schubfach_xjb 4.44 6.33
Ryu 14.02 13.08
Dragonbox 10.19 10.05
yy_json 4.67 5.72
yy_double - 5.24
teju_jagua 14.99 14.37
zmij 4.76 4.78
uscalec - 11.27
xjb 2.24 3.76
Table 8. Float/double to decimal benchmark results on Apple M1 and MacOS 26.4. The unit is nanosecond(ns).
Table 8. Float/double to decimal benchmark results on Apple M1 and MacOS 26.4. The unit is nanosecond(ns).
algorithm float double
apple clang 21.0.0 apple clang 21.0.0
Schubfach 11.64 13.12
Schubfach_xjb 5.16 6.58
Ryu 15.75 14.16
Dragonbox 11.78 12.03
yy_json 3.97 4.46
yy_double - 4.08
teju_jagua 20.25 18.66
zmij 4.11 3.83
uscalec - 15.26
xjb 2.15 2.58
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated