Before we start.

Last week has been quite eventful to me. I taught my first university lecture on information theory. I had to fight my E16 a bit to let me actually finish my slide deck and wrote a short blog post at 5:00 a.m. out of frustration. Then I went to sleep, and by the time I woke up I found out that I am in the news…

If you’re here looking for my work, keep searching. The E16 bug is definitely not as serious as the news outlets like to claim. But actual ground-breaking scientific work rarely breaks into mainstream, I’m genuinely as surprised as you are. I have done nothing to end up being famous for a window manager. It would be great if news sites reported as vividly on, for example, that I work on genomics and produced the best (Pareto-optimal) available digital genome storage solution or that I did some cool stuff when i was 16 or wrote a SoTA data compressor that is seemingly used by CERN and FitGirl Repacks (oh irony!) during my Abitur exams. But none of these stories sell.

I am also eternally grateful to many men who have been kind enough to spend their time bitching about me online for something I had no control over, to the weirdos who complain that I don’t spend my time getting pregnant or that I look ugly. Doing God’s job, really. I don’t need to be told that “one day I will get it” because the media reported on my 20 page bug fix. I probably deserve some punishment for the sins that I have committed in my life, but having to read the opinions of Polish people online is a disproportionate one. If you’re so smart, look at my things that actually can’t reasonably break into the mainstream. And if you’re also (regrettably) living in Poland and wondering why Poles abroad want - on average - nothing to do with you, imagine how my e-mail inbox looks like.

Anyway…

Introduction

Andrzej proposes a curious operator, Exp-Minus-Log, defined as

EML(x,y)=exp(x)ln(y) \operatorname{EML}(x, y) = \exp(x) - \ln(y)

Much ink has been spilled about it. You can probably read about this sensation online. But the core claim is that you can use EML to compute most elementary functions, so it could be a two-variable basis function for some type of a calculator.

This looks deceptively simple to compute and I can’t wait for people to tell me to pull libm and try subtracting the exps and logs. Not so simple. The C standard library typically gives you both to 1 ULP each, but the trouble is the word difference. When exlnye^x \approx \ln y, the subtraction cancels most of the significant bits, and the relative error in the result can explode to thousands of ULPs. This is catastrophic cancellation, and it is the central enemy of this post.

I’ll walk you through some progressively cleverer implementations that I’ve looked into, benchmark them against __float128 reference values over 20mln random inputs, and explain the IEEE 754 and algebraic tricks that make the final version work.

All code in this article requires a C compiler with hardware FMA support (-march=native or -mfma), IEEE 754 binary64 arithmetic (essentially any modern CPU), and linking against -lm.

The Problem

exp and log are each faithfully rounded by glibc/musl on x86-64: the returned double differs from the mathematical value by at most 1 ULP. So each operand carries at most 253\sim 2^{-53} relative error (mantissa bits).

Subtraction typically does not introduce new rounding error (Sterbenz’s lemma). However, it amplifies existing error. If exe^x and lny\ln y agree in their top kk bits, then f2kmax(ex,lny)|f| \approx 2^{-k} \cdot \max(e^x, \ln y), and the 1 ULP error in each operand becomes 2k1\sim 2^{k-1} ULP in the result.

For k=20k = 20, that is half a million ULP we get a result with only about 33 correct bits out of 53. Not very useful. Some would anyway write:

double eml(double x, double y) {
    return exp(x) - log(y);
}

This is the baseline. It works perfectly when there is little cancellation, and catastrophically when there is. Over 2×1072 \times 10^7 random (x,y)(x, y) pairs with x[10,10]x \in [-10, 10], y(0,1000]y \in (0, 1000], measured against __float128:

Metric Value
ns/call 12.4
worst ULP 851,150
0 ULP 69.913%
below 1 ULP 97.659%
below 2 ULP 98.638%
above 1024 ULP 673 (0.0034%)

The vast majority of inputs are fine. But 1.4% of cases exceed 2 ULP, and the worst case has fewer than 33 correct bits.

We can start tackling the problem by reusing the fairly popular trick of double-double arithmetic. Such a number is a pair (h,l)(h, l) of IEEE 754 doubles satisfying lulp(h)/2|l| \le \text{ulp}(h)/2, representing the unevaluated sum h+lh + l. This gives roughly 106 bits of significand, so enough headroom to survive 20+ bits of cancellation with precision to spare.

We have some key building blocks, all of which compile to a handful of instructions with FMA: Knuth’s 2Sum and Dekker’s 2Prod. Given a,ba, b, we produce (s,t)(s, t) such that a+b=s+ta + b = s + t exactly, or (p,e)(p, e) such that ab=p+ea \cdot b = p + e exactly.

static inline void two_sum(double a, double b, double * s, double * t) {
  *s = a + b;
  double v = *s - a;
  *t = (a - (*s - v)) + (b - v);
}

static inline void two_prod(double a, double b, double * p, double * e) {
  *p = a * b;
  *e = fma(a, b, -*p);
}

So now the basic approach would be to compute exe^x and lny\ln y each as a double-double, subtract in extended precision and round once.

exp/log in double-double domain

For exp(x)\exp(x), the arguments are reduced via x=(64n+j)ln264+rx = (64n + j) \cdot \frac{\ln 2}{64} + r, where r<ln21280.0054|r| < \frac{\ln 2}{128} \approx 0.0054. The reduction is carried out in double-double using FMA product residuals:

double p = kf * L_HI;
double p_err = fma(kf, L_HI, -p);     /* residual of kf * L_HI */
double r_hi, r_lo;
two_sum(x, -p, &r_hi, &r_lo);         /* exact subtraction      */
r_lo -= p_err + kf * L_LO;

Then ere^r is approximated by an 8-term Estrin polynomial (serial depth 3 FMAs instead of 8 for Horner), with the r2cr^2 \cdot c tail kept as a double-double via two_prod:

double A  = fma(r, 1.0/6.0,      0.5);
double B  = fma(r, 1.0/120.0,    1.0/24.0);
double C  = fma(r, 1.0/5040.0,   1.0/720.0);
double D  = fma(r, 1.0/362880.0, 1.0/40320.0);
double AB = fma(r2, B, A);
double CD = fma(r2, D, C);
double c  = fma(r4, CD, AB);

The result is scaled by a 64-entry precomputed table Tj=ejln2/64T_j = e^{j \ln 2/64} (stored as double-double, 107\ge 107 bits each) and then by 2n2^n via IEEE exponent injection:

double S = u2d((uint64_t)(n + 1023) << 52);   /* 2^n, no multiply */
*hi = s * S;
*lo = e * S;

For lnx\ln x, the input is decomposed as y=2emy = 2^e \cdot m by extracting IEEE exponent bits, and mm is reduced via a 64-entry table Lj=ln(1+j/64)L_j = \ln(1 + j/64). The kernel is an atanh series on s=f/(m+F)s = f/(m+F) where F=1+j/64F = 1 + j/64.

One critical subtlety is that the denominator m+Fm + F must be kept as a double-double. Without this, its 251\sim 2^{-51} rounding error propagates through 2atanh’(s)22 \cdot \text{atanh}’(s) \approx 2 and dominates the result:

double d_hi, d_lo;
two_sum(m, F, &d_hi, &d_lo);
double s_hi = f / d_hi;
double err  = fma(-s_hi, d_hi, f);   /* exact: f - s_hi * d_hi */
err        -= s_hi * d_lo;           /* correct for d_lo       */
double s_lo = err / d_hi;

The final subtraction is a double-double subtract followed by a single rounding:

two_sum(eh, -lh, &sh, &sl);
sl += el - ll;
return sh + sl;
Metric Value
ns/call 27.1
worst ULP 7
0 ULP 99.999%
below 1 ULP 100.000%
above 1024 ULP 0

This is accurate but expensive: on my machine (Ryzen 7 PRO 7840U), about 2.2 times the naive (glibc) cost. It also costs us two 64-entry tables (2 KB total). The 7 ULP worst case comes from inputs with \sim19 bits of cancellation, right at the limit of our \sim74-bit intermediates.

The elegant solution

The unspoken key insight to evaluating eml, from my experiments, seems to be the following. When t=lny>0t = \ln y > 0, the identity

exlny=lnyexpm1(xln(lny))e^x - \ln y = \ln y \cdot \text{expm1}\big(x - \ln(\ln y)\big)

follows directly from ex=exlntt=texlnte^x = e^{x - \ln t} \cdot t = t \cdot e^{x - \ln t}, and therefore ext=t(exlnt1)=texpm1(xlnt)e^x - t = t(e^{x - \ln t} - 1) = t \cdot \text{expm1}(x - \ln t).

This is actually a really good way to tackle the problem, because expm1 by design handles the cancellation that exp(x) - 1 would otherwise suffer near zero. The problematic subtraction is absorbed into a dedicated function that already knows how to deal with it.

The identity reduces the problem to computing the argument δ=xln(lny)\delta = x - \ln(\ln y) accurately. When cancellation is severe, then of course δ0\delta \approx 0, and both xx and ln(lny)\ln(\ln y) are O(1)\mathcal O(1), so even a small error in ln(lny)\ln(\ln y) gets amplified by expm1’1\text{expm1}’ \approx 1.

This is why the slow path still needs double-double; we call log_dd(y) to get lny=th+tl\ln y = t_h + t_l, then log_dd(t_h) to get lnth=uh+ul\ln t_h = u_h + u_l, and reduce:

double dh, dl;
two_sum(x, -uh, &dh, &dl);         /* exact by Sterbenz (x approxeq. uh) */
dl -= ul;

Then a Taylor-corrected expm1 on the double-double δ\delta:

expm1(δh+δl)expm1(δh)+eδhδl\text{expm1}(\delta_h + \delta_l) \approx \text{expm1}(\delta_h) + e^{\delta_h} \cdot \delta_l

The O(δl2)\mathcal O(\delta_l^2) remainder is <2104< 2^{-104}, negligible. The final assembly uses FMA for an exact product residual:

double em   = expm1(dh);
double corr = (1.0 + em) * dl;       /* = exp(dh) * dl */
double ph   = th * em;
double pl   = fma(th, em, -ph);      /* exact residual */
return ph + (pl + th * corr - tl);

Contrariwise, for 93\sim 93% of inputs, there is no significant cancellation. In this regime, the naive exp(x) - log(y) is already 2\le 2 ULP. The guard condition is thus a single comparison after the subtract:

double e    = exp(x);
double diff = e - t;
double mag  = e > t ? e : t;
if (fabs(diff) * 2.0 >= mag)     /* < 1 bit cancelled */
  return diff;
return eml_slow(x, y);           /* dd path, cold */

Hence, the full code is as follows:

#include <math.h>
#include <stdint.h>
#include <string.h>

static inline uint64_t d2u(double d) {
  uint64_t u; memcpy(&u, &d, 8); return u;
}
static inline double u2d(uint64_t u) {
  double d; memcpy(&d, &u, 8); return d;
}
static inline void two_sum(double a, double b,
                           double * s, double * t) {
  *s = a + b;
  double v = *s - a;
  *t = (a - (*s - v)) + (b - v);
}

/*
 * 64-entry table: LT[j] = ln(1 + j/64) as (hi, lo).
 */
static const double LT_hi[64] = { 0x0.0p+0, 0x1.fc0a8b0fc03e4p-7, 0x1.f829b0e783300p-6, 0x1.77458f632dcfcp-5, 0x1.f0a30c01162a6p-5, 0x1.341d7961bd1d1p-4, 0x1.6f0d28ae56b4cp-4, 0x1.a926d3a4ad563p-4, 0x1.e27076e2af2e6p-4, 0x1.0d77e7cd08e59p-3, 0x1.29552f81ff523p-3, 0x1.44d2b6ccb7d1ep-3, 0x1.5ff3070a793d4p-3, 0x1.7ab890210d909p-3, 0x1.9525a9cf456b4p-3, 0x1.af3c94e80bff3p-3, 0x1.c8ff7c79a9a22p-3, 0x1.e27076e2af2e6p-3, 0x1.fb9186d5e3e2bp-3, 0x1.0a324e27390e3p-2, 0x1.1675cababa60ep-2, 0x1.22941fbcf7966p-2, 0x1.2e8e2bae11d31p-2, 0x1.3a64c556945eap-2, 0x1.4618bc21c5ec2p-2, 0x1.51aad872df82dp-2, 0x1.5d1bdbf5809cap-2, 0x1.686c81e9b14afp-2, 0x1.739d7f6bbd007p-2, 0x1.7eaf83b82afc3p-2, 0x1.89a3386c1425bp-2, 0x1.947941c2116fbp-2, 0x1.9f323ecbf984cp-2, 0x1.a9cec9a9a084ap-2, 0x1.b44f77bcc8f63p-2, 0x1.beb4d9da71b7cp-2, 0x1.c8ff7c79a9a22p-2, 0x1.d32fe7e00ebd5p-2, 0x1.dd46a04c1c4a1p-2, 0x1.e744261d68788p-2, 0x1.f128f5faf06edp-2, 0x1.faf588f78f31fp-2, 0x1.02552a5a5d0ffp-1, 0x1.0723e5c1cdf40p-1, 0x1.0be72e4252a83p-1, 0x1.109f39e2d4c97p-1, 0x1.154c3d2f4d5eap-1, 0x1.19ee6b467c96fp-1, 0x1.1e85f5e7040d0p-1, 0x1.23130d7bebf43p-1, 0x1.2795e1289b11bp-1, 0x1.2c0e9ed448e8cp-1, 0x1.307d7334f10bep-1, 0x1.34e289d9ce1d3p-1, 0x1.393e0d3562a1ap-1, 0x1.3d9026a7156fbp-1, 0x1.41d8fe84672aep-1, 0x1.4618bc21c5ec2p-1, 0x1.4a4f85db03ebbp-1, 0x1.4e7d811b75bb1p-1, 0x1.52a2d265bc5abp-1, 0x1.56bf9d5b3f399p-1, 0x1.5ad404c359f2dp-1, 0x1.5ee02a9241675p-1};
static const double LT_lo[64] = { 0x0.0p+0, -0x1.83092c59642a1p-62, 0x1.33e3f04f1ef23p-60, 0x1.18d3ca87b9296p-59, 0x1.85f325c5bbacdp-59, -0x1.b599f227becbbp-58, -0x1.906d99184b992p-58, 0x1.942f48aa70ea9p-58, -0x1.61578001e0162p-60, 0x1.9a5dc5e9030acp-57, 0x1.301771c407dbfp-57, 0x1.9f4f6543e1f88p-57, -0x1.bc60efafc6f6ep-58, 0x1.be36b2d6a0608p-59, 0x1.d904c1d4e2e26p-57, -0x1.398cff3641985p-58, -0x1.4f689f8434012p-57, -0x1.61578001e0162p-59, -0x1.caaae64f21acbp-57, 0x1.7dcfde8061c03p-56, 0x1.ce63eab883717p-61, -0x1.76f5eb09628afp-56, -0x1.8f4cdb95ebdf9p-56, -0x1.c68651945f97cp-57, 0x1.f42decdeccf1dp-56, 0x1.3927ac19f55e3p-59, 0x1.4236383dc7fe1p-56, -0x1.ddea0f7f58e3dp-57, -0x1.8c76ceb014b04p-56, 0x1.92ce979ed2950p-56, -0x1.29639dfbbf0fbp-56, -0x1.16cc8bae0bbe4p-56, -0x1.a92e513217f5cp-59, -0x1.cadec02b436afp-56, -0x1.cd04495459c78p-56, -0x1.0f3c590a887cap-59, -0x1.4f689f8434012p-56, 0x1.877b232fafa37p-56, -0x1.0467656d8b892p-56, -0x1.c825c90c344b9p-58, -0x1.328df13bb38c3p-56, -0x1.328260d8abca0p-57, -0x1.cb1cb51408c00p-56, 0x1.395e58e2445bbp-55, -0x1.259da11330801p-55, -0x1.0e09b27a4373ap-60, -0x1.59c33171a6876p-55, -0x1.9d1a11443f10cp-56, 0x1.ef62cd2f9f1e3p-56, -0x1.f48725e374d6ep-55, -0x1.487c0c246978ep-57, -0x1.1a158f3917586p-55, 0x1.fb590a1f566dap-57, 0x1.6eb92d885ce4fp-57, -0x1.58eef67f2483ap-55, -0x1.6fef670bd4b62p-55, 0x1.9192f30bd1806p-55, 0x1.f42decdeccf1dp-55, 0x1.13dfa3d3761b6p-60, -0x1.8d3d9ea6e9ea9p-55, -0x1.1883750ea4d0ap-57, 0x1.0471885cd8ff3p-55, -0x1.35955683f7196p-59, 0x1.c358257f49082p-55};

static void log_dd(double y, double * hi, double * lo) {
  const double LN2_HI = 0x1.62e42fefa39efp-1;
  const double LN2_LO = 0x1.abc9e3b39803fp-56;

  uint64_t bits = d2u(y);
  int biased = (int)((bits >> 52) & 0x7FF);
  int subnorm = (biased == 0);
  if (__builtin_expect(subnorm, 0)) {
      y *= 0x1.0p52;
      bits = d2u(y);
      biased = (int)((bits >> 52) & 0x7FF);
  }
  int e = biased - 1023 - (subnorm ? 52 : 0);
  uint64_t m_bits = (bits & 0x000FFFFFFFFFFFFFull)
                  | 0x3FF0000000000000ull;
  double m = u2d(m_bits);
  int j    = (int)((m_bits >> 46) & 0x3F);
  double F = 1.0 + (double)j * 0x1.0p-6;
  double f = m - F;

  /* dd denominator: avoids 2^-51 contamination */
  double d_hi, d_lo;
  two_sum(m, F, &d_hi, &d_lo);
  double s_hi = f / d_hi;
  double err  = fma(-s_hi, d_hi, f) - s_hi * d_lo;
  double s_lo = err / d_hi;

  double s2 = s_hi * s_hi, s4 = s2 * s2;
  double Aq = fma(s2, 1.0/5.0, 1.0/3.0);
  double Bq = fma(s2, 1.0/9.0, 1.0/7.0);
  double q  = fma(s4, Bq, Aq);

  double log_m_hi   = 2.0 * s_hi;
  double log_m_tail = 2.0 * fma(s_hi, s2 * q, s_lo);

  double a, b;
  two_sum(LT_hi[j], log_m_hi, &a, &b);
  b += LT_lo[j] + log_m_tail;

  double ef = (double)e;
  double eln2_hi = ef * LN2_HI;
  double eln2_lo = fma(ef, LN2_HI, -eln2_hi)
                  + ef * LN2_LO;
  double c, d;
  two_sum(eln2_hi, a, &c, &d);
  d += eln2_lo + b;
  two_sum(c, d, hi, lo);
}

static __attribute__((noinline, cold))
double eml_slow(double x, double y) {
  double th, tl;
  log_dd(y, &th, &tl);
  double uh, ul;
  log_dd(th, &uh, &ul);
  double dh, dl;
  two_sum(x, -uh, &dh, &dl);
  dl -= ul;
  double em   = expm1(dh);
  double corr = (1.0 + em) * dl;
  double ph   = th * em;
  double pl   = fma(th, em, -ph);
  return ph + (pl + th * corr - tl);
}

double eml(double x, double y) {
  // IEEE754 nonsense.
  if (__builtin_expect(x != x || y != y, 0))
    return x + y;
  if (__builtin_expect(y <  0.0, 0)) return NAN;
  if (__builtin_expect(y == 0.0, 0)) return INFINITY;
  if (__builtin_expect(y == INFINITY, 0))
    return -INFINITY;
  if (__builtin_expect(x >= 0x1.62e42fefa39efp+9, 0))
    return INFINITY;
  // Hot path.
  double t = log(y);
  if (t <= 0.0)
    return exp(x) - t;
  double e    = exp(x);
  double diff = e - t;
  double mag  = e > t ? e : t;
  if (__builtin_expect(fabs(diff) * 2.0 >= mag, 1))
    return diff;
  return eml_slow(x, y);
}

Finally, the results are again very good:

Metric Value
ns/call 21.4
worst ULP 2
0 ULP 73.016%
below 1 ULP 99.958%
below 2 ULP 100.000%
above 1024 ULP 0

A comparison of the methods tested:

Naive DD brute-force new identity
ns/call 12.4 27.1 21.4
worst ULP 851,150 7 2
tables 0 2 KB (exp/log) 1 KB (log only)
code 1 line ~200 lines ~130 lines

All benchmarks ran on x86-64 with glibc, GCC -O2 -march=native, measured against libquadmath’s 128-bit reference (authoritative source) over 2×1072 \times 10^7 uniformly random (x,y)(x, y) pairs with x[10,10]x \in [-10, 10] and y(0,1000]y \in (0, 1000].