Understanding Poly1305

14 Apr, 2021

This is first post in the series where I'm going to go over various cryptographic algorithms, and try to explore their design and how they are implemented in various libraries.

One of the reasons why I'm writing this post is because I think there is a lot of gatekeeping cryptography behind obscure mathematics and dense, undocumented code. I'm not a genius mathematician myself, but I'm a fine programmer — so I want to show that it's possible to learn and enjoy cryptography, without having to take a full gruaduate course in it.

Poly1305 is a message authentication algorithm (MAC) designed in 2005 by Daniel Bernstein. In the original Poly1305 paper it was paired with AES, but we are going to look into far more popular variant, ChaCha20-Poly1305.

Poly1305 algorithm takes as an input an arbitrarily long message and a 32-byte key, and outputs a 16-byte tag. This tag uniquely identifies that messages under that key, but can be the same for different messages that are using different keys. In that sense, we can think of Poly1305 as a keyed hash-alike function (hash-alike because its goal wasn't being collision-resistant, unlike more popular SHA-3 or Blake2).

A high level overview of Poly1305 is that it's a polynomial evaluation MAC. We take the message and split it into 16-byte blocks, which we then treat as little-endian numbers, and evaluate a polynomial of the form

r * (r * (r * ... (r * m[0] + m[1]) + m[2]) ... + m[n-1]) + s % p

Here, r and s are components of the key, and p is a some large prime number, 2¹³⁰ - 5 in case of Poly1305.

Although this translates fairly easily to code, it has two issues — first,we are adding and multiplying very large numbers, numbers that are way bigger than any available registers, and second, the final product is going to be as large as the prime p. To avoid the second issue, we have to reduce the number modulo p, but doing so we perform n multiplications and n divisions, which is really bad for performance on modern computers.

With that out of the way, let's look at how we can implement it!

We shift each number by two bits to the right to make sure that we are working with the actual data of the r, not with r shifted by two. If this sounds weird, that's beacause it is, but try to give it some time to think it through. At some point it should become clear why does this work.

s0 = (r0 >> 2) * 5
s1 = (r1 >> 2) * 5
s2 = (r2 >> 2) * 5
s3 = (r3 >> 2) * 5

Notice how Poly1305 requires us the bottom 2 bits of each of r segments to be already set to 0 — this means that we don't lose any information by shifting by two and fixing it to be in desired range. Also, if you consider that in each one top 4 bits are cleared, then in each segment of r we clear 6 bits — which means we only have 26 bits meaningful information! That's why we are shifting by two to the right — to fix that meaningful data to be a normal, 26-bit limb.

But if you don't get it yet, don't worry It took me almost a full week to understand the reasoning behind this "trick".

OpenSSL lists a couple of ways the 130-bit number used in various—software and hardware—implementations is encoded:

base 2³² floating-point implementation utilizing extended-precision operations supported by Intel x87 ISA,
mixed base 2¹⁶ and 2³² floating-point implementation for double-precision arithmetics,
base 2²²-ish vectorized floating-point implementation targeting x86_64 AVX unit,
base 2²⁶ vectorized implementation targeting integer SIMD units in x86, x86_64 and ARM processors,
and base 2²⁶ scalar implementation as a reference or fall-back implementation.

Note that there is no base 2⁶⁴ there, as it is not as popular or optimised for low-end devices as 32-bit implementations are.

The base 2⁶⁴ approach is shown to be the best choice for 64-bit processors that lack a SIMD unit.