Today's article marks my initial step on a journey to create a fast protocol buffer implementation − Introducing protobuf-varint.
This article assumes some basic knowledge of programming and JavaScript but should also be engaging for computer enthusiasts.
Aside from delving into the basics and internals of protobufs, you will also learn a few tricks on how to write faster JavaScript codecs.
Today's article is my first step on a journey to write a fast protocol buffer implementation.
Background
JSON, the most common format for sending data, is designed to outperform XML. XML was slow, challenging for humans to read, and introduced considerable overhead.
JSON proves excellent unless you're dealing with a lot of data. It doesn't stream well, and human readability becomes unnecessary when handling data beyond a human's lifetime reading capacity.
A company that extensively transfers data — Google — developed the protobuf
transfer format for improved efficiency. Unlike JSON, it is a binary format that supports streaming and aims to minimize transferred data amount.
Google's JavaScript library, protobuf.js
, is okay. However, my experience with hypercore
revealed its relatively slow performance, while Mafintosh's protocol-buffers
is notably faster and more lightweight.
Upon examining the internals, I discovered that protocol-buffers
utilizes varint
in the background. varint
is a handy little library that I thought works well for most use cases, but I have encountered some issues with it...
Varint is cool
Before delving into the code, let's have a primer on var-ints. The protocol buffers specification has four integer data types: 32-bit and 64-bit integers. Both in signed (positive and negative) and unsigned variants (exclusively positive).
type | bit | byte | min value | max value |
---|---|---|---|---|
int32 | 32 | 4 | -0x8000_0000 |
+0x7fff_ffff |
uint32 | 32 | 4 | +0x0000_0000 |
+0xffff_ffff |
int64 | 64 | 8 | -0x8000_0000_0000_0000 |
+0x7fff_ffff_ffff_ffff |
uint64 | 64 | 8 | +0x0000_0000_0000_0000 |
+0xffff_ffff_ffff_ffff |
Actually, there there are indeed two signed encodings: the standard int32
and int64
encodings, along with "zig-zag" variants, sint32
and sint64
.
A dedicated section about this topic is reserved for the end of this article. Until then, the main focus of the article revolves mostly around uint32
and uint64
.
All var-int codecs employ the same intriguing trick. Instead of directly storing the 4 or 8 bytes, var-int
encoding is clever.
Out of the 8 bits in a byte, it utilizes only 7. The last bit, also known as the "most significant bit" (MSB
), comes into play when additional bits are required.
For unsigned numbers ranging from 0
~127
(0b00000000
~0b01111111
), a single byte suffices. Numbers starting from 128
(0b10000000
) onwards use the MSB
, necessitating at least two bytes. For example, encoding 128
is encoded as [0b10000000, 0b00000001]
.
The first byte comprises the initial 7 bits of 128 (all 0
) along with the MSB, while the second byte contains the remaining bit needed for 128.
As the numbers grow, like 314_151_314
(0b00010010_10111001_10010001_10010010
), additional bytes become essential for encoding. Let's examine this number, grouped by 7 and 8 bits.
4 3 2 1
┏━━━━━━━━┓ ┏━━━━━━━━┓ ┏━━━━━━━━┓ ┏━━━━━━━━┓
8 bit encoded: 00010010 10111001 10010001 10010010
╭──╯│ │╭──╯ │ │╭──╯ │ │╭──╯│
7 bit blocks: 0001 0010101 1100110 0100011 0010010
varint encoded: 0000001 10010101 11100110 10100011 10010010
┗━━┷━━━━┛┗┷━━━━━━━┛┗┷━━━━━━━┛┗┷━━━━━━━┛┗┷━━━━━━━┛
5 4 3 2 1
In the best-case scenario, this means that up to 75% of data can be omitted for the downside that in a worst case scenario 20% more are required.
The impact can be particularly significant, when dealing with sensor data that typically falls within a specific range but may have peaks and valleys.
However, since this isn't universally the case, protocol buffers require you to choose which data encoding fits the nature of the data you are working with.
Note: Varint is similar to UTF-8 encoding.
JavaScript Number Notation
There are various ways to represent numbers in JavaScript. The default assumes a decimal system: 326
is written as 326
. In the example above, I utilized the 0x...
notation for numbers in a hexadecimal system: 326
is represented as 0x146
.
While working on this project, I found the binary notation particularly useful: 326
can be also written as 0b101000110
.
Numeric separators prove quite handy as well. The underscore _
can be employed to enhance number readability, and you can place it wherever you prefer: 3_123_4172_93
is equivalent to 3123417293
. I used it to group bytes in binary numbers for improved readability: 0b11111111_11111111
is can also be written like 0xFF_FF
.
Slow JavaScript code
The varint
library is not just faster than Google's library; it is also smaller and more versatile. Aside from functional issues (that I will address later), there are quite a few other things that seem like bad decisions for processing a lot of data.
do {
if (counter >= l || shift > 49) {
read.bytes = 0
throw new RangeError('Could not decode varint')
}
// ..
For every byte that is read, the if (...)
condition is executed. counter >= l
checks the length of an array. Before we read the contents of the array, it is unknown how many bytes we should be reading and if all the bytes are in the binary data chunk. This is a problem, particularly for streaming, where usually an arbitrary amount of bytes is received at a time, and we don't know how much more we have received.
for await (const chunk of connection) {
// How many numbers are in the chunk?
}
To eliminate this check, I recalled that neither Uint8Array
nor Array
throws an error when you access data outside the bounds.
(new Uint8Array)[1] === undefined // No problem.
When you have, let's say, a 5MB buffer to decode: almost all numbers can be read without the check. Only at the very end of the buffer, reading could become an issue. To speed up the process, it would be better to assume everything is okay for the common case and conduct a check of the last read number.
const numbers = []
const chunk = //... your input buffer (uint8array)
let offset = 0
while (offset < buffer.byteLength) {
numbers.push(decode(buffer, offset))
off += decode.bytes
}
if (off > l) { // read over the end
numbers.pop() // last number wasn't read correctly
off -= decode.bytes
}
After all the data is read, we check whether the last number actually used more bytes than are in the chunk and in that case, remove the last number from the list.
So, it would be faster (and no problem) to simply not check counter > l
of the buffer. Before we can remove the if
entierely, we need to contend with the other check: shift > 49
. This check is needed because without it, the loop might create unusuable numbers:
do {
if (/* ... */ shift > 49) {
// …
}
// …
shift += 7
} while (b >= MSB)
The > 49
and b >= MSB
conditions indicate that the loop will iterate either until there is a number without the MSB
or if the number read contains more than 49 bits.
However, since each iteration uses shift += 7
, it implies that the maximum number of loops is 5 for 32-bit numbers and 10 for 64-bit numbers.
We can skip the entire if (...)
block by copying the code 5 times! After all, the encoded number cannot be higher than 32-bit anyway.
b = buf[counter++]
res += shift < 28
? (b & REST) << shift
: (b & REST) * Math.pow(2, shift)
shift += 7
// ---
b = buf[counter++]
res += shift < 28
? (b & REST) << shift
: (b & REST) * Math.pow(2, shift)
shift += 7
// ... 3 more times.
There are several advantages to simply copying the code 5 times. For one, we know the values of counter
and shift
: 0
~4
and 0, 7, 14, 21, 28
. So we can fill these in:
b = buf[counter]
if (b < MSB) {
return b
}
let res = b & REST
b = buf[counter + 1]
if (b < MSB) {
return res + (b << 7)
}
res += (b & REST) << 7
b = buf[counter + 2]
// ... etc.
This code avoids quite a few code branchs (if
/for
/while
) and also inlines several operations that the computer doesnt need to perform anymore.
Here is the complete uint32
decode function I came up with:
→ martinheidegger/protobuf-varint/uint32#L17-L44
JavaScript's missing 64-bit number
You may have noticed earlier that the varint
library only supports up to 49 bits. This limitation exists because number operations in JavaScript can involve both floating-point numbers and integers. That's why we have the Number.MAX_SAFE_INTEGER
constant, which is 0x1f_ffff_ffff_ffff
(a 53-bit number).
Using the varint
library for 64-bit de-/encoding is a bad idea. Don't use it!
If your protocol buffer implementation returns regular numbers for 64-bit types, that likely indicates it is not compatible!
There is an easy way to work around this: use two numbers.
low
for the first 32 bits and high
for the second (higher) 32 bits.
This is a very common solution. Google's protobuf.js implements the LongBits
class that has a lo
and hi
property. The much more popular library long
also implements 64-bit numbers.
However, my recommendation to work with long numbers is my own, long
-compatible,
library: → longfn
The advantage of longfn
is that you don't need any custom classes and can simply use regular javascript objects. Not only is this faster, it also comes with fewer dependencies.
import { util } from 'protobufjs'
import Long from 'long'
let int64;
int64 = new util.LongBits(low, high) // slow
int64 = new Long(0, 0) // ... also slow
int64 = { low: 0, high: 0 } // Fast and no dependencies
Without any dependency, protobuf-varint
will return a simple object.
import { decode } from 'protobuf-varint/int64'
import { add, fromInt } from 'longfn'
const three = fromInt(3)
const long = decode([0])
// long === { low: 0, high: 0, unsigned: false }
add(long, three, long) // long + 3
This way, you have a long number object that you can use both for manual operations and together with any library that you like.
About BigInt
Some time ago, BigInt numbers were introduced to JavaScript engines.
When I wrote the code, I contemplated whether to use BigInt
or longfn
objects. Two factors were crucial in my decision to not choose BigInt
. Firstly, BigInt is a bit older, and some of the older packing libraries don't support its syntax. Secondly, and more importantly, BigInt numbers - by definition - need bound checks and other operations that are make them slower than two regular number operations.
longfn
comes with a useful fromBigInt
and toBigInt
helper in case you prefer that.
Bring your own memory (BYOM)
The naive way to encode binary data is to always return a new Uint8Array
of the needed size.
function myDummyEncoder (input: number): Uint8Array {
const buf = new Uint8Array(2)
buf[0] = input & 0xff
buf[1] = (input >> 8) & 0xff
return buf
}
const asByte = myDummyEncoder(123)
The problem with this approach is that, typically, when you encode data, you need more than one value. In typical protocol buffers, there are values of mixed type. What you would then need to do is:
const result = new Uint8Array([
...myDummyEncoder(123),
...myDummyEncoder(77)
])
Every time the function is called, memory gets allocated, is used for a bit, and then garbage collected. This is terribly slow.
If you use any kind of binary encoding you want to provide your own Memory (Uint8Array) to encode the value into. It should also provide an offset from where to start writing.
function myEncoder (input: number, target: Uint8Array, offset: number): Uint8Array {
target[offset] = input & 0xff
target[offset + 1] = (input >> 8) & 0xff
return target
}
const asByte = new Uint8Array(2)
myEncoder(123, asByte, 0)
For binary encoding this is long common. The term BYOB
(bring your own buffer) is often used for that. I decided to also use that technique for longfn
and for decoding
. You can simply pass the target value to the decoder!
import { decode } from 'protobuf-varint/int64'
const long = { low: 0, high: 0, unsigned: false }
decode([123], 0, long) // Will write directly into the long value
This way we can avoid garbage collection and are very fast both when encoding and decoding.
→ Execute once not twice
By using the previous BYOM technique we run into the issue that we need to know before we encode the value if enough memory is allocated.
import { encode } from 'protobuf-varint/uint32'
encode(
41415256,
new Uint8Array(/* How many bytes do we need? */),
)
The common way for codecs is to implement a encodeLength
function for a given type to evaluate the size before writing it.
import { encode, encodeLength } from 'protobuf-varint/uint32'
const num = 41415256
encode(
num,
new Uint8Array(encodeLength(num))
)
The problem with this approach is that the entire logic of encodeLength
needs to be executed again when calling encode
.
Through a change of an API I was able to improve the performance by yet another bit for encoding:
import { fixedEncoder } from 'protobuf-varint/uint32'
const num = 41415256
const encoder = fixedEncoder(num)
encoder.encode(
num,
new Uint8Array(encoder.bytes)
)
This way the fixedEncoder
function will figure out which of each encoder is to be used for the encode
call.
Here is how the 1 byte encoder looks like that you get with fixedEncoder(0)
:
{
bytes: 1,
encode(int, out = [], offset = 0) {
out[offset] = int
return out
}
},
This is as fast as it gets in JavaScript. No branches, no operations. In my first tests this approach is so fast that it is quicker than a lookup for values.
Treacherous Code Reuse
Until now I havn't gone into detail in the difference between uint32
(unsigned) encoding and its signed variants int32
and sint32
. Before we start, an important note:
sint32 is much better than int32
int32
encodes negative numbers with a bit in the last byte.
+1 → [0b0_0000001]
-1 → [0b1_1111111, 0b1_1111111, 0b1_1111111, 0b1_1111111, 0b0_0001111]
This means that while positive numbers can start with few bytes and become more: every negative numbers will use all 5 bytes.
sint32
uses zig-zag encoding. It will flip the first bit if the number is negative and then will start the encoding from the second bit. This means both positive and negative numbers will grow similarly.
starts here ↘︎ ↙︎ 1 if negative, 0 if positive
MSB↘ ↓ ↓
+1 → [0b0_000001_0]
-1 → [0b0_000000_1]
+5 → [0b0_000100_0]
-5 → [0b0_000100_1]
Practically, this means you should almost never use int32
or int64
and instead you should use sint32
or sint64
.
Putting that big notice out of the way. In many implementations you might notice that the zig-zag encoding is implemented like in the google documentation:
function encode32(num) {
return (num << 1) ^ (num >> 31)
}
The problem with this code is that it doesn't take JavaScript number limitations into account.
encode32(+1) === 2
encode32(-1) === 1
encode32(+5) === 10
encode32(-5) === 9
// ↑ Looks good
// ↓ Oops
encode32(-0x80000000) === -1
To avoid this issue and still be fast in the de-/encoding process, I implemented the variants separately to make sure that all edges work well!
Wrapping it up
I have a fondness for protocol buffers and the insights I've gained so far while working on the protobuf-varint
implementation. Currently, it's not useful for my work, and chances are it'll remain in its initial state for a while. However, if you find value in this and share the enthusiasm for creating a fast, next-level protobuf implementation, I would be thrilled to collaborate with you on it.
2023 has been quite a year for me, filled with numerous events. OWDDM has become more active, and I was grateful to rejoin JSConf.jp as staff for the first offline event in years. Having danced on the edge of burnout for a while, I wasn't able to wrap up this article before Christmas as well as I'd have liked. Still, I'm happy to have managed this, but in 2024, I need to work on my energy planning.
Thank you for reading, and I hope you found it enjoyable and I wish you all the best in the next year!