Rasmus Ros

Posted on Apr 30 • Edited on May 7 • Originally published at eignex.com

Building a Compact Encoder on kotlinx.serialization

#kotlin #webdev #performance #backend

Over the past few years, I found myself occasionally writing the same boilerplate: manually packing bits of application state into tight, heavily character-limited strings. It ended up with me creating a library for it called kencode. But first it's story time, and then a little explanation of the underlying tech of why kotlinx.serialization is so cool, and THEN I'll go over kencode.

It all started with URL callback links on an integrated Search Engine Results Page (SERP). In a previous project at Theca, we had built a search engine embedded directly into a client's website. When users clicked a search result, the link first redirected to our servers so we could register telemetry for the click before finally sending them to the actual target page.

This is standard tracking infrastructure stuff. But if enough state can be encoded directly into the URL, the tracking server can bypass an expensive database lookup entirely. In this particular case, we needed to pass the query ID, the user ID, the document ID, and the exact position in the SERP. One database call is not much, but latency does matter for initial impressions.

Having a short URL here is nice, they look more professional, and there is a limit to how long URLs can be. We also want there to be no special characters in the encoded result. That includes hyphens and underscore, since that would otherwise break the word selecting logic. Try to select the entire path by double-clicking in this URL and you'll see: https://example.com/hyphen-path. But here it works just fine to select dQw...: https://www.youtube.com/watch?v=dQw4w9WgXcQ since it's a single word.

Then the same encoding problem happened again with Kubernetes pod names. I was dynamically spinning up short-lived jobs and wanted to embed trace IDs somehow. Naturally, this metadata should also be stored in Kubernetes labels so it remains queryable with kubectl. Kubernetes also imposes a strict 63-character limit on names and only allows alphanumeric characters and hyphens. Encoding efficiency becomes a limiting factor here.

Later, I ran into this encoding problem a third time while implementing stateless pagination links for that SERP. Paginating correctly through blended keyword + vector search results meant we had to carry internal ranking state from page to page. This state lived entirely inside a ?next=xxx query parameter, meaning the payload had to be compact, URL-safe, and opaque to the user.

And now, I find myself needing it a fourth time for my current project Eignex. It's an optimization engine for tuning things like model parameters or ranking weights in production. By passing chosen-value state in a token to the front-end and back, we can avoid storing it in a massive user ID to settings dict on the back-end.

I realize this is not an everyday problem, but I have now encountered it four separate times. I think the ability to pack complex state into a tiny string is a useful architectural trick. Doing it manually each time is error-prone.

Pack lots of structured data into a tight string.

This is where kencode shines. You define a data class and get strong typing directly from the decoded payload:

@Serializable
data class JobState(
    val clientId: Int,
    val batchId: Int,
    val retryCount: Int?,
    val isPriority: Boolean
)

val state = JobState(119, 210, null, true)

val encodedState = EncodedFormat.encodeToString(state)
// This encodes the object into the string:
// 03W8mJ

val decodedState = EncodedFormat.decodeFromString<JobState>(encodedState)

For comparison, the same object in other encodings:

Encoding	Length	Output
JSON	66 chars	`{"clientId":119,"batchId":210,"retryCount":null,"isPriority":true}`
Protobuf + Base64	10 chars	`CHcQ0gEgAQ`
kencode (Base62)	6 chars	`03W8mJ`

kencode is implemented as a custom format on top of kotlinx.serialization, which has quite a different approach to serialization compared to other JVM libraries. Why that is the case requires some context.

Why kotlinx.serialization?

Before libraries like modern Jackson became the standard, serializing Java objects usually involved writing manual boilerplate. If you need to support multiple formats like Protobuf in addition to JSON you will suffer. Manually crafting custom serializers for every single combination of data type and output format (the classic NxM problem) is simply not the way.

To reduce this boilerplate, runtime reflection libraries like Gson and Jackson became popular. Under the hood, when an object is serialized, these libraries inspect the class at runtime to find its fields, their types, and their values. They map these fields to sequential tokens on the fly. This makes standard JSON-focused libraries easy to use, but not necessarily easy to extend.

The sequential model of serializing makes it difficult to create formats that perform aggregate operations on the entire class. kencode relies on exactly this kind of optimization to compact the payload, like grouping all boolean fields and nullability flags into a single bitmask header.

There is also a hard performance ceiling on reflection. Reflection libraries do usually cache the reflection steps, but the issue is not the reflection itself. It's that interpreting these cached steps at runtime is inherently slower than executing statically compiled code. When a reflection library loops over the fields of your class, it essentially calls a method like serializer.write(fieldValue) over and over. Since your fields are all different types, that is a megamorphic call site which the compiler can't inline or optimize well.

This is why kotlinx.serialization takes another approach completely. Instead of relying on reflection at runtime, it generates static serializers at compile time. The approach is similar to Rust's serde framework, allowing for highly optimized serialization without resorting to manual boilerplate.

In kotlinx.serialization, when a class is annotated with @Serializable, a compiler plugin generates a custom KSerializer at build time. For the JobState class above, it produces something like:

// Generated automatically by the @Serializable compiler plugin
object JobStateSerializer : KSerializer<JobState> {

    override val descriptor: SerialDescriptor =
        buildClassSerialDescriptor("JobState") {
            element<Int>("clientId")
            element<Int>("batchId")
            element<Int?>("retryCount")
            element<Boolean>("isPriority")
        }

    override fun serialize(encoder: Encoder, value: JobState) {
        val composite = encoder.beginStructure(descriptor)
        composite.encodeIntElement(descriptor, 0, value.clientId)
        composite.encodeIntElement(descriptor, 1, value.batchId)
        composite.encodeNullableSerializableElement(
            descriptor, 2, Int.serializer(), value.retryCount
        )
        composite.encodeBooleanElement(descriptor, 3, value.isPriority)
        composite.endStructure(descriptor)
    }

    override fun deserialize(decoder: Decoder): JobState {
        // Analogous to serialize, slightly longer because of formats
        // with arbitrary ordering like JSON.
    }
}

serialize just calls typed methods on an Encoder. The KSerializer provides the data shape; the Encoder writes it. That separation is why custom formats are so convenient: a new format only has to implement an Encoder/Decoder pair, and every existing @Serializable class works with it for free.

How It Works

Let's dive into kencode. I split it into three pieces: a compact binary format, a general byte-to-text encoder, and a small composition layer that turns the whole thing into a normal string format. The binary format and text encoders can be used separately.

1. PackedFormat

PackedFormat is the biggest part of the library. It contains the logic to serialize Kotlin objects into small byte arrays.

The format assumes both sides already agree on the schema. This is a strong assumption and definitely not what you want for persistence or cross-language communication. But when the assumption holds, we save a lot of space by not encoding structural information that both sides already know.

Its other core optimizations:

Bitmask headers: boolean fields and nullability markers are packed into a compact bitset header, costing 1 bit per field instead of the usual 1 byte.
Merged nested headers: bitmask bits from nested class fields are collected into a single root-level header, eliminating per-class byte-alignment padding that would otherwise be wasted at each nesting boundary.
Variable-length integers: standard integer fields waste space because they always consume 4 or 8 bytes, even for small numbers. We shrink them using varint (LEB128) and ZigZag encodings. Varint uses the most significant bit of each byte as a "continuation flag", letting small positive numbers squeeze into a single byte. ZigZag fixes a flaw in plain varint by mapping small negative numbers to small positives (0 → 0, -1 → 1, 1 → 2, -2 → 3) so they pack tightly too. Varint is the default in kencode (and in protobuf); enum ordinals are always varint-encoded automatically.
Collection bitmaps: boolean lists and nullable element lists pack their flags into a leading bitmap rather than storing one byte per element.

Together these optimizations explain how the JobState example was compacted. The boolean and nullability flag combine in the header, and the ID integers take one and two bytes respectively.

How PackedFormat lays out the JobState example.

The header for a flat class is straightforward: one bit per boolean, one bit per nullable field (0 = null, 1 = present), packed into the smallest number of bytes. For JobState that's two bits, so a single byte. Nested classes complicate this because per-class headers waste bits to byte alignment, so kencode merges every non-nullable nested class's bits into one shared header at the very front.

This header-first layout requires writing data you haven't processed yet, which standard streaming frameworks can't do without first materialising the whole object into an intermediate tree. Because kencode knows the exact schema via the SerialDescriptor, it skips the tree: beginStructure allocates a small byte array and reserves the right number of header bits, and endStructure flushes the bitmask followed by the buffered field data.

PackedFormat is the layer that actually reduces the payload. Everything after this is really about transport.

2. The Text Layer: ASCII-Safe Codecs

Transporting byte data as text is a common operation and usually handled by Base64. In kencode, we have more ByteEncoding options. Base64 and Base64Url are there mostly for interoperability, and they're a bit faster than the base-N codecs since the encoding is just a simple bit-shuffle. Base85 is useful when density matters more than a conservative character set. The most interesting one is Base62 (also the default). It solves the problem of using non-alphanumeric characters while staying reasonably dense.

BaseRadix handles arbitrary alphabets generically. It works like this: you treat the entire array of bytes as one massive number, divide it by your base (like 62), and map the remainders to characters in your alphabet. Same underlying math as converting binary to standard decimals, just using a custom string of letters and digits. So any alphabet works. Base36 uses only lowercase, and you could also plug in the base-58 alphabet Bitcoin uses to avoid visually ambiguous characters like 0/O and I/l.

But there's a catch when implementing this. To do that base conversion math, you have to load those raw bytes into a BigInteger. As your payload gets larger, BigInteger division becomes slower, so the naïve version is O(n²). The encoder uses a trick: chunk the input in pre-defined sizes. Instead of processing the whole payload as one giant number, it slices the data into fixed chunks and converts each block individually. This reverts the solution to O(n) just like Base64. You do lose a tiny fraction of a byte to rounding overhead every time a new block starts. 32 bytes turned out to be a good sweet spot.

The chunking also needs an inverse mapping for the decoder. For a given block, encoding N bytes produces a fixed number of characters M, but because M = ceil(N * 8 / log2(base)) rounds up, multiple byte counts can land on the same character count. So we precompute a lookup that goes the other way (character count back to byte count) so decoding a partial trailing block doesn't have to guess the original length.

The asymptotic cost per input byte falls out of the alphabet size:

Codec	chars / byte	Alphabet
Base36	1.55	`0-9 a-z`
Base62	1.34	`0-9 a-z A-Z`
Base64	1.33	`0-9 a-z A-Z` + 2 symbols
Base85	1.25	85 printable ASCII, incl. punctuation

Base64 and Base62 are nearly tied, with Base64 winning by a hair because its math aligns on bit boundaries. But Base62 buys you an alphanumeric-only output, which is usually the reason you reached for it in the first place.

For a concrete example, here is The quick brown fox jumps over the lazy dog (43 bytes) in each:

Base36    (68): 23qhn8p9aco732ripmr6mhzfrtsmxcxxzjdmm3vgas1xzpdkz80fuvjknh7nfo0s6fdz
Base62    (58): k0YiLeAWe79bmxSBiGjowzAh4fSmcMsLmNNmsSowlyAaaWecFKMVGnsquH
Base64Url (58): VGhlIHF1aWNrIGJyb3duIGZveCBqdW1wcyBvdmVyIHRoZSBsYXp5IGRvZw
Base85    (54): <+ohcEHPu*CER),Dg-(AAoDo:C3=B4F!,CEATAo8BOr<&@=!2AA8c)

At this size Base62 happens to match Base64Url because of where the block rounding lands. On longer payloads Base64 edges ahead by a small constant factor, and Base85 stays the densest at the cost of a much noisier alphabet.

3. The Composition Layer: EncodedFormat

Finally there is EncodedFormat, which is the glue that combines the binary format and a chosen text codec into a single StringFormat. Between those two layers is an optional transform step for arbitrary byte manipulation.

val format = EncodedFormat {
    binaryFormat = PackedFormat { defaultEncoding = IntPacking.SIGNED }
    transform = encryptingTransform
    codec = Base62
}

val token = format.encodeToString(payload)

A PayloadTransform is just a pair of encode/decode functions on a ByteArray. You get the packed bytes, return whatever bytes you want, and the text codec runs on that. Two of them chain together with .then(...).

I mainly added this for encryption. In the Eignex case, the token rides along on the front-end between requests, so it has to be opaque. Wrapping a cipher is basically a few lines:

val encryptingTransform = object : PayloadTransform {
    override fun encode(data: ByteArray): ByteArray = cipher.encrypt(data)
    override fun decode(data: ByteArray): ByteArray = cipher.decrypt(data)
}

The same interface covers a bunch of other useful things: error-correcting codes (wrap Reed-Solomon and you get tokens that survive a couple of mangled characters), compression for larger payloads, or a CRC checksum if you're worried about users truncating tokens they pasted from a log (there's a checksum = Crc16 shorthand for that one).

PackedFormat is for dense transport, not durable storage. If you want something you can persist and evolve more comfortably over time, swap in ProtoBuf instead.

Anyway, that's kencode. Let me know if you find a fifth reason to pack state into a string. Source is at github.com/Eignex/kencode if you want to poke at it.