Is it fast?
Yes, pretty much. Kaitai Struct is not a runtime interpreter, but a compiler — thus it imposes no additional runtime performance penalty. Code that it generates is about as fast as one can write in a particular language to parse a certain data format.
That said, note that Kaitai Struct is all about producing a clean API for parsing binary data. That means that general usage plan is:
Create object (structure) in memory and parse stream into it
Use it via API afterwards
This pattern is generally a good fit for most applications, but for some types of workloads you might want a completely different approach: acting as soon as every particular chunk of data stream is parsed based on that chunk only. This calls for an event-based parsing model (i.e. you define some code that will be executed on each particular state of the parser) and thus, probably, you’ll find that other tools like parser combinators, finite-state machine generators or even plain lexer/parser generators will suit that approach better than Kaitai Struct. In these cases, Kaitai Struct’s approach "read-then-use" might be slower than event-based "read-and-act-simultaneously" approach.
Is output of KS readable?
Yes, Kaitai Struct compiler generates very human-readable files, which can be examined with naked eye, debugged if needed, etc. For example, reading a two-byte signed little-endian integer is usually translated into something like:
field = _io.readS2Le();
Does it support writing (generation, serialization) of structures into stream, or only reading (parsing, deserialization) of structures from the stream?
So far Kaitai Struct focuses on reading (parsing) only. There are plans to support writing, but don’t hold your breath for it — it’s a pretty major change and it’ll probably happen after 1.x.
There is a relevant issue in our issue tracker, which sports a proof-of-concept compiler branch that has some writing support (Java only).
Does it support
mmap (memory-mapped) to access files?
Some languages runtimes support so-called "memory mapped" files. The idea is simple: using OS-provided mechanism, one marks up a certain memory area to reflect exactly contents of a file. After that, one can parse the file by accessing that memory area (as it would be using normal in-memory buffer).
Right now, mmap support is available:
In Java — by using ByteBufferKaitaiStream
in C# — by:
In all other languages — by invoking mmap operation manually (getting a pointer to in-memory buffer), and then wrapping that buffer into KaitaiStream
Note that memory-mapped files are not "the silver bullet" and have both their pros and cons, namely:
Organizing a memory map is relatively slow operation in comparison to simple file opening. If you want to process lots of small files, chances are memory-mapped approach would be slower just because of the per-file mmap overhead.
Memory-mapped files work by specifying exact file size to do a memory map operation. File size must be known a priori and must remain constant during the parsing timeframe. That means that one can’t use mmap on:
Files that get appended to during parsing, i.e. live packet capture stream file, live log files, etc
Virtual files on unknown size (such as majority of Linux procfs / sysfs files)
Concurrent access of different processes to the same file using mmap might be non-trivial.
How does it compare to …
… Python library Construct?
Actually, Construct is the closest analog to Kaitai Struct. It is also a declarative and symmetric binary parsing library, but there are significant differences:
Construct does both parsing and serialization, instead of only parsing (feature #27).
Construct is a Python-only module, instead of supporting multiple languagues.
Construct is "declarative" in sense that it defines data structures instead of parsing code. The structures are still defined using Python language, instead of YAML.
Construct aims at offering more sophisticated building blocks, including those only available on Python like Pickle and Numpy protocols, instead of most basic/common elements.
In fact, there is an open and active collaboration between Construct and Kaitai Struct. There are (currently being implemented) import/export tools, that allow translating schemes between the two frameworks.
Main documentation site: https://construct.readthedocs.io
… Google Protocol Buffers, ASN.1, Apache Thrift, Apache Avro, BSON, etc?
They’re completely different. Projects mentioned are actually different serialization specifications that map existing data into some sort of extensible binary stream, usually for easy transmission / interchange. Binary representation is driven by the data and encoded according to particular standard of a given protocol, which usually has a fixed representation for integers, for strings, for arrays, for dictionaries, etc. Most of these project allow generated formats to be automatically extensibile, carry versioning information, automatically embed typing information of some sort.
KS approaches from the other end: given some sort of existing (or planned) binary representation, build a set of classes that the data inside this representation can be held in and build a parser for it. You can’t read an arbitrary binary format (like, for example,
… Cap’N Proto?
Most of the arguments from the previous answer (for Google Protocol Buffers, ASN.1, Apache Thrift, Apache Avro, BSON) apply here as well. Cap’N Proto is not a tool for reading or writing arbitrary formats. Instead, it uses a couple of clever tricks to make serialization and deserialization more efficient (casting binary structures as blocks, not assigning individual fields), but, essentially, it emphasizes content, and offers very limited control over serialization format.
In theory, [Cap’N Proto encoding scheme](https://capnproto.org/encoding.html) is well documented and can be implemented in .ksy to parse Cap’N Proto encoded messages.
… GNU Bison, Yacc, Lex, Flex, etc?
All these tools actually work on parsing text (most usually, source code) using context-free grammars. The core problem they solve is ambiguity of whatever was read. For example, a single letter
a might be part of string literal, part of an identifier, part of a tag name, etc. In most cases, parsers that they generate have a concept of state and a fairly complex ruleset to change states. On the other hand, binary files are usually structured in a non-ambiguous way: there’s no need to do complex backtracking, re-interpreting everything in a different fashion just because we’ve encountered something near the end of the file. There’s usually no state beyond the pointer in the stream and pointer the code that does parsing.
… SweetScape 010 Editor, Synalysis, Hexinator, Okteta, iBored?
All these tools are advanced hex editors with some sort of template language, which is actually pretty close to
.ksy language. One major difference is that
.ksy files, unlike per-editor templates, can be compiled right into parser source code in any supported language.
Both Preon and KS are declarative
Preon is Java-only library, KS is a cross-language tool
Preon’s data structure definitions are done as annotations inside
.javasource files, KS keeps structure definitions in separate
Preon interpetes data structure annotations in runtime, KS compiles
.javafiles first, then they’re compiled normally by Java compiler as part of the project
Preon supports unaligned bit streams, KS does not (yet)
Format specification: how to …
… use variable-length integer quantities (AKA VLQ, varint, vint, LEB128/ULEB128, 7-bit encoded int, Base-128 encoding)?
In most cases, you can just import existing implementation from our stdlib:
Typical usage example:
meta: id: test_vlq imports: - /common/vlq_base128_le seq: - id: len type: vlq_base128_le - id: buf size: len.value
… binary-coded decimals (BCD)?
There’s lot of variety when it comes to BCD representations:
Number of decimal digits is different
BCDs that use byte per digit or nibble (half-of-a-byte) per digit
Endianness: might be little or big
Kaitai Struct stdlibs include a parameterized type bcd which suports majority of these BCD versions using parameters (available in Kaitai Struct v0.8+):
num_digits— integer, number of digits (valid values: 1..8)
bits_per_digit— integer, number of bits per digit (valid values: 4 or 8)
is_le— boolean, specifies order of digits: true if little-endian, false if big-endian
Typical usage example:
meta: id: test_bcd imports: - /common/bcd seq: - id: len # In stream: 03 02 01 00 00 type: bcd(5, 8, true) - id: buf # Buffer of 123 bytes size: len.as_int
|If you don’t need to access BCD value as an integer or a string (for example, it is very often used to store serial numbers and identifiers in hardware protocols), consider just treating it as an opaque byte array.|