How to Analyze Assembly Code to Guide Optimization Strategies

Bin2c is a tool for embedding files in C89 compliant source code. The advantages and disadvantages of bin2c over existing methods of embedding files are discussed at length in the below project’s ReadMe. In short: bin2c’s goal is to quickly generate C code that is fast to compile and very compact. As such, using it is more portable but slower than directly embedding the file using ld for example. Compared to xxd -i, its output is much more compact, quicker to generate, and quicker to compile.

How bin2c achieves this level of efficiency is discussed in this post!

Choosing the output

Before we are able to implement the converter, we need to decide which format we are going to produce. xxd -i encodes binary data as an array literal containing hexadecimal integers:

$ echo -n 'Hello World my big floof' | xxd -i 0x48, 0x65, 0x6c, 0x6c, 0x6f, 0x20, 0x57, 0x6f, 0x72, 0x6c, 0x64, 0x20, 0x6d, 0x79, 0x20, 0x62, 0x69, 0x67, 0x20, 0x66, 0x6c, 0x6f, 0x6f, 0x66

This format is quite readable, but it also has a very high overhead needing 6.167 characters to encode one byte of input data. By getting rid of the extra spaces and using decimal numbers instead of hex (getting rid of 0x) we can reduce space requirement down to 3.58 chars/byte. The resulting format is less readable than before:

72,101,108,108,111,32,87,111,114,108,100,32, 109,121,32,98,105,103,32,102,108,111,111,102

We could also consider storing the data as a hex encoded array of eight byte words (uint64_t):

0x48656c6c6f20576f,0x726c64206d792062,0x696720666c6f6f66

This format is even more compact (at 2.4 chars/byte), however it is also really fragile: The memory area would have to be cast to a byte array and the encoding of each hex number depends on the endianess of the target platform.

Bin2c follows an entirely different approach: Instead of encoding the data as a byte array, we encode it as a string! This way, text files can be included almost verbatim; just a few characters like " or \n have to be escaped. Unicode characters and binary can be encoded using escape sequences.

$ echo -n 'Hello World my big floof' | build/bin2c
hHello World my big floof
$ head -c 20 </dev/urandom | build/bin2c
\241dT\000\314\022\241\351\355\377#\352\244\216\334\336\004\002\350\31

This encoding is extremely compact for text (requiring 1.2 chars to encode one byte) and still decently small binary (3.1 chars/byte). Turns out compilers are also 20x-30x faster at parsing this format.

Our first implementation

As usual, when approaching optimization problems, we should focus on writing an obvious, easy-to-read implementation of our solution. This way we will have a baseline to compare our code against. Here it is:

#include <stdio.h>
#include <ctype.h>

void bin2c() {
  for (int inc = getchar(); inc != EOF; inc = getchar()) {
    // Detect characters that need a special escape sequence
    switch ((char)inc) {
      case '\a': fputs("\\a", stdout);     continue;
      case '\b': fputs("\\b", stdout);     continue;
      case '\t': fputs("\\t", stdout);     continue;
      case '\n': fputs("\\n\\\n", stdout); continue;
      case '\v': fputs("\\v", stdout);     continue;
      case '\f': fputs("\\f", stdout);     continue;
      case '\r': fputs("\\r", stdout);     continue;
      case '\\': fputs("\\\\", stdout);    continue;
      case '"':  fputs("\\\"", stdout);    continue;
      default: {} // pass
    }

// Handle printable characters (excluding $, @ and ?)
    if (isprint(inc) && inc != '$' && inc != '@' && inc != '?') {
      putchar(inc);

// Fall back to octal encoding
    } else {
      printf("\\%.3o", inc);
    }
  }
}

This very first version already outperforms XXD, albeit by a relatively small factor of 1.6; I am not entirely sure why this is the case. The hot loop for encoding C source files in XXD seems to call fprintf in a loop while using some complicated logic to choose the format string and supply the delimiter string. This is probably one of the major reasons for xxds slowness; clever code does not tend to perform well. Coders have a hard time reading — and thus optimizing — clever code and sometimes, so do compilers.

It probably also helps that bin2c can skip a call to fprintf in many cases and just use puts or putc directly.

Inlining

Let us now take a look at the assembly produced when compiling the code above with optimization. Not many surprises there: getchar() turns into a call _IO_getc, our puts() calls in the switch statement become fwrite(), putchar turns into _IO_putc, isprint() turns into__ctype_b_loc , and the printf() call stays the same in the assembly.

That is a lot of call instructions — two or three per character encoded and the functions are pretty complex. The stream writing functions need to do error handling, isprint takes the locale into account, printf needs to parse the format string. Optimizing out the bin2c function is going to involve creating alternate implementations of the called functions that are both simpler and can be inlined.

We will have to use manual buffering coupled with fread and fwrite and use setvbuf to disable the built in stream buffering. This way our hot loop just has to write from one buffer and into another, deferring all the complexities of writing to C streams to whenever the buffers are flushed.

isprint() can be replaced with chr >= ' ' && chr <= '~' , if we handle any special cases in the switch statement and octal encoding is really easy to implement because we can get the ascii digit for a numerical value by adding '0' or 48.

#include <ctype.h>
#include <stdint.h>
#include <stdlib.h>
#include <assert.h>
#include <string.h>

// Not using the normal strcpy here because the `call` instruction emitted would be expensive…
inline size_t b2c_strcpy(char *from, char *to) {
    size_t r = strlen(from);
    strcpy(to, from);
    return r;
}

inline size_t bin2c_single(uint8_t chr, char *out) {
  switch ((char)chr) {
    case '\a': return b2c_strcpy("\\a", out);
    case '\b': return b2c_strcpy("\\b", out);
    case '\t': return b2c_strcpy("\\t", out);
    case '\n': return b2c_strcpy("\\n\\\n", out);
    case '\v': return b2c_strcpy("\\v", out);
    case '\f': return b2c_strcpy("\\f", out);
    case '\r': return b2c_strcpy("\\r", out);
    case '\\': return b2c_strcpy("\\\\", out);
    case '"':  return b2c_strcpy("\\\"", out);
    case '$':
    case '@':
    case '?':
      goto octal;
  }

// isprint, but inline
  if (chr >= ' ' && chr <= '~') {
    out[0] = chr;
    return 1;
  }

octal:
  // Octal digits are three bit long, so we can just a bit mask 0b111
  // and bit shifts to extract each digit
  out[0] = '\\'; // octal encode
  out[1] = (chr >> 6 & 7) + '0';
  out[2] = (chr >> 3 & 7) + '0';
  out[3] = (chr >> 0 & 7) + '0';
  return 4;
}

void bin2c(uint8_t **in, uint8_t *in_end, char **out, char *out_end) {
  // We just work around the 4 byte buffer limitation in the enclosing
  // read/process/write loop…not an issue for our specific application
  assert(out_end-*out >= 4);
  // (hot loop) While data in inbuff & outbuf has 4 free slots
  // (bin2c needs four free slots)
  for (; *in < in_end && out_end-*out >= 4; (*in)++)
    *out += bin2c_single(**in, *out);
}

These changes just gave us an 8x improvement in performance. Take a look at the assembly again; looks less complex, not a single call instruction.

Lookup Table

Having eliminated call instructions is great, as doing so improves cache coherence and means less work to store parameters (fulfilling the calling convention).

In fact, all sorts of jumps and branch instructions are somewhat expensive. Again, this is because the cache coherence will be slightly worse, but also because of the impact on CPU pipelining. Optimally we should not use any conditional execution (if statements, switch statements or ternary operators) in our hot loop for optimum throughput.

Luckily, there is a way to achieve that in our case: Our encoder is just a mapping from one specific input byte to between one and four output bytes. This is not a lot of data so we can simply use a lookup table! Four bytes per value, zero padded — we need 1KB of lookup table; that is small enough to easily fit into most CPU caches!

That should be pretty easy: Use a lookup table with 256 32bit words (uint32_t), lookup the value for the input char, and copy the output value to the output buffer. Unfortunately, because our lookup table uses variable sizes, we still need an efficient way to determine the length of our output sequence. We could, for instance, use strnlen to determine the length (which results in a call instruction being emitted). Failing that we could use a similar pattern as before and implement our own, inlinable strnlen, but that results in a few conditional jumps.

After playing around with various implementations, even writing my own branch free version of strnlen (based on the Bit Scan Reverse instruction), I decided to simply store the string length inside some bits of the lookup table, which happened to be unused.

Then I just made sure that clearing those bits and copying the data to the output was done in four byte blocks (uint32), instead of byte for byte. Even if the output sequence is short, we still copy all four bytes from the lookup table. It might seem like copying fewer bytes would be less work, but that would require us to use conditional execution which is simply not worth it if the payoff is copying just one byte instead of three or four.

#include <stdint.h>
#include <stdlib.h>
#include <assert.h>
#include <string.h>

static inline size_t bin2c_single(uint8_t chr, char *out) {
  extern const char bin2c_lookup_table_[];

  // NOTE: The length is in the two most significant bits of
  // the last character of the output. The length in there is
  // 0..3, our actual value is 1..4 so we need to add 1.
  // In order to store this mask in an endianess independent way
  // we store it a byte array; in order to apply it efficiently
  // we cast to an uint32_t so we can perform a one 4 byte instruction
  // instead of multiple single byte instructions when copying into
  // the output buffer.

  // Lookup table char -> escaped for C.
  // Each code is between one and 4 bytes and padded by zeroes.
  const uint8_t *slot8 = ((uint8_t*)bin2c_lookup_table_) + chr*4;

  // The bit mask for erasing the length information; stored as a byte
  // array instead of a uint32_t literal to avoid endianess issues
  uint8_t mask8[] = { 0xff, 0xff, 0xff, 0x3f };

  // Copy the data from the lookup table and erase length info
  // This really is just `out = slot & mask;`.
  // Both casting to a packed struct and using memcpy are ways to express
  // unaligned 32 bit pointer access; this is faster than doing the copying
  // byte by byte on x86_64
#ifdef __GNUC__ // GCC/Clang optimized version
  struct __attribute__((__packed__)) uint32_noalign {
    uint32_t v;
  };
  ((struct uint32_noalign*)out)->v = ((struct uint32_noalign*)slot8)->v & ((struct uint32_noalign*)mask8)->v;
#else 
  uint32_t mask32, slot32;
  memcpy(&mask32, mask8, 4);
  memcpy(&slot32, slot8, 4);
  slot32 &= mask32;
  memcpy(out, &slot32, 4);
#endif

// Extract length info from lookup
  return ((slot8[3] & 0xc0) >> 6) + 1;
}

void bin2c(const uint8_t **in, const uint8_t *in_end, char **out, const char *out_end) {
  // (hot loop) While data in inbuff & outbuf has 4 free slots
  // (bin2c needs four free slots)
  for (; *in < in_end && out_end-*out >= 4; (*in)++)
    *out += bin2c_single(**in, *out);

}

Using the lookup table strategy yielded another 2.3x speedup. In the assembly we can see why: Just two jump statements left in the hot loop, the code is also much more compact than before and all code in the hot loop is executed every time, so there is no opportunity for the branch predictor to generate incorrect predictions.

The original implementation is still included, by the way, since we need a way to generate the lookup table in the first place.

Conclusions

The optimization strategies discussed in this post demonstrate how reading assembly can give vital clues when trying to improve your code’s efficiency and that even basic operations like calls and jump instructions can slow down your code when used inside the hot portion of your code. We also discussed a common optimization strategy: Using a lookup table.

The greatest gains though can often be reaped from taking a step back and reexamining your approach; changing the format from hex numbers to escaped string not only improved output size, sped up compilation, and laid the foundation for later performance improvements.

Benchmarks

These benchmarks show the optimization progress; taken on a i7–9700K CPU @ 3.60GHz. This contains some more development steps, like fixing unaligned access to uint32_t (which is undefined behavior; fixing this resulted in a performance hit).

xxd                             13.9282  Mb/s
bin2c-0001_initial              24.5181  Mb/s
bin2c-0003_inline_octal         129.244  Mb/s
bin2c-0002_manual_buffer        129.292  Mb/s
bin2c-0005_lookup               165.952  Mb/s
bin2c-0004_inline_isprint       180.759  Mb/s
bin2c-0006_branchless_length    218.387  Mb/s
bin2c-0007_lookup_length        435.427  Mb/s
bin2c-0008_32bit_lookup_access  593.669  Mb/s
bin2c-0009_fix_unaligned        437.677  Mb/s
bin2c-000a_packed_unaligned     553.805  Mb/s
bin2c-000b_discard_b2c_memcpy   555.226  Mb/s

Date:

Fri May 2020 14:09 UTC

Category:

tech

Tags:

Comments

Write a comment.