Part 5: Optimizations: Speeding up the fuzzer

CMPLOG and COMPCOV from Part 4 help AFL++ get deeper coverage per execution. This part is about getting more executions per second in QEMU-mode. For that, we will take advantage of persistent mode.

Why persistent mode

By default, AFL++ in QEMU mode forks a fresh child for every input. Forking is cheap on Linux, but in QEMU the cost is much higher because the emulator state has to be re-initialized too.

Persistent mode keeps a single QEMU process alive and re-runs a chosen address range in a loop between iterations. It no longer will fork or run initialization code. The AFL++ docs say it’s a 2-5x improvement, and we’ll see something in that range on harness_qdecode.

There are three new environment variables to set:

  • AFL_QEMU_PERSISTENT_ADDR is the address where each iteration starts (typically the start of main or whatever per-input function you want to loop on)
  • AFL_ENTRYPOINT is where the forkserver attaches. Place it after our dlopen/dlsym block so the initialization only runs once
  • AFL_QEMU_PERSISTENT_GPR=1 restores general-purpose registers at each iteration so things like argc/argv survive the loop. You will crash otherwise.

We also need to extend AFL_QEMU_INST_RANGES to include the harness binary, because that’s where the persistent and entrypoint addresses live.

Finding the addresses

Disassemble main:

$ r2 -e scr.color=0 -a x86 -b 32 -qc 'aaa; s main; pdf' harness_qdecode | grep -C5 'mov byte.*, 1\|int main\|ret$'
┌ 502: int main (char **envp, int32_t argv);`- args(sp[0x8..0xc]) vars(3:sp[0x1c..0x24])
|           ; harness_qdecode.c:38int main(int argc, char **argv) {
│           0x00001240      55             push ebp                    
│           0x00001241      53             push ebx
│           0x00001242      57             push edi
│           0x00001243      56             push esi
│           0x00001244      83ec1c         sub esp, 0x1c
--   ...    ...             ...            ...
│   │││││   0x000012db      e810feffff     call sym.imp.dlsym
│   │││││   ; harness_qdecode.c:35:9    if (log_level) *log_level = 0;
│   │││││   0x000012e0      85c0           test eax, eax
│  ┌──────< 0x000012e2      7406           je 0x12ea
│  ││││││   ; harness_qdecode.c:35:31    if (log_level) *log_level = 0;
│  ││││││   0x000012e4      c70000000000   mov dword [eax], 0
│  ││││││   ; CODE XREF from main @ 0x12e2(x)
│  ││││││   ; harness_qdecode.c:41:35    if (!init) { load_lib(); init = 1; }
│  └──────> 0x000012ea      c6834c0000..   mov byte [ebx + 0x4c], 1
│   │││││   ; CODE XREF from main @ 0x126a(x)
│   │││└──> 0x000012f1      8b442434       mov eax, dword [argv]       ; harness_qdecode.c:0:35
│   │││ │   ; harness_qdecode.c:43:21    FILE *f = fopen(argv[1], "rb");
│   │││ │   0x000012f5      8b4004         mov eax, dword [eax + 4]
│   │││ │   ; harness_qdecode.c:43:15    FILE *f = fopen(argv[1], "rb");
│   │││ │   0x000012f8      8d8b14e0ffff   lea ecx, [ebx - 0x1fec]
│   │││ │   0x000012fe      894c2404       mov dword [format], ecx     ; const char *mode

You’re looking for two offsets:

  1. The beginning of the function, which is where we’ll set AFL_ENTRYPOINT so that the forkserver starts past the libc_start_main right into our harness code. The init guard makes sure load_lib() only runs once.
  2. The start of the per-iteration work. For this we look past the initialization/load_lib part and look for where the testcase gets loaded from a file into memory. In this output it looks like that would be 0x000012f1, just after the init block exits and before fopen.

Make sure to add 0x40000000 (the QEMU 32-bit base address) to each offset before exporting them. For example, if main starts at offset0x1240 and the post-init instruction is at 0x12f1:

$ python3 -c "print(hex(0x40000000+0x12f1))"
0x400012f1
$ export AFL_QEMU_PERSISTENT_ADDR=0x400012f1
$ python3 -c "print(hex(0x40000000+0x1240))"
0x40001240
$ export AFL_ENTRYPOINT=0x40001240

Your offsets may be different. Compiler version, optimization, and even minor source changes shift them around.

Test with afl-qemu-trace

Before launching afl-fuzz, run a single iteration through afl-qemu-trace to confirm nothing crashes and the persistent loop survives one round:

AFL_USE_QASAN=1 \
AFL_QEMU_INST_RANGES=0x08048000-0x082da000,0x40001000-0x40002000 \
AFL_ENTRYPOINT=0x40001240 \
AFL_QEMU_PERSISTENT_ADDR=0x400012f1 \
AFL_QEMU_PERSISTENT_GPR=1 \
afl-showmap -Q -o /dev/null -- ./harness_qdecode corpus_qdecode/plain.txt
echo $?

Exit code 0 and no error output for the means the addresses are valid. If you see “forkserver was not found”, the AFL_ENTRYPOINT or AFL_QEMU_PERSISTENT_ADDR is wrong.

Bake it into the run script

Once the addresses work, drop them into run_fuzz_qdecode_qasan.sh as exports. The actual afl-fuzz invocation doesn’t change. Compare your exec/sec in the AFL++ status screen to a run without persistent mode, and you should see a healthy increase.

Optional knobs

  • Use stdin instead of a file: File I/O takes up a lot of CPU cycles. Try replacing all the fopen,fclose, etc with just an fread into a large buffer from stdin and note the performance increase you get. You’ll probably still want to malloc a buffer and memcpy from your large string buffer into that before calling the target function.
  • **AFL_QEMU_PERSISTENT_CNT** determines how many iterations to reuse the same QEMU process before forking a fresh one. Default is 1000. Drop it to 100-500 if your target leaks memory or accumulates state and starts behaving oddly mid-loop. Crank it up to 10000 if the loop is perfectly clean.
  • **AFL_QEMU_PERSISTENT_HOOK=/path/to/hook.so** bypasses the file-read on each iteration and writes the input straight into the target’s memory. Big additional speedup, but it requires writing a small shared object. See AFL++’s [utils/qemu_persistent_hook](https://github.com/AFLplusplus/AFLplusplus/tree/stable/utils/qemu_persistent_hook) for an example and more information.
  • **AFL_QEMU_PERSISTENT_RET** is an explicit end-of-loop address. SincePERSISTENT_ADDR
    is in the middle of a function rather than at its entry, we need to add it so QEMU knows where to end its loop.
    When we start at the beginning of a function, QEMU-mode automatically picks up the return address and ends the loop there.

Next: Part 6: Exercise