Part 5: Optimizations: Speeding up the fuzzer

CMPLOG and COMPCOV from Part 4 help AFL++ get deeper coverage per execution. This part is about getting more executions per second in QEMU-mode. For that, we will take advantage of persistent mode.

Why persistent mode

By default, AFL++ in QEMU mode forks a fresh child for every input. Forking is cheap on Linux, but in QEMU the cost is much higher because the emulator state has to be re-initialized too.

Persistent mode keeps a single QEMU process alive and re-runs a chosen address range in a loop between iterations. It no longer will fork or run initialization code. The AFL++ docs say it’s a 2-5x improvement, and we’ll see something in that range on harness_qdecode.

There are three new environment variables to set:

  • AFL_QEMU_PERSISTENT_ADDR is the address where each iteration starts (typically the start of main or whatever per-input function you want to loop on)
  • AFL_ENTRYPOINT is where the forkserver attaches. Place it after our dlopen/dlsym block so the initialization only runs once
  • AFL_QEMU_PERSISTENT_GPR=1 restores general-purpose registers at each iteration so things like argc/argv survive the loop. You will crash otherwise.

We also need to extend AFL_QEMU_INST_RANGES to include the harness binary, because that’s where the persistent and entrypoint addresses live.

Finding the addresses

Disassemble main:

r2 -e scr.color=0 -e asm.bytes=0 -a x86 -b 32 -qc 'aaa; s main; pdf' harness_qdecode | grep -C5 'mov byte.*, 1\|int main\|ret$'
┌ 502: int main (char **envp, int32_t argv);`- args(sp[0x8..0xc]) vars(3:sp[0x1c..0x24])
│           0x00001240      push ebp                 ; <--- beginning of function
│           0x00001241      push ebx
│           0x00001242      push edi
│           0x00001243      push esi
│           0x00001244      sub esp, 0x1c
│           0x00001247      call 0x124c
│           ; CALL XREF from main @ 0x1247(x)
│           0x0000124c      pop ebx
│           0x0000124d      add ebx, 0x2da8
│           0x00001253      mov esi, 1
│           0x00001258      cmp dword [envp], 2      ; if (argc < 2) return 1;
--
│   │││││   0x000012db      call sym.imp.dlsym
│   │││││   0x000012e0      test eax, eax            ; if (log_level) *log_level = 0;
│  ┌──────< 0x000012e2      je 0x12ea
│  ││││││   0x000012e4      mov dword [eax], 0       ; if (log_level) *log_level = 0;
│  ││││││   ; CODE XREF from main @ 0x12e2(x)
│  └──────> 0x000012ea      mov byte [ebx + 0x4c], 1 ; if (!init) { load_lib(); init = 1; }
│   │││││   ; CODE XREF from main @ 0x126a(x)
│   │││└──> 0x000012f1      mov eax, dword [argv]    ; <--- past load_lib
│   │││ │   0x000012f5      mov eax, dword [eax + 4] ; FILE *f = fopen(argv[1], "rb");
│   │││ │   0x000012f8      lea ecx, [ebx - 0x1fec]  ; FILE *f = fopen(argv[1], "rb");
│   │││ │   0x000012fe      mov dword [format], ecx  ; const char *mode
--
│   │││     0x000013cd      add esp, 0x1c
│   │││     0x000013d0      pop esi
│   │││     0x000013d1      pop edi
│   │││     0x000013d2      pop ebx
│   │││     0x000013d3      pop ebp
│   │││     0x000013d4      ret                          ; <--- return
│   │││     ; CODE XREF from main @ 0x1288(x)
│   ││└───> 0x000013d5      mov eax, dword [ebx - 0x14]  ; fprintf(stderr, "dlopen stubs: %s\n", dlerror()); _exit(1);
│   ││      0x000013db      mov esi, dword [eax]
│   ││      0x000013dd      call sym.imp.dlerror         ; fprintf(stderr, "dlopen stubs: %s\n", dlerror()); _exit(1);
│   ││      0x000013e2      mov dword [whence], eax      ; fprintf(stderr, "dlopen stubs: %s\n", dlerror()); _exit(1);

For our speed optimizations we need two things:

  1. The beginning of the function, which is where we’ll set AFL_ENTRYPOINT so that the forkserver starts past the libc_start_main right into our harness code. The init guard makes sure load_lib() only runs once. We’ll also set AFL_QEMU_PERSISTENT_ADDR to this for now since it’s safer than trying to re-use a possibly stale argv if we were to pick a spot later in the function.
  2. The return instruction. In Qdecode that would be 0x000013d4.

Make sure to add 0x40000000 (the QEMU 32-bit base address) to each offset before exporting them. For example, if main starts at offset0x1240 and the post-init instruction is at 0x12f1:

python3 -c "print(hex(0x40000000+0x1240))"
0x40001240
python3 -c "print(hex(0x40000000+0x13d4))"
0x400013d4
export AFL_ENTRYPOINT=0x40001240
export AFL_QEMU_PERSISTENT_ADDR=0x40001240
export AFL_QEMU_PERSISTENT_RET=0x400013d4

Your offsets may be different. Compiler version, optimization, and even minor source changes shift them around.

Test with afl-qemu-trace

Before launching afl-fuzz, run a single iteration through afl-qemu-trace to confirm nothing crashes and the persistent loop survives one round:

AFL_USE_QASAN=1 \
AFL_QEMU_INST_RANGES=0x08048000-0x082da000,0x40001000-0x40002000 \
AFL_ENTRYPOINT=0x40001240 \
AFL_QEMU_PERSISTENT_ADDR=0x40001240 \
AFL_QEMU_PERSISTENT_RET=0x400013d4 \
AFL_QEMU_PERSISTENT_GPR=1 \
afl-showmap -Q -o /dev/null -- ./harness_qdecode corpus_qdecode/plain.txt
echo $?
afl-showmap++4.41a by Michal Zalewski
[*] Executing './harness_qdecode'...
-- Program output begins --
-- Program output ends --

+++ Program timed off +++
[+] Hash of coverage map: 917471071e6c3e30
[+] Captured 67 tuples (map size 65536, highest value 7, total values 280) in '/dev/null'.
0

Exit code 0 and no error output for the means the addresses are valid. If you see “forkserver was not found”, the AFL_ENTRYPOINT or AFL_QEMU_PERSISTENT_ADDR is wrong.

After that, we check against all of our corpus files:

AFL_USE_QASAN=1 \
AFL_QEMU_INST_RANGES=0x08048000-0x082da000,0x40001000-0x40002000 \
AFL_ENTRYPOINT=0x40001240 \
AFL_QEMU_PERSISTENT_ADDR=0x40001240 \
AFL_QEMU_PERSISTENT_RET=0x400013d4 \
AFL_QEMU_PERSISTENT_GPR=1 \
afl-showmap -Q -o /tmp/showmap -i ./corpus_qdecode -- ./harness_qdecode @@
echo $?
afl-showmap++4.41a by Michal Zalewski
[*] Executing './harness_qdecode'...
[*] Reading from directory './corpus_qdecode'...
[*] Scanning './corpus_qdecode'...
...
+++ Program killed by signal 11 +++
-- Program output begins --
-- Program output ends --
[+] Processed 7 input files.
[+] Captured 71 tuples (map size 65536, highest value 5, total values 540) in '/tmp/showmap'.

Uh oh! We see a crash when trying to run our corpus. Run the same command again with AFL_DEBUG=1 tacked on in front of it. Now you should see what file caused the crash.

afl-showmap++4.41a by Michal Zalewski
[*] Executing './harness_qdecode'...
...
+++ Program killed by signal 11 +++
[!] WARNING: crashed: ./corpus_qdecode/truncated.txt
-- Program output begins --
-- Program output ends --
[+] Processed 7 input files.
[+] Captured 71 tuples (map size 65536, highest value 5, total values 540) in '/tmp/showmap'.

So the issue is our ./corpus_qdecode/truncated.txt. We may have accidentally found a bug in this program we’re fuzzing, but for now we’ll just:

rm -f ./corpus_qdecode/truncated.txt

Now if you run the command above again, it should check out ok!

Bake it into the run script

Once the addresses work, drop them into run_fuzz_qdecode_qasan.sh as exports. The actual afl-fuzz invocation doesn’t change. Compare your exec/sec in the AFL++ status screen to a run without persistent mode, and you should see a healthy increase.

Optional knobs

  • Use stdin instead of a file: File I/O takes up a lot of CPU cycles. Try replacing all the fopen,fclose, etc with just an fread into a large buffer from stdin and note the performance increase you get. You’ll probably still want to malloc a buffer and memcpy from your large string buffer into that before calling the target function. This will also make it so we can put the AFL_QEMU_PERSISTENT_ADDR a little further since we don’t rely on argv being stable.
  • AFL_QEMU_PERSISTENT_CNT determines how many iterations to reuse the same QEMU process before forking a fresh one. Default is 1000. Drop it to 100-500 if your target leaks memory or accumulates state and starts behaving oddly mid-loop. Crank it up to 10000 if the loop is perfectly clean.
  • AFL_QEMU_PERSISTENT_HOOK=/path/to/hook.so bypasses the file-read on each iteration and writes the input straight into the target’s memory. Big additional speedup, but it requires writing a small shared object. See AFL++’s utils/qemu_persistent_hook for an example and more information.
  • AFL_QEMU_PERSISTENT_RET is an explicit end-of-loop address. Since PERSISTENT_ADDR
    is in the middle of a function rather than at its entry, we need to add it so QEMU knows where to end its loop.
    When we start at the beginning of a function, QEMU-mode automatically picks up the return address and ends the loop there.

Next: Part 6: Exercise