dd86k's blog

Machine code enthusiast

Designing the Disassembler Formatter

Author: dd
Published: February 14, 2020
Last modified: December 25, 2022 at 15h02
Categories:

Debuggers always amazed me, so why not I make one?

I’ve always wanted to do a debugger and a disassembler ever since I saw the Visual Studio disassembler while debugging a C# application back in 2014. And so I started my alicedbg project recently. Biggest project yet, and a lot to learn!

Let There be Text

Settings!

I said to myself, thinking it was a good solution.

When I first started making the disassembler, there were simple functions to append into the machine code and instruction mnemonics buffers, as per example: mnaddf(p, "ADD EAX, %d", 5);. Which was perfect for prototyping reasons.

The code looked like this:

if (INCLUDE_MACHINECODE)
	mcaddf(p, "%08X ", *p.addru8);
if (INCLUDE_MNEMONICS)
	mnaddf(p, "INT %s", *p.addru8);

The decoder was doing 100% of the formatting.

Why include settings? Well I thought manually telling if you want one of either or both was a good idea (e.g. include either machine code bytes and/or instruction mnemonics). I’ll go more in-depth about this later.

Alas, soon enough, a friend of mine was jokingly eager to see some AT&T syntax. Good thing she even said that because otherwise I would have done the entire thing hand-written.

An Attempt was Made

I’m keeping the settings!

I claim once again, thinking it was still the better solution to operating modes.

So I sat down and start thinking of ways I could implement a module that takes care of formatting automatically. I started calling it the Styling Engine, as it would handle some special syntax formatting cases depending on the selected disassembler syntax. There are functions that formatted a “type” depending on the item type, so the Intel syntax would leave the register name be, while the AT&T syntax prefixes a percent (‘%’) character.

                      ** Intel **

mov eax, dword ptr es:[eax]
||| |||  ||||||||| ++++++++- style_modrm_rm <----+
||| |||  +++++++++---------- style_modrm_width <-+- style_modrm
||| +++--------------------- style_modrm_reg <---+
+++------------------------- style_mn

                      ** AT&T **

movd %es:(%eax), %eax
|||| ||||||||||  ++++- style_modrm_reg <---+
|||| ++++++++++------- style_modrm_rm <----+- style_modrm
|||+------------------ style_modrm_width <-+
+++------------------- style_mn

The disassembling module was doing 50% decoding and 50% formatting. The code looked like this:

if (INCLUDE_MACHINECODE)
	style_mc_x8(p, *p.addru8);
if (INCLUDE_MNEMONICS)
	style_mn_f(p, "int %s", style_mn_imm(p, *p.addru8));

Not too bad, the decoder is now doing 50% of the formatting.

But then further down I noticed so, so many issues! Especially when it came down to the ModR/M and SIB bytes. I had trouble specifying where those cases would be handled, so I wrote them off entirely in the styling module.

I kept include settings up to now but noticed manually having to type both include setting scopes and doing the formatting at the same time increased implementation time, boo!

Worst of all, the memory size operation specifier (*d for att, dword ptr for intel) was going to be cruel to implement with this approach.

I had to go back to the whiteboard.

I’d Rather Imitate Art

So, I went back to my NOTES file (you’ll likely notice that in my .gitignore files) and started thinking: I need a module that lets me format items and being able to insert, move around, and adjust items at will.

Well, what does printf do? It pushes items into the stack, and depending of the item type (%d, %s, etc.), an item is processed differently. Great! Now, I don’t want the disassembler doing the formatting work, so no format specifiers for the disassembler, which leaves me only the pushing part, and finally doing the formatting at the very end.

                      ** Intel **

         |---- modr/m ----|
mov eax, dword ptr es:[eax]
||| |||  ||||||||| ||||+++|- format_reg
||| |||  ||||||||| |||+~~~+- format_mem
||| |||  ||||||||| +++------ format_reg
||| |||  +++++++++---------- setting.width
||| +++--------------------- format_reg
+++------------------------- instruction

                      ** AT&T **

   * | modr/m |
movd %es:(%eax), %eax
|||| ||||||||||  ++++- format_reg
|||| |||||++++|------- format_reg
|||| ||||+~~~~+------- format_mem
|||| ++++------------- format_reg
|||+------------------ setting.width
+++------------------- instruction

New code looks like this:

if (p.mode >= DisasmMode.File) {
	disasm_push_x8(p, *p.addru8);
	disasm_push_str(p, "int");
	disasm_push_imm(p, *p.addru8);
}

The decoder is now doing none of the formatting!

Once the decoder completes an instruction, the formatter processes items in order, and depending on the request syntax style, may process the items in different orders.

Which brings these improvements over the older system:

  • Reduced compiled binary size
  • Reduced code size
  • Reduced implementation time
  • Simplified and separated the decoding and formatting phases
    • Tweaking a module is now easier, less worry
  • The formatter works on a “lazy” concept (instead of eagerly format things right away)
    • Less CPU time in general (to eagerly format an item type)
    • Maximizes amount of information when it really is the time to process things

Phew! I was worried this wouldn’t work out but hey, I’m happy with this little thing.

Always Room for Improvement

This is only the beginning, there is still a vast room for improvements, tweaks, and most importantly, optimization! Eventually there will be decoder functions to aid fetching immediate values regardless of the host endianness, because I still want to disassemble x86 binaries on an ARM machine (note: ARM uses little-endian by default, unless configured otherwise), perhaps even my phone? Hurrah for software!

Morale

In conclusion, don’t be shy to spend at much time designing something and try things out to see if it all works out. I spent around 3 evenings after work to think of this system. It’s very likely nothing new to experienced people in the domain of disassembly, but as something I’m learning myself, I’d say I had a good run, and I hope I keep improving in the near future.

The alicedbg project is nowhere to be usable at the moment, but once the x86-32 and x86-64 decoders are ready, I do plan to use the debugger myself (at least with its loop UI). The debugging core is somewhat already working on Windows and wherever ptrace(2) is supported (so far using glibc).

Always experimenting with code makes me somewhat of a scientist, doesn’t it?