Recently I spent an afternoon reverse-engineering a few packed and obfuscated malware binaries. I was curious as to what kind of tactics and methods had been applied, so I dissected several binaries. I want to share some of my notes about the techniques that these malware programs used. I also want to share some of my analytic techniques and a few of the scripts that I used to help me with the analysis. Most of the obfuscation techniques were in the realm of polymorphism that has been known for years, even decades. I want to show you how a few scripting and graphing tools can ease the burden of de-obfuscating and understanding these malware binaries.
Yes, fake exports. In this case, an executable was exporting multiple entries, which is unusual. The program exports some random points of the binary with random names, and IDA (Interactive DisAssembler) is so sure that it is a separate function that it breaks up the control flow.
Illustration 1: A function split by exported entry
It's very hard to remove a function when it's exported. The exported functions tend to have random names. So I wrote a IDAPython script that searches for functions with names that are not auto-generated by IDA, and removes them.
It searches for any function that doesn't have name "start" or default name prefix "sub_". [Update: I got a comment that you can also use GetEntryPointQty()/GetEntryXXX() IDA API to achieve this in more stable fashion]
Visual Interference with NOP repetitions
As I continued through the disassembly, I found that it had a repeating pattern that had no meaning at all. The pattern just pushed a register, modified the same register, and pop it off again. The register did not change at all. This is just a NOP pattern. And it was inserted all over the place, making the analysis very tiresome.
The NOP pattern looked like the following.
Illustration 2: Repeating NOPs
Actually, the "db 3eh" byte is not valid according to IDA[Update: Many folks pointed out that db 3eh doesn't mean that it's invalid, but meaningless which is more correct word], but the processor didn't care. So, all of the red-boxed regions are simply meaningless overhead put there to make the code more overwhelming.
To make analysis easier, I simply replaced all such instances with real NOPs. And I felt a lot better after I did. Here's the script that I used. It searches for the "50 3e 0f c8 58" byte pattern, which is the hex representation of the NOP parts, and patches in real NOPs(0x90...).
Here's what I got after the script execution.
Illustration 3: NOPs converted to real NOPs
Basic Block Chunks
The malware had a lot of chunked code, which malware often includes in abundance because it is widely known that IDA doesn't deal with it well. Here's an example of the chunked code. It's heavily split through the binary using jmp instructions.
Illustration 4: Chunked code using jmps
When you look at it in the graph view, it's almost impossible to decipher. IDA failed to create a useful graph.
Illustration 5: IDA is especially poor in dealing with code chunks.
So I wrote a script called IDAGrapher to do the analysis correctly even with chunked basic blocks. And here's the graph that the tool generated. [Update: A person pointed out that you can fix the xrefs to show correct graphs, but in my case it didn't work for some reason and my grapher doesn't have any additional algorithm to draw the graphs correctly, so I suppose the IDA graph functionality is not perfect or sometimes not adequate for use]
Illustration 6: So it looks much better
It uses power of graphviz (http://www.graphviz.org/) graph tool to generate the graph file.
I started with malware that was packed and the unpacking routine was obfuscated and I de-obfuscated the unpacking routines. Next I wanted to find out where the original routine starts.
Here's a portion of the whole graph. The whole graph is huge; however, I just needed to concentrate on the green blocks. The green blocks are terminal blocks. Terminal blocks can return or jmp to a register assigned location or a stack assigned location. So I may not be able to determine the next control flow easily, but I should be able to determine it dynamically. This means that it's a strong candidate for the OEP jumping point.
Here's the zoomed-in shot.
Illustration 8: The block that returns to original routines
We can see that it's returning, but sometimes malware just pushes a jump address onto the stack and uses "ret*" instructions to achieve code jumps.
So I placed a break on the specific instruction and executed the binary using the IDA debugger. Here's what I got:
Illustration 9: The point before the jump to original routines
After that, if we "step over" the instruction by pressing F8 key, we get this.
So "sub_416616" might be the first function called inside the packed binary. To make it sure, look into the original packed binary, the same location looked like following.
So I can be sure that the binary is unpacked now.
Tracing through the program step-by-step without any idea of the control flow is very tedious and time consuming. The graphical view generated by the simple script helped me know where to put a breakpoint to catch the OEP. In this case, I just had to look into the green blocks.
The source code for this IDA graphing tool is a little bit big, so I put that on the googlecode site. You can grab the code here (http://code.google.com/p/idagrapher/). The code is far from complete. It's really just a proof of concept. You can modify the code for your own situation.
Security Researchers: Matt Oh