Focus of this post will be a new widget that I call Crypto Identification.
Now you may say "oh no, yet another crypto detection tool?" Well, yes, but before you stop reading let me introduce you to an approach you might find useful.
Heuristics-based crypto detection by code propertiesAbout 2 years ago, during literature research on network protocol reverse engineering, I came across an interesting paper called "Dispatcher: Enabling Active Botnet Infiltration using Automatic Protocol Reverse-Engineering" by Juan Caballero et al. Besides the description of an approach on how to identify and dissect message buffers into protocol fields, it contains a section on automated detection of cryptographic routines ("Detecting Encoding Functions", p. 10).
The main idea is pretty straight forward:
- Evaluate the ratio of arithmetic/logic instructions related to all instructions in a function.
Assumption: Cryptographic functions usually consist mainly of
arithmetic/logic instructions, thus they should have a higher ratio.
- If the function has a size of 20 or more instructions, flag the function as encoding function.
I use the following set of arithmetic/logic instructions, please tell me if I missed something:
- ["add", "and", "or", "adc", "sbb", "and", "sub", "xor", "inc", "dec", "daa",
"aaa", "das", "aas", "imul", "aam", "aad", "salc", "not", "neg", "test", "sar",
|IDAscope: Crypto Identification widget|
- Range of Arithmetic/Logic Rating: The above mentioned ratio of
arithmetic/logic instruction to total instructions, but calculated on basic
block level instead of function level.
- Considered Basic Block Size: Only blocks having a size within the
boundaries are taken into concern.
- Allowed Calls in Function: Number of calls allowed from the function
containing the analyzed basic block to any other code location. This is
based on the assumption that most actual cryptographic/compression
functions are "leafs" in the overall program flow graph, not having any
The two checkboxes give further ways to refine the search:
- Exclude zeroing Instructions: This can be used to reduce false positives
that may distort the rating. You will often find instructions like xor eax, eax
or sbb eax, eax being used to clear register contents. However, they would
normally be included in the calculation of the rating because XOR is in the
set of arithmetic/logic instructions.
- Group Results by Functions: This is just an alternative display method,
giving a better overview on how many suspicious blocks are contained in
the same functions.
In the above screenshot (from a current Citadel sample with 724 functions) you can see that the candidate blocks have been reduced to 23 out of 9526 basic blocks. The filters are set to show only blocks with a rating of above 30%, with a size of 10 or more instructions and 1 or 0 call instructions. 23 blocks is a number small enough for me to look at in just a few minutes, identifying the relevant parts in a very short amount of time.
Among the 23 blocks is the following one:
|Citadel's modified stream cipher.|
While this may be a weak example because the block is easily identified by searching for exactly this hash, you probably get the idea on how to use the widget.
The heuristic also successfully identifies all the other crypto parts in the sample like the AES and CRC32 algorithms.
If you wonder about how you get double-sliders in Qt (because it is not a standard widget): The idea and code of this widget called "BoundsEditor" is adapted from Enthought's TraitsUI, which luckily is open-source software. I took the code and reduced it back to a standard Qt widget, having a great and compact control element to adjust my parameters.
Signature-based crypto identificationThe second part of the widget does what you might have expected in the first place. It simply uses a set of constants in order to find well-known cryptographic algorithms. It's basically inspired by tools like the IDA findcrypt plugin or the KANAL plugin for PEiD. It does the same job, except being directly coupled to IDA and allowing to instantly jump to the code locations referencing the identified constants.
The following screenshot (from an old but gold conficker sample) shows both types of matches:
- [black] referenced by: constant somewhere (e.g. data section), referenced by code.
- [red] referenced by: constant immediately used in code, just as shown in the basic block to the left.
- ADLER 32
- MD2, MD5, MD6
- PKCS (various initialization values)
- RC2, RC5
Other changes to IDAscopeTo conclude this post, I want to briefly discuss some more changes I did to IDAscope since the last post.
- In my last post, I mentioned that the WinAPI widget only worked against the
offline data from the Windows SDK. This is no longer the case, as it now
supports doing online lookups (controllable by a checkbox) in the case it
does not find local information. This is great because by that, the missing
documentation of CRT and NTDLL functions are now also covered. Parsing
of the MSDN webpage can be optimized but works for now.
- Hotkey support for widgets. As an example, [CTRL+Y] will now look up the
currently highlighted identifier (in IDA View) in the WinAPI widget and
change focus to this widget.
- More changes under the hood, data structures, refactoring, etc. I feel that
the code is better organized and easier to understand now.
- Experimental code for visualizing the function relationship starting by
Thread start addresses (cmp. Alex' last blog post).