Sunday, November 18, 2012

TDD & IDAPython

This is just a short note in which I want to share my experiences with writing test code for (IDAPython) scripts I use and produce on a daily basis.

The case for Test-Driven Development

A while ago, I thought it would be a good idea to advance my coding skills. So I had a look around methodologies that are popular in software development but that I had not tried myself.
The best candidate seemed to me Test-Driven Development (TDD) as I was familiar with the concept of unit testing but I was not able to believe that TDD can drive architecture and design decisions.

I started reading the Clean Code series of books by Uncle Bob. The books start out with general advice on how to structure your code in a way it is easier to understand and maintain. I recognize these books as an efficient way to lift your own coding habits to an acceptable level if you plan on publishing code.
In the later chapters the book focusses on TDD. I transferred the given example projects to Python (instead of Java, which the book uses) and really tried hard to embrace TDD as driving method for code generation.

- Well, didn't work out so far for me. ;)
Personally, I still have the impression TDD slows me down too hard when initially implementing functionality. That's because there is only a very limited time frame when doing analyses and helper scripts are mostly tailored to specific use cases and often not part of the analysis result. So the code has limited value to me.
Additionally, refactoring and restructuring can probably become more painful as you obviously have to change both production and testing code. But this is a wrong assumption as I will later point out.
However, I understand the argument that finding & fixing bugs is more expensive in regards of time than preventing to having bugs in first place. But as my projects (helper scripts) usually have a few hundred lines at most and many are even one-shot tools, the overhead does not fit. For large projects, I would definitely give TDD a shot.

Nevertheless, over trying TDD, I really started liking to have tests for my code for the following reasons:
  • Tests give me increasing confidence instead of the feeling that I'm piling up a house of cards that may collapse with additions.
  • Writing tests to fix bugs both documents the bugs and offers valuable insights in my shortcomings when writing the code in first place. Helps to avoid the same errors in the future.
  • My code itself has become much more modular as I'm looking out to have it testable. Refactorings actually have become easier.
  • Tests come in as a free documentation on how to actually use the code, both a help for myself (looking my code again after some months) as well as for others.
  • I only have to write tests for parts of the code I think that are worth being covered by tests ("complex"), indicated by me having had to think about them for some time before simply pinning them down.
  • Executing successful tests is quite satisfying.
So I regularly produce "test-covered" code now, instead of "test-driven" code which I'm pretty happy with. Should have done that with IDAscope as well but I'll add tests for all future bugs I find, I guess.

Tests in IDAPython

So how to use this now in IDA? Here is my template file for writing tests:

import sys
import unittest
import datetime
from test import *

import idautils

class IDAPythonTests(unittest.TestCase):

    def setUp(self):

    def test_fileLoaded(self):
        assert idautils.GetInputFileMD5() is not None

def main(argv):
    print "#" * 10 + " NEW TEST RUN ## " \
        + datetime.datetime.utcnow().strftime("%A, %d. %B %Y %I:%M:%S") \
        + " " + "##"

if __name__ == "__main__":

In this template we have only one test in our test case "IDAPythonTests" called "test_fileLoaded". Tests to be executed by the Python unittest testrunner use the prefix "test_".
Normally you would not test directly against IDAPython's API as in this example but would rather test your own code through function calls, with your code usually being located in a different file and imported into the test case.

You can run this as a script within IDA while having loaded a file for analysis. This allows you to specifically test your code against IDAPython's API on the one hand and using the contents of the file under analysis for verification on the other hand.
The output of the above script while having loaded a file and not having loaded a file to demonstrate the test's behaviour looks like this:

Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] 
IDAPython v1.5.5 final (c) The IDAPython Team <>
########## NEW TEST RUN ## Sunday, 18. November 2012 11:49:29 ##
Ran 1 test in 0.010s

########## NEW TEST RUN ## Sunday, 18. November 2012 11:50:06 ##
FAIL: test_fileLoaded (__main__.IDAPython_Tests)
Traceback (most recent call last):
  File "Z:/", line 14, in test_fileLoaded
    assert idautils.GetInputFileMD5() is not None

Ran 1 test in 0.000s

FAILED (failures=1)

Pretty much the IDAPython shell we are used to + the nice output from Python's unit testing framework.

Thursday, November 1, 2012

PKCS detection

While I already announced in my last blog post that the PKCS detection feature was implemented in my private development branch of IDAscope, I just wanted to do another technical write-up in order to cover it up properly.

I think starting from now on, I'm going to write more about the actual development process and code produced in order to share some insights in how to use IDA's Python API.

Feature-wise, this is not a big deal but I thought it would be fun to have this integrated.

How it began...

So this story began last week at where mortis talked to me about IDAscope. He then said that some time ago he used some older tools in order to detect/extract PKCS components from a binary and I told him that it would be actually a nice idea to have this in IDAscope as well as there is malware with signed updates using asymmetric cryptography.

He pointed me to this 2010' script by kyprizel, which was a good starting point.It's based on a tool by Tobias Klein, which sadly he makes no longer available due to German law (so called "anti hacking" paragraph §202c).

I looked at kyprizel's script and thought that something like this should be doable in a short amount of time. I changed IDAscope's code quickly but then I remembered that you often find base64 encoded certificate and key data.

I got motivated by that and wanted to cover it as well. On the other hand, I realized that implementing this feature, in kyprizel's way, I would open IDAscope to scanning of arbitrary wildcard signatures, which I saw as a big chance. So here we go.

PKCS detection

As written before, the goal of this feature (and blog post) is detecting data fragments that might be involved in asymmetric crypto schemes such as public/private keys and certificate data. You might now it e.g. when using keys instead of passwords for SSH login.

Here is a random private key (PEM format), easily generated on the shell with:
pnx@box:~/tmp/keys$ openssl genrsa -out privkey1024.pem 1024
resulting in:

and the according public key (PEM format), obtained via openssl, once again:
pnx@box:~/tmp/keys$ openssl rsa -in privkey1024.pem -pubout -out pubkey1024.pem
-----END PUBLIC KEY-----

One thing that might get one's attention is that both keys' base64 representations shown here start with "MI". The reason for this is that both keys are not encrypted and this way provide a hint to the underlying data structures, in this case Distinguished Encoding Rules (DER), specified in X.690. That's the point where we can now dive into PKCS standards. :)

First off, I won't go into details but focus on the things needed to implement a working detection of these data structures. For more information, there is a good Phrack article from 1998 by Yggdrasil on PKCS #7.

Distinguished Encoding Rules (DER)

DER features a type system that can be used to encode elements of which keys and certificates consist. RFC 3447 tells us how to use these elements to specify above private and public keys. I'll continue with the public key because it's shorter and suffices for the example.

First, here is a hexdump of the base64-decoded public key above:
0000000: 3081 9f30 0d06 092a 8648 86f7 0d01 0101  0..0...*.H......
0000010: 0500 0381 8d00 3081 8902 8181 00d8 3add  ......0.......:.
0000020: 1181 da08 aa0b b59d c1de 324a 9e24 d73a  ..........2J.$.:
0000030: c452 9f33 d50e 3a5f 7b7d 72c3 b68b 797e  .R.3..:_{}r...y~
0000040: 979d fc42 eb47 c193 7162 8ad6 aa2d c376  ...B.G..qb...-.v
0000050: a565 47cc b34a b8b7 cbdd edfe 056d 9512  .eG..J.......m..
0000060: bfe3 ec83 5b48 ca25 d76e f9eb 6e4e 534e  ....[H.%.n..nNSN
0000070: 0f97 2741 6a2e 4cfb 53c3 b1c8 c8d1 f87c  ..'Aj.L.S......|
0000080: a030 bdff dcfa baa1 8646 92b2 c5d1 792a  .0.......F....y*
0000090: 1e64 16ab d105 d281 7d05 16e8 e102 0301  .d......}.......
00000a0: 0001                                     ..

According to the RFC, an RSAPublicKey is a SEQUENCE of two INTEGERS (ASN.1 notation):
RSAPublicKey ::= SEQUENCE {
          modulus           INTEGER,  -- n
          publicExponent    INTEGER   -- e

Looking at DER specifications we see that elements are usually specified as tag-length-value (TLV) tuples. A SEQUENCE is such an element and starting off with a fixed tag-byte of 0x30, which meets the 1st byte of the above hexdump.
The 2nd byte is the length byte. DER length bytes have a special encoding depending on the length they shall express. If the target length is less than 128 bytes (and thus expressible in 7bits), the byte itself specifies the length. This covers bytes 0x00 - 0x7f.

If the length is equal or above 128 bytes, bit 7 is set (thus setting the length byte to 0x80 or above) and bits 0-6 specify the number of bytes immediately following the length byte and indicating the actual length.
In the above hexdump, we can see that the length byte is 0x81. First, this indicates that the length is bigger than 128 bytes. Second, this indicates that the length is covered in one additional byte. This byte is the third byte of the dump, 0x91, showing that there are 159 bytes in this SEQUENCE.

The third part of the tuple is the actual value of the SEQUENCE.

In this case it is beginning with another SEQUENCE (indicated by the 4th byte of the dump, 0x30) of length 0x0d (5th byte of the dump, 13 bytes) and value 06092a864886f70d0101010500.

The first byte of this inner SEQUENCE is a again a tag-bytes. 0x06 indicates an OBJECT IDENTIFIER. Its length is 0x09 (9 bytes) and the value is 0x2a864886f70d010101, which translates to 1.2.840.113549.1.1.1 (rsaEncryption, read this for more details on decoding). The remaining 2 bytes of the sequence 0x0500 are a NULL element, which again is a TLV with tag 0x05 and length 0x00, having no actual value.

Now that we have handled the first part of the sequence, we can look forth, starting with the 19th byte and a new element. This time, the tag-byte 0x03 indicates a BIT STRING of length 0x8d (141 bytes). The first value-byte in a BIT STRING signals how many bits in the BIT_STRING are unused, in our case 0x00 = zero bytes.

The BIT STRING encapsulates another SEQUENCE, as the 23rd byte (hexdump position 0x16, value 0x30) tells us. The length is indicated by the next two bytes and set to 0x89 (137 bytes).

The first element in this SEQUENCE is a new type that we haven't seen before, indicated by tag-byte 0x02. This is an INTEGER element of length 0x81 (129 bytes). Remembering what we learned from the RFC about RSAPublicKey, this is now finally our modulus n! But we generated a 1024 bit (=128 byte) key via OpenSSL before, so why is this INTEGER of length 129 bytes? The explanation is simple: The leading byte of the INTEGER simply indicates the sign, in our case 0x00, meaning it's a positive number.

The second element of the SEQUENCE is another INTEGER (beginning at hexdump position 0x9d, value 0x02) the publicExponent of length 0x03. Its value is 0x010001=65537, which is pretty standard for keys generated with OpenSSL.

That's basically the complete walkthrough of this DER-encoded public key, just to wrap up, we have:
    OBJECT IDENTIFIER   <- rsaEncryption
    SEQUENCE            <- RSAPublicKey
      INTEGER           <- modulus
      INTEGER           <- publicExponent

Deriving signatures

Okay, as we have seen, the binary DER format with its TLV elements gives us multiple points we can attack with a signature. Back to Python, I have decided to attack the inner part of SEQUENCE to INTEGER, coming to the following binary signature for a 1024bit public key:
{VariablePattern("30 81 ? 02 81 81"): "PKCS: Public-Key (1024 bit)"}
Just to explain, VariablePattern is a simple type derived from str, just to indicate to IDAscope that we have a variable pattern of hexbytes that may contain wildcards such as "?", feedable into the great
idaapi.find_binary(start_ea, end_ea, pattern, radix=16, direction=SEARCH_DOWN)
in order to search in our IDB. Radix is set to 16 because our pattern consists of  hexbytes and SEARCH_DOWN equals the value 1.

Signatures for other bit lengths of public keys look much the same, just with adjusted TLV length fields. Private keys simply have another INTEGER (being zero) in front of the modulus, indicating that it is a 2-prime key:
{VariablePattern("30 82 ? ? 02 01 00 02 81 81"): "PKCS: Private-Key (1024 bit)"}


However, if we look back to where we came from, we had base64-encoded
structures and not the plain binary as shown in the hexdump.

Scanning for those binary keys is straightforward. But to make those Base64 encoded once also searchable in IDA we need some extra-effort.

I decided to temporarily map all potentially base64 encoded strings into the memory space of IDA, more precisely in a bonus segment.

We can get all strings allowing base64-decoding easily by looping over the names, checking if they are ascii and performing decoding via try and error:
def getDecodedBase64Strings(self):
        decoded_names = []
        for name in idautils.Names():
            flags = idaapi.GetFlags(name[0])
            if not idaapi.isASCII(flags):
            ascii = idc.GetString(name[0])
                decoded = ascii.decode("base64")
        return decoded_names

I decided to create a new Segment and put all decoded strings there:
# any currently unused space will do
start_ea = 0x1000
# we need enough space to fill in all our base64-decoded strings.
end_ea = 0x1000 + sum(map(len, decoded_names))
# new segment shall have paragraph alignment and public access
idc.AddSeg(start_ea, end_ea, 0, True, idc.SA_REL_PARA, idc.SC_PUB)
offset = start_ea
for name in decoded_names:
    for byte in name:
        idc.PatchByte(offset, ord(byte))
        offset += 1
The decoded strings are now directly searchable with find_binary() as described before.

In IDAscope I'm doing a bit more in order to extract the exact positions where the keys (base64 or not) sit in the binary.

Currently supported are the following PKCS structures:
  • unencrypted RSA public key 512bit - 8192bit
  • unencrypted RSA private key 512bit - 8192bit
  • X.509 Certificates

Test Case

What would be a blog post without a demonstration. I took a Kelihos / HLUX sample (MD5: 14ff8123f58df1ec4a49afe70c84723b) which has proven quite good for testing lately.
It has 5600+ functions (huge!) and features a lot of crypto signature hits:

Detection of two RSA 2048bit public keys in HLUX.
Among those hits are two base64-encoded public keys that have been detected by IDAscope. You can see that those also start with "MI" and if you have read the whole post, you can deduce at this points that this has to be related to the 0x3081/0x3081 (SEQUENCE) with which the binary data is starting.

Other IDAscope Changes

There have been a lot refactoring steps in the codebase that are not visible to the outside. I will likely go on with that, with the goal in mind to move towards a  point where you can use IDA+IDAscope without its GUI, basically by using an IDAscope API allowing for further automation purposes.

A minor change has happened to FunctionInspection with another button in the toolbar:
There are now two "fix code options" in Function Inspection
The "Fix unknown code to functions" option has been split up. There is now one button (plain plus sign) for only converting those undefined code regions that start with a valid function prologue and a second button (double plus sign) that will try to fix all code to functions.

Reason for the split-up is that converting all code can mess up the number of functions pretty bad while looking for function prologues produces only a very limited number of hits.

Next Steps

Right after telling Alex about the extended signature detection with wildcards, he asked me if YARA support in IDAscope could be an option. I'm still thinking about it but I definitely see the advantages as it would allow the easy reuse of existing signature databases. So this might come to the agenda.

Marco Ramilli blogged about IDAscope some days ago and suggested to build a more extended "behaviour analysis" upon the existing tagging feature. We had already experimented with this "static sandbox" idea before but the code is yet too experimental to find its way into the production branch. So this is also a potential feature for the future.

At and in the slideset I showed an idea of improving the visualization of functional relationship. This is also something I want to work on in the near future as I believe that this would really aid the reversing workflow by providing more overview.

So stay tuned for the future development and as always, write us mails to
if you want to give us feedback or submit ideas for development.