Dexploration: What a default Dex looks like

During the research phase of my Blackhat talk, I was digging into detecting the default layout of a dexfile, as generated by the normal dx tool. Originally, my concept was that I wanted my tool to “stack” things inside the file the same way that the dalvik compiler would, though I couldn’t find any actual resources on what this actually looked like. After a few hours of digging through code on AOSP and tearing apart an actual dex file to look at the innards, I came up with the quick little ASCII diagram below;

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
+--------------------------------------------------------------------+
| Dex header
| \* offsets and sizes of all sections
| \- default size 0x70
+--------------------------------------------------------------------+
| String\_id\_list
| \* offsets into data
| \- size: number of strings * 4
+--------------------------------------------------------------------+
| Type\_id\_list
| \* index into string\_id\_list
| \- size: number of types * 4
+--------------------------------------------------------------------+
| Proto\_id\_list
| \* index into string\_id\_list
| \* index into type\_id\_list
| \* offsets into data section (params)
| \- size: number of protos * 12
+--------------------------------------------------------------------+
| Field\_id\_list
| \* 2 indexes into type\_id\_list
| \* index into string\_id\_list
| \- size: number of fields * 8
+--------------------------------------------------------------------+
| Method\_id\_list
| \* index into Type\_id\_list
| \* index into Proto\_id\_list
| \* index into String\_id\_list
| \- size: number of methods * 8
+--------------------------------------------------------------------+
| Class\_def\_items
| \* 2 indexes into Type\_id\_list
| \* offsets into data for interfaces
| \* indexes into Type\_id\_list
| \* index into string\_id\_list for source file
| \* offsets into data for annotation
| \* offsets into data for annotation_set
| \* offsets into class data for annotation item
| \* offsets into data for class\_data\_items
| \* index into method_id
| \* offsets into data for static_values
| \* offsets into data for code_item
| \* offsets into data for debug_item
| \- size: number of classes * 20
+--------------------------------------------------------------------+
| Data section (default layout)
| \* annotation items
| \* code items
| \* annotation_directory
| \* interfaces
| \* parameters - used by proto section
| \* strings
| \* debug items
| \* annotation_sets
| \* static values
| \* class_data
| \* map list
+--------------------------------------------------------------------+

The result of the APKfuscator actually ended up being quiet different than the above mappings. It’s definitely possibly to retain the structure, however the sections can easily be interchanged. The resulting sections from my tool look like the following;

Above sections are identical as to layout, but could be shifted around if need be

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
...
+--------------------------------------------------------------------+
| Data section (default layout)
| \* strings
| \* parameters (proto section)
| \* interfaces
| \* annotation items (visibility of item (flags),
| annotation type, number of name,
| encoded annotation)
| \* class annotations (size of items, offsets to items)
| \* annotation data (offset to class annotations,
| fields size, methods size,
| parameters size)
| \* code items
| \* class data
| \* static values
| \* debug items (currently stripped)
| \* map list
+--------------------------------------------------------------------+

The patterns for the normal dx compiler appear to always lay out the same, so if someone has developed a post-compilation modification tool (i.e. - APKfuscator or (bak)smali), it might be possible to see that a dex file has been “changed”. If someone was to develop a tool to look for patterns about how this data is laid out, it could lead to some interesting results. Being able to detect these changes and patterns, run on a large enough scale, could be an interesting tactic to finding out whether or not someone has messed with a file quickly. Hopefully I’ll have more time to research this area and either prove or disprove this theory. Though, until then - hopefully the small ASCII layouts might help someone else with whatever work they’re doing on dalvik research.