[GUIDE] Reversing of insertions for ARM-based mobile phones

Katerina Dekha

Forum posts: 11

Aug 19, 2015, 5:24:05 PM via Website

Aug 19, 2015 5:24:05 PM via Website

Author: Apriorit (Research and Reverse Engineering Team)

Permanent link: www(dot)apriorit(dot)com/dev-blog/76-reversing-of-mobile-phones-insertions

Once we faced the need to investigate how Samsung cellular phones work; we required some information from them, which is not documented (and will never be, for sure). So this article is about interesting points our reverser had met while working with Samsung cellular phones firmware.

Reversing of Insertions for ARM-based Mobile Phones

I have managed to research insertions of all Samsung' generations, including CDMA (except for the smartphones only). In every Samsung phone the ARM-compatible processor with a set of ARM7TDMI commands is used. Insertions are built on the basis of three OS: RTCX, RTK, Nucleus, and compiled on different compilers. I have seen insertions compiled on ADS (SDT) and IAR.
On forums people call Samsung's generations in different way: somebody divides them into Gumi/Suvon (2 cities in Korea), others give code names - "Sysol", "Agere", "VLSI", "Conexant" and "Ancient". I have come to a conclusion that it's more correctly to divide them according to the phone processor.

Processor Models

OM6357 (aka Sysol) E100, E700, E720, E800, E820, S50x, X100, X460, X60x
M46 (aka Conexant) A100, A110, A200, A300, A400, M100, T208
SkyWorks (aka Conexant) C100, C108, C110, P510, P518
ONE-C (aka VLSI) R2XX, Nxxx, Txxx(except for T208)
Trident (aka Agere) Dxxx, Qxxx, Sxxx(except for S50x), Vxxx, C200, E105, E310, E400, E600, E710, E810, X105, X400, X42x, X450 etc.
MSMxxxx all CDMA

Hope, I haven't made any mistakes in this list. )

According to the list, insertions within the same generation are very similar and, to be honest, sometimes they are twins at all (with extremely slight changes). For example, in X100 there are obvious traces of E100/E700/X600 - why then there is a code for working with the second display, camera and IRDA, which it didn't have in a whole life?

Naturally, OS is the same for the whole generation:

OM6357 - RTK
M46 - RTCX OSE
SkyWorks - RTCX OSE
ONE-C - RTK
Trident - Nucleus
MSMxxxx - don't know exactly, it might be any OS from Qualcomm. It's just clear that they are collected to ADS/SDT.

If you are going to investigate the low level, then SDK from corresponding OS will be to the point. Another helpful thing is the symbolical information, which can be met in some insertions archives. Sometimes you can come across the insertions with .lst, .sym, .map, .out files, containing the information, extremely useful in researches. In particular, such files occur in almost all C100, S500 insertions. When talking about the other models, the situation is worse and you have to content yourself with symbols signatures, made for insertion of the same generation. For example, for M46 I have managed to find just one insertion with symbols and it was from A110. But signatures made from it perfectly lie down on A200, A300 etc.
Interpretation of the symbolical information

MAP format

.map files contain the information on modules included in the insertion and look like

Base Size Type RO? Name
0 20 CODE RO AAA_vectors from object file obj/isr.o
20 38e8 CODE RO C$$code from object file../../src/t9latin.o
3908 30 CODE RO C$$code from object file obj/mmi_date.o
3938 5a4 CODE RO C$$code from object file hw_slow.o
3edc 874 CODE RO C$$code from object file rtkgo.o

etc.
where

Base - displacement in an insertion file.
Size - length.
Type - region type.
RO? - region access type.
Name - original file name, part of which was included in the insertion

How all this can be interpreted? For example, this way: starting with displacement 20, there is a block of the code (CODE) 38e8 length - it's an access to Read Only block. The fact that block has CODE attribute is far from being means that the WHOLE area is filled with a code. Actually, it is a code plus data, just as if the block has DATA type it does not mean that it is necessary to make it all by data in IDA.

Without the names/symbols file this information can be used only for determination of insertion code size (i.e. to not get into the graphics). Therefore, we will better examine SYM format.

SYM Format

.sym files are the mines of information. They look like:

Symbol Table
AAA_vectors$$Base 000000
AAA_vectors$$Limit 000020
VectorMap$$Base 1006a3c
VectorMap$$Limit 1006a60
isr$$Base 12774c
isr$$Limit 127bb0
gl_MaskIT 1000078
Rtk_RegionCount 100564c
rtk_WorthItSched 10056a0
Rtk11_Schedule 11f5c8

etc.

It is a little bit easier here, because the name-address correspondence exists. But as for the addresses, there are some secrets - a set of names exists, containing $ sign and having the special status. Symbols with $$Base at the end indicate the beginning of virtual address space area, $$Limit indicates the end. I.e. here we have the information on segments. It is possible to make a memory map of these segments and see how the parts of binary code are being thrown to different addresses. Building memory map should be started with such symbols:

Image$$RO$$Base 000000
Image$$RO$$Limit 1afef4
Image$$RW$$Base 1000000
Image$$RW$$Limit 107dad4
Image$$ZI$$Base 1006a60
Image$$ZI$$Limit 107dad4

RO - Read Only, indicates code addresses.
RW - Read/Write i.e. it is RAM.
ZI - Zero Initialized. RAM, which is being stuffed with zero values when mobile phone is turned on.

Thus segments can be easily created on these addresses. Now we look further:

AAA_vectors$$Base 000000
AAA_vectors$$Limit 000020
C$$code$$Base 000020
C$$code$$Limit 127310
C$$code$$__call_via$$Base 127310
C$$code$$__call_via$$Limit 127320
Example$$Base 127320
Example$$Limit 127324
HAL_boot$$Base 127324
HAL_boot$$Limit 12735c
RtkCode$$Base 12735c
RtkCode$$Limit 127408
SysSupportCode$$Base 127408
SysSupportCode$$Limit 12744c
boot$$Base 12744c
boot$$Limit 127654
clib$$Base 127654
clib$$Limit 12774c
isr$$Base 12774c
isr$$Limit 127bb0
C$$constdata$$Base 127bb0
C$$constdata$$Limit 1afef4
C$$data$$Base 1000000
C$$data$$Limit 1005a38
Stacks$$Base 1005a38
Stacks$$Limit 1006a3c
VectorMap$$Base 1006a3c
VectorMap$$Limit 1006a60
C$$zidata$$Base 1006a60
C$$zidata$$Limit 107dad4

In this interesting way they go one after another. If you wish, it is possible to divide them into segments to corresponding addresses, but this is merely a logic division. Moreover, in .sym file these lines are scattered badly. And more sooner or later a question appears: why the code size is 1afef4, if length of insertion file is 1b6950? Where to put the rest 6a60 byte? We look again on the initial memory map:

Image$$RW$$Base 1000000
Image$$RW$$Limit 107dad4
Image$$ZI$$Base 1006a60
Image$$ZI$$Limit 107dad4

RAM ends on 107dad4 address, block 1006a60-107dad4 is zero initialized, hence there is a question: what does initialize the 1000000-1006a60 block, which size is exactly 6a60? Absolutely right, those odd bytes. If analyse the OS start code, then in the RAM initialization procedure you will find the same copying.

In the newer insertions there is a chance to come across the next inscriptions:

Load$$IRAM$$Base 639a74
Image$$IRAM$$Base 2010000
Image$$IRAM$$Length 0015a4

They should be understood this way: data of 15a4 length are being loaded from 639a74 file displacement to the 2010000 address.

We continue the analysis of symbols with the $ sign:
x$litpool$ - Literal Pool, pieces of the data from functions. At the end of many functions indexes, lines, constants are placed, and x$litpool$ specifies the beginning of such constants.
x$litpool_e$ - Literal Pool end.
$T is merely for debugger. Indicates the addresses where the PC register change take place. So, at these addresses transition commands BL/BEQ/B/BX etc. are placed.
$$- addresses where there is a change of ARM/THUMB state.

There are also C$$code symbols, but I haven't found what it is.

Other names without $ sign are the names for constants and functions. They can be freely used.

If the archive with an insertion contains both MAP and SYM, it is an ideal variant - when you set a name taken from SYM it is possible to check up whether it lays in the code area by using data from MAP. If yes, we may freely indicate it as code not being afraid, that code/data will be determined in IDA incorrectly.

LST Format

It's a real paradise for a reverser, in these files lays all at once. They consist of five parts:
Image Symbol Table - symbols... their meaning I have not understood yet
Local Symbols - everything is clear from the name
Global Symbols - .sym file analogue.

Memory Map of the image - memory map! All at once!
Image component sizes - .map file analogue
The information is so detailed, that even the processor mode for each function is specified.

OUT Format

Have met it only in the Nucleus-based insertions. Here can be tlink.out and tsymb.out files:
tsymb.out - ordinary SYM
think.out - MAP file to which almost useless linker information is added.
Now when we are armed with the symbolical information we can load the insertion in IDA.

What to do if there are no symbols at all

"When there is no toothbrush at hand..." Yes, we take IDA, emulating debuggerand brains in the hands. IDA is "must have". The emulating debugger for ARM, called Trace32, can be taken here.
First of all, we load the insertion in IDA to 0 address. I.e. the whole insertion is being loaded to default addresses. Then look what is on 0 address.

BOOT:00000000 B ResetHandler
BOOT:00000004 B loc_3B4
BOOT:00000008 B loc_410
BOOT:0000000C B loc_42C
BOOT:00000010 B loc_488

etc.

The code in any case begins with 0 address. In all Samsungs and, as I guess, not only in Samsungs an insertion begins with the interruption vectors. These are eight B commands in ARM state, i.e. 8 vectors. 0 address is a vector of null interruption or insertion start/restart. This zero interruption simply starts the mobile phone and thus handler leads to the system loader:

BOOT:00000048 ResetHandler; CODE XREF: BOOT:loc_0 _ j
BOOT:00000048 MRS R0,CPSR
BOOT:0000004C BIC R0,R0,#0x1F
BOOT:00000050 ORR R0,R0,#0x13
BOOT:00000054 ORR R0,R0,#0xC0
BOOT:00000058 MSR CPSR_cxsf,R0
BOOT:0000005C LDR R3,=(InitialHWConfig+1)
BOOT:00000060 MOV LR,PC
BOOT:00000064 BX R3

If the jump from zero address goes to the non-existent address it means that the rest part of the code is mapped to some other addresses. It's easy to determine to which exactly. For example, we have such beginning:

BOOT:00000000 B 0x4003CE

And there is no code on the 4003CE address. We look on 3CE displacement and see an ARM-code. It means the rest part of insertion is displaced on 0x400000. So we have to copy piece of insertion with interruption handlers, load them to zero address and then load an insertion from 400000 address. Now our code is in the right place. We go further. It is necessary to find out where are the RAM and area of input/output ports. The ports are usually either in the end (addresses from about e0000000 and higher) or in the beginning of the memory (up to 0x200000), depending on where the insertion is being loaded. There can be several RAM areas. First of all, we see ports initialization:

BOOT:00000588 MOV R1,#1
BOOT:0000058A LDR R0,=0xE0006000
BOOT:0000058C LSL R1,R1,#0x1B
BOOT:0000058E STR R1,[R0]
BOOT:00000590 STR R1,[R0,#0x10]
BOOT:00000592 STR R1,[R0,#0x20]
BOOT:00000594 LDR R1,=loc_20102
BOOT:00000596 LDR R0,=0xE0003040
BOOT:00000598 STR R1,[R0, #4]
BOOT:0000059A LDR R1,=0x20003
BOOT:0000059C STR R1,[R0, #8]
BOOT:0000059E LDR R0,=0xE0003000
BOOT:000005A0 MOV R1,#0xC
BOOT:000005A2 STR R1,[R0,#0x24]

I.e. since around E0000000 there is an area of input/output ports. Its size doesn't exceed the size of segment and therefore it's possible to create a segment of 0x10000 size. Now we go further. In any insertion there are RAM area which is initialized by zero values and the area which is filled by initial settings which are taken from an insertion. We are looking for copy cycles, so we need the debugger.

Here we see copying:

BOOT:000000D4 LDR R0,=0x63B018
BOOT:000000D8 LDR R1,=0x1000000
BOOT:000000DC LDR R3,=0x1045B38
BOOT:000000E0 CMP R1,R3
BOOT:000000E4 BEQ loc_F8
BOOT:000000E8
BOOT:000000E8 loc_E8; CODE XREF: BOOT:000000F4 _ j
BOOT:000000E8 CMP R1,R3
BOOT:000000EC LDRCC R2,[R0],#4
BOOT:000000F0 STRCC R2,[R1],#4
BOOT:000000F4 BCC loc_E8

The block is being copied from 63B018 address to 1000000 address of insertion. The length is 45B38.
This is the first RAM area. Now we look for the second one, whose zero initialization should be nearby:

BOOT:000000F8 LDR R1,=0x11ED9E4
BOOT:000000FC MOV R2,#0
BOOT:00000100 CMP R3,R1
BOOT:00000104
BOOT:00000104 loc_104; BOOT:00000108 _ j
BOOT:00000104
BOOT:00000104 STRCC R2,[R3],#4
BOOT:00000108 BCC loc_100

Indeed, there is a stuffing with zero values in the area from 1045B38 to 11ED9E4, so here we have the second part. If there are any areas, then there will certainly be zero or copy initialization. Other memory pieces can be found only analytically, but we have got the basis already.

The further research depends on the presence of symbols/signatures. If yes, then everything comes to looking for the necessary function in the names list. What to do if not? First of all, it is necessary to determine approximate code bounds and, if possible, to find functions in the code. The most primitive and effective way is to search for a push command with which 60 % of insertion code begins. Insertion code usually consists of Thumb code on 90 %, so we should look for B5 byte (push) and try to define it as the code in IDA. Insertion code usually takes less than 50 % of the whole size, the rest part is for graphics and language resources. Else I can say that very often at the end of the code there are copyrights lines, a kind of "Samsung corp. 199x-200x ARM ADS 1.2".

Some code has been revealed, around 20% were harmed by IDA itself, because it often can't cope with THUMB/ARM transition. And now we have to take anything left lying around loose, i.e. what had been left by programmers. And what they had left? Trace and Assert. And any trace and assert doesn't go without sprintf/printf. We have to find it. It's easy - we should just look for the "%s" line. We need that which obviously contains a pattern of the error message. With xref we find where this line is used and it will be exactly sprintf, followed by Trace or Assert. Now, with basing on the error messages, we can name the functions. I.e. walking with xref to the Trace/Assert function, we can find output of more than half of mistakes. Further functions naming is possible by searching the following words:

Bad
Fail
Incorrect
Invalid
Error
Memory
File
Null
No
Critical
Abnormal

etc.

This way we will find some more error output functions. Thus we will gradually gain the information, not being based on anything except for the insertion.