January 8, 2019

Understand EVM bytecode – Part 2

Understand EVM bytecode – Part 2

In the first section,

Understand EVM bytecode – Part 1

We have inspected the contract creation part of the EVM bytecode of the smart contract. In this section we will analyze the runtime EVM bytecode. We will still use the sample code demo1 from previous section.

pragma solidity 0.4.25;

contract Demo1 {
  uint public balance;
  
  function add(uint value) public returns (uint256) {
        balance = balance + value;
        return balance;
    }
}

Besides going through the whole compiled bytecode to find the EVM runtime part, there is an easier way. We can also use Remix online portal. When you compiled the source code, click on the “Details” button and scroll down. There is a member field called “object” in the “RUNTIME BYTECODE” section.

The whole EVM runtime part bytecode can be shown as follows:

60806040526004361060485763ffffffff7c010000000000000000000000000000
00000000000000000000000000006000350416631003e2d28114604d578063b69e
f8a8146074575b600080fd5b348015605857600080fd5b5060626004356086565b
60408051918252519081900360200190f35b348015607f57600080fd5b50606260
95565b60008054820190819055919050565b600054815600a165627a7a72305820
63aa00920d824233ab5307ef3a379c757bdbee62fe00fe36a5d852c766e58fef0029

This part of bytecode seems much longer than the creation part. But let’s split it into parts the analyze them one by one.

0000    60  PUSH1 0x80
0002    60  PUSH1 0x40
0004    52  MSTORE
0005    60  PUSH1 0x04
0007    36  CALLDATASIZE
0008    10  LT
0009    60  PUSH1 0x48
000B    57  *JUMPI
...
0048    5B  JUMPDEST
0049    60  PUSH1 0x00
004B    80  DUP1
004C    FD  *REVERT

If you have finished reading my first section, you would have no problem to understand the first 3 instructions set. If you haven’t done so, I really recommend you to read it first. These instructions actually save the address 0x80 in offset 0x40 in the memory as the free memory pointer for future use. There is a new opcode CALLDATASIZE which we haven’t met before at 0x07. It will get the EVM data payload size from this transaction. LT is a opcode to compare 2 items in the stack, it will returns TRUE if the comparison satisfied.  So put all pieces together we can get the equivalent Solidity assembler code as follows:

mstore(0x40,0x80);
if(msg.data.length < 0x04) { revert(0,0); }

We can see that basically these instructions will initialize the memory pointers and validate if the size of data payload is at least 4 bytes long. The reason for this is that for a regular external call to a smart contract, the first 4 bytes in the data payload is HASH value for function signature. This 4-byte value will be used by the contract to select which function to delivery the rest data which are the parameters for that function. For example, if you will call the function withdraw(0xABCD) on a smart contract, the data payload for this call will be like this:

0x3823D66C000000000000000000000000000000000000000000000000000000000000ABCD

In this example the first 4-byte value is 0x3823D66C, which is the SHA3 hash value of “withdrawn(bytes32)”. The following 32-byte integer is the parameter of the function call 0xABCD. This is a simple example of integer function parameters. Things will get more complicated when handling variable size parameters. We will talk about them later.

For now, let’s go back the instructions we have discussed. It will validate the size of data payload to be at least 4 byte long. If not, it will revert. But will this be the case for all smart contracts? What if we just send some Ethers to this smart contract without calling any function? You might already recall some functionality in Solidity programs. Yes, it is where the fallback function implemented. To approve this, you can get sample with fallback function implemented, and then check the instruction branch after the msg.data.length validation code.

To continue on our sample bytecode, we can see following code snippet:

0C: 63  PUSH4 0xffffffff
11: 7C  PUSH29 0x100000000000000000000000000000000000000000000000000000000
2F: 60  PUSH1 0x00
31: 35  CALLDATALOAD
32: 04  DIV
33: 16  AND
34: 63  PUSH4 0x1003e2d2
39: 81  DUP2
3A: 14  EQ
3B: 60  PUSH1 0x4d
3D: 57  *JUMPI
3E: 80  DUP1
3F: 63  PUSH4 0xb69ef8a8
44: 14  EQ
45: 60  PUSH1 0x74
47: 57  *JUMPI

If you look at more EVM bytecode from different smart contracts, you will always find the same code snippet or functionality equivalent code snippet. This code snippet is not from the code you have put into your source, and it was injected by the compiler. Let’s go through every lines.

First, the code pushes 0xFFFFFFFF into the stack. This value will be used as one of the operands of the AND opcode at 0x33. Then the code will pushes another huge constant value, which will be used as the divider of DIV at 0x32. For instructions at 0x2F, 0x31, the code will get the first 32-byte value from data payload using CALLDATALOAD(0x0). Combined with the instruction set of the later DIV and AND, we can see these instructions actually get the first 4-byte value of data payload, which is the function signature HASH value. The calculated result is pushed into stack. Then the value is compared with 0x1003e2d2 using the EQ opcode at 0x3A. If it was true, the execution will be led to address 0x4D by JUMPI. If not the code will keep continue and the result will be compared with another HASH value 0xb69ef8a8.

Now the logic of this code snippet is pretty clear. It gets the first 4-byte value from data payload and decides which function it will be selected to run. For the code addresses at 0x4D and 0x74 they are the entrance of each public functions the caller can access to the smart contracts. If none of the HASH value is satisfied from the code, then it will result in the fallback function of the smart contracts. If it was not defined, it will simply revert.

Since HASH algorithms will make information lost during the calculation, it is impossible to get the original function signature information back theatrically. However, you can still make a guess on the original information based on a huge collection of HASH values. That is what website www.4byte.directorydoes. It collects tons of function signatures and their hashes and provide a WEB API for searching. For example, by browsing following link:

https://www.4byte.directory/signatures/?bytes4_signature=0x1003e2d2

You will get the result as:

add(uint256)

Amazingly, that is the original function definition in our demo1.sol program. So by using this service we can pretty much reverse back most of the function signatures if they are collected before.

Until now we have known how public functions can be accessed by external callers. Now let’s go further into the specific functions. Address 0x4D is the entrance of add() function:

004D    5B  JUMPDEST
004E    34  CALLVALUE
004F    80  DUP1
0050    15  ISZERO
0051    60  PUSH1 0x58
0053    57  *JUMPI
0054    60  PUSH1 0x00
0056    80  DUP1
0057    FD  *REVERT
0058    5B  JUMPDEST
0059    50  POP
005A    60  PUSH1 0x62
005C    60  PUSH1 0x04
005E    35  CALLDATALOAD
005F    60  PUSH1 0x86
0061    56  *JUMP

At the entrance of the function add(), there is an opcode JUMPDEST. This is a special opcode which only marks an address that can be jumped to. It does not seem to play an important role to the EVM implementation. However, you will see it does help to determine the control flow graph (CFG) for bytecode. We will discuss it in the future section.

The instructions set at 0x4F-0x57 have been seen in previous section. They were injected by compiler for the non-payable function. After this validation code, the PUSH1 opcode at address 0x5A is easy to be ignored. However, this PUSH1 is very important for the code to jump back later. Let’s just remember an address 0x62 is pushed into the stack for now. Then, CALLDATALOAD(0x04) is called to load the parameter from data payload, which is located at offset 0x04. After getting the parameter, the code will jump to 0x86 for execution:

0086    5B  JUMPDEST
0087    60  PUSH1 0x00
0089    80  DUP1
008A    54  SLOAD
008B    82  DUP3
008C    01  ADD
008D    90  SWAP1
008E    81  DUP2
008F    90  SWAP1
0090    55  SSTORE
0091    91  SWAP2
0092    90  SWAP1
0093    50  POP
0094    56  *JUMP

In above code snippet, the value at 0x0 in the storage is loaded by using SLOAD(0x0). Then, this value will be added with the parameter loaded from data payload and saved back to the same location 0x0 in storage. Finally we see the code we have put inside the add() function:

balance = balance + value;

Since we only used one integer variable balance inside the smart contract. The compiler give the offset 0x0 to this variable. So any read or write operation on this balance variable will be put on the offset 0x0 in the storage. At the end of the code snippet, a JUMP is used to jump back to 0x62. If you still recall where this value 0x62 was pushed into the stack. This operation might remind you of something on X86 architecture. Yes, that is actually the the call and ret for function calls. Since EVM doesn’t support function calls on bytecode level, it can only use PUSH and JUMP opcodes for function calls. This way will get a lot of trouble to build back the CFG from bytecode.

Let’s jump back to address 0x62 to see what happens next:

0062    5B  JUMPDEST
0063    60  PUSH1 0x40
0065    80  DUP1
0066    51  MLOAD
0067    91  SWAP2
0068    82  DUP3
0069    52  MSTORE
006A    51  MLOAD
006B    90  SWAP1
006C    81  DUP2
006D    90  SWAP1
006E    03  SUB
006F    60  PUSH1 0x20
0071    01  ADD
0072    90  SWAP1
0073    F3  *RETURN

This code snippet has several items adjustments on the stack by opcodes DUPand SWAP. It is not obvious to understand the meaning of the instructions. I will leave this to you to simulate a stack for the execution. For the equivalent assembler code of above code is as follows:

mstore(mload(0x40), value);
return(mload(0x40), 0x20);

The value inside the code was the calculated result from previous add operation, which is the new balance value. Finally, we have gone through all bytecode inside the add() function.

Now let’s go back to the function dispatch code snippet for the second function HASH 0xb69ef8a8. The entrance for that function is at 0x74:

0074    5B  JUMPDEST
0075    34  CALLVALUE
0076    80  DUP1
0077    15  ISZERO
0078    60  PUSH1 0x7f
007A    57  *JUMPI
007D    80  DUP1
007E    FD  *REVERT
007F    5B  JUMPDEST
0080    50  POP
0081    60  PUSH1 0x62
0083    60  PUSH1 0x95
0085    56  *JUMP
...
0095    5B  JUMPDEST
0096    60  PUSH1 0x00
0098    54  SLOAD
0099    81  DUP2
009A    56  *JUMP

We can see the first part of the code is really similar to the previous function except no parameter was loaded. Then the function will call a code snippet at 0x95. The instructions at 0x95-0x98 just load the value at 0x0 in the storage and return. We noted that the code snippet at 0x62 was reused for both functions. That is because both functions will return the storage variable balance back.

You may wonder why there is the function HASH 0xb69ef8a8 inside the function dispatch code? Isn’t there only one function add() inside the smart contract? If you use 4bytes database to check that HASH, you will get balance(). Apparently, the storage variable is recognized as a public function without parameters by compiler.

To summarize the section, we have discussed the whole structure of the runtime part of the bytecode. How functions are accessed by the external callers, how parameters are transferred. But for this demo example, we only put in some integer storage variables. How will it like for mappings or variable length arrays? How will parameters presented in the data payload for strings? We will talk about all these in next section.

Understand EVM bytecode – Part 3