Understand EVM bytecode – Part 2
In the first section,
Understand EVM bytecode – Part 1
We have inspected the contract creation part of the EVM bytecode of the smart contract. In this section we will analyze the runtime EVM bytecode. We will still use the sample code demo1 from previous section.
pragma solidity 0.4.25;
contract Demo1 {
uint public balance;
function add(uint value) public returns (uint256) {
balance = balance + value;
return balance;
}
}
Besides going through the whole compiled bytecode to find the EVM runtime part, there is an easier way. We can also use Remix online portal. When you compiled the source code, click on the “Details” button and scroll down. There is a member field called “object” in the “RUNTIME BYTECODE” section.
The whole EVM runtime part bytecode can be shown as follows:
60806040526004361060485763ffffffff7c010000000000000000000000000000
00000000000000000000000000006000350416631003e2d28114604d578063b69e
f8a8146074575b600080fd5b348015605857600080fd5b5060626004356086565b
60408051918252519081900360200190f35b348015607f57600080fd5b50606260
95565b60008054820190819055919050565b600054815600a165627a7a72305820
63aa00920d824233ab5307ef3a379c757bdbee62fe00fe36a5d852c766e58fef0029
This part of bytecode seems much longer than the creation part. But let’s split it into parts the analyze them one by one.
0000 60 PUSH1 0x80
0002 60 PUSH1 0x40
0004 52 MSTORE
0005 60 PUSH1 0x04
0007 36 CALLDATASIZE
0008 10 LT
0009 60 PUSH1 0x48
000B 57 *JUMPI
...
0048 5B JUMPDEST
0049 60 PUSH1 0x00
004B 80 DUP1
004C FD *REVERT
If you have finished reading my first section, you would have no problem to understand the first 3 instructions set. If you haven’t done so, I really recommend you to read it first. These instructions actually save the address 0x80 in offset 0x40 in the memory as the free memory pointer for future use. There is a new opcode CALLDATASIZE which we haven’t met before at 0x07. It will get the EVM data payload size from this transaction. LT is a opcode to compare 2 items in the stack, it will returns TRUE if the comparison satisfied. So put all pieces together we can get the equivalent Solidity assembler code as follows:
mstore(0x40,0x80);
if(msg.data.length < 0x04) { revert(0,0); }
We can see that basically these instructions will initialize the memory pointers and validate if the size of data payload is at least 4 bytes long. The reason for this is that for a regular external call to a smart contract, the first 4 bytes in the data payload is HASH value for function signature. This 4-byte value will be used by the contract to select which function to delivery the rest data which are the parameters for that function. For example, if you will call the function withdraw(0xABCD) on a smart contract, the data payload for this call will be like this:
0x3823D66C000000000000000000000000000000000000000000000000000000000000ABCD
In this example the first 4-byte value is 0x3823D66C, which is the SHA3 hash value of “withdrawn(bytes32)”. The following 32-byte integer is the parameter of the function call 0xABCD. This is a simple example of integer function parameters. Things will get more complicated when handling variable size parameters. We will talk about them later.
For now, let’s go back the instructions we have discussed. It will validate the size of data payload to be at least 4 byte long. If not, it will revert. But will this be the case for all smart contracts? What if we just send some Ethers to this smart contract without calling any function? You might already recall some functionality in Solidity programs. Yes, it is where the fallback function implemented. To approve this, you can get sample with fallback function implemented, and then check the instruction branch after the msg.data.length validation code.
To continue on our sample bytecode, we can see following code snippet:
0C: 63 PUSH4 0xffffffff
11: 7C PUSH29 0x100000000000000000000000000000000000000000000000000000000
2F: 60 PUSH1 0x00
31: 35 CALLDATALOAD
32: 04 DIV
33: 16 AND
34: 63 PUSH4 0x1003e2d2
39: 81 DUP2
3A: 14 EQ
3B: 60 PUSH1 0x4d
3D: 57 *JUMPI
3E: 80 DUP1
3F: 63 PUSH4 0xb69ef8a8
44: 14 EQ
45: 60 PUSH1 0x74
47: 57 *JUMPI
If you look at more EVM bytecode from different smart contracts, you will always find the same code snippet or functionality equivalent code snippet. This code snippet is not from the code you have put into your source, and it was injected by the compiler. Let’s go through every lines.
First, the code pushes 0xFFFFFFFF into the stack. This value will be used as one of the operands of the AND opcode at 0x33. Then the code will pushes another huge constant value, which will be used as the divider of DIV at 0x32. For instructions at 0x2F, 0x31, the code will get the first 32-byte value from data payload using CALLDATALOAD(0x0). Combined with the instruction set of the later DIV and AND, we can see these instructions actually get the first 4-byte value of data payload, which is the function signature HASH value. The calculated result is pushed into stack. Then the value is compared with 0x1003e2d2 using the EQ opcode at 0x3A. If it was true, the execution will be led to address 0x4D by JUMPI. If not the code will keep continue and the result will be compared with another HASH value 0xb69ef8a8.
Now the logic of this code snippet is pretty clear. It gets the first 4-byte value from data payload and decides which function it will be selected to run. For the code addresses at 0x4D and 0x74 they are the entrance of each public functions the caller can access to the smart contracts. If none of the HASH value is satisfied from the code, then it will result in the fallback function of the smart contracts. If it was not defined, it will simply revert.
Since HASH algorithms will make information lost during the calculation, it is impossible to get the original function signature information back theatrically. However, you can still make a guess on the original information based on a huge collection of HASH values. That is what website www.4byte.directorydoes. It collects tons of function signatures and their hashes and provide a WEB API for searching. For example, by browsing following link:
https://www.4byte.directory/signatures/?bytes4_signature=0x1003e2d2
You will get the result as:
add(uint256)
Amazingly, that is the original function definition in our demo1.sol program. So by using this service we can pretty much reverse back most of the function signatures if they are collected before.
Until now we have known how public functions can be accessed by external callers. Now let’s go further into the specific functions. Address 0x4D is the entrance of add() function:
004D 5B JUMPDEST
004E 34 CALLVALUE
004F 80 DUP1
0050 15 ISZERO
0051 60 PUSH1 0x58
0053 57 *JUMPI
0054 60 PUSH1 0x00
0056 80 DUP1
0057 FD *REVERT
0058 5B JUMPDEST
0059 50 POP
005A 60 PUSH1 0x62
005C 60 PUSH1 0x04
005E 35 CALLDATALOAD
005F 60 PUSH1 0x86
0061 56 *JUMP
At the entrance of the function add(), there is an opcode JUMPDEST. This is a special opcode which only marks an address that can be jumped to. It does not seem to play an important role to the EVM implementation. However, you will see it does help to determine the control flow graph (CFG) for bytecode. We will discuss it in the future section.
The instructions set at 0x4F-0x57 have been seen in previous section. They were injected by compiler for the non-payable function. After this validation code, the PUSH1 opcode at address 0x5A is easy to be ignored. However, this PUSH1 is very important for the code to jump back later. Let’s just remember an address 0x62 is pushed into the stack for now. Then, CALLDATALOAD(0x04) is called to load the parameter from data payload, which is located at offset 0x04. After getting the parameter, the code will jump to 0x86 for execution:
0086 5B JUMPDEST
0087 60 PUSH1 0x00
0089 80 DUP1
008A 54 SLOAD
008B 82 DUP3
008C 01 ADD
008D 90 SWAP1
008E 81 DUP2
008F 90 SWAP1
0090 55 SSTORE
0091 91 SWAP2
0092 90 SWAP1
0093 50 POP
0094 56 *JUMP
In above code snippet, the value at 0x0 in the storage is loaded by using SLOAD(0x0). Then, this value will be added with the parameter loaded from data payload and saved back to the same location 0x0 in storage. Finally we see the code we have put inside the add() function:
balance = balance + value;
Since we only used one integer variable balance inside the smart contract. The compiler give the offset 0x0 to this variable. So any read or write operation on this balance variable will be put on the offset 0x0 in the storage. At the end of the code snippet, a JUMP is used to jump back to 0x62. If you still recall where this value 0x62 was pushed into the stack. This operation might remind you of something on X86 architecture. Yes, that is actually the the call and ret for function calls. Since EVM doesn’t support function calls on bytecode level, it can only use PUSH and JUMP opcodes for function calls. This way will get a lot of trouble to build back the CFG from bytecode.
Let’s jump back to address 0x62 to see what happens next:
0062 5B JUMPDEST
0063 60 PUSH1 0x40
0065 80 DUP1
0066 51 MLOAD
0067 91 SWAP2
0068 82 DUP3
0069 52 MSTORE
006A 51 MLOAD
006B 90 SWAP1
006C 81 DUP2
006D 90 SWAP1
006E 03 SUB
006F 60 PUSH1 0x20
0071 01 ADD
0072 90 SWAP1
0073 F3 *RETURN
This code snippet has several items adjustments on the stack by opcodes DUPand SWAP. It is not obvious to understand the meaning of the instructions. I will leave this to you to simulate a stack for the execution. For the equivalent assembler code of above code is as follows:
mstore(mload(0x40), value);
return(mload(0x40), 0x20);
The value inside the code was the calculated result from previous add operation, which is the new balance value. Finally, we have gone through all bytecode inside the add() function.
Now let’s go back to the function dispatch code snippet for the second function HASH 0xb69ef8a8. The entrance for that function is at 0x74:
0074 5B JUMPDEST
0075 34 CALLVALUE
0076 80 DUP1
0077 15 ISZERO
0078 60 PUSH1 0x7f
007A 57 *JUMPI
007D 80 DUP1
007E FD *REVERT
007F 5B JUMPDEST
0080 50 POP
0081 60 PUSH1 0x62
0083 60 PUSH1 0x95
0085 56 *JUMP
...
0095 5B JUMPDEST
0096 60 PUSH1 0x00
0098 54 SLOAD
0099 81 DUP2
009A 56 *JUMP
We can see the first part of the code is really similar to the previous function except no parameter was loaded. Then the function will call a code snippet at 0x95. The instructions at 0x95-0x98 just load the value at 0x0 in the storage and return. We noted that the code snippet at 0x62 was reused for both functions. That is because both functions will return the storage variable balance back.
You may wonder why there is the function HASH 0xb69ef8a8 inside the function dispatch code? Isn’t there only one function add() inside the smart contract? If you use 4bytes database to check that HASH, you will get balance(). Apparently, the storage variable is recognized as a public function without parameters by compiler.
To summarize the section, we have discussed the whole structure of the runtime part of the bytecode. How functions are accessed by the external callers, how parameters are transferred. But for this demo example, we only put in some integer storage variables. How will it like for mappings or variable length arrays? How will parameters presented in the data payload for strings? We will talk about all these in next section.