Smart Contract Insight – an online EVM decompiler

Since I started working in the Ethereum ecosystem and auditing Ethereum smart contract in bytecode format. I have evaluated many well-known projects which claimed they can decompile EVM (Ethereum Virtual Machine) bytecode. However, none of them really show good result for real world examples. So reading the EVM opcodes from the smart contracts is a really frustrating job and you can be lost anywhere among the JUMPs and POPs. So an idea popped into my mind that why not make a really working one to speed up the audit of raw EVM bytecode.

Fortunately, there are already several good articles talking about EVM bytecode structures. However, those resources still won’t make you fully understand how to make a real EVM decompiler. There are a lot of details lack from either official documents or other research. I have written a series of articles about the things I have learned from the development of the tool. If you are interested in them, please feel free to have a look:

Understand EVM bytecode – Part 1

Understand EVM bytecode – Part 2

Understand EVM bytecode – Part 3

Understand EVM bytecode – Part 4

Besides all the information I have mentioned in those articles, there are still more things need to pay attention to if you want to development your own decompiler or automation tool based on EVM bytecode. The most difficult part I have encountered during the development is how to retrieve back the Control Flow Graph (CFG) for the code. If you are familiar with EVM opcodes, you might have noted there is no function call instruction and return. Instead there is only conditional and unconditional JUMPs. So in order to extract back the original function logic, you need to analyze the pushed return address of the instructions in the stack to decide the range of a function. However, if you think you can skip this step and take the whole code as one function, you will have even more troublesome when dealing with loops. Because without defining functions any re-entrancy of code blocks might be judged as a loop mistakenly.

Another thing worth mentioning is that EVM itself is under development too. Also there are several high level languages you can choose to compile a smart contract. So the compiled EVM bytecode is very compiler dependent, even version dependent on by same compiler. For my research work, I used Remix online Solidity compiler. If you choose different versions to compile on a same piece of code, you will be surprised by the results. Since Solidity is under development too, it is no way you can expect consistent compiled results. So these facts will make the developer life harder, especially when more readable content is expected from the decompiler.

We have talked the problems you might encountered during the development. Let’s look at a demo of the online decompiler we have published. You can easily use it by following link:

https://www.trustlook.com/products/smartcontractinsight

To analyze a piece of EVM bytecode, you can either specify the smart contract address which has been deployed on the main network of Ethereum, or simply copy and paste your bytecode into the text box. After clicking on the “Decompile Now” button, the result will be shown on the page. Let’s take some simple Solidity source code for a testing:

pragma solidity ^0.4.18;

contract Bank {

  // balances, indexed by addresses
mapping(address => uint256) public balanceOf;

function deposit(uint256 amount) public payable {
require(msg.value == amount);


  // adjust the account's balance
balanceOf[msg.sender] += amount;
}

}

After compiler the code using Solidity 0.4.18, you will get following EVM runtime bytecode:

60606040526004361060485763ffffffff7c01000000000000000000000000
0000000000000000000000000000000060003504166370a082318114604d57
8063b6b55f25146088575b600080fd5b3415605757600080fd5b607673ffff
ffffffffffffffffffffffffffffffffffff600435166093565b6040519081
5260200160405180910390f35b609160043560a5565b005b60006020819052
908152604090205481565b34811460b057600080fd5b73ffffffffffffffff
ffffffffffffffffffffffff33166000908152602081905260409020805490
910190555600a165627a7a7230582023b9654f2013fddc872624140433014f
75b2dc3414e8d795bf7330ebc818ef6a0029

To understand why we call it “runtime” bytecode, you can refer to my write-ups about “Understanding EVM bytecode”. You can simply copy above bytecode and submit to the online decompiler like this:

Then you can click on the “Decompile Now” button. After a few seconds, you could have the decompiled result shown in the text box:

From the decompiled result, we can clearly see the original variable declarations and functions. Besides that, it also inspect the bytecode and look for potential known vulnerabilities. Here, an integer overflow was discovered in function deposit. If you do not trust the automation inspection result, you can always click on the “Request Audit” button for a manual review, our team will be happily to assistant you with our expertise.

Of course this sample is pretty simple, the real world example can be much more complicated. You can also check them by using the tool. The tool currently have following features:

  • Public and private functions recognition
  • Storage variables recognition
  • Memory related operations optimization
  • External calls commentating
  • Embed Solidity functions recognitions (ecrecover, sha256, ripemd160, sha3)
  • Well-known vulnerabilities inspection

Besides the above features, there are still more things which can be improved in future. Loops recognition can be done in better way other than giving goto statements for now. Also dynamical size variables like bytes and string can be optimized in better way for readability. There might also be bugs on showing local variables since there is no RET instruction in EVM. So the stack alignment is always an issue when function returns. Since the tool is still on experimental, you are very welcome to contact us for any issue you had found.

Hope you enjoy using it, let’s us know your comments and advise!

Understand EVM bytecode – Part 2

In the first section,

Understand EVM bytecode – Part 1

We have inspected the contract creation part of the EVM bytecode of the smart contract. In this section we will analyze the runtime EVM bytecode. We will still use the sample code demo1 from previous section.

pragma solidity 0.4.25;

contract Demo1 {
uint public balance;

function add(uint value) public returns (uint256) {
balance = balance + value;
return balance;
}
}

Besides going through the whole compiled bytecode to find the EVM runtime part, there is an easier way. We can also use Remix online portal. When you compiled the source code, click on the “Details” button and scroll down. There is a member field called “object” in the “RUNTIME BYTECODE” section.

The whole EVM runtime part bytecode can be shown as follows:

60806040526004361060485763ffffffff7c010000000000000000000000000000
00000000000000000000000000006000350416631003e2d28114604d578063b69e
f8a8146074575b600080fd5b348015605857600080fd5b5060626004356086565b
60408051918252519081900360200190f35b348015607f57600080fd5b50606260
95565b60008054820190819055919050565b600054815600a165627a7a72305820
63aa00920d824233ab5307ef3a379c757bdbee62fe00fe36a5d852c766e58fef0029

This part of bytecode seems much longer than the creation part. But let’s split it into parts the analyze them one by one.

0000    60  PUSH1 0x80
0002 60 PUSH1 0x40
0004 52 MSTORE
0005 60 PUSH1 0x04
0007 36 CALLDATASIZE
0008 10 LT
0009 60 PUSH1 0x48
000B 57 *JUMPI
...
0048 5B JUMPDEST
0049 60 PUSH1 0x00
004B 80 DUP1
004C FD *REVERT

If you have finished reading my first section, you would have no problem to understand the first 3 instructions set. If you haven’t done so, I really recommend you to read it first. These instructions actually save the address 0x80 in offset 0x40 in the memory as the free memory pointer for future use. There is a new opcode CALLDATASIZE which we haven’t met before at 0x07. It will get the EVM data payload size from this transaction. LT is a opcode to compare 2 items in the stack, it will returns TRUE if the comparison satisfied.  So put all pieces together we can get the equivalent Solidity assembler code as follows:

mstore(0x40,0x80);
if(msg.data.length < 0x04) { revert(0,0); }

We can see that basically these instructions will initialize the memory pointers and validate if the size of data payload is at least 4 bytes long. The reason for this is that for a regular external call to a smart contract, the first 4 bytes in the data payload is HASH value for function signature. This 4-byte value will be used by the contract to select which function to delivery the rest data which are the parameters for that function. For example, if you will call the function withdraw(0xABCD) on a smart contract, the data payload for this call will be like this:

0x3823D66C000000000000000000000000000000000000000000000000000000000000ABCD

In this example the first 4-byte value is 0x3823D66C, which is the SHA3 hash value of “withdrawn(bytes32)”. The following 32-byte integer is the parameter of the function call 0xABCD. This is a simple example of integer function parameters. Things will get more complicated when handling variable size parameters. We will talk about them later.

For now, let’s go back the instructions we have discussed. It will validate the size of data payload to be at least 4 byte long. If not, it will revert. But will this be the case for all smart contracts? What if we just send some Ethers to this smart contract without calling any function? You might already recall some functionality in Solidity programs. Yes, it is where the fallback function implemented. To approve this, you can get sample with fallback function implemented, and then check the instruction branch after the msg.data.length validation code.

To continue on our sample bytecode, we can see following code snippet:

0C: 63  PUSH4 0xffffffff
11: 7C PUSH29 0x100000000000000000000000000000000000000000000000000000000
2F: 60 PUSH1 0x00
31: 35 CALLDATALOAD
32: 04 DIV
33: 16 AND
34: 63 PUSH4 0x1003e2d2
39: 81 DUP2
3A: 14 EQ
3B: 60 PUSH1 0x4d
3D: 57 *JUMPI
3E: 80 DUP1
3F: 63 PUSH4 0xb69ef8a8
44: 14 EQ
45: 60 PUSH1 0x74
47: 57 *JUMPI

If you look at more EVM bytecode from different smart contracts, you will always find the same code snippet or functionality equivalent code snippet. This code snippet is not from the code you have put into your source, and it was injected by the compiler. Let’s go through every lines.

First, the code pushes 0xFFFFFFFF into the stack. This value will be used as one of the operands of the AND opcode at 0x33. Then the code will pushes another huge constant value, which will be used as the divider of DIV at 0x32. For instructions at 0x2F, 0x31, the code will get the first 32-byte value from data payload using CALLDATALOAD(0x0). Combined with the instruction set of the later DIV and AND, we can see these instructions actually get the first 4-byte value of data payload, which is the function signature HASH value. The calculated result is pushed into stack. Then the value is compared with 0x1003e2d2 using the EQ opcode at 0x3A. If it was true, the execution will be led to address 0x4D by JUMPI. If not the code will keep continue and the result will be compared with another HASH value 0xb69ef8a8.

Now the logic of this code snippet is pretty clear. It gets the first 4-byte value from data payload and decides which function it will be selected to run. For the code addresses at 0x4D and 0x74 they are the entrance of each public functions the caller can access to the smart contracts. If none of the HASH value is satisfied from the code, then it will result in the fallback function of the smart contracts. If it was not defined, it will simply revert.

Since HASH algorithms will make information lost during the calculation, it is impossible to get the original function signature information back theatrically. However, you can still make a guess on the original information based on a huge collection of HASH values. That is what website www.4byte.directory does. It collects tons of function signatures and their hashes and provide a WEB API for searching. For example, by browsing following link:

https://www.4byte.directory/signatures/?bytes4_signature=0x1003e2d2

You will get the result as:

add(uint256)

Amazingly, that is the original function definition in our demo1.sol program. So by using this service we can pretty much reverse back most of the function signatures if they are collected before.

Until now we have known how public functions can be accessed by external callers. Now let’s go further into the specific functions. Address 0x4D is the entrance of add() function:

004D    5B  JUMPDEST
004E 34 CALLVALUE
004F 80 DUP1
0050 15 ISZERO
0051 60 PUSH1 0x58
0053 57 *JUMPI
0054 60 PUSH1 0x00
0056 80 DUP1
0057 FD *REVERT
0058 5B JUMPDEST
0059 50 POP
005A 60 PUSH1 0x62
005C 60 PUSH1 0x04
005E 35 CALLDATALOAD
005F 60 PUSH1 0x86
0061 56 *JUMP

At the entrance of the function add(), there is an opcode JUMPDEST. This is a special opcode which only marks an address that can be jumped to. It does not seem to play an important role to the EVM implementation. However, you will see it does help to determine the control flow graph (CFG) for bytecode. We will discuss it in the future section.

The instructions set at 0x4F-0x57 have been seen in previous section. They were injected by compiler for the non-payable function. After this validation code, the PUSH1 opcode at address 0x5A is easy to be ignored. However, this PUSH1 is very important for the code to jump back later. Let’s just remember an address 0x62 is pushed into the stack for now. Then, CALLDATALOAD(0x04) is called to load the parameter from data payload, which is located at offset 0x04. After getting the parameter, the code will jump to 0x86 for execution:

0086    5B  JUMPDEST
0087 60 PUSH1 0x00
0089 80 DUP1
008A 54 SLOAD
008B 82 DUP3
008C 01 ADD
008D 90 SWAP1
008E 81 DUP2
008F 90 SWAP1
0090 55 SSTORE
0091 91 SWAP2
0092 90 SWAP1
0093 50 POP
0094 56 *JUMP

In above code snippet, the value at 0x0 in the storage is loaded by using SLOAD(0x0). Then, this value will be added with the parameter loaded from data payload and saved back to the same location 0x0 in storage. Finally we see the code we have put inside the add() function:

balance = balance + value;

Since we only used one integer variable balance inside the smart contract. The compiler give the offset 0x0 to this variable. So any read or write operation on this balance variable will be put on the offset 0x0 in the storage. At the end of the code snippet, a JUMP is used to jump back to 0x62. If you still recall where this value 0x62 was pushed into the stack. This operation might remind you of something on X86 architecture. Yes, that is actually the the call and ret for function calls. Since EVM doesn’t support function calls on bytecode level, it can only use PUSH and JUMP opcodes for function calls. This way will get a lot of trouble to build back the CFG from bytecode.

Let’s jump back to address 0x62 to see what happens next:

0062    5B  JUMPDEST
0063 60 PUSH1 0x40
0065 80 DUP1
0066 51 MLOAD
0067 91 SWAP2
0068 82 DUP3
0069 52 MSTORE
006A 51 MLOAD
006B 90 SWAP1
006C 81 DUP2
006D 90 SWAP1
006E 03 SUB
006F 60 PUSH1 0x20
0071 01 ADD
0072 90 SWAP1
0073 F3 *RETURN

This code snippet has several items adjustments on the stack by opcodes DUP and SWAP. It is not obvious to understand the meaning of the instructions. I will leave this to you to simulate a stack for the execution. For the equivalent assembler code of above code is as follows:

mstore(mload(0x40), value);
return(mload(0x40), 0x20);

The value inside the code was the calculated result from previous add operation, which is the new balance value. Finally, we have gone through all bytecode inside the add() function.

Now let’s go back to the function dispatch code snippet for the second function HASH 0xb69ef8a8. The entrance for that function is at 0x74:

0074    5B  JUMPDEST
0075 34 CALLVALUE
0076 80 DUP1
0077 15 ISZERO
0078 60 PUSH1 0x7f
007A 57 *JUMPI
007D 80 DUP1
007E FD *REVERT
007F 5B JUMPDEST
0080 50 POP
0081 60 PUSH1 0x62
0083 60 PUSH1 0x95
0085 56 *JUMP
...
0095 5B JUMPDEST
0096 60 PUSH1 0x00
0098 54 SLOAD
0099 81 DUP2
009A 56 *JUMP

We can see the first part of the code is really similar to the previous function except no parameter was loaded. Then the function will call a code snippet at 0x95. The instructions at 0x95-0x98 just load the value at 0x0 in the storage and return. We noted that the code snippet at 0x62 was reused for both functions. That is because both functions will return the storage variable balance back.

You may wonder why there is the function HASH 0xb69ef8a8 inside the function dispatch code? Isn’t there only one function add() inside the smart contract? If you use 4bytes database to check that HASH, you will get balance(). Apparently, the storage variable is recognized as a public function without parameters by compiler.

To summarize the section, we have discussed the whole structure of the runtime part of the bytecode. How functions are accessed by the external callers, how parameters are transferred. But for this demo example, we only put in some integer storage variables. How will it like for mappings or variable length arrays? How will parameters presented in the data payload for strings? We will talk about all these in next section.

Understand EVM bytecode – Part 3