Smart Contract Insight – an online EVM decompiler

Since I started working in the Ethereum ecosystem and auditing Ethereum smart contract in bytecode format. I have evaluated many well-known projects which claimed they can decompile EVM (Ethereum Virtual Machine) bytecode. However, none of them really show good result for real world examples. So reading the EVM opcodes from the smart contracts is a really frustrating job and you can be lost anywhere among the JUMPs and POPs. So an idea popped into my mind that why not make a really working one to speed up the audit of raw EVM bytecode.

Fortunately, there are already several good articles talking about EVM bytecode structures. However, those resources still won’t make you fully understand how to make a real EVM decompiler. There are a lot of details lack from either official documents or other research. I have written a series of articles about the things I have learned from the development of the tool. If you are interested in them, please feel free to have a look:

Understand EVM bytecode – Part 1

Understand EVM bytecode – Part 2

Understand EVM bytecode – Part 3

Understand EVM bytecode – Part 4

Besides all the information I have mentioned in those articles, there are still more things need to pay attention to if you want to development your own decompiler or automation tool based on EVM bytecode. The most difficult part I have encountered during the development is how to retrieve back the Control Flow Graph (CFG) for the code. If you are familiar with EVM opcodes, you might have noted there is no function call instruction and return. Instead there is only conditional and unconditional JUMPs. So in order to extract back the original function logic, you need to analyze the pushed return address of the instructions in the stack to decide the range of a function. However, if you think you can skip this step and take the whole code as one function, you will have even more troublesome when dealing with loops. Because without defining functions any re-entrancy of code blocks might be judged as a loop mistakenly.

Another thing worth mentioning is that EVM itself is under development too. Also there are several high level languages you can choose to compile a smart contract. So the compiled EVM bytecode is very compiler dependent, even version dependent on by same compiler. For my research work, I used Remix online Solidity compiler. If you choose different versions to compile on a same piece of code, you will be surprised by the results. Since Solidity is under development too, it is no way you can expect consistent compiled results. So these facts will make the developer life harder, especially when more readable content is expected from the decompiler.

We have talked the problems you might encountered during the development. Let’s look at a demo of the online decompiler we have published. You can easily use it by following link:

https://www.trustlook.com/products/smartcontractinsight

To analyze a piece of EVM bytecode, you can either specify the smart contract address which has been deployed on the main network of Ethereum, or simply copy and paste your bytecode into the text box. After clicking on the “Decompile Now” button, the result will be shown on the page. Let’s take some simple Solidity source code for a testing:

pragma solidity ^0.4.18;

contract Bank {

  // balances, indexed by addresses
mapping(address => uint256) public balanceOf;

function deposit(uint256 amount) public payable {
require(msg.value == amount);


  // adjust the account's balance
balanceOf[msg.sender] += amount;
}

}

After compiler the code using Solidity 0.4.18, you will get following EVM runtime bytecode:

60606040526004361060485763ffffffff7c01000000000000000000000000
0000000000000000000000000000000060003504166370a082318114604d57
8063b6b55f25146088575b600080fd5b3415605757600080fd5b607673ffff
ffffffffffffffffffffffffffffffffffff600435166093565b6040519081
5260200160405180910390f35b609160043560a5565b005b60006020819052
908152604090205481565b34811460b057600080fd5b73ffffffffffffffff
ffffffffffffffffffffffff33166000908152602081905260409020805490
910190555600a165627a7a7230582023b9654f2013fddc872624140433014f
75b2dc3414e8d795bf7330ebc818ef6a0029

To understand why we call it “runtime” bytecode, you can refer to my write-ups about “Understanding EVM bytecode”. You can simply copy above bytecode and submit to the online decompiler like this:

Then you can click on the “Decompile Now” button. After a few seconds, you could have the decompiled result shown in the text box:

From the decompiled result, we can clearly see the original variable declarations and functions. Besides that, it also inspect the bytecode and look for potential known vulnerabilities. Here, an integer overflow was discovered in function deposit. If you do not trust the automation inspection result, you can always click on the “Request Audit” button for a manual review, our team will be happily to assistant you with our expertise.

Of course this sample is pretty simple, the real world example can be much more complicated. You can also check them by using the tool. The tool currently have following features:

  • Public and private functions recognition
  • Storage variables recognition
  • Memory related operations optimization
  • External calls commentating
  • Embed Solidity functions recognitions (ecrecover, sha256, ripemd160, sha3)
  • Well-known vulnerabilities inspection

Besides the above features, there are still more things which can be improved in future. Loops recognition can be done in better way other than giving goto statements for now. Also dynamical size variables like bytes and string can be optimized in better way for readability. There might also be bugs on showing local variables since there is no RET instruction in EVM. So the stack alignment is always an issue when function returns. Since the tool is still on experimental, you are very welcome to contact us for any issue you had found.

Hope you enjoy using it, let’s us know your comments and advise!

Understand EVM bytecode – Part 3

In previous sections:

Understand EVM bytecode – Part 1

Understand EVM bytecode – Part 2

We have talked about creation and runtime parts of the EVM bytecode. We have seen that the stack variables are commonly used to supply operands to opcodes or transfer integer arguments between internal function calls. All state related variables need to be saved in storage. However, storage is designed as a dictionary or hash table.  It will hold all data by key-value pairs. For each address key it can store a 32-byte integer. So how it can implement complicated data structures, struct, mapping, variable length arrays etc? Let’s dig more about the implementations of them.

Let’s start with the relatively easy ones: Fixed size data types. We all already know that EVM support different length of integers, from 1-byte to 32-byte. If you only defined several 32-byte integer variables in your smart contracts, the compiler will simply assign a consequence of addresses starting from 0x0 for the variables. For example:

contract data1 {
uint256 public balance1;
uint256 public balance2;
uint256 public balance3;
}

If you compiled above code and check the EVM bytecode, you will found all code read or write the 3 variables balance1, balance2 and balance3  will be mapping to the SLOAD or SSTORE operations on addresses 0x0, 0x1, and 0x2 accordingly. However, we all know the storage usage costs a lot of gas. If you have no idea what gas is in EVM, I recommend you to have a bit research on it. There are tons of articles talking about it. It is basically the execution cost for your contract code. So to optimize the gas cost of storage usage, when you have multiple small length integers, they will be optimized to use 1 storage slot. For example:

contract data1 {
uint128 public balance1;
uint128 public balance2;
uint256 public balance3;
}

When you have 2 variables defined as uint128, these 2 variables can be fit in one storage slot at address 0x0, and the third variable balance3 will be assigned at address 0x1 for its own. When you refer to the variable balance1, the first 16 byte value will be extracted to the referrer after SLOAD or SSTORE. Let’s look at some real example of it:

pragma solidity 0.4.25;
contract Demo1 {

uint128 public balance1;
uint128 public balance2;
uint256 public balance3;

function add() public returns (uint256) {
balance3 = balance1 + balance2;
return balance3;
}
}

After compiling it using Remix, we can get the runtime bytecode as:

6080604052600436106100615763ffffffff7c0100000000000000000000000000
00000000000000000000000000000060003504166340441eec8114610066578063
4f2be91f146100a0578063c45c4f58146100c7578063f24a0faa146100dc575b60
0080fd5b34801561007257600080fd5b5061007b6100f1565b604080516fffffff
ffffffffffffffffffffffffff9092168252519081900360200190f35b34801561
00ac57600080fd5b506100b561011d565b60408051918252519081900360200190
f35b3480156100d357600080fd5b5061007b610158565b3480156100e857600080
fd5b506100b5610170565b60005470010000000000000000000000000000000090
046fffffffffffffffffffffffffffffffff1681565b6000546fffffffffffffff
ffffffffffffffffff808216700100000000000000000000000000000000909204
81169190910116600181905590565b6000546fffffffffffffffffffffffffffff
ffff1681565b600154815600a165627a7a72305820fa0e623e455a9cc0439ea393
dff5b50cc571150034eb57658dd9718e3982b1590029

For time-being reason, we won’t go through all of them, let’s just look at the add calculation line inside add() function:

011E    60  PUSH1 0x00
0120 54 SLOAD
0121 6F PUSH16 0xffffffffffffffffffffffffffffffff
0132 80 DUP1
0133 82 DUP3
0134 16 AND
0135 70 PUSH17 0x0100000000000000000000000000000000
0147 90 SWAP1
0148 92 SWAP3
0149 04 DIV
014A 81 DUP2
014B 16 AND
014C 91 SWAP2
014D 90 SWAP1
014E 91 SWAP2
014F 01 ADD
0150 16 AND
0151 60 PUSH1 0x01
0153 81 DUP2
0154 90 SWAP1
0155 55 SSTORE

To make it more readable, we can give the mathematical equivalent assembler code as follows:

sstore(0x1,uint128((uint128((sload(0x0) / 0x100000000000000000000000000000000)) + uint128(sload(0x0)))));

The logic of using storage slot 0x0 and 0x1 is clearly showing in above line, which is the line of balance3 = balance1 + balance2;.

We can see one line Solidity code can result in a lot of opcode instructions. In future then we discuss even more complicated data structures, it will be overwhelming if we look at all opcodes. So to make our content more readable, I will just show the equivalent assembler code to prove the logic. However, you are always welcome to spend more time to dig the bytecode one by one to practise.

We have talked about the most basic data type, Integer, in EVM. Now let’s look at struct in Solidity:

pragma solidity 0.4.25;
contract Data2 {
struct Funder {
address addr;
uint256 amount;
}
Funder test;
function deposit(address addr, uint256 amount) public returns (uint256) {
test.addr = addr;
test.amount = amount;
return amount;
}
}

After generating the runtime EVM bytecode and locate the deposit() function. We can see equivalent assembler code was generated like this:

sstore(0x0,uint160(arg0));
sstore(0x1,arg1);

We can see even we define a struct Funder in our Solidity program, the compiler still generate the code without difference than just 2 normal storage variables. Also, if multiple member items inside the struct can fit in one storage slot, optimization will also happen as we said early.

Now let’s talk about a very popular data type, mapping, in Solidity.

pragma solidity ^0.4.25;
contract Bank {
mapping(address => uint256) public balanceOf; // balances, indexed by addresses
function deposit(uint256 amount) public payable {
require(msg.value == amount);

balanceOf[msg.sender] += amount; // adjust the account's balance
}
}

If you are familiar with smart contract development, you probably will see a lot of Dapps has similar usage of mapping to record the balance of an account. Let’s compile it and check the assembler code of the mapping variable balanceOf:

mstore(0x20,0x0);
mstore(0x0,msg.data(0x04) & 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF);
temp0 = keccak256(0x0,0x40);
return(sload(temp0));

We can see msg.data(0x04) & 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF will get the parameter from data payload as an address value. Then, this argument will be put into the memory address 0x0. Also, the address 0x20 in the memory is set up with value 0x0. This 0x0 is actually an index of the declaration of storage variables. It only because balanceOf is the first declared variable in the smart contract. After the memory is set up, SHA3 opcode is called to calculate the HASH value of the inputs, And the result will be used as the storage key for the mapping variable. From above code, we can see when you declare a one level mapping variable, the compiler actually take it as function with 1 argument and return one value. For readability reason, let’s use arg0 for the value set up in the 0x0 address in memory of mapping calculation. Let’s take a look at 2 level mappings:

mapping (address => mapping (address => uint256)) public tokens;

You might have seen some similar mappings variable like above. When you compile the code and generate the bytecode, the assembler code of it will be like:

mstore(0x20,0x0);
mstore(0x0,arg0);
temp0 = keccak256(0x0,0x40);
mstore(0x20,temp0);
mstore(0x0,arg1);
temp1 = keccak256(0x0,0x40);
return(sload(temp1));

Apparently, for a 2-level mapping variable it is actually a function with 2 arguments and return one value. For this tokens example, it first set up memory with 0x0 at address 0x20 (we know 0x0 is the variable declaration index), arg0 at address 0x0. Then the first HASH value was calculated by using SHA3 opcode. However, this calculated value will put in to memory address 0x20 again, and the second argument arg1 is put at address 0x0 this time for another SHA3 calculation. Then the result will be used as the storage key for the variable. In same logic, you can keep adding level for your mapping variables and the key for the storage will always be the last calculated SHA3 result.

Until now you might have better understanding why a free memory pointer is always saved at 0x40, not at 0x0 or 0x20. That is only because those addresses are reserved for HASH calculations.  Until now we have visited several data structures in Solidity. What if we put some of them together? For example, what about if we have a mapping variable which returns a struct value:

pragma solidity 0.4.25;
contract Data4 {
struct Funder {
uint256 last_deposit;
uint256 amount;
}
mapping(address => Funder) public balanceOf;
}

Let’s have a look at the decompiled code for balanceOf then:

function balanceOf( address arg0) public return (var0,var1)
{
mstore(0x20,0x0);
mstore(0x0,arg0);
temp0 = keccak256(0x0,0x40);
return(sload(temp0),sload((temp0 + 0x1)));
}

We can see the SHA3 calculation part is the same as a regular 1-level mapping variable. The only difference is that the regular data type like integer will just use the hash result as storage key. But the member item inside a struct will apply a declaration index of the member to that hash value for the storage key. So for this case, the 2 member items in struct Funder will add 0x0, and 0x01 on the hash value temp0 for the storage key.

Next, we will have a look at the data structures with variable length. For example, we have a smart contract coded as follows:

pragma solidity 0.4.25;
contract Data4 {
address[] public senders;
function add() public {
senders.push(msg.sender);
}
}

We defined a variable length array senders. Let’s see how EVM implement this variable in storage:

function senders( uint256 arg0) public return (var0)
{
assert((arg0 < sload(0x0)));
mstore(0x0,0x0);
temp0 = keccak256(0x0,0x20);
return(uint160(sload((temp0 + arg0))));
}

From above assembler code, we can observe that the variable is implemented more like a function with one argument index and return the item from the array. In the Solidity source we know that we only defined one array variable senders. As we discussed early the storage will assign an address index for every variable, since there is only one, 0x0 is assigned to it. However, the address 0x0 is not enough to hold all data of the array. Instead it will hold the length of the array. For the data of the array, the storage use the SHA3 value of its address index (0x0 in this case) as the start of the array data. The array index will be added into this hash value as the the storage key for the respective item. So let’s look back at above code snippet. When external callers want to access an item in the array by an index, it will first check if the index is bigger than the length of the array. If not, the hash value will be calculated, and the index will be added to this hash to get the item of the array.

So far we have discussed strut, mapping, and array. There is one more data type you might be interested in. That is string in Solidity. String and bytes are the same data type to hold a variable length of bytes. Let’s have a look when we declared a string variable in storage, what it will look like:

pragma solidity 0.4.25;
contract Data6 {
string public hello = "Hello World!";
}

This smart contract is doing nothing than just return a string to the caller. Let’s see how the string is saved in the storage. Even there is one line of Solidity, the compiled assembler code is a bit complicated:

temp0 = mload(0x40);
mstore(0x40,(temp0 + (0x20 + (((0x1F + ((((0x100 * ((0x1 & sload(0x0)) == 0)) - 0x1) & sload(0x0)) / 0x2)) / 0x20) * 0x20))));
mstore(temp0,((((0x100 * ((0x1 & sload(0x0)) == 0)) - 0x1) & sload(0x0)) / 0x2));
var5 = (0x20 + temp0);
var7 = ((((0x100 * ((0x1 & sload(0x0)) == 0)) - 0x1) & sload(0x0)) / 0x2);
if (var7)
{
if ((0x1F < var7))
{
temp1 = (var5 + var7);
var5 = temp1;
mstore(0x0,0x0);
temp2 = keccak256(0x0,0x20);
var7 = var5;
var6 = temp2;
label_00000148:
mstore(var7,sload(var6));
var6 = (0x1 + var6);
var7 = (0x20 + var7);
if ((var5 > var7))
{
goto label_00000148;
}
else
{
temp4 = (var5 + (0x1F & (var7 - var5)));
var5 = temp4;
label_00000165:
return;
}
}
else
{
mstore(var5,((sload(0x0) / 0x100) * 0x100));
goto label_00000165;
}
}

The assembler code is a bit complicated, let’s explore the logic of it. First,  let’s figure out the logic of “((((0x100 * ((0x1 & sload(0x0)) == 0)) – 0x1) & sload(0x0)) / 0x2)”. Apparently, since there is only one variable hello defined in the contract, so storage address 0x0 is reserved for this variable. But it seems it is not just the length filed like arrays. From this comparison “((0x1 & sload(0x0)) == 0)) ” we know the last bit of this field is a flag. If the bit is set to 1, then this comparison is False (0x0), so the whole line will be “(uint256(sload(0x0)) / 0x2)”. If the bit is set to 0, then the whole line will be “(uint8(sload(0x0)) / 0x2)”. This value is assigned to var7 in above code. Then, this value is compared with 0x1F. If it is smaller than 0x1F, this line “mstore(var5,((sload(0x0) / 0x100) * 0x100));” will copy the first 0x1F bytes from storage 0x0 to the memory. If it is bigger than 0x1F, then some similar operation we have seen in variable-length array will be resulted in.

Based on what we have seen from the assembler code, we can know the basic logic of strings. To use the best of storage, if the string is smaller than 0x1F, it means including its length it can be saved in one slot in storage. So the last bit of the field is set to 0, and the last byte has the length of the string multiplied by 2. And the string is saved in the first 31 bytes in the slot. However, if the length of string is longer than 31, then one storage slot can’t put in all data. So the field slot will only hold the length field (last bit is set to 1), and the data is saved in the way of the variable-length array.

Until now, we have discussed the implementations in storage of the most of the data types in Solidity. For next section we will talk about memory and its interaction with data payload.


Understand EVM bytecode – Part 2

In the first section,

Understand EVM bytecode – Part 1

We have inspected the contract creation part of the EVM bytecode of the smart contract. In this section we will analyze the runtime EVM bytecode. We will still use the sample code demo1 from previous section.

pragma solidity 0.4.25;

contract Demo1 {
uint public balance;

function add(uint value) public returns (uint256) {
balance = balance + value;
return balance;
}
}

Besides going through the whole compiled bytecode to find the EVM runtime part, there is an easier way. We can also use Remix online portal. When you compiled the source code, click on the “Details” button and scroll down. There is a member field called “object” in the “RUNTIME BYTECODE” section.

The whole EVM runtime part bytecode can be shown as follows:

60806040526004361060485763ffffffff7c010000000000000000000000000000
00000000000000000000000000006000350416631003e2d28114604d578063b69e
f8a8146074575b600080fd5b348015605857600080fd5b5060626004356086565b
60408051918252519081900360200190f35b348015607f57600080fd5b50606260
95565b60008054820190819055919050565b600054815600a165627a7a72305820
63aa00920d824233ab5307ef3a379c757bdbee62fe00fe36a5d852c766e58fef0029

This part of bytecode seems much longer than the creation part. But let’s split it into parts the analyze them one by one.

0000    60  PUSH1 0x80
0002 60 PUSH1 0x40
0004 52 MSTORE
0005 60 PUSH1 0x04
0007 36 CALLDATASIZE
0008 10 LT
0009 60 PUSH1 0x48
000B 57 *JUMPI
...
0048 5B JUMPDEST
0049 60 PUSH1 0x00
004B 80 DUP1
004C FD *REVERT

If you have finished reading my first section, you would have no problem to understand the first 3 instructions set. If you haven’t done so, I really recommend you to read it first. These instructions actually save the address 0x80 in offset 0x40 in the memory as the free memory pointer for future use. There is a new opcode CALLDATASIZE which we haven’t met before at 0x07. It will get the EVM data payload size from this transaction. LT is a opcode to compare 2 items in the stack, it will returns TRUE if the comparison satisfied.  So put all pieces together we can get the equivalent Solidity assembler code as follows:

mstore(0x40,0x80);
if(msg.data.length < 0x04) { revert(0,0); }

We can see that basically these instructions will initialize the memory pointers and validate if the size of data payload is at least 4 bytes long. The reason for this is that for a regular external call to a smart contract, the first 4 bytes in the data payload is HASH value for function signature. This 4-byte value will be used by the contract to select which function to delivery the rest data which are the parameters for that function. For example, if you will call the function withdraw(0xABCD) on a smart contract, the data payload for this call will be like this:

0x3823D66C000000000000000000000000000000000000000000000000000000000000ABCD

In this example the first 4-byte value is 0x3823D66C, which is the SHA3 hash value of “withdrawn(bytes32)”. The following 32-byte integer is the parameter of the function call 0xABCD. This is a simple example of integer function parameters. Things will get more complicated when handling variable size parameters. We will talk about them later.

For now, let’s go back the instructions we have discussed. It will validate the size of data payload to be at least 4 byte long. If not, it will revert. But will this be the case for all smart contracts? What if we just send some Ethers to this smart contract without calling any function? You might already recall some functionality in Solidity programs. Yes, it is where the fallback function implemented. To approve this, you can get sample with fallback function implemented, and then check the instruction branch after the msg.data.length validation code.

To continue on our sample bytecode, we can see following code snippet:

0C: 63  PUSH4 0xffffffff
11: 7C PUSH29 0x100000000000000000000000000000000000000000000000000000000
2F: 60 PUSH1 0x00
31: 35 CALLDATALOAD
32: 04 DIV
33: 16 AND
34: 63 PUSH4 0x1003e2d2
39: 81 DUP2
3A: 14 EQ
3B: 60 PUSH1 0x4d
3D: 57 *JUMPI
3E: 80 DUP1
3F: 63 PUSH4 0xb69ef8a8
44: 14 EQ
45: 60 PUSH1 0x74
47: 57 *JUMPI

If you look at more EVM bytecode from different smart contracts, you will always find the same code snippet or functionality equivalent code snippet. This code snippet is not from the code you have put into your source, and it was injected by the compiler. Let’s go through every lines.

First, the code pushes 0xFFFFFFFF into the stack. This value will be used as one of the operands of the AND opcode at 0x33. Then the code will pushes another huge constant value, which will be used as the divider of DIV at 0x32. For instructions at 0x2F, 0x31, the code will get the first 32-byte value from data payload using CALLDATALOAD(0x0). Combined with the instruction set of the later DIV and AND, we can see these instructions actually get the first 4-byte value of data payload, which is the function signature HASH value. The calculated result is pushed into stack. Then the value is compared with 0x1003e2d2 using the EQ opcode at 0x3A. If it was true, the execution will be led to address 0x4D by JUMPI. If not the code will keep continue and the result will be compared with another HASH value 0xb69ef8a8.

Now the logic of this code snippet is pretty clear. It gets the first 4-byte value from data payload and decides which function it will be selected to run. For the code addresses at 0x4D and 0x74 they are the entrance of each public functions the caller can access to the smart contracts. If none of the HASH value is satisfied from the code, then it will result in the fallback function of the smart contracts. If it was not defined, it will simply revert.

Since HASH algorithms will make information lost during the calculation, it is impossible to get the original function signature information back theatrically. However, you can still make a guess on the original information based on a huge collection of HASH values. That is what website www.4byte.directory does. It collects tons of function signatures and their hashes and provide a WEB API for searching. For example, by browsing following link:

https://www.4byte.directory/signatures/?bytes4_signature=0x1003e2d2

You will get the result as:

add(uint256)

Amazingly, that is the original function definition in our demo1.sol program. So by using this service we can pretty much reverse back most of the function signatures if they are collected before.

Until now we have known how public functions can be accessed by external callers. Now let’s go further into the specific functions. Address 0x4D is the entrance of add() function:

004D    5B  JUMPDEST
004E 34 CALLVALUE
004F 80 DUP1
0050 15 ISZERO
0051 60 PUSH1 0x58
0053 57 *JUMPI
0054 60 PUSH1 0x00
0056 80 DUP1
0057 FD *REVERT
0058 5B JUMPDEST
0059 50 POP
005A 60 PUSH1 0x62
005C 60 PUSH1 0x04
005E 35 CALLDATALOAD
005F 60 PUSH1 0x86
0061 56 *JUMP

At the entrance of the function add(), there is an opcode JUMPDEST. This is a special opcode which only marks an address that can be jumped to. It does not seem to play an important role to the EVM implementation. However, you will see it does help to determine the control flow graph (CFG) for bytecode. We will discuss it in the future section.

The instructions set at 0x4F-0x57 have been seen in previous section. They were injected by compiler for the non-payable function. After this validation code, the PUSH1 opcode at address 0x5A is easy to be ignored. However, this PUSH1 is very important for the code to jump back later. Let’s just remember an address 0x62 is pushed into the stack for now. Then, CALLDATALOAD(0x04) is called to load the parameter from data payload, which is located at offset 0x04. After getting the parameter, the code will jump to 0x86 for execution:

0086    5B  JUMPDEST
0087 60 PUSH1 0x00
0089 80 DUP1
008A 54 SLOAD
008B 82 DUP3
008C 01 ADD
008D 90 SWAP1
008E 81 DUP2
008F 90 SWAP1
0090 55 SSTORE
0091 91 SWAP2
0092 90 SWAP1
0093 50 POP
0094 56 *JUMP

In above code snippet, the value at 0x0 in the storage is loaded by using SLOAD(0x0). Then, this value will be added with the parameter loaded from data payload and saved back to the same location 0x0 in storage. Finally we see the code we have put inside the add() function:

balance = balance + value;

Since we only used one integer variable balance inside the smart contract. The compiler give the offset 0x0 to this variable. So any read or write operation on this balance variable will be put on the offset 0x0 in the storage. At the end of the code snippet, a JUMP is used to jump back to 0x62. If you still recall where this value 0x62 was pushed into the stack. This operation might remind you of something on X86 architecture. Yes, that is actually the the call and ret for function calls. Since EVM doesn’t support function calls on bytecode level, it can only use PUSH and JUMP opcodes for function calls. This way will get a lot of trouble to build back the CFG from bytecode.

Let’s jump back to address 0x62 to see what happens next:

0062    5B  JUMPDEST
0063 60 PUSH1 0x40
0065 80 DUP1
0066 51 MLOAD
0067 91 SWAP2
0068 82 DUP3
0069 52 MSTORE
006A 51 MLOAD
006B 90 SWAP1
006C 81 DUP2
006D 90 SWAP1
006E 03 SUB
006F 60 PUSH1 0x20
0071 01 ADD
0072 90 SWAP1
0073 F3 *RETURN

This code snippet has several items adjustments on the stack by opcodes DUP and SWAP. It is not obvious to understand the meaning of the instructions. I will leave this to you to simulate a stack for the execution. For the equivalent assembler code of above code is as follows:

mstore(mload(0x40), value);
return(mload(0x40), 0x20);

The value inside the code was the calculated result from previous add operation, which is the new balance value. Finally, we have gone through all bytecode inside the add() function.

Now let’s go back to the function dispatch code snippet for the second function HASH 0xb69ef8a8. The entrance for that function is at 0x74:

0074    5B  JUMPDEST
0075 34 CALLVALUE
0076 80 DUP1
0077 15 ISZERO
0078 60 PUSH1 0x7f
007A 57 *JUMPI
007D 80 DUP1
007E FD *REVERT
007F 5B JUMPDEST
0080 50 POP
0081 60 PUSH1 0x62
0083 60 PUSH1 0x95
0085 56 *JUMP
...
0095 5B JUMPDEST
0096 60 PUSH1 0x00
0098 54 SLOAD
0099 81 DUP2
009A 56 *JUMP

We can see the first part of the code is really similar to the previous function except no parameter was loaded. Then the function will call a code snippet at 0x95. The instructions at 0x95-0x98 just load the value at 0x0 in the storage and return. We noted that the code snippet at 0x62 was reused for both functions. That is because both functions will return the storage variable balance back.

You may wonder why there is the function HASH 0xb69ef8a8 inside the function dispatch code? Isn’t there only one function add() inside the smart contract? If you use 4bytes database to check that HASH, you will get balance(). Apparently, the storage variable is recognized as a public function without parameters by compiler.

To summarize the section, we have discussed the whole structure of the runtime part of the bytecode. How functions are accessed by the external callers, how parameters are transferred. But for this demo example, we only put in some integer storage variables. How will it like for mappings or variable length arrays? How will parameters presented in the data payload for strings? We will talk about all these in next section.

Understand EVM bytecode – Part 3

Understand EVM bytecode – Part 1

If you have started reading this article, I guess you already know what EVM stands for. So I wouldn’t spend too much time on the background of Ethereum. If you do need some basics of it, please go ahead google “Ethereum Virtual Machine”. The main goal of these series of articles is to help understanding everything about EVM bytecode in case you will be involved in some work about bytecode level contract audit or develop a decompiler of EVM bytecode.

Now let’s start with some very basic of EVM bytecode. EVM is a stack-based Virtual Machine. If you have experience with any of similar VMs (like Java VM, DVM, .NET VM), you wouldn’t have too much difficulty to understand the basic idea of it. Basically, EVM bytecode is the VM level machine language. You can image these level of code is certainly not for human to read same as low level machine codes. It can be compiled by high level EVM languages. The most popular one would be Solidity for now. To understand EVM bytecodes better, I will use a lot of simple Solidity samples for demo. So let’s start with our very first simple example:

pragma solidity 0.4.25;

contract Demo1 {
uint public balance;

function add(uint value) public returns (uint256) {
balance = balance + value;
return balance;
}
}

You may ask why I didn’t use the common HelloWorld as a start example. That is because commonly a HelloWorld example will use a string variable, and for our EVM bytecode, the string variable is a dynamical length variable, and we will get another article to talk about it later. So let’s just start with some a simple Add operation for the very first demo.

To compile this piece of Solidity program, we need a compiler. I really recommend Remix for this job. Remix is not just an online compiler, it also supports a lot of great features you would love. Please visit following link to start using it:

https://remix.ethereum.org

The main GUI of Remix is shown as following photo:

The portal is straightforward to use. On the right column of the page, there are tabs you can select for your interest.

After adding a new file demo1.sol in Remix portal, you can choose the right compiler version from “Compile” tab for the compilation. Here we are using “0.4.25”. When the compilation is done without any errors, you can click on “Details” to get the EVM bytecode from the value of “object” in the “BYTECODE” section of the popped out page.

The whole string of it is:

608060405234801561001057600080fd5b5060c78061001f6000396000f3
0060806040526004361060485763ffffffff7c0100000000000000000000
0000000000000000000000000000000000006000350416631003e2d28114
604d578063b69ef8a8146074575b600080fd5b348015605857600080fd5b
5060626004356086565b60408051918252519081900360200190f35b3480
15607f57600080fd5b5060626095565b6000805482019081905591905056
5b600054815600a165627a7a7230582063aa00920d824233ab5307ef3a379
c757bdbee62fe00fe36a5d852c766e58fef0029

At the first glance of the string, you might be just lost, right? But don’t worry we will explore the whole piece of binary string to understand the in and out of it.

First, if you look at the string closely, you will know this is a HEX format string to present a piece of binary. Yes, you are right. The real EVM bytecode is actually a binary string, but in order to show it better to others, it is always be presented in the HEX format. To understand the every byte of the binary characters inside the string, we need to first know some basics of EVM opcodes.

An opcode is a instruction of the EVM. Every opcode itself is a 8bit unsigned integer. For example, 0x00 means STOP, 0x01 means ADD. To understand all meanings of the opcodes, please refer to the Ethereum Yellow Paper at:

https://ethereum.github.io/yellowpaper/paper.pdf

For now, we wouldn’t go through all of the opcodes to explain the meanings. We just need to know the basics of them and explain the new opcodes when we encountered them. So let’s start from the first part of the EVM bytecode we got from Remix to explain:

6080604052

If we mapping all opcode into a readable instructions, we can get following code:

00:  6080 PUSH1 0x80
02: 6040 PUSH1 0x40
04: 52 MSTORE

From above code snippet, we can see 2 opcodes, PUSH1 and MSTORE. PUSH1 means to push 1-byte integer into stack for future use. There are also PUSH2, PUSH3 … until PUSH32. In EVM all integers are from 1-byte to 32-byte long. PUSH family opcodes are the only ones come with operands in EVM bytecode, because for rest of the opcodes they will use the values in the stack. For this example, the first 2 PUSH1 will push 0x80 and 0x40 into the stack, then MSTORE will use the 2 items in the stack for the memory write operation. So the above code snippet is actually the EVM assembler code:

mstore(0x40,0x80)

After MSTORE uses the 2 items in the stack, they will be popped out. Commonly the result of the opcode will be pushed into the stack for later use. However MSTORE does not have a return value, so it will not push anything into the stack.

So in this way if you keep going through the whole EVM bytecode Remix returned to us, you will get the whole list of opcodes. But before we go further to explore more opcodes, let’s talk about 2 more concepts in the EVM environment, memory and storage.

Memory is a readable and writable structure designed for hash calculation and external calls or returns. Memory is reset as stack whenever the EVM starts. The difference from stack is that memory can be accessed by address.  For the earlier example, MSTORE will save the specified value 0x80 into the according address 0x40. You might wonder the meaning of this action. Actually, address 0x40 in EVM memory is reserved for the “free memory pointer”, so when the EVM code needs to use some memory, it will get the free memory pointer from 0x40. Also, if you don’t want that memory be overflowed by future operation, you need to update the value in 0x40 so future operation will not use the same memory again.

Other than memory and stack, storage variables are the ones which hold states. So storage variables won’t be reset every time EVM restarts. You can consider storage as a dictionary or hash table. Everything changed in storage will be recorded in the world states of Ethereum ecosystem. Storage related opcodes are SLOAD and SSTORE. We will talk more about storage variables when analyzing more complicated structures like mappings or arrays.

Based on these information, let’s continue on the bytecode string.

05: 34     CALLVALUE
06: 80 DUP1
07: 15 ISZERO
08: 61 PUSH2 0x0010
0B: 57 JUMPI
0C: 6000 PUSH1 0x00
0E: 80 DUP1
0F: FD REVERT
10: 5B JUMPDEST
11: 50 POP
12: 60C7 PUSH1 0xc7
14: 80 DUP1
15: 61001F PUSH2 0x001f
18: 60 PUSH1 0x00
1A: 39 CODECOPY
1B: 60 PUSH1 0x00
1D: F3 RETURN
1E: 00 STOP

This code snippet is a bit long, but don’t worry about it. Let’s go through it from step by step. CALLVALUE will push msg.value into the stack, then DUP1 will duplicate that value on the stack and check it whether it is 0 or not by using ISZERO. If the value ISZERO got from stack is 0, this opcode will push a TRUE into the stack for next instructions. The next PUSH2 will push a code address 0x0010 into the stack for JUMPI. JUMPI is a conditional jump instruction which uses 2 items from the stack. One is for the condition result, and the other is for the jump address. If the condition (in this case, it is the ISZERO(msg.value)) is satisfied the execution will jump to 0x0010, otherwise the code will end with REVERT(0,0). So the bytecode from address 0x05-0x0F can be transferred to following equivalent Solidity code:

if(msg.value != 0) revert();

The reason why we didn’t see this line in our original Solidity code is because this check was injected by compiler for non-payable functions.

To continue on the later part of the bytecode, if you arrange the stack manually, you can see there is an instruction CODECOPY(0x0,0x001F,0xC7). It means it will copy 0xC7 bytes code from offset 0x1F into memory (0x0, 0xC7). Then the code will call RETURN(0x0,0xC7) to hand the copied data back to EVM. Until now you might have guessed out the logic of this operation and what is the functionality of this piece of bytecode.

Apparently, the whole piece of bytecode generated from Remix compiler has multiple parts. The set from 0-0x1E is the creation part of the contract. This code will be only called during the smart contract creation. It will call the constructor of the contract and also copy the runtime part of code to EVM for creation. After the contract account is created, then the runtime part of code from 0x1F-(0x1F+0xC7) will be called for future transactions on this contract and the constructor function will not be called anymore. Also, you might have found that in the creation part of the bytecode, this is no any JUMP or JUMPI instructions to make the execution into the runtime part bytecode.

To prove what we guess is correct, let’s make another Solidity code with a constructor function:

pragma solidity 0.4.25;

contract Demo2 {
uint public balance;

function add(uint value) public returns (uint256) {
balance = balance + value;
return balance;
}

constructor (uint value) public {
balance = value;
}
}

After compiling it with Remix, we can get the creation part of the bytecode as following:

608060405234801561001057600080fd5b506040516020806100fa833981016040
525160005560c7806100336000396000f300

Apparently the code is longer than the previous one since we defined a constructor function there. So let’s disassemble the opcode into more readable codes:

0000    60  PUSH1 0x80
0002 60 PUSH1 0x40
0004 52 MSTORE
0005 34 CALLVALUE
0006 80 DUP1
0007 15 ISZERO
0008 61 PUSH2 0x0010
000B 57 JUMPI
000C 60 PUSH1 0x00
000E 80 DUP1
000F FD REVERT
0010 5B JUMPDEST
0011 50 POP
0012 60 PUSH1 0x40
0014 51 MLOAD
0015 60 PUSH1 0x20
0017 80 DUP1
0018 61 PUSH2 0x00fa
001B 83 DUP4
001C 39 CODECOPY
001D 81 DUP2
001E 01 ADD
001F 60 PUSH1 0x40
0021 52 MSTORE
0022 51 MLOAD
0023 60 PUSH1 0x00
0025 55 SSTORE

0026 60 PUSH1 0xc7
0028 80 DUP1
0029 61 PUSH2 0x0033
002C 60 PUSH1 0x00
002E 39 CODECOPY
002F 60 PUSH1 0x00
0031 F3 RETURN
0032 00 STOP

We can see some similar code set at the start and end. But the code set between 0x12 and 0x25 are new. So let’s focus on this new part. First, in opcode set 0x12 and 0x14, MLOAD(0x40) was called to get the value from memory at address 0x40. From previous section, we already knew the address 0x40 in memory holds the free memory pointer in EVM. In this case it is 0x80. Then after arranging the stack by using PUSH and DUP, it will have [… 0x20, 0x00FA, 0x80] in stack before calling CODECOPY. So the code will call CODECOPY(0x80, 0x00FA, 0x20). Apparently, this action didn’t show in previous demo bytecode. It has something to do with the new code we put inside the constructor function. It copies the last 32 bytes data from code into the free memory address. It is likely the parameter value during the deployment of the contract. Let’s keep going on the later bytecode.

In the instructions set of 0x1D – 0x21, the code added 0x20 to the current free memory pointer 0x80, and save it back to the address 0x40 by using MSTORE(0x40, 0x80+0x20).

Then the instruction at 0x22 will push the value returned by MLOAD(0x80) into the stack, which is the 32-byte value copied from the code. The later code at 0x23, 0x25 will save the value into the storage offset 0x0 using SSTORE(0x0, MLOAD(0x80)). So in summary, the instructions between 0x12 and 0x25 are basically doing some operation like:

SSTORE(0x0,CODECOPY(0x80, 0x00FA, 0x20))

Apparently, during the deployment of a new contract, the initialized parameters are specified at the end of EVM bytecode in the transaction data payload. Then in the process of creation, the constructor function will get the the parameter by using CODECOPY.

So far, we have talked the basics of EVM bytecode, including the three types of data structures in EVM: stack, memory and storage, some regular opcodes involved in the smart contract creation, how constructor parameters were transferred, and the structure of compiled EVM bytecode. In next section we will talk about the runtime part of the bytecode.

Understand EVM bytecode – Part 2

PS, We have published our online EVM decompiler to everyone. Please feel free to use it. Any comments are welcome.

https://www.trustlook.com/products/smartcontractinsight