January 8, 2019

Understand EVM bytecode – Part 3

Understand EVM bytecode – Part 3

In previous sections:

Understand EVM bytecode – Part 1

Understand EVM bytecode – Part 2

We have talked about creation and runtime parts of the EVM bytecode. We have seen that the stack variables are commonly used to supply operands to opcodes or transfer integer arguments between internal function calls. All state related variables need to be saved in storage. However, storage is designed as a dictionary or hash table.  It will hold all data by key-value pairs. For each address key it can store a 32-byte integer. So how it can implement complicated data structures, struct, mapping, variable length arrays etc? Let’s dig more about the implementations of them.

Let’s start with the relatively easy ones: Fixed size data types. We all already know that EVM support different length of integers, from 1-byte to 32-byte. If you only defined several 32-byte integer variables in your smart contracts, the compiler will simply assign a consequence of addresses starting from 0x0 for the variables. For example:

contract data1 {
  uint256 public balance1;
  uint256 public balance2;
  uint256 public balance3;
}

If you compiled above code and check the EVM bytecode, you will found all code read or write the 3 variables balance1, balance2 and balance3  will be mapping to the SLOAD or SSTORE operations on addresses 0x0, 0x1, and 0x2 accordingly. However, we all know the storage usage costs a lot of gas. If you have no idea what gas is in EVM, I recommend you to have a bit research on it. There are tons of articles talking about it. It is basically the execution cost for your contract code. So to optimize the gas cost of storage usage, when you have multiple small length integers, they will be optimized to use 1 storage slot. For example:

contract data1 {
  uint128 public balance1;
  uint128 public balance2;
  uint256 public balance3;
}

When you have 2 variables defined as uint128, these 2 variables can be fit in one storage slot at address 0x0, and the third variable balance3 will be assigned at address 0x1 for its own. When you refer to the variable balance1, the first 16 byte value will be extracted to the referrer after SLOAD or SSTORE. Let’s look at some real example of it:

pragma solidity 0.4.25;
contract Demo1 {

  uint128 public balance1;
  uint128 public balance2;
  uint256 public balance3;

  function add() public returns (uint256) {
        balance3 = balance1 + balance2;
        return balance3;
    }
}

After compiling it using Remix, we can get the runtime bytecode as:

6080604052600436106100615763ffffffff7c0100000000000000000000000000
00000000000000000000000000000060003504166340441eec8114610066578063
4f2be91f146100a0578063c45c4f58146100c7578063f24a0faa146100dc575b60
0080fd5b34801561007257600080fd5b5061007b6100f1565b604080516fffffff
ffffffffffffffffffffffffff9092168252519081900360200190f35b34801561
00ac57600080fd5b506100b561011d565b60408051918252519081900360200190
f35b3480156100d357600080fd5b5061007b610158565b3480156100e857600080
fd5b506100b5610170565b60005470010000000000000000000000000000000090
046fffffffffffffffffffffffffffffffff1681565b6000546fffffffffffffff
ffffffffffffffffff808216700100000000000000000000000000000000909204
81169190910116600181905590565b6000546fffffffffffffffffffffffffffff
ffff1681565b600154815600a165627a7a72305820fa0e623e455a9cc0439ea393
dff5b50cc571150034eb57658dd9718e3982b1590029

For time-being reason, we won’t go through all of them, let’s just look at the add calculation line inside add() function:

011E    60  PUSH1 0x00
0120    54  SLOAD
0121    6F  PUSH16 0xffffffffffffffffffffffffffffffff
0132    80  DUP1
0133    82  DUP3
0134    16  AND
0135    70  PUSH17 0x0100000000000000000000000000000000
0147    90  SWAP1
0148    92  SWAP3
0149    04  DIV
014A    81  DUP2
014B    16  AND
014C    91  SWAP2
014D    90  SWAP1
014E    91  SWAP2
014F    01  ADD
0150    16  AND
0151    60  PUSH1 0x01
0153    81  DUP2
0154    90  SWAP1
0155    55  SSTORE

To make it more readable, we can give the mathematical equivalent assembler code as follows:

sstore(0x1,uint128((uint128((sload(0x0) / 0x100000000000000000000000000000000)) + uint128(sload(0x0)))));

The logic of using storage slot 0x0 and 0x1 is clearly showing in above line, which is the line of balance3 = balance1 + balance2;.

We can see one line Solidity code can result in a lot of opcode instructions. In future then we discuss even more complicated data structures, it will be overwhelming if we look at all opcodes. So to make our content more readable, I will just show the equivalent assembler code to prove the logic. However, you are always welcome to spend more time to dig the bytecode one by one to practise.

We have talked about the most basic data type, Integer, in EVM. Now let’s look at struct in Solidity:

pragma solidity 0.4.25;
contract Data2 {
      struct Funder {
        address addr;
        uint256 amount;
     }
    Funder test;
    function deposit(address addr, uint256 amount) public returns (uint256) {
        test.addr = addr;
        test.amount = amount;
        return amount;
     }
}

After generating the runtime EVM bytecode and locate the deposit() function. We can see equivalent assembler code was generated like this:

sstore(0x0,uint160(arg0));
sstore(0x1,arg1);

We can see even we define a struct Funder in our Solidity program, the compiler still generate the code without difference than just 2 normal storage variables. Also, if multiple member items inside the struct can fit in one storage slot, optimization will also happen as we said early.

Now let’s talk about a very popular data type, mapping, in Solidity.

pragma solidity ^0.4.25;
contract Bank {
    mapping(address => uint256) public balanceOf;   // balances, indexed by addresses
    function deposit(uint256 amount) public payable {
        require(msg.value == amount);

        balanceOf[msg.sender] += amount;     // adjust the account's balance
    }
}

If you are familiar with smart contract development, you probably will see a lot of Dapps has similar usage of mapping to record the balance of an account. Let’s compile it and check the assembler code of the mapping variable balanceOf:

mstore(0x20,0x0);
mstore(0x0,msg.data(0x04) & 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF);
temp0 = keccak256(0x0,0x40);
return(sload(temp0));

We can see msg.data(0x04) & 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF will get the parameter from data payload as an address value. Then, this argument will be put into the memory address 0x0. Also, the address 0x20 in the memory is set up with value 0x0. This 0x0 is actually an index of the declaration of storage variables. It only because balanceOf is the first declared variable in the smart contract. After the memory is set up, SHA3 opcode is called to calculate the HASH value of the inputs, And the result will be used as the storage key for the mapping variable. From above code, we can see when you declare a one level mapping variable, the compiler actually take it as function with 1 argument and return one value. For readability reason, let’s use arg0 for the value set up in the 0x0 address in memory of mapping calculation. Let’s take a look at 2 level mappings:

mapping (address => mapping (address => uint256)) public tokens;

You might have seen some similar mappings variable like above. When you compile the code and generate the bytecode, the assembler code of it will be like:

mstore(0x20,0x0);
mstore(0x0,arg0);
temp0 = keccak256(0x0,0x40);
mstore(0x20,temp0);
mstore(0x0,arg1);
temp1 = keccak256(0x0,0x40);
return(sload(temp1));

Apparently, for a 2-level mapping variable it is actually a function with 2 arguments and return one value. For this tokens example, it first set up memory with 0x0 at address 0x20 (we know 0x0 is the variable declaration index), arg0 at address 0x0. Then the first HASH value was calculated by using SHA3 opcode. However, this calculated value will put in to memory address 0x20 again, and the second argument arg1 is put at address 0x0 this time for another SHA3 calculation. Then the result will be used as the storage key for the variable. In same logic, you can keep adding level for your mapping variables and the key for the storage will always be the last calculated SHA3 result.

Until now you might have better understanding why a free memory pointer is always saved at 0x40, not at 0x0 or 0x20. That is only because those addresses are reserved for HASH calculations.  Until now we have visited several data structures in Solidity. What if we put some of them together? For example, what about if we have a mapping variable which returns a struct value:

pragma solidity 0.4.25;
contract Data4 {
    struct Funder {
        uint256 last_deposit;
        uint256 amount;
    }  
    mapping(address => Funder) public balanceOf;
}

Let’s have a look at the decompiled code for balanceOf then:

function balanceOf( address arg0) public return (var0,var1)
{
    mstore(0x20,0x0);
    mstore(0x0,arg0);
    temp0 = keccak256(0x0,0x40);
    return(sload(temp0),sload((temp0 + 0x1)));
}

We can see the SHA3 calculation part is the same as a regular 1-level mapping variable. The only difference is that the regular data type like integer will just use the hash result as storage key. But the member item inside a struct will apply a declaration index of the member to that hash value for the storage key. So for this case, the 2 member items in struct Funder will add 0x0, and 0x01 on the hash value temp0 for the storage key.

Next, we will have a look at the data structures with variable length. For example, we have a smart contract coded as follows:

pragma solidity 0.4.25;
contract Data4 {
    address[] public senders;
    function add() public {
        senders.push(msg.sender);
    }
}

We defined a variable length array senders. Let’s see how EVM implement this variable in storage:

function senders( uint256 arg0) public return (var0)
{
    assert((arg0 < sload(0x0)));
    mstore(0x0,0x0);
    temp0 = keccak256(0x0,0x20);
    return(uint160(sload((temp0 + arg0))));
}

From above assembler code, we can observe that the variable is implemented more like a function with one argument index and return the item from the array. In the Solidity source we know that we only defined one array variable senders. As we discussed early the storage will assign an address index for every variable, since there is only one, 0x0 is assigned to it. However, the address 0x0 is not enough to hold all data of the array. Instead it will hold the length of the array. For the data of the array, the storage use the SHA3 value of its address index (0x0 in this case) as the start of the array data. The array index will be added into this hash value as the the storage key for the respective item. So let’s look back at above code snippet. When external callers want to access an item in the array by an index, it will first check if the index is bigger than the length of the array. If not, the hash value will be calculated, and the index will be added to this hash to get the item of the array.

So far we have discussed strut, mapping, and array. There is one more data type you might be interested in. That is string in Solidity. String and bytes are the same data type to hold a variable length of bytes. Let’s have a look when we declared a string variable in storage, what it will look like:

pragma solidity 0.4.25;
contract Data6 {
    string public hello = "Hello World!";
}

This smart contract is doing nothing than just return a string to the caller. Let’s see how the string is saved in the storage. Even there is one line of Solidity, the compiled assembler code is a bit complicated:

temp0 = mload(0x40);
mstore(0x40,(temp0 + (0x20 + (((0x1F + ((((0x100 * ((0x1 & sload(0x0)) == 0)) - 0x1) & sload(0x0)) / 0x2)) / 0x20) * 0x20))));
mstore(temp0,((((0x100 * ((0x1 & sload(0x0)) == 0)) - 0x1) & sload(0x0)) / 0x2));
var5 = (0x20 + temp0);
var7 = ((((0x100 * ((0x1 & sload(0x0)) == 0)) - 0x1) & sload(0x0)) / 0x2);
if (var7) 
{
    if ((0x1F < var7)) 
    {
        temp1 = (var5 + var7);
        var5 = temp1;
        mstore(0x0,0x0);
        temp2 = keccak256(0x0,0x20);
        var7 = var5;
        var6 = temp2;
label_00000148:
        mstore(var7,sload(var6));
        var6 = (0x1 + var6);
        var7 = (0x20 + var7);
        if ((var5 > var7)) 
        {
            goto label_00000148;
        }
        else
        {
            temp4 = (var5 + (0x1F & (var7 - var5)));
            var5 = temp4;
label_00000165:
            return;
        }
    }
    else
    {
        mstore(var5,((sload(0x0) / 0x100) * 0x100));
        goto label_00000165;
    }
}

The assembler code is a bit complicated, let’s explore the logic of it. First,  let’s figure out the logic of “((((0x100 * ((0x1 & sload(0x0)) == 0)) – 0x1) & sload(0x0)) / 0x2)”. Apparently, since there is only one variable hellodefined in the contract, so storage address 0x0 is reserved for this variable. But it seems it is not just the length filed like arrays. From this comparison “((0x1 & sload(0x0)) == 0)) ” we know the last bit of this field is a flag. If the bit is set to 1, then this comparison is False (0x0), so the whole line will be “(uint256(sload(0x0)) / 0x2)”. If the bit is set to 0, then the whole line will be “(uint8(sload(0x0)) / 0x2)”. This value is assigned to var7 in above code. Then, this value is compared with 0x1F. If it is smaller than 0x1F, this line “mstore(var5,((sload(0x0) / 0x100) * 0x100));” will copy the first 0x1F bytes from storage 0x0 to the memory. If it is bigger than 0x1F, then some similar operation we have seen in variable-length array will be resulted in.

Based on what we have seen from the assembler code, we can know the basic logic of strings. To use the best of storage, if the string is smaller than 0x1F, it means including its length it can be saved in one slot in storage. So the last bit of the field is set to 0, and the last byte has the length of the string multiplied by 2. And the string is saved in the first 31 bytes in the slot. However, if the length of string is longer than 31, then one storage slot can’t put in all data. So the field slot will only hold the length field (last bit is set to 1), and the data is saved in the way of the variable-length array.

Until now, we have discussed the implementations in storage of the most of the data types in Solidity. For next section we will talk about memory and its interaction with data payload.

Understand EVM bytecode – Part 4