26M TRX is gone due to a “backdoor” in bytecode

On May 3, 26.7 million TRX was taken away by an attacker “wojak” (discord name, address: THeRTTCvN4SHEVYNqcLVLNGGVsWLR4smyH). These TRX tokens worth around 700K USD at that moment. This is the second big security event from TronBank team since the huge BTTBank BTT token stolen event. The reason for the stolen this time is because of a backdoor someone set up in the withdraw function of the smart contract.

According to what Tronbank team’s statement, they were trying to compiled the smart contract and verified it on TSC (TronSmartContract.space) for many times but ended up with failures. So they used TSC to compile and deploy and somehow the backdoor was injected by Khanh (Author of https://tronsmartcontract.space/#/author, Tron address:TTX5N2wxLeyWBSNE6UeaBjCFZbpa2FH6jr).

The above statement was from the post here.

Apparently, after the smart contract was deployed the compiled bytecode was not human friendly and hard to inspect without any reverse engining tools. The backdoor was added into the smart contract function withdraw(). By using our online decompiling tool, we can easily found the the backdoor code:

From above photo snapshot, we can clearly see when the msg.value is 0x2E87 (11911) the smart contract will send all balance of the contract to the sender. Therefore the attacker wojak transferred all TRX to his own address by calling withdraw function with value of 0.011911 TRX:

One interesting part of the story is after the attack occurred, this wojak claimed in the discord channel that he didn’t realize the transfer of the tokens until later review and he was originally testing his program on all smart contract on Tronbank network.

He also stated he will refund all tokens to the original investors but ended up with disappearance.

One lesson we can learn from this attack is that the security issue inside block chain ecosystem is never an adversary or a patch anymore. It means real finance loss for investors or users. Also, since the smart contract’s final format is for Virtual Machine and not human, we should really aware of the real logic in the bytecode and not the source code. Especially, the binary was not compiled by your own compilers. Even the bytecode is not meant to be read by human, but with help of de-compiler tool like Smart Contract Guardian, the logic of the bytecode can still by retrieved clearly. We also offer bytecode level smart contract audit service to make sure all your contract is safe and operate in the logic that is supposed to be.

Trustlook Launches Smart Contract Auditing Platform Smart Contract Guardian

SAN JOSE, Calif., Jan. 11, 2019 (GLOBE NEWSWIRE) — Trustlook, the global leader of AI-powered cybersecurity, launched Smart Contract Guardian (SCG), a smart contract bytecode decompiling platform and announced a free smart contract auditing service based on this platform.

https://www.trustlook.com/products/smartcontractguardian

The open source spirit of Ethereum is intended to allow developers to share their work with the community so innovative platforms and applications can be built. However, according to a survey conducted by Trustlook at the end of 2018, there have been roughly 2 million smart contracts built and deployed on the Ethereum network, but over 80% were published as unreadable low-level byte-code. This makes it nearly impossible for the average developer to analyze the contents of these smart contracts, which lead to the widespread use of insecure and unreliable contracts. This is likely a key enabler for a number of serious incidents on the Ethereum network.

Trustlook’s SCI decompiles unreadable smart contract byte-code into Solidity, a familiar and readable high-level language. There currently exists no product on the market which matches SCI in maturity or capability. Additionally, the decompiler has been released for free online use, which will allow developers a convenient tool for analyzing previously opaque smart contracts. Trustlook is hopeful that in the process of using SCI, community developers will also improve the quality and security of the Ethereum Network.

The security of smart contracts is critical since they may not be altered once deployed. Accordingly, the ability to audit smart contracts is necessary to ensure their security, as their bugs can directly cause thousands or millions of dollars in damages to digital currency exchanges and users.

Therefore, Trustlook has decided to provide the smart contract auditing service for free to the community while launching the SCI platform.

The founding team at Trustlook are cybersecurity veterans with over ten years of industry experience, with deep understanding of traditional cybersecurity fields as well as cutting edge blockchain security issues. Trustlook seeks to provide security and reliability to all smart contract based services in order to build a safer and more mature Ethereum network.

About Trustlook
Trustlook (www.trustlook.com) is a global leader in next-generation cyber security products based on artificial intelligence. Their innovative SECUREai engine delivers the performance and scalability needed to provide total threat protection against malware and other forms of attack. Trustlook’s solutions protect users from both known and zero-day threats by analyzing millions of code-level and behavior combinations to find malicious patterns. Founded in 2013, the company is headquartered in San Jose and managed by leading security experts from Palo Alto Networks, FireEye, Google and Yahoo.

Smart Contract Guardian – an online EVM decompiler

Since I started working in the Ethereum ecosystem and auditing Ethereum smart contract in bytecode format. I have evaluated many well-known projects which claimed they can decompile EVM (Ethereum Virtual Machine) bytecode. However, none of them really show good result for real world examples. So reading the EVM opcodes from the smart contracts is a really frustrating job and you can be lost anywhere among the JUMPs and POPs. So an idea popped into my mind that why not make a really working one to speed up the audit of raw EVM bytecode.

Fortunately, there are already several good articles talking about EVM bytecode structures. However, those resources still won’t make you fully understand how to make a real EVM decompiler. There are a lot of details lack from either official documents or other research. I have written a series of articles about the things I have learned from the development of the tool. If you are interested in them, please feel free to have a look:

Understand EVM bytecode – Part 1

Understand EVM bytecode – Part 2

Understand EVM bytecode – Part 3

Understand EVM bytecode – Part 4

Besides all the information I have mentioned in those articles, there are still more things need to pay attention to if you want to development your own decompiler or automation tool based on EVM bytecode. The most difficult part I have encountered during the development is how to retrieve back the Control Flow Graph (CFG) for the code. If you are familiar with EVM opcodes, you might have noted there is no function call instruction and return. Instead there is only conditional and unconditional JUMPs. So in order to extract back the original function logic, you need to analyze the pushed return address of the instructions in the stack to decide the range of a function. However, if you think you can skip this step and take the whole code as one function, you will have even more troublesome when dealing with loops. Because without defining functions any re-entrancy of code blocks might be judged as a loop mistakenly.

Another thing worth mentioning is that EVM itself is under development too. Also there are several high level languages you can choose to compile a smart contract. So the compiled EVM bytecode is very compiler dependent, even version dependent on by same compiler. For my research work, I used Remix online Solidity compiler. If you choose different versions to compile on a same piece of code, you will be surprised by the results. Since Solidity is under development too, it is no way you can expect consistent compiled results. So these facts will make the developer life harder, especially when more readable content is expected from the decompiler.

We have talked the problems you might encountered during the development. Let’s look at a demo of the online decompiler we have published. You can easily use it by following link:

https://www.trustlook.com/products/smartcontractguardian

To analyze a piece of EVM bytecode, you can either specify the smart contract address which has been deployed on the main network of Ethereum, or simply copy and paste your bytecode into the text box. After clicking on the “Decompile Now” button, the result will be shown on the page. Let’s take some simple Solidity source code for a testing:

pragma solidity ^0.4.18;

contract Bank {

  // balances, indexed by addresses
mapping(address => uint256) public balanceOf;

function deposit(uint256 amount) public payable {
require(msg.value == amount);


  // adjust the account's balance
balanceOf[msg.sender] += amount;
}

}

After compiler the code using Solidity 0.4.18, you will get following EVM runtime bytecode:

60606040526004361060485763ffffffff7c01000000000000000000000000
0000000000000000000000000000000060003504166370a082318114604d57
8063b6b55f25146088575b600080fd5b3415605757600080fd5b607673ffff
ffffffffffffffffffffffffffffffffffff600435166093565b6040519081
5260200160405180910390f35b609160043560a5565b005b60006020819052
908152604090205481565b34811460b057600080fd5b73ffffffffffffffff
ffffffffffffffffffffffff33166000908152602081905260409020805490
910190555600a165627a7a7230582023b9654f2013fddc872624140433014f
75b2dc3414e8d795bf7330ebc818ef6a0029

To understand why we call it “runtime” bytecode, you can refer to my write-ups about “Understanding EVM bytecode”. You can simply copy above bytecode and submit to the online decompiler like this:

Then you can click on the “Decompile Now” button. After a few seconds, you could have the decompiled result shown in the text box:

From the decompiled result, we can clearly see the original variable declarations and functions. Besides that, it also inspect the bytecode and look for potential known vulnerabilities. Here, an integer overflow was discovered in function deposit. If you do not trust the automation inspection result, you can always click on the “Request Audit” button for a manual review, our team will be happily to assistant you with our expertise.

Of course this sample is pretty simple, the real world example can be much more complicated. You can also check them by using the tool. The tool currently have following features:

  • Public and private functions recognition
  • Storage variables recognition
  • Memory related operations optimization
  • External calls commentating
  • Embed Solidity functions recognitions (ecrecover, sha256, ripemd160, sha3)
  • Well-known vulnerabilities inspection

Besides the above features, there are still more things which can be improved in future. Loops recognition can be done in better way other than giving goto statements for now. Also dynamical size variables like bytes and string can be optimized in better way for readability. There might also be bugs on showing local variables since there is no RET instruction in EVM. So the stack alignment is always an issue when function returns. Since the tool is still on experimental, you are very welcome to contact us for any issue you had found.

Hope you enjoy using it, let’s us know your comments and advise!

Understand EVM bytecode – Part 4

In previous section:

Understand EVM bytecode – Part 1

Understand EVM bytecode – Part 2

Understand EVM bytecode – Part 3

We have talked about how different Solidity data types are implemented in storage. For this section we will dig more about memory and its usage in external calls in this section.

We have learned some basics about memory from previous sections. We know memory is designed for HASH calculation or interactions with external calls or returns. The memory structure has reserved 0x0 and 0x20 for HASH calculation. At address 0x40 it will store a free memory pointer for future use. When some memory need to be allocated, the pointer can be adjusted accordingly so the allocated memory won’t be re-visited again. Also, Memory only accessible during contract execution. Once execution is finished, its contents are discarded. Comparing to storage, it is more like the RAM in computer.

Let’s look at all the opcodes which depends on memory besides MLOAD and MSTORE:

SHA3
CALLDATACOPY
CODECOPY
EXTCODECOPY
RETURNDATACOPY
LOG1,LOG2,LOG3,LOG4
CREATE
CALL
CALLCODE
RETURN
DELEGATECALL
STATICCALL
REVERT

We can see besides the SHA3 is used for calculate HASH value, most of the rest opcodes are related to interactions with EVM. It includes loading data from EVM data payload or code , or arranging data for external calls and returns.

First, let’s look at opcode CALLDATACOPY. From Solidity document, “calldatacopy(t, f, s)” is defined as “copy s bytes from calldata at position f to mem at position t”. If you got experience to analyze bytecode level contracts, you may observe that another similar opcode CALLDATALOAD is more popular than this one. But the difference of these 2 opcodes is that CALLDATALOAD only load 32-bytes data into the stack instead of memory. If the contract public function only uses integers in their arguments, then CALLDATALOAD is good enough for the calls. The format of the data payload will be as follows:

0x00: <function signature hash>
0x04: <first integer argument if any>
0x24: <second integer argument if any>

...

However, there are exceptions. Functions can support more data types in Solidity other than pure integers. For example of a function with struct or fixed size arrays:

contract Data7 {
address a;
address b;

function test(address [2] addresses) public returns (bool) {
a = addresses[0];
b = addresses[1];
return true;
}
}

When you look at the bytecode generated from above code, you can see :

temp0 = mload(0x40);
mstore(0x40,(0x40 + temp0));
calldatacopy(temp0,0x4,0x40);

Apparently, this time CALLDATACOPY is used to copy the whole argument into the memory for future use. It is not just fixed size of arrays. For any parameter which has fixed size like struct. CALLDATACOPY will be the one for the work.

However, things can be even more complicated when there is dynamic arrays. For example:

contract Data8 {
address a;
address b;

function test(address [] addresses) public returns (bool) {
a = addresses[0];
b = addresses[1];
return true;
}
}

We can see there is only one function test() uses an address array as the argument. So how the data will be arranged inside the data payload? Let’s still go into the bytecode for the truth. Here is the code snippet before calling the test() function:

temp0 = mload(0x40);
temp1 = msg.data(0x4)
mstore(0x40,(0x20 + (temp0 + (msg.data((0x4 + temp1)) * 0x20))));
mstore(temp0,msg.data((0x4 + temp1)));
calldatacopy((temp0 + 0x20),(0x24 + temp1),(msg.data((0x4 + temp1)) * 0x20));
var1 = test(temp0);

It might not be that obvious to get the logic of the data payload structure. But don’t worry. Let’s go through it together. The first line “temp0 = mload(0x40);

” is a very popular one. It gets the free memory pointer from memory address 0x40 into variable temp0. Then, temp1 will be assigned with the value get from data payload at offset 0x4, which regularly hold the first parameter when the type is integer. However, apparently it is not finished yet in this case. This value in temp1 will be used as an offset to locate the data starting from 0x4. This value can be shown as “msg.data((0x4 + temp1))”. From above code snippet, this value is the length of the array. The size of each item in the array is 0x20. So “mstore(0x40,(0x20 + (temp0 + (msg.data((0x4 + temp1)) * 0x20))));” will adjust the free memory pointer to save a piece of memory for this argument. Then, the array length will be copied into the old free memory pointer and items of array will be copied too by CALLDATACOPY. Finally, the memory pointer will be transferred to test() function for operation.
After we have some basic idea how EVM data payload was arranged, we can see how CALL related opcodes work and how memory is involved in. Solidity document states “call(g, a, v, in, insize, out, outsize) – call contract at address a with input mem[in..(in+insize)) providing g gas and v wei and output area mem[out..(out+outsize)) returning 0 on error (eg. out of gas) and 1 on success”. So when you use this opcode to call other smart contract you need to supply all of the information. Let’s look at a real example of external contract reference:

pragma solidity ^0.4.18;
contract Deployed {
function a() public pure returns (uint) {}
}
contract Existing {
Deployed dc;
function Existing(address _t) public {
dc = Deployed(_t);
}
function getA() public view returns (uint result) {
return dc.a();
}
}

Here we defined 2 smart contracts. The getA() function in the Existing contract will call the external function a() in Deployed. Here is the compiled bytecode:

temp0 = mload(0x40);
mstore(temp0,0xDBE671F00000000000000000000000000000000000000000000000000000000);
var11 =uint160(sload(0x0));
require(extcodesize(var11));
var6 = var11.gas(gasleft).value(0).call(temp0,4,temp0,0x20);

As always, the free memory pointer was loaded into variable temp0. Then the function hash 0xDBE671F is saved into the free memory. Then var11 will hold the first storage variable which is dc in this case. The require line will check if the address in var11 holds a contract address. Finally, this address will be used to make an external call “var11.gas(gasleft).value(0).call(temp0,4,temp0,0x20)”. The parameters (temp0 and 4) in this call is the data payload it wants to send to the external contract. In this case it sends 4 bytes from the address temp0.  From early code we know the 4 bytes are the function signature hash value for a(). Because there is no other arguments in the function, only 4 bytes function hash is needed to make this call. If the function you want to call does have arguments, then the compiler will arrange the memory in the way we have discussed early for the call. The parameters (temp0 and 0x20) will have the return data form the external call. EVM will get the data returned from the external call and put it into the memory address opcode CALL specified.

There is one thing worth mention about external call is that there is some hard-coded address for builtin function in Solidity. If you look at some real world EVM bytecode you will always find some external calls using addresses 1,2,3 and 4. Apparently they are not normal smart contract addresses. I have searched Internet, and not a lot of information about them can be found. But I do find some clue from Solidity documents:

Expressions that might have a side-effect on memory allocation are allowed, but those that might have a side-effect on other memory objects are not. The built-in functions keccak256, sha256, ripemd160, ecrecover, addmod and mulmod are allowed (even though they do call external contracts).

Check what it says in the bracket. They call external contracts for these built-in functions. Then I just wrote a smart contract with all these functions inside, then I found following mapping for the hard-coded addresses:

1 - ecrecover
2 - sha256
3 - ripemd160
4 - sha3

For recent version of Solidity compiler, SHA3 has its own opcode, so no external call is needed for this calculation. But you can still see some online smart contracts are using 0x4 to send external call for this functionality.

So far, we have discussed how memory plays an important role in the EVM environment, especially when making external calls to other smart contracts. This is would be the last section for this series. We have talked most of the things you might encounter when you want to analyze the EVM bytecode. Hope it can help you a little bit to understand how it works.


Understand EVM bytecode – Part 3

In previous sections:

Understand EVM bytecode – Part 1

Understand EVM bytecode – Part 2

We have talked about creation and runtime parts of the EVM bytecode. We have seen that the stack variables are commonly used to supply operands to opcodes or transfer integer arguments between internal function calls. All state related variables need to be saved in storage. However, storage is designed as a dictionary or hash table.  It will hold all data by key-value pairs. For each address key it can store a 32-byte integer. So how it can implement complicated data structures, struct, mapping, variable length arrays etc? Let’s dig more about the implementations of them.

Let’s start with the relatively easy ones: Fixed size data types. We all already know that EVM support different length of integers, from 1-byte to 32-byte. If you only defined several 32-byte integer variables in your smart contracts, the compiler will simply assign a consequence of addresses starting from 0x0 for the variables. For example:

contract data1 {
uint256 public balance1;
uint256 public balance2;
uint256 public balance3;
}

If you compiled above code and check the EVM bytecode, you will found all code read or write the 3 variables balance1, balance2 and balance3  will be mapping to the SLOAD or SSTORE operations on addresses 0x0, 0x1, and 0x2 accordingly. However, we all know the storage usage costs a lot of gas. If you have no idea what gas is in EVM, I recommend you to have a bit research on it. There are tons of articles talking about it. It is basically the execution cost for your contract code. So to optimize the gas cost of storage usage, when you have multiple small length integers, they will be optimized to use 1 storage slot. For example:

contract data1 {
uint128 public balance1;
uint128 public balance2;
uint256 public balance3;
}

When you have 2 variables defined as uint128, these 2 variables can be fit in one storage slot at address 0x0, and the third variable balance3 will be assigned at address 0x1 for its own. When you refer to the variable balance1, the first 16 byte value will be extracted to the referrer after SLOAD or SSTORE. Let’s look at some real example of it:

pragma solidity 0.4.25;
contract Demo1 {

uint128 public balance1;
uint128 public balance2;
uint256 public balance3;

function add() public returns (uint256) {
balance3 = balance1 + balance2;
return balance3;
}
}

After compiling it using Remix, we can get the runtime bytecode as:

6080604052600436106100615763ffffffff7c0100000000000000000000000000
00000000000000000000000000000060003504166340441eec8114610066578063
4f2be91f146100a0578063c45c4f58146100c7578063f24a0faa146100dc575b60
0080fd5b34801561007257600080fd5b5061007b6100f1565b604080516fffffff
ffffffffffffffffffffffffff9092168252519081900360200190f35b34801561
00ac57600080fd5b506100b561011d565b60408051918252519081900360200190
f35b3480156100d357600080fd5b5061007b610158565b3480156100e857600080
fd5b506100b5610170565b60005470010000000000000000000000000000000090
046fffffffffffffffffffffffffffffffff1681565b6000546fffffffffffffff
ffffffffffffffffff808216700100000000000000000000000000000000909204
81169190910116600181905590565b6000546fffffffffffffffffffffffffffff
ffff1681565b600154815600a165627a7a72305820fa0e623e455a9cc0439ea393
dff5b50cc571150034eb57658dd9718e3982b1590029

For time-being reason, we won’t go through all of them, let’s just look at the add calculation line inside add() function:

011E    60  PUSH1 0x00
0120 54 SLOAD
0121 6F PUSH16 0xffffffffffffffffffffffffffffffff
0132 80 DUP1
0133 82 DUP3
0134 16 AND
0135 70 PUSH17 0x0100000000000000000000000000000000
0147 90 SWAP1
0148 92 SWAP3
0149 04 DIV
014A 81 DUP2
014B 16 AND
014C 91 SWAP2
014D 90 SWAP1
014E 91 SWAP2
014F 01 ADD
0150 16 AND
0151 60 PUSH1 0x01
0153 81 DUP2
0154 90 SWAP1
0155 55 SSTORE

To make it more readable, we can give the mathematical equivalent assembler code as follows:

sstore(0x1,uint128((uint128((sload(0x0) / 0x100000000000000000000000000000000)) + uint128(sload(0x0)))));

The logic of using storage slot 0x0 and 0x1 is clearly showing in above line, which is the line of balance3 = balance1 + balance2;.

We can see one line Solidity code can result in a lot of opcode instructions. In future then we discuss even more complicated data structures, it will be overwhelming if we look at all opcodes. So to make our content more readable, I will just show the equivalent assembler code to prove the logic. However, you are always welcome to spend more time to dig the bytecode one by one to practise.

We have talked about the most basic data type, Integer, in EVM. Now let’s look at struct in Solidity:

pragma solidity 0.4.25;
contract Data2 {
struct Funder {
address addr;
uint256 amount;
}
Funder test;
function deposit(address addr, uint256 amount) public returns (uint256) {
test.addr = addr;
test.amount = amount;
return amount;
}
}

After generating the runtime EVM bytecode and locate the deposit() function. We can see equivalent assembler code was generated like this:

sstore(0x0,uint160(arg0));
sstore(0x1,arg1);

We can see even we define a struct Funder in our Solidity program, the compiler still generate the code without difference than just 2 normal storage variables. Also, if multiple member items inside the struct can fit in one storage slot, optimization will also happen as we said early.

Now let’s talk about a very popular data type, mapping, in Solidity.

pragma solidity ^0.4.25;
contract Bank {
mapping(address => uint256) public balanceOf; // balances, indexed by addresses
function deposit(uint256 amount) public payable {
require(msg.value == amount);

balanceOf[msg.sender] += amount; // adjust the account's balance
}
}

If you are familiar with smart contract development, you probably will see a lot of Dapps has similar usage of mapping to record the balance of an account. Let’s compile it and check the assembler code of the mapping variable balanceOf:

mstore(0x20,0x0);
mstore(0x0,msg.data(0x04) & 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF);
temp0 = keccak256(0x0,0x40);
return(sload(temp0));

We can see msg.data(0x04) & 0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF will get the parameter from data payload as an address value. Then, this argument will be put into the memory address 0x0. Also, the address 0x20 in the memory is set up with value 0x0. This 0x0 is actually an index of the declaration of storage variables. It only because balanceOf is the first declared variable in the smart contract. After the memory is set up, SHA3 opcode is called to calculate the HASH value of the inputs, And the result will be used as the storage key for the mapping variable. From above code, we can see when you declare a one level mapping variable, the compiler actually take it as function with 1 argument and return one value. For readability reason, let’s use arg0 for the value set up in the 0x0 address in memory of mapping calculation. Let’s take a look at 2 level mappings:

mapping (address => mapping (address => uint256)) public tokens;

You might have seen some similar mappings variable like above. When you compile the code and generate the bytecode, the assembler code of it will be like:

mstore(0x20,0x0);
mstore(0x0,arg0);
temp0 = keccak256(0x0,0x40);
mstore(0x20,temp0);
mstore(0x0,arg1);
temp1 = keccak256(0x0,0x40);
return(sload(temp1));

Apparently, for a 2-level mapping variable it is actually a function with 2 arguments and return one value. For this tokens example, it first set up memory with 0x0 at address 0x20 (we know 0x0 is the variable declaration index), arg0 at address 0x0. Then the first HASH value was calculated by using SHA3 opcode. However, this calculated value will put in to memory address 0x20 again, and the second argument arg1 is put at address 0x0 this time for another SHA3 calculation. Then the result will be used as the storage key for the variable. In same logic, you can keep adding level for your mapping variables and the key for the storage will always be the last calculated SHA3 result.

Until now you might have better understanding why a free memory pointer is always saved at 0x40, not at 0x0 or 0x20. That is only because those addresses are reserved for HASH calculations.  Until now we have visited several data structures in Solidity. What if we put some of them together? For example, what about if we have a mapping variable which returns a struct value:

pragma solidity 0.4.25;
contract Data4 {
struct Funder {
uint256 last_deposit;
uint256 amount;
}
mapping(address => Funder) public balanceOf;
}

Let’s have a look at the decompiled code for balanceOf then:

function balanceOf( address arg0) public return (var0,var1)
{
mstore(0x20,0x0);
mstore(0x0,arg0);
temp0 = keccak256(0x0,0x40);
return(sload(temp0),sload((temp0 + 0x1)));
}

We can see the SHA3 calculation part is the same as a regular 1-level mapping variable. The only difference is that the regular data type like integer will just use the hash result as storage key. But the member item inside a struct will apply a declaration index of the member to that hash value for the storage key. So for this case, the 2 member items in struct Funder will add 0x0, and 0x01 on the hash value temp0 for the storage key.

Next, we will have a look at the data structures with variable length. For example, we have a smart contract coded as follows:

pragma solidity 0.4.25;
contract Data4 {
address[] public senders;
function add() public {
senders.push(msg.sender);
}
}

We defined a variable length array senders. Let’s see how EVM implement this variable in storage:

function senders( uint256 arg0) public return (var0)
{
assert((arg0 < sload(0x0)));
mstore(0x0,0x0);
temp0 = keccak256(0x0,0x20);
return(uint160(sload((temp0 + arg0))));
}

From above assembler code, we can observe that the variable is implemented more like a function with one argument index and return the item from the array. In the Solidity source we know that we only defined one array variable senders. As we discussed early the storage will assign an address index for every variable, since there is only one, 0x0 is assigned to it. However, the address 0x0 is not enough to hold all data of the array. Instead it will hold the length of the array. For the data of the array, the storage use the SHA3 value of its address index (0x0 in this case) as the start of the array data. The array index will be added into this hash value as the the storage key for the respective item. So let’s look back at above code snippet. When external callers want to access an item in the array by an index, it will first check if the index is bigger than the length of the array. If not, the hash value will be calculated, and the index will be added to this hash to get the item of the array.

So far we have discussed strut, mapping, and array. There is one more data type you might be interested in. That is string in Solidity. String and bytes are the same data type to hold a variable length of bytes. Let’s have a look when we declared a string variable in storage, what it will look like:

pragma solidity 0.4.25;
contract Data6 {
string public hello = "Hello World!";
}

This smart contract is doing nothing than just return a string to the caller. Let’s see how the string is saved in the storage. Even there is one line of Solidity, the compiled assembler code is a bit complicated:

temp0 = mload(0x40);
mstore(0x40,(temp0 + (0x20 + (((0x1F + ((((0x100 * ((0x1 & sload(0x0)) == 0)) - 0x1) & sload(0x0)) / 0x2)) / 0x20) * 0x20))));
mstore(temp0,((((0x100 * ((0x1 & sload(0x0)) == 0)) - 0x1) & sload(0x0)) / 0x2));
var5 = (0x20 + temp0);
var7 = ((((0x100 * ((0x1 & sload(0x0)) == 0)) - 0x1) & sload(0x0)) / 0x2);
if (var7)
{
if ((0x1F < var7))
{
temp1 = (var5 + var7);
var5 = temp1;
mstore(0x0,0x0);
temp2 = keccak256(0x0,0x20);
var7 = var5;
var6 = temp2;
label_00000148:
mstore(var7,sload(var6));
var6 = (0x1 + var6);
var7 = (0x20 + var7);
if ((var5 > var7))
{
goto label_00000148;
}
else
{
temp4 = (var5 + (0x1F & (var7 - var5)));
var5 = temp4;
label_00000165:
return;
}
}
else
{
mstore(var5,((sload(0x0) / 0x100) * 0x100));
goto label_00000165;
}
}

The assembler code is a bit complicated, let’s explore the logic of it. First,  let’s figure out the logic of “((((0x100 * ((0x1 & sload(0x0)) == 0)) – 0x1) & sload(0x0)) / 0x2)”. Apparently, since there is only one variable hello defined in the contract, so storage address 0x0 is reserved for this variable. But it seems it is not just the length filed like arrays. From this comparison “((0x1 & sload(0x0)) == 0)) ” we know the last bit of this field is a flag. If the bit is set to 1, then this comparison is False (0x0), so the whole line will be “(uint256(sload(0x0)) / 0x2)”. If the bit is set to 0, then the whole line will be “(uint8(sload(0x0)) / 0x2)”. This value is assigned to var7 in above code. Then, this value is compared with 0x1F. If it is smaller than 0x1F, this line “mstore(var5,((sload(0x0) / 0x100) * 0x100));” will copy the first 0x1F bytes from storage 0x0 to the memory. If it is bigger than 0x1F, then some similar operation we have seen in variable-length array will be resulted in.

Based on what we have seen from the assembler code, we can know the basic logic of strings. To use the best of storage, if the string is smaller than 0x1F, it means including its length it can be saved in one slot in storage. So the last bit of the field is set to 0, and the last byte has the length of the string multiplied by 2. And the string is saved in the first 31 bytes in the slot. However, if the length of string is longer than 31, then one storage slot can’t put in all data. So the field slot will only hold the length field (last bit is set to 1), and the data is saved in the way of the variable-length array.

Until now, we have discussed the implementations in storage of the most of the data types in Solidity. For next section we will talk about memory and its interaction with data payload.


Understand EVM bytecode – Part 2

In the first section,

Understand EVM bytecode – Part 1

We have inspected the contract creation part of the EVM bytecode of the smart contract. In this section we will analyze the runtime EVM bytecode. We will still use the sample code demo1 from previous section.

pragma solidity 0.4.25;

contract Demo1 {
uint public balance;

function add(uint value) public returns (uint256) {
balance = balance + value;
return balance;
}
}

Besides going through the whole compiled bytecode to find the EVM runtime part, there is an easier way. We can also use Remix online portal. When you compiled the source code, click on the “Details” button and scroll down. There is a member field called “object” in the “RUNTIME BYTECODE” section.

The whole EVM runtime part bytecode can be shown as follows:

60806040526004361060485763ffffffff7c010000000000000000000000000000
00000000000000000000000000006000350416631003e2d28114604d578063b69e
f8a8146074575b600080fd5b348015605857600080fd5b5060626004356086565b
60408051918252519081900360200190f35b348015607f57600080fd5b50606260
95565b60008054820190819055919050565b600054815600a165627a7a72305820
63aa00920d824233ab5307ef3a379c757bdbee62fe00fe36a5d852c766e58fef0029

This part of bytecode seems much longer than the creation part. But let’s split it into parts the analyze them one by one.

0000    60  PUSH1 0x80
0002 60 PUSH1 0x40
0004 52 MSTORE
0005 60 PUSH1 0x04
0007 36 CALLDATASIZE
0008 10 LT
0009 60 PUSH1 0x48
000B 57 *JUMPI
...
0048 5B JUMPDEST
0049 60 PUSH1 0x00
004B 80 DUP1
004C FD *REVERT

If you have finished reading my first section, you would have no problem to understand the first 3 instructions set. If you haven’t done so, I really recommend you to read it first. These instructions actually save the address 0x80 in offset 0x40 in the memory as the free memory pointer for future use. There is a new opcode CALLDATASIZE which we haven’t met before at 0x07. It will get the EVM data payload size from this transaction. LT is a opcode to compare 2 items in the stack, it will returns TRUE if the comparison satisfied.  So put all pieces together we can get the equivalent Solidity assembler code as follows:

mstore(0x40,0x80);
if(msg.data.length < 0x04) { revert(0,0); }

We can see that basically these instructions will initialize the memory pointers and validate if the size of data payload is at least 4 bytes long. The reason for this is that for a regular external call to a smart contract, the first 4 bytes in the data payload is HASH value for function signature. This 4-byte value will be used by the contract to select which function to delivery the rest data which are the parameters for that function. For example, if you will call the function withdraw(0xABCD) on a smart contract, the data payload for this call will be like this:

0x3823D66C000000000000000000000000000000000000000000000000000000000000ABCD

In this example the first 4-byte value is 0x3823D66C, which is the SHA3 hash value of “withdrawn(bytes32)”. The following 32-byte integer is the parameter of the function call 0xABCD. This is a simple example of integer function parameters. Things will get more complicated when handling variable size parameters. We will talk about them later.

For now, let’s go back the instructions we have discussed. It will validate the size of data payload to be at least 4 byte long. If not, it will revert. But will this be the case for all smart contracts? What if we just send some Ethers to this smart contract without calling any function? You might already recall some functionality in Solidity programs. Yes, it is where the fallback function implemented. To approve this, you can get sample with fallback function implemented, and then check the instruction branch after the msg.data.length validation code.

To continue on our sample bytecode, we can see following code snippet:

0C: 63  PUSH4 0xffffffff
11: 7C PUSH29 0x100000000000000000000000000000000000000000000000000000000
2F: 60 PUSH1 0x00
31: 35 CALLDATALOAD
32: 04 DIV
33: 16 AND
34: 63 PUSH4 0x1003e2d2
39: 81 DUP2
3A: 14 EQ
3B: 60 PUSH1 0x4d
3D: 57 *JUMPI
3E: 80 DUP1
3F: 63 PUSH4 0xb69ef8a8
44: 14 EQ
45: 60 PUSH1 0x74
47: 57 *JUMPI

If you look at more EVM bytecode from different smart contracts, you will always find the same code snippet or functionality equivalent code snippet. This code snippet is not from the code you have put into your source, and it was injected by the compiler. Let’s go through every lines.

First, the code pushes 0xFFFFFFFF into the stack. This value will be used as one of the operands of the AND opcode at 0x33. Then the code will pushes another huge constant value, which will be used as the divider of DIV at 0x32. For instructions at 0x2F, 0x31, the code will get the first 32-byte value from data payload using CALLDATALOAD(0x0). Combined with the instruction set of the later DIV and AND, we can see these instructions actually get the first 4-byte value of data payload, which is the function signature HASH value. The calculated result is pushed into stack. Then the value is compared with 0x1003e2d2 using the EQ opcode at 0x3A. If it was true, the execution will be led to address 0x4D by JUMPI. If not the code will keep continue and the result will be compared with another HASH value 0xb69ef8a8.

Now the logic of this code snippet is pretty clear. It gets the first 4-byte value from data payload and decides which function it will be selected to run. For the code addresses at 0x4D and 0x74 they are the entrance of each public functions the caller can access to the smart contracts. If none of the HASH value is satisfied from the code, then it will result in the fallback function of the smart contracts. If it was not defined, it will simply revert.

Since HASH algorithms will make information lost during the calculation, it is impossible to get the original function signature information back theatrically. However, you can still make a guess on the original information based on a huge collection of HASH values. That is what website www.4byte.directory does. It collects tons of function signatures and their hashes and provide a WEB API for searching. For example, by browsing following link:

https://www.4byte.directory/signatures/?bytes4_signature=0x1003e2d2

You will get the result as:

add(uint256)

Amazingly, that is the original function definition in our demo1.sol program. So by using this service we can pretty much reverse back most of the function signatures if they are collected before.

Until now we have known how public functions can be accessed by external callers. Now let’s go further into the specific functions. Address 0x4D is the entrance of add() function:

004D    5B  JUMPDEST
004E 34 CALLVALUE
004F 80 DUP1
0050 15 ISZERO
0051 60 PUSH1 0x58
0053 57 *JUMPI
0054 60 PUSH1 0x00
0056 80 DUP1
0057 FD *REVERT
0058 5B JUMPDEST
0059 50 POP
005A 60 PUSH1 0x62
005C 60 PUSH1 0x04
005E 35 CALLDATALOAD
005F 60 PUSH1 0x86
0061 56 *JUMP

At the entrance of the function add(), there is an opcode JUMPDEST. This is a special opcode which only marks an address that can be jumped to. It does not seem to play an important role to the EVM implementation. However, you will see it does help to determine the control flow graph (CFG) for bytecode. We will discuss it in the future section.

The instructions set at 0x4F-0x57 have been seen in previous section. They were injected by compiler for the non-payable function. After this validation code, the PUSH1 opcode at address 0x5A is easy to be ignored. However, this PUSH1 is very important for the code to jump back later. Let’s just remember an address 0x62 is pushed into the stack for now. Then, CALLDATALOAD(0x04) is called to load the parameter from data payload, which is located at offset 0x04. After getting the parameter, the code will jump to 0x86 for execution:

0086    5B  JUMPDEST
0087 60 PUSH1 0x00
0089 80 DUP1
008A 54 SLOAD
008B 82 DUP3
008C 01 ADD
008D 90 SWAP1
008E 81 DUP2
008F 90 SWAP1
0090 55 SSTORE
0091 91 SWAP2
0092 90 SWAP1
0093 50 POP
0094 56 *JUMP

In above code snippet, the value at 0x0 in the storage is loaded by using SLOAD(0x0). Then, this value will be added with the parameter loaded from data payload and saved back to the same location 0x0 in storage. Finally we see the code we have put inside the add() function:

balance = balance + value;

Since we only used one integer variable balance inside the smart contract. The compiler give the offset 0x0 to this variable. So any read or write operation on this balance variable will be put on the offset 0x0 in the storage. At the end of the code snippet, a JUMP is used to jump back to 0x62. If you still recall where this value 0x62 was pushed into the stack. This operation might remind you of something on X86 architecture. Yes, that is actually the the call and ret for function calls. Since EVM doesn’t support function calls on bytecode level, it can only use PUSH and JUMP opcodes for function calls. This way will get a lot of trouble to build back the CFG from bytecode.

Let’s jump back to address 0x62 to see what happens next:

0062    5B  JUMPDEST
0063 60 PUSH1 0x40
0065 80 DUP1
0066 51 MLOAD
0067 91 SWAP2
0068 82 DUP3
0069 52 MSTORE
006A 51 MLOAD
006B 90 SWAP1
006C 81 DUP2
006D 90 SWAP1
006E 03 SUB
006F 60 PUSH1 0x20
0071 01 ADD
0072 90 SWAP1
0073 F3 *RETURN

This code snippet has several items adjustments on the stack by opcodes DUP and SWAP. It is not obvious to understand the meaning of the instructions. I will leave this to you to simulate a stack for the execution. For the equivalent assembler code of above code is as follows:

mstore(mload(0x40), value);
return(mload(0x40), 0x20);

The value inside the code was the calculated result from previous add operation, which is the new balance value. Finally, we have gone through all bytecode inside the add() function.

Now let’s go back to the function dispatch code snippet for the second function HASH 0xb69ef8a8. The entrance for that function is at 0x74:

0074    5B  JUMPDEST
0075 34 CALLVALUE
0076 80 DUP1
0077 15 ISZERO
0078 60 PUSH1 0x7f
007A 57 *JUMPI
007D 80 DUP1
007E FD *REVERT
007F 5B JUMPDEST
0080 50 POP
0081 60 PUSH1 0x62
0083 60 PUSH1 0x95
0085 56 *JUMP
...
0095 5B JUMPDEST
0096 60 PUSH1 0x00
0098 54 SLOAD
0099 81 DUP2
009A 56 *JUMP

We can see the first part of the code is really similar to the previous function except no parameter was loaded. Then the function will call a code snippet at 0x95. The instructions at 0x95-0x98 just load the value at 0x0 in the storage and return. We noted that the code snippet at 0x62 was reused for both functions. That is because both functions will return the storage variable balance back.

You may wonder why there is the function HASH 0xb69ef8a8 inside the function dispatch code? Isn’t there only one function add() inside the smart contract? If you use 4bytes database to check that HASH, you will get balance(). Apparently, the storage variable is recognized as a public function without parameters by compiler.

To summarize the section, we have discussed the whole structure of the runtime part of the bytecode. How functions are accessed by the external callers, how parameters are transferred. But for this demo example, we only put in some integer storage variables. How will it like for mappings or variable length arrays? How will parameters presented in the data payload for strings? We will talk about all these in next section.

Understand EVM bytecode – Part 3

Trustlook is one of the best Anti-Virus engines provider

A Silicon Valley researcher recently released a report about the Malware scanning capabilities of global cybersecurity providers. Trustlook, one of the providers included in this report, is specifically identified as one of the top tier performers.

The researcher actively conducted a 14-day comparison of available cybersecurity providers on the market in preparation of selecting a new vendor for his firm. The following is a detailing of the survey methodology:

First, the researcher built a testing dataset, mostly sourced from the VirusTotal (VT) database. “Our company has some benign APK samples, but no malicious samples”, the researcher said,” we also select some malicious APK samples and benign APK samples using a very conservative labeling policy from VT’s live feed samples.”

VT is well-known in the cybersecurity industry, which was founded in 2004 and then was acquired by Google in 2012. The website is hosting many antivirus scanners from global cybersecurity company. Users can upload different types of files to VT and it will return scan results from all of the hosted vendors. All of the scan reports are shared with the public VT community, what important is these test results may be verified by querying the testing samples on VT. In addition, VT provides an API to get a live feed with the latest samples submitted to VT, which is the data source of this research. That’s how the researcher getting his researching data and subjects.

Then, the researcher fed the dataset to malware scanners from 66 vendors hosted on VT  for a period of 14 days: from 11/01 to 11/14.

In the final accounting, Trustlook ranked second among the vendors, which are shown in the following table.

The primary metric used to compare malware detection capabilities is the True Positive Rate (TPR). This represents the rate which the scanner successfully detected known malware samples among the provided dataset. Trustlook’s scanner finished at number two with an admirable TPR of 98.33%.

“In the table, we sorted the vendors according to their TPR which represents their malware detection capabilities. In total, four vendors have achieved more than 95% detection rate which are ESET-NOD32 from Slovakia, Trustlook from United States, AhnLab-V3 from Korea, and K7GW from India. Among them, Trustlook has the lowest FPR of 0.12%. Among the eight vendors with over 90% TPR, Fortinet has the lowest FPR which is close to 0,” the report said.

Significantly, Trustlook’s performance also surpasses globally renowned vendors such as Avast, BitDefender, McAfee, and Symantec.

As a cybersecurity startup Trustlook has always focused heavily on cutting-edge technology development, and we are excited to see our capabilities confirmed by third-party research. Our promise is that we will continue to build powerful and reliable products for individual and corporate consumers in the future.