estools Javascript

estools Javascript

0xCC 2015/08/07 10:26

0x00


Javascript is a scripting language that runs on the client, and its source code is completely visible to users. But not every js developer wants his code to be read directly, such as malware creators. In order to increase the difficulty of code analysis, obfuscate tools have been applied to many malicious software (such as 0day hacking, cross-site attacks, etc.). In order to lift the veil of malware, analysts must first deobfuscate the script.

This article will introduce some common obfuscation methods and an introduction to static code analysis with estools.

0x01 Common obfuscation methods


?encryption

The key idea of this kind of confusion is to encode the code that needs to be executed once, restore the legal script executable by the browser during execution, and then execute it. It looks a bit similar to the packing of executable files. Javascript provides the ability to evaluate strings as code, which can be passed to the ?js engine for analysis and execution through Function constructor , eval , setTimeout , and setInterval . The most common is base62 encoding -its most obvious feature is that the generated code eval(function(p,a,c,k,e,r))starts with.

No matter how the code is transformed, it will eventually call eval and other functions. Decryption algorithm method does not require them to do any analysis, simply find the final call, changed console.logor otherwise, after the results of the program can be decoded in the output string. There have been many articles about the realization of automation, so I won t go into details here.

Steganography

Strictly speaking, this can't be called obfuscation, it just hides the js code in a specific medium. For example, the least significant bit (LSB) algorithm is used to embed in the RGB channel of the picture, hidden in the EXIF metadata of the picture, hidden in HTML blank characters, etc.

For example, this sensational topic: [A picture hacks you: Embedding a malicious program in a picture] , when the PPT is released, it uses the least effective bit plane algorithm. Combined with HTML5 canvas or TypeArray for processing binary data, the script can extract the hidden data (such as code) in the carrier.

The steganography method also requires decoding programs and dynamic execution, so the cracking method is the same as the former, hijacking the behavior of replacing key function calls in the browser context, and changing to text output to get the hidden code in the carrier.

Complicated expression

Code obfuscation does not necessarily call eval, and you can also increase the complexity of the code by filling invalid instructions in the code, and greatly reduce the readability. There are many maddening features in Javascript, these features combined can make the original simple literal (Literal), member access (MemberExpression), function call (CallExpression) and other code fragments difficult to read.

Literals in Js include strings, numbers, and regular expressions

Here is a brief example.

  • There are two ways to access the members of an object-dot operator and subscript operator. Call the eval method of window, either can be written window.eval()or can be window['eval'];

  • In order to make the code more perverted, the obfuscator chooses the second way of writing, and then makes a fuss about the string literal. First string into several parts: 'e' + 'v' + 'al';

  • This looks very clear, and then use a digital conversion of the number of tips: 14..toString(15) + 31..toString(32) + 0xf1.toString(22);

  • Don t do two things, unfold the numbers: (0b1110).toString(4<<2) + (' '.charCodeAt() - 1).toString(Math.log(0x100000000)/Math.log(2)) + 0xf1.toString(11 << 1)

  • The final effect:window[(2*7).toString(4<<2) + (' '.charCodeAt() - 1).toString(Math.log(0x100000000)/Math.log(2)) + 0xf1.toString(11 << 1)]('alrt(1)')

Many such reciprocal operations can be found in js, and simple expressions can be infinitely complicated by combining them using random generation.

0x02 static analysis implementation


Parse and transform code

The idea of anti-obfuscation for Javascript in this article is to simulate the predictable result of the execution code, write a simple script execution engine, only execute code blocks that meet certain predetermined rules, and finally replace the calculation results with the original lengthy code to achieve Simplification of expressions.

If you have a preliminary understanding of the principle of the script engine interpreter, you can know that the interpreter will perform lexical analysis and grammatical analysis on the source code in order to "understand" the code, and convert the string of the code into an abstract syntax tree ( Abstract Syntax Tree). , AST) data format.

Such as this code:

var a = 42; var b = 5; function addA(d) { return a + d; } var c = addA(2) + b;

The corresponding syntax tree is shown in the figure:

( Generated by the online tool of JointJS )

Regardless of JIT technology, the interpreter can start from the root node of the syntax tree, traverse all nodes of the entire tree using depth-first, and execute them one by one according to the instructions analyzed on the node, until the script ends and returns the result.

There are many tools for generating abstract syntax trees through js code, such as the parser with the compressor UglifyJS , and the esprima used in this article .

The interface provided by esprima is simple:

  var ast = require('esprima').parse(code)
 

In addition, Esprima provides an online tool that can parse any (legal) Javascript code into an AST and output: http://esprima.org/demo/parse.html

Combined with several auxiliary libraries of estools, static code analysis of js can be performed:

  • escope Javascript scope analysis tool

  • esutil auxiliary function library, check whether the syntax tree node meets certain conditions

  • estraverse syntax tree traversal auxiliary library, the interface is a bit similar to SAX parsing XML

  • esrecurse Another syntax tree traversal tool, using recursion

  • esquery uses the syntax of css selectors to extract eligible nodes from the syntax tree

  • The functions of escodegen and esprima are reversed, and the syntax tree is restored to code

The traversal tool used in the project is estraverse. It provides two static methods, estraverse.traverseand estraverse.replace. The former simply traverses the AST nodes and controls whether to continue traversing to the leaf nodes through the return value; while the replace method can directly modify the AST during the traversal process to achieve code reconstruction. For specific usage, please refer to its official documentation or the sample code attached to this article.

Rule design

Start with the code actually encountered. Recently, while studying some XSS worms, I encountered code confusion similar to the following:

Observing its code style, I found that this obfuscator did several things:

  • String literal confusion: first extract all strings, create a string array in the global scope, and escape characters at the same time to increase the difficulty of reading, and then replace the place where the string appears to be a reference to the array element

  • Variable name confusion: different from the shortened name of the compressor, the format of underscore plus number is used here, the distinction between variables is very low, and it is more difficult to read than a single letter

  • Member operator confusion: replace the dot operator with a string subscript form, and then confuse the string

  • Delete extra white space characters: reduce the file size, this is what all compressors will do

After searching, such code is likely to be generated through the free version of javascriptobfuscator.com . The three options available for the free version ( Encode Strings/Strings/Replace Names) also confirm the phenomenon observed earlier.

In these transformations, variable name confusion is irreversible. A tool that can intelligently name variables is also good. For example, this jsnice website provides an online tool that can analyze the specific role of variables and automatically rename them. Even if it can't be perfect, use manual methods, use IDE (such as WebStorm) code reconstruction function, combined with code behavior analysis to manually rename and restore.

Let's look at the processing of strings. Since the string will be extracted into a global array, such characteristics can be observed in the syntax tree: In the global scope, a VariableDeclarator appears, its init attribute is ArrayExpression, and all elements are Literal-this shows All elements of this array are constants. Simply evaluate it and associate it with the variable name (identifier). Note that in order to simplify processing, the issue of variable name scope chain is not considered here. In js, the priority of variable names exists on the scope chain. For example, global variable names can be redefined by local variables. If the obfuscator is a bit more perverted, the same variable name is used in different scopes, and the deobfuscator does not deal with the scope, it will lead to errors in the solved code.

In the test program, I set the following replacement rules:

  • The string array declared by the global variable, directly use the numerical subscript to refer to its value in the code

  • A series of binary operations with a certain result, such as 1 * 2 + 3/4 - 6 % 5

  • The source of the regular expression literal, the length of the string literal

  • An array composed entirely of string constants, join/reverse/slicethe return value of other methods

  • String constants substr/charAtreturn value of the method, and the like

  • Global functions such as decodeURIComponent, all parameters of which are constant, replace them with their return values

  • A mathematical function call whose result is a constant, such as Math.sin(3.14)

As for the restoration of indentation, this is a feature of escodegen. Call escodegen.generateusing the default configuration code to a method for generating time (the second parameter is ignored).

DEMO program

The prototype of this deobfuscator is on GitHub: github.com/ChiChou/eta...

Refer to the README of the warehouse for the operating environment and usage method.

Extracted a piece of code from   YOU MIGHT NOT NEED JQUERY and put it into javascriptobfuscator.com to test the confusion:

Untie the confusion result github.com/ChiChou/eta... , the result is as follows:

Although the readability of variable names is still poor, the behavior of the code can be seen in general.

The demo program currently has a lot of limitations, it can only be regarded as a semi-automatic auxiliary tool, and there are many unrealized functions.

Some obfuscators will perform more complex protections on string literals, converting the string to the form of f(x), where the f function is a decryption function, and the parameter x is the ciphertext string. There is also an anonymous function generated in-situ, the return value is a string. The function expression usually used in this way has the characteristic of being context-independent-its return value is only related to the input of the function, and has nothing to do with the context of the current code (such as the members of the class, the value retrieved in the DOM). Such as the xor function in the following code snippet:

var xor = function(str, a, b) {
 

return String.fromCharCode.apply(null, str.split('').map(function(c, i) {var ascii = c.charCodeAt(0); return ascii ^ (i% 2? a: b);} )); };

How to judge whether a function has such characteristics? 1. some of the library functions to determine compliance, such as btoa escape String.fromCharCodesuch as long as the input value is a constant, the return value is fixed. Create a whitelist of such built-in functions, and then traverse the AST of the function expression. If none of the parameters involved in the calculation of the function comes from an external context, and all of its CallExpression callees are in the function whitelist, then one can be confirmed by recursively Whether the function meets the conditions.

Other obfuscators will create a large number of reference instances for variables, that is, use multiple aliases for the same object, which is very disturbing to read. You can send the escope tool to analyze the data flow of the variable identifier and replace it with the correct value pointed to. There is also the use of mathematical identities for confusion. For example, if a variable a is declared, if a is Number, the expression a-aand a * 0both are always 0. But if a is satisfied isNaN(a), the expression returns NaN. To clean up this type of code, you also need to use data flow analysis methods.

So far, there are no js obfuscated samples implemented using flat process jumps. I think it may be related to the usage scenarios and characteristics of js language itself. Generally, js generations are business-oriented, and there will not be too complicated process control or algorithms, and the effect of confusion may not be ideal.

0x03 concluding remarks


Javascript is indeed a magical language, and you can often encounter some surprising tricks. Decrypting the protected code is also interesting. It is said that several big technology giants are brewing to design a common bytecode standard for browser applications- WebAssembly . Once this idea is realized, code protection will be able to introduce real "packing" or virtual machine protection, and the countermeasure technology will be raised to a new level.

The demo project code is hosted on GitHub: github.com/ChiChou/eta...

0x04 references


  1. tobyho.com/2013/12/02/
  2. github.com/estree/estr...
  3. developer.mozilla.org/en-US/docs/...
  4. jointjs.com/demos/javas...