SimuVEX and Bare-Bones Symbolic Execution

Most analyses require an understanding of what the code is doing (semantic meaning), not just what the code is (syntactic meaning). For this, we developed a module called SimuVEX (https://github.com/angr/simuvex). SimuVEX provides a semantic understanding of what a given piece of VEX code does on a given machine state.

In a nutshell, SimuVEX is a symbolic VEX emulator. Given a machine state and a VEX IR block, SimuVEX provides a resulting machine state (or, in the case of condition jumps, several resulting machine states).

Semantic Translation

SimuVEX's ultimate goal is to provide a semantic meaning to blocks of binary code.

While the IR gives us syntactic meaning (i.e., what statements the block contains), SimuVEX can provide us semantic meaning (i.e., what the block does to a given state). We'll now move on to how to do that.

Accessing SimIRSBs

We get semantic meaning by converting an IRSB into a SimIRSB. While the former focuses on providing a cross-architecture, programmatically-accessible representation of what a block is, the latter provides a cross-platform, programmatically-accessible representation of what a block did, given an input state.

Note that you will almost never need to interact with SimIRSBs directly. The higher level angr interfaces complete abstract away simuvex's pedantic execution model, but it's important to understand the way that execution happens.

To create a SimIRSB directly, use the sim_block(state) constructor in the project factory. Here's an example.

# This creates a symbolic state with execution starting at 0x400664,
# and then symbolically executes the IRSB at its instruction pointer.
>>> import angr, simuvex
>>> b = angr.Project('examples/fauxware/fauxware')
>>> state = b.factory.blank_state(addr=0x400664)
>>> sirsb = b.factory.sim_block(state)

# this is the address of the first instruction in the block
>>> assert sirsb.addr == 0x400664

What happened?

Symbolic execution happened! sirsb is now a record of what happened during the symbolic execution of this basic block. This information is exposed in the form of SimActions. In case you hadn't noticed, everything that comes out of simuvex has "Sim-" on the front of its name.

An IRSB has Statements, so a SimIRSB has SimStatements. Each statement has a list of SimActions recording each of its reads and writes to and from memory, registers, and temps. A SimAction has an associated type (i.e., "mem" for memory, "reg" for registers, "tmp" for temps), and action ("read", "write").

# Print out all the actions for this block:
>>> for sstmt in sirsb.statements:
...     print '[+] Actions for statement %d' % sstmt.stmt_idx
...     for a in sstmt.actions:
...         if a.type == 'mem':
...             print "Memory write to", a.addr.ast
...             print "... address depends on registers", a.addr.reg_deps, "and temps", a.addr.tmp_deps
...             print "... data is:", a.data.ast
...             print "... data depends on registers", a.data.reg_deps, "and temps", a.data.tmp_deps
...             if a.condition is not None:
...                 print "... condition is:", a.condition.ast
...             if a.fallback is not None:
...                 print "... alternate write in case of condition fail:", a.fallback.ast
...         elif a.type == 'reg':
...             print 'Register write to registerfile offset', a.offset
...             print "... data is:", a.data.ast
...         elif a.type == 'tmp':
...             print 'Tmp write to tmp', a.tmp
...             print "... data is:", a.data.ast

Where are we going?

For each Exit in the IRSB, there will be a new successor SimState! Each successor will be a copy of the original state you started with, plus the modifications described by the SimActions, plus the constraints added to the solver engine by the exit you've taken. For boring jump exits, no constraints are added, but for conditional jumps, the successor that took the jump gets the guard condition of the jump exit, while the default successor gets the negation.

# The list of successors:
>>> for succ in sirsb.all_successors:
...     print succ

# The default successor, i.e. the one that runs off the end of the block:
>>> print sirsb.default_exit

# Any unconstrained successors, that is, successors with symbolic instruction pointers:
>>> for succ in sirsb.unconstrained_successors:
...     print succ

# Any successors whose constraints contain a contradition (not necessarily a complete list):
>>> for succ in sirsb.unsat_successors:
...     print succ

SimProcedures

Fun fact: SimIRSBs aren't the only way to step symbolic execution forward! SimIRSBs inherit the SimRun class, which is a generic class for describing execution steps.

At the moment, the only other important SimRun is the SimProcedure. SimProcedures are, first and foremost, symbolic function summaries: angr handles functions imported into the binary by executing a SimProcedure that symbolically implements the given library function, if one exists. SimProcedures are a generic enough interface to do more than this, though - they can be used to run Python code to mutate a state at any point in execution.

SimProcedures are injected into angr's execution pipeline through an interface called hooking. The full interface is described here, but the most important part is the Project.hook(address, procedure) method. After running this, whenever execution in this project reaches address, instead of running the SimIRSB at that address, we run the SimProcedure specified by the procedure argument.

Project.hook can also take a plain python function as an argument, instead of a SimProcedure class. That function will be automatically wrapped by a SimProcedure and executed (with the current SimState) as its argument.

Nobody actually calls Project.factory.sim_block, like in the example above. Instead, use Project.factory.sim_run, which has the exact same interface but will chose what kind of SimRun to create. Again, this is still a very low-level interface that you normally shouldn't have to touch while writing analyses with angr.

TODO: Programming SimProcedures. Cover all the kinds of control flow, inline calls, etc. If you want to program a SimProcedure now, look at the library of already-written ones.

Putting it all together

Now you have all the components necessary to do the most basic symbolic execution with angr! The below code will execute all possible paths through a binary.

# Note: please don't actually do this. This will eat up all your computer's RAM and you will die.
# Also there's no error handling at all!

states = [b.factory.entry_state()]
while len(states) > 0:
    successors = []
    for state in states:
        successors.extend(b.factory.sim_run(state).all_successors)
    states = successors

Breakpoints!

Like any decent execution engine, SimuVEX supports breakpoints. This is pretty cool! A point is set as follows:

# get our state
>>> s = b.factory.entry_state()

# add a breakpoint. This breakpoint will drop into ipdb right before a memory write happens.
>>> s.inspect.b('mem_write')

# on the other hand, we can have a breakpoint trigger right *after* a memory write happens. On top of that, we
# can have a specific function get run instead of going straight to ipdb.
>>> def debug_func(state):
...     print "State %s is about to do a memory write!"

>>> s.inspect.b('mem_write', when=simuvex.BP_AFTER, action=debug_func)

# or, you can have it drop you in an embedded ipython!
>>> s.inspect.b('mem_write', when=simuvex.BP_AFTER, action='ipython')

There are many other places to break than a memory write. Here is the list. You can break at BP_BEFORE or BP_AFTER for each of these events.

Event type	Event meaning
mem_read	Memory is being read.
mem_write	Memory is being written.
reg_read	A register is being read.
reg_write	A register is being written.
tmp_read	A temp is being read.
tmp_write	A temp is being written.
expr	An expression is being created (i.e., a result of an arithmetic operation or a constant in the IR).
statement	An IR statement is being translated.
instruction	A new (native) instruction is being translated.
irsb	A new basic block is being translated.
constraints	New constraints are being added to the state.
exit	A SimExit is being created from a SimIRSB.
symbolic_variable	A new symbolic variable is being created.
call	A call instruction is hit.
address_concretization	A symbolic memory access is being resolved.

These events expose different attributes:

Event type	Attribute name	Attribute availability	Attribute meaning
mem_read	mem_read_address	BP_BEFORE or BP_AFTER	The address at which memory is being read.
mem_read	mem_read_length	BP_BEFORE or BP_AFTER	The length of the memory read.
mem_read	mem_read_expr	BP_AFTER	The expression at that address.
mem_write	mem_write_address	BP_BEFORE or BP_AFTER	The address at which memory is being written.
mem_write	mem_write_length	BP_BEFORE or BP_AFTER	The length of the memory write.
mem_write	mem_write_expr	BP_BEFORE or BP_AFTER	The expression that is being written.
reg_read	reg_read_offset	BP_BEFORE or BP_AFTER	The offset of the register being read.
reg_read	reg_read_length	BP_BEFORE or BP_AFTER	The length of the register read.
reg_read	reg_read_expr	BP_AFTER	The expression in the register.
reg_write	reg_write_offset	BP_BEFORE or BP_AFTER	The offset of the register being written.
reg_write	reg_write_length	BP_BEFORE or BP_AFTER	The length of the register write.
reg_write	reg_write_expr	BP_BEFORE or BP_AFTER	The expression that is being written.
tmp_read	tmp_read_num	BP_BEFORE or BP_AFTER	The number of the temp being read.
tmp_read	tmp_read_expr	BP_AFTER	The expression of the temp.
tmp_write	tmp_write_num	BP_BEFORE or BP_AFTER	The number of the temp written.
tmp_write	tmp_write_expr	BP_AFTER	The expression written to the temp.
expr	expr	BP_AFTER	The value of the expression.
statement	statement	BP_BEFORE or BP_AFTER	The index of the IR statement (in the IR basic block).
instruction	instruction	BP_BEFORE or BP_AFTER	The address of the native instruction.
irsb	address	BP_BEFORE or BP_AFTER	The address of the basic block.
constraints	added_constrints	BP_BEFORE or BP_AFTER	The list of contraint expressions being added.
call	function_name	BP_BEFORE or BP_AFTER	The name of the function being called.
exit	exit_target	BP_BEFORE or BP_AFTER	The expression representing the target of a SimExit.
exit	exit_guard	BP_BEFORE or BP_AFTER	The expression representing the guard of a SimExit.
exit	jumpkind	BP_BEFORE or BP_AFTER	The expression representing the kind of SimExit.
symbolic_variable	symbolic_name	BP_BEFORE or BP_AFTER	The name of the symbolic variable being created. The solver engine might modify this name (by appending a unique ID and length). Check the symbolic_expr for the final symbolic expression.
symbolic_variable	symbolic_size	BP_BEFORE or BP_AFTER	The size of the symbolic variable being created.
symbolic_variable	symbolic_expr	BP_AFTER	The expression representing the new symbolic variable.
address_concretization	address_concretization_strategy	BP_BEFORE or BP_AFTER	The SimConcretizationStrategy being used to resolve the address. This can be modified by the breakpoint handler to change the strategy that will be applied. If your breakpoint handler sets this to None, this strategy will be skipped.
address_concretization	address_concretization_action	BP_BEFORE or BP_AFTER	The SimAction object being used to record the memory action.
address_concretization	address_concretization_memory	BP_BEFORE or BP_AFTER	The SimMemory object on which the action was taken.
address_concretization	address_concretization_expr	BP_BEFORE or BP_AFTER	The AST representing the memory index being resolved. The breakpoint handler can modify this to affect the address being resolved.
address_concretization	address_concretization_add_constraints	BP_BEFORE or BP_AFTER	Whether or not constraints should/will be added for this read.
address_concretization	address_concretization_result	BP_AFTER	The list of resolved memory addresses (integers). The breakpoint handler can overwrite these to effect a different resolution result.

These attributes can be accessed as members of state.inspect during the appropriate breakpoint callback to access the appropriate values. You can even modify these value to modify further uses of the values!

>>> def track_reads(state):
...     print 'Read', state.inspect.mem_read_expr, 'from', state.inspect.mem_read_address
...
>>> s.inspect.b('mem_read', when=simuvex.BP_AFTER, action=track_reads)

Additionally, each of these properties can be used as a keyword argument to inspect.b to make the breakpoint conditional:

# This will break before a memory write if 0x1000 is a possible value of its target expression
>>> s.inspect.b('mem_write', mem_write_address=0x1000)

# This will break before a memory write if 0x1000 is the *only* value of its target expression
>>> s.inspect.b('mem_write', mem_write_address=0x1000, mem_write_address_unique=True)

# This will break after instruction 0x8000, but only 0x1000 is a possible value of the last expression that was read from memory
>>> s.inspect.b('instruction', when=simuvex.BP_AFTER, instruction=0x8000, mem_read_expr=0x1000)

Cool stuff! In fact, we can even specify a function as a condition:

# this is a complex condition that could do anything! In this case, it makes sure that RAX is 0x41414141 and
# that the basic block starting at 0x8004 was executed sometime in this path's history
>>> def cond(state):
...     return state.any_str(state.regs.rax) == 'AAAA' and 0x8004 in state.inspect.backtrace

>>> s.inspect.b('mem_write', condition=cond)

That is some cool stuff!

Symbolic memory indexing

SimuVEX supports symbolic memory addressing, meaning that offsets into memory may be symbolic. Our implementation of this is inspired by "Mayhem".

This resolution behavior is governed by concretization strategies, which are subclasses of simuvex.concretization_strategies.SimConcretizationStrategy. Concretization strategies for reads are set in state.memory.read_strategies and for writes in state.memory.write_strategies. These strategies are called, in order, until one of them is able to resolve addresses for the symbolic index. By setting your own concretization managers (or through the use of SimInspect address_concretization breakpoints, described above), you can change the way SimuVEX resolves symbolic addresses.

TODO: elaborate

SimuVEX Options

SimuVEX is extremely customizable through the use of state options, a set of constants stored in state.options. These options are documented in the source code.

実行エンジン