Download as pdf or txt
Download as pdf or txt
You are on page 1of 14

SOFWARE TESTING

DEPENDENCE AND DATA FLOW MODELS

Objective Understand basics of data-flow models and the related concepts (def-use pairs, dominators) Understand some analyses that can be performed with the data-flow model of a program The data flow analyses to build models Analyses that use the data flow models Understand basic trade-offs in modeling data flow variations and limitations of data-flow models and analyses, differing in precision and cost

Data Flow An abstract representation of the sequence and possible changes of state of data objects, where the state of an object is any of creation, usage, modification or destruction. Data Flow graph A picture of the system as a network of functional processes connected to each other by pipelines of data (Data Flows) and holding tanks of data (Data Stores). The processes transform the data. The Data Flow graphs partitions the system into its component parts Data flow structure follows the trail of a data item as it is accessed and modified by the code. Why data flow models? Models emphasized control Eg: Control flow graph, call graph, finite state machines We also need to reason about dependence Where does this value of x come from? What would be affected by changing this?

Many program analyses and test design techniques use data flow information Often in combination with control flow Example: Taint analysis to prevent SQL injection attacks Example: Dataflow test criteria

Definition- Use A definition-use (of def-use) pair is the most basic data flow relationship: A value was defined (produced or stored) at one point and used (referenced or retrieved from a variable) at another point. A def-use (du) pair associates a point in a program where a value is produced with a point where it is used Definition: where a variable gets a value Variable declaration (often the special value uninitialized) Variable initialization Assignment Values received by a parameter

Use: extraction of a value from a variable Expressions Conditional statements Parameter passing Returns

Eg:

Definition-Clear or Killing

When x gets a new value in line B, we say the earlier definition at line A is killed. This means a def-use path with A as the definition cannot pass through B. Dominator Pre-dominators in a rooted, directed graph can be used to make this intuitive notion of controlling decision precise. Node M dominates node N if every path from the root to N passes through M. A node will typically have many dominators, but except for the root, there is a unique immediate dominator of node N which is closest to N on any path from the root, and which is in turn dominated by all the other dominators of N. Because each node (except the root) has a unique immediate dominator, the immediate dominator relation forms a tree. Post-dominators: Calculated in the reverse of the control flow graph, using a special exit node as the root. Data flow analysis The algorithms used to compute data flow information. These are basic algorithms used widely in compilers, test and analysis tools, and other software tools.

It is worthwhile to understand basic data flow analysis algorithms, even though we can usually use tools without writing them ourselves. The basic ideas are applicable to other problems for which tools are not readily available. A basic understanding of how they work is key to understanding their limitations and the tradeoffs we must make between efficiency and precision DF Algorithm An efficient algorithm for computing reaching definitions (and several other properties) is based on the way reaching definitions at one node are related to the reaching definitions at an adjacent node. Suppose we are calculating the reaching definitions of node n, and there is an edge (p,n) from an immediate predecessor node p. If the predecessor node p can assign a value to variable v, then the definition vp reaches n. We say the definition vp is generated at p. If a definition vp of variable v reaches a predecessor node p, and if v is not redefined at that node (in which case we say the vp is killed at that point), then the definition is propagated on from p to n.

public class GCD { public int gcd(int x, int y) { int tmp; while (y != 0) { tmp = x % y; x = y; y = tmp; } return x; } // F: use x // A: def x, y, tmp // B: use y // C: def tmp; use x, y // D: def x; use y // E: def y; use tmp

Reach(E) = ReachOut(D) ReachOut(E) = (Reach(E) \ {yA}) {yE} Reach(E) = ReachOut(D) The first (simple) equation says that values can reach E by going through D. ReachOut(E) = (Reach(E) \ {yA}) {yE} This describes what happens at D. The previous definition of y at node A is overwritten. A new value of y is defined here at line E. We can associate two data flow equations of this form (what reached here, and what this node changes) for every line in the program or node in the control flow graph. General equation for reaching analysis Each node of the control flow graph (or each line of the program) can be associated with set equations of the same general form, with gen and kill For reaching definitions, the gen set is just the variable that gets a new value there. It could be the empty set if no variable is changed in that line. The kill set is the set of values that will be replaced. The subscripts here are the lines or nodes where the assignment takes place, so that we can keep track of different assignments to the same variable. For example, v99 would represent a value that v received at node (or line) 99. Reach(n) = ReachOut(m) mpred(n) ReachOut(n) = (Reach(n) \ kill (n)) gen(n) gen(n) = { vn | v is defined or modified at n } kill(n) = { vx | v is defined or modified at x, xn } Avail equations Available expressions is another classic data flow analysis. An expression becomes available where it is computed, and remains available until one of the variables in the expression is

changed. Unlike reaching definitions, though, it is not enough for an expression to be available on ANY path to a node. Rather, it has to be available on ALL paths to the node. Avail (n) = AvailOut(m) mpred(n) AvailOut(n) = (Avail (n) \ kill (n)) gen(n) gen(n) = { exp | exp is computed at n } kill(n) = { exp | exp has variables assigned at n } The gen and kill sets are different, reflecting the definition We use intersection instead of union when there is more than one path to a node, because this is an all paths analysis rather than an any paths analysis

Live variable equations Live variables is a third, classic data flow analysis. A variable is live at a node if the value that is currently in the variable may be used later (before being replaced by another value). It is an any-paths property like reaching definitions Note what is different here: We look at successors to the current node, rather than predecessors. We say live is a backward analysis, while reach and avail were forward analyses. Live(n) = LiveOut(m) msucc(n) LiveOut(n) = (Live(n) \ kill (n)) gen(n) gen(n) = { v | v is used at n } kill(n) = { v | v is modified at n }

classification analysis Forward/backward: a nodes set depends on that of its predecessors/successors Any-path/all-path: a nodes set contains a value iff it is coming from any/all of its inputs

Any-path () Forward (pred) Backward (succ) Reach Live

All-paths () Avail inevitable

Live, Avail, and Reach fill three possible combinations of { Forward, Backward } x { Any path, All paths }., (All paths, backward), a natural term for it would be inevitability: It would propagate values backward from the points where they appear in a gen set, but only so long as all possible paths forward will definitely reach the point of gen.

From Execution to Conservative Flow Analysis We can use the same data flow algorithms to approximate other dynamic properties Gen set will be facts that become true here ,Kill set will be facts that are no longer true here Flow equations will describe propagation Example: Taintedness (in web form processing) Taint: a user-supplied value (e.g., from web form) that has not been validated Gen: we get this value from an untrusted source here Kill: we validated to make sure the value is proper

Monotonic: y > x implies f(y) f(x) (where f is application of the flow equations on values from successor or predecessor nodes, and > is movement up the lattice)

Flow equations must be monotonic, Initialize to the bottom element of a lattice of approximations. Each new value that changes must move up the lattice Typically: Powerset lattice: Bottom is empty set, top is universe Or empty at top for all-paths analysis

Work list Algorithm for Data Flow One way to iterate to a fixed point solution. General idea: Initially all nodes are on the work list, and have default values -Default for any-path problem is the empty set, default for all-path problem is the set of all possibilities (union of all gen sets) While the work list is not empty- Pick any node n on work list; remove it from the list -Apply the data flow equations for that node to get new values -If the new value is changed (from the old value at that node), then Add successors (for forward analysis) or predecessors (for backward analysis) on the work list Eventually the work list will be empty (because new computed values = old values for each node) and the algorithm stops.

This is a classic iteration to a fixed point. Perhaps the only subtlety is how we initialize values.

Data flow analysis with arrays and pointers Arrays and pointers introduce uncertainty: Do different expressions access the same storage? a[i] same as a[k] when i = k a[i] same as b[i] when a = b (aliasing) The uncertainty is accomodated depending to the kind of analysisAny-path: gen sets should include all potential aliases and kill set should include only what is definitely modified All-path: vice versa Scope of Data Flow Analysis

Intraprocedural :Within a single method or procedure as described so far,Interprocedural:Across several methods (and classes) or procedures Cost/Precision trade-offs for interprocedural analysis are critical, and difficult context sensitivity A context-sensitive (interprocedural) analysis distinguishes sub() called from foo() from sub() called from bar();

A context-insensitive (interprocedural) analysis does not separate them, as if foo() could call sub() and sub() could then return to bar() The problem with context sensitivity is that we dont need to keep track of the immediate calling context --- we may also need to consider all the methods that could call foo, and all the methods that could call those methods, and so on. A single procedure or method can have exponentially many calling contexts, relative to the number of procedures or methods in the program. A context-insensitive analysis does not distinguish each context, so it can be much cheaper --- but it can be very imprecise. For example, we could get an analysis result that corresponds to foo calling sub (trace the top half of the red line) and sub returning to bar (trace the bottom half of the blue line), which is impossible. flow-sensitivity Within a single method or procedure, we typically use a flow-sensitive data flow analysis, like those we have seen (avail, live, reach, etc.). The cost may be cubic in the size of a procedure or method, if procedures and methods are short and, for well-structured code, the algorithms often run much faster than their theoretical worst case time bound.

Interprocedural analysis, on the other hand, could need to handle many thousands or even millions of lines of code. For the properties that really demand interprocedural analysis, a flow-insensitive analysis is often good enough. It may be imprecise, but it is much better than just making worst-case assumptions. Type checking is a familiar example of flow-insensitive analysis:

Data flow concept

Value of x at 6 could be computed at 1 or at 4 Bad computation at 1 or 4 could be revealed only if they are used at 6 (1,6) and (4,6) are def-use (DU) pairs defs at 1,4 - use at 6

DU pair: a pair of definition and use for some variable, such that at least one DU path exists from the definition to the use

x = ... is a definition of x = ... x ... is a use of x DU path: a definition-clear path on the CFG starting from a definition to a use of a same variable Definition clear: Value is not replaced on path Note loops could create infinite DU paths between a def and a use Adequacy criteria The All DU pairs adequacy criterion requires each DU pair is exercised by at least one test case All DU paths: Each simple (non looping) DU path is exercised by at least one test case CDUpath = # of exercised simple DU paths # of simple DU paths All definitions: For each definition, there is at least one test case which exercises a DU pair containing it (Every computed value is used somewhere) Corresponding coverage fractions can also be defined CDU pairs = # of exercised DU pairs --------------------------# of Du pairs Difficult cases x[i] = ... ; ... ; y = x[j] DU pair (only) if i==j p = &x ; ... ; *p = 99 ; ... ; q = x *p is an alias of x m.putFoo(...); ... ; y=n.getFoo(...); Are m and n the same object? Do m and n share a foo field? Problem of aliases: Which references are (always or sometimes) the same? Data flow coverage with complex structures Arrays and pointers are critical for data flow analysis Under-estimation of aliases may fail to include some DU pairs Over-estimation, on the other hand, may introduce unfeasible test obligations For testing, it may be preferrable to accept under-estimation of alias set rather than over-estimation or expensive analysis Controversial: In other applications (e.g., compilers), a conservative over-estimation of aliases is usually required

Alias analysis may rely on external guidance or other global analysis to calculate good estimates Undisciplined use of dynamic storage, pointer arithmetic, etc. may make the whole analysis infeasible

Infeasibility The path-oriented nature of data flow analysis makes the infeasibility problem especially relevant Combinations of elements matter! Impossible to (infallibly) distinguish feasible from infeasible paths. More paths = more work to check manually. In practice, reasonable coverage is (often, not always) achievable Number of paths is exponential in worst case, but often linear All DU paths is more often impractical Not all elements of a profram are executable The path oriented nature of data flow aggravates the problem Infeasibility creates test obligation not only for isolated unexecutable elements but also for infeasible combinations of feasible elements Complex data strucutre may amplify the infeasibility problem by adding infeasible paths as a result of alias computation Eg:x[i] alias of x[j] when i=j cant determine when i = j in prg exe. But problem of infeasible path modest in DU pair adequacy criteria

You might also like