#+title: All about Soot (draft)
#+date: <2022-11-15 Tue 12:51>
#+author: thebesttv

- Official Soot documents
  - [[https://soot-oss.github.io/soot/docs/4.4.0-SNAPSHOT/options/soot_options.html][Soot cli Options]]
  - [[https://soot-oss.github.io/soot/docs/4.4.0-SNAPSHOT/jdoc/index.html][Soot javadoc]]
  - [[https://github.com/soot-oss/soot/wiki][Soot wiki]]
- Tutorials
  - [[https://github.com/noidsirius/SootTutorial][SootTutorial]] A step-by-step tutorial for Soot
  - [[https://blog.csdn.net/qq_45401577/article/details/123958021][Soot入门(1): 安装与生成Jimple文件]]
  - [[https://www.brics.dk/SootGuide/][A Survivor's Guide to Java Program Analysis with Soot]]
    简直是救世主!!!
    里面的代码是 Latin1 编码的, 转换成 UTF8 好点.
    #+begin_src bash
      find . -name '*.java' -exec iconv -f latin1 -t utf8 -o \{} \{} \;
    #+end_src
- Theses
  - <<<Sable thesis>>>: An [[https://courses.cs.washington.edu/courses/cse501/01wi/project/sable-thesis.pdf][107-page-long thesis]] by Raja Vallee-Rai,
    which gives much information about Soot, especially the Jimple
    grammar.

* Preliminaries

JVM 4 种函数调用
- invoke special: call constructor, superclass methods, private method
- invoke virtual: normal instance method call (virtual dispatch)
- invoke interface: like invoke virtual, but cannot optimize,
  additionally, check interface implementation
- invoke static: call static methods
- invoke dynamic (after Java 7): allows dynamic typing language to run
  on JVM (Java is static typing)

* Basic concepts

Soot has its own class path, which by default is empty.  When specifying
class path of Soot using =-cp=, do not use =~=.  Instead, use absolute
or relative paths.

Jimple 尖括号中为 method signature: =class-name: return-type method-name
(parameter-type1, ...)=

** Three types of classes
:PROPERTIES:
:CUSTOM_ID: analyzed-classes
:END:

There are [[https://soot-oss.github.io/soot/docs/4.4.0-SNAPSHOT/options/soot_options.html#description][three kinds of classes]] (these are classes *analyzed by Soot*,
not the ones [[#main-impl-classes][owned by Soot]]):
- argument class: specified explicitly in Soot cli as argument, also
  become application class
- *application class*: classes that Soot analyzes, transforms, and turns
  into output files
- library class: classes which are *referred to*, directly or
  indirectly, by the application classes, but which are not themselves
  application classes.  Only used for *type resolution*.
Since argument classes automatically become application classes,
there are inherently only two classes---application class & library
class.

When you use the =-app= option, however, then Soot also processes all
classes referenced by these classes.  It will not, however, process any
classes in the JDK, i.e. classes in one of the =java.*= and =com.sun.*=
packages.  If you wish to include those too you have to use the special
=–i= option, e.g. =-i= java.

** Packs & phases

#+begin_quote
The execution of Soot is separated into several phases called /packs/.
#+end_quote

The role of a pack
- =b=: body creation
- =t=: user-defined transformations.  This is of special interest since
  it allows us to inject custom analysis.
- =o=: optimizations
- =a=: annotation (attribute generation)

*** Whole Program Analysis Packs

Before running the aforementioned packs, some packs are run
- =wjpp=: here =w= stands for /whole-program/.
- =cg=: call-graph generation
- =wjtp=: whole Jimple transformation pack
- =wjop=: whole Jimple optimization pack (this is disabled unless =-W=
  is specified)
- =wjap=: whole Jimple annotation pack
The information generated in these packs are made available to the rest
of Soot through =Scene.v()=.

*** Cli Options

To show help:
- =-pl=, =-phase-list=: Print list of available phases
- =-ph PACK=, =-phase-help PACK=: Print help for the specified =PACK=.
  Here =PACK= can be either generic (e.g. =jop=), or specific
  (e.g. =jop.cpf=)

To set an option to a pack, use =-p= or =-phase-option= in the form of
=-p PACK OPTION:VALUE=, which sets =PACK='s =OPTION= to =VALUE=, e.g. to
turn off all user-defined intra-procedural transformations (in pack
=jtp=):
#+begin_src bash
  soot -p jtp enabled:false ...
#+end_src

* Building Soot

#+begin_src bash
  mvn clean compile assembly:single
#+end_src

** Javadoc

#+begin_src bash
  mvn javadoc:javadoc
#+end_src

* Soot in cli

#+begin_src bash
  soot -v -process-dir code/ -d out
  soot -cp . -pp Circle
  soot -cp . -pp Circle -p cg.spark verbose:true,on-fly-cg:true
#+end_src

Cli options are defined in =src/main/xml/options/soot_options.xml=.

* Different IRs

{{{fig(Soot IRs, ir, 80)}}}
[[./soot/ir.jpg]]

** Baf

Baf is
- a compact representation of bytecode
- stack-based

The common interface is =soot.baf.Inst=.

Available optimizations are in =soot.baf.toolkits.base=.

** Jimple

Jimple is
- typed: all local variables are typed
- stackless
- 3-address (statements reference at most 3 local variables or
  constants)
  - this requires linearization of some complex expressions, e.g. =a*b +
    c*d= is converted to multiple 3-address statements.

For a complete explanation of Jimple, see [[#jimple][section Jimple]].

** Shimple

Shimple is
- SSA-version (Static Single Assignment) of Jimple: each local variable
  has a single static point of definition.
  - this introduces a /Phi node/.

** Grimp

Grimp preserves =new= operator and complex expressions (no
linearization).

** Dava

* Main implementation classes
:PROPERTIES:
:CUSTOM_ID: main-impl-classes
:END:

Thses are *implementation classes of Soot*, i.e. they are owned by Soot.
For a classification of classes *analyzed by Soot*, see [[#analyzed-classes][this section]].
Fig. [[main-class-relation]] shows fun-call relations of some of the most
important classes.

{{{fig(Main class relationships, main-class-relation, 80)}}}
[[./soot/main-class-relation.jpg]]

- [[https://soot-oss.github.io/soot/docs/4.4.0-SNAPSHOT/jdoc/soot/Scene.html][=Scene=]] Manages the =SootClass=​es of the application being analyzed.
- [[https://soot-oss.github.io/soot/docs/4.4.0-SNAPSHOT/jdoc/soot/SootClass.html][=SootClass=]] Soot representation of a Java class.  They are usually
  created by a =Scene=, but can also be constructed manually through the
  given constructors.
  #+begin_src java
    // for methods
    SootMethod getMethod(String subsignature);
    SootMethod getMethod(String name, List<Type> parameterTypes);
    SootMethod getMethodByName(String name);
    int getMethodCount();
    List<SootMethod> getMethods();
    // for fields, alike
    Chain<SootField> getFields();
  #+end_src
- =SootMethod=
  - =Body=, =JimpleBody=
- =SootField=
- =Unit=
- =UnitGraph=
  - =ExceptionalUnitGraph=: use
    =ExceptionalUnitGraphFactory.createExceptionalUnitGraph()= to create

** Scene

=Scene= is a singleton class that keeps all classes which are
represented by =SootClass=.  Each =SootClass= may contain several
methods (=SootMethod=) and each method may have a =Body= object that
keeps the statements (=Unit=​s).

Scene

There are two scenes:
- =soot.Scene=: which manages all the =SootClass=​es being analyzed.
- =soot.ModuleScene=: a subclass of =Scene= used to analyze Java 9
  modules.

Methods of =soot.Scene=:
- =loadClassAndSupport(String className)=: loads the given class and all
  the required support classes.
- =loadNecessaryClass(String name)=
  #+begin_src java
    protected void loadNecessaryClass(String name) {
        loadClassAndSupport(name).setApplicationClass();
    }
  #+end_src
- =loadNecessaryClasses()=: loads the set of classes that soot needs,
  including those *specified on the command-line*.  This is the standard
  way of initialising the list of classes soot should use.

  The classes specified in the command-line include:
  - individual classes specified in command-line.  e.g. =java soot.Main
    -cp . -pp A B=, then =opts.classes()= returns the list ={"A", "B"}=.
    #+begin_src java
      for (String name : opts.classes()) {
          loadNecessaryClass(name);
      }
    #+end_src
  - =-process-dir=: all classes specified in directories
    #+begin_src java
      for (String path : opts.process_dir()) {
          for (String cl : SourceLocator.v().getClassesUnder(path)) {
              SootClass theClass = loadClassAndSupport(cl);
              if (!theClass.isPhantom) {
                  theClass.setApplicationClass();
              }
          }
      }
    #+end_src

** SootMethod

SootMethod
- =getActiveBody()= throws an exception when no active body is present.
  This cannot be called before =PackManager.v().runPacks();= in =Main=.
- =retrieveActiveBody()= will construct an active body if none is
  present.

*** Printing a Method

In =soot.Body::toString()=, =Printer.v().printTo()= is used to print a
method body:
#+begin_src java
  Printer.v().printTo(this, writerOut);
#+end_src

** SootField

** Graph

Different kinds of graphs (partial)
#+begin_example
  DirectedBodyGraph (I)
      ExceptionalGraph (I)
          CompleteUnitGraph (C)
          ExceptionalUnitGraph (C)
              CompleteUnitGraph (C)
          CompleteBlockGraph (C)
          ExceptionalBlockGraph (C)
              CompleteBlockGraph (C)
      CompleteUnitGraph (C)
      ExceptionalUnitGraph (C)
          CompleteUnitGraph (C)
      BriefUnitGraph (C)
      TrapUnitGraph (C)
      UnitGraph (C)
          ExceptionalUnitGraph (C)
              CompleteUnitGraph (C)
          BriefUnitGraph (C)
          TrapUnitGraph (C)
#+end_example

* Jimple
:PROPERTIES:
:CUSTOM_ID: jimple
:END:

A complete description of the Jimple grammar can be seen in Figure 2.9
and 2.10 of the Sable thesis.

The common interface is =soot.jimple.Stmt=.

There are 15 =Stmt=​s (=Stmt= is instance of =Unit=)
- Core statements
  - =NopStmt=
  - =DefinitionStmt=: its left op can either be a primitive (=PrimType=)
    or a ref-like type (=RefLikeType=).  To check:
    #+begin_src java
      if (defStmt.getLeftOp().getType() instanceof RefLikeType) {
          // ...
      }
    #+end_src
    - =IdentityStmt=: assigns parameters and =this= reference to local
      variables.  This ensures that all local variables have at least
      one definition point.
      #+begin_src text
        r0 := @this;
        i1 := @parameter0;
      #+end_src
    - =AssignStmt=
- Intra-procedual control-flow statements
  - =IfStmt=
    #+begin_src text
      if r1 != null goto label0;
    #+end_src
    In a =BranchedFlowAnalysis=, there're two flows out of an =IfStmt=:
    the fall-through flow and branched flow.
  - =GotoStmt=
  - =SwitchStmt=
    - =TableSwitchStmt=
    - =LookupSwitchStmt=
- Inter-procedual control-flow statements
  - =InvokeStmt=
  - =ReturnStmt=
  - =ReturnVoidStmt=
- Monitor statements: for mutual exclusion
  - =EnterMonitorStmt=
  - =ExitMonitorStmt=
- =ThrowStmt=: throws an exception
- =RetStmt=: not used; returns from a JSR
  - JSR & RET are JVM instructions for subroutine.  It seems that they
    are [[https://stackoverflow.com/q/5871190/11938767][deprecated Java bytecode]], as using them causes more harm than
    good.  According to [[http://www.sable.mcgill.ca/listarchives/soot-list/msg00509.html][this]] mail and its [[http://www.sable.mcgill.ca/listarchives/soot-list/msg00510.html][reply]], JVM subroutines (JSR &
    RET) "cause huge problems with analysis and optimization" and are
    removed by Jimple's JSR inliner.

#+begin_quote
The local variables which start with a dollar sign (=$=) represent *stack
positions* and not local variables in the original program whereas those
without =$= represent real local variables e.g. =i0= in the main method
corresponds to =a= in the Java source.
#+end_quote

The main structure of a Jimple method (from Section 2.3.6 of the Sable
thesis):
- All local variables are declared at the top of the method.
- Identity statements follow the local variable declarations, which
  marks the local variables that have values upon method entry.
- Then comes the method body, which are mostly assignment statements.

- See the [[https://soot-oss.github.io/soot/docs/4.4.0-SNAPSHOT/jdoc/soot/jimple/internal/package-tree.html][Hierarchy For Package soot.jimple.internal]], all statements are
  under =soot.AbstractUnit= \to =soot.jimple.internal.AbstractStmt=.

** FieldRef

=FieldRef= 分为 =InstanceFieldRef= 和 =StaticFieldRef=
#+begin_example
FieldRef (I)
|- InstanceFieldRef (I)
|  |- JInstanceFieldRef (C, for Jimple)
|  |- GInstanceFieldRef (C, for Grimp)
|  `- ...
|- StaticFieldRef (C)
`- ...
#+end_example

** Labels

Labels are displayed using =Printer=.

* Body

Body has three chains
- Units chain: the actual code.  Jimple provides the =Stmt=
  implementation of =Unit=, while Grimp provides the =Inst=
  implementation.
- Locals chain: local variables
- Traps chain: trap handlers, in the form of
  #+begin_src text
    catch java.lang.Exception from label0 to label1 with label2;
  #+end_src

* Value

Value
- =Local=: a local variable
  - =JimpleLocal=
- =Expr=: expression.  An =Expr= carries out some action on one or
  several =Value=​s and returns another =Value=.
  - package =soot.jimple=
    - =BinopExpr=
    - =NewExpr=
    - =NewArrayExpr=
    - =NewMultiArrayExpr=
  - package =soot.jimple.internal=
    - =JCastExpr=
    - ...
  - ...
- =Immediate=
  - =Constant=
- =Ref=
  - =ParameterRef=
  - =CaughtExceptionRef=
  - =ThisRef=

** ValueBox

A =ValueBox= is a pointer to some value.  It can be visualized as a box
containing some value.

- =getValue()=: dereferences the pointer
- =setValue()=: mutates value in the box
- A unit has both DefBox & UseBox
  - =getUseBoxes()= returns a list of =ValueBox=​es, corresponding to
    *all* =Value=​s used in the unit.
  - =getDefBoxes()= returns all =Values=​s defined in the unit.
  - For example, for unit =x=y*z=, there're 3 use boxes: =[y*z]= (an
    =Expr=), =[y]= (a =Local=), and =[z]= (another =Local=); and one
    def box: =[x]= (a =Local=).  The brackets (=[]=) represent the
    box.

For example, the following Java code
#+begin_src java
  int a = 12;
  int b = 24;
  int x = a * b;
#+end_src
is translated to
#+begin_src text
  a = 12;
  b = 24;
  temp$0 = a * b;
  x = temp$0;
#+end_src
The DefBox & UseBox of each statement is as follows
#+begin_src text
  a = 12
    Def:
      LinkedVariableBox[JimpleLocal: a]
    Use:
      LinkedRValueBox[IntConstant: 12]

  b = 24
    Def:
      LinkedVariableBox[JimpleLocal: b]
    Use:
      LinkedRValueBox[IntConstant: 24]

  temp$0 = a * b
    Def:
      LinkedVariableBox[JimpleLocal: temp$0]
    Use:
      LinkedRValueBox[JMulExpr: a * b]
      ImmediateBox[JimpleLocal: a]
      ImmediateBox[JimpleLocal: b]

  x = temp$0
    Def:
      LinkedVariableBox[JimpleLocal: x]
    Use:
      LinkedRValueBox[JimpleLocal: temp$0]
#+end_src

* Type

Class hierarchy of =Type=:
#+begin_src text
  Type
  |- PrimType: including int, float, char ...
  |  |- BooleanType
  |  |- CharType
  |  |- IntType
  |  `- ...
  |- RefLikeType
  |  |- ArrayType: array reference
  |  |- NullType
  |  `- RefType: simple reference
  `- VoidType: void
#+end_src

* Analyses

** Off-The-Shelf Analyses

- Null Pointer Checker
  - =jap.npc=
  - =jap.npcolorer=
- Array Bound Checker
  - =jap.abc=
- Liveness Analysis
  - =jap.lvtagger=

** Custom Analyses

Inject custom inter-procedural analyses into =wjtp= pack and
intra-procedural analyses into =jtp= pack.

#+begin_src java
  public class MySootMainExtension {
      public static void main(String[] args) {
          // Inject the analysis tagger into Soot
          PackManager.v().getPack("jtp")
              .add(new Transform("jpt.myanalysistagger",
                                 MyAnalysisTagger.instance()));
          // Invoke soot.Main with arguments given
          Main.main(args);
      }
  }
#+end_src

*** Very Busy Expressions Analysis

- [[https://www.cis.upenn.edu/~mhnaik/edu/cis700/lessons/dataflow_analysis.pdf][dataflow\under{}analysis.pdf]] very good explanation
- [[https://pages.cs.wisc.edu/~fischer/cs701.f08/lectures/Lecture18.4up.pdf][Lecture18.4up.pdf]] another explanation

The goal of Very Busy Expressions analysis is to compute expressions
that are very busy at the exit from each program point.

An expression is very busy if, *no matter what path is taken*, the
expression is always used before any of the variables occurring in it
are redefined.

This is a must analysis, since if in either one of the path, the
expression $e$ is not used, it is not considered very busy.

This is a backwards analysis, as the fact at node $d$ is deduced from
later (TODO: change word) nodes.

For expression $e = x + y$ from node $s$ to $p$, if either $x$ or $y$ is
redefined along the path, then even if $p$ uses expression $e$, it's not
very busy at $s$.