--- name: "Source code analyzers:
how generalizable are they?" theme : "white" transition : "slide" slideNumber: "true" pdfMaxPagesPerSlide: 1 highlightTheme: "default" mouseWheel: true ---


Source code analyzers:
how generalizable are they?

Ivan Kochurkin
Positive Technologies
Team Lead

Theme: White | Black Slides: kvanttt.github.io
--- # 📝 About me * Ivan Kochurkin * Team Lead at [Positive Technologies](https://www.ptsecurity.com/), Data Flow Source Code Analyzer * Developer at [Swiftify](http://swiftify.io/), Objective-C → Swift Source Code Converter * Active Contributor on GitHub: [KvanTTT](https://github.com/KvanTTT) * Tech Article Writer at [habr.com](https://habr.com/users/kvanttt/) and other blogs --- # 📋 Analyzer Types 1. Regular Expressions 2. Tokens 3. Parse Trees and AST 4. Data & Control Flow Graphs (DFG & CFG) 5. Binary | Intermediate Language --- # ⏭️ Regular Expressions 1. `(.*?)
` 2. Attributes? `(.*?)` 3. Elements? `tr`, `td` 4. Comments? `` 5. ... 6. [NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̅ͫ͏̙̤g͇̫͛͆̾ͫ̑͆l͖͉̗̩̳̟̍ͫͥͨe̠̅s ͎a̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZA̡͊͠͝LGΌ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ ͠P̯͍̭O̚​N̐Y̡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̬̩̾͛ͪ̈́̀́͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̙̲̝͖ͭ̏ͥͮ͟Oͮ͏̮̪̝͍M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ](https://stackoverflow.com/a/1732454/1046374) --- # ㊙️ Regex DSL * `[ ]` - Matches a single character that is contained within the brackets. * `[^ ]` - Matches a single character that is not contained within the brackets. * `?` - Optional symbol * `*` - Zero or more occurrences. `ab*c` matches `ac`, `abc`, `abbc` * `+` - One or more occurrences. * `|` - Or. `gray|grey` can match `gray` or `grey`. --- # 🔲 Regex Patterns | Advantages | Disadvantages | |------------------------------|----------------------------------------------------------| | Very simple | Hard to support | | Formal model is not required | Generally not recursive | | Universal | Slow | | | Hidden tokens (whitespaces, comments) cannot be skipped | --- # 🔲 Regex Patterns * Floating Point Numbers: `[-+]?[0-9]*\.?[0-9]` * 📧 Emails Addresses ``` `\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b` ``` * IP Address Find | Validation --- --- # ⏭️ Tokens * Lexeme - Recognized char sequence * Token = Lexeme + Type * Grammar ``` Keyword: 'var'; Id: [a-z]+; Digit: [0-9]+; Comment: '/*' .*? '*/'; Semi: ';'; Whitespace: ' '+; ``` * Code Sample ```csharp var a = 17; /* comment */ ``` --- # ㊙️ Token DSL Example Regex + Additional Syntax * `<[regex]>` - Id token by custom regex * `<"regex">` - String by custom regex * `<(begin..end)>` - Numbers (range) * `` - Comments by custom regex --- # 🔲 Token Patterns Simple, but still not recursive * `<[password]> = <"">` * `` * `<[md5|sha1]>(` * `<"(?i)select\s\w*"> + <~> <"\w*">` --- # 😲 A error in code due to error in parser? #### Grammar ``` Identifier: [A-Za-z]+ ``` #### ❌ Wrong ```sql add constraint С_PK primary key (ID); ``` #### ✔️ Right ```sql add constraint C_PK primary key (ID); ``` #### 😲 WTF? --- # 🕵️ Text fingerprinting with zero-length characters Be c​aref​ul wh​at yo​u copy 🕵️ [https://diffchecker.com](https://www.diffchecker.com/M2PvqSXw) Be c•aref•ul wh•at yo•u copy•