--- layout: post title: A Compact Linux Detours Library (x86_64) author: Dylan Müller author_url: https://linkedin.com/in/dylanmuller --- > Function detouring is a powerful hooking technique that allows for the > interception of C/C++ functions. `cdl86` aims to be a compact detours library > for x86_64 Linux. 1. [Overview](#overview) 2. [JMP Patching](#jmp-patching) 3. [INT3 Patching](#int3-patching) 4. [Code Injection](#code-injection) 5. [API](#api) 6. [Source Code](#source-code) # Overview Note: This article details the linux specific details of the library. Windows support has since been added. See: [https://github.com/lunarjournal/cdl86](https://github.com/lunarjournal/cdl86) [Microsoft Research](https://en.wikipedia.org/wiki/Microsoft_Research) currently maintains a library known as [MS Detours](https://github.com/microsoft/Detours). It allows for the interception of Windows API calls within the memory address space of a process. This might be useful in certain situtations such as if you are writing a D3D9 (DirectX) hook and you need to intercept cetain graphics routines. This is commonly done for ESP and wallhacks where the Z depth buffer needs to be disabled for certain character models, for D3D9 this might involve hooking `DrawIndexedPrimitive`. ``` HRESULT WINAPI hkDrawIndexedPrimitive(LPDIRECT3DDEVICE9 pDevice, ...args) { // Check play model strides, primitive count, etc ... pDevice->SetRenderState(D3DRS_ZENABLE, false); ... // Call original function and return oDrawIndexedPrimitive(...) return ... } ``` In order to disable the Z buffer in this example we need access to a valid `LPDIRECT3DDEVICE9` context within the running process. This is where detours comes in handy. Generally, the procedure to hook a specific function is as follows: - Declare a function pointer with target function signature. ``` typedef HRESULT (WINAPI* tDrawIndexedPrimitive)(LPDIRECT3DDEVICE9 pDevice, ...args); ``` - Define detour function with same function signature ``` HRESULT WINAPI hkDrawIndexedPrimitive(LPDIRECT3DDEVICE9 pDevice, ...args) ``` - Assign the function pointer the target functions address in memory. In this case a VTable entry. ``` #define DIP 0x55 tDrawIndexedPrimitive oDrawIndexedPrimitive = (oDrawIndexedPrimitive)SomeVTable[DIP]; ``` - Call DetourFunction ``` DetourFunction((void**)&oDrawIndexedPrimitive, &hkhkDrawIndexedPrimitive) ``` `DetourFunction` then uses the `oDrawIndexedPrimitive` function pointer and modifies the instructions at the target function in order to transfer control flow to the detour function function. At this point any calls to `DrawIndexedPrimitive` within the `LPDIRECT3DDEVICE9` class will be rerouted to `hkDrawIndexedPrimitive`. You can see that this is a very powerful concept and gives us access to the calees function arguments. As demonstrated, it is possible to hook both C and C++ functions. The difference generally is that the first argument to a C++ function is a hidden `this` pointer. Therefore you can define a C++ detour in C with this extra argument. Detours is great, but it is only available for Windows. The aim of the `cdl86` project is to create a simple, compact detours library for x86_64 Linux. What follows is a brief explanation on how the library was designed. # Detour methods Two different approaches to method detouring were investigated and implemented in the cdl86 C library. First let's have a look at a typical function call for a simple C program. We will be using GDB to inspect the resulting dissasembly. ``` #include int add(int x, int y) { return x + y; } int main() { printf("%i", add(1,1)); return 0; } ``` Compile with: ``` gcc main.c -o main ``` and then debug with GDB: ``` gdb main ``` To list all the functions in the binary, supply `info functions` to the gdb command prompt. ``` 0x0000000000001100 __do_global_dtors_aux 0x0000000000001140 frame_dummy 0x0000000000001149 add 0x0000000000001161 main 0x00000000000011a0 __libc_csu_init 0x0000000000001210 __libc_csu_fini 0x0000000000001218 _fini ``` Let's disassemble the main function with `disas /r main`: ``` Dump of assembler code for function main: 0x0000000000001161 <+0>: f3 0f 1e fa endbr64 0x0000000000001165 <+4>: 55 push %rbp 0x0000000000001166 <+5>: 48 89 e5 mov %rsp,%rbp 0x0000000000001169 <+8>: be 01 00 00 00 mov $0x1,%esi 0x000000000000116e <+13>: bf 01 00 00 00 mov $0x1,%edi 0x0000000000001173 <+18>: e8 d1 ff ff ff callq 0x1149 0x0000000000001178 <+23>: 89 c6 mov %eax,%esi ``` `callq` has one operand which is the address of the function being called. It pushes the current value of `%rip` (next instruction after call) onto the stack and then transfers control flow to the target function. You may have also noticed the presence of the `endbr64` intruction. This instruction is specific to Intel processors and is part of [Intel's Control-Flow Enforcement Technology (CET)](https://software.intel.com/content/www/us/en/develop/articles/technical-look-control-flow-enforcement-technology.html). CET is designed to provide hardware protection against ROP (Return-orientated Programming) and similar methods which manipulate control flow using *existing* byte code. It's two main features are: * A shadow stack for tracking return addresses. * Indirect branch tracking, which endbr64 is a part of. Intel CET however does not prevent us from modifying control flow **directly** by inserting instructions into memory. # JMP Patching The first method of function detouring we will explore is by inserting a `JMP` instruction at the beginning of the target function to transfer control over to the detour function. It should be noted that in order to preserve the stack we need to use a `JMP` (specifically `jmpq`) instruction rather than a CALL. Since there is no way to pass a 64-bit address to the `jmpq` instruction we will have to first store the address we want to jump to into a register. We need to choose a register that is not part of the `__cdecl` (defualt) calling convention. `%rax` happens to be a register that is not part of the `__cdecl` userspace calling convention and so for simplicity we use this register in our design. The following is a disassembly of the instructions required for a `JMP` to a 64-bit immediate address: ``` 0x0000555555561389 <+0>: 48 b8 b1 13 56 55 55 55 00 00 movabs $0x5555555613b1,%rax 0x0000555555561393 <+10>: ff e0 jmpq *%rax ``` You can see that 12 bytes are required to encode the `movabs` instruction (which moves the detour address into `%rax`) as well as the `jmpq` instruction. Immediate values are stored in little endian (LE) encoding. So we can therefore conclude that we need to patch **at least** 12 bytes in memory at the location of our target function. These 12 bytes however are important and we cannot simply discard them. It turns out that we actually place these bytes at the start of what i will call a 'trampoline function', it's layout is as follows: ``` trampoline <0x23215412>: (original instruction bytes which were patched) JMP (target + JMP patch length) ``` Simply put, the trampoline function behaves as the original, unpatched function. As shown above it consists of the target function's original instruction bytes as well as a call to the target function, offset by the `JMP` patch length. The tampoline generation code for `cdl86` is shown below: ``` uint8_t *cdl_gen_trampoline(uint8_t *target, uint8_t *bytes_orig, int size) { uint8_t *trampoline; int prot = 0x0; int flags = 0x0; /* New function should have read, write and * execute permissions. */ prot = PROT_READ | PROT_WRITE | PROT_EXEC; flags = MAP_PRIVATE | MAP_ANONYMOUS; /* We use mmap to allocate trampoline memory pool. */ trampoline = mmap(NULL, size + BYTES_JMP_PATCH, prot, flags, -1, 0); memcpy(trampoline, bytes_orig, size); /* Generate jump to address just after call * to detour in trampoline. */ cdl_gen_jmpq_rax(trampoline + size, target + size); return trampoline; } ``` You can see that the allocation of the trampoline function occurs through a call to `mmap` with the `PROT_READ | PROT_WRITE | PROT_EXEC` memory protection flags. Therefore it should also be noted that the correct memory permissions should be set for both the target function before modification as well as the trampoline function, after allocation. Here is a snippet from the `cdl86` library for setting memory attributes: ``` /* Set R/W memory protections for code page. */ int cdl_set_page_protect(uint8_t *code) { int perms = 0x0; int ret = 0x0; /* Read, write and execute perms. */ perms = PROT_EXEC | PROT_READ | PROT_WRITE; /* Calculate page size */ uintptr_t page_size = sysconf(_SC_PAGE_SIZE); ret = mprotect(code - ((uintptr_t)(code) % page_size), page_size, perms); return ret; } ``` The general procedure to place the `JMP` hook is as follows: 1. Determine the minimum number of bytes required for a JMP patch 2. Create trampoline function 3. Set memory permissions (read, write, execute) 4. Generate `JMP` to detour at target function 5. Fill unused bytes with NOP 6. Assign trampoline address to target function pointer. Let's have a look at all of this in action using GDB. I will be using the [basic_jmp.c](https://github.com/lunarjournal/cdl86/blob/master/tests/basic_jmp.c) test case in the `cdl86` library. The source code for this test case is shown below: ``` #include "cdl.h" typedef int add_t(int x, int y); add_t *addo = NULL; int add(int x, int y) { printf("Inside original function\n"); return x + y; } int add_detour(int x, int y) { printf("Inside detour function\n"); return addo(5,5); } int main() { struct cdl_jmp_patch jmp_patch = {}; addo = (add_t*)add; printf("Before attach: \n"); printf("add(1,1) = %i\n\n", add(1,1)); jmp_patch = cdl_jmp_attach((void**)&addo, add_detour); if(jmp_patch.active) { printf("After attach: \n"); printf("add(1,1) = %i\n\n", add(1,1)); printf("== DEBUG INFO ==\n"); cdl_jmp_dbg(&jmp_patch); } cdl_jmp_detach(&jmp_patch); printf("\nAfter detach: \n"); printf("add(1,1) = %i\n\n", add(1,1)); return 0; } ``` We compile the following source file with (modified from makefile): ``` gcc -I../ -g basic_jmp.c ../cdl.c ../lib/libudis86/*.c -g -o basic_jmp ``` Then load into GDB using: ``` gdb basic_jmp ``` Once GDB has loaded, we insert a breakpoints at lines 24 and 27 using the command: ``` break 24 break 27 ``` We start execution of the program with: ``` run ``` GDB will then inform you that the first breakpoint has been triggered. For this first breakpoint we are interested in the add() function's assembly before the hook has taken place. To inspect this assembly, provide: ``` disas /r add ``` ``` Dump of assembler code for function add: 0x0000555555561389 <+0>: f3 0f 1e fa endbr64 0x000055555556138d <+4>: 55 push %rbp 0x000055555556138e <+5>: 48 89 e5 mov %rsp,%rbp 0x0000555555561391 <+8>: 48 83 ec 10 sub $0x10,%rsp 0x0000555555561395 <+12>: 89 7d fc mov %edi,-0x4(%rbp) ``` This is the disassembly of the unaltered target function. 12 bytes for the `JMP` patch will have to be written at this address. Therefore the first 4 instructions will need to be written to the trampoline function followed by a JMP to address `0x0000555555561395` and that's all we need for the trampoline! Now the fun part! Let's continue execution to the next breakpoint, where our `JMP` hook will be placed. ``` continue ``` Let's examine the disassembly of our `add()` function once again: ``` Dump of assembler code for function add: 0x0000555555561389 <+0>: 48 b8 b1 13 56 55 55 55 00 00 movabs $0x5555555613b1,%rax 0x0000555555561393 <+10>: ff e0 jmpq *%rax 0x0000555555561395 <+12>: 89 7d fc mov %edi,-0x4(%rbp) 0x0000555555561398 <+15>: 89 75 f8 mov %esi,-0x8(%rbp) ``` `0x5555555613b1` is the address of our detour/intercept function. Let's examine the disassembly of our detour function: ``` disas /r 0x5555555613b1 ``` ``` Dump of assembler code for function add_detour: 0x00005555555613b1 <+0>: f3 0f 1e fa endbr64 0x00005555555613b5 <+4>: 55 push %rbp 0x00005555555613b6 <+5>: 48 89 e5 mov %rsp,%rbp 0x00005555555613b9 <+8>: 48 83 ec 10 sub $0x10,%rsp 0x00005555555613bd <+12>: 89 7d fc mov %edi,-0x4(%rbp) 0x00005555555613c0 <+15>: 89 75 f8 mov %esi,-0x8(%rbp) 0x00005555555613c3 <+18>: 48 8d 3d 53 5c 00 00 lea 0x5c53(%rip),%rdi 0x00005555555613ca <+25>: e8 b1 fd ff ff callq 0x555555561180 0x00005555555613cf <+30>: 48 8b 05 ba bc 01 00 mov 0x1bcba(%rip),%rax 0x00005555555613d6 <+37>: be 05 00 00 00 mov $0x5,%esi 0x00005555555613db <+42>: bf 05 00 00 00 mov $0x5,%edi 0x00005555555613e0 <+47>: ff d0 callq *%rax 0x00005555555613e2 <+49>: c9 leaveq 0x00005555555613e3 <+50>: c3 retq ``` We can see that a call to our trampoline function is made to the address given by referencing the QWORD (out function pointer) at address `0x55555557d090`, let's deference it: ``` print /x *(long unsigned int*)(0x55555557d090) ``` ``` $20 = 0x7ffff7ffb000 ``` So the function pointer is pointing to address `0x7ffff7ffb000` which is our trampoline function, let's dissasemble it: ``` x/10i 0x7ffff7ffb000 ``` ``` 0x7ffff7ffb000: endbr64 0x7ffff7ffb004: push %rbp 0x7ffff7ffb005: mov %rsp,%rbp 0x7ffff7ffb008: sub $0x10,%rsp 0x7ffff7ffb00c: movabs $0x555555561395,%rax 0x7ffff7ffb016: jmpq *%rax 0x7ffff7ffb018: add %al,(%rax) 0x7ffff7ffb01a: add %al,(%rax) 0x7ffff7ffb01c: add %al,(%rax) 0x7ffff7ffb01e: add %al,(%rax) ``` You can see that our trampoline contains the first 4 instructions that were replaced when the `JMP` patch was placed in our target function. You can see a jmp back to address `0x555555561395` which was disassembled earlier. This should give you an idea of how the control flow modification is achieved. # INT3 Patching There is another method of function detouring which involves placing INT3 breakpoints at the start of the target function in memory. `INT3` breakpoints are encoded with the `0xCC` opcode: ``` /* Generate int3 instruction. */ uint8_t *cdl_gen_swbp(uint8_t *code) { *(code + 0x0) = 0xCC; return code; } ``` So rather than placing a `JMP` patch to the detour we simply write the byte `0xCC` to the target function being careful to NOP the unused bytes. Once the `RIP` register reaches an address of an `INT3` breakpoint the Linux kernel sends a `SIGTRAP` signal to the process. We can register our own signal handler but we need some additional info on the signal such as context information. A context is the state of a program's registers and stack. We need this info to compare the breakpoints `RIP` value to any active global software breakpoints. This is how the signal handler is registered in `cdl86`: ``` struct sigaction sa = {}; /* Initialise cdl signal handler. */ if (!cdl_swbp_init) { /* Request signal context info which * is required for RIP register comparison. */ sa.sa_flags = SA_SIGINFO | SA_ONESHOT; sa.sa_sigaction = (void *)cdl_swbp_handler; sigaction(SIGTRAP, &sa, NULL); cdl_swbp_init = true; } ... ``` Note the use of `SA_SIGINFO` to get context information. The software breakpoint handler is then defined as follows: ``` void cdl_swbp_handler(int sig, siginfo_t *info, struct ucontext_t *context) { int i = 0x0; bool active = false; uint8_t *bp_addr = NULL; /* RIP register point to instruction after the * int3 breakpoint so we subtract 0x1. */ bp_addr = (uint8_t *)(context->uc_mcontext.gregs[REG_RIP] - 0x1); /* Iterate over all breakpoint structs. */ for (i = 0; i < cdl_swbp_size; i++) { active = cdl_swbp_hk[i].active; /* Compare breakpoint addresses. */ if (bp_addr == cdl_swbp_hk[i].bp_addr) { /* Update RIP and reset context. */ context->uc_mcontext.gregs[REG_RIP] = (greg_t)cdl_swbp_hk[i].detour; setcontext(context); } } } ``` Note that if a match of the RIP value to any known breakpoints occurs the `RIP` value for the current context is updated and the new context applied using `setcontext()`. A trampoline function similar to our `JMP` patch is allocated and serves the same purpose. # Code Injection `cdl86` assumes that you are operating in the address space of the target process. Therefore code injection is often required in practice and requires the use of an [injector](https://github.com/lunarjournal/robocraft/tree/main/injector). Once a shared library (`.so`) has been injected you can use the following code to get the base address of the main executable module: ``` #include #include int __attribute__((constructor)) init() { ... struct link_map *lm = dlopen(0, RTLD_NOW); printf("base = %" PRIx64 , lm->l_addr); ... } ``` Or find the address of a function by symbol name: ``` void* dl_handle = dlopen(NULL, RTLD_LAZY); void* add_ptr = dlsym(dl_handle, "add"); ``` # API The API for the `cdl86` library is shown below: ``` struct cdl_jmp_patch cdl_jmp_attach(void **target, void *detour); struct cdl_swbp_patch cdl_swbp_attach(void **target, void *detour); void cdl_jmp_detach(struct cdl_jmp_patch *jmp_patch); void cdl_swbp_detach(struct cdl_swbp_patch *swbp_patch); void cdl_jmp_dbg(struct cdl_jmp_patch *jmp_patch); void cdl_swbp_dbg(struct cdl_swbp_patch *swbp_patch); ``` # Source code You can find the `cdl86` source code [here](https://github.com/lunarjournal/cdl86).
This project was inspired by some reverse engineering research I did for my [undergraduate thesis](https://github.com/lunarjournal/research).