Code, once written by the programmer and compiled into an executable, is usually treated as immutable. Yet, at the lowest level of execution, this immutability is more of a suggestion than a rule, enforced by operating systems rather than the hardware itself. In reality, code is just stored in the memory within the text section, and as you can with any other location in memory, you can modify it at runtime, given that you have the right privileges.

This simple realisation opens the door to several possibilities - the greatest being the ability to create programs that can mutate themselves at runtime. Functions, that are programmed to do one task, but end up doing something else entirely. In this post, we explore how we can create such mutating programs, what uses do they have, and why they can be potentially dangerous.

The Background

In a program’s memory, code is stored as instructions in the text section. These instructions are stored in the typical opcode-operand format. Now, if a different instruction changes the value of either the opcode or the operand at the memory location of the first instruction, the first instruction has been mutated. The CPU itself won’t object to this, because it only reads and interprets the memory.

The only thing stopping someone from doing so is the operating system. In a typical program, different sections of memory are assigned different privileges. Each segment has three permission bits - read, write, and execute. The text segment of the program conventionally only has read and execute privileges. This provision exists for several purposes, such as preventing malicious attacks by intjecting executable code, or preventing bugs from corrupting program logic. In fact, the CPU can even cache the code more aggressively if it knows that it won’t change.

The Implementation

section .data
    SYSCALL_write db 4
    SYSCALL_mprotect db 125
    
    PROT_READ equ 1
    PROT_WRITE equ 2
    PROT_EXEC equ 4

    PROT_RWX equ (PROT_READ | PROT_WRITE | PROT_EXEC)

    newline db 10

section .bss
    result rsb 4        ; Result buffer to temporarily store value

section .text
    global _start

modifiable_func:
instr_to_modify:
    mov eax, 5          ; Instruction that we intend to modify
    ret

_start:
    mov eax, SYSCALL_mprotect       ; mprotect syscall
    mov ebx, modifiable_func        ; ebx contains memory address (for mprotect)
    and ebx, 0xFFFFF000             ; align the memory address to the page boundary
    mov ecx, 4096                   ; ecx contains the size of the memory block (for mprotect)
    mov edx, PROT_RWX               ; edx contains the new protection flags (for mprotect)
    int 0x80

    call modifiable_func            ; First time we call, the value in eax will be 5

    mov eax, [result]
    mov eax, SYSCALL_write
    mov ebx, 1
    mov ecx, result
    mov edx, 1
    int 0x80                        ; Output : 5

    mov eax, SYSCALL_write
    mov ebx, 1
    mov ecx, newline
    mov edx, 1
    int 0x80

    mov byte [instr_to_modify + 1], 6 ; Modifying the opcode of the given instruction

    call modifiable_func            ; Second time we call, the value in eax will be 6

    mov eax, [result]
    mov eax, SYSCALL_write
    mov ebx, 1
    mov ecx, result
    mov edx, 1
    int 0x80                        ; Output : 6

    mov eax, 1
    xor ebx, ebx
    int 0x80

In the above program, we have declared the function we intend to modify in the text section. Additionally, we have also added a label to the specific instruction that we want to modify.

At the very beginning of the _start function, we make a system call to mprotect. mprotect is the syscall used to change the memory access protections for the calling process’s memory pages. The function takes in the address, aligned to a page boundary, a size, the size of the page in consideration, and the new protection flags that are to be assigned to the memory block. In this case, we pass the address of the modifiable function, along with a page size of 4096, and requesting privileges to read, write, and execute.

Then, the program calls the modifiable function, which places 5 in the eax register. It then print the contents of eax. After this, the program actually starts mutating code. It goes to the memory location of the instruction to be modified, moves one byte further (that one byte corresponds to the opcode), and changes the operand from a 5 to a 6. And after calling the function and printing the contents of eax, we can see that the mutation has succeeded.

Applications

Practical applications of self-modifying code are few and far between. From an optimization standpoint, mutating code can be used to optimize loops via unrolling and dynamically adjusting loop counters based on runtime data sizes, however any benefit gained from this will be negated by the performance drop caused by the loss of cacheing of instructions.

A more realistic use of this can be in DRM protection and anti-reverse engineering measures. For example, if a program detects that it is being run inside a debugger, it will intentionally abstract and obfuscate the code to prevent analysis and subsequent reverse engineering.

However, this paradigm can also lead itself to more malicious applications. Using this technique, one can create a polymorphic virus, with code that mutates with every infection. Changing minor things, like the instruction order or registers used can be enough to morph the signature enough and evade detection. This technique can also be used to create runtime metamorphic viruses, where the virus changes its own code completely while still preserving its funcion.

Conclusion

Just like most of my previous projects, this one as well is not something that should be applied in the real world. However, watching a function return different values from the exact same function call, knowing I had rewritten the instructions within mid-execution, gave me a visceral understanding of how fragile the boundary between code and data really is. The full source code of the program can be found here.