Aug 12, 2022

# Pass by Value vs Pass by Pointer

When I was in college, our computer science instructor introduced us to the struct and then gave us the following rule: when you pass a struct into or out of a function, you should use a pointer. The reason being that a struct is full of other values and you don’t want to pass all of that on the stack.

I followed that rule for a long time, and I see a lot of other programmers (particularly C programmers) that still follow it. Whether out of habit or preference I don’t know.

At some point I decided that all of that struct pointer stuff made for some ugly, hard-to-read code, especially with something like vector math, and so I switched to passing nearly all structs by value unless the function needed to modify the struct itself.

But I wondered: did it really matter from a performance perspective? What about large structs? And are there any non-performance benefits?

## Small Struct

A good example of a reasonably small struct is something like a vec4, which is four floats totaling 16 bytes.

 ``````1 2 3 4 5 6 `````` ``````typedef struct { float x; float y; float z; float w; } vec4;``````

The old school C-style way of doing an add operation with two of them would be the following:

 `````` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 `````` ``````void vec4_add(vec4* out, vec4* a, vec4* b) { out->x = a->x + b->x; out->y = a->y + b->y; out->z = a->z + b->z; out->w = a->w + b->w; } vec4 a = {1.0f, 3.0f, 5.0f, 7.0f}; vec4 b = {2.0f, 4.0f, 6.0f, 8.0f}; vec4 c; vec4_add(&c, &a, &b);``````

The reason I think this makes for ugly code is that you need to declare your output on a separate line from the function and its inputs, and I don’t like passing the addresses of things to functions when not necessary.

My preferred way looks like this:

 `````` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 `````` ``````vec4 vec4_add(vec4 a, vec4 b) { vec4 result; result.x = a.x + b.x; result.y = a.y + b.y; result.z = a.z + b.z; result.w = a.w + b.w; return result; } vec4 a = {1.0f, 3.0f, 5.0f, 7.0f}; vec4 b = {2.0f, 4.0f, 6.0f, 8.0f}; vec4 c = vec4_add(a, b);``````

In this version the entire thing reads like the math operation it’s performing: c = a + b

Great, so it looks better (in my opinion), but what about performance? Let’s consult the almighty Godbolt.

The compiler that I’ll be using in the tests is x86-64 clang 14.0.0, and I’ve removed the boilerplate instructions that occur at the start (setting up the stack frame) and end (returning to the caller) of every function, as they are the same for all examples.

### Pass by Pointer (-O0)

First, the pointer version without optimizations.

 `````` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 `````` ``````# Put the three pointers onto the stack mov qword ptr [rbp - 8], rdi # out mov qword ptr [rbp - 16], rsi # a mov qword ptr [rbp - 24], rdx # b # out->x = a->x + b->x; mov rax, qword ptr [rbp - 16] # get address of a movss xmm0, dword ptr [rax] # place a->x in xmm0 mov rax, qword ptr [rbp - 24] # get address of b addss xmm0, dword ptr [rax] # add b->x to a->x in xmm0 mov rax, qword ptr [rbp - 8] # get address of out movss dword ptr [rax], xmm0 # put sum in out->x # out->y = a->y + b->y; mov rax, qword ptr [rbp - 16] # get address of a movss xmm0, dword ptr [rax + 4] # place a->y in xmm0 mov rax, qword ptr [rbp - 24] # get address of b addss xmm0, dword ptr [rax + 4] # add b->y to a->y in xmm0 mov rax, qword ptr [rbp - 8] # get address of out movss dword ptr [rax + 4], xmm0 # put sum in out->y # out->y = a->y + b->y; mov rax, qword ptr [rbp - 16] # get address of a movss xmm0, dword ptr [rax + 8] # place a->z in xmm0 mov rax, qword ptr [rbp - 24] # get address of b addss xmm0, dword ptr [rax + 8] # add b->z to a->z in xmm0 mov rax, qword ptr [rbp - 8] # get address of out movss dword ptr [rax + 8], xmm0 # put sum in out->z # out->w = a->w + b->w; mov rax, qword ptr [rbp - 16] # get address of a movss xmm0, dword ptr [rax + 12] # place a->w in xmm0 mov rax, qword ptr [rbp - 24] # get address of b addss xmm0, dword ptr [rax + 12] # add b->w to a->w in xmm0 mov rax, qword ptr [rbp - 8] # get address of out movss dword ptr [rax + 12], xmm0 # put sum in out->w ``````

Total Instructions: 27

I’ve annotated what’s happening to make it more clear. The key thing to notice is that each add requires six instructions, and three of them are fetches from memory.

### Pass by Value (-O0)

Let’s compare that to the alternative:

 `````` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 `````` ``````# Put a and b onto stack movlpd qword ptr [rbp - 32], xmm0 # a.x, a.y movlpd qword ptr [rbp - 24], xmm1 # a.z, a.w movlpd qword ptr [rbp - 48], xmm2 # b.x, b.y movlpd qword ptr [rbp - 40], xmm3 # b.z, b.w # result.x = a.x + b.x; movss xmm0, dword ptr [rbp - 32] # put a.x in xmm0 movss xmm1, dword ptr [rbp - 48] # put b.x in xmm1 addss xmm0, xmm1 # add b.x to a.x in xmm0 movss dword ptr [rbp - 16], xmm0 # put sum into result.x # result.y = a.y + b.y; movss xmm0, dword ptr [rbp - 28] # put a.y in xmm0 movss xmm1, dword ptr [rbp - 44] # put b.y in xmm1 addss xmm0, xmm1 # add b.y to a.y in xmm0 movss dword ptr [rbp - 12], xmm0 # put sum into result.y # result.z = a.z + b.z; movss xmm0, dword ptr [rbp - 24] # put a.z in xmm0 movss xmm1, dword ptr [rbp - 40] # put b.z in xmm1 addss xmm0, xmm1 # add b.z to a.z in xmm0 movss dword ptr [rbp - 8], xmm0 # put sum into result.z # result.w = a.w + b.w; movss xmm0, dword ptr [rbp - 20] # put a.w in xmm0 movss xmm1, dword ptr [rbp - 36] # put b.w in xmm1 addss xmm0, xmm1 # add b.w to a.w in xmm0 movss dword ptr [rbp - 4], xmm0 # put sum into result.w # Put result into xmm0 and xmm1 movsd xmm0, qword ptr [rbp - 16] # result.x, result.y movsd xmm1, qword ptr [rbp - 8] # result.z, result.w ``````

Total Instructions: 22

To my eyes, even the assembly is more readable in this form. We see a clear pattern: load, load, add, store. Each add is four instructions this time instead of six, and the loads aren’t happening from memory but from the stack itself.

There could be performance benefits here in two forms: fewer instructions (26 vs 31) and no reads from memory. Because all of the operations are happening with values that are on the stack, we don’t have to worry about the latency involved with reading from memory.

Although I’m not sure why the compiler decided to use four xmm registers when only two were needed (each can hold four floats).

What if we turn optimizations on?

### Pass by Pointer (-O2)

 ``````1 2 3 4 `````` ``````movups xmm0, xmmword ptr [rsi] # put a in xmm0 movups xmm1, xmmword ptr [rdx] # put b in xmm1 addps xmm1, xmm0 # add a and b movups xmmword ptr [rdi], xmm1 # put sum into out ``````

Total Instructions: 4

The compiler realizes that a and b are four floats each which happens to be the size of the xmm registers, so it saves a lot of effort by simply placing the four floats of a into xmm0 and the four floats of b into xmm1 and doing a single add instruction that takes care of the four individual adds.

Surely passing by value can’t beat that?

### Pass by Value (-O2)

 ``````1 2 `````` ``````addps xmm0, xmm2 addps xmm1, xmm3 ``````

Total Instructions: 2

It doesn’t need to load anything from memory so it’s able to do everything within the registers themselves, for a total of three instructions. I’m not sure why it needs two add instructions though while the other version only requires the one.

## Large Struct

The previous example had the benefit that the struct only contained four floats totaling sixteen bytes, and it also had the advantage that an entire vec4 could fit into a single xmm register. Let’s try another common game math structure: the 4x4 matrix.

 ``````1 2 3 4 5 6 `````` ``````typedef struct { float e00; float e01; float e02; float e03; float e10; float e11; float e12; float e13; float e20; float e21; float e22; float e23; float e30; float e31; float e32; float e33; } mat4; ``````

For our example function, let’s do something a bit contrived and do matrix addition which I’ve never needed to use but it’ll keep the assembly shorter than something like matrix multiplication.

The pass-by-pointer form would look like this:

 `````` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 `````` ``````void mat4_add(mat4* out, mat4* a, mat4* b) { out->e00 = a->e00 + b->e00; out->e01 = a->e01 + b->e01; out->e02 = a->e02 + b->e02; out->e03 = a->e03 + b->e03; out->e10 = a->e10 + b->e10; out->e11 = a->e11 + b->e11; out->e12 = a->e12 + b->e12; out->e13 = a->e13 + b->e13; out->e20 = a->e20 + b->e20; out->e21 = a->e21 + b->e21; out->e22 = a->e22 + b->e22; out->e23 = a->e23 + b->e23; out->e30 = a->e30 + b->e30; out->e31 = a->e31 + b->e31; out->e32 = a->e32 + b->e32; out->e33 = a->e33 + b->e33; } ``````

And the pass-by-value form:

 `````` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 `````` ``````mat4 mat4_add(mat4 a, mat4 b) { mat4 result; result.e00 = a.e00 + b.e00; result.e01 = a.e01 + b.e01; result.e02 = a.e02 + b.e02; result.e03 = a.e03 + b.e03; result.e10 = a.e10 + b.e10; result.e11 = a.e11 + b.e11; result.e12 = a.e12 + b.e12; result.e13 = a.e13 + b.e13; result.e20 = a.e20 + b.e20; result.e21 = a.e21 + b.e21; result.e22 = a.e22 + b.e22; result.e23 = a.e23 + b.e23; result.e30 = a.e30 + b.e30; result.e31 = a.e31 + b.e31; result.e32 = a.e32 + b.e32; result.e33 = a.e33 + b.e33; return result; } ``````

### Pass by Pointer (-O0)

 `````` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 `````` ``````# Put the three pointers onto the stack mov qword ptr [rbp - 8], rdi # out mov qword ptr [rbp - 16], rsi # a mov qword ptr [rbp - 24], rdx # b # result.e00 = a.e00 + b.e00; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax] mov rax, qword ptr [rbp - 8] movss dword ptr [rax], xmm0 # result.e01 = a.e01 + b.e01; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 4] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 4] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 4], xmm0 # result.e02 = a.e02 + b.e02; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 8] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 8] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 8], xmm0 # result.e03 = a.e03 + b.e03; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 12] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 12] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 12], xmm0 # result.e10 = a.e10 + b.e10; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 16] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 16] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 16], xmm0 # result.e11 = a.e11 + b.e11; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 20] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 20] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 20], xmm0 # result.e12 = a.e12 + b.e12; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 24] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 24] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 24], xmm0 # result.e13 = a.e13 + b.e13; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 28] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 28] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 28], xmm0 # result.e20 = a.e20 + b.e20; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 32] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 32] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 32], xmm0 # result.e21 = a.e21 + b.e21; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 36] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 36] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 36], xmm0 # result.e22 = a.e22 + b.e22; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 40] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 40] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 40], xmm0 # result.e23 = a.e23 + b.e23; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 44] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 44] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 44], xmm0 # result.e30 = a.e30 + b.e30; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 48] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 48] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 48], xmm0 # result.e31 = a.e31 + b.e31; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 52] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 52] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 52], xmm0 # result.e32 = a.e32 + b.e32; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 56] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 56] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 56], xmm0 # result.e33 = a.e33 + b.e33; mov rax, qword ptr [rbp - 16] movss xmm0, dword ptr [rax + 60] mov rax, qword ptr [rbp - 24] addss xmm0, dword ptr [rax + 60] mov rax, qword ptr [rbp - 8] movss dword ptr [rax + 60], xmm0 ``````

Total Instructions: 98

It’s about the same as the vec4 version except that there are four times as many add sections because there are four times as many floats involved. There really isn’t any difference otherwise. It still places the pointers onto the stack frame and then loads the values from memory into registers, adds, and places the values back into memory.

### Pass by Value (-O0)

 `````` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 `````` ``````# Store pointers to the stack in registers (for easier offsets I assume) mov rax, rdi lea rcx, [rbp + 80] # b lea rdx, [rbp + 16] # a # result.e00 = a.e00 + b.e00; movss xmm0, dword ptr [rdx] addss xmm0, dword ptr [rcx] movss dword ptr [rdi], xmm0 # result.e01 = a.e01 + b.e01; movss xmm0, dword ptr [rdx + 4] addss xmm0, dword ptr [rcx + 4] movss dword ptr [rdi + 4], xmm0 # result.e02 = a.e02 + b.e02; movss xmm0, dword ptr [rdx + 8] addss xmm0, dword ptr [rcx + 8] movss dword ptr [rdi + 8], xmm0 # result.e03 = a.e03 + b.e03; movss xmm0, dword ptr [rdx + 12] addss xmm0, dword ptr [rcx + 12] movss dword ptr [rdi + 12], xmm0 # result.e10 = a.e10 + b.e10; movss xmm0, dword ptr [rdx + 16] addss xmm0, dword ptr [rcx + 16] movss dword ptr [rdi + 16], xmm0 # result.e11 = a.e11 + b.e11; movss xmm0, dword ptr [rdx + 20] addss xmm0, dword ptr [rcx + 20] movss dword ptr [rdi + 20], xmm0 # result.e12 = a.e12 + b.e12; movss xmm0, dword ptr [rdx + 24] addss xmm0, dword ptr [rcx + 24] movss dword ptr [rdi + 24], xmm0 # result.e13 = a.e13 + b.e13; movss xmm0, dword ptr [rdx + 28] addss xmm0, dword ptr [rcx + 28] movss dword ptr [rdi + 28], xmm0 # result.e20 = a.e20 + b.e20; movss xmm0, dword ptr [rdx + 32] addss xmm0, dword ptr [rcx + 32] movss dword ptr [rdi + 32], xmm0 # result.e21 = a.e21 + b.e21; movss xmm0, dword ptr [rdx + 36] addss xmm0, dword ptr [rcx + 36] movss dword ptr [rdi + 36], xmm0 # result.e22 = a.e22 + b.e22; movss xmm0, dword ptr [rdx + 40] addss xmm0, dword ptr [rcx + 40] movss dword ptr [rdi + 40], xmm0 # result.e23 = a.e23 + b.e23; movss xmm0, dword ptr [rdx + 44] addss xmm0, dword ptr [rcx + 44] movss dword ptr [rdi + 44], xmm0 # result.e30 = a.e30 + b.e30; movss xmm0, dword ptr [rdx + 48] addss xmm0, dword ptr [rcx + 48] movss dword ptr [rdi + 48], xmm0 # result.e31 = a.e31 + b.e31; movss xmm0, dword ptr [rdx + 52] addss xmm0, dword ptr [rcx + 52] movss dword ptr [rdi + 52], xmm0 # result.e32 = a.e32 + b.e32; movss xmm0, dword ptr [rdx + 56] addss xmm0, dword ptr [rcx + 56] movss dword ptr [rdi + 56], xmm0 # result.e33 = a.e33 + b.e33; movss xmm0, dword ptr [rdx + 60] addss xmm0, dword ptr [rcx + 60] movss dword ptr [rdi + 60], xmm0 ``````

Total Instructions: 51

The result here is a bit interesting. It first takes the stack addresses of a and b and places them into the registers rdx and rcx. It then uses offsets from rdx and rcx to access the values rather than offsets from the stack frame pointer (rbp) itself.

Also, each add is now three instructions which is one fewer than the pass-by-value vec4_add. Here it’s able to move one value into xmm0 and then perform the add directly from the value on the stack.

So the pass-by-pointer version is 102 instructions while the pass-by-value version is only 55.

### Pass by Pointer (-O2)

 `````` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 `````` ``````# Add four floats movups xmm0, xmmword ptr [rsi] movups xmm1, xmmword ptr [rdx] addps xmm1, xmm0 movups xmmword ptr [rdi], xmm1 # Add four floats movups xmm0, xmmword ptr [rsi + 16] movups xmm1, xmmword ptr [rdx + 16] addps xmm1, xmm0 movups xmmword ptr [rdi + 16], xmm1 # Add four floats movups xmm0, xmmword ptr [rsi + 32] movups xmm1, xmmword ptr [rdx + 32] addps xmm1, xmm0 movups xmmword ptr [rdi + 32], xmm1 # Add four floats movups xmm0, xmmword ptr [rsi + 48] movups xmm1, xmmword ptr [rdx + 48] addps xmm1, xmm0 movups xmmword ptr [rdi + 48], xmm1 ``````

Total Instructions: 16

Similarly to the vec4 version, it’s able to add in four float chunks which reduces the number of instructions significantly, but it does still need to load from memory which slows things a bit.

### Pass by Value (-O2)

 `````` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 `````` ``````# Place return value into rax mov rax, rdi # Add four floats movaps xmm0, xmmword ptr [rsp + 8] addps xmm0, xmmword ptr [rsp + 72] movups xmmword ptr [rdi], xmm0 # Add four floats movaps xmm0, xmmword ptr [rsp + 24] addps xmm0, xmmword ptr [rsp + 88] movups xmmword ptr [rdi + 16], xmm0 # Add four floats movaps xmm0, xmmword ptr [rsp + 40] addps xmm0, xmmword ptr [rsp + 104] movups xmmword ptr [rdi + 32], xmm0 # Add four floats movaps xmm0, xmmword ptr [rsp + 56] addps xmm0, xmmword ptr [rsp + 120] movups xmmword ptr [rdi + 48], xmm0 ``````

Total Instructions: 13

Again, it’s able to add in chunks of four floats but there is less overhead because everything is happening within the stack frame.

## Conclusion

After doing these experiments I’m convinced that in most cases (on a PC with a modern x86 CPU) passing by value is the way to go, in regards to both readability and performance.

Passing by pointer leads to reads from memory which can be slow and often require additional instructions.

Passing by pointer can cause ambiguity about ownership because anything with the pointer can do whatever it wants with the data.

Passing by pointer can lead to pointer aliasing where a compiler can’t be sure they don’t point to the same data and so can’t perform certain optimizations.

Passing by value operates on the stack and prevents the need to fetch from memory.

Passing by value makes for more readable function calls as you clearly see that the inputs are the function arguments and the output is the return value from the function.

Passing by value makes a variable const by default.

That’s not to say I never pass by pointer. If I have a function that needs to modify an existing piece of data then that is a good use for a pointer. It wouldn’t make sense to take in a copy, modify the copy, and return a new copy. Or if a struct were truly huge (hundreds of bytes or more), then I would use a pointer, or at the very least check the compiled assembly to see what it was doing.

But for a function that needs to only read from a struct, and/or return a brand new struct, I think it makes to sense pass by value.

Last Edited: Dec 20, 2022