Aug 12, 2022
Pass by Value vs Pass by Pointer
When I was in college, our computer science instructor introduced us to the struct and then gave
us the following rule: when you pass a struct into or out of a function, you should use a pointer.
The reason being that a struct is full of other values and you don’t want to pass all of that on the
stack.
I followed that rule for a long time, and I see a lot of other programmers (particularly C
programmers) that still follow it. Whether out of habit or preference I don’t know.
At some point I decided that all of that struct pointer stuff made for some ugly, hard-to-read
code, especially with something like vector math, and so I switched to passing nearly all structs
by value unless the function needed to modify the struct itself.
But I wondered: did it really matter from a performance perspective? What about large structs? And
are there any non-performance benefits?
Small Struct
A good example of a reasonably small struct is something like a vec4, which is four floats
totaling 16 bytes.
1
2
3
4
5
6
| typedef struct {
float x;
float y;
float z;
float w;
} vec4;
|
The old school C-style way of doing an add operation with two of them would be the following:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
| void
vec4_add(vec4* out, vec4* a, vec4* b)
{
out->x = a->x + b->x;
out->y = a->y + b->y;
out->z = a->z + b->z;
out->w = a->w + b->w;
}
vec4 a = {1.0f, 3.0f, 5.0f, 7.0f};
vec4 b = {2.0f, 4.0f, 6.0f, 8.0f};
vec4 c;
vec4_add(&c, &a, &b);
|
The reason I think this makes for ugly code is that you need to declare your output on a separate
line from the function and its inputs, and I don’t like passing the addresses of things to functions
when not necessary.
My preferred way looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
| vec4
vec4_add(vec4 a, vec4 b)
{
vec4 result;
result.x = a.x + b.x;
result.y = a.y + b.y;
result.z = a.z + b.z;
result.w = a.w + b.w;
return result;
}
vec4 a = {1.0f, 3.0f, 5.0f, 7.0f};
vec4 b = {2.0f, 4.0f, 6.0f, 8.0f};
vec4 c = vec4_add(a, b);
|
In this version the entire thing reads like the math operation it’s performing: c = a + b
Great, so it looks better (in my opinion), but what about performance? Let’s consult the almighty
Godbolt.
The compiler that I’ll be using in the tests is x86-64 clang 14.0.0, and I’ve removed the
boilerplate instructions that occur at the start (setting up the stack frame) and end (returning to
the caller) of every function, as they are the same for all examples.
Pass by Pointer (-O0)
First, the pointer version without optimizations.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
| # Put the three pointers onto the stack
mov qword ptr [rbp - 8], rdi # out
mov qword ptr [rbp - 16], rsi # a
mov qword ptr [rbp - 24], rdx # b
# out->x = a->x + b->x;
mov rax, qword ptr [rbp - 16] # get address of a
movss xmm0, dword ptr [rax] # place a->x in xmm0
mov rax, qword ptr [rbp - 24] # get address of b
addss xmm0, dword ptr [rax] # add b->x to a->x in xmm0
mov rax, qword ptr [rbp - 8] # get address of out
movss dword ptr [rax], xmm0 # put sum in out->x
# out->y = a->y + b->y;
mov rax, qword ptr [rbp - 16] # get address of a
movss xmm0, dword ptr [rax + 4] # place a->y in xmm0
mov rax, qword ptr [rbp - 24] # get address of b
addss xmm0, dword ptr [rax + 4] # add b->y to a->y in xmm0
mov rax, qword ptr [rbp - 8] # get address of out
movss dword ptr [rax + 4], xmm0 # put sum in out->y
# out->y = a->y + b->y;
mov rax, qword ptr [rbp - 16] # get address of a
movss xmm0, dword ptr [rax + 8] # place a->z in xmm0
mov rax, qword ptr [rbp - 24] # get address of b
addss xmm0, dword ptr [rax + 8] # add b->z to a->z in xmm0
mov rax, qword ptr [rbp - 8] # get address of out
movss dword ptr [rax + 8], xmm0 # put sum in out->z
# out->w = a->w + b->w;
mov rax, qword ptr [rbp - 16] # get address of a
movss xmm0, dword ptr [rax + 12] # place a->w in xmm0
mov rax, qword ptr [rbp - 24] # get address of b
addss xmm0, dword ptr [rax + 12] # add b->w to a->w in xmm0
mov rax, qword ptr [rbp - 8] # get address of out
movss dword ptr [rax + 12], xmm0 # put sum in out->w
|
Total Instructions: 27
I’ve annotated what’s happening to make it more clear. The key thing to notice is that each add
requires six instructions, and three of them are fetches from memory.
Pass by Value (-O0)
Let’s compare that to the alternative:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
| # Put a and b onto stack
movlpd qword ptr [rbp - 32], xmm0 # a.x, a.y
movlpd qword ptr [rbp - 24], xmm1 # a.z, a.w
movlpd qword ptr [rbp - 48], xmm2 # b.x, b.y
movlpd qword ptr [rbp - 40], xmm3 # b.z, b.w
# result.x = a.x + b.x;
movss xmm0, dword ptr [rbp - 32] # put a.x in xmm0
movss xmm1, dword ptr [rbp - 48] # put b.x in xmm1
addss xmm0, xmm1 # add b.x to a.x in xmm0
movss dword ptr [rbp - 16], xmm0 # put sum into result.x
# result.y = a.y + b.y;
movss xmm0, dword ptr [rbp - 28] # put a.y in xmm0
movss xmm1, dword ptr [rbp - 44] # put b.y in xmm1
addss xmm0, xmm1 # add b.y to a.y in xmm0
movss dword ptr [rbp - 12], xmm0 # put sum into result.y
# result.z = a.z + b.z;
movss xmm0, dword ptr [rbp - 24] # put a.z in xmm0
movss xmm1, dword ptr [rbp - 40] # put b.z in xmm1
addss xmm0, xmm1 # add b.z to a.z in xmm0
movss dword ptr [rbp - 8], xmm0 # put sum into result.z
# result.w = a.w + b.w;
movss xmm0, dword ptr [rbp - 20] # put a.w in xmm0
movss xmm1, dword ptr [rbp - 36] # put b.w in xmm1
addss xmm0, xmm1 # add b.w to a.w in xmm0
movss dword ptr [rbp - 4], xmm0 # put sum into result.w
# Put result into xmm0 and xmm1
movsd xmm0, qword ptr [rbp - 16] # result.x, result.y
movsd xmm1, qword ptr [rbp - 8] # result.z, result.w
|
Total Instructions: 22
To my eyes, even the assembly is more readable in this form. We see a clear pattern: load, load,
add, store. Each add is four instructions this time instead of six, and the loads aren’t happening
from memory but from the stack itself.
There could be performance benefits here in two forms: fewer instructions (26 vs 31) and no reads
from memory. Because all of the operations are happening with values that are on the stack, we don’t
have to worry about the latency involved with reading from memory.
Although I’m not sure why the compiler decided to use four xmm registers when only two were needed (each
can hold four floats).
What if we turn optimizations on?
Pass by Pointer (-O2)
1
2
3
4
| movups xmm0, xmmword ptr [rsi] # put a in xmm0
movups xmm1, xmmword ptr [rdx] # put b in xmm1
addps xmm1, xmm0 # add a and b
movups xmmword ptr [rdi], xmm1 # put sum into out
|
Total Instructions: 4
The compiler realizes that a and b are four floats each which happens to be the size of the
xmm registers, so it saves a lot of effort by simply placing the four floats of a into
xmm0 and the four floats of b into xmm1 and doing a single add instruction that takes
care of the four individual adds.
Surely passing by value can’t beat that?
Pass by Value (-O2)
1
2
| addps xmm0, xmm2
addps xmm1, xmm3
|
Total Instructions: 2
It doesn’t need to load anything from memory so it’s able to do everything within the registers
themselves, for a total of three instructions. I’m not sure why it needs two add instructions
though while the other version only requires the one.
Large Struct
The previous example had the benefit that the struct only contained four floats totaling sixteen
bytes, and it also had the advantage that an entire vec4 could fit into a single xmm
register. Let’s try another common game math structure: the 4x4 matrix.
1
2
3
4
5
6
| typedef struct {
float e00; float e01; float e02; float e03;
float e10; float e11; float e12; float e13;
float e20; float e21; float e22; float e23;
float e30; float e31; float e32; float e33;
} mat4;
|
For our example function, let’s do something a bit contrived and do matrix addition which I’ve never
needed to use but it’ll keep the assembly shorter than something like matrix multiplication.
The pass-by-pointer form would look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| void
mat4_add(mat4* out, mat4* a, mat4* b)
{
out->e00 = a->e00 + b->e00;
out->e01 = a->e01 + b->e01;
out->e02 = a->e02 + b->e02;
out->e03 = a->e03 + b->e03;
out->e10 = a->e10 + b->e10;
out->e11 = a->e11 + b->e11;
out->e12 = a->e12 + b->e12;
out->e13 = a->e13 + b->e13;
out->e20 = a->e20 + b->e20;
out->e21 = a->e21 + b->e21;
out->e22 = a->e22 + b->e22;
out->e23 = a->e23 + b->e23;
out->e30 = a->e30 + b->e30;
out->e31 = a->e31 + b->e31;
out->e32 = a->e32 + b->e32;
out->e33 = a->e33 + b->e33;
}
|
And the pass-by-value form:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
| mat4
mat4_add(mat4 a, mat4 b)
{
mat4 result;
result.e00 = a.e00 + b.e00;
result.e01 = a.e01 + b.e01;
result.e02 = a.e02 + b.e02;
result.e03 = a.e03 + b.e03;
result.e10 = a.e10 + b.e10;
result.e11 = a.e11 + b.e11;
result.e12 = a.e12 + b.e12;
result.e13 = a.e13 + b.e13;
result.e20 = a.e20 + b.e20;
result.e21 = a.e21 + b.e21;
result.e22 = a.e22 + b.e22;
result.e23 = a.e23 + b.e23;
result.e30 = a.e30 + b.e30;
result.e31 = a.e31 + b.e31;
result.e32 = a.e32 + b.e32;
result.e33 = a.e33 + b.e33;
return result;
}
|
Pass by Pointer (-O0)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
| # Put the three pointers onto the stack
mov qword ptr [rbp - 8], rdi # out
mov qword ptr [rbp - 16], rsi # a
mov qword ptr [rbp - 24], rdx # b
# result.e00 = a.e00 + b.e00;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax], xmm0
# result.e01 = a.e01 + b.e01;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 4]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 4]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 4], xmm0
# result.e02 = a.e02 + b.e02;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 8]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 8]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 8], xmm0
# result.e03 = a.e03 + b.e03;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 12]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 12]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 12], xmm0
# result.e10 = a.e10 + b.e10;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 16]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 16]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 16], xmm0
# result.e11 = a.e11 + b.e11;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 20]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 20]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 20], xmm0
# result.e12 = a.e12 + b.e12;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 24]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 24]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 24], xmm0
# result.e13 = a.e13 + b.e13;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 28]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 28]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 28], xmm0
# result.e20 = a.e20 + b.e20;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 32]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 32]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 32], xmm0
# result.e21 = a.e21 + b.e21;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 36]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 36]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 36], xmm0
# result.e22 = a.e22 + b.e22;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 40]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 40]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 40], xmm0
# result.e23 = a.e23 + b.e23;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 44]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 44]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 44], xmm0
# result.e30 = a.e30 + b.e30;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 48]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 48]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 48], xmm0
# result.e31 = a.e31 + b.e31;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 52]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 52]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 52], xmm0
# result.e32 = a.e32 + b.e32;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 56]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 56]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 56], xmm0
# result.e33 = a.e33 + b.e33;
mov rax, qword ptr [rbp - 16]
movss xmm0, dword ptr [rax + 60]
mov rax, qword ptr [rbp - 24]
addss xmm0, dword ptr [rax + 60]
mov rax, qword ptr [rbp - 8]
movss dword ptr [rax + 60], xmm0
|
Total Instructions: 98
It’s about the same as the vec4 version except that there are four times as many add sections
because there are four times as many floats involved. There really isn’t any difference otherwise.
It still places the pointers onto the stack frame and then loads the values from memory into
registers, adds, and places the values back into memory.
Pass by Value (-O0)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
| # Store pointers to the stack in registers (for easier offsets I assume)
mov rax, rdi
lea rcx, [rbp + 80] # b
lea rdx, [rbp + 16] # a
# result.e00 = a.e00 + b.e00;
movss xmm0, dword ptr [rdx]
addss xmm0, dword ptr [rcx]
movss dword ptr [rdi], xmm0
# result.e01 = a.e01 + b.e01;
movss xmm0, dword ptr [rdx + 4]
addss xmm0, dword ptr [rcx + 4]
movss dword ptr [rdi + 4], xmm0
# result.e02 = a.e02 + b.e02;
movss xmm0, dword ptr [rdx + 8]
addss xmm0, dword ptr [rcx + 8]
movss dword ptr [rdi + 8], xmm0
# result.e03 = a.e03 + b.e03;
movss xmm0, dword ptr [rdx + 12]
addss xmm0, dword ptr [rcx + 12]
movss dword ptr [rdi + 12], xmm0
# result.e10 = a.e10 + b.e10;
movss xmm0, dword ptr [rdx + 16]
addss xmm0, dword ptr [rcx + 16]
movss dword ptr [rdi + 16], xmm0
# result.e11 = a.e11 + b.e11;
movss xmm0, dword ptr [rdx + 20]
addss xmm0, dword ptr [rcx + 20]
movss dword ptr [rdi + 20], xmm0
# result.e12 = a.e12 + b.e12;
movss xmm0, dword ptr [rdx + 24]
addss xmm0, dword ptr [rcx + 24]
movss dword ptr [rdi + 24], xmm0
# result.e13 = a.e13 + b.e13;
movss xmm0, dword ptr [rdx + 28]
addss xmm0, dword ptr [rcx + 28]
movss dword ptr [rdi + 28], xmm0
# result.e20 = a.e20 + b.e20;
movss xmm0, dword ptr [rdx + 32]
addss xmm0, dword ptr [rcx + 32]
movss dword ptr [rdi + 32], xmm0
# result.e21 = a.e21 + b.e21;
movss xmm0, dword ptr [rdx + 36]
addss xmm0, dword ptr [rcx + 36]
movss dword ptr [rdi + 36], xmm0
# result.e22 = a.e22 + b.e22;
movss xmm0, dword ptr [rdx + 40]
addss xmm0, dword ptr [rcx + 40]
movss dword ptr [rdi + 40], xmm0
# result.e23 = a.e23 + b.e23;
movss xmm0, dword ptr [rdx + 44]
addss xmm0, dword ptr [rcx + 44]
movss dword ptr [rdi + 44], xmm0
# result.e30 = a.e30 + b.e30;
movss xmm0, dword ptr [rdx + 48]
addss xmm0, dword ptr [rcx + 48]
movss dword ptr [rdi + 48], xmm0
# result.e31 = a.e31 + b.e31;
movss xmm0, dword ptr [rdx + 52]
addss xmm0, dword ptr [rcx + 52]
movss dword ptr [rdi + 52], xmm0
# result.e32 = a.e32 + b.e32;
movss xmm0, dword ptr [rdx + 56]
addss xmm0, dword ptr [rcx + 56]
movss dword ptr [rdi + 56], xmm0
# result.e33 = a.e33 + b.e33;
movss xmm0, dword ptr [rdx + 60]
addss xmm0, dword ptr [rcx + 60]
movss dword ptr [rdi + 60], xmm0
|
Total Instructions: 51
The result here is a bit interesting. It first takes the stack addresses of a and b and
places them into the registers rdx and rcx. It then uses offsets from rdx and rcx to
access the values rather than offsets from the stack frame pointer (rbp) itself.
Also, each add is now three instructions which is one fewer than the pass-by-value vec4_add.
Here it’s able to move one value into xmm0 and then perform the add directly from the value on
the stack.
So the pass-by-pointer version is 102 instructions while the pass-by-value version is only 55.
But what about with optimizations?
Pass by Pointer (-O2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
| # Add four floats
movups xmm0, xmmword ptr [rsi]
movups xmm1, xmmword ptr [rdx]
addps xmm1, xmm0
movups xmmword ptr [rdi], xmm1
# Add four floats
movups xmm0, xmmword ptr [rsi + 16]
movups xmm1, xmmword ptr [rdx + 16]
addps xmm1, xmm0
movups xmmword ptr [rdi + 16], xmm1
# Add four floats
movups xmm0, xmmword ptr [rsi + 32]
movups xmm1, xmmword ptr [rdx + 32]
addps xmm1, xmm0
movups xmmword ptr [rdi + 32], xmm1
# Add four floats
movups xmm0, xmmword ptr [rsi + 48]
movups xmm1, xmmword ptr [rdx + 48]
addps xmm1, xmm0
movups xmmword ptr [rdi + 48], xmm1
|
Total Instructions: 16
Similarly to the vec4 version, it’s able to add in four float chunks which reduces the number of
instructions significantly, but it does still need to load from memory which slows things a bit.
Pass by Value (-O2)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
| # Place return value into rax
mov rax, rdi
# Add four floats
movaps xmm0, xmmword ptr [rsp + 8]
addps xmm0, xmmword ptr [rsp + 72]
movups xmmword ptr [rdi], xmm0
# Add four floats
movaps xmm0, xmmword ptr [rsp + 24]
addps xmm0, xmmword ptr [rsp + 88]
movups xmmword ptr [rdi + 16], xmm0
# Add four floats
movaps xmm0, xmmword ptr [rsp + 40]
addps xmm0, xmmword ptr [rsp + 104]
movups xmmword ptr [rdi + 32], xmm0
# Add four floats
movaps xmm0, xmmword ptr [rsp + 56]
addps xmm0, xmmword ptr [rsp + 120]
movups xmmword ptr [rdi + 48], xmm0
|
Total Instructions: 13
Again, it’s able to add in chunks of four floats but there is less overhead because everything is
happening within the stack frame.
Conclusion
After doing these experiments I’m convinced that in most cases (on a PC with a modern x86 CPU) passing by value
is the way to go, in regards to both readability and performance.
Passing by pointer leads to reads from memory which can be slow and often require additional instructions.
Passing by pointer can cause ambiguity about ownership because anything with the pointer can do whatever it
wants with the data.
Passing by pointer can lead to pointer aliasing where a compiler can’t be sure they don’t point to
the same data and so can’t perform certain optimizations.
Passing by value operates on the stack and prevents the need to fetch from memory.
Passing by value makes for more readable function calls as you clearly see that the inputs are the
function arguments and the output is the return value from the function.
Passing by value makes a variable const by default.
That’s not to say I never pass by pointer. If I have a function that needs to modify an existing
piece of data then that is a good use for a pointer. It wouldn’t make sense to take in a copy,
modify the copy, and return a new copy. Or if a struct were truly huge (hundreds of bytes or more),
then I would use a pointer, or at the very least check the compiled assembly to see what it was
doing.
But for a function that needs to only read from a struct, and/or return a brand new struct, I think it
makes to sense pass by value.
Last Edited: Dec 20, 2022