AUSTIN MORLAN

CODE EMAIL SOCIAL RSS
Jun 03, 2022

Working with Jumbo/Unity Builds (Single Translation Unit)


There’s been a lot of talk in recent years about something called a “jumbo build” or “unity build” or sometimes “single translation unit build”. Essentially it’s an alternate way of compiling C/C++ by putting all of the code (headers and source both) into a single file.

Some people love them. Some people think they’re evil. I’ve been using them in my personal projects for a while now and thought I’d share my experiences.

It's unfortunate that "Unity Build" is the prevailing term because it's impossible to do a search without getting a lot of results about the Unity game engine.

Translation Units


In C and C++, code is split up into separate files: source files (.c) and header files (.h). When you run the compiler on a source file it compiles it down to an object file (.o) which is a translation unit. The compiler can only see what’s inside of that single C file and nothing else, so it takes it on faith that symbols referenced in the source file exist in some other translation unit. It’s blind to everything outside of the single file it’s compiling.

The linker then takes all of the object files generated by the compiler and attempts to connect up the symbols. If foo is defined in foo.c and referenced in bar.c, the linker connects the symbols such that the reference in boo.o connects with the definition in foo.o.

The purpose of header files is to allow a source file to use types that exist in other translation units. The compiler can only look at one source file at a time, but a source file can reference types and objects that exist outside of it. The header defines the interface for the compiler so that it knows how to handle those objects, but it’s the linker that actually connects them.

That system leads to a nasty pattern of needing to declare a thing in one file (header), define it in another file (source), then include that header in any file that uses it.

The Problem

You have a source file bar.c that defines a function called bar().

1
2
3
4
5
/* bar.c */

void bar(void)
{
}

You then want to call bar() from the source file foo.c.

1
2
3
4
5
6
/* foo.c */

void foo(void)
{
	bar();
}

But when the compiler encounters the symbol bar(), it doesn’t know what it returns or what arguments it takes so it gives a warning.

1
2
3
4
5
6
$ clang -c foo.c -o foo.o

foo.c:4:2: warning: implicit declaration of function 'bar' is invalid in C99 [-Wimplicit-function-declaration]
        bar();
        ^
1 warning generated.

The compiler gives a warning instead of an error because it’s using a compiler extension to do its best to figure things out, but that is invalid C99 code and could go wrong if it is incorrect in its assumptions. I like to enable pedantic errors to ensure I conform to the standard properly.

1
2
3
4
5
6
$ clang -pedantic-errors -c foo.c -o foo.o

foo.c:4:2: error: implicit declaration of function 'bar' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
        bar();
        ^
1 error generated.

To get rid of the error, the compiler needs more information about bar() which means we need to create a header file bar.h.

1
2
3
/* bar.h */

void bar(void);

Now we can include that header file in foo.c to make the compiler happy.

1
2
3
4
5
6
7
8
/* foo.c */

#include "bar.h"

void foo(void)
{
	bar();
}

If we compile now the compiler issues no errors.

1
$ clang -pedantic-errors -c foo.c -o foo.o

Great. Now let’s say that we also have a struct called bartender that is created in bar() and returned. If we want to use that struct inside of foo.c then we need to know its definition so the struct definition needs to go into the header. But bar.c also needs to know the definition because it’s creating the thing. So now we have to add the struct to bar.h and now include bar.h in bar.c.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
/* bar.h */

typedef struct bartender {
	int x;
} bartender_t;

bartender_t bar(void);


/* bar.c */

#include "bar.h"

bartender_t bar(void)
{
	bartender_t result;

	result.x = 5;

	return result;
}


/* foo.c */

#include "bar.h"

void foo(void)
{
	bartender_t bartender = bar();
}

This annoys me because you have to update multiple files when you make changes rather than making the change in a single place. Imagine if we wanted to change bar() so that it took in an argument. We would have to make the change in both bar.c and bar.h.

Another trouble with this system is that, in larger projects, it takes time to parse all of the source and header files and then link them together. This is especially bad in C++ because header files can get very heavy with templates.

You can compile multiple source files at the same time because they’re all isolated (make -j8) but linking can’t be parallelized. I’ve seen large games take twice as long to link as they did to compile.

The Solution

Wouldn’t it be nice if we could have one file bar.c that had a struct and function definitions, and one file foo.c that used that struct and function, without the need for including a header file?

Here’s how. Remove bar.h and place its contents into the top of bar.c.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
/* bar.c */

typedef struct bartender {
	int x;
} bartender_t;

bartender_t bar(void)
{
	bartender_t result;

	result.x = 5;

	return result;
}

Remove the includes from foo.c.

1
2
3
4
5
6
/* foo.c */

void foo(void)
{
	bartender_t bartender = bar();
}

Now create a source file that includes all of the other source files. I like to name mine all.c because it is a clear name and it usually will appear first in a list of source files when sorted alphabetically.

1
2
3
4
5
/* all.c */

#include "bar.c"
#include "foo.c"
#include "main.c"

To build your project, all you have to do is compile that one file. No need to create a list of source files in your build system and manage include directories or anything of that. Just run the compiler on that one file.

1
$ clang all.c

That’s really all there is to it.

Each source file will be pasted into all.c in order of inclusion by the preprocessor and then compiled as one translation unit.

I also like to add common system headers and library headers at the top of all.c as well so that they’re included once at the beginning and the entire program has access to them throughout.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
/* all.c */

#include <assert.h>
#include <math.h>
#include <stdarg.h>
#include <stdbool.h>
#include <stddef.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>

#include <glad/gl.h>
#include <SDL.h>

#include "bar.c"
#include "foo.c"
#include "main.c"

Advantages


Faster Compilation

The compilation will (often) be faster because it only needs to parse every file once. In a normal system with source files including header files, the preprocessor has to go through and resolve all of those #include statements before the compiler can do its job. With this method the preprocessor goes through each #include in all.c, pasting the contents, and then is done. Just the one time.

Faster Linking

Linking will be faster because all of the symbols reside in the one translation unit so the linker doesn’t have to resolve symbols across multiple object files.

More Optimizations

The compiler has access to all of the source of the entire program so optimizations can be made that would otherwise be impossible.

No Juggling Headers

You no longer need to have a set of include statements at the top of each file, and you no longer need to keep the header file and source file in sync when making changes.

Disadvantages


No Interface

When you get rid of header files you also get rid of the concept of the interface. One nice thing about header files is that someone who wants to use the functionality can look at the header file and see all of the data structures that were defined and all of the possible functions that could be used.

Merging the header and the source file together removes that level of separation and mixes together the interface with the implementation.

Possible Solution

You can still use header files if you want to, they’re just no longer strictly necessary. You’re free to put struct definitions and function prototypes into a header file if you’d like.

Order Matters

The order that you include the source files in all.c matters. In the above example, bar.c had to be included before foo.c because foo.c used a struct and function that was defined in bar.c.

A file’s dependencies must always be included before the file itself, or else you’ll get undefined identifier errors.

Possible Solution

Either be mindful of the order (which isn’t too terrible), or place the module’s external definitions into a header file and include all of them in all.c at the top. That way all of them are visible to the source files that are included further down. But this still requires that the header files are included in a certain order because one might reference symbols from another.

No More Static

The keyword static when used at the file scope means that the symbol is local to the translation unit that it was defined in, so you could have a static variable or a static function that was only visible inside a single source file.

static no longer means anything when you have only one translation unit. If you define a static variable or function in bar.c, the compiler will happily allow you to reference them in foo.c (assuming bar.c was included before foo.c).

From what I’ve gathered, this is the number one reason that many people think jumbo builds are bad. They absolutely hate the idea of everything becoming global.

Possible Solution

The best solution I’ve come up with is a sort of pseudo-namespace scheme. It won’t stop someone from intentionally referencing something intended to be static, but it will at least help with accidental usage.

For example, let’s say bar.c had a static variable called tender and a static function called hop. In the normal scheme it would look like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
/* bar.c */

static int tender;

static void hop(void);

void bar(void)
{
	tender = 5;
	hop();
}

If something in foo.c tried to reference either of those it would get an undeclared identifier error.

For the single translation unit, we can put all of the static variables into a struct with the file name, and we can prefix each static function as well.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
/* bar.c */

static struct {
	int tender;
} priv_bar;

static void priv_bar__hop(void);

void bar(void)
{
	priv_bar.tender = 5;
	priv_bar__hop();
}

A C++ solution might be to use a special namespace.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
/* bar.cpp */

namespace priv_bar {
	static int tender;
	static void hop(void);
}

void bar(void)
{
	priv_bar::tender = 5;
	priv_bar::hop();
}

A different file could then have their own variable named tender or function named hop() without clashing with this one. It’s not perfect but it helps prevent accidents.

I still use the static keyword so that the variables are initialized to zero, and so that behavior would be as expected if I decided to stop using the jumbo build.

Compile Everything Every Time

The separate compilation and linking steps of multiple translation units means that you only have to compile a source file if it’s changed since the last time you compiled it. If you’re actively working in foo.c, and you run the build, it will leave everything else alone and compile only foo.c.

Because all.c includes everything, making a change to foo.c means that bar.c will compile as well even though it hasn’t changed.

This seems to be the other reason that many people think jumbo builds are bad, likely because they’re working in large C++ projects that take an hour or more to build fresh while a partial compilation only takes five minutes (although I think even five minutes is way too long).

Possible Solution

You don’t have to literally have only one translation unit. You could have two, or four, or eight, or whatever number you want. You could put systems that rarely change into one translation unit and more actively developed systems into a different one. That would grant you the benefits of both, you just need to separate things appropriately.

No Parallel Compilation

You can no longer run one compiler per thread and compile multiple source files simultaneously, which means you’re wasting 119 threads of your monster Threadripper.

Possible Solution

Similar to the last, you can break your source tree up into multiple translation units so that you can use your processor more effectively.

More RAM Usage

This is unlikely to be much of an issue on most machines, but a sufficiently large project could consume a lot of RAM as all of the source is loaded into memory.

Possible Solution

Again, break your source tree up into multiple translation units. Experiment and see what works best for you.

Conclusion


I find the advantages of the jumbo build outweigh the disadvantages.

Because I work mostly in C and on smaller projects, I don’t really benefit from the faster compilation and linking.

What I love is the removal of header file juggling. I’ve always found the concept of header files to be a big pain in the ass and C’s biggest flaw.

I find it very irritating to have to edit two files for one module, to compile something only to realize I was missing an include, to worry about forward declaring structs, to always wonder if my list of includes per file is out of date after a refactor, etc.

I would encourage people to try it out on their own personal projects and see how they like it.

Discussion


Discuss here