Memory allocation in game code requires thought and care if you want to avoid the performance pitfalls of typical C++ development practices. We have good reason to look beyond the general-purpose memory allocator given to us in malloc/new. The standard allocator is slow, comes with hidden/amortized overhead, may fail at run time, and doesn't offer any protection against heap fragmentation.
Some of the traditional solutions to this problem are:
- Avoiding dynamic allocation. Just use fixed sized arrays... You probably won't ever need more than 4096 entities, right? And if you do, just increase the array size. Well, this may be a pragmatic way to sidestep the issue in small games, but it's not a good fit for my project. I need to be able to scale to hundreds of thousands, possibly millions, of procedurally generated entities with only a small subset of them resident in memory at any time.
- Being a memory glutton. Just allocate a huge block from the OS up front and then divvy up the memory as you see fit... This is fine but doesn't actually solve any problems by itself. The real work remains; what to do with the memory that you've reserved. This is not a silver bullet to mitigate sloppy allocation practices.
- Custom C++ allocators. The STL answer... Pushing the solution into the C++ type system via templates doesn't sit well with me. I don't want to introduce any more templated types across my code base than absolutely necessary, and the memory allocation strategy I use should not be conflated with the type of data I'm allocating.
- Override global new/delete. This solution implies that you should feel free to use new/new[] and delete/delete[] at will throughout your code base, and you're covered... No, I want to avoid the explicit (and implicit) use of new and delete within the game loop as much as possible.
- class-specific new/delete. This has all the wrong of the above solution, with the added benefit that you get to do it dozens of times!
- global macro
#define MY_NEW ...
. This is a rather invasive solution that tightly couples your code to its own project. I don't know about you, but I'm always borrowing code from past projects, but I'm not trying to adopt the framework along with it. If done well, the semantics of your custom "new" function will be identical, so a simple find/replace should take care of things. But I come right back to the same argument as the last two; I want to avoid the explicit use of "new" as much as possible inside my game loop.
Simple Goals
To be perfectly clear, I'm not suggesting that fast linear allocators, stack allocators, memory pools and the rest don't have their place. What I'm shooting for is a simple strategy to avoid expensive allocations while the game is running. I want to get up and running quickly. I want to avoid accruing technical debt in my project that will have to be cleaned up and optimized later. I want to avoid making large architectural decisions early on that may have unforeseen consequences. I want something that might even be viable to ship, if performance turns out to be acceptable.
It doesn't matter how fast the allocation of a block is, if it's done during initialization. It doesn't matter where on the heap a block resides, as long as the memory itself is contiguous, tightly packed, and data-oriented design principles are informing the data layout and access patterns. Fragmentation of the heap over time is not a major concern if the blocks of memory never move, or move very rarely.
On the STL
Many containers in the standard library require an individual heap allocation per node in order to satisfy certain big O complexity requirements, and the abstraction of generic iterators. The trade-offs decided for these containers attempt to satisfy the "general" case. These are often not the same trade-offs we would choose in game development. With a data-oriented design approach, writing generic code is not a goal in itself. Being unnecessarily generic is often at odds with writing code that is clearly representative of actual memory access patterns. We can't effectively choose the algorithmic complexity trade-offs of our containers until we understand the statistically common use case for our data.
Sure, generic iterators dovetail nicely with the algorithms library. In many cases, thinking in terms of algorithms can help clean up nasty imperative code, or can make difficult problems more approachable. But we can't ignore the reality that modern CPU architectures are largely bound by memory latency. It takes a sizable data set before gains in efficiency outweigh the cost of constant cache misses. It kind of blew my mind when I learned that for small N's, an in-place bubble sort will outperform a more efficient quicksort. The key lesson here is that efficiency means doing less work, and performance means doing work faster. We're after performance. If performance can be improved through efficiency, great. However, there are large swaths of game code where achieving O(1) is less important than ensuring that your data and algorithm mesh for ideal cache locality.
So we're mostly going to leave the STL containers behind, except for vector. Over the next few posts we'll implement our own set of containers on top of vector, including the vector_queue, concurrent_queue, handle_map, and hash_map.
Why vector?
If there is one redeeming quality of the STL, it's the good old vector. Aside from the absence of a stack-based small storage optimization (a la std::string), there isn't much left to be desired. Vector gives us the contiguous memory block and performance characteristics of a basic array, with the safety and flexibility of being able to grow to accommodate more data. Vector internally implements the RAII idiom, in that the heap-allocated contiguous block is managed by a small stack-allocated stub. Make no mistake, std::vector<T>
is a good and reasonable use of C++ templates. This application of templates isn't going to murder your compile times or your debug messages. Anti-C++/template zealots don't often distinguish between this simple usage of templates, and Andrei Alexandrescu-esq metaprogramming with template-template arguments, variadic templates, CRTP, SFINAE, etc. Full disclosure, I do that stuff from time to time, fully aware that I'm probably doing "the wrong thing."
My Solution
It's common for game programmers to roll their own containers on top of vector in a pinch. My proposal is a dead simple pattern to mitigate the downsides of vector, while taking advantage of the benefits. I don't call it a poor man's solution for nothing.
- Keep a memory_reserve.h containing the reserve sizes for all important data containers in your project.
- Reserve during initialization before the game loop starts.
- On exit, write to the log or console if capacity exceeds the original reserve.
- If capacity exceeds the original reserve, increase the reserve.
Here's how this might look in practice:
// memory_reserve.h
#pragma once
#ifndef GRIFFIN_MEMORY_RESERVE_H_
#define GRIFFIN_MEMORY_RESERVE_H_
// Input System
#define RESERVE_INPUTSYSTEM_MAPPINGS 32
#define RESERVE_INPUTSYSTEM_CONTEXTS 32
#define RESERVE_INPUTSYSTEM_CALLBACKS 10
#define RESERVE_INPUTSYSTEM_EVENTSQUEUE 40
#define RESERVE_INPUTSYSTEM_MOTIONEVENTSQUEUE 100
#define RESERVE_INPUTSYSTEM_POPQUEUE 30
#define RESERVE_INPUTSYSTEM_MOTIONPOPQUEUE 100
#define RESERVE_INPUTCONTEXT_MAPPINGS 64
// Entity System
#define RESERVE_ENTITY_COMPONENTS 20
#define RESERVE_ENTITYMANAGER_ENTITIES 1000
#define RESERVE_ENTITYMANAGER_COMPONENTS 100
// Resource System
#define RESERVE_RESOURCELOADER_CALLBACKS 5
#define RESERVE_RESOURCECACHE_PERMANENT 100
#define RESERVE_RESOURCECACHE_MATERIALS 100
#define RESERVE_RESOURCECACHE_MODELS 100
#define RESERVE_RESOURCECACHE_SCRIPTS 100
// Render System
#define RESERVE_RENDER_QUEUE 500
#define RESERVE_SHADER_PROGRAMS 100
#define RESERVE_SHADER_PROGRAM_PREPROCESSORS 20
#define RESERVE_MODELS 100
// Scene System
#define RESERVE_SCENEGRAPH_TRAVERSAL_QUEUE 100
#define RESERVE_SCENEMANAGER_SCENES 3
#define RESERVE_SCENE_CAMERAS 10
// Scripting System
#define RESERVE_LUA_STATES 10
// Concurrency System
#define RESERVE_CONCURRENCY_POP_TASK_LIST 32
// ...
#endif
// SceneGraph.cpp
// ...
SceneGraph::SceneGraph()
{
m_bfsQueue.reserve(RESERVE_SCENEGRAPH_TRAVERSAL_QUEUE);
m_handleBuffer.reserve(RESERVE_SCENEGRAPH_TRAVERSAL_QUEUE);
}
SceneGraph::~SceneGraph() {
if (m_bfsQueue.capacity() > RESERVE_SCENEGRAPH_TRAVERSAL_QUEUE) {
SDL_Log("check RESERVE_SCENEGRAPH_TRAVERSAL_QUEUE: original=%d, highest=%d", RESERVE_SCENEGRAPH_TRAVERSAL_QUEUE, m_bfsQueue.capacity());
}
}
Conclusion
Our use of std::vector as the basic building block for further containers ensures that we'll be storing game and engine data contiguously in memory. We keep track of the highest capacity encountered during playtesting, and reserve during initialization to decrease the chance of reallocation within the game loop. We also benefit from the peace of mind in knowing that if we get it wrong and the capacity spikes, vector will still grow and allow the game to continue running as usual.
In future posts we'll build custom containers to take the place of their standard library counterparts (vector_queue, and hash_map). We'll introduce a very useful container that I call handle_map (I've seen others refer to this container as a "slot map"), which is used heavily in the griffin engine's entity-component system implementation. We'll also create a "good enough" concurrent_queue, with plans to launch into a series of posts on a task-based concurrency framework for games.
The memory allocation strategy presented here will be utilized in almost all areas of the griffin engine and game, meaning we'll have almost no worries about the cost of allocation at run time for our engine data and game data. We hardly have to do anything to achieve this - just write a few extra lines of code when a new container is introduced, and occasionally scan for logged messages on shutdown.