Exploring the Polycube Universe

TLDR;

This long article documents my journey to generate and count polycubes—3D shapes formed by connecting unit cubes face-to-face. Starting with a YouTube video, I ported the code to Rust and achieved dramatic performance improvements through targeted optimizations like FxHashSet, SmallVec, and Zstd compression. For generating polycubes of size 12, I initially reduced processing time from 46.5 minutes to 13.5 minutes while shrinking cache file sizes by 90%.

I then made another breakthrough by replacing string-based canonical forms with hash-based representation, which slashed processing time even further—down to just 2 minutes for n=12 (a 98% reduction from the original). This optimization demonstrated how seemingly small changes like avoiding string operations can have outsized performance impacts.

I also investigated counting-only algorithms based on Stanley Dodds's work, discovering the fundamental trade-offs between memory safety and performance. While my safe Rust implementation worked instantly for n=11, it struggled with n=12, revealing that some calculations truly benefit from unsafe memory manipulation for maximum performance.

Overall, it was a great learning experience. I learnt a lot about algorithm optimization, language capabilities, and the sometimes unavoidable trade-offs between safety and speed in computational mathematics.

Well, this one is different from The Marvel or DC Universe.

About a week ago, a friend of mine introduced me to a video about generating polycubes. I certainly did not know about this before, but after talking to him and watching the video, I have to say I was obsessed with them.

First of all, what are polycubes? If you ever played Tetris, you kind of know what they are. They are those kind of shapes but in 3D. Simply a polycube is a solid figure formed by joining one or more equal cubes face to face. This is important: They cannot be joined at the edges. Take a look at the image below:

Before we go any further here is the video which is about 15 minutes, if you would like to watch: https://www.youtube.com/watch?v=g9n0a0644B4

The author in the video has generated a python code and he generated polycubes for 11 (n

\=11) cubes. After that he says he stopped since he was happy with his implementation. I used his Python code as a staring point.

This got me thinking: Python, is not the most performant code, so what if I used a more performant language? After all, how hard can it, right? (famous last words).

Here is the problem with this: While generating unique shapes is straight forward for smaller number of cubes (represented by n), when you get to 3D space, things get very complicated very fast.

For example, “L” and "⅃" are considered to be the same shapes. So, now, since you are also dealing with 3D shapes, you need to rotate the polycubes and compare them to see if they are the same shape but in a different rotation. Then you add the reflection to it and all of a sudden you are dealing with a lot of math and possibilities.

Here is a demo of reflection below. In this case, the red cube represents the reflected version of the blue cube (source: https://demonstrations.wolfram.com/Understanding3DReflection/).

Keeping that in mind, as the number of cubes (n) go up, the number of unique shapes grow exponentially - to the point that our computers struggle to generate them. Look at the table below where the number of unique polycubes are listed against the number of cubes:

Yeah, that last number for 18 cubes is 3 some trillion! As you can see, the number of unique shapes goes up exponentially.

💡

Fixed Polycubes are considered distinct if they have different shapes or orientations in 3D space. Free polycubes are considered the same in that case. For example, imagine a straight line that consists off 3 cubes, like an “I”. The vertical and horizontal “I” is considered different in fixed polycubes, whereas the same in free polycubes, hence the number of unique shapes discrepancy between the two.

Right. So, after a crash course on 3D math, I decided to use the good old C# - because after all it is much more performant than Python, right?

I had some ground rules before I started:

Going to keep Dr. Matt’s approach and create the polycubes
No cloud computing - speed comparisons were going to be made running on my laptop
Not trying to see how many cubes (n) I can count, but more like how fast can I get with this methodology for let’s say n=11 or n=12

Following the logic behind the YouTube video, my approach was simple. Create polycubes and then expand them and eliminate the edge connections and duplicates. Oh, and I wanted to do it all on my local PC - 32GB RAM. No cloud, no distributed computing. Simple.

Except that it wasn’t. For the love of me, I could not get the C# app to successfully eliminate the duplicates, or when I did, I could not get rid of the edge connections. However, as buggy as my code was when I ran it for n=9, it took the code to generate the 48,311 unique shapes some 35 seconds. That time may not look so bad but if I went for larger numbers, let’s say n=11, it would take some good time.

So, at that time, I became obsessed with getting it accurate and efficient. My code needed to be fast. I looked for a way to generate the cubes without generating them and looking at all rotations etc. but could not come up with one.

I needed efficiency and speed.

So, I ditched C#, and decided to tackle this issue in Rust. I thought the math libraries in Rust may provide the shortcut I needed instead of trying to reinvent the wheel in C#. Hey, after all, the next step up is the assembly language.

Architecture

Same approach, the core idea of the architecture was as follows:

Foundational Structures
- Pos: A simple 3D coordinate system (x,y,z)
- Polycube: Represents a polycube as a collection of cube positions
Generation Engine
- Builds polycubes incrementally starting from smaller ones
- Uses a recursive technique: generate all n-cube polycubes by adding one cube to all (n-1)-cube polycubes.
  - What we do is we cache (save the results to the disk) the previous n results - we generate them if they do not exist.
  - Then we look at the results of the n-1 and simply add one cube in all possible positions. This gives us face connectivity guarantee and also saves on memory as we only need to store polycubes of the current and previous size rather than all sizes.
  - If we tried to generate n-cubes from scratch (e.g., by checking all possible arrangements of n cubes), we'd need to examine 3^n potential configurations. Starting from (n-1)-cubes reduces this drastically.
- Parallelizes generation with Rayon for multi-core performance.
Canonicalization & Uniqueness
- Handles the most challenging aspect: determining if two differently-oriented polycubes are actually the same shape. This is where rotations come in to play.
- Implements rotation matrices to check all 24 possible 3D orientations.
- Uses canonical form representation for efficient comparison.
Performance Optimizations
- Implements caching to avoid regenerating the same polycubes
- Uses bit-packed representations for memory efficiency
- Parallelizes processing across available CPU cores
Export & Visualization
- Exports to CSV format for analysis
- Creates text-based representations for smaller polycubes
- Generates interactive HTML/Three.js visualizations
  - Uses the exported csv information for the visualization

Let me explain how we add cubes with an example:

Let’s say we are looking at generating polycubes for n=3. We start with only-2 cubes, which is a domino. We then find all valid positions to add a third cube - there are 10 of them. Then, we generate all potential 3-cubes by adding a cube at each position. We check the rotations and eliminate the duplicates and we are left with 2 unique 3-cubes.

So, at the end of the day I approached this in a 2-step process:

Port from Python to Rust
Improve Rust code-base

Port from Python to Rust

The version of the code is a direct port of the Python code. For example, it generated the polycubes almost the same way:

fn generate_polycubes(n: u8, use_cache: bool) -> Vec<Polycube> {
    // Handle base cases
    if n < 1 { return Vec::new(); }
    else if n == 1 { return vec![Polycube::unit_cube()]; }
    else if n == 2 { return vec![Polycube::domino()]; }

    // Check cache
    if use_cache && cache_exists(n) {
        return load_from_cache(n);
    }

    // Get (n-1)-cubes
    let base_cubes = generate_polycubes(n - 1, use_cache);

    // Store results
    let mut polycubes = Vec::new();
    let mut unique_forms = HashSet::new();

    // Process each base cube sequentially
    for (idx, base_cube) in base_cubes.iter().enumerate() {
        // Find expansion positions
        let positions = base_cube.get_expansion_positions();

        for position in positions {
            // Create expanded shape
            let expanded = base_cube.expand(position);

            // Check if it's unique
            if !cube_exists(&expanded, &unique_forms) {
                polycubes.push(expanded.clone());
                unique_forms.insert(expanded.get_canonical_form());
            }
        }

        // Show progress
        if idx % 100 == 0 {
            print!("\rGenerating polycubes n={}: {}%", n, (idx * 100) / base_cubes.len());
        }
    }

    // Cache results
    if use_cache {
        save_to_cache(&polycubes, n);
    }

    polycubes
}

Checking for uniqueness was straightforward but computationally expensive:

fn cube_exists(polycube: &Polycube, unique_forms: &HashSet<String>) -> bool {
    // Generate all 24 rotations
    for rotation in all_rotations(polycube) {
        if unique_forms.contains(&rotation.get_canonical_form()) {
            return true;
        }
    }

    false
}

fn get_canonical_form(&self) -> String {
    // Sort positions lexicographically
    let mut positions = self.cubes.clone();
    positions.sort_by(|a, b| {
        a.x.cmp(&b.x)
            .then(a.y.cmp(&b.y))
            .then(a.z.cmp(&b.z))
    });

    // Create string representation
    positions.iter()
        .map(|p| format!("{},{},{};", p.x, p.y, p.z))
        .collect()
}

In the code above, we generate 24 possible 3D rotations for each new polycube. This means;

Creating 24 new copies of the polycube
Applying 24 different rotation matrices
Normalizing each rotated result
Calculating the canonical form for each rotation

That is a lot of computation and as the number of cubes get larger, this work becomes extremely costly.

However, that was not all. This algorithm also checked uniqueness for every single candidate:

for position in positions {
    // Create expanded shape
    let expanded = base_cube.expand(position);

    // Check if it's unique
    if !cube_exists(&expanded, &unique_forms) {
        polycubes.push(expanded.clone());
        unique_forms.insert(expanded.get_canonical_form());
    }
}

This meant every candidate required a full rotation check.

I also used string representations for canonical forms:

fn get_canonical_form(&self) -> String {
    // Sort positions lexicographically
    let mut positions = self.cubes.clone();
    positions.sort_by(|a, b| {
        a.x.cmp(&b.x)
            .then(a.y.cmp(&b.y))
            .then(a.z.cmp(&b.z))
    });

    // Create string representation
    positions.iter()
        .map(|p| format!("{},{},{};", p.x, p.y, p.z))
        .collect()
}

While this makes life somewhat easier for me (in the end it turned out not so easy though), it also meant cloning the entire vector of positions, sorting operations on each check, expensive string operations including comparison and hash lookups with complex keys. Expensive.

I did accomplish what I wanted to do though: It worked. Then as a bonus, I also put together a 3D visualizer. All the user needed to do was to export the results at the end to a csv file and load that csv file to the visualizer. It worked beautifully and it was invaluable at the beginning to be able to compare the results and look at the shapes.

You can compare 2 shapes, and rotate the shapes using your mouse. It is pretty cool and I was very happy with it.

I used n=5 as my test case and then went up to 12 to get a benchmark (so I can compare with the improved version later on). Here is what I got - at this point keep in mind that my ineffective C# code produced n=9 in about 35 seconds.

Number of cubes (n)	Unique polycubes	Results Generated (seconds)
1	1	-
2	1	-
3	2	-
4	8	-
5	29	0.01
6	166	0.03
7	1,023	0.08
8	6,922	0.53
9	48,311	4.08
10	346,543	29.45
11	2,522,522	239.71 → ~4 mins
12	18,598,427	2793 → ~46.5 mins

This was a huge improvement. Even if you just look at n=9, the speed difference of 35 secs vs 4 secs is a vast improvement. Go Rust!

As I mentioned before, I was also caching the previous n number results by saving them as bincode files. Here is what they looked along, along with their sizes:

Obviously, as the number of cubes got bigger, the cache files were going to be pretty big too. Remember, if we were to go for n=13, the app would load cubes_12.bincode first and start working from there. In this case, that would mean loading 800MB file and deserializing it and then processing it. Once again, expensive operation.

Now that I had this working, it was time to see what can be improved and satisfy my need for speed.

Improving The Code

FxHashSet

The first thing was doing something about those hash sets. The original code used Rust’s HashSet. The issue, in our case, is HashSet uses a secure hashing algorithm called, SipHash, which puts security over speed. The security it provides comes at a cost though: It is more complex and performs additional calculations.

FxHashSet, on the other hand, is much simpler and uses a non-cryptographic hashing function that prioritizes speed over security.

The code is extremely close:

use rustc_hash::FxHashSet;

// Before:
let positions: HashSet<_> = polycube.cubes.iter().cloned().collect();

// After:
let positions: FxHashSet<_> = polycube.cubes.iter().copied().collect();

So, the rule is use HashSet when security matters like for user-provided input and FxHashSet when performance is more important and you don’t need protection against hash-flooding attacks and you know your keys won’t be malicious (like in-memory processing of trusted data). It was as if it was made for our case.

💡

Hash-flooding attacks are a type of denial-of-service (DoS) attack that targets hash table implementations. For example, if a web server uses a hash table to store form data and the hash function is predictable, an attacker could submit thousands of form fields that all hash to the same value. This can cause the server to consume excessive CPU time processing these inputs.

SmallVec for Adjacent Positions

The original code uses Vec<T> for small fixed-size collections, but it is meant for dynamic arrays. The vector is allocated on the heap.

The improvement came from using SmallVec<T, N> which allocates the vector on the stack (inline). It is a hybrid container that stores elements up to n elements. It is important to know that if more than N elements are added, it spills over into heap allocation like Vec<T>. There is also an extra move cost when data spills to heap.

Once again, the Rust code is extremely similar:

// Before:
pub fn adjacent_positions(&self) -> Vec<Pos> {
    vec![
        Pos::new(self.x + 1, self.y, self.z),
        // ... 5 more positions
    ]
}

// After:
pub fn adjacent_positions(&self) -> SmallVec<[Pos; 6]> {
    smallvec![
        Pos::new(self.x + 1, self.y, self.z),
        // ... 5 more positions
    ]
}

Zstd Compression for Cache Files

You have seen how big the .bincode files can get. When it is over a certain size, loading this file and deserializing became a serious bottleneck.

To overcome this issue, the v2 code uses something called Zstd Compression. This achieves better compression ratios while maintaining good performance, which becomes increasingly important as cache files grow exponentially with larger n values.

// Before:
fn save_to_cache(polycubes: &[Polycube], path: &str) -> Result<(), std::io::Error> {
    let file = File::create(path)?;
    bincode::serialize_into(file, &polycubes)?;
    Ok(())
}

// After:
fn save_to_cache(polycubes: &[Polycube], path: &str) -> Result<(), std::io::Error> {
    let serialized = bincode::serialize(polycubes)?;
    let file = File::create(path)?;
    let mut encoder = zstd::Encoder::new(file, 3)?;
    encoder.write_all(&serialized)?;
    encoder.finish()?;
    Ok(())
}

💡

It is worth noting that the code above uses Zstd compression to store the final generated polycubes for each value of n, not the intermediate sets. Another optimization opportunity would be reducing redundant storage of intermediate structures but I was happy with the results so I left it alone.

Performance Optimization: Strategic Inlining

#[inline]
pub fn apply_rotation(&self, rotation: &[[i8; 3]; 3]) -> Self {
    // ... rotation code
}

Adding #[inline] attributes to performance-critical functions encourages the compiler to inline these functions, eliminating function call overhead in tight loops.

💡

In Rust, the compiler (LLVM) decides whether to inline a function based on its heuristics. However, by adding #[inline], you hint to the compiler that inlining might be beneficial. This eliminates function call overhead, and enables further optimizations due to the function body becoming part of the calling function.

Performance Optimization: Capacity Pre-allocation

Throughout the code, I ended up pre-allocating vectors with knows or estimated capacities.

// Before:
let mut matrices = Vec::new();

// After:
let mut matrices = Vec::with_capacity(24);

This simple change prevents multiple reallocations as the vectors grow, which is particularly important in hot code paths.

💡

Hot Code Path: A hot code path is a section of code that runs frequently and is responsible for a significant portion of execution time. Optimizing these paths can have a major impact on performance.

Running The Improved Version

Once these changes were in place, it was time to test it. I cleared the cache and ran the app again for numbers from n=5 to n=12.

Number of cubes (n)	Unique polycubes	Results Generated (seconds)	V2 - Results Generated (seconds)
1	1	-
2	1	-
3	2	-
4	8	-
5	29	0.01	0.00
6	166	0.03	0.01
7	1,023	0.08	0.02
8	6,922	0.53	0.14
9	48,311	4.08	1.08
10	346,543	29.45	9.27
11	2,522,522	239.71 → ~4 mins	98.07 → 1.5 mins
12	18,598,427	2793 → ~46.5 mins	819 → ~13.5 mins

I have to admit, the results were beyond my wildest dreams. But, wait! Look at these cache files:

The cache files are roughly 1/10 of the size of the files generated by the first version of the code.

Take that .. whatever!. Huh, in your face! I was extremely happy and as I was waiting for n=12 to finish, and getting ready to do my victory dance and then came across this article:

https://philthompson.me/2024/Counting-Polycubes-of-Size-21.html

That algorithm managed to create n=11 in 0 seconds, n=12 in like a second and so on. Wait, hold on? What? How did they do it? What was different?

Naturally, I dug a bit deeper. Phil Thompson’s code was based on a C# (C#?) program that was done by a brilliant mathematician named Stanley Dodds. Here is his program: https://oeis.org/A000162/a000162.cs.txt.

Phil Thompson ported this code to Rust and managed to create results for really high number of n. The last count was n=22. You can track this by visiting this site: https://oeis.org/A000162

The core of Dodds' algorithm is a clever mathematical encoding of 3D space that allows efficient counting without explicitly generating each polycube. Rather than storing each polycube, the algorithm recursively builds partial polycubes and counts valid extensions mathematically. At a certain recursion depth (typically 4), it switches to an optimized counting function that uses mathematical formulas to count large sets of configurations at once.

The magic happens in this counting function, which uses combinatorial formulas to calculate how many valid polycubes could be formed from a given partial state. This is where the algorithm gains its incredible performance advantage - counting many configurations with a single arithmetic operation rather than explicitly generating and checking each one.

On top of that, to take advantage of modern multi-core processors, the algorithm uses a "filter depth" parameter to divide the work into independent tasks that can be processed in parallel.

At the specified filter depth in the recursion, the algorithm creates a task that contains the current state. These tasks can then be distributed across multiple threads or processors for parallel execution.

Adding A Counter

I wanted to add a port of Phil Thompson’s Rust port and integrate it to my project to see what would happen. However, I immediately ran into issues.

Dodds's algorithm uses sophisticated pointer manipulation, bit-level operations, and complex mathematical optimizations that are challenging to translate correctly to unsafe Rust. So, I used Thomposn’s Rust port as a guide, however, his code was a standalone application specifically designed for counting. Integrating it to my existing codebase proved to be challenging as it would require a lot of refactoring.

Unsafe code, even if it is there for you and you need to implement it is challenging to say the least. Lots of memory corruptions.

On top of all that, the existing code base had a different approach and API design than Thompson’s implementation which made direct use difficult.

So, I decided to go for memory-safe option and see how we can integrate this approach.

The results were mixed to say the least and if you do not want to read the next section and skip directly to the conclusion bit, I’ll tell you right away: With n=11, the safe counter code ran instantly. n=12 proved to be difficult and instead of debugging it, since it was taking a long time, I decided to call it quits and be happy with where I was because 1) This section was not part of my original goal anyway 2) well, diminishing returns. I may come back to it later, but no promises.

Memory-Safe Counter

The memory-safe implementation had the following project goals:

Use only safe Rust code
Support parallel processing
Handle both fixed and fee polycube counting

This approach worked correctly and efficiently for smaller values of n. For example, n=11 was instant. However, n>= 12 proved to be difficult and it could not compete with either Dodds’ or Thompson’s applications.

The problem was that safe data structures consumed more memory than unsafe structures (direct pointer manipulation) and as such finding canonical forms became increasingly expensive.

Recursion was another issue as deep recursion with safe data structures lead to stack overflow. So, surprisingly, my original code for n=12 performed better than this version but it was slower for n=11.

Last Hurray: I Feel The Need, The Need for Speed

As I was writing this article and getting ready to publish it, I was also thinking what else I could do to optimize my original project. Once I am done with a project, I usually walk away from it, especially if I am thinking on how to solve a problem or optimize something, to let my subconscious thinking do the work.

Anyway, as IO was thinking I realized that a significant bottleneck was hiding in the canonical form representation.

You see, the code used a string-based representation for uniqueness checking. It works fine, but string operations in any language is expensive. For small n numbers this performance hit is negligible, but as we get to the bigger numbers, say n=10 and above, the performance hits just kept coming.

The solution to this problem was straightforward but powerful: replace string representations with direct lexicographic comparison and 64-bit hashes.

💡

Lexicographic comparison is a way of ordering elements based on their component parts, similar to how words are ordered in a dictionary. In the context of polycubes, lexicographic comparison means comparing the coordinates of the cubes in a specific order: We sort the cubes by in each polycube by their coordinates. Then we first compare the x-coordinates, if they are equal, then compare the y-coordinates and if they are equal too then compare the z-coordinates.

We can directly compare position vectors and compute a hash from the canonical form, entirely avoiding string operations. No more converting coordinates to strings. We directly compare the coordinates themselves.

pub fn get_canonical_hash(&self) -> u64 {
    let rotations = all_rotations(self);

    // Find the lexicographically smallest rotation
    let mut smallest: Option<Vec<Pos>> = None;

    for rotation in &rotations {
        let mut positions: Vec<_> = rotation.cubes.clone();
        positions.sort_by(/* lexicographic comparison */);

        if smallest.is_none() || lexicographically_smaller(&positions, smallest.as_ref().unwrap()) {
            smallest = Some(positions);
        }
    }

    // Compute a 64-bit hash of the canonical form
    let canonical_positions = smallest.unwrap();
    let mut hasher = FxHasher::default();
    canonical_positions.hash(&mut hasher);
    hasher.finish()
}

Using it as;

let canonical_hash = normalized.get_canonical_hash();

// Try to add to global uniqueness set
let mut unique_hashes_guard = unique_hashes.lock().unwrap();
if unique_hashes_guard.insert(canonical_hash) {
    local_polycubes.push(normalized);
}

The results were nothing short of spectacular:

n / Polycubes	Results in (secs)	V2 - Results in (secs)	V3 - Results in (secs)
11 / 2,522,522	239.71 → ~4 mins	98.07 → 1.5 mins	13.3
12 / 18,598,427	2793 → ~46.5 mins	819 → ~13.5 mins	123.61 → ~2 mins

So, no matter what language you are using, keep looking for ways to improve. Direct integer comparisons and hashing are much more efficient than string operations.

Conclusion

This was an invaluable exercise, as I learnt a lot about Rust efficiency and unsafe code. It also showed that, you may be a lot more efficient in something familiar to you and than something that is not, even though the unfamiliar is faster on paper.

Another lesson was that memory safety comes at a cost but the other option is extremely hard. Both Dodds’ C# code and Thompson’s Rust code used unsafe code for speed.

Also, mathematical insights and knowledge matter. I am not a mathematician and where I got stuck with Dodds’ formula, I could not come up with an alternative.

If you really want to go for higher n values, you can download either Dodds’ or Thomposn’s code and run them.

In the end, I accomplished what I set out to do and this exercise deepened my understanding of the fascinating world of polycubes and the deceptively hard nature of the algorithmic challenges they present.

💡

You can find the code here: https://github.com/tjgokken/polycube-generator