Experimentally testing Collatz conjecture

I came across a great 2020 paper by David Barina with a brilliant faster way of experimentally testing the Collatz conjecture! I simplified his CPU code here (see my comments at the top). The purpose of this webpage is to show how I continued making progress to speed up his algorithm even further!

Making his CPU code a bit faster

With David Barina's generous help, I wrote my own code (collatzCPU.c), and it is a bit faster than his code! Using k=34 on my I5-3210M CPU running Ubuntu Linux and running 212 numbers for each number in the 234 sieve, my code tests 3.6 × 109 numbers/second, whereas his tests 2.6 × 109 numbers/second. My guess is that: (1) my way of doing modulo operation helps the speed (calculating modulo on large numbers takes time!), (2) I also use a 32 sieve instead of a 31 sieve, and (3) I use a small lookup table to check for overflow instead of doing a calculation each time. I don't calculate checksums, though I probably should to test my code. Like his code, I also use 128-bit integers. In his code (not in the one that I simplified), he has an option to use gmp.h to test rare numbers that need more than 128-bit integers (I can just use this tiny Python script on those numbers). I believe his code would be as fast as mine if he focused on his CPU-only code, but his main goal was to create fast GPU code.

My code simply prints any numbers that overflow or seem to be stuck in an infinite cycle (to standard output), and these are the numbers that could disprove the Collatz conjecture. See my extensive comments at the top of the file to learn about how my code (and how David Barina's code) works.

On Windows, you'll need to get GCC (macOS and Linux should already have it, though gcc links to clang on macOS, and clang is fine). On Windows, download GCC from here. When running the downloaded installer, select the architecture to be 64-bit (x86_64). I then added C:\Program Files\mingw-w64\x86_64-8.1.0-posix-seh-rt_v6-rev0\mingw64\bin to Windows' Path so that I can compile code in Command Prompt.

Note that my code generates the 2k sieve and stores it in RAM before testing numbers. Ideally, my code would just load a pre-made lookup table from the storage drive, and it could do this using a buffer to prevent needing a noticeable amount of RAM (that is, load in only pieces at a time), but my main goal was just to create a fast algorithm, and I wanted the complete algorithm to be in a single file.

Extending beyond k=34 to k=40

Why do we want larger k? The fraction of numbers not excluded by a 2k sieve for k=34 is 0.007246. Less than 1% need to be tested! However, k=40 has 0.004782. Larger k means faster code! This is a time-space tradeoff since larger k also takes more RAM and/or room in storage drive. Also, you can pre-calculate the first k steps of non-excluded numbers, so larger k is even better!

David Barina uses a clever compression to store a 234 sieve. 234 bits is 2 GiB, but he found that there are only 50 unique 64-bit patterns. Since the integer 50 fits in a byte (8 bits), he can use 8 bits to store 64 bits. This allows him to store a 234 sieve in 256 MiB. However, k=40 would require 16 GiB using this compression.

A much larger concern is that generating the sieve took a lot of RAM. For k=34, there are 234 128-bit integers needed to test whether different paths join. This alone is 256 GiB, and there are other things needed in RAM. So k=34 seemed to be a limit.

Before I discuss the fix to these issues, I discovered a few basic things. You don't need 128-bit integers for generating the sieve if k < 41. I use 64-bit integers, which use half of the RAM and use less CPU time. Also, I discovered that there are actually 53 unique 64-bit patterns, though the "extra 3" only occur for k=6 and k=7.

The first big idea to realize is that not all paths can join. Only "nearby" paths can join. As k gets larger, more distant paths can join, but still within a k-dependent distance. If you look in my code, you'll see these distances called deltaN. For k=34, deltaN = 46. For k=40, deltaN = 120. Using deltaN, I can create any 2k sieve using very little RAM! collatzSieve2toK_FindPatterns.c is my code that experimentally finds deltaN (among other things) for any k. Note that this code takes over a day to run k=40.

But, to experimentally find deltaN, you would still need all 2k values stored in RAM because you don't know deltaN yet. If you read the comments at the top of collatzSieve2toK_FindPatterns.c, you see that deltaN can be bounded by 2(k - minC), where minC > k / log2(3) is the lowest c that does not include already-ruled-out numbers, where c is how many increases are encountered in k steps (steps and "already-ruled-out n" are defined in the code's comments). Confused? Just read the copious comments at the top of the code! I was proud of myself for figuring this out, but, even with this bound, for k=40, putting 2^(40 - ceiling(40/log2(3))) into WolframAlpha gives a bound of 16384. This solves the issue with RAM, but CPU time is going to take almost half a year. The code's comments mention a code I use to get a tighter bound, and collatzFindDeltaNbound.c is this tricky code. I developed the algorithm myself, and I wonder if anyone else has done this. Read the comments at the top with pen and enough paper to figure out how the math of it works! For k=40, the tighter bound is just 679, which can be run in a week by collatzSieve2toK_FindPatterns.c to find deltaN = 120.

But 16 GiB for k=40 is still a huge file to share between computers or just have on a computer. I can compress the k=40 sieve to 8 GiB by finding unique 256-bit patterns and referencing them using 2 bytes (16 bits). For k < 41, there are 13374 unique patterns (will likely increase for larger k). This is more than 53, but less than the 216 limit for 2 bytes.

Why did I choose 256-bit patterns? First of all, pattern sizes should be of size 2K due to exclusion rules being of form n%(2K) == value. For example, n%2 == 0 and n%4 == 1 do not need to be tested (as long as k is not less than K).
64-bit patterns: This is what David Barina used for an 8-to-1 compression. I get a 16-to-1 compression using 256-bit patterns. Note that my compression would take about 30% more time to generate a sieve due to having to search through more patterns, but I implemented a binary search to prevent a slow down.
128-bit patterns: There are more than 256 unique 128-bit patterns, so they would need 2 bytes. This compression is identical to using 64-bit patterns.
512-bit patterns: These would need 3-byte labels, which is fine, except I would use uint32_t because I am too lazy to get a 24-bit integer, so it would actually take 4 bytes unless you do something fancy like uint8_t label[3]. 4 bytes gives no advantage over using 256-bit patterns. More importantly, there would now be so many unique 128-bit patterns that they could no longer be expected to fit in CPU cache (the patterns are 128-bit since I only need to consider n%4 == 3).
1024-bit patterns: I have not tested whether all unique 1024-bit patterns fit in 4 bytes because I am currently unable to handle these because, even with my 4-to-1 compression trick (only looking at n%4 == 3), I cannot fit these unique patterns in a 128-bit integer. I could find a different container for these patterns, but I don't care because there would be so many of them that they would fill up my storage drive and cache, and, for each k above 40, you would find many more unique patterns to add to the list.

As for how to implement these 256-bit-pattern sieves, set K=8 in collatzSieve2toK_FindPatterns.c and run through all the k values of interest. Note that this capital K has different meaning than the one in the previous paragraph! Luckily for you, I have already done this, and all needed information is in this spreadsheet (has lots of other great info too!), and collatzCreatePatternsForSieve.c has cumulative unique patterns for k < 41. You can then generate a sieve using collatzCreateSieve16to1compression.c. k=40 will likely take over a day. You would then modify my collatzCPU.c to load in a sieve instead of generating one, but I am not motivated to do that. Instead, to show you how to load in the sieve using basically no RAM, here is a code: count16to1sieve.c. It just outputs how many numbers need to be tested in the sieve, but look at how the storage drive is read to a buffer to greatly reduce RAM usage.

On second thought, not writing a code to use these compressed sieves seems weird, so I made collatzCPU16to1.c! I still don't much care to run it with k=40 because...
    • 8 GiB is a lot to write to a storage drive, and creating this k=40 sieve would take me 31 hours.
    • Then, using the sieve would take me 2.6 days if I test 210 numbers for each number in the 240 sieve (that is, to test 250 numbers).
But I got curious about what the numbers per second would be! I got 5.03 × 109 numbers/second on the same I5-3210M CPU. This wasn't as fast as I was expecting, but then I realized something: the numbers that are excluded by a k=40 sieve but are not excluded by a k=34 sieve are often the numbers that don't take long to test!

GPU!

David Barina also made some very fast GPU code. Next summer (in 2021), I want to tackle using the GPU!