I came across a great 2020 paper by David Barina with a brilliant faster way of experimentally testing the Collatz conjecture! I simplified his CPU code here (see my comments at the top). The purpose of this webpage is to show how I continued making progress to speed up his algorithm even further!

With David Barina's generous help, I wrote my own code (collatzCPU.c), and it is a bit faster than his code! **Using k=34 on my I5-3210M CPU running Ubuntu Linux and running 2 ^{12} numbers for each number in the 2^{34} sieve, my code tests 3.6 × 10^{9} numbers/second, whereas his tests 2.6 × 10^{9} numbers/second.** My guess is that: (1) my way of doing modulo operation helps the speed (calculating modulo on large numbers takes time!), (2) I also use a 3

My code simply prints any numbers that overflow or seem to be stuck in an infinite cycle (to standard output), and these are the numbers that could disprove the Collatz conjecture. **See my extensive comments at the top of the file to learn about how my code (and how David Barina's code) works**.

On Windows, you'll need to get GCC (macOS and Linux should already have it, though *gcc* links to *clang* on macOS, and *clang* is fine). On Windows, download GCC from here. When running the downloaded installer, select the architecture to be 64-bit (x86_64). I then added *C:\Program Files\mingw-w64\x86_64-8.1.0-posix-seh-rt_v6-rev0\mingw64\bin* to Windows' Path so that I can compile code in Command Prompt.

Note that my code generates the 2^{k} sieve and stores it in RAM before testing numbers. Ideally, my code would just load a pre-made lookup table from the storage drive, and it could do this using a buffer to prevent needing a noticeable amount of RAM (that is, load in only pieces at a time), but my main goal was just to create a fast algorithm, and I wanted the complete algorithm to be in a single file.

Why do we want larger k? The fraction of numbers not excluded by a 2^{k} sieve for k=34 is 0.007246. Less than 1% need to be tested! However, k=40 has 0.004782. Larger k means faster code! This is a time-space tradeoff since larger k also takes more RAM and/or room in storage drive. Also, you can pre-calculate the first k steps of non-excluded numbers, so larger k is even better!

David Barina uses a clever compression to store a 2^{34} sieve. 2^{34} bits is 2 GiB, but he found that there are only 50 unique 64-bit patterns. Since the integer *50* fits in a byte (8 bits), he can use 8 bits to store 64 bits. This allows him to store a 2^{34} sieve in 256 MiB. However, k=40 would require 16 GiB using this compression.

A much larger concern is that *generating* the sieve took **a lot** of RAM. For k=34, there are 2^{34} 128-bit integers needed to test whether different paths join. This alone is 256 GiB, and there are other things needed in RAM. So k=34 seemed to be a limit.

Before I discuss the fix to these issues, I discovered a few basic things. You don't need 128-bit integers for *generating* the sieve if k < 41. I use 64-bit integers, which use half of the RAM and use less CPU time. Also, I discovered that there are actually 53 unique 64-bit patterns, though the "extra 3" only occur for k=6 and k=7.

The first big idea to realize is that not all paths can join. Only "nearby" paths can join. As k gets larger, more distant paths can join, but still within a k-dependent distance. If you look in my code, you'll see these distances called deltaN. For k=34, deltaN = 46. For k=40, deltaN = 120. Using deltaN, I can create any 2^{k} sieve using very little RAM! collatzSieve2toK_FindPatterns.c is my code that experimentally finds deltaN (among other things) for any k. Note that this code takes over a day to run k=40.

But, to experimentally find deltaN, you would still need all 2^{k} values stored in RAM because you don't know deltaN yet. If you read the comments at the top of *collatzSieve2toK_FindPatterns.c*, you see that deltaN can be bounded by 2^{(k - minC)}, where *minC > k / log2(3)* is the lowest c that does not include already-ruled-out numbers, where c is how many increases are encountered in k steps (steps and "already-ruled-out n" are defined in the code's comments). **Confused? Just read the copious comments at the top of the code!** I was proud of myself for figuring this out, but, even with this bound, for k=40, putting *2^(40 - ceiling(40/log2(3)))* into WolframAlpha gives a bound of 16384. This solves the issue with RAM, but CPU time is going to take almost half a year. The code's comments mention a code I use to get a tighter bound, and collatzFindDeltaNbound.c is this tricky code. I developed the algorithm myself, and I wonder if anyone else has done this. Read the comments at the top with pen and enough paper to figure out how the math of it works! For k=40, the tighter bound is just 679, which can be run in a week by *collatzSieve2toK_FindPatterns.c* to find deltaN = 120.

But 16 GiB for k=40 is still a huge file to share between computers or just have on a computer. I can compress the k=40 sieve to 8 GiB by finding unique 256-bit patterns and referencing them using 2 bytes (16 bits). For k < 41, there are 13374 unique patterns (will likely increase for larger k). This is more than 53, but less than the 2^{16} limit for 2 bytes.

Why did I choose 256-bit patterns? First of all, pattern sizes should be of size 2^{K} due to exclusion rules being of form n%(2^{K}) == value. For example, n%2 == 0 and n%4 == 1 do not need to be tested (as long as k is not less than K).

**64-bit patterns**: This is what David Barina used for an 8-to-1 compression. I get a 16-to-1 compression using 256-bit patterns. Note that my compression would take about 30% more time to *generate* a sieve due to having to search through more patterns, but I implemented a binary search to prevent a slow down.

**128-bit patterns**: There are more than 256 unique 128-bit patterns, so they would need 2 bytes. This compression is identical to using 64-bit patterns.

**512-bit patterns**: These would need 3-byte labels, which is fine, except I would use uint32_t because I am too lazy to get a 24-bit integer, so it would actually take 4 bytes unless you do something fancy like *uint8_t label[3]*. 4 bytes gives no advantage over using 256-bit patterns. More importantly, there would now be so many unique 128-bit patterns that they could no longer be expected to fit in CPU cache (the patterns are 128-bit since I only need to consider n%4 == 3).

**1024-bit patterns**: I have not tested whether all unique 1024-bit patterns fit in 4 bytes because I am currently unable to handle these because, even with my 4-to-1 compression trick (only looking at n%4 == 3), I cannot fit these unique patterns in a 128-bit integer. I could find a different container for these patterns, but I don't care because there would be so many of them that they would fill up my storage drive and cache, and, for each k above 40, you would find *many* more unique patterns to add to the list.

As for how to implement these 256-bit-pattern sieves, set K=8 in *collatzSieve2toK_FindPatterns.c* and run through all the k values of interest. Note that this capital K has different meaning than the one in the previous paragraph! Luckily for you, I have already done this, and all needed information is in this spreadsheet (has lots of other great info too!), and collatzCreatePatternsForSieve.c has cumulative unique patterns for k < 41. You can then generate a sieve using collatzCreateSieve16to1compression.c. k=40 will likely take over a day. You would then modify my *collatzCPU.c* to load in a sieve instead of generating one, but I am not motivated to do that. Instead, to show you how to load in the sieve using basically no RAM, here is a code: count16to1sieve.c. It just outputs how many numbers need to be tested in the sieve, but look at how the storage drive is read to a buffer to greatly reduce RAM usage.

On second thought, not writing a code to use these compressed sieves seems weird, so I made collatzCPU16to1.c! I still don't much care to run it with k=40 because...

• 8 GiB is a lot to write to a storage drive, and creating this k=40 sieve would take me 31 hours.

• Then, using the sieve would take me 2.6 days if I test 2^{10} numbers for each number in the 2^{40} sieve (that is, to test 2^{50} numbers).

But I got curious about what the numbers per second would be! I got **5.03 × 10 ^{9} numbers/second on the same I5-3210M CPU**. This wasn't as fast as I was expecting, but then I realized something: the numbers that are excluded by a k=40 sieve but are not excluded by a k=34 sieve are often the numbers that don't take long to test!

David Barina also made some very fast GPU code. Next summer (in 2021), I want to tackle using the GPU!