Experimentally testing Collatz conjecture

I came across a great 2020 paper by David Bařina with a brilliant faster way of experimentally testing the Collatz conjecture! I simplified his CPU code here (see my comments at the top). I will call his algorithm the "n++ n--" algorithm. The purpose of this webpage is to show how I continued making progress to speed up testing the Collatz conjecture even further! I do this in a chronological way so that you may understand my progress.

Personally, I think the conjecture has now been tested enough experimentally, but, if you disagree, you should be using the fastest code with the best mathematical tricks! I highly enjoy the math and code behind experimentally testing Collatz conjecture. Please let me know if you want to use my code!

An often overlooked aspect of productivity is motivation, and my motivation to fine tune my code is gone. I once had tremendous motivation by getting different perspectives from conversations with David Bařina and Eric Roosendaal, and I'd be very happy to help anyone try to take the baton from me partly because it might inspire me to make more progress! I especially would appreciate someone with low-level CPU or GPU expertise (or who is trying to acquire expertise) optimizing my code! Just don't be that horrible sort of person who doesn't communicate when trying to improve!

Abstract

I have combined many clever already-known math tricks with my new "sieveless" approach. If, in addition to testing the Collatz conjecture, you want to find the max n, you should use my "n++ n--" code. If you just want to test the Collatz conjecture, you should use my "repeated k steps" code. Regardless, use my "sieveless" approach! I provide benchmarked code for both CPU and GPU.

To do this "sieveless" approach, I first studied a value I call deltaN that determines how close numbers must be for them to join paths in k steps (with the same number of steps that are increases). This number allows for very large sieves to be generated, and these sieves can now be generated in a "sieveless" way, meaning that sieves are never saved to a storage drive. In fact, the sieves are so huge that they cannot be stored on a storage drive. Note that, when I count steps, I count (3*n + 1)/2 as one step.

2021-03 Update: I wrote hybrid codes called "partially sieveless" to speed things up even more!
2021-04 Update: I do the "partially sieveless" algorithm using CUDA instead of OpenCL!

My "sieveless" approach could also be applied to a "reduce to 1" code, so I also explore and expose various challenges of any "reduce to 1" code.

I then study the number of steps needed to reduce a number below its starting value and the number of steps needed to reduce a number to 1.

Making his "n++ n--" CPU code a bit faster

With David Bařina's generous help, I wrote my own code (collatzCPU.c), and it is a bit faster than his code (that is, faster than the code I provided above; GitHub is updated)! Using k=34 on my I5-3210M CPU running Ubuntu Linux and running 212 numbers for each number in the 234 sieve, my code tests 3.6 × 109 numbers/second, whereas his tests 2.6 × 109 numbers/second (using a single CPU core for both and my simplified code to test his). My guess is that: (1) my way of doing modulo operation helps the speed (calculating modulo on large numbers takes time!), (2) I also use a 32 sieve instead of a 31 sieve, and (3) I use a small lookup table to check for overflow instead of doing a calculation each time. I don't calculate checksums (or max n), though I probably should to test my code. Like his code, I also use 128-bit integers. In his code (not in the one that I simplified), he has an option to use gmp.h to test rare numbers that need more than 128-bit integers (I can just use a tiny Python script collatzTestOverflow.py on those numbers).

My code simply prints any numbers that overflow or seem to be stuck in an infinite cycle (to standard output), and these are the numbers that could disprove the Collatz conjecture. See my extensive comments at the top of the file to learn about how my code (and how David Bařina's code) works.

On Windows, you'll need to get GCC (macOS and Linux should already have it, though gcc links to clang on macOS, and clang is fine). On Windows, download GCC from here. When running the downloaded installer, select the architecture to be 64-bit (x86_64). I then added C:\Program Files\mingw-w64\x86_64-8.1.0-posix-seh-rt_v6-rev0\mingw64\bin to Windows' Path so that I can compile code in Command Prompt.

Note that my code generates the 2k sieve and stores it in RAM before testing numbers. Ideally, my code would just load a pre-made lookup table from the storage drive, and it could do this using a buffer to prevent needing a noticeable amount of RAM (that is, load in only pieces at a time), but my main goal was just to create a fast algorithm, and I wanted the complete algorithm to be in a single file.

Extending beyond k=34 to k=40

Why do we want larger k? The fraction of numbers not excluded by a 2k sieve for k=34 is 0.007246. Less than 1% need to be tested! However, k=40 has 0.004782. Larger k means faster code! This is a time-space tradeoff since larger k also takes more RAM and/or room in storage drive. Also, you can pre-calculate the first k steps of non-excluded numbers, so larger k is even better!

David Bařina uses a clever compression to store a 234 sieve. 234 bits is 2 GiB, but he found that there are only 50 unique 64-bit patterns. Since the integer 50 fits in a byte (8 bits), he can use 8 bits to store 64 bits. This allows him to store a 234 sieve in 256 MiB. However, k=40 would require 16 GiB using this compression.

A much larger concern is that generating the sieve took a lot of RAM. For k=34, there are 234 128-bit integers needed to test whether different paths join. This alone is 256 GiB, and there are other things needed in RAM. So k=34 seemed to be a limit.

Before I discuss the fix to these issues, I discovered a few basic things. You don't need 128-bit integers for generating the sieve if k < 41. I use 64-bit integers, which use half of the RAM and use less CPU time. Also, I discovered that there are actually 53 unique 64-bit patterns, though the "extra 3" only occur for k=6 and k=7.

The first big idea to realize is that not all paths can join. Only "nearby" paths can join. As k gets larger, more distant paths can join, but still within a k-dependent distance. If you look in my code, you'll see these distances called deltaN. For k=34, deltaN = 46. For k=40, deltaN = 120. Using deltaN, I can create any 2k sieve using very little RAM! collatzSieve2toK_FindPatterns.c is my code that experimentally finds deltaN (among other things) for any k. Note that this code takes over a day to run k=40.

But, to experimentally find deltaN, you would still need all 2k values stored in RAM because you don't know deltaN yet. If you read the comments at the top of collatzSieve2toK_FindPatterns.c, you see that deltaN can be bounded by 2(k - minC), where minC > k / log2(3) is the lowest c that does not include already-ruled-out numbers, where c is how many increases are encountered in k steps (steps and "already-ruled-out n" are defined in the code's comments). Confused? Just read the copious comments at the top of the code! I was proud of myself for figuring this out, but, even with this bound, for k=40, putting 2^(40 - ceiling(40/log2(3))) into WolframAlpha gives a bound of 16384. This solves the issue with RAM, but CPU time is going to take almost half a year. The code's comments mention a code I use to get a tighter bound, and collatzFindDeltaNbound.c is this tricky code. I developed the algorithm myself, and I wonder if anyone else has done this. Read the comments at the top with pen and enough paper to figure out how the math of it works! For k=40, the tighter bound is just 679, which can be run in a week by collatzSieve2toK_FindPatterns.c to find deltaN = 120.

But 16 GiB for k=40 is still a huge file to share between computers or just have on a computer. I can compress the k=40 sieve to 8 GiB by finding unique 256-bit patterns and referencing them using 2 bytes (16 bits). For k < 41, there are 13374 unique patterns (will likely increase for larger k). This is more than 53, but less than the 216 limit for 2 bytes.

Why did I choose 256-bit patterns? First of all, pattern sizes should be of size 2K due to exclusion rules being of form n%(2K) == value. For example, n%2 == 0 and n%4 == 1 do not need to be tested (as long as k is not less than K). Here are the alternatives to using 256-bit patterns...
64-bit patterns: This is what David Bařina used for an 8-to-1 compression. I get a 16-to-1 compression using 256-bit patterns. Note that my compression would take about 30% more time to generate a sieve due to having to search through more patterns, but I implemented a binary search to prevent a slow down.
128-bit patterns: There are more than 256 unique 128-bit patterns, so they would need 2 bytes. This compression is identical to using 64-bit patterns.
512-bit patterns: These would need 3-byte labels, which is fine, except I would use uint32_t because I am too lazy to get a 24-bit integer, so it would actually take 4 bytes unless you do something fancy like uint8_t label[3]. 4 bytes gives no advantage over using 256-bit patterns. More importantly, there would now be so many unique 128-bit patterns that they could no longer be expected to fit in CPU cache (the patterns are 128-bit since I only need to consider n%4 == 3).
1024-bit patterns: I have not tested whether all unique 1024-bit patterns fit in 4 bytes because I am currently unable to handle these because, even with my 4-to-1 compression trick (only looking at n%4 == 3), I cannot fit these unique patterns in a 128-bit integer. I could find a different container for these patterns, but I don't care because there would be so many of them that they would fill up my storage drive and cache, and, for each k above 40, you would find many more unique patterns to add to the list.

As for how to implement these 256-bit-pattern sieves, set K=8 in collatzSieve2toK_FindPatterns.c and run through all the k values of interest. Note that this capital K has different meaning than the one in the previous paragraph! Luckily for you, I have already done this, and all needed information is in my spreadsheet largeK.xlsx (has lots of other great info too!), and collatzCreatePatternsForSieve.c has cumulative unique patterns for k < 46. You can then generate a sieve using collatzCreateSieve16to1compression.c. k=40 will likely take over a day. You would then modify my collatzCPU.c to load in a sieve instead of generating one, but I am not motivated to do that. Instead, to show you how to load in the sieve using basically no RAM, here is a code: count16to1sieve.c. It just outputs how many numbers need to be tested in the sieve, but look at how the storage drive is read to a buffer to greatly reduce RAM usage.

On second thought, not writing a code to use these compressed sieves seems weird, so I made collatzCPU16to1.c! I still don't much care to run it with k=40 because...
    • 8 GiB is a lot to write to a storage drive, and creating this k=40 sieve would take me 31 hours.
    • Then, using the sieve would take me a couple days if I test 210 numbers for each number in the 240 sieve (that is, to test 250 numbers).
But I got curious about what the numbers per second would be! I got 5.03 × 109 numbers/second on the same I5-3210M CPU. This is certainly faster but wasn't as fast as I was expecting; I then realized something: the numbers that are excluded by a k=40 sieve but are not excluded by a k=34 sieve are often the numbers that don't take long to test!

Update: By talking with Eric Roosendaal, I learned of another way to store 256 bits into 2 bytes. A 2^8 sieve (has 256 numbers) has only 16 numbers that need to be tested. So, for each consecutive group of 256 numbers in a larger 2^k sieve, you just need 16 bits (2 bytes) to say whether each of these 16 numbers are needed. I do something similar but with a 2^2 sieve, which only requires 1 number to be tested, allowing me to store the 256-bit patterns as 64-bit patterns, but Eric's idea lets us think of them as 16-bit patterns! The advantage of Eric's idea is that we don't have to find all the unique patterns beforehand! My "sieveless" codes don't need this compression, but my "partially sieveless" codes use it!

"n++ n--" algorithm on GPU!

David Bařina also made some very fast GPU code. We really want to be using a GPU to do these calculations (not just a CPU)! In Dec. 2020, I tackled using the GPU!

Update 1: I have experimentally found that OpenCL on all my Intel and AMD GPUs on macOS and Windows make various subtle 64-bit and 128-bit integer arithmetic errors (getting the wrong answer—yikes!!) in addition to some GPUs not implementing ctz() and to printf() acting strangely. Different GPUs make slightly different arithmetic errors for 128-bit integers, but the 64-bit error seems to be the same for all platforms that make it. As for the arithmetic errors, here are two of my (sloppy) test codes: testBug64.c and testBug128.c. Here is cldemo.c to probe your OpenCL platforms and devices. David Bařina only used OpenCL on Nvidia GPUs on Linux and experienced no errors and could use 128-bit integers (used Nvidia CUDA Toolkit 10.x or 11.x), so I'll try that (via Ubuntu's nvidia-cuda-toolkit package), which will require me SSHing into a computer that has an Nvidia device! I also want to try Nvidia on Windows. Since macOS no longer supports CUDA and OpenCL, I have no desire to try Nvidia on macOS. To compile OpenCL...
• On macOS, I just used the -framework opencl option of gcc (instead of the usual -lOpenCL option).
• On Windows, I tried MinGW-w64, like this. To allow various Linux C libraries to work and to get ctz() to work in a kernel, I needed Cygwin instead. Select the gcc-core and libOpenCL-devel packages with installing Cygwin. For Cygwin, gcc needs /cygdrive/c/Windows/System32/OpenCL.DLL as an argument (instead of the usual -lOpenCL option). To get a printf() in an OpenCL kernel to work in Cygwin, I had to add C:\cygwin64\bin\ to the Windows Path then run via the usual cmd.exe.
Nvidia on Windows works! There are no arithmetic errors! Just download and install the CUDA Toolkit from Nvidia before trying to compile using Cygwin! The only weird thing is that a printf() inside an OpenCL kernel needs llu for ulong (lu only prints 32 bits), but, even though this is OpenCL and lu works on Windows with Intel and AMD GPUs, needing llu is a typical Windows thing I guess. On second thought, a long integer is always 64-bit in OpenCL, so this behavior by Nvidia is actually a bug. Luckily, llu works for Intel-GPU-on-Windows (not AMD GPUs on Windows) and for Nvidia-on-Linux, so I just switched all my OpenCL kernels to use llu. Also, llu doesn't work when using Beignet on Linux because Beignet seems to strictly abide by standards! Anyway, I wonder if PRIu64 works in OpenCL for all devices and platforms.
• When using Linux, I never get the scary arithmetic errors that macOS and Windows have given me (for Intel and AMD). Using Ubuntu's beignet-dev package (for Intel GPUs) avoids all problems on all GPUs, but is slow and does not have 128-bit integers. For Beignet on Ubuntu, I also needed to isntall the ocl-icd-opencl-dev and opencl-headers packages. On Linux, the usual -lOpenCL option is needed in gcc. Instead of beignet-dev, I tried the better intel-opencl-icd package, but my GPUs were too old to be supported, and it wouldn't work. Instead of Intel GPU and beignet-dev, I used AMD and mesa-opencl-icd, but many things didn't work (though no arithmetic errors!). For AMD, instead of mesa-opencl-icd, I looked into AMD's Linux drivers, but my GPU wasn't on the list of supported devices, so I didn't try it.

Based on my limited data, I now make the following conclusions: (1) Linux, unlike Windows and macOS, never makes arithmetic errors regardless of GPU, (2) Nvidia, unlike Intel and AMD GPUs, never makes arithmetic errors, (3) everything, such as ctz() and printf(), works on Nvidia, and (4) the device, not the OS, determines if there is a watchdog timer (only my Intel GPUs have a watchdog timer, and they have it on all OSs).

Update 2: Using an Nvidia Quadro P4000 (and Xeon W-2135) and Ubuntu's nvidia-cuda-toolkit package, OpenCL and David Bařina's code run successfully! I simplified his GPU code here (see my comments at the top), and here is his simplified OpenCL kernel. His code achieved 5.73 × 1010 numbers/second on the Nvidia Quadro P4000 (GitHub is updated with faster code by now), so my goal is 1 × 1011 numbers/second! I didn't fine tune any parameters; I just asked David Bařina what the standard use is, then I increased TASK_SIZE by 4 (from 40 to 44). I then increased TASK_UNITS by 4 (from 16 to 20) in order to keep TASK_SIZE - TASK_UNITS the same, and the code did slightly better giving the speed above.

Update 3: I worked with him to modify his code to achieve 7.77 × 1010 numbers/second! Here is my host program, gpuworkerBrad.c, and OpenCL kernel, kernelBrad.cl. Unlike my CPU code, which was from scratch, I only modified his GPU code. I somewhat cheated by no longer searching for the largest number or making a checksum. Read my notes at the top of each file. Anyway, my next goal is to improve a pre-calculation (interlacing) kernel as I feel it has much potential!

Update 4: His pre-calculation (interlacing) code has been improved! Here is my kernelBrad-precalc.cl that runs 1.20 × 1011 numbers/second, surpassing my goal! As always, see my comments at the top, and it runs with the same host program as the other kernel (just, in the host program, change the kernel filename). In the process of making this, he also got his pre-calculation kernel running much better! I find myself happy but also annoyed with the idiosyncrasies of GPUs. Something that would help the GPU that I was kindly allowed to use (Pascal microarchitecture), would not help the GPU that David Bařina was using (newer Turing microarchitecture), though no differences existed. Also, things that should obviously help reduce computation would greatly slow down the calculations on both architectures. From my limited experience experimentally testing the Collatz conjecture, I conclude that CUDA Cores hate reading from RAM unless it's the RAM that is shared with the threads in the same work group. Anyway, I am happy to have learned so much about GPUs! I want to thank Brad Philipbar for letting me SSH into his fantastic machine during my winter break, and I thank David Bařina for answers my emails long before I knew much of anything about the Collatz conjecture!

Update 5: I made 2 improvements: (1) generating my own format of sieves and (2) running 3 × 2TASK_SIZE numbers. The goal of both is to reduce the amount of time threads spend waiting on other threads in the work group. I am now at 1.30 × 1011 numbers/second when running 3 × 246 numbers with a 234 sieve (same sieve size as prior runs). Here is the sieve generator, collatzCreateSieveUintList.c, the host program, gpuworkerBrad-precalc2.c, and the kernel, kernelBrad-precalc2.cl. The cost of my using my own sieves is that they: (1) use more GPU RAM, (2) use more of the storage drive, (3) require setting more parameters when changing the sieve size, and (4) take more time to run unless TASK_SIZE - SIEVE_LOGSIZE is at least 10. My next goal is to use a 240 sieve, which will probably require running the kernel multiple times on different parts of the sieve.

Update 6: The 240 sieve helped (see the sieve creator in previous update)! It took a day to create, and it requires almost 40 GiB of your storage drive, but you never have to create it again after the first time! The Nvidia Quadro P4000 now does 1.83 × 1011 numbers/second when running 3 × 248 numbers! A newer GPU microarchitecture and more CUDA cores would certainly increase this speed! Turing microarchitecture seems to be about twice as fast as Pascal per CUDA Core. Looks like this GPU is 36 times faster than a single I5-3210M CPU core. I also increased the speed (or allowed for a smaller TASK_SIZE) by loading the next part of the sieve from the storage drive to CPU RAM while the GPU works on the previous part of the sieve. Here is the host program, gpuworkerBrad-precalc3.c, and the kernel, kernelBrad-precalc3.cl. If I cared to save space on my storage drive, I'm sure the 8 GiB 240 sieve could also be used by populating my section_of_sieve array using a compressed sieve (to be done while GPU is working), but I'll leave doing this to someone who cares, though I'd start by looking at my count16to1sieve.c to see how to look through those sieves.

Update 7: I guess I'll write the code to run my compressed sieves! Here is the host program, gpuworkerBrad-precalc3b.c, which still uses kernelBrad-precalc3.cl. I need to run extra numbers, 3 × 250, to get the same speed. This is probably because decompressing the sieves prevents the CPU core from best controlling the GPU because, with a not-large-enough TASK_SIZE, the CPU doesn't finish before the GPU finishes.

Extending beyond k=40 to k=80 using a GPU

The next step is to start working up to k=80. Since the fraction of numbers that need to be tested approximately exponentially decays as k increases, a larger k would be great! This requires replacing various 64-bit integers with 128-bit integers in collatzSieve2toK_FindPatterns.c (all 64-bit integers should be replaced) and collatzCreateSieve16to1compression.c (all 64-bit integers except those having to do with patterns and bytes should be replaced). My collatzCPU16to1.c should be ready to go, but kernelBrad-precalc3.cl needs variable L to be changed to a 128-bit integer when k > 40.

For k=41, I get a reduced deltaN bound of 1215. For k=46, I get 5231. Finding the actual (experimental) deltaN bound from this k=41 would take about a month on a CPU core! The only reasonable way to continue would be to try to use a GPU!

My idea is to have the GPU run a kernel that does k steps of something like 2^26 numbers (probably have each GPU thread do some group of these numbers). The GPU will create a uint64_t array that is 2^27 long. The extra "times 2" is because I actually want a uint128_t array, but those can't be passed to a kernel. This array will store the result after k steps. The GPU will create another array that is uint8_t that is 2^26 long that will be set 0 or 1 depending on if the number ever reduces during those k steps (though I could make it a fourth of this length by only saving when n%4==3). The CPU can then use the huge array to see if paths joined for any of the numbers that have a 1 in the uint8_t array (changing some of the 1's to 0's). The CPU will have to hold on to deltaN of the end of the previous huge array. The CPU can then deal with finding patterns and any writing to the storage drive while the GPU is running the next chunk of 2^26 numbers. The GPU might need to create another uint8_t array that counts the number of increases during the k steps.

Using the above idea, I can create a new collatzSieve2toK_FindPatterns.c that finds the experimentally reduced deltaN for k > 40 using a kernel file. By running each k in order, I can also create an updated patterns file, that allows sieves to go above k=40. So far, I'm not saving out the sieves to the storage drive yet. Here is collatzSieve2toK_FindPatterns_GPU.c and the kernel, kernel.cl. Running it will take me time, but I add to the spreadsheet and update collatzCreatePatternsForSieve.c as I get results.
Note: The code currently makes and uses a uint8_t array that counts the number of increases during the k steps, though, without it, the code runs a bit faster and, at least for k=40 (with deltaN = 120), creates the same sieve. It seems that for small n and large deltaN, it should be necessary to check the number of increases to prevent numbers from being incorrectly removed from the sieve. Since it would be a disaster to skip numbers that should be tested, I must include a comparison of the number of increases. Currently, the code will output some information if this comparison is ever useful! Maybe I can prove that this comparison is never necessary? So far, running k = 41, 42, 43, 44, and 45 (using deltaN bound, not the experimental deltaN) have not required this comparison.

If I run the same GPU code but with a larger deltaN, the code can be much slower. This is because the CPU core cannot process the results from the GPU fast enough. Multiple CPU threads via OpenMP to the rescue! My collatzSieve2toK_FindPatterns_GPU.c now uses many CPU threads (number of threads can be set in the code). In the future, as GPUs get faster and CPU cores don't, OpenMP becomes more and more necessary. On the Nvidia setup I'm using, a deltaN over 200 or 300 could justify using OpenMP. Note that, as k increases, ignoring the affect of k on deltaN, the CPU doesn't have to work as hard to keep up with the GPU because the CPU has fewer numbers to test (and the GPU has to test each number farther).
If you don't want to use multiple cores, you could set the num_threads() to 1 or just comment out the entire pragma omp line.
On Windows, OpenMP works great using gcc in Cygwin (not so great in MinGW-w64). On macOS, to get OpenMP to work, I installed gcc9 via MacPorts, then I use gcc-mp-9 instead of clang. Of course, OpenMP works flawlessly in Linux.

To run k > 43, OpenMP wasn't enough to speed it up enough for my comfort because the runs now take too long, and there is risk of a power outage. Since these runs use a lot of electricity ($1 USD per day), I don't want the runs to go to waste due to a power outage at the last minute. To fix this problem, my collatzSieve2toK_FindPatterns_GPU.c now saves at checkpoints to the storage drive at the frequency of your choosing. Set log2saves to 0 to turn off checkpoints, and set loadCheckpoint to an integer larger than 0 to load a saved checkpoint.
Using checkpoints, I can now see how deltaN (to be specific, the variable maxM) increases as the sieve is analyzed, and, interestingly, it doesn't immediately reach the final value meaning that the numbers that require the large deltaN are very rare.

Here is collatzCreateSieve16to1compression_GPU.c for creating sieves! It uses the same kernel, kernel.cl. It doesn't yet use OpenMP because the deltaN that this file runs is much less than the deltaN when first running collatzSieve2toK_FindPatterns_GPU.c, and it doesn't yet save at checkpoints because I don't really care to save out huge sieves (see the next section).

As for going up to k=80, going above k=64 will require kernelBrad-precalc3b.cl to receive a 128-bit sieve array, which, due to the limits of OpenCL, must be instead a 64-bit array that is twice as long (arrayLarge[] in my kernel.cl already passes 128-bit numbers this way). You will also have to change how bMod is calculated in collatzCPU16to1.c (to be done like k2Mod). Another issue with collatzCPU16to1.c is that lenC3 will need to be increased from 65 to 81. Going above k=65 will require maxDecreases[] in collatzFindDeltaNbound.c to be lengthened (very easy!).

Creating sieves above k=80 would require integers that are larger than 128-bit because 2^80 - 1 increases 80 times to 3^80 - 1, which fits in a 128-bit integer, but 3^81 does not fit. You would want to rewrite the code using smaller integer types, then implement low level arithmetic operations yourself. In OpenCL, you can use the vector data type uint8 (or ulong4). Perhaps a good intermediate step would be a uint4 and uint2 (or a ulong2 and ulong). As far as I can tell, the only advantage of using uint instead of ulong in OpenCL is not getting subtle arithmetic errors on Intel and AMD GPUs, but, I am so horrified that these errors occurred, that I'd only ever use Nvidia for serious OpenCL.

"Sieveless" runs! (using the "n++ n--" algorithm)

Since k=46 would require 512 GiB of storage drive (ideally an SSD), I have an idea that never requires running collatzCreateSieve16to1compression_GPU.c to create a sieve on the storage drive allowing for much larger sieves!

My idea is to run very large sieves "without a sieve" in that I will generate the sieve live without reading from a storage drive. Ironically, this was my first idea (see collatzCPU.c), but now I will implement it in a very good way. Instead of having some sort of task ID that you increment each run, you will have two task IDs: the first controls a huge amount of numbers such as 9 * 2^65, and the second refers to a chunk of these 9 * 2^65 numbers. Only increment the first when the second has reached its final value. The second task ID determines which segment of the very large sieve will be produced. The total size (9 * 2^65 in this example) should be chosen carefully because it would be a shame to have to leave a chunk unfinished. The idea is that you make the 2^k sieve as large as possible while keeping the sieve generation time negligible compared to time running numbers using that chunk of the sieve.

For the GPU code, you will want two kernels. Kernel1 will generate the part of the sieve determined by the second task ID. Kernel2 will test 9 * 2^(65 - k) numbers for each number in the sieve (this number should be large to not notice the extra time needed to generate the sieve live). Luckily, most of what kernel2 needs as inputs are the same as what kernel1 creates as outputs, so the GPU RAM can mostly just stay in the GPU between the kernels! This allows kernel2 to start with pre-calculated numbers!

As for the CPU-only code (that is, no GPU), I won't check any maxSteps variable anymore for consistency with my GPU code. Perhaps one day I will try a "semi sieveless" code that loads in a smaller sieve to speed things up the pre-calculation.

With this sieveless approach, I might try running collatzSieve2toK_FindPatterns_GPU.c up to k=48, or maybe even k=51 (or anything k<81). The main result I need is deltaN, so, with the help of the checkpoints that can be saved, I could always just run enough of these huge sieves to find a big-enough deltaN (or just have it print each time maxM is increased)! Besides, if a large deltaN is rarely needed, I don't want it slowing down my CPUs! To view the current deltaN from a checkpoint, something like od -N 2 -t d2 temp1maxM should print it! My sieveless codes allow you to choose a deltaN_max to override any other deltaN. This can be used to run k=65 for example.

With this sieveless approach, I won't maintain all the files here anymore. The following general files will be maintained...
    • collatzTestOverflow.py
    • collatzFindDeltaNbound.c
    • collatzSieve2toK_FindPatterns.c
    • largeK.xlsx         ← I have now highlighted the best k values to use!
I will of course also maintain...
    • collatzSieveless.c
I will maintain the following GPU files...
    • testBug64.c
    • testBug128.c
    • cldemo.c
    • collatzSieve2toK_FindPatterns_GPU.c
    • kernel.cl
    • collatzSieveless_GPU.c
    • kernel1.cl
    • kernel2.cl
The above were last updated 2021-02-22.

I have had incredible results so far (January 2021). I tried k=51 with the CPU code, and I easily got 9.0 × 109 numbers/second on the same I5-3210M CPU. I was running a task that only took a minute. My GPU code was also a huge success. With k=51, I easily got 3.1 × 1011 numbers/second on the same Nvidia Quadro P4000. This was for a task that took less than a minute.

Getting exact speeds is difficult because each task has a different part of the sieve, but the sieveless approach is the only way to go. The only real con is that you must finish running each of the second task ID for the results to be valid (or to have any useful results at all). Note that, for the CPU code, you can always set TASK_SIZE = k.

While running various tests on my GPU code to see if it was working correctly, I discovered something interesting: my hold[] and holdC[] were never being used yet I was always getting the correct results! So, for k=34, I searched for numbers of the binary form ...0000011 that did not reduce but joined the path of smaller numbers (because these are the numbers that would require hold[] and holdC[]), but I didn't find any! I only found (many) numbers of the form ...1011. I could then test hold[] and holdC[] using the very unrealistic settings: TASK_SIZE = 3 and task_id = 120703 (and k=34). This is interesting because maybe I could prove that, if TASK_SIZE is large enough, hold[] and holdC[] are unnecessary!

A larger k (and therefore TASK_SIZE) might be better even if a larger fraction of the time will be spent making the sieve. If you get get a 10% faster sieve at the cost of creating the sieve taking 5% of the time, you still win!

Here are the steps to setting your parameters...
  (1) Choose an amount of numbers that you could commit to finishing in a month or whatever your timescale is. If 2^68, set TASK_SIZE_KERNEL2 in the GPU code and TASK_SIZE0 in the CPU code to be 68 (you'll also need to remove the some 9's in the code as explained in the code).
  (2) Then set k and deltaN_max to maximize the numbers per second! If you set k too large, then the code spends too large a fraction of time generating the sieve. I have found that the very large deltaN only occur very rarely, so deltaN_max need not be huge. If your TASK_SIZE_KERNEL2 and TASK_SIZE0 are not large enough and you find that the fastest k is something small like 35, you may wish to not use this sieveless approach and just use a sieve that you store on your storage drive. See my spreadsheet, largeK.xlsx, for good k values to run. Also, it might be faster to change the code to load in a smaller sieve to help speed up the creation of the actual sieve.
  (3) Choose the amount of numbers that each task should test—let's say 2^40—then set TASK_SIZE = 40 + k - (TASK_SIZE0 or TASK_SIZE_KERNEL2). Each task will look at a 2^TASK_SIZE chunk of the sieve to test 2^(TASK_SIZE0 - k) or 2^(TASK_SIZE_KERNEL2 - k) numbers per number in the sieve. Because there is apparent randomness to the sieve, TASK_SIZE should at least be 10 to give each process something to run! Unlike the CPU code, which needs very little RAM, TASK_SIZE in the GPU code determines RAM usage, and you want it large enough to minimize your CUDA Cores sitting around doing nothing.

If you want to run both the GPU and CPU code simultaneously, be sure to use the same sieve size for both, and have the size controlled by the first task ID be the same for both. I would normally say that TASK_SIZE needs to also be the same, but, if the CPU tasks need to finish in a certain time, it may be best to set TASK_SIZE differently: the CPU should have a lower one. This would require just a bit of organization when setting the second task ID to make sure that the tiny CPU tasks fit around the GPU tasks and to make sure that all of the sieve is eventually run.

If you have already have up to some number tested by a non-sieveless approach, you can still easily use this sieveless approach! For my CPU code, adjust aStart manually. For my GPU code, adjust h_minimum and h_supremum in kernel2.cl, keeping in mind that integer literals over 64-bit don't work (so use bit shifts).

If you want to use general GPUs...
    • For computers with watchdog timers, a maxSteps variable must be introduced, and a message should be printed whenever testing a number goes beyond those steps. This is crucial.
    • For computers with watchdog timers, you will likely have to modify the code to call kernel2.cl multiple times, each time with a different h_minimum and h_supremum.
    • To accommodate GPUs (and CPUs) without ECC RAM, you'll have to run each task twice to validate.
    • The 128-bit integers should become OpenCL's uint4 to prevent arithmetic errors on non-Nvidia GPUs. I believe that ulong arithmetic should never be used on Intel or AMD GPUs. Another solution could be to look at my testBug64.c and testBug128.c to understand the bugs, then test each device for these bugs before running the Collatz code.
    • To accommodate weaker GPUs, set TASK_SIZE to be smaller.

To find the largest number or to calculate checksums, you can easily edit kernel2.cl to output some extra things. To see how, search David Bařina's kernel for mxoffset, max_n, and checksum_alpha. Of course, something similar can be done for my CPU code. I don't expect this to greatly slow down the code. I temporarily added the checksum code to my CPU and GPU codes, and the checksums matched those of David Bařina's codes.

If finding the max number and using a 2^k sieve, you have to be careful that the max number does not occur in the first k steps. That is, your sieve cannot be too large. Even after k increases, the number will get approximately (3/2)^k times larger requiring approximately log2(3/2) k more bits. From this list of max numbers, finding a maximum number in these k steps would be difficult to do since even a k=80 sieve is very safe (unless you start testing up to 2^89 and no new new records are found!).

On the Nvidia-Quadro-P4000 Linux setup I am using, I use clFlush() and clFinish() for all of my sieveless GPU code after calling kernel2, and a CPU core waits at 100% usage for the GPU to finish. I don't think non-Nvidia GPUs would ever busy wait. To greatly reduce CPU usage when using Nvidia, instead of using clFlush() and clFinish(), I tried to return an event object when calling clEnqueueNDRangeKernel() then, in a loop containing usleep(), use clGetEventInfo() to see if the event has finished. I read somewhere that waking up from a sleep() only takes about 30 microseconds, so I could check many times per second! However, it seems that my code starts CPU busy waiting at clEnqueueNDRangeKernel(), but commenting out the printf() in kernel2 fixes this behavior. But, then the code waits at clGetEventInfo() with the CPU running at 100%. The developers at primegrid.com then helped me figure out that I needed a clFlush() before checking clGetEventInfo()! If you want to stop CPU busy waiting on Nvidia, check out my "partially sieveless" OpenCL host codes where I have commented-out code that can prevent busy waiting if you were to get rid of the printf() in kernel2, but getting rid of printf() in kernel2 would require some other means of communicating overflow!
Update: See my CUDA section below! Nvidia using my CUDA code does not cause CPU busy waiting even when I use a printf() in kernel2!

A different faster "repeated k steps" algorithm to be combined with my "sieveless" approach

99% of the time, my codes are running just a tiny amount of code: collatzSieveless.c is in the
    for (a=aStart; a<aEnd; a++)
loop, and collatzSieveless_GPU.c is in kernel2.cl's
    for (uint128_t h = h_minimum; h < h_supremum; ++h)
loop. I'm using the "n++ n--" algorithm for this code, and my "sieveless" approach allows me to use huge sieves to prevent this code from needing to be run as often. My CPU code is, as expected, likely the world's fastest at the moment. But this 2017 paper has much faster GPU code than mine! Their Nvidia GeForce GTX TITAN X has more CUDA Cores than I do, 3072 CUDA Cores vs. my card's 1792, and is an older Maxwell microarchitecture (vs. my Pascal).

Their code is faster because they repeatedly do k2 steps at a time. I currently do k steps then just a few steps at a time until the number is reduced. A long time ago, I found that my approach was faster (for CPU at least), but the paper actually describes two sieves being used simultaneously. For GPU, they use k=37 and k2=18. For CPU, they use k=37 and k2=11. The k=37 sieve is what I can turn into something like k=51 by using my "sieveless" approach. It is what determines which numbers are run and controls the initial k steps. The smaller sieve is for creating a table to do k2 steps at a time after the initial k steps. I suppose this k2 sieve isn't really acting as a sieve, so I suppose I should call it a lookup table, but I probably will accidentally call it a sieve from time to time! This "repeated k steps" approach is a replacement for the "n++ n--" algorithm (though the "n++ n--" algorithm can be used in generating sieves).

The best code depends on your purpose.
If you want to find the max n, you'll want to use the "n++ n--" algorithm with my "sieveless" approach.
If you just want to test the Collatz conjecture, then you'll want to use the "repeated k steps" algorithm with my "sieveless" approach.

By the way, shame on them for not making their code readily available, whereas David Bařina put a link to his code in his paper. If I couldn't do better than them, I would have tried Googling the Japanese authors to see if they can send me the code.

Below is CPU and GPU code! My CPU code is currently set to k=51 and k2=11. My GPU code is currently set to k=51 and k2=18. I easily got 1.8 × 1010 numbers/second on my I5-3210M CPU and 1.5 × 1012 numbers/second on the same Nvidia Quadro P4000. This is faster than the 2017 paper! So I'd say that, starting January 2021, I have the fastest CPU and GPU code in the world!
    • collatzSieveless_repeatedKsteps.c is my CPU code
    • collatzSieveless_repeatedKsteps_GPU.c
    • kernel1_repeatedKsteps.cl to make the k sieve (identical to kernel1.cl)
    • kernel1_2_repeatedKsteps.cl to make the k2 sieve
    • kernel2_repeatedKsteps.cl to use the sieves
The above were last updated 2021-05-03.

I again want to thank Brad Philipbar for letting me use his GPU, and I want to thank David Bařina, Brad Philipbar, and Trevor Wommack for their helpful conversations!

The GPU saw a gain larger than the CPU's gain. This is because a work group for a GPU must wait for the slowest thread, so using an algorithm like "repeated k steps" that caters to the hard-to-reduce numbers is wise!

"Partially sieveless" code

Great news! I learned that nearly all of the numbers removed by a deltaN (after checking for reduction in no more than k steps) are from deltaN = 1!
For k=21, 38019 numbers need testing using if using all deltaN, 38044 need testing if only using deltaN=1, and 46611 need testing if not using any deltaN.
For k=32, 33880411 numbers need testing using if using all deltaN, 33985809 need testing if only using deltaN=1, and 41347483 need testing if not using any deltaN.
I am adding a column to the spreadsheet, largeK.xlsx, to give the numbers that need testing if only using deltaN = 1. I now recommend to set deltaN_max to 1! This should cover most removals by a deltaN allowing for a larger k! If you for sure want to use deltaN = 1, you can get rid of hold[] and holdC[] from the GPU code, and arrayLarge[] and arrayIncreases[] can be made to half their length or even shorter.

An idea: I once had the idea of saving out a 2^32 sieve that a "partially sieveless" code could load in to help generate a much larger "sieveless" 2^k sieve. My GPU approach is to do k steps at a time for all numbers, so, for large deltaN, I'd still have to do k steps for all numbers anyway. My CPU-only approach would benefit, but I didn't think it would benefit enough to be revolutionary. So I settled for only using a 2^2 sieve to help generate a much larger 2^k sieve. However, a deltaN-equals-1 2^k sieve could use this 2^32 sieve very nicely! For each n in the 2^k sieve for which n % 2^32 belongs to the 2^32 sieve, do k steps for n and n-1. In fact, since the 2^32 sieve uses all possible deltaN values, doing k steps for n-1 (that is, using deltaN = 1) for the numbers in the 2^32 sieve barely helps. This should greatly speed up sieve generation allowing for larger k values! Also, for GPU code, TASK_SIZE could be increased more without running out of RAM. I realized that this using-multiple-sieves idea could be great for deltaN=1 when Eric Roosendaal was telling me about how he uses a 2^8 sieve (which only requires 16 numbers to be run), to create a larger sieve. I hope to (soon) create "partially sieveless" CPU and GPU codes for "n++ n--" as well as for "repeated k steps"!

I will call these codes "partially sieveless" because the smaller sieve of size 2^k1 is saved to disk, so it's not "sieveless" (that is, it's not generated as you go). However, the larger 2^k sieve will be "sieveless" (that is, it is generated as you go).

I have been using k1 = 32 because a 2^32 sieve can be created in a couple minutes by a single CPU core. The only con of a larger k1 is that the sieve file is harder to create, transfer, and store. Because a buffer is used to load only parts of the 2^k1 sieve, a larger k1 does not use more time or RAM to use. Once you have a large-k1 sieve file, use it! Especially good values for k1 are 32, 35, 37, 40, ... See my largeK.xlsx for why I say this.

Almost everything mentioned in the "Sieveless" runs! section and the "repeated k steps" algorithm section above still applies. Read them!

Here is code to create the 2^k1 sieve...
    • collatzCreateSieve.c.
It uses a 2^8 sieve to compress the output file by a factor of 16 and to speed up sieve creation.
Here are the CPU codes...
    • collatzPartiallySieveless.c
    • collatzPartiallySieveless_repeatedKsteps.c
Here are the GPU codes...
    • collatzPartiallySieveless_GPU.c
    • kern1.cl
    • kern2.cl
    • collatzPartiallySieveless_repeatedKsteps_GPU.c
    • kern1_repeatedKsteps.cl is identical to kern1.cl
    • kern1_2_repeatedKsteps.cl is identical to kernel1_2_repeatedKsteps.cl
    • kern2_repeatedKsteps.cl
The above were last updated 2021-05-03.

On CPU, creating the 2^k sieve is vastly faster with this approach! Even though deltaN = 1, the amount of numbers tested by a "partially sieveless" or an any-deltaN "sieveless" code are basically the same partly because the 2^k1 sieve uses any deltaN. There is no need for me to provide speeds because the speeds of a k=51 sieve will be the same as with the "sieveless" approach. The difference from "sieveless" code is that we can now run larger k values! Or, to think if it another way, use the same k value, but commit to a much smaller TASK_SIZE0 or TASK_SIZE_KERNEL2.

On GPU, the "partially sieveless" is many times faster than the "sieveless" code for making the sieve, but it doesn't get the extreme speedup that the CPU sees. Perhaps this is because, for "partially sieveless", the different threads in the work group can test a very different number of numbers (the threads in the same work group must wait on the thread that takes the longest time). Note that the "sieveless" code spends most of the time to make the sieve on the CPU (when using my usual Nvidia device), so a very fast GPU would prefer my "partially sieveless" code even more. Feel free to use "partially sieveless" code for CPU and "sieveless" for GPU (or vice versa)! If you do this, as previously discussed, just be sure to use the same k and same TASK_SIZE0 or TASK_SIZE_KERNEL2 (different k1 or k2 values are just fine). Also, for large TASK_SIZE, my "partially sieveless" code uses about 9% of the RAM compared to my "sieveless" code!

As for how to test the validity of this code, I tested the 2^k1 sieve by removing the 2^k code that does the first k steps, and I tested the 2^k sieve by removing the code that checks against the 2^k1 sieve. When comparing to my "sieveless" codes, keep in mind that 2^k1 uses any deltaN, but the 2^k sieve uses deltaN = 1.

Making 128-bit Integers for CUDA (and OpenCL)

I was curious to try Nvidia's CUDA instead of OpenCL, which only works on Nvidia GPUs. Learning CUDA can be useful because (1) it might be faster, (2) it sure is easier to code, and (3), on Nvidia GPUs, CPUs don't have to busy wait (Nvidia using OpenCL currently causes CPU busy waiting due to an Nvidia bug). CUDA currently has no native support for 128-bit integers, so CUDA requires that I find my own 128-bit integer class. Learning to do 128-bit integers can be useful because (1) not all OpenCL implementations have a 128-bit integers and (2) one could perhaps speed up the Collatz algorithm by optimizing this 128-bit-integer code for Collatz.

I obtained 128-bit integers in CUDA via this GitHub project. The idea is to download a single file into your current folder, and put...
    #include "cuda_uint128.h"
in your .cu file. Note that...
    ♦ multiply and divide require the 2nd argument to be uint64_t
    ♦ casting to 128-bit works as usual with something like (uint128_t)foo, but casting away from 128-bit requires using .lo
    ♦ a ++ or -- must be placed before the variable (not after it)
    ♦ subtraction always has the 2nd argument be uint128_t
    ♦ addition has a version where the 2nd argument is uint128_t and another that has the 2nd argument be uint64_t, but ++ and += always make the 2nd argument be uint128_t
    ♦ bit shifts and comparisons are included, but *= is not even though -= and += are defined
I found several serious errors in the code (causing incorrect arithmetic to be done), so I told the author and they were fixed. Then I made some additions to the code to overcome the last three of the above limitations, and then I made some optimizations specific to my Collatz code, so I will provide my version of the .h file (see below) complete with comments explaining all my changes.

I converted my "partially sieveless" codes into CUDA! My goal was to keep the CUDA code as similar as I could to my OpenCL code. However, I did make a change to prevent the CPU from busy waiting during kernel2, which I couldn't successfully do on my Nvidia GPU using OpenCL due to Nvidia's less than perfect implementation of OpenCL. I checked this code's validity by temporarily adding in code to calculate checksums (as described above).
    • cuda_uint128.h
    • collatzPartiallySieveless_CUDA.cu
    • collatzPartiallySieveless_repeatedKsteps_CUDA.cu
The above were last updated 2021-04-27.

The CUDA code is about 10% faster than the OpenCL code!

Since implementing 128-bit integers as two 64-bit integers in OpenCL might be better than using __uint128_t in OpenCL, I want to try it! However, I still do not recommend using anything but Nvidia due to the unforgivable ulong arithmetic errors that Intel and AMD GPUs make (as previously mentioned). For CUDA, I used Nvidia's assembly language, PTX, via the asm() function in any addition involving 128-bit integers (all other arithmetic used in kernels was done without using PTX). Since OpenCL can run on many GPU types, there is no standard assembly language, so my 128-bit implementation may be slower than __uint128_t! The following are written in C (because I was curious how it could work in C), which doesn't allow my to overload operators like I did using C++ for CUDA, so the codes are hard to read!
    • kern1_128byHand.cl works in both collatzPartiallySieveless_GPU.c and collatzPartiallySieveless_repeatedKsteps_GPU.c
    • kern2_128byHand.cl works in collatzPartiallySieveless_GPU.c
    • kern2_repeatedKsteps_128byHand.cl works in collatzPartiallySieveless_repeatedKsteps_GPU.c
The above were last updated 2021-04-27.

The "repeated k steps" algorithm uses lookup tables instead of doing intense computation on 128-bit integers, so there wasn't a noticeable speedup (slower than CUDA still), but the "n++ n--" algorithm is now as fast as CUDA! As always, I'm testing these speeds on a Nvidia Quadro P4000.

Conclusions. On an Nvidia GPU, use the CUDA code because it is faster and never causes CPU busy waiting. On a non-Nvidia GPU, I'd use the above kernels where I implemented 128-bit integers myself! I do not have any new Intel or AMD GPUs to test the code on, but I hope it works!

I have had minimal success getting this OpenCL code to work on my old Intel HD Graphics 4000. I could only get the code to "successfully" run on macOS in that it would find the correct numbers that would overflow, but it consistently returned an incorrect checksum (I temporarily added some code to calculate checksums), probably due to the bug described in my testBug64.c (or a similar bug!). If I care, I might try to implement a 128-bit integer as 4 32-bit integers to try to fix that problem.
Update: I was using another old Intel GPU (on Windows) that had the 64-bit bug, but it was returning the correct checksum! Does this guarantee that the 64-bit bug will never cause problems? Heck no! Does this mean that there are other problems with the Intel HD Graphics 4000 on macOS? I'd say so! I was playing around with the Intel HD Graphics 4000 on macOS, and simply commenting out a printf() I added in kernel2 would change the checksum!
Here are things I typically had to do to get things to work...
- In kern2, I always had to move const struct uint128_t UINT128_MAX = {~(ulong)0, ~(ulong)0} into the actual kernel.
- For my Intel HD Graphics 4000, I often had to change all instances of -cl-std=CL2.0 to -cl-std=CL1.2 in the host code. Simply not specifying the OpenCL version also works!
- I think I always have to get rid of all goto statements in the kernels. I don't really know if this is always required because only my macOS code actually runs successfully! For sure, macOS didn't like the goto statements in the kernels. Getting rid of goto statements wasn't super hard: (1) any goto even simply isn't necessary to do, (2) several goto statements can be replaced with logic using variable R, and (3) the rest of the goto statements can be replaced with logic of an int you create (I called mine go, and I initialized it to 1).
- Using Beignet on Linux, I had to change %llu to %lu in the kernels (as previously discussed).
- On macOS, I had to write my own ctz() function! I shouldn't have to do this!
- For my Intel HD Graphics 4000, the variable TASK_SIZE_KERNEL2 must be reduced because my GPU has a watchdog timer of about 5 seconds. My kern2 could easily be changed to run only a fraction of the h values, then I could call clEnqueueNDRangeKernel() multiple times so that each call would take less than 5 seconds, but I'll worry about that if I ever get non-Nvidia GPUs to work!

I tried my AMD Radeon R5 (Stoney Ridge) GPU on Windows next (I don't think there are good Linux OpenCL drivers for it?). The -cl-std=CL2.0 is required (specifying nothing doesn't work), the goto statement is not allowed in kernels (see above paragraph for how to change this), and changing %llu to %lu in the kernels is required. It gave the correct checksum! This GPU also passes the testBug64.c test (though it fails the testBug128.c test), so I'm not surprised it works. Also, it doesn't seem to have a watchdog timer, which is great!
For running high TASK_SIZE_KERNEL2 values, the code (kern2) would sometimes just never complete! The problem is not a watchdog timer because I made a test kernel that successfully ran for many minutes. This is very strange because, as far as I can tell, the only resource that a larger TASK_SIZE_KERNEL2 requires is more time. Setting TASK_SIZE lower helps, as did raising clEnqueueNDRangeKernel()'s local_work_size argument. I was worried that I had a coding error, but testing (manually editing indices[]) eventually revealed that running the same numbers but with multiple smaller kernel executions worked perfectly. I believe there to be a problem with the GPU or its drivers because I often run this GPU on BOINC, and it will too often take a very long time to run a task, then time out, and the returned task is then found to be invalid. But I feel that I have succeeded in finally getting the code to run on a non-Nvidia GPU! When errors do occur on this GPU, at least their occurrence is not hidden. Instead of using clFinish(), I could use the commented out loop with usleep() to enforce a time limit.
Update: I was using another old Intel GPU (on Windows) that clearly had a watchdog timer, but this timer would either cause clEnqueueReadBuffer() to fail, or, for longer tasks, would cause the kernel to never complete, just like my AMD would do! I now believe that my AMD has a watchdog timer, but it is more complicated than stopping a kernel after a certain amount of time! Just like the AMD GPU's issue with BOINC, this other old Intel GPU would get stuck then time out on BOINC!

Here is "repeated k steps" code that should run on not-ancient non-Nvidia GPUs (maybe I'll do the "n++ n--" code in the future). I used what I learned in the above two paragraphs to make it. Depending on your GPU, you may need to set g_ocl_ver1 to 1 for it to work. I permanently added checksum code. With its current parameters, running ./a.out 0 0 should return a checksum of 8639911566. Running ./a.out 26 71353562 should return an overflow. Current parameters are set so that the kernels should finish within a 5 second watchdog timer. I have also added a check to see if 64-bit arithmetic has a certain bug, and there is now commented-out code that could enforce a time limit on kern2. You will still need to create a sieve using collatzCreateSieve.c (default is a 2^27 sieve).
    • collatzPartiallySieveless_repeatedKsteps_GPU_nonNvidia.c
    • kern1_128byHand_nonNvidia.cl
    • kern1_2_repeatedKsteps_nonNvidia.cl
    • kern2_repeatedKsteps_128byHand_nonNvidia.cl
The above were last updated 2021-05-07.
Next, I will edit kern2 to be able to call it in a loop so that a watchdog timer doesn't stop the code from running.

Things that may or may not be minor improvements to my codes

If you enjoy the details of coding, please improve my code!

(1) If you want to speed things up a bit more, my CPU-only sieveless codes don't yet use "n++ n--" when generating the sieve. In my GPU code that uses hold[] and holdC[] (any code before "partially sieveless"), the initial calculation of these by the CPU could also be converted to the "n++ n--" algorithm.

(2) I am still somewhat confused about how caching on GPUs works with OpenCL. For my "repeated k steps" code, there may be a better way to get the lookup table (arrayLarge2[] in kernel2) into a faster cache.

(3) In kernel2.cl, I have some code commented out that pre-calculates a variable, threeToSalpha, because using threeToSalpha slows down the code for some reason. Why? Would threeToSalpha slow down CUDA code also?

(4) On a GPU, why does a 3^2 sieve run slower than a 3^1 sieve? On a CPU, the 3^2 sieve is faster than a 3^1 sieve. I never compared speeds of 3^1 vs 3^2 for any "repeated k steps" code or any CUDA code.

(5) In all of my kernel2s, I calculate N0Mod from aMod, bMod, and cMod, where aMod increases by 1 each iteration. However, this could be sped up. If cMod is 1, then N0Mod will simply increase by 1 each iteration. If cMod is 2, then N0Mod will simply decrease by 1 each iteration. Note that cMod is either 1 or 2. Since cMod only depends on SIEVE_LOGSIZE, an #if could be used to choose the algorithm at compile time. My CPU-only codes could do the same thing but the variables are now called nMod, aMod, k2Mod, k, etc.
Also, in all my kernel2 (or kern2) files, I for some reason calculate N0 each iteration while finding a good N0Mod value for each thread in the work group. My CPU-only codes do it better by calculating the start value after checking the modulus!

(6) In my "partially sieveless" GPU codes, I set sieve8[] to be in constant GPU RAM simply because of an online example I was looking at. I don't believe this was the best choice, and maybe also copying to shared/local RAM would help, but I also don't think it really matters!?

(7) In many of my kernels, I calculate the powers of 3 and save them into shared/local RAM. Is there a better way? I also do this calculation using only 1 thread in the group of threads that is sharing the RAM (while the other threads wait), but I don't see why I couldn't use more threads!

(8) In some of my kernels, I set alpha and beta variables to size_t since they are used to index arrays, but would it be faster to use 32-bit integers?

(9) I always wonder if I've optimized the very-low-level stuff. For example, in kern2_128byHand.cl, I do mulEquals(N, (ulong)lut[alpha]), but this actually might cause (ulong)lut[alpha] to be calculated more than once since mulEquals() is a macro. I'm not about to dig into the assembly code to see how the compiler optimizes this. Would it be better to calculate (ulong)lut[alpha] before calling mulEquals()? I'm sure there are lots of possible low-level optimizations for either my CUDA or OpenCL by-hand implementations of 128-bit integers.

Challenges of any "reduce to 1" algorithm that counts steps

The 2017 paper I linked to also has a "repeated k steps" algorithm to count steps to 1 with some great results regarding the best size of lookup table (what I call k2) to use, though I'd imagine that they'd get somewhat different results if reducing much larger numbers. I started trying to write code to do this, but I soon found myself realizing that there is no clear way to define a step. Should (3*n + 1)/2 count as 1 step or 2 steps? The convention seems to be to call it 2 steps, but I feel that it is far more natural to call it 1 step. In light of David Bařina's new "n++ n--" algorithm, a step could be each bit shift! Regardless of how steps towards 1 are counted, when doing the k steps of a 2^k sieve, (3*n + 1)/2 must be counted as 1 step.

An annoying aspect of a "reduce to 1" code is that the code will take more time per number as you test larger numbers. You'll probably want to optimize your code for rather large n. Benchmarking your code will also need to be done carefully to make sure the same numbers are run.

For any scan such as the "reduce to 1" scan for max steps or a scan for max n, it is important to start testing numbers beginning at 1 without skipping. Anyone can find a number that has a huge number of steps, but the goal is to find the smallest number to require that many steps. 2^10000 takes 10000 steps to reduce to 1, but who cares? If you want to argue that 2^10000 can be reduced in 1 step (a single bit shift), I would probably agree with you, but you can always run the Collatz algorithm backwards to get more interesting numbers that require any number of steps! Here is collatzSetDelay.py to do this. Whenever (2*n - 1)%3 == 0, decrease to (2*n - 1)/3 unless (2*n - 1)%9 == 0. The "mod 9" check is needed to prevent getting numbers that have 3 as a prime factor because they can only increase by a factor of 2 forever. You can then calculate that, on average, 3/7 of the steps will be decreases because, as you increase, (2*n - 1)%9 cycles through a repeating pattern of 3, 7, 6, 4, 0, 1, and the 3 decreases to a 1, 4, or 7, and the 6 decreases to a 0, 3, or 6. The next decrease is twice as likely to be from a 3 than a 6. Also, n%27 == 11 can decrease immediately, but it is best to not decrease because, if it decreases, there can be 4 increases before the 3rd decrease, but, if it increases, there can be 3 increases before the 3rd decrease. Instead of the n%27 == 11 rule, using the following rules does even better: n%729 in [209, 587, 452, 695, 533, 47, 344, 281, 38, 524, 245, 362, 416, 470, 335, 173, 425, 686, 443, 389, 119, 605, 659, 200, 92]. A better way of getting a larger fraction of decreases is to have the code look ahead a certain number of steps, as I have done in collatzSetDelay2.py. Setting the code to look ahead 25 steps, I asked for a number that takes 1000 steps, and I got 476,003724,222891.

There is a BOINC project that counts the steps to 1 that has been running since 2009. However, they made a fatal error by using a certain type of sieve, so all of their "high step" results are invalid. I have many other serious concerns with the project, but the person who runs the project, Jon Sonntag, never responded. In January 2021, I let the forums know of this fatal error. Jon Sonntag finally responded by deleting my forum posts, so I posted it at the main BOINC forum. Likely, none of their results are valid. Certainly, my code is vastly faster. On my I5-3210M CPU, his code only tests 2.3 × 108 numbers second! On the Nvidia Quadro P4000, his code only achieves 4.14 × 1010 numbers/second! To calculate these speeds, I first needed to figure out that his code tests (tested?) 3*2^44 numbers each task. The time, electricity, and opportunity wasted by this BOINC project can be seen here. I had fun doing much better than this project, but it was so sad to see the blind faith of some of Jon's followers when they would get so angry with my accurate harsh statements. The lesson to be learned here is to listen to the scientists and to the experts, and don't follow people who are all show and no substance. Just because someone knows more than you about something doesn't mean that they are trained in thinking scientifically.

Certain types of sieves can still be used. A 3^1 or 3^2 sieve can still be used because the numbers removed by these sieves come from smaller numbers, and these smaller numbers will require more steps to reduce than the larger numbers that result from them. To generate my 2^k sieves up until now, I have (1) removed numbers that reduce in k steps then (2) removed numbers that join the path of smaller numbers in k steps. The second of these approaches can still be done because paths that join will require the same number of steps, and the smaller number is the one that needs to be found. My "sieveless" approach could still work! However, I am not sure if the second of these approaches is valid if counting steps towards 1 as bit shifts.

Note that all of my above code for deltaN including collatzFindDeltaNbound.c only try to join paths for numbers that do not reduce, so they need to be modified. My first step was modifying collatzFindDeltaNbound.c by (1) removing the two checks for isS1 and (2) starting the c loop with 1 instead of minC. These modifications are very easy, so I won't provide the modified code, but I put the results in a column in largeK.xlsx. The deltaN bounds are much larger!

To find the experimental (actual) deltaN and to count the numbers that need testing in each sieve, I then modified collatzSieve2toK_FindPatterns.c by (1) removing the bit of code under the comment that says "check to see if if 2^k*N + b0 is reduced in no more than k steps", (2) updating the starting deltaN, and (3) making the b0 loop now start at 0 and increase by 1 each iteration (instead of starting at 3 and increasing by 4 each time). For the final change, I added a test for if b0 is 0 before seeing if it was in the sieve. I learned that not all numbers that join a path of a smaller number work! For a 2^3 sieve, 5 becomes 2 after 3 steps (one of them is an increase), and 4 becomes 2 after 3 steps (one of them is an increase). The problem is that the path that started with 4 should no longer be followed after it reduced to 1 because the 5 takes more steps to reduce to 1, so I added the following line after the nList[m] >>= 1 line...
    if ( nList[m] == 1 && m>0 ) nList[m] = 0;
and I had to loop over the steps as the outer loop making the m loop (including checking against the 0th element) the interior loop. Note that, with the code in this configuration, the code could claim an exclusion by a larger deltaN even though a smaller deltaN will exclude it in a later step.
Results...
- A very interesting pattern emerged! The experimental deltaN are all 2^(k-4). Though, I only tested up to k=16. The explanation seems to be how, for k=4, 13 joins paths with 12. For k=5, just double the 13 and the 12, and we have that 26 joins paths with 24. Doubled numbers must join paths because doubling just adds a decrease step for each. Doubling also doubles the deltaN between them.
- Regarding the density of large-deltaN exclusions, for k=16, only 1 number required deltaN=4096, only 1 number required the next deltaN=3243, and only 3 numbers required the next deltaN=2048.
- Of course, these "reduce to 1" sieves take more time to analyze than the usual reduce-below-start-n sieves. The overall design of my code is not optimized for such large deltaN. My "sieveless" approach also does not favor a large deltaN.
- These sieves are not nearly as good as the ones you can use if not wanting to reduce to 1! As k increases, there is less gain with these "reduce to 1" sieves than with the usual sieves because the fraction that needs testing no longer exponentially decays.

I wonder if the "numbers cannot join numbers that have already reduced to 1" rule can be lifted somewhat. For our 2^3 sieve, 5 joins 4 after 3 steps, but 4 already reduced to 1. This is a problem because 4 takes 2 steps to reduce, and 5 takes 4 steps to reduce, so 5 needs to be tested. However, 2^3 + 5 must also join 2^3 + 4 in 3 steps, but 2^3 + 4 has not yet reduced to 1. Perhaps I could prove that for A*2^k + B that joins A*2^k + b, where b is smaller than B, A*2^k + b with A > 0 never reduces to 1 before the paths join. If I could prove this, then better sieves could be made for testing all numbers except the A=0 numbers. Actually, the proof is very easy: 2^k is the largest number to reduce to 1 in k steps, and nothing above it can join its path. For large enough k, the sieves are barely better if I let numbers join numbers that have already reduced to 1, and the cost of doing this, as far as I have experimentally tested, is that deltaN doubles plus the cost of having to do A=0 separately. Doing A=0 separately is no minor cost if using my "sieveless" approach, which aims to use huge sieves.

As for making these sieves using a GPU, my approach on GPU so far won't work with the "numbers cannot join numbers that have already reduced to 1" rule. My GPU approach would not be able to find cases where numbers joined before reducing to 1 in k steps. That is, my GPU code creates A > 0 sieves. However, unlike the modified CPU code I mentioned, the approach I use in the GPU code would be optimized for these larger deltaN! Unlike my non "reduce to 1" version of this code, comparing number of increases is now certainly required, hold[] and holdC[] arrays are also certainly required, and you'll start using considerable RAM after k=26 (due to deltaN being large). My GPU runs for only a tiny fraction of the run time, so I should probably explore making a second kernel instead of OpenMP! To do this, my 2 collect arrays, my 2 hold arrays, bits, and n0start would need to be passed to the GPU. In the meantime, I wrote a CPU-only code that makes A > 0 sieves that are not limited by the amount of GPU RAM (limited only by CPU RAM).
    • collatzSieve2toK_FindPatterns_reduceTo1.c is my CPU-only code
    • collatzSieve2toK_FindPatterns_GPU_reduceTo1.c
    • kernel_reduceTo1.cl
The following is OpenMP CPU-only code for A >= 0 sieves. The algorithm is much slower, but it requires basically no RAM. If I ever care, I'll write GPU code for this...
    • collatzSieve2toK_FindPatterns_reduceTo1_Aequals0.c
The above were last updated 2021-02-22.

"Reduce to 1" conclusion: A 3^2 (or 3^1) sieve works very nicely, but the 2^k sieve that you can create has a huge deltaN that is a real pain for not-miraculous gain.

It seems that I'm not the first to conclude such things! See his section called "The class record algorithm".

The strategy I would recommend is to use the "A > 0" sieves because they are easier to generate. For A=0, you can just look up the record for highest steps, so you don't have to run it (note that this list counts (3*n + 1)/2 as 2 steps). Use the same 3 steps I gave for choosing "sieveless" parameters, and your k will have to be small unless you set your deltaN_max to be low enough.

I wrote collatzSieveless_reduceTo1_Aequals0.c to run the A=0 case. I quickly ran up to 2^36, and I verified that, up through 2^36, counting (3*n + 1)/2 as a single step gives the same records as the table I just linked to (with simply a smaller delay). I also found that 762 is the record delay for a 2^36 sieve.

I ran the following "sieveless" files using a 2^36 sieve to allow for a 2^50 sieve. A 2^50 sieve would require setting STEPS_MAX (or stepsMax for CPU-only code) to at least 1161. I verified that, up through 9*2^47, counting (3*n + 1)/2 as a single step gives the same records as counting it as 2 steps (with simply a different delay).
    • collatzSieveless_reduceTo1.c is my CPU-only code
    • collatzSieveless_reduceTo1_GPU.c
    • kernel1_reduceTo1.cl
    • kernel1_2_reduceTo1.cl
    • kernel2_reduceTo1.cl
My results were added to the following file (the other delays in the file are from here)...
    • delayRecords.txt
The above were last updated 2021-04-25.

To analyze the log file from the GPU code, I ran the Linux commands...

grep "steps = " log.txt | sed 's/^.steps.=.\([0-9]*\).found:.(\([0-9]*\)....64)...\([0-9]*\).*/\1 \3/' | sort -n -k 1,1 -k 2,2 | uniq -w4 > lowestForEachDelay.txt
cat lowestForEachDelay.txt | sort -n -k2 > lowestN.txt

I then ran the Python 3 script...

stepsMax = 762
f = open("lowestN.txt", "r", encoding="utf8").read().splitlines()
for line in f:
  steps = int(line.split(" ")[0])
  if steps > stepsMax:
    stepsMax = steps
    print(line)

As for the density of deltaN in these "A > 0" sieves, the large deltaN are very rare. k=16 has deltaN = 8192 and a total of 46261 numbers removed from the sieve. For the following deltaN_max settings, I provide the numbers not excluded by the sieve...
  deltaN_max = 8192: 19275
  deltaN_max = 4096: 19276
  deltaN_max = 2048: 19279
  deltaN_max = 1024: 19288
  deltaN_max = 512: 19308
  deltaN_max = 256: 19356
  deltaN_max = 128: 19457
  deltaN_max = 64: 19693
  deltaN_max = 32: 20285
  deltaN_max = 16: 21582
  deltaN_max = 8: 24000
  deltaN_max = 4: 27984
  deltaN_max = 2: 32997
  deltaN_max = 1: 40295 out of 65536 must be run
Here are the numbers not excluded as a function of deltaN_max for k=10 with deltaN = 128 and a total of 586 numbers removed from the sieve...
  deltaN_max = 128: 438
  deltaN_max = 64: 439
  deltaN_max = 32: 442
  deltaN_max = 16: 452
  deltaN_max = 8: 471
  deltaN_max = 4: 513
  deltaN_max = 2: 572
  deltaN_max = 1: 673 out of 1024 must be run
Note that deltaN's that are of the form 2^integer are the most common.

Other types of searches

In addition to testing the Collatz conjecture, why search for other things like records for max n or max steps? Statistics of max n from a random walk can be predicted, so I see why max n is searched for (so far, experimental data is consistent with a random walk). Also, on a practical level, the max n affects how many bits are needed to test the Collatz conjecture. As for max steps, perhaps these records can be predicted from various stochastic models? Certainly, max n is easier to find than max steps!

I found a website with a list of their results of various other searches. In addition to searching for max n and max steps to 1 (where (3*n + 1)/2 is counted as 2 steps), here are the other searches they have done...
  (1) "Glide" records are provided (N=2 is missing), where "glide" is the number of steps needed to reduce an n below its start value (where (3*n + 1)/2 is counted as 2 steps).
  (2) First n to take N steps to reduce to 1 for all N are provided (where (3*n + 1)/2 is counted as 2 steps). Instead of asking for the first n to take more steps than any previous n required, they are wondering, given any N, which is the first n to require exactly that many steps. Note that my sieveless "reduce to 1" approach can naturally find these results, but a 3^1 or 3^2 sieve can no longer be used.
  (3) Ratios involving counts of odd and even n's when reducing starting n to 1 are provided (where (3*n + 1)/2 is counted as 2 steps), though N=2 is missing. This is clearly extremely related to max-step records because numbers that require many steps for their size have a higher ratio of odd to even, whereas a number that reduces quickly for its size would be produce mostly even numbers. In fact, all records in the first table of results are found in the second one.
  (4) "Residue" results are provided, where residue is defined using the counts of odd and even n's when reducing starting N to 1 (where (3*n + 1)/2 is counted as 2 steps). The final 1 is not counted as an odd, and the initial n counts as an even or an odd. Residue is then defined as (2^evens / 3^odds) / N. I think the reason we care about this is that, if the Collatz conjecture didn't have the the "+1" in "3*n + 1", N = 2^evens / 3^odds would be true for numbers that reduced to 1 (assuming that such an N would be an integer), and the residue would be 1. Since the "+1" only significantly matters when n is small, the residue's difference from 1 is measuring how long n is putzing around at small values before reaching 1. Note that, if residue were always equal to 1, paths would never join other paths, and nothing interesting could happen.
  (5) Just found another one: strength records. It wasn't linked on the main list, but I found it. Strength is defined as 5*odds - 3*evens for when (3*n + 1)/2 is counted as two steps. Strength has to do with how long a number can "tread water" compared to dropping down to 1. While "treading water", I find that strength is approximately equal to 0.245 * odds, where odds increases as the treading continues, but then any subsequent drop to 1 will reduce the strength. All strength records are delay records, but the strength records are the delay records that make a particularly high delay for the starting n.

For (3), (4), and (5), the website makes some conjectures that are interesting, so I understand why they are tested, but I don't really understand why (1) and (2) are tested. In general, I do not understand why people are experimentally counting steps to reduce to 1, perhaps this is because I do not know much about the theory behind trying to prove the Collatz conjecture. Perhaps the only reason is that getting experimental data can guide the mathematical theory if something surprising is found!

Anyway, I mention all this for two reasons. The first is to maybe have someone explain to me why we care about counting steps to reduce to 1. The second is to have a complete list of experimental searches so that I can evaluate if my "sieveless" approach can be applied to them. For some of the above searches, I see certain cases where some form of a sieve can be used. I might give this more thought at some point.

One thing that would interest me is to see a rolling average of the "glide" as defined by the website as the number of steps to reduce a number below where it started. I would expect this to be more-or-less constant as the starting n increases! Not many tricks could be used: no sieves, no "n++ n--", and no "repeated k steps". Though maybe "n++ n--" or "repeated k steps" could be used up until just before n reduces below its start value. For simplicity, I would just average the glides in each consecutive 2^30 block of numbers. I would define (3*n + 1)/2 as a single step because then the glide is useful because it then tells you which sieves will exclude that number. Better yet, define a step as a bit shift so that the "n++ n--" algorithm can be used, and collatzCPUglide_BitShift.c is my code that does this! For the following chunks of 2^30, I get the average glide...
  0: 2.246348
  1: 2.246315
  2: 2.246323
  3: 2.246313
  4: 2.246330
  5: 2.246352
  6: 2.246339
  7: 2.246338
  8: 2.246330
  9: 2.246311
  1000: 2.246329
  1000000: 2.246361
  1000000000: 2.246318
  1000000000000: 2.246360
  1000000000000000: 2.246339
  1000000000000000000: 2.246354
The last chunk of 2^30 runs numbers 2^30*1000000000000000000 + 2 to 2^30*1000000000000000001 + 1.
I would now like to predict the above average and standard deviation from a random walk. In trying this, I see that defining (3*n + 1)/2 as a single step is the easier way to go, so collatzCPUglide.c is my code that finds experimental values when (3*n + 1)/2 is a single step! For the following chunks of 2^30, I get the average glide...
  0: 3.492694
  1: 3.492627
  2: 3.492635
  3: 3.492623
  4: 3.492647
  5: 3.492700
  6: 3.492652
  7: 3.492654
  8: 3.492659
  9: 3.492626
  1000: 3.492654
  1000000: 3.492710
  1000000000: 3.492637
  1000000000000: 3.492715
  1000000000000000: 3.492683
  1000000000000000000: 3.492708
Note that, if counting (3*n + 1)/2 as two steps, my results become 3/2 larger. My code that predicts the above from a random walk is collatzGlide.c, which gives 3.492652 with a standard deviation of 6.481678. Since I'm running 2^30 numbers, the central limit theorem says that my experimental results, if from a random walk, should have a standard deviation of 6.481678 / sqrt(2^30) = 0.00020 between them. Seems that the Collatz conjecture is consistent enough with a random walk! Though, I am surprised at how similar all the glides are considering that, if running chunks of 2^1 instead of 2^30, there would be strong oscillation in the glide because n%4 == 3 are the only numbers that do anything interesting. 2^30 averages out enough of the oscillations because a sieve of that size would find that most numbers would reduce in no more than 30 steps.

Another thing I would like to see is the average of (delay - expectedDelay), where delay is defined by the website as the number of steps to reduce to 1, and expectedDelay is given by a random walk. Unlike the website, I will count (3*n + 1)/2 as a single step. Since the glide has already been compared to a random walk, there probably isn't much to gain scientifically from looking at the delay, but writing the code sounds fun. I think the reason I really care is that I am curious to see the expectedDelay values, and collatzDelay.c is my code that does this! I get expectedDelay = f(N) log2(N), where f(N) seems to be approaching a number, but I am unable to trust my results when I run N larger than 2^80 due to my steps being limited to 1023 (else overflow). Here are my results...
  N = 2^10: expectedDelay = 5.030079 log2(N)
  N = 2^20: expectedDelay = 4.923198 log2(N)
  N = 2^30: expectedDelay = 4.888867 log2(N)
  N = 2^40: expectedDelay = 4.871146 log2(N)
  N = 2^50: expectedDelay = 4.860508 log2(N)
  N = 2^60: expectedDelay = 4.853471 log2(N)
  N = 2^70: expectedDelay = 4.848431 log2(N)
  N = 2^80: expectedDelay = 4.844653 log2(N)
Multiply this by 3/2 if you count (3*n + 1)/2 as two steps. I believe the results are approaching
    expectedDelay = 2 / log2(4/3) log2(N)
                              = 4.818842 log2(N)
The above comes from writing N approximately as (4/3)^(delay/2) by building N up by an equal number of "backward" increases and decreases. Each pair of increases and decreases changes N by a factor of 4/3 and takes 2 steps. I want to thank Eric Roosendaal for this formula. My code also predicts a standard deviation of about 3.8 sqrt(expectedDelay).

I decided to write collatzDelayGMP.c to use the GMP library to allow me to run much larger numbers! Unlike the original non-GMP code, I optimized this code for speed. Here are the results...
  N = 2^1000: expectedDelay = 4.820915 log2(N), standard deviation = 8.384883 sqrt(log2(N))
  N = 2^2000: expectedDelay = 4.819878 log2(N), standard deviation = 8.383898 sqrt(log2(N))
  N = 2^5000: expectedDelay = 4.819256 log2(N), standard deviation = 8.383410 sqrt(log2(N))
  N = 2^10000: expectedDelay = 4.819049 log2(N), standard deviation = 8.383238 sqrt(log2(N))
  N = 2^20000: expectedDelay = 4.818945 log2(N), standard deviation = 8.383170 sqrt(log2(N))
  N = 2^50000: expectedDelay = 4.818883 log2(N), standard deviation = 8.383100 sqrt(log2(N))
  N = 2^100000: expectedDelay = 4.818862 log2(N), standard deviation = 8.383085 sqrt(log2(N))
  N = 2^200000: expectedDelay = 4.818852 log2(N), standard deviation = 8.383074 sqrt(log2(N))
  N = 2^500000: expectedDelay = 4.818846 log2(N), standard deviation = 8.383073 sqrt(log2(N))
  N = 2^1000000: expectedDelay = 4.818844 log2(N), standard deviation = 8.383070 sqrt(log2(N))
I was curious about the prediction for what the standard deviation is approaching. First of all, for very large N, the standard deviation of a random walk is effectively constant as steps increases during the time when most numbers are reaching 1, giving a standard deviation of Δincreases = sqrt(steps) / 2 = sqrt(expectedDelay) / 2. The division by 2 is because our random walk graphed increases vs. steps spreads half as fast as the "usual" random walk. Note that most of the walks occur near the line increases = steps/2 on this graph. The standard deviations listed above are Δsteps = step2 - step1, where step2 = 2 log2(N) / (2 - log2(3)) is when the line increases = (steps - log2(N)) / log2(3) crosses the line increases = steps/2, and step1 = 2 (log2(N) - log2(3) Δincreases) / (2 - log2(3)) is when the line increases = (steps - log2(N)) / log2(3) crosses the line increases = steps/2 - Δincreases. This gives Δsteps = Δincreases / (1/log2(3) - 1/2) = 8.383068 sqrt(log2(N)).

As for experimentally testing the above delays, I chose to run chunks of 2^30 numbers. Here is my code collatzCPUdelay.c that does this. The expected delay noticeably drifts for a run of 2^30 when starting at 2^40, so I started at 2^50. By the way, 2^100 caused overflow (I suppose this depends on how k2 is set). When starting at 2^50, I got the average delay to be 4.833497 log2(N), which seemed promising! But 2^60 gave 5.020711 log2(N). I then ran 2^60.1 and got 4.587167 log2(N), so there's clearly a huge standard deviation! For running 2^30 numbers, the standard deviation should be 0.00026 sqrt(log2(N)), which should be around 0.002, not approximately 10! If I were to run 2^42 instead of 2^30 numbers, I could reduce the standard deviation by a factor of 2^6 = 64 due to the central limit theorem allowing for a better measurement of the average. 2^42 would take many days to run on a single CPU core using a "repeated k steps" code. I ran the following amount of numbers starting at N = 2^60 giving the following results...
  running 2^30 numbers starting at N=2^60: delay = 5.020711 log2(N)
  running 2^31 numbers starting at N=2^60: delay = 5.086590 log2(N)
  running 2^32 numbers starting at N=2^60: delay = 5.207521 log2(N)
  running 2^33 numbers starting at N=2^60: delay = 5.281781 log2(N)
  running 2^34 numbers starting at N=2^60: delay = 5.228182 log2(N)
  running 2^35 numbers starting at N=2^60: delay = 5.158507 log2(N)
  running 2^36 numbers starting at N=2^60: delay = 5.172008 log2(N)
  running 2^37 numbers starting at N=2^60: delay = 5.211181 log2(N)
  running 2^38 numbers starting at N=2^60: delay = 5.135604 log2(N)
  running 2^39 numbers starting at N=2^60: delay = 4.937276 log2(N)
  running 2^40 numbers starting at N=2^60: delay = 4.940875 log2(N)
  running 2^41 numbers starting at N=2^60: delay = 4.914249 log2(N)
  running 2^42 numbers starting at N=2^60: delay = 4.897469 log2(N)

Why does the delay have such a large standard deviation compared to a random walk? The glide is an average over oscillations of various size. For example, only n%4 == 3 do anything interesting. The delay does not have these simple patterns! The glide just has to reduce a number, but the delay must "stuff the number down a thinner funnel", which might explain things. I would love to see a rolling average of the delay! I am now interested in counting steps to 1! There are already some graphs here that show it, but I'm more interested in the patterns that arise.

Here are my results from again running 2^30 numbers, but now as 2^10 consecutive groups of 2^20 numbers, starting at 2^70 and starting at 2^90. Various features such as oscillations and sudden jumps can be seen! Note that the numbers in the graphs should have a standard deviation between data points of about 0.001 if from a random walk.

References

[1] Bařina, David. (2020). Convergence verification of the Collatz problem. The Journal of Supercomputing. 10.1007/s11227-020-03368-x. Full text.

[2] Honda, Takumi & Ito, Yasuaki & Nakano, Koji. (2017). GPU-accelerated Exhaustive Verification of the Collatz Conjecture. International Journal of Networking and Computing. 7. 69-85. 10.15803/ijnc.7.1_69. Full text.