I came across a great 2020 paper by David Bařina with a brilliant faster way of experimentally testing the Collatz conjecture! I simplified his CPU code here (see my comments at the top). I will call his algorithm the "n++ n--" algorithm. The purpose of this webpage is to show how I continued making progress to speed up testing the Collatz conjecture even further! I do this in a chronological way so that you may understand my progress.
Personally, I think the conjecture has now been tested enough experimentally, but, if you disagree, you should be using the fastest code with the best mathematical tricks! I highly enjoy the math and code behind experimentally testing Collatz conjecture. Please let me know if you want to use my code!
I have combined many clever already-known math tricks with my new "sieveless" approach. If, in addition to testing the Collatz conjecture, you want to find the max n, you should use my "n++ n--" code. If you just want to test the Collatz conjecture, you should use my "repeated k steps" code. Regardless, use my "sieveless" approach! I provide benchmarked code for both CPU and GPU.
To do this "sieveless" approach, I first studied a value I call deltaN that determines how close numbers must be for them to join paths in k steps (with the same number of steps that are increases). This number allows for very large sieves to be generated, and these sieves can now be generated in a "sieveless" way, meaning that sieves are never saved to a storage drive. In fact, the sieves are so huge that they cannot be stored on a storage drive. Note that, when I count steps, I count (3*n + 1)/2 as one step.
Update: I am now working on a hybrid code called "partially sieveless" to speed things up even more!
My "sieveless" approach could also be applied to a "reduce to 1" code, so I also explore and expose various challenges of any "reduce to 1" code.
With David Bařina's generous help, I wrote my own code (collatzCPU.c), and it is a bit faster than his code (that is, faster than the code I provided above; GitHub is updated)! Using k=34 on my I5-3210M CPU running Ubuntu Linux and running 212 numbers for each number in the 234 sieve, my code tests 3.6 × 109 numbers/second, whereas his tests 2.6 × 109 numbers/second (using a single CPU core for both and my simplified code to test his). My guess is that: (1) my way of doing modulo operation helps the speed (calculating modulo on large numbers takes time!), (2) I also use a 32 sieve instead of a 31 sieve, and (3) I use a small lookup table to check for overflow instead of doing a calculation each time. I don't calculate checksums (or max n), though I probably should to test my code. Like his code, I also use 128-bit integers. In his code (not in the one that I simplified), he has an option to use gmp.h to test rare numbers that need more than 128-bit integers (I can just use a tiny Python script collatzTestOverflow.py on those numbers). I believe his code would be as fast as mine if he focused on his CPU-only code, but his main goal was to create fast GPU code.
My code simply prints any numbers that overflow or seem to be stuck in an infinite cycle (to standard output), and these are the numbers that could disprove the Collatz conjecture. See my extensive comments at the top of the file to learn about how my code (and how David Bařina's code) works.
On Windows, you'll need to get GCC (macOS and Linux should already have it, though gcc links to clang on macOS, and clang is fine). On Windows, download GCC from here. When running the downloaded installer, select the architecture to be 64-bit (x86_64). I then added C:\Program Files\mingw-w64\x86_64-8.1.0-posix-seh-rt_v6-rev0\mingw64\bin to Windows' Path so that I can compile code in Command Prompt.
Note that my code generates the 2k sieve and stores it in RAM before testing numbers. Ideally, my code would just load a pre-made lookup table from the storage drive, and it could do this using a buffer to prevent needing a noticeable amount of RAM (that is, load in only pieces at a time), but my main goal was just to create a fast algorithm, and I wanted the complete algorithm to be in a single file.
Why do we want larger k? The fraction of numbers not excluded by a 2k sieve for k=34 is 0.007246. Less than 1% need to be tested! However, k=40 has 0.004782. Larger k means faster code! This is a time-space tradeoff since larger k also takes more RAM and/or room in storage drive. Also, you can pre-calculate the first k steps of non-excluded numbers, so larger k is even better!
David Bařina uses a clever compression to store a 234 sieve. 234 bits is 2 GiB, but he found that there are only 50 unique 64-bit patterns. Since the integer 50 fits in a byte (8 bits), he can use 8 bits to store 64 bits. This allows him to store a 234 sieve in 256 MiB. However, k=40 would require 16 GiB using this compression.
A much larger concern is that generating the sieve took a lot of RAM. For k=34, there are 234 128-bit integers needed to test whether different paths join. This alone is 256 GiB, and there are other things needed in RAM. So k=34 seemed to be a limit.
Before I discuss the fix to these issues, I discovered a few basic things. You don't need 128-bit integers for generating the sieve if k < 41. I use 64-bit integers, which use half of the RAM and use less CPU time. Also, I discovered that there are actually 53 unique 64-bit patterns, though the "extra 3" only occur for k=6 and k=7.
The first big idea to realize is that not all paths can join. Only "nearby" paths can join. As k gets larger, more distant paths can join, but still within a k-dependent distance. If you look in my code, you'll see these distances called deltaN. For k=34, deltaN = 46. For k=40, deltaN = 120. Using deltaN, I can create any 2k sieve using very little RAM! collatzSieve2toK_FindPatterns.c is my code that experimentally finds deltaN (among other things) for any k. Note that this code takes over a day to run k=40.
But, to experimentally find deltaN, you would still need all 2k values stored in RAM because you don't know deltaN yet. If you read the comments at the top of collatzSieve2toK_FindPatterns.c, you see that deltaN can be bounded by 2(k - minC), where minC > k / log2(3) is the lowest c that does not include already-ruled-out numbers, where c is how many increases are encountered in k steps (steps and "already-ruled-out n" are defined in the code's comments). Confused? Just read the copious comments at the top of the code! I was proud of myself for figuring this out, but, even with this bound, for k=40, putting 2^(40 - ceiling(40/log2(3))) into WolframAlpha gives a bound of 16384. This solves the issue with RAM, but CPU time is going to take almost half a year. The code's comments mention a code I use to get a tighter bound, and collatzFindDeltaNbound.c is this tricky code. I developed the algorithm myself, and I wonder if anyone else has done this. Read the comments at the top with pen and enough paper to figure out how the math of it works! For k=40, the tighter bound is just 679, which can be run in a week by collatzSieve2toK_FindPatterns.c to find deltaN = 120.
But 16 GiB for k=40 is still a huge file to share between computers or just have on a computer. I can compress the k=40 sieve to 8 GiB by finding unique 256-bit patterns and referencing them using 2 bytes (16 bits). For k < 41, there are 13374 unique patterns (will likely increase for larger k). This is more than 53, but less than the 216 limit for 2 bytes.
Why did I choose 256-bit patterns? First of all, pattern sizes should be of size 2K due to exclusion rules being of form n%(2K) == value. For example, n%2 == 0 and n%4 == 1 do not need to be tested (as long as k is not less than K). Here are the alternatives to using 256-bit patterns...
64-bit patterns: This is what David Bařina used for an 8-to-1 compression. I get a 16-to-1 compression using 256-bit patterns. Note that my compression would take about 30% more time to generate a sieve due to having to search through more patterns, but I implemented a binary search to prevent a slow down.
128-bit patterns: There are more than 256 unique 128-bit patterns, so they would need 2 bytes. This compression is identical to using 64-bit patterns.
512-bit patterns: These would need 3-byte labels, which is fine, except I would use uint32_t because I am too lazy to get a 24-bit integer, so it would actually take 4 bytes unless you do something fancy like uint8_t label. 4 bytes gives no advantage over using 256-bit patterns. More importantly, there would now be so many unique 128-bit patterns that they could no longer be expected to fit in CPU cache (the patterns are 128-bit since I only need to consider n%4 == 3).
1024-bit patterns: I have not tested whether all unique 1024-bit patterns fit in 4 bytes because I am currently unable to handle these because, even with my 4-to-1 compression trick (only looking at n%4 == 3), I cannot fit these unique patterns in a 128-bit integer. I could find a different container for these patterns, but I don't care because there would be so many of them that they would fill up my storage drive and cache, and, for each k above 40, you would find many more unique patterns to add to the list.
As for how to implement these 256-bit-pattern sieves, set K=8 in collatzSieve2toK_FindPatterns.c and run through all the k values of interest. Note that this capital K has different meaning than the one in the previous paragraph! Luckily for you, I have already done this, and all needed information is in my spreadsheet largeK.xlsx (has lots of other great info too!), and collatzCreatePatternsForSieve.c has cumulative unique patterns for k < 46. You can then generate a sieve using collatzCreateSieve16to1compression.c. k=40 will likely take over a day. You would then modify my collatzCPU.c to load in a sieve instead of generating one, but I am not motivated to do that. Instead, to show you how to load in the sieve using basically no RAM, here is a code: count16to1sieve.c. It just outputs how many numbers need to be tested in the sieve, but look at how the storage drive is read to a buffer to greatly reduce RAM usage.
On second thought, not writing a code to use these compressed sieves seems weird, so I made collatzCPU16to1.c! I still don't much care to run it with k=40 because...
• 8 GiB is a lot to write to a storage drive, and creating this k=40 sieve would take me 31 hours.
• Then, using the sieve would take me a couple days if I test 210 numbers for each number in the 240 sieve (that is, to test 250 numbers).
But I got curious about what the numbers per second would be! I got 5.03 × 109 numbers/second on the same I5-3210M CPU. This is certainly faster but wasn't as fast as I was expecting; I then realized something: the numbers that are excluded by a k=40 sieve but are not excluded by a k=34 sieve are often the numbers that don't take long to test!
Update: By talking with Eric Roosendaal, I learned of another way to store 256 bits into 2 bytes. A 2^8 sieve (has 256 numbers) has only 16 numbers that need to be tested. So, for each consecutive group of 256 numbers in a larger 2^k sieve, you just need 16 bits (2 bytes) to say whether each of these 16 numbers are needed. I do something similar but with a 2^2 sieve, which only requires 1 number to be tested, allowing me to store the 256-bit patterns as 64-bit patterns, but Eric's idea lets us think of them as 16-bit patterns! The advantage of Eric's idea is that we don't have to find all the unique patterns beforehand! I won't implement it because I'm using my sieveless codes now, but I thought I'd share it!
David Bařina also made some very fast GPU code. We really want to be using a GPU to do these calculations (not just a CPU)! In Dec. 2020, I tackled using the GPU!
Update 1: I have experimentally found that OpenCL on all my Intel and AMD GPUs on macOS and Windows make various subtle 64-bit and 128-bit integer arithmetic errors (getting the wrong answer—yikes!!) in addition to not implementing ctz() and to printf() acting strangely. Different GPUs make different errors. As for the arithmetic errors, here are two of my (sloppy) test codes: testBug64.c and testBug128.c. Here is cldemo.c to probe your OpenCL platforms and devices. David Bařina only used OpenCL on Nvidia GPUs on Linux and experienced no errors and could use 128-bit integers (used Nvidia CUDA Toolkit 10.x or 11.x), so I'll try that (via Ubuntu's nvidia-cuda-toolkit package), which will require me SSHing into a computer that has an Nvidia device! I also want to try Nvidia on Windows. Since macOS no longer supports CUDA and OpenCL, I have no desire to try Nvidia on macOS. To compile OpenCL...
• On macOS, I just used the -framework opencl option of gcc.
• On Windows, I used MinGW-w64, like this. Better yet, on Windows, install Cygwin (selecting the gcc-core and libOpenCL-devel packages) to allow various Linux C libraries to work and to get ctz() to work in a kernel. For Cygwin, gcc needs /cygdrive/c/Windows/System32/OpenCL.DLL as an argument. To get a printf() in an OpenCL kernel to work in Cygwin, I had to add C:\cygwin64\bin\ to the Windows Path then run via the usual cmd.exe.
Nvidia on Windows works! There are no arithmetic errors! Just download and install the CUDA Toolkit from Nvidia before trying to compile using Cygwin! The only weird thing is that a printf() inside an OpenCL kernel needs llu for ulong (lu only prints 32 bits), but, even though this is OpenCL and lu works on Windows with Intel and AMD GPUs, needing llu is a typical Windows thing I guess. On second thought, a long integer is always 64-bit in OpenCL, so this behavior by Nvidia is actually a bug. Luckily, llu works for Intel-GPU-on-Windows (not AMD GPUs on Windows) and for Nvidia-on-Linux, so I just switched all my OpenCL kernels to use llu. Also, llu doesn't work when using Beignet on Linux because Beignet seems to strictly abide by standards! Anyway, I wonder if PRIu64 works in OpenCL for all devices and platforms.
• When using Linux, I never get the scary arithmetic errors that macOS and Windows have given me (for Intel and AMD). Using Ubuntu's beignet-dev package (for Intel GPUs) avoids all problems on all GPUs, but is slow and does not have 128-bit integers. On Ubuntu, I also need the ocl-icd-opencl-dev and opencl-headers packages. On Linux, the -lOpenCL option is needed in gcc. Instead of beignet-dev, I tried the intel-opencl-icd package, but my GPUs were too old to be supported, and it wouldn't work. Instead of Intel GPU and beignet-dev, I used AMD and mesa-opencl-icd, but many things didn't work (though no arithmetic errors!). For AMD, instead of mesa-opencl-icd, I looked into AMD's Linux drivers, but my GPU wasn't on the list of supported devices, so I didn't try it.
Based on my limited data, I now make the following conclusions: (1) Linux, unlike Windows and macOS, never makes arithmetic errors regardless of GPU, (2) Nvidia, unlike Intel and AMD GPUs, never makes arithmetic errors, (3) everything, such as ctz() and printf(), works on Nvidia, and (4) the device, not the OS, determines if there is a watchdog timer (only my Intel GPUs have a watchdog timer, and they have it on all OSs).
Update 2: Using an Nvidia Quadro P4000 (and Xeon W-2135) and Ubuntu's nvidia-cuda-toolkit package, OpenCL and David Bařina's code run successfully! I simplified his GPU code here (see my comments at the top), and here is his simplified OpenCL kernel. His code achieved 5.73 × 1010 numbers/second on the Nvidia Quadro P4000 (GitHub is updated with faster code by now), so my goal is 1 × 1011 numbers/second! I didn't fine tune any parameters. I just asked David Bařina what the standard use is, then I increased TASK_SIZE by 4 (from 40 to 44). I then increased TASK_UNITS by 4 (from 16 to 20) in order to keep TASK_SIZE - TASK_UNITS the same, and the code did slightly better giving the speed above.
Update 3: I worked with him to modify his code to achieve 7.77 × 1010 numbers/second! Here is my host program, gpuworkerBrad.c, and OpenCL kernel, kernelBrad.cl. Unlike my CPU code, which was from scratch, I only modified his GPU code. I somewhat cheated by no longer searching for the largest number or making a checksum. Read my notes at the top of each file. Anyway, my next goal is to improve a pre-calculation (interlacing) kernel as I feel it has much potential!
Update 4: His pre-calculation (interlacing) code has been improved! Here is my kernelBrad-precalc.cl that runs 1.20 × 1011 numbers/second, surpassing my goal! As always, see my comments at the top, and it runs with the same host program as the other kernel (just, in the host program, change the kernel filename). In the process of making this, he also got his pre-calculation kernel running much better! I find myself happy but also annoyed with the idiosyncrasies of GPUs. Something that would help the GPU that I was kindly allowed to use (Pascal microarchitecture), would not help the GPU that David Bařina was using (newer Turing microarchitecture), though no differences existed. Also, things that should obviously help reduce computation would greatly slow down the calculations on both architectures. From my limited experience experimentally testing the Collatz conjecture, I conclude that CUDA Cores hate reading from RAM unless it's the RAM that is shared with the threads in the same work group. Anyway, I am happy to have learned so much about GPUs! I want to thank Brad Philipbar for letting me SSH into his fantastic machine during my winter break, and I thank David Bařina for answers my emails long before I knew much of anything about the Collatz conjecture!
Update 5: I made 2 improvements: (1) generating my own format of sieves and (2) running 3 × 2TASK_SIZE numbers. The goal of both is to reduce the amount of time threads spend waiting on other threads in the work group. I am now at 1.30 × 1011 numbers/second when running 3 × 246 numbers with a 234 sieve (same sieve size as prior runs). Here is the sieve generator, collatzCreateSieveUintList.c, the host program, gpuworkerBrad-precalc2.c, and the kernel, kernelBrad-precalc2.cl. The cost of my using my own sieves is that they: (1) use more GPU RAM, (2) use more of the storage drive, (3) require setting more parameters when changing the sieve size, and (4) take more time to run unless TASK_SIZE - SIEVE_LOGSIZE is at least 10. My next goal is to use a 240 sieve, which will probably require running the kernel multiple times on different parts of the sieve.
Update 6: The 240 sieve helped (see the sieve creator in previous update)! It took a day to create, and it requires almost 40 GiB of your storage drive, but you never have to create it again after the first time! The Nvidia Quadro P4000 now does 1.83 × 1011 numbers/second when running 3 × 248 numbers! A newer GPU microarchitecture and more CUDA cores would certainly increase this speed! Turing microarchitecture seems to be about twice as fast as Pascal per CUDA Core. Looks like this GPU is 36 times faster than a single I5-3210M CPU core. I also increased the speed (or allowed for a smaller TASK_SIZE) by loading the next part of the sieve from the storage drive to CPU RAM while the GPU works on the previous part of the sieve. Here is the host program, gpuworkerBrad-precalc3.c, and the kernel, kernelBrad-precalc3.cl. If I cared to save space on my storage drive, I'm sure the 8 GiB 240 sieve could also be used by populating my section_of_sieve array using a compressed sieve (to be done while GPU is working), but I'll leave doing this to someone who cares, though I'd start by looking at my count16to1sieve.c to see how to look through those sieves.
Update 7: I guess I'll write the code to run my compressed sieves! Here is the host program, gpuworkerBrad-precalc3b.c, which still uses kernelBrad-precalc3.cl. I need to run extra numbers, 3 × 250, to get the same speed. This is probably because decompressing the sieves prevents the CPU core from best controlling the GPU because, with a not-large-enough TASK_SIZE, the CPU doesn't finish before the GPU finishes.
The next step is to start working up to k=80. Since the fraction of numbers that need to be tested approximately exponentially decays as k increases, a larger k would be great! This requires replacing various 64-bit integers with 128-bit integers in collatzSieve2toK_FindPatterns.c (all 64-bit integers should be replaced) and collatzCreateSieve16to1compression.c (all 64-bit integers except those having to do with patterns and bytes should be replaced). My collatzCPU16to1.c should be ready to go, but kernelBrad-precalc3.cl needs variable L to be changed to an 128-bit integer when k > 40.
For k=41, I get a reduced deltaN bound of 1215. For k=46, I get 5231. Finding the actual (experimental) deltaN bound from this k=41 would take about a month on a CPU core! The only reasonable way to continue would be to try to use a GPU!
My idea is to have the GPU run a kernel that does k steps of something like 2^26 numbers (probably have each GPU thread do some group of these numbers). The GPU will create a uint64_t array that is 2^27 long. The extra "times 2" is because I actually want a uint128_t array, but those can't be passed to a kernel. This array will store the result after k steps. The GPU will create another array that is uint8_t that is 2^26 long that will be set 0 or 1 depending on if the number ever reduces during those k steps (though I could make it a fourth of this length by only saving when n%4==3). The CPU can then use the huge array to see if paths joined for any of the numbers that have a 1 in the uint8_t array (changing some of the 1's to 0's). The CPU will have to hold on to deltaN of the end of the previous huge array. The CPU can then deal with finding patterns and any writing to the storage drive while the GPU is running the next chunk of 2^26 numbers. The GPU might need to create another uint8_t array that counts the number of increases during the k steps.
Using the above idea, I can create a new collatzSieve2toK_FindPatterns.c that finds the experimentally reduced deltaN for k > 40 using a kernel file. By running each k in order, I can also create an updated patterns file, that allows sieves to go above k=40. So far, I'm not saving out the sieves to the storage drive yet. Here is collatzSieve2toK_FindPatterns_GPU.c and the kernel, kernel.cl. Running it will take me time, but I add to the spreadsheet and update collatzCreatePatternsForSieve.c as I get results.
Note: The code currently makes and uses a uint8_t array that counts the number of increases during the k steps, though, without it, the code runs a bit faster and, at least for k=40 (with deltaN = 120), creates the same sieve. It seems that for small n and large deltaN, it should be necessary to check the number of increases to prevent numbers from being incorrectly removed from the sieve. Since it would be a disaster to skip numbers that should be tested, I must include a comparison of the number of increases. Currently, the code will output some information if this comparison is ever useful! Maybe I can prove that this comparison is never necessary? So far, running k = 41, 42, 43, 44, and 45 (using deltaN bound, not the experimental deltaN) have not required this comparison.
If I run the same GPU code but with a larger deltaN, the code can be much slower. This is because the CPU core cannot process the results from the GPU fast enough. Multiple CPU threads via OpenMP to the rescue! My collatzSieve2toK_FindPatterns_GPU.c now uses many CPU threads (number of threads can be set in the code). In the future, as GPUs get faster and CPU cores don't, OpenMP becomes more and more necessary. On the Nvidia setup I'm using, a deltaN over 200 or 300 could justify using OpenMP. Note that, as k increases, ignoring the affect of k on deltaN, the CPU doesn't have to work as hard to keep up with the GPU because the CPU has fewer numbers to test (and the GPU has to test each number farther).
If you don't want to use multiple cores, you could set the num_threads() to 1 or just comment out the entire pragma omp line.
On Windows, OpenMP works great using gcc in Cygwin (not so great in MinGW-w64). On macOS, to get OpenMP to work, I installed gcc9 via MacPorts, then I use gcc-mp-9 instead of clang. Of course, OpenMP works flawlessly in Linux.
To run k > 43, OpenMP wasn't enough to speed it up enough for my comfort. Since these runs use a lot of electricity ($1 USD per day), I don't want the runs to go to waste due to a power outage at the last minute. My collatzSieve2toK_FindPatterns_GPU.c now saves at checkpoints to the storage drive at the frequency of your choosing. Set log2saves to 0 to turn off checkpoints, and set loadCheckpoint to an integer larger than 0 to load that saved checkpoint.
I can now see how deltaN (to be specific, the variable maxM) increases as the sieve is analyzed, and, interestingly, it doesn't immediately reach the final value meaning that the numbers that require the large deltaN are very rare.
Here is collatzCreateSieve16to1compression_GPU.c for creating sieves! It uses the same kernel, kernel.cl. It doesn't yet use OpenMP because the deltaN that this file runs is much less than the deltaN when first running collatzSieve2toK_FindPatterns_GPU.c, and it doesn't yet save at checkpoints because I don't really care to save out huge sieves (see the next section).
As for going up to k=80, going above k=64 will require kernelBrad-precalc3b.cl to receive a 128-bit sieve array, which, due to the limits of OpenCL, must be instead a 64-bit array that is twice as long (arrayLarge in my kernel.cl already passes 128-bit numbers this way). You will also have to change how bMod is calculated in collatzCPU16to1.c (to be done like k2Mod). Another issue with collatzCPU16to1.c is that lenC3 will need to be increased from 65 to 81. Going above k=65 will require maxDecreases in collatzFindDeltaNbound.c to be lengthened (very easy!).
Creating sieves above k=80 would require integers that are larger than 128-bit because 2^80 - 1 increases 80 times to 3^80 - 1, which fits in a 128-bit integer, but 3^81 does not fit. You would want to rewrite the code using smaller integer types, then implement low level arithmetic operations yourself. In OpenCL, you can use the vector data type uint8 (or ulong4). Perhaps a good intermediate step would be a uint4 and uint2 (or a ulong2 and ulong). As far as I can tell, the only advantage of using uint instead of ulong in OpenCL is not getting subtle arithmetic errors on Intel and AMD GPUs, but, I am so horrified that these errors occurred, that I'd only ever use Nvidia for serious OpenCL.
Since k=46 would require 512 GiB of storage drive (ideally an SSD), I have an idea that never requires running collatzCreateSieve16to1compression_GPU.c to create a sieve on the storage drive allowing for much larger sieves!
My idea is to run very large sieves "without a sieve" in that I will generate the sieve live without reading from a storage drive. Ironically, this was my first idea (see collatzCPU.c), but now I will implement it in a very good way. Instead of having some sort of task ID that you increment each run, you will have two task IDs: the first controls a huge amount of numbers such as 9 * 2^65, and the second refers to a chunk of these 9 * 2^65 numbers. Only increment the first when the second has reached its final value. The second task ID determines which segment of the very large sieve will be produced. The total size (9 * 2^65 in this example) should be chosen carefully because it would be a shame to have to leave a chunk unfinished. The idea is that you make the 2^k sieve as large as possible while keeping the sieve generation time negligible compared to time running numbers using that chunk of the sieve.
For the GPU code, you will want two kernels. Kernel1 will generate the part of the sieve determined by the second task ID. Kernel2 will test 9 * 2^(65 - k) numbers for each number in the sieve (this number should be large to not notice the extra time needed to generate the sieve live). Luckily, most of what kernel2 needs as inputs are the same as what kernel1 creates as outputs, so the GPU RAM can mostly just stay in the GPU between the kernels! This allows kernel2 to start with pre-calculated numbers!
As for the CPU-only code (that is, no GPU), I won't check any maxSteps variable anymore for consistency with my GPU code. Perhaps one day I will try a "semi sieveless" code that loads in a smaller sieve to speed things up the pre-calculation.
With this sieveless approach, I might try running collatzSieve2toK_FindPatterns_GPU.c up to k=48, or maybe even k=51 (or anything k<81). The main result I need is deltaN, so, with the help of the checkpoints that can be saved, I could always just run enough of these huge sieves to find a big-enough deltaN (or just have it print each time maxM is increased)! Besides, if a large deltaN is rarely needed, I don't want it slowing down my CPUs! To view the current deltaN from a checkpoint, something like od -N 2 -t d2 temp1maxM should print it! My sieveless codes allow you to choose a deltaN_max to override any other deltaN. This can be used to run k=65 for example.
With this sieveless approach, I won't maintain all the files here anymore. The following general files will be maintained...
• largeK.xlsx ← I have now highlighted the best k values to use!
I will of course also maintain...
I will maintain the following GPU files...
The above were last updated 2021-02-22.
I have had incredible results so far (January 2021). I tried k=51 with the CPU code, and I easily got 9.0 × 109 numbers/second on the same I5-3210M CPU. I was running a task that only took a minute. My GPU code was also a huge success. With k=51, I easily got 3.1 × 1011 numbers/second on the same Nvidia Quadro P4000. This was for a task that took less than a minute.
Getting exact speeds is difficult because each task has a different part of the sieve, but the sieveless approach is the only way to go. The only real con is that you must finish running each of the second task ID for the results to be valid (or to have any useful results at all). Note that, for the CPU code, you can always set TASK_SIZE = k.
While running various tests on my GPU code to see if it was working correctly, I discovered something interesting: my hold and holdC were never being used yet I was always getting the correct results! So, for k=34, I searched for numbers of the binary form ...0000011 that did not reduce but joined the path of smaller numbers (because these are the numbers that would require hold and holdC), but I didn't find any! I only found (many) numbers of the form ...1011. I could then test hold and holdC using the very unrealistic settings: TASK_SIZE = 3 and task_id = 120703 (and k=34). This is interesting because maybe I could prove that, if TASK_SIZE is large enough, hold and holdC are unnecessary!
A larger k (and therefore TASK_SIZE) might be better even if a larger fraction of the time will be spent making the sieve. If you get get a 10% faster sieve at the cost of creating the sieve taking 5% of the time, you still win!
Here are the steps to setting your parameters...
(1) Choose an amount of numbers that you could commit to finishing in a month or whatever your timescale is. If 2^68, set TASK_SIZE_KERNEL2 in the GPU code and TASK_SIZE0 in the CPU code to be 68 (you'll also need to remove the some 9's in the code as explained in the code).
(2) Then set k and deltaN_max to maximize the numbers per second! If you set k too large, then the code spends too large a fraction of time generating the sieve. I have found that the very large deltaN only occur very rarely, so deltaN_max need not be huge. If your TASK_SIZE_KERNEL2 and TASK_SIZE0 are not large enough and you find that the fastest k is something small like 35, you may wish to not use this sieveless approach and just use a sieve that you store on your storage drive. See my spreadsheet, largeK.xlsx, for good k values to run. Also, it might be faster to change the code to load in a smaller sieve to help speed up the creation of the actual sieve.
(3) Choose the amount of numbers that each task should test—let's say 2^40—then set TASK_SIZE = 40 + k - (TASK_SIZE0 or TASK_SIZE_KERNEL2). Each task will look at a 2^TASK_SIZE chunk of the sieve to test 2^(TASK_SIZE0 - k) or 2^(TASK_SIZE_KERNEL2 - k) numbers per number in the sieve. Because there is apparent randomness to the sieve, TASK_SIZE should at least be 10 to give each process something to run! Unlike the CPU code, which needs very little RAM, TASK_SIZE in the GPU code determines RAM usage, and you want it large enough to minimize your CUDA Cores sitting around doing nothing.
If you want to run both the GPU and CPU code simultaneously, be sure to use the same sieve size for both, and have the size controlled by the first task ID be the same for both. I would normally say that TASK_SIZE needs to also be the same, but, if the CPU tasks need to finish in a certain time, it may be best to set TASK_SIZE differently: the CPU should have a lower one. This would require just a bit of organization when setting the second task ID to make sure that the tiny CPU tasks fit around the GPU tasks and to make sure that all of the sieve is eventually run.
If you have already have up to some number tested by a non-sieveless approach, you can still easily use this sieveless approach! For my CPU code, adjust aStart manually. For my GPU code, adjust h_minimum and h_supremum in kernel2.cl, keeping in mind that integer literals over 64-bit don't work (so use bit shifts).
If you want to use general GPUs...
• For computers with watchdog timers, a maxSteps variable must be introduced, and a message should be printed whenever testing a number goes beyond those steps. This is crucial.
• For computers with watchdog timers, you will likely have to modify the code to call kernel2.cl multiple times, each time with a different h_minimum and h_supremum.
• To accommodate GPUs (and CPUs) without ECC RAM, you'll have to run each task twice to validate.
• The 128-bit integers should become OpenCL's uint4 to prevent arithmetic errors on non-Nvidia GPUs. I believe that ulong arithmetic should never be used on Intel or AMD GPUs. Another solution could be to look at my testBug64.c and testBug128.c to understand the bugs, then test each device for these bugs before running the Collatz code.
• To accommodate weaker GPUs, set TASK_SIZE to be smaller.
To find the largest number or to calculate checksums, you can easily edit kernel2.cl to output some extra things. To see how, search David Bařina's kernel for mxoffset, max_n, and checksum_alpha. Of course, something similar can be done for my CPU code. I don't expect this to greatly slow down the code.
If finding the max number and using a 2^k sieve, you have to be careful that the max number does not occur in the first k steps. That is, your sieve cannot be too large. Even after k increases, the number will get approximately (3/2)^k times larger requiring approximately log2(3/2) k more bits. From this list of max numbers, finding a maximum number in these k steps would be difficult to do since even a k=80 sieve is very safe (unless you start testing up to 2^89 and no new new records are found!).
If you want to speed things up a bit more, my CPU-only sieveless codes don't yet use "n++ n--" when generating the sieve.
On the Nvidia-Quadro-P4000 Linux setup I am using, I use clFlush() and clFinish() for all of my sieveless GPU code after calling kernel2, and a CPU core waits at 100% usage for the GPU to finish. To greatly reduce this CPU usage, instead of using clFlush() and clFinish(), I tried to return an event object when calling clEnqueueNDRangeKernel() then, in a loop containing usleep(), use clGetEventInfo() to see if the event has finished. I read somewhere that waking up from a sleep() only takes about 30 microseconds, so I could check many times per second! However, it seems that my code waits at clEnqueueNDRangeKernel() (any printf() and fflush(stdout) right after it doesn't print until GPU is finished), but commenting out the printf() in kernel2 fixes this behavior. But, then the code waits at clGetEventInfo() with the CPU running at 100%. I'm not sure if this behavior is my GPU or my code, but let me know if you can stop a CPU core from running at 100% while kernel2 runs!
99% of the time, my codes are running just a tiny amount of code: collatzSieveless.c is in the
for (a=aStart; a<aEnd; a++)
loop, and collatzSieveless_GPU.c is in kernel2.cl's
for (uint128_t h = h_minimum; h < h_supremum; ++h)
loop. I'm using the "n++ n--" algorithm for this code, and my "sieveless" approach allows me to use huge sieves to prevent this code from needing to be run as often. My CPU code is, as expected, likely the world's fastest at the moment. But this 2017 paper has much faster GPU code than mine! Their Nvidia GeForce GTX TITAN X has more CUDA Cores than I do, 3072 CUDA Cores vs. my card's 1792, and is an older Maxwell microarchitecture (vs. my Pascal).
Their code is faster because they repeatedly do k2 steps at a time. I currently do k steps then just a few steps at a time until the number is reduced. A long time ago, I found that my approach was faster (for CPU at least), but the paper actually describes two sieves being used simultaneously. For GPU, they use k=37 and k2=18. For CPU, they use k=37 and k2=11. The k=37 sieve is what I can turn into something like k=51 by using my "sieveless" approach. It is what determines which numbers are run and controls the initial k steps. The smaller sieve is for creating a table to do k2 steps at a time after the initial k steps. I suppose this k2 sieve isn't really acting as a sieve, so I suppose I should call it a lookup table, but I probably will accidentally call it a sieve from time to time! This "repeated k steps" approach is a replacement for the "n++ n--" algorithm (though the "n++ n--" algorithm can be used in generating sieves).
The best code depends on your purpose.
If you want to find the max n, you'll want to use the "n++ n--" algorithm with my "sieveless" approach.
If you just want to test the Collatz conjecture, then you'll want to use the "repeated k steps" algorithm with my "sieveless" approach.
By the way, shame on them for not making their code readily available, whereas David Bařina put a link to his code in his paper. If I couldn't do better than them, I would have tried Googling the Japanese authors to see if they can send me the code.
Below is CPU and GPU code! My CPU code is currently set to k=51 and k2=11. My GPU code is currently set to k=51 and k2=18. I easily got 1.8 × 1010 numbers/second on my I5-3210M CPU and 1.5 × 1012 numbers/second on the same Nvidia Quadro P4000. This is faster than the 2017 paper! So I'd say that, starting January 2021, I have the fastest CPU and GPU code in the world!
• collatzSieveless_repeatedKsteps.c is my CPU code
• kernel1_repeatedKsteps.cl to make the k sieve
• kernel1_2_repeatedKsteps.cl to make the k2 sieve
• kernel2_repeatedKsteps.cl to use the sieves
The above were last updated 2021-02-22.
I again want to thank Brad Philipbar for letting me use his GPU, and I want to thank David Bařina, Brad Philipbar, and Trevor Wommack for their helpful conversations!
The GPU saw a gain larger than the CPU's gain. This is because a work group for a GPU must wait for the slowest thread, so using an algorithm like "repeated k steps" that caters to the hard-to-reduce numbers is wise!
Great news! I learned that nearly all of the numbers removed by a deltaN (after checking for reduction in no more than k steps) are from deltaN = 1!
For k=21, 38019 numbers need testing using if using all deltaN, 38044 need testing if only using deltaN=1, and 46611 need testing if not using any deltaN.
For k=32, 33880411 numbers need testing using if using all deltaN, 33985809 need testing if only using deltaN=1, and 41347483 need testing if not using any deltaN.
I am adding a column to the spreadsheet, largeK.xlsx, to give the numbers that need testing if only using deltaN = 1. I now recommend to set deltaN_max to 1! This should cover most removals by a deltaN allowing for a larger k! If you for sure want to use deltaN = 1, you can get rid of hold and holdC from the GPU code, and arrayLarge and arrayIncreases can be made to half their length or even shorter.
An idea: I once had the idea of saving out a 2^32 sieve that a "partially sieveless" code could load in to help generate a much larger "sieveless" 2^k sieve. My GPU approach is to do k steps at a time for all numbers, so, for large deltaN, I'd still have to do k steps for all numbers anyway. My CPU-only approach would benefit, but I didn't think it would benefit enough to be revolutionary. So I settled for only using a 2^2 sieve to help generate a much larger 2^k sieve. However, a deltaN = 1 sieve could use this 2^32 sieve very nicely! For each n in 2^k sieve for which n % 2^32 belongs to the 2^32 sieve, do k steps for n and n-1. This should greatly speed up sieve generation allowing for larger k values! Also, for GPU code, TASK_SIZE could be increased more without running out of RAM. I realized that this using-multiple-sieves idea could be great for deltaN=1 when Eric Roosendaal was telling me about how he uses a 2^8 sieve (which only requires 16 numbers to be run), to create a larger sieve. I hope to (soon) create "partially sieveless" CPU and GPU codes for "n++ n--" as well as for "repeated k steps"!
I will call these codes "partially sieveless" because the smaller sieve of size 2^k1 is saved to disk, so it's not "sieveless" (that is, it's not generated as you go). However, the larger 2^k sieve will be "sieveless" (that is, it is generated as you go).
The 2017 paper I linked to also has a "repeated k steps" algorithm to count steps to 1 with some great results regarding the best size of lookup table (what I call k2) to use, though I'd imagine that they'd get somewhat different results if reducing much larger numbers. I started trying to write code to do this, but I soon found myself realizing that there is no clear way to define a step. Should (3*n + 1)/2 count as 1 step or 2 steps? The convention seems to be to call it 2 steps, but doesn't it seem more natural to call it 1? In light of David Bařina's new "n++ n--" algorithm, a step should probably be each bit shift! Regardless of how steps towards 1 are counted, when doing the k steps of a 2^k sieve, (3*n + 1)/2 must be counted as 1 step.
An annoying aspect of a "reduce to 1" code is that the code will take more time per number as you test larger numbers. You'll probably want to optimize your code for rather large n. Benchmarking your code will also need to be done carefully to make sure the same numbers are run.
For any scan such as the "reduce to 1" scan for max steps or a scan for max n, it is important to start testing numbers beginning at 1 without skipping. Anyone can find a number that has a huge number of steps, but the goal is to find the smallest number to require that many steps. 2^10000 takes 10000 steps to reduce to 1, but who cares? If you want to argue that 2^10000 can be reduced in 1 step (a single bit shift), I would probably agree with you, but you can always run the Collatz algorithm backwards to get more interesting numbers that require any number of steps. To do these backwards steps, when n satisfies (2*n-1)%3 == 0 and n != 2, you may randomly consider setting the next n to (2*n-1)/3, else set next n to 2*n.
There is a BOINC project that counts the steps to 1 that has been running since 2009. However, they made a fatal error by using a certain type of sieve, so all of their "high step" results are invalid. I have many other serious concerns with the project, but the person who runs the project, Jon Sonntag, never responded. In January 2021, I let the forums know of this fatal error. Jon Sonntag finally responded by deleting my forum posts, so I posted it at the main BOINC forum. Likely, none of their results are valid. Certainly, my code is vastly faster. On my I5-3210M CPU, his code only tests 2.3 × 108 numbers second! On the Nvidia Quadro P4000, his code only achieves 4.14 × 1010 numbers/second! To calculate these speeds, I first needed to figure out that his code tests (tested?) 3*2^44 numbers each task. The time, electricity, and opportunity wasted by this BOINC project can be seen here. I had fun doing much better than this project, but it was so sad to see the blind faith of some of Jon's followers when they would get so angry with my accurate harsh statements. The lesson to be learned here is to listen to the scientists and to the experts, and don't follow people who are all show and no substance. Just because someone knows more than you about something doesn't mean that they are trained in thinking scientifically.
Certain types of sieves can still be used. A 3^1 or 3^2 sieve can still be used because the numbers removed by these sieves come from smaller numbers, and these smaller numbers will require more steps to reduce than the larger numbers that result from them. To generate my 2^k sieves up until now, I have (1) removed numbers that reduce in k steps then (2) removed numbers that join the path of smaller numbers in k steps. The second of these approaches can still be done because paths that join will require the same number of steps, and the smaller number is the one that needs to be found. My "sieveless" approach could still work! However, I am not sure if the second of these approaches is valid if counting steps towards 1 as bit shifts.
Note that all of my above code for deltaN including collatzFindDeltaNbound.c only try to join paths for numbers that do not reduce, so they need to be modified. My first step was modifying collatzFindDeltaNbound.c by (1) removing the two checks for isS1 and (2) starting the c loop with 1 instead of minC. These modifications are very easy, so I won't provide the modified code, but I put the results in a column in largeK.xlsx. The deltaN bounds are much larger!
To find the experimental (actual) deltaN and to count the numbers that need testing in each sieve, I then modified collatzSieve2toK_FindPatterns.c by (1) removing the bit of code under the comment that says "check to see if if 2^k*N + b0 is reduced in no more than k steps", (2) updating the starting deltaN, and (3) making the b0 loop now start at 0 and increase by 1 each iteration (instead of starting at 3 and increasing by 4 each time). For the final change, I added a test for if b0 is 0 before seeing if it was in the sieve. I learned that not all numbers that join a path of a smaller number work! For a 2^3 sieve, 5 becomes 2 after 3 steps (one of them is an increase), and 4 becomes 2 after 3 steps (one of them is an increase). The problem is that the path that started with 4 should no longer be followed after it reduced to 1 because the 5 takes more steps to reduce to 1, so I added the following line after the nList[m] >>= 1 line...
if ( nList[m] == 1 && m>0 ) nList[m] = 0;
and I had to loop over the steps as the outer loop making the m loop (including checking against the 0th element) the interior loop. Note that, with the code in this configuration, the code could claim an exclusion by a larger deltaN even though a smaller deltaN will exclude it in a later step.
- A very interesting pattern emerged! The experimental deltaN are all 2^(k-4). Though, I only tested up to k=16. The explanation seems to be how, for k=4, 13 joins paths with 12. For k=5, just double the 13 and the 12, and we have that 26 joins paths with 24. Doubled numbers must join paths because doubling just adds a decrease step for each. Doubling also doubles the deltaN between them.
- Regarding the density of large-deltaN exclusions, for k=16, only 1 number required deltaN=4096, only 1 number required the next deltaN=3243, and only 3 numbers required the next deltaN=2048.
- Of course, these "reduce to 1" sieves take more time to analyze than the usual reduce-below-start-n sieves. The overall design of my code is not optimized for such large deltaN. My "sieveless" approach also does not favor a large deltaN.
- These sieves are not nearly as good as the ones you can use if not wanting to reduce to 1! As k increases, there is less gain with these "reduce to 1" sieves than with the usual sieves because the fraction that needs testing no longer exponentially decays.
I wonder if the "numbers cannot join numbers that have already reduced to 1" rule can be lifted somewhat. For our 2^3 sieve, 5 joins 4 after 3 steps, but 4 already reduced to 1. This is a problem because 4 takes 2 steps to reduce, and 5 takes 4 steps to reduce, so 5 needs to be tested. However, 2^3 + 5 must also join 2^3 + 4 in 3 steps, but 2^3 + 4 has not yet reduced to 1. Perhaps I could prove that for A*2^k + B that joins A*2^k + b, where b is smaller than B, A*2^k + b with A > 0 never reduces to 1 before the paths join. If I could prove this, then better sieves could be made for testing all numbers except the A=0 numbers. Actually, the proof is very easy: 2^k is the largest number to reduce to 1 in k steps, and nothing above it can join its path. For large enough k, the sieves are barely better if I let numbers join numbers that have already reduced to 1, and the cost of doing this, as far as I have experimentally tested, is that deltaN doubles plus the cost of having to do A=0 separately. Doing A=0 separately is no minor cost if using my "sieveless" approach, which aims to use huge sieves.
As for making these sieves using a GPU, my approach on GPU so far won't work with the "numbers cannot join numbers that have already reduced to 1" rule. My GPU approach would not be able to find cases where numbers joined before reducing to 1 in k steps. That is, my GPU code creates A > 0 sieves. However, unlike the modified CPU code I mentioned, the approach I use in the GPU code would be optimized for these larger deltaN!
Unlike my non "reduce to 1" version of this code, comparing number of increases is now certainly required, hold and holdC arrays are also certainly required, and you'll start using considerable RAM after k=26 (due to deltaN being large). My GPU runs for only a tiny fraction of the run time, so I should probably explore making a second kernel instead of OpenMP! To do this, my 2 collect arrays, my 2 hold arrays, bits, and n0start would need to be passed to the GPU. In the meantime, I wrote a CPU-only code that makes A > 0 sieves that are not limited by the amount of GPU RAM (limited only by CPU RAM).
• collatzSieve2toK_FindPatterns_reduceTo1.c is my CPU-only code
The following is OpenMP CPU-only code for A >= 0 sieves. The algorithm is much slower, but it requires basically no RAM. If I ever care, I'll write GPU code for this...
The above were last updated 2021-02-22.
"Reduce to 1" conclusion: A 3^2 (or 3^1) sieve works very nicely, but the 2^k sieve that you can create has a huge deltaN that is a real pain for not-miraculous gain.
It seems that I'm not the first to conclude such things! See the section called "The class record algorithm".
The strategy I would recommend is to use the "A > 0" sieves because they are easier to generate. For A=0, you can just look up the record for highest steps, so you don't have to run it (note that this list counts (3*n + 1)/2 as 2 steps). Use the same 3 steps I gave for choosing "sieveless" parameters, and your k will have to be small unless you set your deltaN_max to be low enough.
I wrote collatzSieveless_reduceTo1_Aequals0.c to run the A=0 case. I quickly ran up to 2^36, and I verified that, up through 2^36, counting (3*n + 1)/2 as a single step gives the same records as the table I just linked to (with simply a smaller delay). I also found that 762 is the record delay for a 2^36 sieve.
I ran the following "sieveless" files using a 2^36 sieve to allow for a 2^50 sieve by setting STEPS_MAX (or stepsMax for CPU-only code) to 1161. I verified that, up through 9*2^47, counting (3*n + 1)/2 as a single step gives the same records as counting it as 2 steps (with simply a different delay).
• collatzSieveless_reduceTo1.c is my CPU-only code
My results were added to the following file (the other delays in the file are from here)...
The above were last updated 2021-02-25.
To analyze the log file from the GPU code, I ran the Linux commands...
grep "steps = " log.txt | sed 's/^.steps.=.\([0-9]*\).found:.(\([0-9]*\)....64)...\([0-9]*\).*/\1 \3/' | sort -n -k 1,1 -k 2,2 | uniq -w4 > lowestForEachDelay.txt cat lowestForEachDelay.txt | sort -n -k2 > lowestN.txt
I then ran the Python 3 script...
stepsMax = 762 f = open("lowestN.txt", "r", encoding="utf8").read().splitlines() for line in f: steps = int(line.split(" ")) if steps > stepsMax: stepsMax = steps print(line)
As for the density of deltaN in these "A > 0" sieves, the large deltaN are very rare. k=16 has deltaN = 8192 and a total of 46261 numbers removed from the sieve. For the following deltaN_max settings, I provide the numbers not excluded by the sieve...
deltaN_max = 8192: 19275
deltaN_max = 4096: 19276
deltaN_max = 2048: 19279
deltaN_max = 1024: 19288
deltaN_max = 512: 19308
deltaN_max = 256: 19356
deltaN_max = 128: 19457
deltaN_max = 64: 19693
deltaN_max = 32: 20285
deltaN_max = 16: 21582
deltaN_max = 8: 24000
deltaN_max = 4: 27984
deltaN_max = 2: 32997
deltaN_max = 1: 40295 out of 65536 must be run
Here are the numbers not excluded as a function of deltaN_max for k=10 with deltaN = 128 and a total of 586 numbers removed from the sieve...
deltaN_max = 128: 438
deltaN_max = 64: 439
deltaN_max = 32: 442
deltaN_max = 16: 452
deltaN_max = 8: 471
deltaN_max = 4: 513
deltaN_max = 2: 572
deltaN_max = 1: 673 out of 1024 must be run
Note that deltaN's that are of the form 2^integer are the most common.
In addition to testing the Collatz conjecture, why search for other things like records for max n or max steps? Statistics of max n from a random walk can be predicted, so I see why max n is searched for (so far, experimental data is consistent with a random walk). Also, on a practical level, the max n affects how many bits are needed to test the Collatz conjecture. As for max steps, perhaps these records can be predicted from various stochastic models? Certainly, max n is easier to find than max steps!
I found a website with a list of their results of various other searches. In addition to searching for max n and max steps to 1 (where (3*n + 1)/2 is counted as 2 steps), here are the other searches they have done...
(1) "Glide" records are provided (N=2 is missing), where "glide" is the number of steps needed to reduce an n below its start value (where (3*n + 1)/2 is counted as 2 steps).
(2) First n to take N steps to reduce to 1 for all N are provided (where (3*n + 1)/2 is counted as 2 steps). Instead of asking for the first n to take more steps than any previous n required, they are wondering, given any N, which is the first n to require exactly that many steps. Note that my sieveless "reduce to 1" approach can naturally find these results, but a 3^1 or 3^2 sieve can no longer be used.
(3) Ratios involving counts of odd and even n's when reducing starting n to 1 are provided (where (3*n + 1)/2 is counted as 2 steps), though N=2 is missing. This is clearly extremely related to max-step records because numbers that require many steps for their size have a higher ratio of odd to even, whereas a number that reduces quickly for its size would be produce mostly even numbers. In fact, all records in the first table of results are found in the second one.
(4) "Residue" results are provided, where residue is defined using the counts of odd and even n's when reducing starting N to 1 (where (3*n + 1)/2 is counted as 2 steps). The final 1 is not counted as an odd, and the initial n counts as an even or an odd. Residue is then defined as (2^evens / 3^odds) / N. I think the reason we care about this is that, if the Collatz conjecture didn't have the the "+1" in "3*n + 1", N = 2^evens / 3^odds would be true for numbers that reduced to 1 (assuming that such an N would be an integer), and the residue would be 1. Since the "+1" only significantly matters when n is small, the residue's difference from 1 is measuring how long n is putzing around at small values before reaching 1.
(5) Just found another one: strength records. It wasn't linked on the main list, but I found it.
For (3), (4), and (5), the website makes some conjectures that are interesting, so I understand why they are tested, but I don't really understand why (1) and (2) are tested. In general, I do not understand why people are experimentally counting steps to reduce to 1, perhaps this is because I do not know much about the theory behind trying to prove the Collatz conjecture. Perhaps the only reason is that getting experimental data can guide the mathematical theory if something surprising is found!
Anyway, I mention all this for two reasons. The first is to maybe have someone explain to me why we care about counting steps to reduce to 1. The second is to have a complete list of experimental searches so that I can evaluate if my "sieveless" approach can be applied to them. For some of the above searches, I see certain cases where some form of a sieve can be used. I might give this more thought at some point.
One thing that would interest me is to see a rolling average of the "glide" as defined by the website as the number of steps to reduce a number below where it started. I would expect this to be more-or-less constant as the starting n increases! Not many tricks could be used: no sieves, no "n++ n--", and no "repeated k steps". Though maybe "n++ n--" or "repeated k steps" could be used up until just before n reduces below its start value. For simplicity, I would just average the glides in each consecutive 2^30 block of numbers. I would define (3*n + 1)/2 as a single step because then the glide is useful because it then tells you which sieves will exclude that number. Better yet, define a step as a bit shift so that the "n++ n--" algorithm can be used, and collatzCPUglide_BitShift.c is my code that does this! For the following chunks of 2^30, I get the average glide...
The last chunk of 2^30 runs numbers 2^30*1000000000000000000 + 2 to 2^30*1000000000000000001 + 1.
I would now like to predict the above average and standard deviation from a random walk. In trying this, I see that defining (3*n + 1)/2 as a single step may have been the better way to go, and collatzCPUglide.c is my code that does this! For the following chunks of 2^30, I get the average glide...
Note that, if counting (3*n + 1)/2 as two steps, my results become 3/2 larger. My code that predicts the above from a random walk is collatzGlide.c, which gives 3.492652 with a standard deviation of 6.481678. Since I'm running 2^30 numbers, the central limit theorem says that my experimental results, if from a random walk, should have a standard deviation of 6.481678 / sqrt(2^30) = 0.00020 between them. Seems that the Collatz conjecture is consistent enough with a random walk! Though, I am surprised at how similar all the glides are considering that, if running chunks of 2^1 instead of 2^30, there would be strong oscillation in the glide because n%4 == 3 are the only numbers that do anything interesting. 2^30 averages out enough of the oscillations because a sieve of that size would find that most numbers would reduce in no more than 30 steps.
Another thing I would like to see is the average of (delay - expectedDelay), where delay is defined by the website as the number of steps to reduce to 1, and expectedDelay is given by a random walk. Unlike the website, I will count (3*n + 1)/2 as a single step. Since the glide has already been compared to a random walk, there probably isn't much to gain scientifically from looking at the delay, but writing the code sounds fun. I think the reason I really care is that I am curious to see the expectedDelay values, and collatzDelay.c is my code that does this! I get expectedDelay = f(N) log2(N), where f(N) seems to be approaching a number, but I am unable to trust my results when I run N larger than 2^80 due to my steps being limited to 1023 (else overflow). Here are my results...
N = 2^10: expectedDelay = 5.030079 log2(N)
N = 2^20: expectedDelay = 4.923198 log2(N)
N = 2^30: expectedDelay = 4.888867 log2(N)
N = 2^40: expectedDelay = 4.871146 log2(N)
N = 2^50: expectedDelay = 4.860508 log2(N)
N = 2^60: expectedDelay = 4.853471 log2(N)
N = 2^70: expectedDelay = 4.848431 log2(N)
N = 2^80: expectedDelay = 4.844653 log2(N)
Multiply this by 3/2 if you count (3*n + 1)/2 as two steps. I believe the results are approaching
expectedDelay = 2 / log2(4/3) log2(N)
= 4.818842 log2(N)
The above comes from writing N approximately as (4/3)^(delay/2) by building N up by an equal number of "backward" increases and decreases. Each pair of increases and decreases changes N by a factor of 4/3 and takes 2 steps. I want to thank Eric Roosendaal for this formula. My code also predicts a standard deviation of about 3.8 sqrt(expectedDelay).
I decided to write collatzDelayGMP.c to use the GMP library to allow me to run much larger numbers! Here are the results...
N = 2^1000: expectedDelay = 4.820915 log2(N), standard deviation = 8.384883 sqrt(log2(N))
N = 2^2000: expectedDelay = 4.819878 log2(N), standard deviation = 8.383898 sqrt(log2(N))
N = 2^5000: expectedDelay = 4.819256 log2(N), standard deviation = 8.383410 sqrt(log2(N))
N = 2^10000: expectedDelay = 4.819049 log2(N), standard deviation = 8.383238 sqrt(log2(N))
N = 2^20000: expectedDelay = 4.818945 log2(N), standard deviation = 8.383170 sqrt(log2(N))
N = 2^50000: expectedDelay = 4.818883 log2(N), standard deviation = 8.383100 sqrt(log2(N))
N = 2^100000: expectedDelay = 4.818862 log2(N), standard deviation = 8.383085 sqrt(log2(N))
I was curious about the prediction for what the standard deviation is approaching. First of all, for very large N, the standard deviation of a random walk is effectively constant as steps increases during the time when most numbers are reaching 1, giving a standard deviation of Δincreases = sqrt(steps) / 2 = sqrt(expectedDelay) / 2. The division by 2 is because our random walk graphed increases vs. steps spreads half as fast as the "usual" random walk. Note that most of the walks occur near the line increases = steps/2 on this graph. The standard deviations listed above are Δsteps = step2 - step1, where step2 = 2 log2(N) / (2 - log2(3)) is when the line increases = (steps - log2(N)) / log2(3) crosses the line increases = steps/2, and step1 = 2 (log2(N) - log2(3) Δincreases) / (2 - log2(3)) is when the line increases = (steps - log2(N)) / log2(3) crosses the line increases = steps/2 - Δincreases. This gives Δsteps = Δincreases / (1/log2(3) - 1/2) = 8.383068 sqrt(log2(N)).
As for experimentally testing the above delays, this is challenging. I chose to run chunks of 2^30 numbers. The expected delay noticeably drifts for a run of 2^30 when starting at 2^40, so I started at 2^50. By the way, 2^100 caused overflow (I suppose this depends on how k2 is set). When starting at 2^50, I got the average delay to be 4.833497 log2(N), which seemed promising! But 2^60 gave 5.020711 log2(N). I then ran 2^60.1 and got 4.587167 log2(N), so there's clearly a huge standard deviation! For running 2^30 numbers, the standard deviation should be 0.00026 sqrt(log2(N)), which should be around 0.002, not approximately 10! If I were to run 2^40 instead of 2^30 numbers, I could reduce the standard deviation by a factor of 2^5 = 32 due to the central limit theorem allowing for a better measurement of the average. 2^40 would take a couple days to run on a CPU core using a "repeated k steps" code. Here is my code collatzCPUdelay.c that does this.
Why does the delay have such a large standard deviation compared to a random walk? The glide is an average over oscillations of various size. For example, only n%4 == 3 do anything interesting. The delay must not have any such simple patterns! I would love to see a rolling average of the delay! Regardless, I am now interested in counting steps to 1! There are already some graphs here that show it, but I'm more interested in the patterns that arise.
Here are my results from running 2^10 consecutive groups of 2^20 numbers starting at 2^70 and starting at 2^90. Various features such as oscillations and sudden jumps can be seen! Note that the numbers in the graphs should have a standard deviation between data points of about 0.001.
 Bařina, David. (2020). Convergence verification of the Collatz problem. The Journal of Supercomputing. 10.1007/s11227-020-03368-x. Full text.
 Honda, Takumi & Ito, Yasuaki & Nakano, Koji. (2017). GPU-accelerated Exhaustive Verification of the Collatz Conjecture. International Journal of Networking and Computing. 7. 69-85. 10.15803/ijnc.7.1_69. Full text.