(c) 2026, Roberto A. Foglietta , CC ND-BY-NC 4.0 ### rationale ################################################################## Collecting entropy is a metaphor because only information can be "collected" ergo stored in a precise notation like the number 123 or a string of bits. The aim of "collecting entropy" is to have a sequence of numbers (information) which cannot be anticipated. However, as you can imagine because that series of numbers are stored somewhere, it is possible to anticipate because it is known. The fundamental point is about WHO can access that information and HOW that information is structured. The structure of information is fundamental to avoid (or minimise) the risk that someone can guess the sequence. For example, we can use the Fibonacci series and decide to start from a specific number of the original series instead of 1. Unfortunately everyone that can see or guess a single number of the series knows all the rest of the series. We can use the Fibonacci series from N and step 2, or step N-4, or whatever, it has a structure. Structured information can be guessed, it is just a matter of how difficult it is to make that guess (or given a unit of computation power per second, how many seconds it will take). This is related to how complicated is the relationship between a number and the number after e.g. Fibonacci(N, 4). When this relationship is "random" and in particular "white noise random" it means that every next 8-bit (or 16-bit, or 32-bit) of information are totally uncorrelated with anything before and after thus the chance to guess the next 8-bit string is 1:2^8 (it cannot be more). This is randomness, white noise. The 'drama' is about HOW to generate a random sequence of numbers from a system that has been designed to be predictable and deterministic (and also error auto correcting like ECC memory, for example). The "easiest" answer is having a very sensitive thermometer (e.g. 0.00001°C precision) and collecting each dT-time the least significant digit of temperature measuring. Each dT is a fixed and repetitive timing and precision is a supposition plus the wires are EM emitters. Which in practice is easier to sniff than we might suppose because extracting "entropy" from heat requires physical work (L) or a Maxwell's devil. Under this perspective what really matters is the predictability from the known (so how hard is to guess from the known or plausible known) and how difficult is to access that information (storing safety). Is it stored inside the CPU registry? The chip is the safe boundary. Into the memory? The EM S/N of conductive tracks between CPU and RAM is the safe boundary. In the kernel space or in the user-land? For a local attacker, it is the root privilege in local escalation, the safe boundary. For a remote, wait... Here is the MAIN point. This "obscurity by randomness" is not used to protect the root password secret only but mainly to protect the ciphering of the communications. Because a crypto algorithm can be extremely hard to break unless guessing the seed of the random numbers generator is easy instead. Because the idea behind the crypto-safety is about who knows the key (a string of bits) can de/cypher the message (Enigma, WW2) despite the algorithm for de/cypher being public (or deveiled, because leaked). Therefore about randomness, everything is about HOW hard it is to predict or sniff the sequence of numbers from what we already know or can reasonably support and check in a very short time, short enough to be useful: 1s or 1E100s? In modern systems the CPU is a multi-core, multi-thread, multi-branch and all of these "features" are multi-layered caches. Many tasks are running even if the whole userland system is "doing nothing" and task preemption can kick in at any time, creating latencies in HOW long a deterministic task takes. Because also in complex systems it often requires a very-low latency (e.g. industry and CNC), also these latencies aren't very unpredictable, and usually they aren't but low. Fortunately, the latency isn't ALWAYS the same nanoseconds precise and the variance of a latency is its jitter, e.g. a latency in [ 80, 120 ] ns has a 20 (or 40) ns of maximum jitter. Usually, the distribution is a Bell curve so the dt = 1ns (actual jitter) is more frequent than dt = 10ns. The main idea could be that within [ 95, 105 ] ns the jitter distribution is almost flat so every value in that range is almost the same probable to pop-up. When the value is read outside that interval then it is discarded thus not stored. At the next turn it might be discarded again and the chance of a continuous rejection which stops the sequence has a not-zero probability to happen. But dT % 10 is always flat when the average is 100 and the latency is 40 (T >> dT >> %dt). A small number rests on a non-white noise number but Cauchy^2 distributed is theoretically totally flat in terms of distribution, and also in practice when the T >> dT >> %dt is related by a 4x or 10x rations like the LSB of 32-bit. https://github.com/robang74/roberto-a-foglietta/ /blob/main/data/Tesi_Foglietta-051a.pdf (jitter) Two functions are great to multiply "entropy": we cannot create entropy by a determinist system but we can multiply it because we can spread that little bit of randomness over a larger set of bits with a hash and zip algorithm. Because the hash "diffuses" in a fast but NP-hard to recover way, and the zip algorithm provides a nearly flat distribution of values which are "hard" to guess when the first part of the output (the table) is lost (until the next chunk). For example if 128Kb is the input chunk and 4x is the max compression rate expected in a single chunk the output chunk will be 32Kb thus a 64b pick has a great chance to fall somewhere the unpredictability is almost random flat. The zip algorithm provides a (1) nearly flat distribution of values which is (2) hard to guess without knowing the encoding table by definition of "zip" itself. A decent zip algorithm should tear down the redundancy of the input granting the output values distribution would be nearly flat (file header apart, etc) as (1). A decent zip algorithm whose output can be predicted once lost the encoding table means that the zipped part is not p-flat enough and zip is very poor in size shrinking or fidelity (e.g., it is a copy not a zipping or zeros string). A process PT that creates a relatively large jittered latency and clogging the CPU multi-threading management for a little while its execution thus being scattered among internal computational pipes and caches, interrupted and pre-empted is a great "random" generator in terms of nanoseconds fluctuation. Thus the PT start, stop time in nanoseconds are not related in the least significant stop digits. Express these two numbers in digits 0-9 then the string $stop$start has the LSB randomness in the middle of the string which will be something like: nnXnnx. Because the start can be guessed apart a few nanoseconds but the relatively large jitter causes the LSB stop part to have many more digits that can be considered strongly random in their nature. Once a string which the random nature of nnXnnx is hashed the result is XXXXXX because nobody can anymore say (once the start/stop are lost) how the randomness has been diffused into the checksum string length, it is everywhere making each bit totally unpredictable in the same manner (multiplication by diffusion). If a process like the one described here is operated by the CPU microcode, it would be safely bound by the CPU itself, a modern CPUs can be seen as very good random generators because of their jitter unpredictability in executing tasks. How good is this kind of random generator based on CPU arch and microcode to be good at providing unbreakable cryptography assuming the whole process is made at the state of art? Who design the chip (or made the chip) and deliver the microcode can definitely have access to every ciphered communications no matter what. ### original function (2023) ################################################### { n=$((33 + ${RANDOM:-15}%32)) dd if=/dev/random bs=$n count=1 2>&1 cat /proc/cmdline /proc/*stat /init* } | pigz -$((1 + n%9))c > /dev/urandom & ### untested functions (2026) ################################################## rafrand_zro() { echo $((33 + ${1:-$RANDOM}%32)) } rafrand_ubf() { cmd=""; if [ -n "$1" ]; then cmd="chrt -$1"; fi; shift ionice -c3 $cmd stdbuf -o0 -e0 "$@" } rafrand_one() { k=$(rafrand_zro ${1:-}) echo ${1:-} & ( set -x rafrand_ubf "" dd if=/dev/random bs=$k count=1 & rafrand_ubf "i 0" tail -n$((k*16)) /var/log/syslog & { for f in $(ls -1 /proc/cmdline /proc/*stat /init*) do rafrand_ubf "b 0" cat $f & done & } ) 2>&1 } rafrand_two() { n=$(rafrand_zro ${1:-}) m=$((9 - ${2:-$RANDOM}%9)) rafrand_one ${3:-} |{\ pigz -${m}cp$((4 * $(nproc))) -b32 |\ dd bs=1 skip=$n count=64 } 2>/dev/null } rafrand() { a=$(date +%N) c=$(rafrand_two | md5sum) b=$(date +%N) c=$(echo $b$a$c | md5sum | tr [0a-f-\ ] [1-9]) a=$(echo $c | cut -c1-5) b=$(echo $c | cut -c6-10) c=$(echo $c | cut -c11-15) rafrand_two $c $a $b >/dev/urandom } ### apparent issues ############################################################ 1. The "tr" Command Entropy Bias Theoretically correct to arise a "bias issue" but that string is splitted in 3 parts and that parts are used as seed for other strong-level randomness production tasks. While the randomness seed seems very essential, the values used in that process are relatively small compared to the 5 digits (%32 to %3). Hence, it doesn't matter because reminders are those that really matter. 2. The "dd" skipping on Compressed Stream The variable "n" controls the read size of the compressed data and it creates a dependency where if the compressed output is smaller than $n (unlikely here but possible in other contexts), the command fails/outputs nothing. Which is correct, in theory. However, the main issue relies in "cat" files that can be void or not existent which is the reason because 2>&1 has been used to "at least" collect something and the "echo ${1:-}" has been added for a little more data because cat might partially or totally fail while /dev/random is not supposed to provide a lot of data but just a bit of extra-initial randomness. However, the shell implementation is just a PoC because the C-language is the way. Even in the laziest option, the shell implementation should be contextualised on the target system which, for example, it might not have "pigz" or not having a multi-thread processor but a "dumb" 32-bit micro which provides no significant jitters or "pigz" would be almost useless for goal in that system. ### updates #################################################################### rafrand_one: It spawns many processes, it uses "2>&1", "set -x", to mix by a time task switch unpredictable sequence different data flow which can potentially be interleaved "stdbuf" into a single output stream not only because task switching but also because output buffering suppression (or changes). rafrand_ubf: Another way to introduce task switching uncertainty settings scheduling policy and priorities. Both should be tuned or rely on templates specific for the target architecture and customisable for the specific system. This approach has a good chance to "collect entropy" also from simpler CPUs. By the way, leveraging a "complex" scheduler is an assumption rather than just having a simply rotating tasks scheduler into the OS. However, rotating tasks may introduce some glitches which bring in those "stochastics" features we would like to have. It is worth to note that because of just a small portion of the zipped data stream is taken in account, it is hard to believe that this would have a practical effect unless hashing chunks would not also included into that data flows mixing process. ### tiny random generator ######################################################## The rafrand_tiny() leverages rafgen5sum() which produce random output by multiplying I/O task jittering like dd output like this: echo | dd 0+1 records in 0+1 records out 1 byte copied, 4.5698e-05 s, 21.9 kB/s there are few randomness here because execution time varies between 4 and 8 -05s on my tests which are 4.5 digits 5E4 combinations, less than 16 bits of "entropy" because everything else is correlated at that 4.5 digits of information. rafgen5sum() { n=${1:-5}; echo | while let n--; do dd bs=1 count=1k 2>&1 | md5sum; done } Since rafgen5sum() runs that dd five times, it creates a data stream with 64+ bits of unstructured information (randomness) which is enough to fill the 128 md5sum. ug() { rafgen5sum; } # 128-bit random ng() { { ug;ug;ug;ug; }| pigz -11cp8 | dd bs=32 skip=1 count=2 status=none; } A more sophisticated implementation increase cut the obvious redundancy in the sources and ensure enough data for terminating the task by the required length: cg() { n=${1:-6}; while let n--; do rafgen5sum 4 | cut -zc-32; done |\ pigz -9cp8 | tail -c130 | head -c 128; } rafrand_tiny() { n=64; cg | while let n--; do dd bs=1 skip=1 count=1 status=none; done } The changes are: the fixed length and collecting the last part of the zip-data. ### information density ######################################################## The "pigz -11" is peculiar because is one of the most extreme non-lossing compression easily available. Thus is strong enough to provide a reasonable quantisation of the information contained into chunk of data because it removes redundancy. It adds a file header but the size is fix (10 byte) thus irrelevant. tg () { n=${1:-5}; echo | while let n--; do dd bs=1 count=1k 2>&1; done | $FUNC; } info: { tg;tg;tg;tg;tg; } | cut -d' ' -f1 | pigz -11cp8 | wc -c; # -10b 32ch: { tg;tg;tg;tg;tg; } | cut -c-32 | pigz -11cp8 | wc -c; # -10b size: { tg;tg;tg;tg;tg; } | wc -c The function above are made to check the impact of the hashing function, here below: | file | cat | md5 | sha1 | sha256 | sha512 | byte | | | 16 | 20 | 32 | 64 | full | 10 | +210 | +122 | +144 | +205 | +367 | 32ch | 10 | n/a | +122 | +122 | +122 | +122 | size | n/a | 1752 | 180 | 220 | 340 | 660 | cr% | n/a | 12% | 68% | 65% | 60% | 56% | ir% | n/a | n/a | 1 | 85% | 59% | 33% | The hash function doesn't matter provided that it is good enough as a hash function. The hash function matters for avoid collisions in forensic signature but in terms of diffusion has not any impact. The length of the output has impact on the size of the output but taking this as a general principle we would use the clear text. The numbers in the table above show that there is no advantage to use a longer hash because in any case the total randomness injected remains the same but diluted. Among these hash functions md5 is the simplest, sha256 has cr:ci = 1. hg () { n=${1:-5}; echo | while let n--; do dd bs=1 count=1k 2>&1; done; } h5 () { { hg;hg;hg;hg; } | sha512sum | cut -d' ' -f1; } cg () { { h5;h5; } | pigz -9cp8 | tail -c130 | head -c 128; } With this set of functions above tuned for the sha512, the rafrand_tiny() is 2x faster than using the md5sum (ca. 2.5x before its md5sum 85% optimization). hg () { echo | dd bs=32 count=32 2>&1; } h6 () { { hg; hg; } & { hg; hg; } & { hg; hg; } } h2 () { { h6 & h6 & h6; } | sha512sum -b | cut -d' ' -f1; } cg () { { h2 & h2; } | pigz -9c | tail -c130 | head -c128; } However, the winner is the sha512 because of it scale better on a 8-core CPU due to a multi-thread approach. In fact, the algorithm is newer and heavier but also its implementation is better suited for modern CPUs. Also, the sha256 alternative is fine: h2 () { { hg & hg; } | sha256sum | cut -d' ' -f1; } cg () { { h2 & h2 & h2 & h2; } | pigz -9c -J4 -p auto | tail -c130 | head -c 128; } This variation that leverage the sha256 has cr:ci = 1 and the 8 cores parallelism is about 4x faster than the first stable version based on md5sum. hg() { printf "%05d" $RANDOM; } The above function is the reference to compare with because $RANDOM is the default source of pseudorandom numbers. If the approach leveraging jitters is faster and/or provide better quality "entropy" the reference test is passed: !kernel but shell! ### randomness density ######################################################### Stronger gc() variance with 2x randomness density (rnd/bit) even if the total number of "dd" instances are are 10% less 6 x 3 x 3 < 5 x 4 x 2 but the concurrency among the instances is uprises to 8 to 18 which clog also a 8-core CPU increasing the jitter amplitude thus the randomness (as per unpredictability). hc() { pigz -cp$(nproc) "$@" | tail -c+16 | head -c-8; } # run.33 hs() { sha512sum | xxd -r -p; } hg() { { time echo | dd bs=1 count=1k; } 2>&1; } h4() { { hg & hg & hg & hg; } | hs; } h3() { { h4 & h4 & h4; } | hc -1; } # h3( i:192 --> hc(-c+24) --> o:192 ) cg() { { h3 & h3; } | hc -9 | hs; } Then included time in hg() in order to add 3 digits more of randomness, at least. By a raw estimation the single hg() call provides 26 bit of randomness x 24 calls. Which is a particular good balance because N calls * N bit: N² >= 512 --> N = 23. In practice over 165 bytes, only 128 are selected and trimmed down to 64 (1:1). In this scenario 165 x 8 = 1320, and 24 x 26 = 624 but because the zip table the first 32 (instead of 10, the file header) are providing lower randomness density. A raw estimation indicates a need of randomness between 600 and 630 bits in total. The main principle behind is hash and zip provide their best in diffuse and level the values when they can work in their sweet spot 1:2 aka cr% and ci% around 50%. -- RUN 28 ---------------------------------------------------------------------- let nr++; echo run.$nr while sleep 0.01; do cg >> run.$nr; done & while sleep 60; do ent run.$nr; done Evolution of the statistics from 62K to 360K file size using the "ent" command: File size : 61696 --> 359552 byte Entropy : 7.996529 --> 7.999392 bits per byte. \missing gap : -0.04339% --> -0.00760% Chi square : 296.75 --> 303.19 \dev. freq. : 3.71% --> 2.06% Average : 127.5678 --> 127.4090 \Err. on avg.: 0.053% --> 0.071% Monte Carlo Pi: 3.175646761 --> 3.143662912 \Err. on Pi : 1.08% --> 0.07% Correlation : -0.006375 --> -0.001009 (0: 100% uncorrelated) ### rationale 2nd part ######################################################### Almost all security by ciphering is based on the principle that TRUE security relies on fundamental uncertainty. So, supposing we have a source of fundamental uncertainty, we will not be able to find out that it is a real fundamental uncertainty or a tricking system that catches our attempts and shuffles the answers. There is no way out of this dilemma, apparently. Unless, a change of paradigma kicks in. Uncertainty is just A way to achieve unpredictability. But in reality, classic deterministic systems can evolve in a state for which unpredictability is a fundamental trait. Theory of chaos and theory of the constrained systems control are complementary NOT mutually exclusive. Keeping a system (or bringing for a little while a system) into an unpredictability zone is possible and it is possible to verify in a useful timeframe / delay that it is working in that zone. Classic example is the 7 hits of a ball in a billiard game. If an attacker can predict 7 moves but NOT the 8+ then the attacker is done. Because this means that whatever s/he manages to hook the system, the system will get out of predictability shortly despite any attempts of keeping under observation. It means that when a system is hooked, it is already out of the observation scope and the attacker is just looking at something in the past. A move in chess is a letter, a number, another letter and another number (bare plainly encodable in 12 bits). Let's say that 512 bits (of unpredictability) is not an entire game but enough to check this assumption (43 chess moves vs billiard 7-cushion). Failing in check with 64 bytes, means that 128 enter into the scene. Useless trying on high ground of 128 bytes. If it works, it will work at 32 or 64 bytes. That's my bold assumption and it is not so bold considering what I wrote here. The history of physics is full of classic systems of unpredictability running-away. So, I am not inventing anything, just repeating the mistakes of the past to create conditions that usually people avoid like unpredictability. Curiously, while we -- as humans -- are avoiding to face unpredictability, at the same time we are seeking uncertainty for safety. This is a clear psychological pathema: we run away from our fears to embrace the fundamental uncertainty as our comfort zone and it is an illusion, orchestrated or not. ### the masterchef cusine ###################################################### Another wrong belief is that structured information in input of a deterministic algorithm will create a structure in output, or at least, the output will continue to carry on some characteristics of the original structure. Boiling a fish tank will provide a fish soup, reversing this process isn't feasible. Blender and mix fruits and milk, provide a milk-fruit smoothie, reversing this process isn't feasible. However, in IT the backdoor is about what if analysing the output I could discover the "recipe" and the "ingredient" like how the blender works or which was the fishtank population? Which is the same as asking how well the blender and the mixer are working? Is the hash good? Are the zippers good? We know tasting the smoothie and the taste of the smoothie can be replicated, for sure. It is deterministic. What is not deterministic, is the "information collected" on the surface of the boiling bubble, the glitches of the mixer blades sound in cutting the fruit. Those information have been collected but at the same time lost because mixed with the soup and the smoothie. Although someone can argue that s/he can clearly determine it is a fish soup rather than a fruit milkshake, it doesn't help. What remains of the previous input information structure is enough to clear some uncertainty but still does not provide a useful predictability. The watermark of the compression, even if survived by 1-took-1-lost, isn't predictable. Saying this is pigz compression output at "-p8 -9c" seems magical but gives nothing. In the same manner we know how ciphers work: they keep the data secret if the secret of the seed is kept. Then, the military grade "white noise" is achieved using good and quick ciphers, cypher the blender output and forget the key. Violating a safe to find a fish soup or a milkshake recipe is worth the effort? Or it can be done the opposite, cypher the input before blend and mix. When the blender and the mixer are working good enough to compete with /dev/random, there is a good reason to think that adding another NP-layer would easily reach the /dev/urandom and beyond. Criptographics' need for randomness is a self solving problem. Create an algorithm that while it ciphers or deciphers (risky, the input can be manipulated by the attacker) is also creating unpredictable randomness at no extra cost because every work (L) produces entropy so the ciphering as well, just collects it. In fact, observing the 1-took-1-lost process it does 64 times a "dd" which creates on the stderr the same kind of "entropy" we started from initially. It is a self-sustaining unpredictability engine: it makes running the wheels and keeps "warm" the habitacle at the same time with the same fuel. This is masterchef, not IT anymore? Right, because of the day in which "entropy" word gets into the discourse: who said entropy, shame on them!