(c) 2026, Roberto A. Foglietta <roberto.foglietta@gmail.com>, CC ND-BY-NC 4.0

### rationale ##################################################################

Collecting entropy is a metaphor because only information can be "collected"
ergo stored in a precise notation like the number 123 or a string of bits. The
aim of "collecting entropy" is to have a sequence of numbers (information) which
cannot be anticipated. However, as you can imagine because that series of
numbers are stored somewhere, it is possible to anticipate because it is known.

The fundamental point is about WHO can access that information and HOW that
information is structured. The structure of information is fundamental to avoid
(or minimise) the risk that someone can guess the sequence. For example, we can
use the Fibonacci series and decide to start from a specific number of the
original series instead of 1. Unfortunately everyone that can see or guess a
single number of the series knows all the rest of the series. We can use the
Fibonacci series from N and step 2, or step N-4, or whatever, it has a
structure.

Structured information can be guessed, it is just a matter of how difficult it
is to make that guess (or given a unit of computation power per second, how many
seconds it will take). This is related to how complicated is the relationship
between a number and the number after e.g. Fibonacci(N, 4). When this
relationship is "random" and in particular "white noise random" it means that
every next 8-bit (or 16-bit, or 32-bit) of information are totally uncorrelated
with anything before and after thus the chance to guess the next 8-bit string is
1:2^8 (it cannot be more). This is randomness, white noise.

The 'drama' is about HOW to generate a random sequence of numbers from a system
that has been designed to be predictable and deterministic (and also error auto
correcting like ECC memory, for example). The "easiest" answer is having a very
sensitive thermometer (e.g. 0.00001°C precision) and collecting each dT-time the
least significant digit of temperature measuring. Each dT is a fixed and
repetitive timing and precision is a supposition plus the wires are EM emitters.
Which in practice is easier to sniff than we might suppose because extracting
"entropy" from heat requires physical work (L) or a Maxwell's devil.

Under this perspective what really matters is the predictability from the known
(so how hard is to guess from the known or plausible known) and how difficult is
to access that information (storing safety). Is it stored inside the CPU
registry? The chip is the safe boundary. Into the memory? The EM S/N of
conductive tracks between CPU and RAM is the safe boundary. In the kernel space
or in the user-land? For a local attacker, it is the root privilege in local
escalation, the safe boundary. For a remote, wait... Here is the MAIN point.

This "obscurity by randomness" is not used to protect the root password secret
only but mainly to protect the ciphering of the communications. Because a crypto
algorithm can be extremely hard to break unless guessing the seed of the random
numbers generator is easy instead. Because the idea behind the crypto-safety is
about who knows the key (a string of bits) can de/cypher the message (Enigma,
WW2) despite the algorithm for de/cypher being public (or deveiled, because
leaked).

Therefore about randomness, everything is about HOW hard it is to predict or
sniff the sequence of numbers from what we already know or can reasonably
support and check in a very short time, short enough to be useful: 1s or 1E100s?

In modern systems the CPU is a multi-core, multi-thread, multi-branch and all of
these "features" are multi-layered caches. Many tasks are running even if the
whole userland system is "doing nothing" and task preemption can kick in at any
time, creating latencies in HOW long a deterministic task takes. Because also in
complex systems it often requires a very-low latency (e.g. industry and CNC),
also these latencies aren't very unpredictable, and usually they aren't but
low.

Fortunately, the latency isn't ALWAYS the same nanoseconds precise and the
variance of a latency is its jitter, e.g. a latency in [ 80, 120 ] ns has a 20
(or 40) ns of maximum jitter. Usually, the distribution is a Bell curve so the
dt = 1ns (actual jitter) is more frequent than dt = 10ns. The main idea could be
that within [ 95, 105 ] ns the jitter distribution is almost flat so every value
in that range is almost the same probable to pop-up. When the value is read
outside that interval then it is discarded thus not stored. At the next turn it
might be discarded again and the chance of a continuous rejection which stops
the sequence has a not-zero probability to happen. But dT % 10 is always flat
when the average is 100 and the latency is 40 (T >> dT >> %dt).

A small number rests on a non-white noise number but Cauchy^2 distributed is
theoretically totally flat in terms of distribution, and also in practice when
the T >> dT >> %dt is related by a 4x or 10x rations like the LSB of 32-bit.

  https://github.com/robang74/roberto-a-foglietta/
    /blob/main/data/Tesi_Foglietta-051a.pdf (jitter)

Two functions are great to multiply "entropy": we cannot create entropy by a
determinist system but we can multiply it because we can spread that little bit
of randomness over a larger set of bits with a hash and zip algorithm.

Because the hash "diffuses" in a fast but NP-hard to recover way, and the zip
algorithm provides a nearly flat distribution of values which are "hard" to
guess when the first part of the output (the table) is lost (until the next
chunk). For example if 128Kb is the input chunk and 4x is the max compression
rate expected in a single chunk the output chunk will be 32Kb thus a 64b pick
has a great chance to fall somewhere the unpredictability is almost random
flat.

The zip algorithm provides a (1) nearly flat distribution of values which is (2)
hard to guess without knowing the encoding table by definition of "zip" itself.
A decent zip algorithm should tear down the redundancy of the input granting the
output values distribution would be nearly flat (file header apart, etc) as (1).
A decent zip algorithm whose output can be predicted once lost the encoding
table means that the zipped part is not p-flat enough and zip is very poor in
size shrinking or fidelity (e.g., it is a copy not a zipping or zeros string).

A process PT that creates a relatively large jittered latency and clogging the
CPU multi-threading management for a little while its execution thus being
scattered among internal computational pipes and caches, interrupted and
pre-empted is a great "random" generator in terms of nanoseconds fluctuation.

Thus the PT start, stop time in nanoseconds are not related in the least
significant stop digits. Express these two numbers in digits 0-9 then the string
$stop$start has the LSB randomness in the middle of the string which will be
something like: nnXnnx. Because the start can be guessed apart a few nanoseconds
but the relatively large jitter causes the LSB stop part to have many more
digits that can be considered strongly random in their nature.

Once a string which the random nature of nnXnnx is hashed the result is XXXXXX
because nobody can anymore say (once the start/stop are lost) how the randomness
has been diffused into the checksum string length, it is everywhere making each
bit totally unpredictable in the same manner (multiplication by diffusion).

If a process like the one described here is operated by the CPU microcode, it
would be safely bound by the CPU itself, a modern CPUs can be seen as very good
random generators because of their jitter unpredictability in executing tasks.

How good is this kind of random generator based on CPU arch and  microcode to be
good at providing unbreakable cryptography assuming the whole process is made at
the state of art? Who design the chip (or made the chip) and deliver the
microcode can definitely have access to every ciphered communications no matter
what.

### original function (2023) ###################################################

{
  n=$((33 + ${RANDOM:-15}%32))
  dd if=/dev/random bs=$n count=1 2>&1
  cat /proc/cmdline /proc/*stat /init*
} | pigz -$((1 + n%9))c > /dev/urandom &

### untested functions (2026) ##################################################

rafrand_zro() {
  echo $((33 + ${1:-$RANDOM}%32))
}

rafrand_ubf() {
  cmd=""; if [ -n "$1" ]; then cmd="chrt -$1"; fi; shift
  ionice -c3 $cmd stdbuf -o0 -e0 "$@"
}

rafrand_one() {
  k=$(rafrand_zro ${1:-})
  echo ${1:-} & ( set -x
    rafrand_ubf "" dd if=/dev/random bs=$k count=1 &
    rafrand_ubf "i 0" tail -n$((k*16)) /var/log/syslog & {
      for f in $(ls -1 /proc/cmdline /proc/*stat /init*)
        do rafrand_ubf "b 0" cat $f & done &
    }
  ) 2>&1
}

rafrand_two() {
  n=$(rafrand_zro ${1:-})
  m=$((9 - ${2:-$RANDOM}%9))
  rafrand_one ${3:-} |{\
    pigz -${m}cp$((4 * $(nproc))) -b32 |\
      dd bs=1 skip=$n count=64
    } 2>/dev/null
}

rafrand() {
  a=$(date +%N)
  c=$(rafrand_two | md5sum)
  b=$(date +%N)
  c=$(echo $b$a$c | md5sum | tr [0a-f-\ ] [1-9])
  a=$(echo $c | cut -c1-5)
  b=$(echo $c | cut -c6-10)
  c=$(echo $c | cut -c11-15)
  rafrand_two $c $a $b >/dev/urandom
}

### apparent issues ############################################################

1. The "tr" Command Entropy Bias

  Theoretically correct to arise a "bias issue" but that string is splitted
  in 3 parts and that parts are used as seed for other strong-level randomness
  production tasks. While the randomness seed seems very essential, the values
  used in that process are relatively small compared to the 5 digits (%32 to %3).
  Hence, it doesn't matter because reminders are those that really matter.

2. The "dd" skipping on Compressed Stream

  The variable "n" controls the read size of the compressed data and it creates
  a dependency where if the compressed output is smaller than $n (unlikely here
  but possible in other contexts), the command fails/outputs nothing.

  Which is correct, in theory. However, the main issue relies in "cat" files that
  can be void or not existent which is the reason because 2>&1 has been used to
  "at least" collect something and the "echo ${1:-}" has been added for a little
  more data because cat might partially or totally fail while /dev/random is not
  supposed to provide a lot of data but just a bit of extra-initial randomness.

  However, the shell implementation is just a PoC because the C-language is
  the way. Even in the laziest option, the shell implementation should be
  contextualised on the target system which, for example, it might not have
  "pigz" or not having a multi-thread processor but a "dumb" 32-bit micro
  which provides no significant jitters or "pigz" would be almost useless
  for goal in that system.

### updates ####################################################################

rafrand_one:

  It spawns many processes, it uses "2>&1", "set -x", to mix by a time task
  switch unpredictable sequence different data flow which can potentially be
  interleaved "stdbuf" into a single output stream not only because task
  switching but also because output buffering suppression (or changes).

rafrand_ubf:

  Another way to introduce task switching uncertainty settings scheduling
  policy and priorities. Both should be tuned or rely on templates specific
  for the target architecture and customisable for the specific system. This
  approach has a good chance to "collect entropy" also from simpler CPUs.

  By the way, leveraging a "complex" scheduler is an assumption rather than
  just having a simply rotating tasks scheduler into the OS. However, rotating
  tasks may introduce some glitches which bring in those "stochastics" features
  we would like to have.

  It is worth to note that because of just a small portion of the zipped data
  stream is taken in account, it is hard to believe that this would have a
  practical effect unless hashing chunks would not also included into that
  data flows mixing process.

### tiny random generator ########################################################

The rafrand_tiny() leverages rafgen5sum() which produce random output by
multiplying I/O task jittering like dd output like this:

  echo | dd
  0+1 records in
  0+1 records out
  1 byte copied, 4.5698e-05 s, 21.9 kB/s

there are few randomness here because execution time varies between 4 and 8 -05s
on my tests which are 4.5 digits 5E4 combinations, less than 16 bits of "entropy"
because everything else is correlated at that 4.5 digits of information.

  rafgen5sum() {
      n=${1:-5}; echo | while let n--; do
          dd bs=1 count=1k 2>&1 | md5sum; done
  }

Since rafgen5sum() runs that dd five times, it creates a data stream with 64+ bits
of unstructured information (randomness) which is enough to fill the 128 md5sum.

  ug() { rafgen5sum; }    # 128-bit random

  ng() { { ug;ug;ug;ug; }| pigz -11cp8 | dd bs=32 skip=1 count=2 status=none; }

A more sophisticated implementation increase cut the obvious redundancy in the
sources and ensure enough data for terminating the task by the required length:

  cg() {
    n=${1:-6}; while let n--; do rafgen5sum 4 | cut -zc-32; done |\
      pigz -9cp8 | tail -c130 | head -c 128;
  }

  rafrand_tiny() {
    n=64; cg | while let n--; do dd bs=1 skip=1 count=1 status=none; done
  }

The changes are: the fixed length and collecting the last part of the zip-data.

### information density ########################################################

The "pigz -11" is peculiar because is one of the most extreme non-lossing
compression easily available. Thus is strong enough to provide a reasonable
quantisation of the information contained into chunk of data because it removes
redundancy. It adds a file header but the size is fix (10 byte) thus irrelevant.

  tg () { n=${1:-5}; echo | while let n--; do dd bs=1 count=1k 2>&1; done | $FUNC; }

  info: { tg;tg;tg;tg;tg; } | cut -d' ' -f1 | pigz -11cp8 | wc -c;  # -10b

  32ch: { tg;tg;tg;tg;tg; } | cut -c-32 | pigz -11cp8 | wc -c;      # -10b

  size: { tg;tg;tg;tg;tg; } | wc -c

The function above are made to check the impact of the hashing function, here below:

       | file |  cat |  md5 | sha1 | sha256 | sha512 |
  byte |      |      |   16 |   20 |    32  |    64  |
  full |  10  | +210 | +122 | +144 |  +205  |  +367  |
  32ch |  10  |  n/a | +122 | +122 |  +122  |  +122  |
  size | n/a  | 1752 |  180 |  220 |   340  |   660  |
  cr%  | n/a  |  12% |  68% |  65% |   60%  |   56%  |
  ir%  | n/a  |  n/a |   1  |  85% |   59%  |   33%  |

The hash function doesn't matter provided that it is good enough as a hash function.

The hash function matters for avoid collisions in forensic signature but in terms of
diffusion has not any impact. The length of the output has impact on the size
of the output but taking this as a general principle we would use the clear text.

The numbers in the table above show that there is no advantage to use a longer
hash because in any case the total randomness injected remains the same but
diluted. Among these hash functions md5 is the simplest, sha256 has cr:ci = 1.

  hg () { n=${1:-5}; echo | while let n--; do dd bs=1 count=1k 2>&1; done; }

  h5 () { { hg;hg;hg;hg; } | sha512sum | cut -d' ' -f1; }

  cg () { { h5;h5; } | pigz -9cp8 | tail -c130 | head -c 128; }

With this set of functions above tuned for the sha512, the rafrand_tiny() is 2x
faster than using the md5sum (ca. 2.5x before its md5sum 85% optimization).

  hg () { echo | dd bs=32 count=32 2>&1; }

  h6 () { { hg; hg; } & { hg; hg; } & { hg; hg; } }

  h2 () { { h6 & h6 & h6; } | sha512sum -b | cut -d' ' -f1; }

  cg () { { h2 & h2; } | pigz -9c | tail -c130 | head -c128; }

However, the winner is the sha512 because of it scale better on a 8-core CPU due
to a multi-thread approach. In fact, the algorithm is newer and heavier but also
its implementation is better suited for modern CPUs.

Also, the sha256 alternative is fine:

  h2 () { { hg & hg; } | sha256sum | cut -d' ' -f1; }

  cg () { { h2 & h2 & h2 & h2; } | pigz -9c -J4 -p auto | tail -c130 | head -c 128; }

This variation that leverage the sha256 has cr:ci = 1 and the 8 cores parallelism
is about 4x faster than the first stable version based on md5sum.

  hg() { printf "%05d" $RANDOM; }

The above function is the reference to compare with because $RANDOM is the default
source of pseudorandom numbers. If the approach leveraging jitters is faster and/or
provide better quality "entropy" the reference test is passed: !kernel but shell!

### randomness density #########################################################

Stronger gc() variance with 2x randomness density (rnd/bit) even if the total
number of "dd" instances are are 10% less 6 x 3 x 3 < 5 x 4 x 2 but the concurrency
among the instances is uprises to 8 to 18 which clog also a 8-core CPU increasing
the jitter amplitude thus the randomness (as per unpredictability).

  hc() { pigz -cp$(nproc) "$@" | tail -c+16 | head -c-8; }              # run.33

  hs() { sha512sum | xxd -r -p; }
  
  hg() { { time echo | dd bs=1 count=1k; } 2>&1; }

  h4() { { hg & hg & hg & hg; } | hs; }

  h3() { { h4 & h4 & h4; } | hc -1; }      # h3( i:192 --> hc(-c+24) --> o:192 )

  cg() { { h3 & h3; } | hc -9 | hs; }

Then included time in hg() in order to add 3 digits more of randomness, at least.
By a raw estimation the single hg() call provides 26 bit of randomness x 24 calls.

Which is a particular good balance because N calls * N bit: N² >= 512 --> N = 23.
In practice over 165 bytes, only 128 are selected and trimmed down to 64 (1:1).

In this scenario 165 x 8 = 1320, and 24 x 26 = 624 but because the zip table the
first 32 (instead of 10, the file header) are providing lower randomness density.

A raw estimation indicates a need of randomness between 600 and 630 bits in total.
The main principle behind is hash and zip provide their best in diffuse and level
the values when they can work in their sweet spot 1:2 aka cr% and ci% around 50%.

-- RUN 28 ----------------------------------------------------------------------

let nr++; echo run.$nr
while sleep 0.01; do cg >> run.$nr; done &
while sleep   60; do ent   run.$nr; done

Evolution of the statistics from 62K to 360K file size using the "ent" command:

File size     :       61696 -->      359552 byte
Entropy       :    7.996529 -->    7.999392 bits per byte.
 \missing gap :    -0.04339% -->   -0.00760%
Chi square    :      296.75 -->      303.19
 \dev. freq.  :       3.71% -->       2.06%
Average       :    127.5678 -->    127.4090
 \Err. on avg.:      0.053% -->      0.071%
Monte Carlo Pi: 3.175646761 --> 3.143662912
 \Err. on Pi  :       1.08% -->       0.07%
Correlation   :   -0.006375 -->   -0.001009 (0: 100% uncorrelated)

### rationale 2nd part #########################################################

Almost all security by ciphering is based on the principle that TRUE security
relies on fundamental uncertainty. So, supposing we have a source of fundamental
uncertainty, we will not be able to find out that it is a real fundamental
uncertainty or a tricking system that catches our attempts and shuffles the
answers. There is no way out of this dilemma, apparently.

Unless, a change of paradigma kicks in. Uncertainty is just A way to achieve
unpredictability. But in reality, classic deterministic systems can evolve in a
state for which unpredictability is a fundamental trait. Theory of chaos and
theory of the constrained systems control are complementary NOT mutually
exclusive. Keeping a system (or bringing for a little while a system) into an
unpredictability zone is possible and it is possible to verify in a useful
timeframe / delay that it is working in that zone.

Classic example is the 7 hits of a ball in a billiard game. If an attacker can
predict 7 moves but NOT the 8+ then the attacker is done. Because this means
that whatever s/he manages to hook the system, the system will get out of
predictability shortly despite any attempts of keeping under observation. It
means that when a system is hooked, it is already out of the observation scope
and the attacker is just looking at something in the past.

A move in chess is a letter, a number, another letter and another number (bare
plainly encodable in 12 bits). Let's say that 512 bits (of unpredictability) is
not an entire game but enough to check this assumption (43 chess moves vs
billiard 7-cushion). Failing in check with 64 bytes, means that 128 enter into
the scene. Useless trying on high ground of 128 bytes. If it works, it will work
at 32 or 64 bytes. That's my bold assumption and it is not so bold considering
what I wrote here. The history of physics is full of classic systems of
unpredictability running-away.

So, I am not inventing anything, just repeating the mistakes of the past to
create conditions that usually people avoid like unpredictability. Curiously,
while we -- as humans -- are avoiding to face unpredictability, at the same time
we are seeking uncertainty for safety. This is a clear psychological pathema: we
run away from our fears to embrace the fundamental uncertainty as our comfort
zone and it is an illusion, orchestrated or not.

### the masterchef cusine ######################################################

Another wrong belief is that structured information in input of a deterministic
algorithm will create a structure in output, or at least, the output will
continue to carry on some characteristics of the original structure. Boiling a
fish tank will provide a fish soup, reversing this process isn't feasible.
Blender and mix fruits and milk, provide a milk-fruit smoothie, reversing this
process isn't feasible.

However, in IT the backdoor is about what if analysing the output I could
discover the "recipe" and the "ingredient" like how the blender works or which
was the fishtank population? Which is the same as asking how well the blender
and the mixer are working? Is the hash good? Are the zippers good? We know
tasting the smoothie and the taste of the smoothie can be replicated, for sure.
It is deterministic.

What is not deterministic, is the "information collected" on the surface of the
boiling bubble, the glitches of the mixer blades sound in cutting the fruit.
Those information have been collected but at the same time lost because mixed
with the soup and the smoothie. Although someone can argue that s/he can clearly
determine it is a fish soup rather than a fruit milkshake, it doesn't help.

What remains of the previous input information structure is enough to clear some
uncertainty but still does not provide a useful predictability. The watermark of
the compression, even if survived by 1-took-1-lost, isn't predictable. Saying
this is pigz compression output at "-p8 -9c" seems magical but gives nothing. In
the same manner we know how ciphers work: they keep the data secret if the
secret of the seed is kept.

Then, the military grade "white noise" is achieved using good and quick ciphers,
cypher the blender output and forget the key. Violating a safe to find a fish
soup or a milkshake recipe is worth the effort? Or it can be done the opposite,
cypher the input before blend and mix. When the blender and the mixer are
working good enough to compete with /dev/random, there is a good reason to think
that adding another NP-layer would easily reach the /dev/urandom and beyond.

Criptographics' need for randomness is a self solving problem. Create an
algorithm that while it ciphers or deciphers (risky, the input can be
manipulated by the attacker) is also creating unpredictable randomness at no
extra cost because every work (L) produces entropy so the ciphering as well,
just collects it. In fact, observing the 1-took-1-lost process it does 64 times
a "dd" which creates on the stderr the same kind of "entropy" we started from
initially.

It is a self-sustaining unpredictability engine: it makes running the wheels and
keeps "warm" the habitacle at the same time with the same fuel.

This is masterchef, not IT anymore? Right, because of the day in which "entropy"
word gets into the discourse: who said entropy, shame on them!