Sunday, January 12, 2014

Derivation of Fractals and Self-Similarity from Information Theory

In a paper published a couple of years ago (Pilger, 2012), I describe the application of a simple principle, transformed into a distinctive abstract object, to an optimization problem (within the plate tectonics paradigm): simultaneous reconstruction of lithospheric plates for a range of ages from marine geophysical data . It is rare that the relation of the principle, maximum entropy, with a particular transformation, power-series fractals, is recognized, since Pastor-Satorras and Wagensberg derived it. I'm unaware of any other application of fractal forms to optimization problems analogous to the paper. The following derivation is taken from the 2012 paper, with slight modification, in hopes that it might prove useful in other fields, not merely the earth sciences, but beyond. I'm investigating  applications in a variety of other areas, from plate tectonics, to petroleum geology, and, oddly enough, the arts.
Pastor-Satorras and Wagensberg (1998) showed that fractal self-similarity could be derived utilizing Shannon’s (1948) information theory and Jaynes’ (1957) maximum entropy formalism. For convenience, straight-forward derivations of both maximum entropy and fractal distributions are provided here.
Mandelbrot (1982) provided a basic definition of the mathematical formulation of fractals: 
p = k d-D 
For statistical fractals the formula is 
p @ k d-D.  
In either case, p is a measure of a self similar structure, such as length, width, mass, or even probability, over a range of scales, d; k is a constant, in part dependent on the units of the measure, if any; and D is the fractal dimension.
Given a measure space – for the purposes of this study, a two-dimensional map – the map area can be divided into successive subareas. If the dimensions of the map are 1 by 1 unit, divide the map into square cells of dimension 1/k by 1/k, in which, for convenience, k=1, 2, 4, 16…N {kj = [1, for j=1; 1/( 2kj-1) for j>1]}.  
1. The amount of information, I, supplied for each successive division j, is equal to ln(1/kj2) = ln (dj) (see Shannon, 1948; or Kanasewich, 1974). The average (expected) information
<I> = Sj=1,N pj ln(1/kj2) = Sj=1,N pj ln(dj)     (eq. 1)
in which pj is the probability of each division. (Note that dj could be equal to either the inverse cell width, kj, or area, kj2, without affecting the derivation.)
2. In the presence of inadequate information, Jaynes (1957) proposed (using Grandy’s, 1992, paraphrase here): “The optimal probability assignment describing that situation is the one which maximizes the information-theoretic entropy subject to constraints imposed by the information that is available.” What is the information-theoretic entropy? From Grandy’s derivation:
An experiment is repeated n times with potentially m results. Thus there are potentially mn outcomes of the experiments. Each outcome produces a set of sample numbers nj and frequencies of occurrence of those sample numbers, fj = nj/n, 1 j m.
As a result, the number of outcomes that yield a particular set of frequences fj is given by the multiplicity factor:
W = n!/[(nf1)! . .. (nfm)!]     (2)
The set of frequencies (fj) that can be realized in the most ways is that which maximizes the multiplicity, W.
It is convenient to note that maximizing W is equivalent to maximizing ln (W). Thus
ln(W) = ln {n!/[(nf1)! . .. (nfm)!]} = ln (n!) – S j = 1,m ln (nfj)!     (3)
Stirling’s approximation, ln(q!) @ q ln(q) – q , is applicable for large n:
ln(W) @ n ln(n) – n – S j = 1,m [n fj ln(fj) – nfj ln (n) – nfj ]     (4)
and, then
ln(W)/n @ ln(n) – 1 – S j = 1,m fj ln(nfj) – ln n S j = 1,m fj - S j = 1,m fj     (5)
As Sj = 1,m f= 1, therefore,
ln(W)/n @S j = 1,m fj ln(fj)     (6)
If the set of frequencies accurately represent the probabilities (pj , j = 1,m) of the phenomenon being investigated, then,
S = ln(W)/n @Sj = 1,m pj ln(pj)     (7)
which is Shannon information entropy.         
Shannon information entropy has the following properties (Jaynes, 1957): (1) It is a continuous function of the probabilities, (2) if all probabilities are equal, it is a monotonic function of n (the number of probabilities), (3) grouping of events produces the same value as separate events, and (4) S(m) + S(n) = S(mn), which can only be satisfied by a form such as S(n) = k ln (n).
3. In the Pastor-Satorras and Wagensberg formalism, using the Lagrangian variational principle, information entropy (eq. A.7) is maximized subject to the constraining information (eq. A.1) and normalization of the probabilities (Sj=1,N pj = 1) with Lagrange multipliers a and b:
F = – Sj = 1,N pj ln(pj) + a [- Sj=1,N pj ln(dj)] + b [1 - Sj=1,N pj]     (8)
Maximizing F for each pi:
F/¶pi = 0 = - ln(pi) – 1 – a ln(di) – b     (9)
ln(pi) = – 1 + a  ln(di) + b     (10)
exp[ln(pi)] = pi = exp (-b -1) di-a     (11)
pi = g di -a     (12)
in which a and g = (-b -1) are constants.
For such a power series, eq. 12, the probabilities exhibit self-similarity across scales, a,, and a, then, is equivalent to Mandelbrot’s (1975, 1982) fractal dimension, D, above.
The novelty of the Pastor-Satorras and Wagensberg (1998) derivation is the incorporation of the constraint of average information, = Sj=1,N pj ln(d), into the maximum entropy formalism (eq. A.8). Conventional applications of the formalism utilize statistical measures, such as the mean or variance from experimental data, as, e.g., = Sj=1,N pj f(xj), rather than information.
Fractal structure can be interpreted as the manifestation of iterative processes which propagate information kernels across a range of scales, in such a manner that they achieve the most probable outcome. And, the fractal structure, the information, is thereby apparent across every scale over which the process operates.
(Links as of January 13, 2014; if link is broken, it may be necessary to search for it.)