| Université Paris 6
Pierre et Marie Curie |
Université Paris 7
Denis Diderot
|
|
|
|
||
|
|
||
|
|
||
|
|
||
O. Catoni Data compression and adaptive histograms, International Conference on Foundations of Computational Mathematics in honor of Professor Steve Smale's 70th Birthday July 13-17 2000
It is distributed under the GNU Lesser General Public License.
tar xvzf gibbsHist.tar.gz. Then go to gibbsHist/src
and execute make. After this you should be ready to run the
tclsh8.0 scripts located in gibbsHist/data. You need
tclsh8.0, however adapting the scripts to another
language, such as bash, should be straightforward. You need also
gnuplot, to output the graphics. This software has been developped
under Suse Linux : I did not test it on other Unix platforms, although I do not
see why it should not work as well (provided the proper utilities, such as
gnumake and gnuplot, are installed).
gibbsHist, you will find the following subdirectories :
bin where the executable files testHist and
textHist are located.
testHist generates a sample from a histogram with a constant bin width.
By taking the bin width to be very small, it is possible to approximate
smooth distributions, such as mixtures of Gaussians. textHist generates
a sample from an ascii text file. The text file is divided into chunks of
four bytes, each one is represented as a real number in the unit interval.
The sample is drawn according to the uniform distribution on this set of
numbers. Then both testHist and textHist computes
a first histogram estimate based on simulated data, and a second histogram
estimate for simulated data composed with the cumulated distribution function
of the first estimate. The estimation scheme is described in greater details
in the preprint. The syntax of the two executable files
is :testHist DistDataFileName NumberofSamples ratio0 ratio1 ratio2 depth \ first_branching_rate second_branching_rate [LogFileStem|[-batch nb_of_trials]] textHist DataFileName NumberofSamples ratio0 ratio1 ratio2 \ depth first_branching_rate second_branching_rate refDepth [LogFileStem|[-batch nb_of_trials]]
DistDataFileName is the name of the input file describing
the distribution of the sample. Its format is explained in the API section below.
The ratios ratio0 ratio1 ratio2 describe the way the
sample is split respectively between the first and the second estimator and
between the weight estimate and the aggregation estimate for each estimator.
Branching rates are used to form the Gibbs aggregation rules, they could be safely set
to 0.5, as is pointed out in the preprint. The
depth parameter is the depth of the cell structure
used for histograms : the finer histogram has 2 to the power
depth bins. The parameter refDepth is the depth used to compute
the reference measure for the text experiment : the reference measure is
the empirical measure of the text from which the observed sample is bootstrapped,
projected on a histogram with 2 to the power refDepth
bins. This way of computing the reference measure is not completely satisfactory
from the mathematical point of view : it is approximately right if refDepth
is much larger than depth. LogFileStem is an optional
argument providing a stem for output file names. If it is omitted, no output
files are generated. The -batch option generates nb_of_trials
consecutive runs and prints statistics on the standard output.
src where the source files and the Makefile are located. To compile
the program, simply type `` make '';
lib contains the compiled library libhisto.a.
The library functions are listed
in the source header file precode.h;
doc contains a copy of this documentation, as well as licence
informations;
data contains examples of data files used to generate the simulations
shown in the paper, and four tclsh scripts :
testSession launches one instance of testHist, displays the result
with gnuplot and generates a postscript figure called graphic.eps, its syntax is
testSession DataFile nbOfSamples ratio depthwhere
DataFile is the name of the input file describing the true sample
distribution, ratio is the ratio used to split the sample between the
first and the second estimate, and depth is the depth of the cell
structure,
batchSession runs testHist several times without display and prints
statistics on stdout, its syntax is
batchSession DataFile nbOfSamples ratio depth nbOfTrials
textSession launches textHist, displays the result with gnuplot and
generates a postscript figure stored in graphic.eps, its syntax is
textSession DataFile nbOfSamples ratio depth refDepth
DataFile is the ascci text file, ratio and depth
have the same meaning as for testSession, and refDepth
is the depth of the cell structure used to approximate the empirical text distribution,
as explained above about textHist.
batchTextS runs textHist several times without display and prints
statistics on stdout, its syntax is
batchTextS DataFile nbOfSamples ratio depth refDepth nbOfTrials
data/readme . Filenames
beginning with ``Dump'' are output files. Unless you compile with
#define DEBUG
uncommented, there are three of them : Dump.Dist.log, containing the true
sample distribution, Dump.DensEst.log, containing the first estimate and
Dump.finalDens.log containing the second estimate. To obtain profile
informations about the time spent in each function, execute in directory
gibbsHist/data the command
gprof ../bin/testHist right after running testSession or
batchSession,
gprof ../bin/textHist right after running textSession
or batchTextS.
A mathematical description of the implemented algorithms and explanations about the meaning of the output are given in the above mentionned preprint.
typedef struct {
int size;
double *first;
} Sample;
/* contains the size and values of a sample of points in the unit interval */
/* Warning : d->first has size d->size+1 , and d->first[0] = 0*/
typedef struct {
int size;
double *first;
double *dens;
} Dist;
/* Defines the distribution of a histogram with dist->size bins of constant width.
* dist->first is an array containing the cumulated distribution
* function,
* whereas dist->dens is an array of density values with respect to the
* Lebesgue measure.
*/
typedef struct {
int depth;
int *first;
} Count;
/* a binary tree of counters.
* tree nodes are indexed by integers, 1 for the root, 2 for its left son,
* 3 for its right son, 4 for the left son of its left son etc.
*/
typedef struct {
int depth;
double *first;
} Weight;
/* a binary tree of weights, could represent different things : conditional
* probability densities, logarithms of conditional densities ...
*/
/* possibly irregular histogram distribution */
/* Warning : dens->first has size dens->size+1 , and dens->first[0] = 0*/
typedef struct {
int size;
double *first;
double *dens;
} Dens;
/* It sounds like Dist but is not used in the same way :
* dens->first is an array of bin boundaries, whereas dist->dens contains
* the corresponding densities. Unlike Dist, Dens represents a histogram
* with variable bin widths.
*/
The following functions are implemented :
void RandInit()Initializes the random number generator. You should
provide a seed in file .seed, unless a previous state has been saved in
file .random. Random numbers are generated from the function random
declared in <stdlib.h>.
void RandSave() Saves the state of the random number generator
in file .random.
Sample *SampleNew(int size)Allocates a sample of size size
and sets its size field.
void SampleFree(Sample *s)Frees the sample structure s.
Count *CountNew(int depth)Allocates a counter tree structure
Count of depth depth. The tree structure is implemented
as an array of size 2 to the power depth + 1.
void CountFree(Count *c) Frees the Count structure.
Weight *WeightNew(int depth) Same as CountNew,
but for the weight tree structure.
void WeightFree(Weight *w) Frees the weight structure w.
void SampleFill(Sample *s,Dist *d) Creates a sample with
a prescribed distribution.
Dens *DensNew(int size) Allocates a Dens structure and set its
size field.
void DensFree(Dens *dens) Frees a Dens structure.
double DistApply(Dist *d, double x) computes the value of the
cumulated distribution function at point x.
double QuantileApply(Dist *d, double alpha) computes the value
of the quantile function at quantile alpha (i.e. of the inverse of the
distribution function).
void CountFill(Count *c, Sample *s) fills the counters according
to the sample s.
void WeightFill(Weight *w, Count *c) computes the Laplace estimators
for the conditional distributions
.
void WeightMult(Weight *w)Computes the non conditional distribution
as the product of conditional distributions. (Should be called
after WeightFill.)
void WeightAdd(Weight *w)Cumulates the weights to
produce a cumulated distribution function on each line of the tree.
Used by Weight2Dist (see below) to create a cumulated
distribution from a weight tree.
Dist *Weight2Dist(Weight *w) Allocates and fills a
Dist structure from a tree of weights. The weights should
be non conditional (created by the sequence WeightFill and
WeightMult). Warning : The function has a side effect on its argument :
it calls WeightAdd(w).
Sample *SampleCode(Sample *s,Dist *d)Code sample s,
applying to it the cumulated distribution function defined by d.
void SampleCodeFill(Sample *s,Dist *d,Dist *estD)Fills sample s
according to distribution d and code it applying the cumulated distribution
function of estD. Same as SampleFill followed by
SampleCode, except for memory management : it allows to allocate
only one sample structure for the generated data.
void WeightLog(Weight *w)Works on conditional weights, (to be
called straight after WeightFill. Computes the logarithm of the
conditional ratio
, where
is the Lebesgue measure.
Weight *WeightMix(Weight *w,Count *c,double alpha, double beta)
is the main function of the package : it computes mixed weights from a weight array
w in log format (produced by WeightLog). The posterior
distribution is computed from the likelihood obtained from the tree of counters
c, the different subhistograms are mixed according to a branching
process with birth rate alpha, and the Gibbs aggregation rule
at inverse temperature beta is applied. The returned array of weights
is in a suitable format to be directly processed by Weight2Dist,
that is in a non conditional, non logarithmic format.
double BetaSet(Weight *w) Sets the value of the inverse temperature
in the Gibbs aggregation rule to the largest value allowed by the theoretical
bounds contained in the preprint.
void BootStrap(Sample *from,Sample *result)Fills result
according to the empirical distribution defined by from.Note that
result has to be allocated beforehand with SampleNew.
Dens *Count2Dens(Count *c)Creates a density structure Dens
containing the density corresponding to the empirical distribution defined by the
counters c.
Dens *Dist2Dens(Dist *dist) Creates a Dens structure
from a Dist structure.
Dens *Est2Dens(Dist *dist, Dist *codedDist) Computes the density
corresponding to the first estimate dist and the second estimate
codedDist. The cumulated distribution function of this density
is the compose of the cumulated distribution function of the second estimate
and the cumulated distribution function of the first estimate. The corresponding
density is obtained by differentiating this compose.
double Kullback(Dens *P,Dens *Q) Computes the Kullback divergence
function (also known as the relative entropy) between densities P
and Q.
Dist *DistRead (char *name) a wrapper around
internal function Dist * DistNew(int size, double *data).
Creates a histogram distribution on the unit interval equipped with the
Lebesgue measure. There are size bins of equal width,
whose densities are proportional to given weights.
The input ascii file should be named name and should have the
following format :
size first_weight second_weight etc.Note that the weights need not be normalized.
Sample *SampleFromFile(char *name,int N) Computes the empirical
distribution of four bytes chunks extracted from the file named name,
represented as real numbers in the unit interval.
The number of samples N is truncated to the file size if necessary.
When this is the case, the truncated sampleSize is written to the
standard ouptut. A value of N=0 will result in a sample size equal
to the file size (given in four byte units).
void DistHistWrite (Dist *d, char *name) Writes a distribution
to a file named name, in a suitable format to be displayed
with the gnuplot function plot with boxes.
void DensHistWrite(Dens *dens, char *name) Writes a density to
a file named name in a suitable format for gnuplot function
plot with boxes.
void SampleWrite (Sample *s, char *name) for debugging.
void CountWrite(Count *c, char *name) for debugging.
void WeightWrite(Weight *w, char *name) for debugging.