analistica/README.md

# Statistical analysis


## Description

This repository is structured as follows:

- `lectures`: a summary of the lectures of the course

- `notes`: an explanation of the solutions of the exercises

* `ex-n`: programs written for each exercise


## Building the documents

The two documents `excercise.pdf` and `lectures.pdf` are written in Pandoc
markdown. XeTeX (with some standard LaTeX packages), the
[pandoc-crossref](https://github.com/lierdakil/pandoc-crossref) filter and a
Make program are required to build. Simply typing `make` in the respective
directory will build the document, provided the above dependencies are met.


## Building the programs

The programs used to solve the exercise are written in standard C99 (with the
only exception of the `#pragma once` clause) and require the following
libraries to build:

- [GMP](https://gmplib.org/)

- [GSL](https://www.gnu.org/software/gsl/)

* [pkg-config](https://www.freedesktop.org/wiki/Software/pkg-config/)
  (build-time only)

Additionally, Python (version 3) with `numpy` and `matplotlib` is required to
generate plots.

For convenience, a `shell.nix` file is provided to set up the build environment.
See this [guide](https://nixos.org/nix/manual/#chap-quick-start) if you have
never used Nix before. Running `nix-shell` in the top-level will drop you into
the development shell.

Once ready, invoke `make` with the program you wish to build. For example

    $ make ex-1/bin/main

or, to build every program of an exercise

    $ make ex-1

To clean up the build results run

    $ make clean


## Running the programs

Notes:

- Many programs generate random numbers using a PRNG that is seeded with a
fixed value, for reproducibility. It's possible to test the program on 
different samples by changing the seed via the environment variable
`GSL_RNG_SEED`.


### Exercise 1

`ex-1/bin/main` generate random numbers following the Landau distribution and
run a series of test to check if they really belong to such a distribution.  
The size of the sample can be controlled with the argument `-n N`.  
The program outputs the result of a Kolmogorov-Smirnov test and t-tests
comparing the sample mode, FWHM and median, in this order.

`ex-1/bin.pdf` prints a list of x-y points of the Landau PDF to the `stdout`.
The output can be redirected to `ex-1/pdf-plot.py` to generate a plot.


### Exercise 2

Every program in `ex-2` computes the best available approximation (with a given
method) to the Euler-Mascheroni γ constant and prints[1]:

1. the leading decimal digits of the approximate value found

2. the exact decimal digits of γ

3. the absolute difference between the 1. and 2.

[1]: Some program may also print additional debugging information.

`ex-2/bin/fancy`, `ex-2/bin/fancier` can compute γ to a variable precision and
take therefore the required number of decimal places as their only argument.
The exact γ digits (used in comparison) are limited to 50 and 500 places,
respectively.


### Exercise 3

`ex-3/bin/main` generates a sample of particle decay events and attempts to
recover the distribution parameters via both a MLE and a χ² method. In both
cases the best fit and the parameter covariance matrix are printed.  
The program then performs a t-test to assert the compatibility of the data with
two hypothesis and print the results in a table.

To plot a 2D histogram of the generated sample do

    $ ex-3/bin/main -i | ex-3/plot.py

In addition the program accepts a few more parameters to control the histogram
and number of events, run it with `-h` to see their usage.

Note: the histogram parameters affect the computation of the χ² and the
relative parameter estimation.


### Exercise 4

`ex-4/bin/main` generates a sample of particles with random oriented momentum
and creates an histogram with average vertical component, in modulus, versus
horizontal component. It is possible to set the maximum momentum with the
option `-p`. A χ² fit and a t-test compatibility are performed with respect
to the expected distribution and results are printed.

To plot a histogram of the generated sample do

    $ ex-4/bin/main -o | ex-4/plot.py

It is possible to set the number of particles and bins with the options `-n`
and `-b`.


### Exercise 5

`ex-5/main` compute the integral of exp(1) between 0 and 1 with the methods
plain MC, MISER and VEGAS. Being reiterative routines, it takes the number of
iterations as its only argument.  
It prints out the obatined value and its estimated error for each method.


### Exercise 6

`ex-6/bin/main` simulates a Fraunhöfer diffraction experiment. The program
prints to `stdout` the bin counts of the intensity as a function of the
diffraction angle. To plot a histogram simply pipe the output to the
program `ex-6/plot.py`.

The program convolves the original signal with a Gaussian kernel (`-s` to
change the σ), optionally adds a Poisson noise (`-m` to change the mean μ) and
performs either a naive deconvolution by a FFT (`-m fft` mode) or applying the
Richard-Lucy deconvolution algorithm (`-m rl` mode), which is expected to
perform optimally in this case.

The `-c` and `-d` options control whether the convolved or deconvolved
histogram counts should be printed to `stdout`. For more options
run the program with `-h` to see the usage screen.


### Exercise 7

`ex-7/bin/main` generates a sample with two classes of 2D points (signal,
noise) and trains either a Fisher linear discriminant or a single perceptron to
classify them (`-m` argument to change mode). Alternatively the weights can be
set manually via the `-w` argument. In either case the program then prints the
classified data in this order: signal then noise.

To plot the result of the linear classification pipe the output to
`ex-7/plot.py`. The program generates two figures:
  - a scatter plot showing the Fisher projection line and the cut line
  - two histograms of the projected data and the cut line

`ex-7/bin/test` takes a model trained in `ex-7/bin/main` and test it against
newly generated datasets (`-i` to set the number of test iterations). The
program prints the statistics of the number of false positives, false
negatives and finally the purity and efficiency of the classification.
-												add a README.md

											
										
										
											2020-04-26 00:30:18 +02:00
+								# Statistical analysis
 								## Description
-												readme: add exercise 6

											
										
										
											2020-04-28 00:23:13 +02:00
+								This repository is structured as follows:
-												add a README.md

											
										
										
											2020-04-26 00:30:18 +02:00
-												readme: add exercises 2,3

											
										
										
											2020-04-27 23:51:34 +02:00
+								- `lectures`: a summary of the lectures of the course
-												add a README.md

											
										
										
											2020-04-26 00:30:18 +02:00
 								- `notes`: an explanation of the solutions of the exercises
-												readme: add exercise 6

											
										
										
											2020-04-28 00:23:13 +02:00
+								* `ex-n`: programs written for each exercise
-												add a README.md

											
										
										
											2020-04-26 00:30:18 +02:00
 								## Building the documents
 								The two documents `excercise.pdf` and `lectures.pdf` are written in Pandoc
 								markdown. XeTeX (with some standard LaTeX packages), the
 								[pandoc-crossref](https://github.com/lierdakil/pandoc-crossref) filter and a
 								Make program are required to build. Simply typing `make` in the respective
 								directory will build the document, provided the above dependencies are met.
 								## Building the programs
 								The programs used to solve the exercise are written in standard C99 (with the
 								only exception of the `#pragma once` clause) and require the following
 								libraries to build:
 								- [GMP](https://gmplib.org/)
 								- [GSL](https://www.gnu.org/software/gsl/)
 								* [pkg-config](https://www.freedesktop.org/wiki/Software/pkg-config/)
 								  (build-time only)
-												readme: add exercise 4 and 5

											
										
										
											2020-04-30 22:28:27 +02:00
+								Additionally, Python (version 3) with `numpy` and `matplotlib` is required to
-												add a README.md

											
										
										
											2020-04-26 00:30:18 +02:00
+								generate plots.
-												readme: add exercise 4 and 5

											
										
										
											2020-04-30 22:28:27 +02:00
+								For convenience, a `shell.nix` file is provided to set up the build environment.
-												add a README.md

											
										
										
											2020-04-26 00:30:18 +02:00
+								See this [guide](https://nixos.org/nix/manual/#chap-quick-start) if you have
 								never used Nix before. Running `nix-shell` in the top-level will drop you into
 								the development shell.
-												readme: add exercise 4 and 5

											
										
										
											2020-04-30 22:28:27 +02:00
+								Once ready, invoke `make` with the program you wish to build. For example
-												add a README.md

											
										
										
											2020-04-26 00:30:18 +02:00
 								    $ make ex-1/bin/main
 								or, to build every program of an exercise
 								    $ make ex-1
 								To clean up the build results run
 								    $ make clean
 								## Running the programs
 								Notes:
 								- Many programs generate random numbers using a PRNG that is seeded with a
 								fixed value, for reproducibility. It's possible to test the program on
 								different samples by changing the seed via the environment variable
 								`GSL_RNG_SEED`.
 								### Exercise 1
 								`ex-1/bin/main` generate random numbers following the Landau distribution and
-												readme: add exercises 2,3

											
										
										
											2020-04-27 23:51:34 +02:00
+								run a series of test to check if they really belong to such a distribution.
 								The size of the sample can be controlled with the argument `-n N`.
-												add a README.md

											
										
										
											2020-04-26 00:30:18 +02:00
+								The program outputs the result of a Kolmogorov-Smirnov test and t-tests
 								comparing the sample mode, FWHM and median, in this order.
-												readme: add exercises 2,3

											
										
										
											2020-04-27 23:51:34 +02:00
 								`ex-1/bin.pdf` prints a list of x-y points of the Landau PDF to the `stdout`.
 								The output can be redirected to `ex-1/pdf-plot.py` to generate a plot.
 								### Exercise 2
 								Every program in `ex-2` computes the best available approximation (with a given
 								method) to the Euler-Mascheroni γ constant and prints[1]:
 . the leading decimal digits of the approximate value found
 . the exact decimal digits of γ
 . the absolute difference between the 1. and 2.
 								[1]: Some program may also print additional debugging information.
 								`ex-2/bin/fancy`, `ex-2/bin/fancier` can compute γ to a variable precision and
 								take therefore the required number of decimal places as their only argument.
 								The exact γ digits (used in comparison) are limited to 50 and 500 places,
 								respectively.
 								### Exercise 3
 								`ex-3/bin/main` generates a sample of particle decay events and attempts to
 								recover the distribution parameters via both a MLE and a χ² method. In both
 								cases the best fit and the parameter covariance matrix are printed.
 								The program then performs a t-test to assert the compatibility of the data with
 								two hypothesis and print the results in a table.
 								To plot a 2D histogram of the generated sample do
 								    $ ex-3/bin/main -i | ex-3/plot.py
 								In addition the program accepts a few more parameters to control the histogram
 								and number of events, run it with `-h` to see their usage.
 								Note: the histogram parameters affect the computation of the χ² and the
 								relative parameter estimation.
-												readme: add exercise 6

											
										
										
											2020-04-28 00:23:13 +02:00
-												readme: add exercise 4 and 5

											
										
										
											2020-04-30 22:28:27 +02:00
+								### Exercise 4
 								`ex-4/bin/main` generates a sample of particles with random oriented momentum
 								and creates an histogram with average vertical component, in modulus, versus
 								horizontal component. It is possible to set the maximum momentum with the
 								option `-p`. A χ² fit and a t-test compatibility are performed with respect
 								to the expected distribution and results are printed.
 								To plot a histogram of the generated sample do
 								    $ ex-4/bin/main -o | ex-4/plot.py
 								It is possible to set the number of particles and bins with the options `-n`
 								and `-b`.
 								### Exercise 5
 								`ex-5/main` compute the integral of exp(1) between 0 and 1 with the methods
 								plain MC, MISER and VEGAS. Being reiterative routines, it takes the number of
 								iterations as its only argument.
 								It prints out the obatined value and its estimated error for each method.
-												readme: add exercise 6

											
										
										
											2020-04-28 00:23:13 +02:00
+								### Exercise 6
 								`ex-6/bin/main` simulates a Fraunhöfer diffraction experiment. The program
 								prints to `stdout` the bin counts of the intensity as a function of the
 								diffraction angle. To plot a histogram simply pipe the output to the
 								program `ex-6/plot.py`.
-												readme: add exercise 4 and 5

											
										
										
											2020-04-30 22:28:27 +02:00
+								The program convolves the original signal with a Gaussian kernel (`-s` to
-												readme: add exercise 6

											
										
										
											2020-04-28 00:23:13 +02:00
+								change the σ), optionally adds a Poisson noise (`-m` to change the mean μ) and
 								performs either a naive deconvolution by a FFT (`-m fft` mode) or applying the
 								Richard-Lucy deconvolution algorithm (`-m rl` mode), which is expected to
 								perform optimally in this case.
-												readme: add exercise 4 and 5

											
										
										
											2020-04-30 22:28:27 +02:00
+								The `-c` and `-d` options control whether the convolved or deconvolved
-												readme: add exercise 6

											
										
										
											2020-04-28 00:23:13 +02:00
+								histogram counts should be printed to `stdout`. For more options
 								run the program with `-h` to see the usage screen.
-												readme: add exercise 7

											
										
										
											2020-04-28 22:16:21 +02:00
 								### Exercise 7
 								`ex-7/bin/main` generates a sample with two classes of 2D points (signal,
 								noise) and trains either a Fisher linear discriminant or a single perceptron to
 								classify them (`-m` argument to change mode). Alternatively the weights can be
 								set manually via the `-w` argument. In either case the program then prints the
 								classified data in this order: signal then noise.
 								To plot the result of the linear classification pipe the output to
 								`ex-7/plot.py`. The program generates two figures:
 								  - a scatter plot showing the Fisher projection line and the cut line
 								  - two histograms of the projected data and the cut line
 								`ex-7/bin/test` takes a model trained in `ex-7/bin/main` and test it against
 								newly generated datasets (`-i` to set the number of test iterations). The
 								program prints the statistics of the number of false positives, false
 								negatives and finally the purity and efficiency of the classification.