analistica/README.md

266 lines
9.5 KiB
Markdown
Raw Permalink Normal View History

2020-04-26 00:30:18 +02:00
# Statistical analysis
## Description
2020-04-28 00:23:13 +02:00
This repository is structured as follows:
2020-04-26 00:30:18 +02:00
- `lectures`: notes and slides of the course lectures
2020-04-26 00:30:18 +02:00
- `notes`: an explanation of the solutions of the exercises
- `slides`: a slideshow about some further researches
2020-04-28 00:23:13 +02:00
* `ex-n`: programs written for each exercise
2020-04-26 00:30:18 +02:00
## Building the documents
The two documents `excercise.pdf` and `lectures.pdf` are written in Pandoc
markdown. XeTeX (with some standard LaTeX packages), the
[pandoc-crossref](https://github.com/lierdakil/pandoc-crossref) filter and a
Make program are required to build. Simply typing `make` in the respective
directory will build the document, provided the above dependencies are met.
## Building the programs
The programs used to solve the exercise are written in standard C99 (with the
only exception of the `#pragma once` clause) and require the following
libraries to build:
- [GMP] (≥ 6.2)
- [GSL] (≥ 2.6)
* [pkg-config] (≥ 0.29, build-time only)
2020-04-26 00:30:18 +02:00
To generate plots, Python (version 3) with
- [numpy] (≥ 1.18)
- [matplotlib] (≥ 2.2)
- [scipy] (≥ 1.4, optional)
is required to generate plots.
[GMP]: https://gmplib.org/
[GSL]: https://www.gnu.org/software/gsl/
[pkg-config]: https://www.freedesktop.org/wiki/Software/pkg-config/
[numpy]: https://numpy.org/
[scipy]: https://www.scipy.org/scipylib/index.html
[matplotlib]: https://matplotlib.org/
2020-04-26 00:30:18 +02:00
2020-04-30 22:28:27 +02:00
For convenience, a `shell.nix` file is provided to set up the build environment.
2020-04-26 00:30:18 +02:00
See this [guide](https://nixos.org/nix/manual/#chap-quick-start) if you have
never used Nix before. Running `nix-shell` in the top-level will drop you into
the development shell.
Once ready, invoke `make` with the program you wish to build. For example:
2020-04-26 00:30:18 +02:00
$ make ex-1/bin/main
or, to build every program of an exercise:
2020-04-26 00:30:18 +02:00
$ make ex-1
To clean up the build results run:
2020-04-26 00:30:18 +02:00
$ make clean
## Running the programs
Notes:
- Many programs generate random numbers using a PRNG that is seeded with a
fixed value, for reproducibility. It's possible to test the program on
different samples by changing the seed via the environment variable
`GSL_RNG_SEED`.
### Exercise 1
`ex-1/bin/main` generate random numbers following either the Landau or Moyal
distributions (controlled by the argument `-m`) and run a series of statistical
test to check if the points where samples from a Landau.
2020-04-27 23:51:34 +02:00
The size of the sample can be controlled with the argument `-n N`.
2020-04-26 00:30:18 +02:00
The program outputs the result of a Kolmogorov-Smirnov test and t-tests
comparing the sample mode, FWHM and median, in this order.
2020-04-27 23:51:34 +02:00
2020-06-03 14:27:26 +02:00
`ex-1/bin/pdf` prints a list of x-y points of the Landau PDF to the `stdout`.
2020-04-27 23:51:34 +02:00
The output can be redirected to `ex-1/pdf-plot.py` to generate a plot.
(optional) `ex-1/plots/kde.py` makes the example plot (shown in exercises.pdf,
fig. 4) of the kernel density estimation used to compute a non-parametric FWHM
from a sample of random points. To run this program you must additionally
install [scipy].
(optional) `ex-1/plots/slides.py` makes two plots. The first (shown
in fig. 3, exercises.pdf) is an illustration of the Landau distribution
FWHM and the second (shown in slides.pdf) is a comparison of the Landau
and Moyal distributions.
2020-04-27 23:51:34 +02:00
### Exercise 2
Every program in `ex-2` computes the best available approximation (with a given
method) to the Euler-Mascheroni γ constant and prints[1]:
1. the leading decimal digits of the approximate value found;
2020-04-27 23:51:34 +02:00
2. the exact decimal digits of γ;
2020-04-27 23:51:34 +02:00
3. the absolute difference between the 1. and 2.
[1]: Some program may also print additional debugging information.
`ex-2/bin/fancy`, `ex-2/bin/fancier` can compute γ to a variable precision and
take therefore the required number of decimal places as their only argument.
The exact γ digits (used in comparison) are limited to 50 and 500 places,
respectively.
`ex-2/bin/fast` is a highly optimized version of `ex-2/bin/fancier`, meant to
compute a very large number of digits and therefore doesn't come with a
verified, fixed, approximation of γ.
2020-06-09 14:32:36 +02:00
`ex-2/digits` containes compressed text files of the first 1M digits
of γ, obtained from `ex-2/bin/fast` and from `sympy` (using `mpmath`).
2020-04-27 23:51:34 +02:00
### Exercise 3
`ex-3/bin/main` generates a sample of particle decay events and attempts to
recover the distribution parameters via both a MLE and a χ² method. In both
cases the best fit and the parameter covariance matrix are printed.
The program then performs a t-test to assert the compatibility of the data with
two hypothesis and print the results in a table.
To plot a 2D histogram of the generated sample do:
2020-04-27 23:51:34 +02:00
$ ex-3/bin/main -i | ex-3/plot.py
In addition the program accepts a few more parameters to control the histogram
and number of events, run it with `-h` to see their usage.
Note: the histogram parameters affect the computation of the χ² and the
relative parameter estimation.
2020-04-28 00:23:13 +02:00
2020-04-30 22:28:27 +02:00
### Exercise 4
`ex-4/bin/main` generates a sample of particles with random oriented momentum
and creates an histogram with average vertical component, in modulus, versus
horizontal component. It is possible to set the maximum momentum with the
option `-p`. A χ² fit and a t-test compatibility are performed with respect
to the expected distribution and results are printed.
To plot a histogram of the generated sample do:
2020-04-30 22:28:27 +02:00
$ ex-4/bin/main -o | ex-4/plot.py
It is possible to set the number of particles and bins with the options `-n`
and `-b`.
### Exercise 5
`ex-5/main` compute estimations of the integral of exp(x) between 0 and 1
2020-05-24 22:04:31 +02:00
using several methods: a plain Monte Carlo, the MISER and VEGAS algorithms
with different number of samples. The program takes no arguments and prints
a table of the result and its error for each method.
To visualise the results, you can plot the table by doing:
$ ex-5/bin/main | ex-5/plot.py
(optional) `ex-6/plots/fit.py` makes the plot (shown in exercises.pdf, fig. 13)
of the standard deviation vs function calls for the plain MC method. The
program takes the tabular results of `ex-5/bin/main` as input, so run it as:
2020-06-09 18:55:54 +02:00
$ ex-5/bin/main | ex-5/plots/fit.py
2020-04-30 22:28:27 +02:00
2020-04-28 00:23:13 +02:00
### Exercise 6
`ex-6/bin/main` simulates a Fraunhöfer diffraction experiment. The program
prints to `stdout` the bin counts of the intensity as a function of the
diffraction angle. To plot a histogram do:
$ ex-6/bin/main | ex-6/plot.py
2020-04-28 00:23:13 +02:00
The program convolves the original signal with a gaussian kernel (`-s` to
change the kernel σ), optionally adds a gaussian noise (`-n` to change the
noise σ) and performs either a naive deconvolution by a FFT (`-m fft` mode)
or applying the Richardson-Lucy deconvolution algorithm (`-m rl` mode).
2020-04-28 00:23:13 +02:00
The `-o`, `-c` and `-d` options control whether the original, convolved or
deconvolved histogram counts should be printed to `stdout`. For more options
2020-04-28 00:23:13 +02:00
run the program with `-h` to see the usage screen.
2020-04-28 22:16:21 +02:00
`ex-6/bin/test` simulates a customizable number of experiments and prints
to `stdout` the histograms of the distribution of the EMD from the original
signal to:
1. the result of the FFT deconvolution
2. the result of the Richardson-Lucy deconvolution
3. the convolved signal (with noise if `-n` has been given)
It also prints to `stderr` the average, standard deviation and skewness of
each distribution. To plot the histograms, do:
$ ex-6/bin/test | ex-6/dist-plot.py
The program accepts some parameters to control the histogram and number of
events, run it with `-h` to see their usage.
(optional) `ex-6/plots/emd.py` makes the plots of the EMD statistics of the RL
2020-05-24 22:04:31 +02:00
deconvolution (shown in exercises.pdf, section 6.6) as a function of the number
of rounds. The programs sources its data from two files in the same directory,
these were obtained by running `ex-6/bin/test`. Do:
$ ex-6/plots/emd.py noisy
for the plots of the experiment with gaussian noise, and
$ ex-6/plots/emd.py noiseless
for the experiment without noise.
2020-04-28 22:16:21 +02:00
### Exercise 7
`ex-7/bin/main` generates a sample with two classes of 2D points (signal,
noise) and trains either a Fisher linear discriminant or a single perceptron to
classify them (`-m` argument to change mode). Alternatively the weights can be
set manually via the `-w` argument. In either case the program then prints the
classified data in this order: signal then noise.
To plot the result of the linear classification pipe the output to
`ex-7/plot.py`. The program generates two figures:
- a scatter plot showing the Fisher projection line and the cut line;
- two histograms of the projected data and the cut line.
2020-04-28 22:16:21 +02:00
`ex-7/bin/test` takes a model trained in `ex-7/bin/main` and test it against
newly generated datasets (`-i` to set the number of test iterations). The
program prints the statistics of the number of false positives, false
negatives and finally the purity and efficiency of the classification.
(optional) `ex-7/plots/fisher.py` makes the example plot (shown in
exercises.pdf, fig. 27) of the naïve projection vs Fisher projection.
2020-07-05 13:54:16 +02:00
## License
Copyright (C) 2020 Giulia Marcer, Michele Guerini Rocco
All the programs, scripts, libraries and document source codes
included in this repository are free software: you can redistribute
them and/or modify them under the terms of the GNU General Public
License as published by the Free Software Foundation, either version
3 of the License, or (at your option) any later version.
This software is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see <http://www.gnu.org/licenses/>.