Collection of exercise solutions, notes and documents for the course of Statistical Analysis.
 
 
 
 
 
Go to file
Michele Guerini Rocco 96812a912c
slides: more corrections
2020-07-08 23:42:10 +02:00
docs docs: add encrypted documents 2020-07-05 14:53:25 +02:00
ex-1 ex-1: add the possibility to plot the gala and gamo plots 2020-07-05 11:37:21 +02:00
ex-2 ex-2/naive: fix whitespace 2020-07-05 11:36:56 +02:00
ex-3 ex-3: fix some minor mistakes 2020-07-05 11:36:08 +02:00
ex-4 ex-4: fix some typos 2020-07-05 11:36:22 +02:00
ex-5 ex-5/plots: rename fit_plot -> fit 2020-07-05 11:36:32 +02:00
ex-6 ex-6: redo 6-smoothed.pdf with the right font in the title 2020-07-05 11:36:20 +02:00
ex-7 ex-7: replace Fisher projection plot 2020-07-05 11:36:30 +02:00
lectures lectures: fix something here and there 2020-07-05 17:51:15 +02:00
notes notes: fix grammar mistake in ex-7 2020-07-07 10:52:33 +02:00
slides slides: more corrections 2020-07-08 23:42:10 +02:00
.gitattributes docs: add encrypted documents 2020-07-05 14:53:25 +02:00
.gitignore update gitignore 2020-07-05 11:36:34 +02:00
LICENSE add GPL3 license 2020-07-05 13:54:16 +02:00
README.md add GPL3 license 2020-07-05 13:54:16 +02:00
makefile makefile: remove old target 2020-07-05 11:37:25 +02:00
shell.nix make the build environment more friendly 2020-07-05 11:35:32 +02:00

README.md

Statistical analysis

Description

This repository is structured as follows:

  • lectures: notes and slides of the course lectures

  • notes: an explanation of the solutions of the exercises

  • slides: a slideshow about some further researches

  • ex-n: programs written for each exercise

Building the documents

The two documents excercise.pdf and lectures.pdf are written in Pandoc markdown. XeTeX (with some standard LaTeX packages), the pandoc-crossref filter and a Make program are required to build. Simply typing make in the respective directory will build the document, provided the above dependencies are met.

Building the programs

The programs used to solve the exercise are written in standard C99 (with the only exception of the #pragma once clause) and require the following libraries to build:

To generate plots, Python (version 3) with

is required to generate plots.

For convenience, a shell.nix file is provided to set up the build environment. See this guide if you have never used Nix before. Running nix-shell in the top-level will drop you into the development shell.

Once ready, invoke make with the program you wish to build. For example:

$ make ex-1/bin/main

or, to build every program of an exercise:

$ make ex-1

To clean up the build results run:

$ make clean

Running the programs

Notes:

  • Many programs generate random numbers using a PRNG that is seeded with a fixed value, for reproducibility. It's possible to test the program on different samples by changing the seed via the environment variable GSL_RNG_SEED.

Exercise 1

ex-1/bin/main generate random numbers following either the Landau or Moyal distributions (controlled by the argument -m) and run a series of statistical test to check if the points where samples from a Landau.
The size of the sample can be controlled with the argument -n N.
The program outputs the result of a Kolmogorov-Smirnov test and t-tests comparing the sample mode, FWHM and median, in this order.

ex-1/bin/pdf prints a list of x-y points of the Landau PDF to the stdout. The output can be redirected to ex-1/pdf-plot.py to generate a plot.

(optional) ex-1/plots/kde.py makes the example plot (shown in exercises.pdf, fig. 4) of the kernel density estimation used to compute a non-parametric FWHM from a sample of random points. To run this program you must additionally install scipy.

(optional) ex-1/plots/slides.py makes two plots. The first (shown in fig. 3, exercises.pdf) is an illustration of the Landau distribution FWHM and the second (shown in slides.pdf) is a comparison of the Landau and Moyal distributions.

Exercise 2

Every program in ex-2 computes the best available approximation (with a given method) to the Euler-Mascheroni γ constant and prints[1]:

  1. the leading decimal digits of the approximate value found;

  2. the exact decimal digits of γ;

  3. the absolute difference between the 1. and 2.

[1]: Some program may also print additional debugging information.

ex-2/bin/fancy, ex-2/bin/fancier can compute γ to a variable precision and take therefore the required number of decimal places as their only argument. The exact γ digits (used in comparison) are limited to 50 and 500 places, respectively.

ex-2/bin/fast is a highly optimized version of ex-2/bin/fancier, meant to compute a very large number of digits and therefore doesn't come with a verified, fixed, approximation of γ.

ex-2/digits containes compressed text files of the first 1M digits of γ, obtained from ex-2/bin/fast and from sympy (using mpmath).

Exercise 3

ex-3/bin/main generates a sample of particle decay events and attempts to recover the distribution parameters via both a MLE and a χ² method. In both cases the best fit and the parameter covariance matrix are printed.
The program then performs a t-test to assert the compatibility of the data with two hypothesis and print the results in a table.

To plot a 2D histogram of the generated sample do:

$ ex-3/bin/main -i | ex-3/plot.py

In addition the program accepts a few more parameters to control the histogram and number of events, run it with -h to see their usage.

Note: the histogram parameters affect the computation of the χ² and the relative parameter estimation.

Exercise 4

ex-4/bin/main generates a sample of particles with random oriented momentum and creates an histogram with average vertical component, in modulus, versus horizontal component. It is possible to set the maximum momentum with the option -p. A χ² fit and a t-test compatibility are performed with respect to the expected distribution and results are printed.

To plot a histogram of the generated sample do:

$ ex-4/bin/main -o | ex-4/plot.py

It is possible to set the number of particles and bins with the options -n and -b.

Exercise 5

ex-5/main compute estimations of the integral of exp(x) between 0 and 1 using several methods: a plain Monte Carlo, the MISER and VEGAS algorithms with different number of samples. The program takes no arguments and prints a table of the result and its error for each method. To visualise the results, you can plot the table by doing:

$ ex-5/bin/main | ex-5/plot.py

(optional) ex-6/plots/fit.py makes the plot (shown in exercises.pdf, fig. 13) of the standard deviation vs function calls for the plain MC method. The program takes the tabular results of ex-5/bin/main as input, so run it as:

$ ex-5/bin/main | ex-5/plots/fit.py

Exercise 6

ex-6/bin/main simulates a Fraunhöfer diffraction experiment. The program prints to stdout the bin counts of the intensity as a function of the diffraction angle. To plot a histogram do:

$ ex-6/bin/main | ex-6/plot.py

The program convolves the original signal with a gaussian kernel (-s to change the kernel σ), optionally adds a gaussian noise (-n to change the noise σ) and performs either a naive deconvolution by a FFT (-m fft mode) or applying the Richardson-Lucy deconvolution algorithm (-m rl mode).

The -o, -c and -d options control whether the original, convolved or deconvolved histogram counts should be printed to stdout. For more options run the program with -h to see the usage screen.

ex-6/bin/test simulates a customizable number of experiments and prints to stdout the histograms of the distribution of the EMD from the original signal to:

  1. the result of the FFT deconvolution
  2. the result of the Richardson-Lucy deconvolution
  3. the convolved signal (with noise if -n has been given)

It also prints to stderr the average, standard deviation and skewness of each distribution. To plot the histograms, do:

$ ex-6/bin/test | ex-6/dist-plot.py

The program accepts some parameters to control the histogram and number of events, run it with -h to see their usage.

(optional) ex-6/plots/emd.py makes the plots of the EMD statistics of the RL deconvolution (shown in exercises.pdf, section 6.6) as a function of the number of rounds. The programs sources its data from two files in the same directory, these were obtained by running ex-6/bin/test. Do:

$ ex-6/plots/emd.py noisy

for the plots of the experiment with gaussian noise, and

$ ex-6/plots/emd.py noiseless

for the experiment without noise.

Exercise 7

ex-7/bin/main generates a sample with two classes of 2D points (signal, noise) and trains either a Fisher linear discriminant or a single perceptron to classify them (-m argument to change mode). Alternatively the weights can be set manually via the -w argument. In either case the program then prints the classified data in this order: signal then noise.

To plot the result of the linear classification pipe the output to ex-7/plot.py. The program generates two figures:

  • a scatter plot showing the Fisher projection line and the cut line;
  • two histograms of the projected data and the cut line.

ex-7/bin/test takes a model trained in ex-7/bin/main and test it against newly generated datasets (-i to set the number of test iterations). The program prints the statistics of the number of false positives, false negatives and finally the purity and efficiency of the classification.

(optional) ex-7/plots/fisher.py makes the example plot (shown in exercises.pdf, fig. 27) of the naïve projection vs Fisher projection.

License

Copyright (C) 2020 Giulia Marcer, Michele Guerini Rocco

All the programs, scripts, libraries and document source codes included in this repository are free software: you can redistribute them and/or modify them under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.