MHPC thesis discussion

The MHPC thesis discussion takes place on Friday, December 20, 2019, in SISSA via Bonomea in the Big Meeting Room, 7th floor with the following schedule:

09:00 Dr. Jiaxin Wang
09:30 Federico Barone
10:00 Luca Bonaldo
10:30 Eduardo Quintana Miranda
11:00 Break
11:30 Tommaso Ronconi
12:00 Dr. Khalil Ahmed Mohamed Saleh Hassanin
12:30 Matteo Zampieri
13:00 Dr. Costantino Pacilio

Dr. Jiaxin Wang
Title: "Cosmic Ray Propagation with High Dimensional Finite Element Method"
Supervisors: Prof. Luca Heltai (SISSA), Prof. Piero Ullio (SISSA), Dr. Alberto Sartori (SISSA)

Cosmic Ray Propagation with High Dimensional Finite Element Method The Galactic synchrotron emission contains abundant physics related to not only the Galactic magnetized interstellar medium but also has prominent effect on understanding the Cosmic microwave background especially the B-mode polarization. To catch up with the growing precision in astrophysical observations, we need to build a consistent numerical framework where simulating the cosmic ray (electron) propagation is a major task. In the master project, we propose to use the finite element method for solving cosmic ray (electron) transport equation within the phase-space of dimension varying from two to six. The numeric package BIFET is developed on top of the deal.II library with support in the adaptive mesh refinement. We mainly introduce the design and methods in solving advection-diffusion problems and demonstrate its capability and precision with physically simplified tests and examples.

Federico Barone
Title: "Compressing Medical Images with Minimal Information Loss"
Supervisors: Prof. Alessandro Laio (SISSA), Prof. Luca Heltai (SISSA), and Dr. Alberto Sartori(SISSA)

This thesis aims to explore the potentialities of neural networks as compression algorithms for medical images. The objective is to develop a compressed image representation suitable for image comparison. In particular we studied different autoencoder architectures, varying the encoding mechanism in order to achieve a high degree of compression while also retaining a meaningful feature space. Our work is focused on mammograms but the methods introduced here can be extrapolated to other types of medical images.

Luca Bonaldo
Title: "Image classification methods for small or highly unbalanced datasets"
Supervisors: Ing. Daniele Fornasier (Beantech), Dr. Alessio Ansuini (Area Science Park), and Alberto Sartori (SISSA)

Optical inspection and standard computer vision techniques are today two of the most used ways of performing quality control in manufacturing. However, this two approaches limit the full automation of the production and impose solutions dependent on the specific products which translates to a higher cost with respect to a fully automatic approach. Moreover, standard machine vision algorithms rely on ad-hoc parameters that have to be adjusted case by case with a loss of solidity and robustness.
In this thesis it was studied and tested an innovative approach to quality control on industrial products based on deep neural networks. A deep learning model does not requires parameters to be fixed in advance, it uses a data-driven approach and adapts itself using the features present in the training set in order to classify a product as conforming or non-conforming as accurately as possible. Moreover, coupling the model with a proper in-line acquisition system and a suitable hardware, it was obtained that the monitoring of a production line can be completely automatized.

Eduardo Quintana Miranda
Title: "Forwarding GADGET to exa-scale"
Supervisors: Dr. Luca Tornatore (INAF), Dr. Alberto Sartori (SISSA)

This thesis will focus on one particular code for cosmological simulations named GADGET, an acronym that stands for GAlaxies with Dark matter and Gas intEracT, whose lastest public version is GADGET-2. The GADGET code has been publicy available since the beginning, and that has lead to a large number of custom versions by several different groups that developed special features of their own interest. The mainline and the custom codes have been used to perform many of the most advanced state-of-the-art cosmological simulations since almost 20 years, bringing results that were, and still are, at the cuting edge of the research in cosmology.
In spite of several efforts in recent years, GADGET is still conceived and written as a monolithic code. Moreover, the overall code architecture does not allow it to efficiently scale to many thousands of processors when tackling exceptionally challenging problems at the current cutting edge of research. In front of these issues, and of the need of updating the code's design for the forthcoming exa-scale architectures, a thorough re-design of crucial parts of the code has been undergone by a core team of long-term developers of the code's mainline. This thesis is part of that effort, focusing from some of its fundamental pillars: the domain decomposition and the tree building. It analyzes their current behaviour and weaknesses, and proposes some well-motivated improvements.

Tommaso Ronconi
Title: "Renewal and Optimization of ScamPy: a Python API for painting galaxies on top of dark matter simulations"
Supervisors: Prof. Andrea Lapi (SISSA), Prof. Matteo Viel (SISSA), Dr. Alberto Sartori (SISSA)

In spite of the accuracy and reliability reached in modelling the dark sector, the formation and evolution of the luminous component is far from being well constrained with cosmological N-body simulations. Among the several possible approaches to asses this modelling issue, empirical models are designed to reproduce observable properties of a target population of objects in a given moment of their evolution. This design choice makes them particularly suitable for modelling the high redshift Universe. We have developed a computational framework for ''painting'' galaxies on top of the Dark Matter Halo/Sub-Halo hierarchy obtained from N-body simulations. The method we use is based on the sub-halo clustering and abundance matching (SCAM) scheme which requires observations of the 1- and 2-point statistics of the target population we want to reproduce. Starting from a rough preliminary prototype, we have developed an hybrid c++/python API which implements a wide range of functions and custom types intended for efficiently produce mock-galaxy catalogues. The core functionalities are written in c++ and exploit Object Oriented Programming, with a wide use of polymorphism, to achieve flexibility and high computational efficiency. In order to have an easily accessible interface, all the libraries are wrapped in python and provided with an extensive documentation. In the development we have applied the best practices in High Performance Computing and Advanced Programming: modularization, unit testing, cross-platform building are all aspects that have been covered in detail. We have validated all the results returned by the API and we are starting to apply our tool in a set of scientific cases. Finally, we have benchmarked the crucial sections of our library measuring an impressive increase in performances that reaches up to a factor 10'000.

Dr. Saleh Hassanin Khalil Ahmed Mohamed
Title: "Data Management Tools for NFFA-Europe"
Supervisor: Dr. Stefano Cozzini (CNR-IOM, eXact Lab)

Overall Goal: Setup well organized data services to manage all the SEM images within the NFFA-EUROPE project in order to make the FAIR.
Specific goals within this thesis:
- Creation of python application to collect/define metadata for SEM images
- Massive Parallel processing of all images to collect/define metadata
- Definition of an easy to setup and portable computational ecosystem to accomplish the above goal (Kubernetes +Spark. on different computational infrastructure)
- Measure performance on different computational infrastructure

Matteo Zampieri
Title: "Multi-label Classification of Computed Tomography Scan Reports"
Supervisors: Prof. Luca Heltai (SISSA), Prof. Luca Bortolussi (UniTS)

The digitalization of clinical reports and the ever-growing usage of electronic health records make possible the collection of huge amounts of data. This data can be used to explore strategies to come in aid of both the patients and the clinical personnel, in terms of inference tools that could hint diagnostic decisions in a relevant manner, or as a general research pool. This project specifically makes use of reports of Computed Tomography Scans of patients with metastatic breast cancer. The aim of the thesis is to explore methods for multi-label text classification. The reports of interest are classified with a varying number of tags, depending on the location of the metastasis inferred from the report, that comes in the form of a free text description. To address this problem, I used a set of algorithms, namely logistic regression (multinomial and one-vs-rest), k-Nearest-Neighbors (with 'uniform' and 'distance' weight), Multi-k-Nearest-Neighbors, and Support Vector Classifier; these algorithms were fed with different types of word embeddings (TF-IDF and doc2vec). Moreover, the fastText library was explored in its integrated word embedding and text classification capabilities. At last, I used Fast-Bert, an open-source extension of Google's BERT to specifically perform text classification. The results were not satisfying, due to the small size and the high class imbalance of the dataset. However, the investigation of different techniques has shed light to the promising possibilities of some of the strategies used.

Dr. Costantino Pacilio
Title: "Multioutput regression of noisy time series using convolutional neural networks, with applications to gravitational waves"
Supervisors: Prof. Luca Heltai (SISSA), and Prof. Enrico Barausse (SISSA)

In this thesis I implement a deep learning algorithm to perform a multioutput regression. The dataset is a collection of one dimensional time series arrays, corresponding to simulated gravitational waveforms emitted by a black hole binary, and labeled by the masses of the two black holes. In addition, white Gaussian noise is added to the arrays, to simulate a signal detection in the presence of noise. A convolutional neural network is trained to infer the output labels in the presence of noise, and the resulting model generalizes over many order of magnitudes in the noise level. From the results I argue that the hidden layers of the model successfully denoise the signals before the inference step. The entire code is implemeted in the form of a Python module, and the neural network is written in PyTorch. The training of the network is speeded up using a single GPU, and I report about efforts to improve the scaling of the training time with respect to the size of the training sample.