STRUCTURE AND SCIENTIFIC PROGRAMME OF MCMD

STRUCTURE OF THE COURSE

MDMC will be a 9-month training program with the initial 6 weeks of intensive lectures in Trieste and the following 7 months of internship in a laboratory of their Operating Unit, involved in one of the supporting projects (i.e., NFFA-DI and PRP@CERIC).

The course structure is outlined in detail below:

 

Part I

Part II

Part III

Part IV

Duration

6 weeks (160 hours)

2-3 days

7 months

2-3 days

Dates

September 16th - October 25th 2024

October 28th - 30th

November 2024 - May 2025

End of May 2025

Topic

Intensive training on Research Data Management and tools

Definition of FAIR-by-design approach in the labs

Implementation of FAIR-by-design approach in the labs

Thesis discussion

Location

Trieste

Trieste

Labs

Trieste

 

Part I:   (From September 16th to October 25th 2024) in Trieste, 6 weeks of intensive (6-8 hours per day, ~ 160 hours total) lectures in Trieste, described in detail below in the SCIENTIFIC PROGRAMME.

Part II:  (From October 28th to October 30th 2024)in Trieste, two/three days of presentations by each participant to outline the FAIR-by-design thesis project agreed upon with the supervisor of the selected laboratory and a supervisor in Trieste (AREA SCIENCE PARK – CNR).

Part III:   (From November 2024 to May 2025) in the selected laboratory of the Operating Unit of origin, 7 months of thesis work to implement the FAIR-by-design project tailored to the needs of the specific laboratory; Students will be followed with periodic follow-ups and tutoring by the teachers involved in Part I of the course to verify the progress of the practical work.

Part IV:   (End of May, dates TBD) in Trieste, two-three days of presentations by each participant to describe the FAIR-by-design thesis project developed and carried out in the laboratory; Each student will have at disposal 20+10 minutes (discussion + questions) to present their thesis in front of a board composed by selected lecturers and invited experts.


SCIENTIFIC PROGRAMME

The training modules of intensive lessons in Part I have been designed to provide all the skills and competencies necessary for the development and execution of the subsequent FAIR-by-design project in the laboratory where the following 7-month internship will be carried out. 
The list of the 7 modules provided is given here below:
Introduction to Open Science 
Scientific Programming Environment
Cloud Data Environment
Programming: Python for data management (comprising three parts)
Data Infrastructures
Tools for Data Management and Curation
Introduction to Statistical Data Analysis and Machine learning

The list and the details about all the lecturers are available at this page.


Introduction to Open Science – OS (20 hours)

Lecturers : Mariarita de Luca, Elena Giglia
MAIN TOPICS:
- Concepts and principles of Open-Science
- Open Research Data and FAIR principles
- The European context: Horizon Europe and EOSC
- Research Data Management concepts and techniques
- Open Access to Publications
- Open Science policies
- Open Licensing and File Format
- Practical exercises (e.g., Board game "Super-Open Researcher, DOAJ, Sherpa Romeo, Re3Data, Zenodo, etc…)        


Scientific Programming Environment - SPE (20 hours)

Lecturers: Ruggero Lot, Gianfranco Gallizia, Niccolò Tosato 
MAIN TOPICS:
- Introduction to Unix-like operating systems - (kernel vs. userspace, processes/threads, file system semantics)
- Shell scripting (bourne shell)
- Configuring, compiling, linking software packages
- Principal commands, shell, tools, and scripting
- Packet manager (dnf, yum, zipper, pacman)
- Cgroups, Resource monitoring
- Editors (Vim, nano) and file managers (ranger or broot, vim)
- Command line environment, org.freedesktop standard
- Access control (permission, groups, home), selinux
- Collaborative source code management: Git  
- File system
- Network configuration
- Integrated development environments
- Visualization tools
- Debugging tools- File system    


Cloud Data Environment – CDE (20 hours)

Lecturer: Ruggero Lot
MAIN TOPICS:
- Introduction to Virtual Machine and Containers
- Docker files and Podman
- Kubernetes
       • pod, services,
       • LoadBalancer, Ingress, gateway
       • CNI, (Flannel, Calico)
       • Storage: PV, PVC, CSI (nfs, ceph)
- Helm
- Git Ops


Programming: Python for data management – PY (30 hours)

Lecturers: Matteo Poggi, Marco Franzon, Jacopo Nespolo
MAIN TOPICS - First part: introduction to Python
- Why it is so important in data science
- Basic features and good practices in Python
- How to check which version of Python
- Create and env
- Different env manager
- Python on different OS
- IDEs
- Datatypes
- Context managers
- Functions
- Classes
- Super quick overview on software design (SOLID principles) in Python
- Unittesting with pytest  MAIN TOPICS - Second part: Jupyter notebooks (install, configure, etc...)
- Git
- Release code on github/gitlab
- Sphinx or mkdocs
- How to document your code  MAIN TOPICS - Third part: data handling and visualization
- Numpy
- Intro
- Broadcast
- Big data in numpy
- Dask/zarr
- Alternative to numpy for big data
- When and why use dask and zarr
- Pandas
- Series and data frames
- Missing data aggregation
- Pandas to sql
- Matplotlib
- Data visualization


Data Infrastructures – DI (20 hours)

Lecturer: Marco Prenassi
MAIN TOPICS:
- The lower level: Ansible, Ceph, bucket, modern Filesystems features for data managing and resilience
- Infrastructure scalability (horizontal vs vertical)
- Relational databases and SQL
- Optimizing DB schema
- ORM, SQLAlchemy
- Rest API in flask
- NoSQL databases
- Data lake concepts and implementation
- Data access control


Tools for Data Management and Curation – TDMC (24 hours)

Lecturers: Federica Bazzocchi, Stefano Cozzini
MAIN TOPICS:
- Data management and curation: Definition, meaning and their interplay. Data management as a combination of software, tools and best practices. Different approaches to data curation. Data and metadata concepts.
- Data management Plans: Scientific data lifecycles. Data management plan structure and requirements. Useful tools for data management plan. FAIR assessment tools.
- Open Data Repository: Worldwide scenario, FAIR certification, solutions adopted in NFFA-DI.
- Metadata: Definition and their importance in data lifecycle. Collecting metadata. Electronic notebook. Ontology and vocabulary. Choosing a metadata schema.
- Formats: Metadata formats. Open data formats. Hierarchical Data Format and Research Object Crate. Useful packages.
- Scientific Data Management System: Architecture & Interfaces. Database system for scientific data. Relational and non-relational database systems.
- Workflows: Planning workflow for scientific data curation. Workflows for open science. Workflow tools, libraries examples.


Introduction to Statistical Data Analysis and Machine learning – SDA (24 hours)

Lecturers: Matteo Biagetti, Tommaso Rodani
MAIN TOPICS:
Introduction to probability theory and statistic:
- Elements of probability theory
- Sampling and sampling distributions
- Point and interval estimation
- Hypothesis testing
- Chi-square test
- Nonparametric methods
- Elements of Bayesian inference
- Model fitting and comparison
Introduction to Machine Learning and Neural Networks:
- Linear methods for regression and classification
- Kernel methods for regression and classification
- Learning with imbalanced/missing data
- Hyperparameter tuning, cross validation
- Data intrinsic dimension and dimensionality reduction techniques
- Clustering methods