STRUCTURE AND SCIENTIFIC PROGRAMME OF MCMD

STRUCTURE OF THE COURSE

MDMC will be a 9-month training program with the initial 6 weeks of intensive lectures in Trieste and the following 7 months of internship in a laboratory of their Operating Unit, involved in one of the supporting projects (i.e., NFFA-DI and PRP@CERIC).

The course structure is outlined in detail below:

	Part I	Part II	Part III	Part IV
Duration	6 weeks (160 hours)	2-3 days	7 months	2-3 days
Dates	September 16th - October 25th 2024	October 28th - 30th	November 2024 - May 2025	End of May 2025
Topic	Intensive training on Research Data Management and tools	Definition of FAIR-by-design approach in the labs	Implementation of FAIR-by-design approach in the labs	Thesis discussion
Location	Trieste	Trieste	Labs	Trieste

Part I: (From September 16th to October 25th 2024) in Trieste, 6 weeks of intensive (6-8 hours per day, ~ 160 hours total) lectures in Trieste, described in detail below in the SCIENTIFIC PROGRAMME.

Part II: (From October 28th to October 30th 2024)in Trieste, two/three days of presentations by each participant to outline the FAIR-by-design thesis project agreed upon with the supervisor of the selected laboratory and a supervisor in Trieste (AREA SCIENCE PARK – CNR).

Part III: (From November 2024 to May 2025) in the selected laboratory of the Operating Unit of origin, 7 months of thesis work to implement the FAIR-by-design project tailored to the needs of the specific laboratory; Students will be followed with periodic follow-ups and tutoring by the teachers involved in Part I of the course to verify the progress of the practical work.

Part IV: (End of May, dates TBD) in Trieste, two-three days of presentations by each participant to describe the FAIR-by-design thesis project developed and carried out in the laboratory; Each student will have at disposal 20+10 minutes (discussion + questions) to present their thesis in front of a board composed by selected lecturers and invited experts.

SCIENTIFIC PROGRAMME

The training modules of intensive lessons in Part I have been designed to provide all the skills and competencies necessary for the development and execution of the subsequent FAIR-by-design project in the laboratory where the following 7-month internship will be carried out.
The list of the 7 modules provided is given here below:
• Introduction to Open Science
• Scientific Programming Environment
• Cloud Data Environment
• Programming: Python for data management (comprising three parts)
• Data Infrastructures
• Tools for Data Management and Curation
• Introduction to Statistical Data Analysis and Machine learning

The list and the details about all the lecturers are available at this page .

Introduction to Open Science – OS (20 hours)

Lecturers : Mariarita de Luca, Elena Giglia
MAIN TOPICS:
- Concepts and principles of Open-Science
- Open Research Data and FAIR principles
- The European context: Horizon Europe and EOSC
- Research Data Management concepts and techniques
- Open Access to Publications
- Open Science policies
- Open Licensing and File Format
- Practical exercises (e.g., Board game "Super-Open Researcher, DOAJ, Sherpa Romeo, Re3Data, Zenodo, etc…)

Scientific Programming Environment - SPE (20 hours)

Lecturers: Ruggero Lot, Gianfranco Gallizia, Niccolò Tosato
MAIN TOPICS:
- Introduction to Unix-like operating systems - (kernel vs. userspace, processes/threads, file system semantics)
- Shell scripting (bourne shell)
- Configuring, compiling, linking software packages
- Principal commands, shell, tools, and scripting
- Packet manager (dnf, yum, zipper, pacman)
- Cgroups, Resource monitoring
- Editors (Vim, nano) and file managers (ranger or broot, vim)
- Command line environment, org.freedesktop standard
- Access control (permission, groups, home), selinux
- Collaborative source code management: Git
- File system
- Network configuration
- Integrated development environments
- Visualization tools
- Debugging tools- File system

Cloud Data Environment – CDE (20 hours)

Lecturer: Ruggero Lot
MAIN TOPICS:
- Introduction to Virtual Machine and Containers
- Docker files and Podman
- Kubernetes
• pod, services,
• LoadBalancer, Ingress, gateway
• CNI, (Flannel, Calico)
• Storage: PV, PVC, CSI (nfs, ceph)
- Helm
- Git Ops

Programming: Python for data management – PY (30 hours)

Lecturers: Matteo Poggi, Marco Franzon, Jacopo Nespolo
MAIN TOPICS - First part: introduction to Python
- Why it is so important in data science
- Basic features and good practices in Python
- How to check which version of Python
- Create and env
- Different env manager
- Python on different OS
- IDEs
- Datatypes
- Context managers
- Functions
- Classes
- Super quick overview on software design (SOLID principles) in Python
- Unittesting with pytest MAIN TOPICS - Second part: Jupyter notebooks (install, configure, etc...)
- Git
- Release code on github/gitlab
- Sphinx or mkdocs
- How to document your code MAIN TOPICS - Third part: data handling and visualization
- Numpy
- Intro
- Broadcast
- Big data in numpy
- Dask/zarr
- Alternative to numpy for big data
- When and why use dask and zarr
- Pandas
- Series and data frames
- Missing data aggregation
- Pandas to sql
- Matplotlib
- Data visualization

Data Infrastructures – DI (20 hours)

Lecturer: Marco Prenassi
MAIN TOPICS:
- The lower level: Ansible, Ceph, bucket, modern Filesystems features for data managing and resilience
- Infrastructure scalability (horizontal vs vertical)
- Relational databases and SQL
- Optimizing DB schema
- ORM, SQLAlchemy
- Rest API in flask
- NoSQL databases
- Data lake concepts and implementation
- Data access control

Tools for Data Management and Curation – TDMC (24 hours)

Lecturers: Federica Bazzocchi, Stefano Cozzini
MAIN TOPICS:
- Data management and curation: Definition, meaning and their interplay. Data management as a combination of software, tools and best practices. Different approaches to data curation. Data and metadata concepts.
- Data management Plans: Scientific data lifecycles. Data management plan structure and requirements. Useful tools for data management plan. FAIR assessment tools.
- Open Data Repository: Worldwide scenario, FAIR certification, solutions adopted in NFFA-DI.
- Metadata: Definition and their importance in data lifecycle. Collecting metadata. Electronic notebook. Ontology and vocabulary. Choosing a metadata schema.
- Formats: Metadata formats. Open data formats. Hierarchical Data Format and Research Object Crate. Useful packages.
- Scientific Data Management System: Architecture & Interfaces. Database system for scientific data. Relational and non-relational database systems.
- Workflows: Planning workflow for scientific data curation. Workflows for open science. Workflow tools, libraries examples.

Introduction to Statistical Data Analysis and Machine learning – SDA (24 hours)

Lecturers: Matteo Biagetti, Tommaso Rodani
MAIN TOPICS:
Introduction to probability theory and statistic:
- Elements of probability theory
- Sampling and sampling distributions
- Point and interval estimation
- Hypothesis testing
- Chi-square test
- Nonparametric methods
- Elements of Bayesian inference
- Model fitting and comparison
Introduction to Machine Learning and Neural Networks:
- Linear methods for regression and classification
- Kernel methods for regression and classification
- Learning with imbalanced/missing data
- Hyperparameter tuning, cross validation
- Data intrinsic dimension and dimensionality reduction techniques
- Clustering methods