Skip to content

alpha-unito/streamflow-fl

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Federated Learning with StreamFlow

This repository contains a StreamFlow Federated Learning (FL) pipeline based on PyTorch. The workflow trains a VGG16 model with Group Normalization over two datasets:

  • A standard version of MNIST;
  • A grayscaled version of SVHN.

The workflow is described with an extended version of CWL that introduces support for the Loop construct, necessary to describe the training-aggregate iteration of FL workloads.

Datasets have been placed onto two different HPC facilities:

  • MNIST has been trained on the EPITO cluster at the University of Torino (1 80-core Arm Neoverse N1, 512GB RAM, and 2 NVIDIA A100 GPU per node);
  • SVHN has been trained on the CINECA MARCONI100 cluster in Bologna (2 16-core IBM POWER9 AC922, 256GB RAM, and 4 NVIDIA V100 GPUs per node).

Since HPC worker nodes cannot access the Internet through outbound connections, this workload cannot be managed by FL frameworks that require direct bidirectional connections between worker and aggregator nodes. Conversely, StreamFlow relies on a pull-based data transfer mechanism that overcomes this limitation.

To also perform a direct comparison between StreamFlow and the Intel OpenFL framework, the pipeline has also been executed over two VMs (8 cores, 32GB RAM, 1 NVIDIA T4 GPU each) hosted on the HPC4AI Cloud at the University of Torino, acting as workers. Conversely, the aggregation plane has always been placed on Cloud.

If you want to cite this work, please use the reference below:

@inproceedings{22:ml4astro,
  location  = {Catania, Italy},
  author    = {Iacopo Colonnelli and
               Bruno Casella and
               Gianluca Mittone and
               Yasir Arfat and
               Barbara Cantalupo and
               Roberto Esposito and
               Alberto Riccardo Martinelli and
               Doriana Medi\'{c} and
               Marco Aldinucci},
  booktitle = {Astrophysics and Space Science Proceedings},
  doi       = {10.1007/978-3-031-34167-0_39},
  editor    = {Filomena Bufano and
               Simone Riggi and
               Eva Sciacca and
               Francesco Schillir\`{o}},
  isbn      = {978-3-031-34167-0},
  pages     = {193--199},
  publisher = {Springer},
  address   = {Cham, Switzerland},
  title     = {Federated Learning meets {HPC} and cloud},
  volume    = {60},
  year      = {2023}
}

Usage

To run the experiment as is, clone this repository on the aggregator node and use the following commands:

python -m venv venv
source venv/bin/activate
pip install "streamflow==0.2.0.dev2"
pip install -r requirements.txt
streamflow run streamflow.yml

Reproducing the experiments in the same environment requires access to both HPC facilities and the HPC4AI Cloud. However, interested users can run the same pipeline on their preferred infrastructure by changing the deployments definitions in the streamflow.yml file and the corresponding Slurm/SSH scripts inside the environments folder.

Also, note that the Python dependencies listed in the requirements.txt file should be manually installed in any involved location (both the workers and the aggregator), and the datasets are supposed to be already present in the worker nodes.

Contributors

Iacopo Colonnelli iacopo.colonnelli@unito.it
Bruno Casella bruno.casella@unito.it
Marco Aldinucci marco.aldinucci@unito.it