ML System Design Session #1: Federated Learning

7 min readJul 9, 2021

Artem Yushkovsky (Neu.ro) and Paulo Maia (NILG.AI)

The initiative

The original idea of these ML System Design Sessions was to be somewhat similar to a community reading club format, where one or multiple team members research a paper, present it to the group and then discuss it. But this format, while awesome for diving into new research topics, can sometimes feel lacking in the practical part.

So, we at NILG.AI and Neu.ro decided to try another format where the topic is not a specific paper but an entire research area; and then after a short discussion we will have a System Design part where the team describes a specific use-case to apply the new approach. Ideally, the discussion would stick to the format of a typical System Design interview — however, our first exploratory attempt appeared to be rather freestyle.

In our Session #1, held on 2021–05–27, an ML team from NILG.AI led by Paulo Maia, and an MLOps team from Neu.ro led by Artem Yushkovsky met. The leaders researched the topic preliminarily and prepared a theoretical presentation for ~30-min so that everyone could be on the same page. Then, we had a ~90-minute practical part where both teams discussed technical aspects (both ML and MLOps) of the architecture for a given use-case, putting their thoughts to a Miro board.

In this article, we would like to share a high-level overview of the outcomes of this first session and briefly discuss our takeaways.

Please Note: none of us are experts in the topic of Federated Learning and Differential Privacy — our conclusions are based on the performed research and the following brainstorming session.

Inputs (reading materials)

Topic: Federated Learning.
https://arxiv.org/abs/1902.01046 Original paper: Towards Federated Learning at Scale: System Design (Bonawitz et al; Feb 2019)
https://federated.withgoogle.com/ — An online comic explaining the concepts of federated learning.
https://ai.googleblog.com/2017/04/federated-learning-collaborative.html — Google AI Blog: Federated Learning: Collaborative Machine Learning without Centralized Training Data (McMahan et al; April, 2017)
https://www.tensorflow.org/federated/tutorials/federated_learning_for_image_classification — Tensorflow tutorial: Federated Learning for Image Classification.

Outputs (planned outcomes)

Both teams to have a general understanding of the concepts of Federated Learning.
Deepened theoretical understanding of the topic through a system design session where the key aspects of the solution are discussed.

Theoretical part (presentation)

Please find the slides here.

What is Federated Learning and where can it be applied?

In a few words, Federated Learning (FL) is a set of techniques to train ML models on end user devices while preserving privacy of the user’s sensitive data. At each iteration, devices download the current model, improve it by learning from local sensitive data and summarize the changes as a small focused model update, which is then sent to the cloud in an encrypted way. The cloud then averages the updates and uses this to improve the shared model. Thus, the cloud can not see any individual update, but only the averaged one. The sensitive data itself is stored solely on the devices and is not shared.

Federated Learning can be a good solution in cases where strict performance and privacy concerns are applied. Initially, FL was developed to support auto-adjustable auto-suggestions for Google Keyboard (Android Phones). The user’s phone locally stores information about the current context and whether you clicked the suggestion. However, it has many more applications, such as voice recognition, photo ranking, autonomous cars or medical data (use-case of current session).

Federated Learning — Technical Overview

In the FL scheme, end-user devices participate in the model improvement process voluntarily — for example, a mobile phone would compute the model update only when it’s connected to a high-bandwidth network and to power supply in order not to degrade the user experience. As such, we are dealing with potentially a very large population of user devices, only a fraction of which may be available for training at a given point in time. The whole process is described in the below figure: there are several training rounds, each with the following steps:

Selection: a subset of devices available for training are selected by the server, while others are rejected
Training: The training is done on-device for variable durations, after an initial model configuration is loaded from a persistent storage. Some devices might fail and are dropped out of training in the current round.
Reporting: Results are sent to the server and aggregated. The next training round starts over again from step 1

Towards Federated Learning at Scale: System Design (2019)

While some devices train the model, others can be used to test it. Evaluation, like training, is done with no need to access user data. You can also have proxy data for unit tests, such as a text from Wikipedia for the text, which is used to mimic the text typed on a mobile keyboard.

Federated Learning, due to its uniqueness, has some specific challenges regarding data quality, model training, model QA and privacy concerns:

Privacy concerns

Sensitive data protection: the ML Engineers don’t have access to the raw data used for training the model
Anonymous system monitoring and analytics

Data Quality

Can be noisy, unreliably labeled
Is highly unbalanced (different amounts of data on different devices)
Individual training examples are not directly inspectable

Model Training

Strict requirements on the resources (RAM, CPU, GPU, etc), different runtime versions
Longer convergence time.
Need device synchronization mechanisms
Devices are unreliable (might fail)

Model QA

On-device model evaluation
Rollout/rollback the model versions to the devices
A/B testing
Data drift detection

Many of these challenges were considered during the System Design part (see below).

Federated Learning — Server Architecture

On the server side, an actor model is followed, with the coordinators providing global synchronization, the master aggregators managing the rounds of each FL task, and the selectors accepting and forwarding device connections. This provides failure robustness and lets us anonymously log some device activity and health parameters to the cloud as well as automatically monitor time series and create system alerts.

There are some extra privacy features, such as Secure Aggregation (enables the server to combine the encrypted result and only decrypt the aggregate) and differential privacy (limit to how much each device can contribute, and the ability to add noise to obscure rare data) that will not be covered in more detail. You can use some resources such as Learning Differentially Private Recurrent Language Models (2018), if curious.

System Design Discussion

For the practical part, we have chosen the following use case:

You work for a startup making ML solutions for medical institutes
There’s a central cloud owned by the startup
Patient’s personal data (ID info) must not leave the hospital
M models for image processing/patient diagnosis
N hospitals (each can use multiple models)
K devices (doctors’ mobile phones)
Each hospital has a central data hub connected to its IT system. There are some networking concerns: data can only be accessed within the hospital network (but the hospital network, from the inside, can access external networks).
Users: Hospital’s lab worker U1: loads X-ray images. Doctor U2: loads the image, processes it via the on-device model and accesses the predictions.
After some time (up to 3 months — when more information is available), these results may be updated and then we can have access to the labels.
3 times per week, at night, the devices get connected to the electricity and the internet and can perform updates.

Specific Challenges

In our brainstorming session, we covered several topics related with building Federated Learning Systems, such as Performance Issues, Model and Data Issues, Networking, …

For each of these, we dove a bit deeper into what could happen. The conclusions are not displayed here for summarization purposes — only an example of the way the brainstorming session was done. However, you can find the view-only Miro board here.

In the end, we defined an overall system architecture:

Each hospital would be connected to one aggregator which manages the aggregation of results hospital-wise. Then there’s a master aggregator which combines the results for each of those. This will allow the aggregator to have access to only a single location’s network — and since there’s no external access to the network the updates are pull-based.

Whenever a new device/hospital is registered, all that is required is to add a new node and update the Federated Learning plans.

Conclusions

Federated Learning is an interesting approach for use-cases with strict patient privacy requirements, as well as minimal hardware requirements, and capability of using data from different sourcess and locations.
However, the applicable ML algorithms are limited, and as such model performance can be worse compared to traditional supervised learning, given we’re training models with small batches and have less visibility regarding input quality. It’s a better approach for sharing models with other hospitals — you can even have more variability of data samples with less geographic restrictions.
There are some open topics regarding model governance: who owns the model in this case? The developers, the labelers (doctors), or the hospital itself? As a consequence, who is responsible for any mistakes that might be made? Who is monitoring, analysing and making decisions about the deployed model?
It was a very interesting experience to system-design a solution in such a collaborative format, joining the expertise of one company in Data Science (NILG.AI) and MLOps (Neu.ro).