SWARM Parallel with Asynchronous Updates

We significantly improve training reliability, robustness and speed of asynchronous pipeline-parallel training.

and

May 01, 2025

Protocol Learning is multi-participant collaborative training and inference of foundation models over the internet. Core to the realisation of Protocol Learning is the ability to carry out a multi-party training run over the internet, when the model itself is split across an elastic and dynamic pool of compute. Today, we're previewing more research from our lab presented as a talk at the ICLR 2025 Modularity for Collaborative, Decentralized, and Continual Deep Learning (MCDC) workshop. This work explores how to make pipeline parallelism more practical for fully elastic swarms with heterogeneous devices.

Exploring Asynchronism in SWARM Parallelism

To do this, we’ve utilized the SWARM parallelism framework, which, aside from our recent work, is one of the only efforts focused on Model Parallel training in a low-bandwidth setting. SWARM dynamically routes the computational graph for a full forward pass over various participants and is hence robust to a noisey and elastic pool of devices. Built on the Hivemind framework, it supports fault-tolerant, pipeline-parallel training with multiple worker nodes per stage. A visualisation from the original paper appears below:

In our research, we identified a bottleneck: the synchronous nature of the periodic weight synchronization step forces all nodes to wait for each other. This becomes especially problematic in elastic swarms where nodes can join or leave dynamically, and when dealing with heterogeneous devices of varying speeds. By allowing nodes to perform local updates asynchronously and only periodically synchronizing their states, we've made the system roughly twice as fast. The main technical challenge is maintaining model convergence despite the "staleness" that naturally occurs in asynchronous settings, where gradients fall out-of-sync with the model.

When warmup time is low (500 steps), SWARM and a naive implementation of Asynchronous SWARM fails to properly converge. Our NAG-adapted Asynchronous SWARM remains stable and converges under this setting

NAG Adaptation in SWARM

We developed a novel approach using Nesterov Accelerated Gradient (NAG) to counteract gradient staleness in asynchronous training. Unlike standard momentum, which applies a velocity term to past gradients, NAG adjusts the update direction before computing the gradient, reducing overshooting and improving stability.

The velocity component acts as a weight correction term, compensating for gradient staleness by anticipating the future weight position. This allows NAG to mitigate stale updates by dynamically adjusting the weight trajectory based on past updates and estimated future gradients.

Results

We performed experiments on the WikiText dataset using the SWARM baseline where the model used is a Transformer language model with a Llama-3 based architecture - the first time results from SWARM have been reported with a modern LLM arch. The results exceeded our expectations:

Under an aggressive local update schedule, our NAG-adapted SWARM remains resilient and shows stability in convergence compared to the synchronous and asynchronous variants of SWARM

Up to 45.9% improvement in wall-clock time with a batch size of 64.
Consistent outperformance of both synchronous SWARM and basic asynchronous implementations.
Stability even with reduced warmup periods and artificial network delays.

We introduce artificial delays (0.5s and 1.0s) in one node per stage, forcing stochastic routing to favor non-delayed nodes and increasing gradient staleness. Our NAG-adapted approach proves more resilient to these disruptions, significantly out-performing baselines in terms of stability and convergence

What We Found

The main result of our work is that the NAG-adjusted asynchronicity we introduce leads to a direct improvement in model training time without sacrificing model performance. The system becomes faster, better handles elastic swarms where nodes come and go, and is more robust to heterogeneous devices with varying compute capabilities. As far as we observe, there is no drawback.

This asynchronous modification leads to a direct improvement in model training time without sacrificing model performance

Final Thoughts

We're sharing this research as a window into our ongoing work on Protocol Learning. While there's significantly more to explore, we hope this small modification to existing approaches helps make decentralized model training a bit more practical. We'll continue investigating ways to make collaborative training more accessible and efficient, but for now, we're encouraged that even simple changes can yield meaningful improvements in this domain.

Pluralis Research

Discussion about this post