Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Performance fix for allreduce on large buffers (facebookincubator#192)
Summary: See facebookincubator#169 for the issue and the original post reporting this issue. The problem at the root of this performance issue is an optimization attempt to keep intermediate buffers bounded, regardless of the size of the input/output buffers. To pipeline communication with reduction, every chunk to send to a neighboring process is split in a minimum of 2 segments. For large buffers, the number of segments per chunk may be larger, depending on the maximum segment size setting. The implementation of the algorithm still assumed just 2 segments per chunk and would run longer than needed. This resulted in garbage values for the additional segments it would process in the reduce scatter phase. This garbage didn't affect the correctness because it would later be overwritten by the allgather phase. The only side effect from this extra work was additional bytes on the wire, so it would run slower. For large inputs (>= 10MB) for low number of processes (e.g. 2), this meant up to twice the projected bandwidth. This commit addresses the error and restores performance for large input buffers to what we expect from the algorithm. Additionally, it addresses a potential issue with segment size imbalance. If the natural segment size is a single byte larger than the maximum, the algorithm would run as if the input was padded to the closest integer multiple of the maximum segment size. In terms of performance, this can look like a performance cliff for certain input sizes that result in a segment size close to the maximum segment size. The benchmark tool is updated to include a run for the new style ring allreduce algorithm. The benchmark results included below are the result of running the new version on 2 machines with by 10Gb Ethernet. In particular, look at the average bandwidth for runs against 1000000 elements and larger. Without the fix, there is a steep drop. With the fix, it remains at the link bandwidth. Without the fix: ``` Devices: - tcp, pci=0000:01:00.0, iface=enp1s0f0, speed=10000, addr=[100.97.16.108] - tcp, pci=0000:01:00.0, iface=enp1s0f0, speed=10000, addr=[100.97.16.108] Algorithm: new_allreduce_ring Options: processes=2, inputs=1, threads=2 elements min (us) p50 (us) p99 (us) max (us) avg (GB/s) samples 100000 672 945 1180 1825 0.785 3664 200000 1190 1652 2086 2229 0.920 2290 500000 2957 3424 4101 4624 1.072 872 1000000 5916 6820 7642 7846 1.088 568 2000000 17172 21496 22860 23047 0.691 180 5000000 57343 64547 75051 75051 0.575 60 ``` With the fix: ``` Devices: - tcp, pci=0000:01:00.0, iface=enp1s0f0, speed=10000, addr=[100.97.16.108] - tcp, pci=0000:01:00.0, iface=enp1s0f0, speed=10000, addr=[100.97.16.108] Algorithm: new_allreduce_ring Options: processes=2, inputs=1, threads=2 elements min (us) p50 (us) p99 (us) max (us) avg (GB/s) samples 100000 696 956 1214 2156 0.777 3492 200000 1252 1667 2225 2906 0.910 2240 500000 2606 3424 4096 4425 1.070 1026 1000000 4367 6825 9270 11283 1.081 494 2000000 10980 13661 14527 14674 1.086 292 5000000 32987 34462 35088 35854 1.082 114 ``` Closes facebookincubator#169. Pull Request resolved: facebookincubator#192 Reviewed By: mrshenli Differential Revision: D16517254 Pulled By: pietern fbshipit-source-id: bce3f6af30b747e89cc96de9331441267be340e0
- Loading branch information