Difference between revisions of "Resource:Seminar"

Revision as of 23:41, 4 October 2021

Time: 2021-10-08 8:40
Address: Main Building B1-612
Useful links: Readling list; Schedules; Previous seminars.

[SIGCOMM 2021] Hoplite: efficient and fault-tolerant collective communication for task-based distributed systems, Xianyang
Abstract: Task-based distributed frameworks (e.g., Ray, Dask, Hydro) have become increasingly popular for distributed applications that contain asynchronous and dynamic workloads, including asynchronous gradient descent, reinforcement learning, and model serving. As more data-intensive applications move to run on top of task-based systems, collective communication efficiency has become an important problem. Unfortunately, traditional collective communication libraries (e.g., MPI, Horovod, NCCL) are an ill fit, because they require the communication schedule to be known before runtime and they do not provide fault tolerance. We design and implement Hoplite, an efficient and fault-tolerant collective communication layer for task-based distributed systems. Our key technique is to compute data transfer schedules on the fly and execute the schedules efficiently through fine-grained pipelining. At the same time, when a task fails, the data transfer schedule adapts quickly to allow other tasks to keep making progress. We apply Hoplite to a popular task-based distributed framework, Ray. We show that Hoplite speeds up asynchronous stochastic gradient descent, reinforcement learning, and serving an ensemble of machine learning models that are difficult to execute efficiently with traditional collective communication by up to 7.8x, 3.9x, and 3.3x, respectively. Vid: https://www.youtube.com/watch?v=pHLIrkNj4w0
[NSDI 2021] EPaxos Revisited, Jianfei
Abstract: This paper re-evaluates the performance of the EPaxos consensus protocol for geo-replication and proposes an enhancement that uses synchronized clocks to reduce operation latency. The benchmarking approach used for the original EPaxos evaluation does not trigger or measure the full impact of conflict behavior on system performance. Our re-evaluation confirms the original claim that EPaxos provides optimal median commit latency in a WAN, but it shows much worse tail latency than previously reported (more than 4x worse than Multi-Paxos). Furthermore, performance is highly sensitive to application workloads, particularly at the tail. In addition, we show how synchronized clocks can be used to reduce conflicts in geo-replication. By imposing intentional delays on message processing, we can achieve roughly in-order deliveries to multiple replicas. When applied to EPaxos, this technique reduced conflicts by at least 50% without introducing additional overhead, decreasing mean latency by up to 7.5%. Vid: https://www.usenix.org/conference/nsdi21/presentation/tollman

[Topic] [ The path planning algorithm for multiple mobile edge servers in EdgeGO], Rong Cong, 2020-11-18

[Mobisys20] Combating packet collisions using non-stationary signal scaling in LPWANs, Wenliang Mao, 2020-11-18
[Topic] [ Dependency-Aware and Latency-Optimal Service Cache in Edge networks], Jiwei Mo, 2020-11-18
[talk] Paper Carnival 2020, ALL, 2020-09-24,25,26

请使用Latest_seminar和Hist_seminar模板更新本页信息.

- 修改时间和地点信息
- 将当前latest seminar部分的code复制到这个页面中
- 将{{Latest_seminar... 修改为 {{Hist_seminar...，并增加对应的日期信息|date=
- 填入latest seminar各字段信息
- link请务必不要留空，如果没有link则填本页地址 https://mobinets.org/index.php?title=Resource:Seminar

格式说明
- Latest_seminar:

{{Latest_seminar
|confname=
|link=
|title=
|speaker=
}}

- Hist_seminar

{{Hist_seminar
|confname=
|link=
|title=
|speaker=
|date=
}}

@@ Line 7: / Line 7: @@
 ===Latest===
 {{Latest_seminar
-|abstract=Task-based distributed frameworks (e.g., Ray, Dask, Hydro) have become increasingly popular for distributed applications that contain asynchronous and dynamic workloads, including asynchronous gradient descent, reinforcement learning, and model serving. As more data-intensive applications move to run on top of task-based systems, collective communication efficiency has become an important problem. Unfortunately, traditional collective communication libraries (e.g., MPI, Horovod, NCCL) are an ill fit, because they require the communication schedule to be known before runtime and they do not provide fault tolerance. We design and implement Hoplite, an efficient and fault-tolerant collective communication layer for task-based distributed systems. Our key technique is to compute data transfer schedules on the fly and execute the schedules efficiently through fine-grained pipelining. At the same time, when a task fails, the data transfer schedule adapts quickly to allow other tasks to keep making progress. We apply Hoplite to a popular task-based distributed framework, Ray. We show that Hoplite speeds up asynchronous stochastic gradient descent, reinforcement learning, and serving an ensemble of machine learning models that are difficult to execute efficiently with traditional collective communication by up to 7.8x, 3.9x, and 3.3x, respectively.
+|abstract=Task-based distributed frameworks (e.g., Ray, Dask, Hydro) have become increasingly popular for distributed applications that contain asynchronous and dynamic workloads, including asynchronous gradient descent, reinforcement learning, and model serving. As more data-intensive applications move to run on top of task-based systems, collective communication efficiency has become an important problem. Unfortunately, traditional collective communication libraries (e.g., MPI, Horovod, NCCL) are an ill fit, because they require the communication schedule to be known before runtime and they do not provide fault tolerance. We design and implement Hoplite, an efficient and fault-tolerant collective communication layer for task-based distributed systems. Our key technique is to compute data transfer schedules on the fly and execute the schedules efficiently through fine-grained pipelining. At the same time, when a task fails, the data transfer schedule adapts quickly to allow other tasks to keep making progress. We apply Hoplite to a popular task-based distributed framework, Ray. We show that Hoplite speeds up asynchronous stochastic gradient descent, reinforcement learning, and serving an ensemble of machine learning models that are difficult to execute efficiently with traditional collective communication by up to 7.8x, 3.9x, and 3.3x, respectively. Vid: https://www.youtube.com/watch?v=pHLIrkNj4w0
-Vid: https://www.youtube.com/watch?v=pHLIrkNj4w0
 |confname=SIGCOMM 2021
 |link=https://dl.acm.org/doi/pdf/10.1145/3452296.3472897
@@ Line 15: / Line 14: @@
 }}
 {{Latest_seminar
-|abstract=This paper re-evaluates the performance of the EPaxos consensus protocol for geo-replication and proposes an enhancement that uses synchronized clocks to reduce operation latency. The benchmarking approach used for the original EPaxos evaluation does not trigger or measure the full impact of conflict behavior on system performance. Our re-evaluation confirms the original claim that EPaxos provides optimal median commit latency in a WAN, but it shows much worse tail latency than previously reported (more than 4x worse than Multi-Paxos). Furthermore, performance is highly sensitive to application workloads, particularly at the tail. In addition, we show how synchronized clocks can be used to reduce conflicts in geo-replication. By imposing intentional delays on message processing, we can achieve roughly in-order deliveries to multiple replicas. When applied to EPaxos, this technique reduced conflicts by at least 50% without introducing additional overhead, decreasing mean latency by up to 7.5%.
+|abstract=This paper re-evaluates the performance of the EPaxos consensus protocol for geo-replication and proposes an enhancement that uses synchronized clocks to reduce operation latency. The benchmarking approach used for the original EPaxos evaluation does not trigger or measure the full impact of conflict behavior on system performance. Our re-evaluation confirms the original claim that EPaxos provides optimal median commit latency in a WAN, but it shows much worse tail latency than previously reported (more than 4x worse than Multi-Paxos). Furthermore, performance is highly sensitive to application workloads, particularly at the tail. In addition, we show how synchronized clocks can be used to reduce conflicts in geo-replication. By imposing intentional delays on message processing, we can achieve roughly in-order deliveries to multiple replicas. When applied to EPaxos, this technique reduced conflicts by at least 50% without introducing additional overhead, decreasing mean latency by up to 7.5%. Vid: https://www.usenix.org/conference/nsdi21/presentation/tollman
-Vid: https://www.usenix.org/conference/nsdi21/presentation/tollman
 |confname=NSDI 2021
 |link=https://www.usenix.org/system/files/nsdi21-tollman.pdf

Navigation menu

Difference between revisions of "Resource:Seminar"

Revision as of 23:41, 4 October 2021

Contents

Difference between revisions of "Resource:Seminar"

Revision as of 23:41, 4 October 2021

Latest

History

History

2024

2023

2022

2021

2020

2019

2018

2017

Instructions