Difference between revisions of "Resource:Seminar"

Latest revision as of 11:36, 28 May 2024

Time: Friday 10:30-12:00
Address: 4th Research Building A518
Useful links: Readling list; Schedules; Previous seminars.

[SIGCOMM '23] Masking Corruption Packet Losses in Datacenter Networks with Link-local Retransmission, Jiacheng
Abstract: Packet loss due to link corruption is a major problem in large warehouse-scale datacenters. The current state-of-the-art approach of disabling corrupting links is not adequate because, in practice, all the corrupting links cannot be disabled due to capacity constraints. In this paper, we show that, it is feasible to implement link-local retransmission at sub-RTT timescales to completely mask corruption packet losses from the transport endpoints. Our system, LinkGuardian, employs a range of techniques to (i) keep the packet buffer requirement low, (ii) recover from tail packet losses without employing timeouts, and (iii) preserve packet ordering. We implement LinkGuardian on the Intel Tofino switch and show that for a 100G link with a loss rate of 10−3, LinkGuardian can reduce the loss rate by up to 6 orders of magnitude while incurring only 8% reduction in effective link speed. By eliminating tail packet losses, LinkGuardian improves the 99.9th percentile flow completion time (FCT) for TCP and RDMA by 51x and 66x respectively. Finally, we also show that in the context of datacenter networks, simple out-of-order retransmission is often sufficient to significantly mitigate the impact of corruption packet loss for short TCP flows.
[NSDI '23] ROLEX: A Scalable RDMA-oriented Learned Key-Value Store for Disaggregated Memory Systems, Haotian
Abstract: Disaggregated memory systems separate monolithic servers into different components, including compute and memory nodes, to enjoy the benefits of high resource utilization, flexible hardware scalability, and efficient data sharing. By exploiting the high-performance RDMA (Remote Direct Memory Access), the compute nodes directly access the remote memory pool without involving remote CPUs. Hence, the ordered key-value (KV) stores (e.g., B-trees and learned indexes) keep all data sorted to provide rang query service via the high-performance network. However, existing ordered KVs fail to work well on the disaggregated memory systems, due to either consuming multiple network roundtrips to search the remote data or heavily relying on the memory nodes equipped with insufficient computing resources to process data modifications. In this paper, we propose a scalable RDMA-oriented KV store with learned indexes, called ROLEX, to coalesce the ordered KV store in the disaggregated systems for efficient data storage and retrieval. ROLEX leverages a retraining-decoupled learned index scheme to dissociate the model retraining from data modification operations via adding a bias and some data-movement constraints to learned models. Based on the operation decoupling, data modifications are directly executed in compute nodes via one-sided RDMA verbs with high scalability. The model retraining is hence removed from the critical path of data modification and asynchronously executed in memory nodes by using dedicated computing resources. Our experimental results on YCSB and real-world workloads demonstrate that ROLEX achieves competitive performance on the static workloads, as well as significantly improving the performance on dynamic workloads by up to 2.2 times than state-of-the-art schemes on the disaggregated memory systems. We have released the open-source codes for public use in GitHub.

[Topic] [ The path planning algorithm for multiple mobile edge servers in EdgeGO], Rong Cong, 2020-11-18

[Mobisys20] Combating packet collisions using non-stationary signal scaling in LPWANs, Wenliang Mao, 2020-11-18
[Topic] [ Dependency-Aware and Latency-Optimal Service Cache in Edge networks], Jiwei Mo, 2020-11-18
[talk] Paper Carnival 2020, ALL, 2020-09-24,25,26

请使用Latest_seminar和Hist_seminar模板更新本页信息.

- 修改时间和地点信息
- 将当前latest seminar部分的code复制到这个页面中
- 将{{Latest_seminar... 修改为 {{Hist_seminar...，并增加对应的日期信息|date=
- 填入latest seminar各字段信息
- link请务必不要留空，如果没有link则填本页地址 https://mobinets.org/index.php?title=Resource:Seminar

格式说明
- Latest_seminar:

{{Latest_seminar
|confname=
|link=
|title=
|speaker=
}}

- Hist_seminar

{{Hist_seminar
|confname=
|link=
|title=
|speaker=
|date=
}}

@@ Line 1: / Line 1: @@
 {{SemNote
-|time='''2022-6-27 10:30'''
+|time='''Friday 10:30-12:00'''
-|addr=4th Research Building A527-B
+|addr=4th Research Building A518
 |note=Useful links: [[Resource:Reading_List|Readling list]]; [[Resource:Seminar_schedules|Schedules]]; [[Resource:Previous_Seminars|Previous seminars]].
 }}
@@ Line 7: / Line 7: @@
 ===Latest===
 {{Latest_seminar
-|abstract = Federated learning (FL) has emerged in edge computing to address limited bandwidth and privacy concerns of traditional cloud-based centralized training. However, the existing FL mechanisms may lead to long training time and consume a tremendous amount of communication resources. In this paper, we propose an efficient FL mechanism, which divides the edge nodes into K clusters by balanced clustering. The edge nodes in one cluster forward their local updates to cluster header for aggregation by synchronous method, called cluster aggregation, while all cluster headers perform the asynchronous method for global aggregation. This processing procedure is called hierarchical aggregation. Our analysis shows that the convergence bound depends on the number of clusters and the training epochs. We formally define the resource-efficient federated learning with hierarchical aggregation (RFL-HA) problem. We propose an efficient algorithm to determine the optimal cluster structure (i.e., the optimal value of K) with resource constraints and extend it to deal with the dynamic network conditions. Extensive simulation results obtained from our study for different models and datasets show that the proposed algorithms can reduce completion time by 34.8%-70% and the communication resource by 33.8%-56.5% while achieving a similar accuracy, compared with the well-known FL mechanisms.
+|abstract=Packet loss due to link corruption is a major problem in large warehouse-scale datacenters. The current state-of-the-art approach of disabling corrupting links is not adequate because, in practice, all the corrupting links cannot be disabled due to capacity constraints. In this paper, we show that, it is feasible to implement link-local retransmission at sub-RTT timescales to completely mask corruption packet losses from the transport endpoints. Our system, LinkGuardian, employs a range of techniques to (i) keep the packet buffer requirement low, (ii) recover from tail packet losses without employing timeouts, and (iii) preserve packet ordering. We implement LinkGuardian on the Intel Tofino switch and show that for a 100G link with a loss rate of 10−3, LinkGuardian can reduce the loss rate by up to 6 orders of magnitude while incurring only 8% reduction in effective link speed. By eliminating tail packet losses, LinkGuardian improves the 99.9th percentile flow completion time (FCT) for TCP and RDMA by 51x and 66x respectively. Finally, we also show that in the context of datacenter networks, simple out-of-order retransmission is often sufficient to significantly mitigate the impact of corruption packet loss for short TCP flows.
-|confname= INFOCOM 2021
+|confname=SIGCOMM '23
-|link=https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=9488756
+|link=https://dl.acm.org/doi/pdf/10.1145/3603269.3604853
-|title=Resource-Efficient Federated Learning with Hierarchical Aggregation in Edge Computing
+|title=Masking Corruption Packet Losses in Datacenter Networks with Link-local Retransmission
-|speaker=Jianqi
+|speaker=Jiacheng
-}}
+|date=2024-05-31}}
+{{Latest_seminar
-=== History ===
+|abstract=Disaggregated memory systems separate monolithic servers into different components, including compute and memory nodes, to enjoy the benefits of high resource utilization, flexible hardware scalability, and efficient data sharing. By exploiting the high-performance RDMA (Remote Direct Memory Access), the compute nodes directly access the remote memory pool without involving remote CPUs. Hence, the ordered key-value (KV) stores (e.g., B-trees and learned indexes) keep all data sorted to provide rang query service via the high-performance network. However, existing ordered KVs fail to work well on the disaggregated memory systems, due to either consuming multiple network roundtrips to search the remote data or heavily relying on the memory nodes equipped with insufficient computing resources to process data modifications. In this paper, we propose a scalable RDMA-oriented KV store with learned indexes, called ROLEX, to coalesce the ordered KV store in the disaggregated systems for efficient data storage and retrieval. ROLEX leverages a retraining-decoupled learned index scheme to dissociate the model retraining from data modification operations via adding a bias and some data-movement constraints to learned models. Based on the operation decoupling, data modifications are directly executed in compute nodes via one-sided RDMA verbs with high scalability. The model retraining is hence removed from the critical path of data modification and asynchronously executed in memory nodes by using dedicated computing resources. Our experimental results on YCSB and real-world workloads demonstrate that ROLEX achieves competitive performance on the static workloads, as well as significantly improving the performance on dynamic workloads by up to 2.2 times than state-of-the-art schemes on the disaggregated memory systems. We have released the open-source codes for public use in GitHub.
+|confname=NSDI '23
+|link=https://www.usenix.org/system/files/fast23-li-pengfei.pdf
+|title=ROLEX: A Scalable RDMA-oriented Learned Key-Value Store for Disaggregated Memory Systems
+|speaker=Haotian
+|date=2024-05-31}}
 {{Resource:Previous_Seminars}}

Navigation menu

Difference between revisions of "Resource:Seminar"

Latest revision as of 11:36, 28 May 2024

Contents

Difference between revisions of "Resource:Seminar"

Latest revision as of 11:36, 28 May 2024

Latest

History

2024

2023

2022

2021

2020

2019

2018

2017

Instructions