แนวคิดหลัก
Dirigent is a new cluster manager architecture designed to efficiently orchestrate short-lived, sporadically invoked serverless functions, addressing the performance limitations of existing FaaS platforms that build on top of generic cluster management systems.
บทคัดย่อ
The paper proposes Dirigent, a clean-slate system architecture for serverless (Function as a Service, FaaS) cluster management, designed to address the performance limitations of existing FaaS platforms that build on top of generic cluster management systems like Kubernetes.
Key insights and highlights:
- Current FaaS cluster managers built on Kubernetes suffer from high scheduling latency, especially when handling bursts of concurrent function invocations that require creating many new function sandboxes (containers) on worker nodes.
- The root cause is the complex, hierarchical state management and persistent state updates in Kubernetes-based designs, which become a bottleneck under the high churn of short-lived function sandboxes.
- Dirigent adopts three key design principles to address these issues:
- Simplified internal cluster management abstractions to minimize state management complexity.
- Elimination of persistent state updates on the critical path of function invocations, relaxing exact state reconstruction guarantees.
- Monolithic control and data planes to minimize internal communication overheads.
- Dirigent can create 2500 function sandboxes per second, 1250x more than Knative, a representative Kubernetes-based FaaS platform.
- For a production FaaS workload trace, Dirigent reduces 99th percentile per-function scheduling latency by 2.79x compared to AWS Lambda.
- Dirigent maintains fault tolerance guarantees comparable to existing FaaS platforms while improving performance.
สถิติ
Dirigent can create 2500 function sandboxes per second, 1250x more than Knative.
For a production FaaS workload trace, Dirigent reduces 99th percentile per-function scheduling latency by 2.79x compared to AWS Lambda.
คำพูด
"While initializing function sandboxes on worker nodes takes 10-100s of milliseconds1 with today's FaaS worker system software [34, 37, 43, 60, 74, 80, 81], we find that the end-to-end latency to initialize function sandboxes is often one or more orders of magnitude higher in operational FaaS environments."
"We find that the current approach of building FaaS cluster managers on top of legacy orchestration systems like Kubernetes leads to high scheduling delay at high sandbox churn, which is typical in FaaS clusters."