Scaling ETL (Extract, Transform, Load) workloads efficiently in the cloud is critical to managing large-scale data pipelines. Clearwater Analytics, a leader in web-based investment accounting and analytics, faced the challenge of scaling their ETL dispatchers within AWS while ensuring minimal impact on active tasks. Clearwater successfully optimized their pipeline processes by implementing an innovative detachment strategy for auto-scaling. In this blog post, we’ll explore how they addressed these challenges and the solutions they applied to achieve efficient ETL scaling.

ETL Dispatchers: The Core of Clearwater Analytics’ Data Pipeline

ETL dispatchers play a vital role in Clearwater’s data pipeline. They manage large volumes of data processing tasks, coordinating the extraction, transformation, and loading of information across multiple systems. However, scaling dispatchers to meet fluctuating demand while maintaining task continuity is a challenging endeavor, especially when tasks can be long-running.

The Challenge: Gracefully Scaling Down ETL Dispatchers with Active Tasks

One of the primary challenges Clearwater faced was scaling down ETL dispatchers without disrupting ongoing tasks. AWS Auto Scaling is built to adjust compute resources dynamically but doesn’t natively handle long-running processes that must be completed before termination. Active ETL tasks could be interrupted without careful management, leading to data loss or processing delays.

Exploring Solutions: Custom Termination Policies for AWS Auto Scaling

Clearwater investigated various solutions to address this, focusing on custom termination policies within AWS Auto Scaling. These policies allow for more control over which instances should be terminated, considering factors like instance age, remaining tasks, and availability zones. While promising, standard policies failed to accommodate long-running ETL jobs that couldn’t be quickly interrupted.

Pause, Drain, Kill: A Simple Yet Limited Approach

Initially, Clearwater explored the “Pause, Drain, Kill” approach. This technique involves pausing new task assignments, draining existing tasks, and terminating the instance once it’s safe. While this worked well for brief tasks, it became problematic for tasks with extended run times. Additionally, this approach lacked flexibility, especially when managing complex task dependencies and resource reallocation.

Detach, Pause, Drain, Kill: A Promising Enhancement

To solve the limitations of the “Pause, Drain, Kill” approach, Clearwater introduced an enhancement: the “Detach, Pause, Drain, Kill” method. By detaching instances from the Auto Scaling group before pausing task assignments, they allowed existing tasks to be completed without triggering immediate instance termination. This offered a cleaner, more controlled shutdown process and improved task continuity.

The Detachment Approach: Advantages and Future Considerations

The detachment strategy significantly improved the scalability and resilience of Clearwater’s ETL infrastructure. Some key advantages of this approach include:

  • Minimized Task Disruptions: By allowing tasks to be completed before terminating instances, data loss is avoided, and processing flows uninterrupted.
  • Customizable: The detachment process can be fine-tuned to handle various task durations and priorities.
  • Scalability: With AWS Auto Scaling still managing the overall fleet, resources can be dynamically adjusted without manual intervention.

Clearwater is considering further enhancements to improve efficiency and reduce costs, such as integrating predictive scaling policies based on historical task data and expanding the detachment strategy to other long-running services.

Deployments Streamlined, Costs Reduced

By implementing this detachment strategy, Clearwater achieved two crucial goals: streamlined deployments and reduced operational costs. Auto Scaling allowed them to optimize resource utilization while avoiding unnecessary infrastructure spending. This cost reduction and increased task processing reliability represent a significant win for their cloud infrastructure strategy.

Overcoming Limitations: Lifecycle Hooks and AZ Rebalancing

Despite the success of the detachment strategy, some challenges required additional solutions. One such issue was ensuring smooth rebalancing across availability zones (AZs) when instances were detached. Clearwater leveraged AWS Auto Scaling lifecycle hooks to manage instance terminations, adding delays where necessary to ensure that tasks were completed gracefully. This allowed for more robust control over scaling actions, especially during high-demand periods.

Conclusion: Achieving Efficient Auto Scaling for Long-Running Processes

Clearwater Analytics’ detachment strategy provides an effective solution for scaling ETL workloads in AWS. By carefully managing the scaling-down process of ETL dispatchers through a combination of custom termination policies, detachment strategies, and lifecycle hooks, they achieved a highly efficient, cost-optimized data pipeline. This approach is a valuable reference for organizations facing similar challenges in achieving resilient and scalable cloud infrastructure.

References

Specify the scaling strategy

AWS Auto Scaling