Comparing AWS Services for hosting DataOps Runners
Executive summary
We explore the suitability of three AWS offerings—EC2, EKS, and ECS—for hosting a DataOps Runner designed to execute DataOps Orchestrator workloads. The objective is to help you make informed decisions based on your requirements and workloads.
By comparing each AWS service's pros, cons, and specific use cases, this document aims to empower you with the knowledge needed to choose the best infrastructure for your needs. Given the results, EC2 is identified as the most common setup among existing DataOps.live customers, while EKS offers scalable and managed Kubernetes integration. ECS is currently unsupported due to technical limitations.
A runner on Kubernetes offers the most cost-effective and flexible approach for enterprises with established Kubernetes workloads.
Service | Pros | Cons |
---|---|---|
EC2 | Most commonly used for runners. | More setup and maintenance. |
Full control over configuration. | Manual scale and security patching. | |
EKS | Scaleable. | Requires Kubernetes expertise. |
Reduced operational overhead. | Ongoing cost for EKS control plane. | |
ECS | Has potential in the future. | Currently unsupported. |
DataOps runner and orchestrator workloads
- Runner Functionality: A persistent service communicating with DataOps.live for job execution and log management.
- Job Execution: Jobs run on DataOps.live Orchestrator container images and are grouped into pipelines.
- Job Duration: Based on ~1.9MM jobs from June 2024, 90% of jobs complete within ~4 minutes.
- Pipeline Duration: Out of ~260k pipelines, 75% complete within ~15 minutes, with a median of 7.3 jobs per pipeline.
Key Advice
- Configure runners for concurrent execution to minimize job queuing.
- Prefer compute-optimized hosts for steady memory utilization.
- Monitor disk usage vigilantly to avoid failures from insufficient disk space.
Introduction
This section shows a high-level overview of workloads being run on infrastructure and monitored by DataOps.live.
Runner functionality
The DataOps Runner is a persisting service that communicates with app.dataops.live
, requests new jobs for execution, and sends back job execution logs. These jobs use DataOps.live Orchestrators do the work to deliver DataOps.
Jobs and pipelines on DataOps.live
Each runner on DataOps.live executes jobs that run one of our Orchestrator container images. These jobs are logically connected together into a Pipeline.
The bar graph below shows the time jobs in DataOps.live Operate take to complete.
Job duration distribution of jobs in June 2024
Note that these jobs run on various compute architectures with varying available resources.
As the 90th percentile of jobs takes up to ~4 minutes to complete, we recommend runners be configured with concurrent execution to avoid job queuing.
Each pipeline is crafted for the project's needs, and projects may have many pipeline structures. The bar graph below shows the time pipelines in DataOps.live Operate take to complete.
The median is 7.3 jobs per pipeline in June 2024.
These pipeline stats may help set expectations of how long DataOps pipelines will take to complete. 75% of customer pipelines are completed in ~15 minutes or less.
Key advice
Configure your runner with concurrent execution to avoid queuing of long-running jobs.
Favor the use of compute-optimized hosts as memory use by DataOps Runner and Orchestrators maintains a steady utilization.
Monitor disk usage and alert when it exceeds an acceptable threshold. The most common issue of runner failure is Docker using all available disk space on a host, resulting in failed jobs and possible critical failure of a host system.
AWS Elastic Compute Cloud (EC2)
Pros
- High customization and full control over instance configuration.
- Extensive experience at DataOps.live.
Cons
- More manual setup and maintenance are required.
- Scaling and security are handled manually unless automated tools are deployed.
Setup and Monitoring
- Setup: Requires manual configuration of EC2 and Docker installation.
- Monitoring: Utilize AWS CloudWatch and third-party tools like DataDog for comprehensive monitoring.
- Maintenance: Handle OS updates, security patches, and scaling.
Suitability
- Best for proof of technology workshops and production workloads requiring full control and customization.
Introduction
In this section, we provide a high-level overview of AWS EC2 for deploying a DataOps Runner.
Setup and monitoring journey
In this section, we explore the high-level steps to setting up and monitoring. For detailed setup instructions, review the Running the DataOps Runner with Docker documentation.
Journey
Main steps for setting up AWS EC2 and a DataOps Runner:
- Configure and deploy EC2 and install Docker via CloudFormation or in the AWS Console.
- Set registration secrets on EC2 as environment variables.
- Register and configure runner.
- Configure instance monitoring.
Maintaining after setup:
- Configure alarms and alerting for CPU, memory, and disk usage.
EC2 setup
- Ease of Setup: Compared to managed services, setup is more manual. You'll need to provision and configure the EC2 instances.
- Customization: A high level of customization is possible since you have full control over the instance configuration.
Runner setup
- Ease of Setup: Direct access to logs makes it simple to diagnose issues.
- Skills and privileges: Requires deployers comfortable with EC2 remote access and command-line tools. Deployers do not need access to secrets other than the app.dataops.live registration token.
Host monitoring
- AWS CloudWatch: Can be used for basic monitoring of EC2 metrics (CPU, memory, disk usage).
- Third-party Tools: Can integrate with external monitoring tools like Prometheus, Grafana, etc. DataOps.live uses DataDog and a DataDog agent running on the EC2.
Runner monitoring
app.dataops.live
: Basic monitoring of a registered runner, showing last contact time and version.- Third-party Tools: Can integrate with external monitoring tools like Prometheus, Grafana, etc.
Maintenance considerations
- Instances: Management: Responsible for OS updates, security patches, and instance scaling.
- Scaling: Manual unless you implement auto-scaling groups, which require additional configuration.
- Security: You handle security groups, IAM roles, and key management for AWS instances.
Suitability
- Flexibility: Suitable if you need full control over your environment and customization. Scaling up is limited to instance size and configuration.
- Cost: The trade-off between predictable running costs and the complexity of the ongoing maintenance effort required to update the OS and security patching.
- Use Case: Runners running proof of technology workshops and production workloads.
- Experience: DataOps.live engineers and supporters have extensive experience in working with runners on EC2.
AWS Elastic Kubernetes Service (EKS)
Pros
- Managed service reduces some operational overhead.
- High scalability with Kubernetes' built-in features like Horizontal Pod Autoscaler.
Cons
- Requires Kubernetes expertise for setup and management.
- Ongoing cost for EKS control plane and EC2 instances.
Setup and Monitoring
- Setup: Easier with Helm charts but requires Kubernetes knowledge.
- Monitoring: Leverage CloudWatch and Kubernetes-native tools for comprehensive monitoring.
- Maintenance: EKS manages the control plane; you manage worker nodes and scaling.
Suitability
- Ideal for containerized workloads needing dynamic scaling and Kubernetes orchestration.
Introduction
In this section, we provide a high-level overview of AWS EKS in the context of deploying a DataOps Runner.
Setup and monitoring journey
This section explores the high-level steps for setting up and monitoring. Review the Running the DataOps Runner with Kubernetes documentation for detailed setup instructions.
DataOps runner for Kubernetes has its configuration files packaged into a Helm chart for easy installation and maintenance.
Journey
Operations engineers will need knowledge and permissions to apply the required EFS storage class.
Main steps for setting up AWS EKS and a DataOps Runner:
- Create (or gain access to) your EKS cluster via
eksctl
CLI or in the AWS Console. - Configure cluster remote registry and PVCs. Set registration secrets as Kubernetes secrets.
- Configure the DataOps Runner helm chart and install.
- Configure cluster monitoring.
Maintaining after setup:
- Configure alarms and alerting for CPU, memory, and disk usage.
EKS setup
Ease of Setup: Setup and configuration require knowledge of Kubernetes. Managed service helps ease the burden, but it still involves the complexities of Kubernetes.
Customization: High level of customization with Kubernetes features and configurations.
Runner setup
Ease of setup: Once the tools are configured, installing them is effortless using Helm. Without Helm, more profound Kubernetes configuration knowledge would be required.
Skills and privileges: This requires deployers to be comfortable with EKS remote access and command-line tools. Deployers do not need access to secrets other than an app.dataops.live
registration token.
Cluster monitoring
AWS CloudWatch: Native integration with CloudWatch for monitoring.
Kubernetes Tools: Can leverage Kubernetes-native monitoring tools such as Prometheus, Grafana, and EFK stack (Elasticsearch, Fluentd, Kibana). K9s for visual interactive management terminal.
Runner monitoring
app.dataops.live
: Basic monitoring of a registered runner, showing the last contact time and version.
Third-party Tools: These can be integrated with external monitoring tools like Prometheus, Grafana, etc.
Maintenance
- Cluster Management: EKS manages the control plane, but you still need to manage worker nodes.
- Scaling: Kubernetes' built-in scaling features (Horizontal Pod Autoscaler, Cluster Autoscaler) make it easier to handle changing workloads.
- Security: Kubernetes offers robust security features but requires a good understanding of how to configure properly (RBAC, network policies).
Suitability
- Scalability: Highly scalable and suitable for complex workloads.
- Cost: Ongoing cost for EKS control plane and underlying EC2 instances. Cost-effectiveness depends on usage and configuration.
- Use Case: Best for containerized workloads with tailored dynamic scaling needs and those who prefer using Kubernetes orchestration.
- Experience: DataOps.live engineers have moderate experience working with Kubernetes on EKS and AKS.
AWS Elastic Container Service (ECS)
Pros
- Not applicable due to limitations.
Cons
- Technical limitations prevent effective use with DataOps.live.
- High IO workloads are not well-supported on Fargate.
Suitability
- Not recommended for DataOps Runners due to ECS architecture constraints.
Suitability
DataOps doesn't support ECS due to technical limitations with ECS.
The ECS architecture requires a long-running EC2 to manage the jobs running on ECS.
Pipeline jobs use the image defined in the ECS task rather than the value of the image: keyword in your full-ci.yml (pipeline definition).
AWS Fargate abstracts container hosts, which limits configurability for container host properties. This affects runner workloads that require high IO to disk or network since these properties have limited or no configurability with Fargate. Before you use DataOps Runner on Fargate, ensure runner workloads with high or extreme compute characteristics on CPU, memory, disk IO, or network IO are suitable for Fargate.
Success stories and infrastructure Recommendations
EC2
-
Success: Widely used among DataOps.live customers, especially for workshops.
-
Recommended Infrastructure:
- Proof of Technology Workloads:
c5.large
, 300 GiB EBS. - Production Workloads:
c5.xlarge
, 300 GiB EBS.
- Proof of Technology Workloads:
EKS
- Success: Used less frequently but chosen for its scalability.
- Recommended Infrastructure: Minimum
c6i.large
, 300 GiB EBS (GP3).
Conclusion
EC2 and EKS offer robust solutions for hosting DataOps Runners, with EC2 being the most versatile and widely used option. However, EKS provides excellent scalability for those experienced with Kubernetes. Due to its technical limitations, ECS is not recommended at this time. This comprehensive analysis should guide customers in selecting the optimal AWS service to meet their specific operational needs.