Comparing AWS Services for hosting DataOps Runners

Executive summary

We explore the suitability of three AWS offerings—EC2, EKS, and ECS—for hosting a DataOps Runner designed to execute DataOps Orchestrator workloads. The objective is to help you make informed decisions based on your requirements and workloads.

By comparing each AWS service's pros, cons, and specific use cases, this document aims to empower you with the knowledge needed to choose the best infrastructure for your needs. Given the results, EC2 is identified as the most common setup among existing DataOps.live customers, while EKS offers scalable and managed Kubernetes integration. ECS is currently unsupported due to technical limitations.

Advice

A runner on Kubernetes offers the most cost-effective and flexible approach for enterprises with established Kubernetes workloads.

Service	Pros	Cons
EC2	Most commonly used for runners.	More setup and maintenance.
	Full control over configuration.	Manual scale and security patching.
EKS	Scaleable.	Requires Kubernetes expertise.
	Reduced operational overhead.	Ongoing cost for EKS control plane.
ECS	Has potential in the future.	Currently unsupported.

DataOps runner and orchestrator workloads

Summary

Runner Functionality: A persistent service communicating with DataOps.live for job execution and log management.
Job Execution: Jobs run on DataOps.live Orchestrator container images and are grouped into pipelines.
Job Duration: Based on ~1.9MM jobs from June 2024, 90% of jobs complete within ~4 minutes.
Pipeline Duration: Out of ~260k pipelines, 75% complete within ~15 minutes, with a median of 7.3 jobs per pipeline.

Key Advice

Configure runners for concurrent execution to minimize job queuing.
Prefer compute-optimized hosts for steady memory utilization.
Monitor disk usage vigilantly to avoid failures from insufficient disk space.

Introduction

This section shows a high-level overview of workloads being run on infrastructure and monitored by DataOps.live.

Runner functionality

The DataOps Runner is a persisting service that communicates with app.dataops.live, requests new jobs for execution, and sends back job execution logs. These jobs use DataOps.live Orchestrators do the work to deliver DataOps.

Jobs and pipelines on DataOps.live

Each runner on DataOps.live executes jobs that run one of our Orchestrator container images. These jobs are logically connected together into a Pipeline.

The bar graph below shows the time jobs in DataOps.live Operate take to complete.

bar graph showing time jobs in DataOps.live Operate take to complete !!shadow!!

Job duration distribution of jobs in June 2024

note

Note that these jobs run on various compute architectures with varying available resources.

As the 90th percentile of jobs takes up to ~4 minutes to complete, we recommend runners be configured with concurrent execution to avoid job queuing.

Each pipeline is crafted for the project's needs, and projects may have many pipeline structures. The bar graph below shows the time pipelines in DataOps.live Operate take to complete.

bar graph showing the time pipelines in DataOps.live Operate take to complete

note

The median is 7.3 jobs per pipeline in June 2024.

These pipeline stats may help set expectations of how long DataOps pipelines will take to complete. 75% of customer pipelines are completed in ~15 minutes or less.

Key advice

Advice

Configure your runner with concurrent execution to avoid queuing of long-running jobs.

Advice

Favor the use of compute-optimized hosts as memory use by DataOps Runner and Orchestrators maintains a steady utilization.

Advice

Monitor disk usage and alert when it exceeds an acceptable threshold. The most common issue of runner failure is Docker using all available disk space on a host, resulting in failed jobs and possible critical failure of a host system.

AWS Elastic Compute Cloud (EC2)

Summary

Pros

High customization and full control over instance configuration.
Extensive experience at DataOps.live.

Cons

More manual setup and maintenance are required.
Scaling and security are handled manually unless automated tools are deployed.

Setup and Monitoring

Setup: Requires manual configuration of EC2 and Docker installation.
Monitoring: Utilize AWS CloudWatch and third-party tools like DataDog for comprehensive monitoring.
Maintenance: Handle OS updates, security patches, and scaling.

Suitability

Best for proof of technology workshops and production workloads requiring full control and customization.

Introduction

In this section, we provide a high-level overview of AWS EC2 for deploying a DataOps Runner.

Setup and monitoring journey

In this section, we explore the high-level steps to setting up and monitoring. For detailed setup instructions, review the Running the DataOps Runner with Docker documentation.

Journey

Main steps for setting up AWS EC2 and a DataOps Runner:

Configure and deploy EC2 and install Docker via CloudFormation or in the AWS Console.
Set registration secrets on EC2 as environment variables.
Register and configure runner.
Configure instance monitoring.

Maintaining after setup:

Configure alarms and alerting for CPU, memory, and disk usage.

EC2 setup

Ease of Setup: Compared to managed services, setup is more manual. You'll need to provision and configure the EC2 instances.
Customization: A high level of customization is possible since you have full control over the instance configuration.

Runner setup

Ease of Setup: Direct access to logs makes it simple to diagnose issues.
Skills and privileges: Requires deployers comfortable with EC2 remote access and command-line tools. Deployers do not need access to secrets other than the app.dataops.live registration token.

Host monitoring

AWS CloudWatch: Can be used for basic monitoring of EC2 metrics (CPU, memory, disk usage).
Third-party Tools: Can integrate with external monitoring tools like Prometheus, Grafana, etc. DataOps.live uses DataDog and a DataDog agent running on the EC2.

Runner monitoring

app.dataops.live: Basic monitoring of a registered runner, showing last contact time and version.
Third-party Tools: Can integrate with external monitoring tools like Prometheus, Grafana, etc.

Maintenance considerations

Instances: Management: Responsible for OS updates, security patches, and instance scaling.
Scaling: Manual unless you implement auto-scaling groups, which require additional configuration.
Security: You handle security groups, IAM roles, and key management for AWS instances.

Suitability

Flexibility: Suitable if you need full control over your environment and customization. Scaling up is limited to instance size and configuration.
Cost: The trade-off between predictable running costs and the complexity of the ongoing maintenance effort required to update the OS and security patching.
Use Case: Runners running proof of technology workshops and production workloads.
Experience: DataOps.live engineers and supporters have extensive experience in working with runners on EC2.

AWS Elastic Kubernetes Service (EKS)

Summary

Pros

Managed service reduces some operational overhead.
High scalability with Kubernetes' built-in features like Horizontal Pod Autoscaler.

Cons

Requires Kubernetes expertise for setup and management.
Ongoing cost for EKS control plane and EC2 instances.

Setup and Monitoring

Setup: Easier with Helm charts but requires Kubernetes knowledge.
Monitoring: Leverage CloudWatch and Kubernetes-native tools for comprehensive monitoring.
Maintenance: EKS manages the control plane; you manage worker nodes and scaling.

Suitability

Ideal for containerized workloads needing dynamic scaling and Kubernetes orchestration.

Introduction

In this section, we provide a high-level overview of AWS EKS in the context of deploying a DataOps Runner.

Setup and monitoring journey

This section explores the high-level steps for setting up and monitoring. Review the Running the DataOps Runner with Kubernetes documentation for detailed setup instructions.

DataOps runner for Kubernetes has its configuration files packaged into a Helm chart for easy installation and maintenance.

Journey

info

Operations engineers will need knowledge and permissions to apply the required EFS storage class.

Main steps for setting up AWS EKS and a DataOps Runner:

Create (or gain access to) your EKS cluster via eksctl CLI or in the AWS Console.
Configure cluster remote registry and PVCs. Set registration secrets as Kubernetes secrets.
Configure the DataOps Runner helm chart and install.
Configure cluster monitoring.

Maintaining after setup:

Configure alarms and alerting for CPU, memory, and disk usage.

EKS setup

Ease of Setup: Setup and configuration require knowledge of Kubernetes. Managed service helps ease the burden, but it still involves the complexities of Kubernetes.

Customization: High level of customization with Kubernetes features and configurations.

Runner setup

Ease of setup: Once the tools are configured, installing them is effortless using Helm. Without Helm, more profound Kubernetes configuration knowledge would be required.

Skills and privileges: This requires deployers to be comfortable with EKS remote access and command-line tools. Deployers do not need access to secrets other than an app.dataops.live registration token.

Cluster monitoring

AWS CloudWatch: Native integration with CloudWatch for monitoring.

Kubernetes Tools: Can leverage Kubernetes-native monitoring tools such as Prometheus, Grafana, and EFK stack (Elasticsearch, Fluentd, Kibana). K9s for visual interactive management terminal.

Runner monitoring

app.dataops.live: Basic monitoring of a registered runner, showing the last contact time and version.

Third-party Tools: These can be integrated with external monitoring tools like Prometheus, Grafana, etc.

Maintenance

Cluster Management: EKS manages the control plane, but you still need to manage worker nodes.
Scaling: Kubernetes' built-in scaling features (Horizontal Pod Autoscaler, Cluster Autoscaler) make it easier to handle changing workloads.
Security: Kubernetes offers robust security features but requires a good understanding of how to configure properly (RBAC, network policies).

Suitability

Scalability: Highly scalable and suitable for complex workloads.
Cost: Ongoing cost for EKS control plane and underlying EC2 instances. Cost-effectiveness depends on usage and configuration.
Use Case: Best for containerized workloads with tailored dynamic scaling needs and those who prefer using Kubernetes orchestration.
Experience: DataOps.live engineers have moderate experience working with Kubernetes on EKS and AKS.

AWS Elastic Container Service (ECS)

Summary

Pros

Not applicable due to limitations.

Cons

Technical limitations prevent effective use with DataOps.live.
High IO workloads are not well-supported on Fargate.

Suitability

Not recommended for DataOps Runners due to ECS architecture constraints.

Suitability

DataOps doesn't support ECS due to technical limitations with ECS.

note

The ECS architecture requires a long-running EC2 to manage the jobs running on ECS.

warning

Pipeline jobs use the image defined in the ECS task rather than the value of the image: keyword in your full-ci.yml (pipeline definition).

warning

AWS Fargate abstracts container hosts, which limits configurability for container host properties. This affects runner workloads that require high IO to disk or network since these properties have limited or no configurability with Fargate. Before you use DataOps Runner on Fargate, ensure runner workloads with high or extreme compute characteristics on CPU, memory, disk IO, or network IO are suitable for Fargate.

Success stories and infrastructure Recommendations

EC2

Success: Widely used among DataOps.live customers, especially for workshops.
Recommended Infrastructure:
- Proof of Technology Workloads: c5.large, 300 GiB EBS.
- Production Workloads: c5.xlarge, 300 GiB EBS.

EKS

Success: Used less frequently but chosen for its scalability.
Recommended Infrastructure: Minimum c6i.large, 300 GiB EBS (GP3).

Conclusion

EC2 and EKS offer robust solutions for hosting DataOps Runners, with EC2 being the most versatile and widely used option. However, EKS provides excellent scalability for those experienced with Kubernetes. Due to its technical limitations, ECS is not recommended at this time. This comprehensive analysis should guide customers in selecting the optimal AWS service to meet their specific operational needs.

Executive summary​

DataOps runner and orchestrator workloads​

Introduction​

Runner functionality​

Jobs and pipelines on DataOps.live​

Key advice​

AWS Elastic Compute Cloud (EC2)​

Introduction​

Setup and monitoring journey​

Journey​

EC2 setup​

Runner setup​

Host monitoring​

Runner monitoring​

Maintenance considerations​

Suitability​

AWS Elastic Kubernetes Service (EKS)​

Introduction​

Setup and monitoring journey​

Journey​

EKS setup​

Runner setup​

Cluster monitoring​

Runner monitoring​

Maintenance​

Suitability​

AWS Elastic Container Service (ECS)​

Suitability​

Success stories and infrastructure Recommendations​

EC2​

EKS​

Conclusion​

Executive summary

DataOps runner and orchestrator workloads

Introduction

Runner functionality

Jobs and pipelines on DataOps.live

Key advice

AWS Elastic Compute Cloud (EC2)

Introduction

Setup and monitoring journey

Journey

EC2 setup

Runner setup

Host monitoring

Runner monitoring

Maintenance considerations

Suitability

AWS Elastic Kubernetes Service (EKS)

Introduction

Setup and monitoring journey

Journey

EKS setup

Runner setup

Cluster monitoring

Runner monitoring

Maintenance

Suitability

AWS Elastic Container Service (ECS)

Suitability

Success stories and infrastructure Recommendations

EC2

EKS

Conclusion