Skip to main content

DataOps Security and Governance Appendix

Overall DataOps Architecture

Broadly speaking, there are three components to the overall DataOps solution, as indicated in this image:

overall-dataops-architecture _shadow_

These components are discussed in detail in the following points:

The DataOps Runner and Orchestrators

The DataOps Runner is a long-running process (usually in a long-running container) installed on customer infrastructure (on-prem or private cloud). At DataOps Runner installation time, a registration token from the DataOps platform is provided, and this is used to create and exchange authentication keys between the DataOps Runner and the DataOps Platform.

These keys authenticate the DataOps Runner when connecting to DataOps. The registration token determines the scope for which DataOps projects' jobs a DataOps Runner can handle. For example:

  • If a DataOps project registration token is used, the DataOps Runner will only be allowed to run jobs for that specific project.
  • If a DataOps group registration token is provided, then all projects in the group, and all subgroups, will be able to run jobs served by the specific DataOps Runner.

You can install any number of DataOps Runners. The number of runners usually reflects the different data centers or locations where data needs to be orchestrated.

Each DataOps Runner can then execute any number of DataOps Orchestrators that provide individual functionality to ELT data.

The DataOps Platform

DataOps is a distributed, secure SaaS application. It is only available via secure transport protocols. The HTTP port is open but only redirects to HTTPS. It serves:

  • UI Access via HTTPS
  • RESTFul API/Web Service via HTTPS
  • Git access over HTTPS or SSH

DataOps also handles core functions such as:

  • Git repository storage
  • DataOps pipeline running and scheduling
  • DataOps project and group administration
  • Historical storage of pipeline and job history plus created artifacts

Snowflake and other Data Consumers

Snowflake is a very well-understood and documented technology. See Snowflake.com for more information on this technology.

Snowflake is where all the organization's data is encrypted and stored. Data consumers cover any tools that access data from Snowflake, including BI tools such as Tableau, Power BI, Looker, ThoughtSpot, and libraries for more direct access through Python, Go, and Java.

Network Connectivity

network-connectivity _shadow_

Runner Connection

Direction of connection - The DataOps Runner makes the connection to the DataOps SaaS platform.

Type of connection - HTTPS/TLS 1.2. See TLS Protocol for further information.

Direction of data flow - Bidirectional, initiated from the DataOps Runner.

Description

  • Although this value is configurable, the DataOps Runner connects to DataOps every second.

  • The DataOps Runner provides its agent name and pre-exchanged authentication key (to validate its identity) to the DataOps Platform and asks if there are any pending jobs it is authorized to orchestrate.

  • If there are none, the connection is closed.

  • If there are any jobs, the required configuration for the jobs to be executed is passed down to the DataOps Runner. See Authentication Credentials and Secrets Management for further information.

  • The DataOps Runner then spawns additional DataOps orchestrators (Docker containers) using an image specific to the type of job it must orchestrate. For instance, the Python3 Orchestrator container image is used to create the relevant orchestrator if a Python 3 job needs to be run.

  • The STDOUT and STDERR for these additional containers are streamed back to DataOps in close to real-time for monitoring and debugging purposes.

  • Lastly, if the job is configured to produce artifacts (typically rendered configuration files, reports, etc.), these are sent back to DataOps.

Registry Connection

Direction of connection - The DataOps Runner connects to the public container registry https://hub.docker.com.

Type of connection - HTTPS/TLS 1.2. See TLS Protocol for further information.

Direction of data flow - Bidirectional, initiated from the DataOps orchestrator host.

Description

  • The DataOps Runner downloads one or more orchestrator images to execute the job logic as defined by the DataOps pipeline.

Orchestrator Connection

Direction of connection - The DataOps Runner connects to the local container infrastructure (local Docker engine) from its host machine. The orchestrator container spawns on the same host as the DataOps Runner. For security reasons, the network connection used will be a local network in most scenarios.

Type of connection - Localhost and intranet only

Direction of data flow - Bidirectional, initiated from the DataOps Runner image to the orchestrator image.

Description

  • The DataOps Runner executes the orchestrator container images and monitors their status until they exit.

Local Connection

Direction of connection - The DataOps orchestrator connects to an on-premises or private cloud system such as Matillion to ELT data from the system.

Type of connection - Dictated by the capabilities of the system to be orchestrated. Usually, this is HTTPS/TLS 1.2. See TLS Protocol for further information.

Direction of data flow - Bidirectional, initiated from the DataOps orchestrator host.

Description

  • The specific DataOps Orchestrator, like Matillion, uses configuration information from the job or the DataOps Vault. The DataOps Vault securely provides connection credentials using the Secrets Manager Orchestrator.

  • The actual work, such as starting an ETL/ELT job, is requested.

  • For instance, if this is an asynchronous request, it will return before the actual task is completed. The DataOps orchestrator will then enter a polling loop to check the status of the asynchronous task until it finishes or a configurable timeout is hit.

Indirect Snowflake Connection

Direction of connection - The orchestrated system connects to Snowflake.

Type of connection - Dictated by the capabilities of the orchestrated system. However, this can only be within the secure methods supported by Snowflake.

Direction of data flow - Bidirectional

Description - See the Snowflake data consumer and other documentation for the orchestrated system.

Direct Snowflake Connection

Direction of connection - One of the DataOps Orchestrators that connects to Snowflake directly.

Type of connection - HTTPS/TLS 1.2. See TLS Protocol for further information.

Direction of data flow - Bidirectional, initiated via the DataOps Orchestrator from the DataOps Runner host.

Description

SaaS Connection

Identical to the Local Connection except that traffic routes via the public internet.

Consumer Connection

Direction of connection - Data Consumers connect to Snowflake

Type of connection - HTTPS/TLS 1.2. See TLS Protocol for further information.

Direction of data flow - Bidirectional

Description

User Connection

Direction of connection - DataOps end users connect to the DataOps app.

Type of connection - HTTPS/TLS 1.2. See TLS Protocol for further information.

Direction of data flow - Bidirectional

Description

The main use cases are:

  • Standard Web UI access
  • RESTful API/Web Service calls
  • Git access over HTTPS

TLS Protocol

DataOps only supports TLS protocol version 1.2. Older, insecure versions, are disabled.

Only the following TLS Ciphers are enabled.

  • ECDHE-ECDSA-AES128-GCM-SHA256
  • ECDHE-RSA-AES128-GCM-SHA256
  • ECDHE-ECDSA-AES128-SHA256
  • ECDHE-RSA-AES128-SHA256
  • ECDHE-ECDSA-AES256-GCM-SHA384
  • ECDHE-RSA-AES256-GCM-SHA384
  • ECDHE-ECDSA-AES256-SHA384
  • ECDHE-RSA-AES256-SHA384
  • AES128-GCM-SHA256
  • AES128-SHA256
  • AES256-GCM-SHA384
  • AES256-SHA256

Static IP Addresses

DataOps users, Git clients, and the DataOps Runners must open their firewall for outbound HTTPS connections to the IPv4 address 3.9.0.146 to connect to DataOps (via the Runner Connection and the User Connection).

note

IPv6 is currently not supported.

Authentication Credentials and Secrets Management

Authentication details are fetched via the Secrets Manager Orchestrator by the DataOps Runner. Credentials are either stored in the encrypted local DataOps Vault or (recommended) stored in a third-party secrets manager like AWS Keystore.

Secrets and Credentials Management

User Access Control

DataOps supports the following user access control methods:

  • Username and Password Authentication
  • Access Token Authentication

Two-Factor Authentication (MFA/2FA)

For Username and Password Authentication, DataOps supports time-based one-time passwords, compatible with applications such as:

  • Authy
  • Duo Mobile
  • LastPass Authenticator
  • Authenticator
  • andOTP
  • Google Authenticator
  • Microsoft Authenticator
  • SailOTP

Single Sign-On (SSO)

DataOps supports Single Sign-On (SSO) via a third-party provider. Azure Directory Service, Microsoft's Active Directory Federation Services (ADFS), Ping Federate, the standard protocols SAML 2.0 and OpenID Connect (OIDC), and other SSO providers, are supported as external identity providers. You can read up more on the Single Sign-On documentation page.

Role Based Access Control (RBAC)

DataOps supports user restrictions in two dimensions:

Scope

The following points apply to the Dataops Scope:

  • By default, a user has no access to any DataOps projects or groups until assigned access.
  • If a user is given access to a specific project, this allows access to only that project.
  • If users are given access to a particular group, they will have access to all projects and sub-groups underneath that group.

Function

When a user is assigned a scope, they are also assigned a functional role from the following available roles:

  • Reporter: Read-only access
  • Developer: Read-write access to all but protected branches (configurable)
  • Maintainer: Read-write access to all branches, including protected branches (configurable)
  • Owner: Read-write access to all branches plus the ability to register DataOps runners, change permissions for other users in their group, and perform specific project and group admin functions (such as rename, transfer, etc.)

Penetration Testing

The DataOps platform is subject to regular penetration testing. The latest report can be obtained by contacting our support team after signing an NDA.

API Rate Limiting

API rate limiting is a common technique used to improve the security and durability of a web application, no matter if the request is malicious or just a bug. Rate limiting mitigates brute force attacks as well. DataOps uses rate limiting to protect many aspects of the platform, including but not limited to:

  • General API access
  • Failed Authentication attempts
  • Import and export rates of files and projects

This ensures throttling requests from IP addresses where large request volumes originate. IP block-and-allow lists are in place to ban abusive clients.

Business Continuity Management

The DataOps.live platform ensures high availability (>99.9%) and protection against unplanned interruptions.

Snowflake

Snowflake is a highly secure platform used in many highly-sensitive areas. Snowflake has excellent security documentation, including:

Physical and Environmental Security

All aspects of the DataOps platform are hosted on Amazon Web Services; therefore, both physical and environmental security is provided by AWS. This is summarized in the AWS Data Center Controls web page.

Operations and Processes

Staff Onboarding and Offboarding

Human Resource Security:

  • Does your organization perform background screening of applicants, including prior employment, criminal records, credit checks, professional and academic references, and drug screening (unless prohibited by law)?

    • N/A - Internal Control Response is sufficient.
  • Are all employees and non-employees that manage your and/or your client data on behalf of your company are required, upon hire, to:

    • Sign a Code of Ethics or any agreement(s) that require non-disclosure, preservation of confidentiality, and/or acceptable use, and undergo information security awareness training upon hire?"

      • N/A - Internal Control Response is sufficient.
    • Does your organization have a formal process to remove/change user access and obtain assets (as applicable) within 24 hours after HR's notification of termination, or when an employee / non-employee changes position?

Internal Company IT/Security Operations

DataOps.live has policies that cover:

  • Information Security
  • Information Security Training
  • Password Management
  • Information Security Incident Management
  • Asset Management
  • Acceptable Use
  • Access Control
  • Business Continuity Management
  • Management Review
  • Social Media
  • Data Encryption
  • Anti-Virus
  • Data Protection
  • Internal Audit
  • Document Control and Record Management
  • Supplier Security
  • BYOD (Bring Your Own Device)
  • Secure Development

All the access to the DataOps.live infrastructure for maintenance and administration is whitelist-controlled and only available via Secure Protocols (i.e., TLS 1.2, VPN, etc.). Unsecured protocols (e.g., telnet, FTP) are prohibited and prevented.

All firewall rules and access groups controlling administrative access to DataOps are set to deny by default.

Development and Maintenance

DataOps.live development practices include:

  • Requirements tracking using a standard requirements tracking tool
  • Source Code Management (SCM) based on Git
  • Code review processes
  • Automated build and test CI/CD Infrastructure
  • Automated security testing scans (see Penetration Testing). Critical vulnerabilities are corrected within 30 days. High-risk vulnerabilities are fixed within 90 days.

Incident Management

DataOps.live has an Incident Management and Escalation Process as well as Cybersecurity Insurance Coverage.