Skip to main content

Using Snowpark Container Services with DataOps.live

What are Snowpark Container Services?

Snowpark Container Services (SPCS) is a feature of Snowflake Data Cloud (in public preview since late Dec 2023). It's a fully managed service that helps run containerized applications directly within Snowflake.

These services are designed specifically for Snowflake. They are fully integrated and optimized to run containerized workloads. They provide a secure and governed infrastructure with flexible hardware options, including GPUs, for a seamless and efficient experience.

This new Snowpark runtime simplifies the complexities of managing and maintaining compute resources and clusters for containers. For additional information, see the Snowflake documentation Snowpark Container Services.

Combined capabilities: container services and DataOps.live

The introduction of Snowpark Container Services is a significant enhancement to DataOps processes within the Snowflake ecosystem. You can now use various application images in Snowpark Container Services. This will enhance your DataOps pipelines and jobs and make them more efficient.

By coupling DataOps.live application images with Snowpark Container Services, the deployment and management of containerized applications becomes more accessible than ever. This integration offers increased flexibility and efficiency in managing various workloads.

You can concentrate on creating your data products and apps on the data product platform. Quickly launch them on the Snowflake platform without worrying about managing the infrastructure in both environments.

How does it work?

For most DataOps pipeline jobs, the standard built-in orchestrator images are usually sufficient. Yet, Dataops.live allows the creation of images from scratch to answer specific customer use cases. This is where Snowpark Container Services becomes particularly valuable, introducing a key feature to the process.

This guide shows how to link DataOps CI/CD pipelines, the pipeline for your app, and Snowflake Snowpark containerization. It focuses on flexibility and supporting different workloads. This guide shows how to create application images in the data product platform. Once you complete the images, it demonstrates how to use them inside Container Services.

In the data product platform, you will work with the DataOps data pipelines, jobs, orchestrators, SOLE, and MATE.

  1. Start DevReady, our browser-based development environment.
  2. In the CI/CD pipeline of your Dataops project, define a job to build your custom application image.
  3. Register and deploy the image into the Snowpark image registry within your Snowflake account.

In your Snowflake account, you can use an image registry and compute pool to run containerized apps. Once deployed and running, containers expose services, service functions, or jobs.

  1. Upload your application images to the registry in your Snowflake account.
  2. Once uploaded, run your application containers as a service, service function, or job. Create a YAML file to provide Snowflake with the necessary information to configure and run them.
  3. When creating a service or job, create and specify a compute pool where they will operate.
  4. Use service functions to communicate with a service from an SQL query.

Image Registries

Image registries are Snowflake schema objects. You can have multiple registries per schema. The naming format is

account-identifier.registry.snowflakecomputing.com/<database>/<schema>/<registry_name>

Note that having Docker log in to the registry fails if you enable two-factor authentication (2FA).

Image registries appear as Snowflake stage. However, the Snowsight UI does not allow listing the contents and throws errors when accessing the stage. It does not appear to be possible to show the registry's contents.

Building images

Build images during software development

During your development cycle, you can build images directly in DevReady.

Build your images for the linux/amd64 platform using a command similar to:

docker build --rm --platform linux/amd64 -t <name> .

Build images with CI/CD

We recommend using Kaniko as the image builder when automating your builds. A working DataOps job definition is:

build-spcs-image.yml
Build SPCS image:
extends:
- .agent_tag
stage: Build
image:
name: gcr.io/kaniko-project/executor:v1.19.2-debug
entrypoint: [""]
variables:
SPCS_ACCOUNT: DATAOPS_VAULT(SNOWFLAKE.ACCOUNT)
SPCS_REGISTRY: $SPCS_ACCOUNT.registry.snowflakecomputing.com/spcs_db/spcs_schema/spcs_repository
SPCS_CONTAINER_USER: DATAOPS_VAULT(SNOWFLAKE.INGESTION.USERNAME)
SPCS_CONTAINER_PASSWORD: DATAOPS_VAULT(SNOWFLAKE.INGESTION.PASSWORD)
IMAGE_NAME: my-target-image-name:latest # change this
script:
- ${CI_PROJECT_DIR}/create_auth.sh
- /kaniko/executor --context "${CI_PROJECT_DIR}" --dockerfile "${CI_PROJECT_DIR}/Dockerfile" --destination ${SPCS_REGISTRY}/${IMAGE_NAME}

You need a script named create_auth.sh to create a config.json. Below, you can see a sample create_auth.sh.

#!/bin/sh

CONFIGFILE=/kaniko/.docker/config.json
CRED=`echo ${SPCS_CONTAINER_USER}:${SPCS_CONTAINER_PASSWORD} | base64`
echo '{"auths":{"'${SPCS_REGISTRY}'": { "auth": "'${CRED}'" }}}' > ${CONFIGFILE}

Compute pools

Compute pools are the hosts that Snowflake uses to run containers. To clean up and stop being billed, follow these steps.

  1. Stop all services and jobs on the compute pool.

    ALTER COMPUTE POOL tutorial_compute_pool STOP ALL;

  2. Then, proceed to delete the compute pool.

    DROP COMPUTE POOL tutorial_compute_pool;

Running images as containers

Services

A service contains multiple containers and is the unit of management within SPCS. Create a service using SQL or from a definition file in a stage. The definition contains:

  1. List of containers with name, image, command, args, environment variables, volume mounts, resources
  2. List of endpoints with name, type, and if public
  3. List of volumes with name, source, configuration
  4. Log level configuration, e.g. INFO, WARN

Service endpoints

Service endpoints are externally available at https://<host>-account-identifier.snowflakecomputing.app/ui. The host part is randomly generated and identifies the container. The /ui part is defined in the code uploaded.

The service implementation listens to port 8000, which is proxied to port 443. Snowflake also adds an OAuth login in front of this connection. Snowflake SSO users can log in using the normal SSO login flow.

A DESCRIBE SERVICE <name> command will describe the service, and the endpoint will be in JSON format in the public_endpoints column.

Networking

Containers cannot connect to the Internet by default. You can configure the access to specific hosts on the Internet. The ports supported are ports 22 (ssh+git), 80 (HTTP), 443 (HTTPS), and 1024+. For simpler cases, you would just configure ports 80 and 443.

Configuring network egress is a multi-step process. Start by creating a network rule:

CREATE OR REPLACE NETWORK RULE google_network_rule
MODE = EGRESS
TYPE = HOST_PORT
VALUE_LIST = ('google.com:80', 'google.com:443');

Next, reference the rule as part of an external access integration:

CREATE EXTERNAL ACCESS INTEGRATION google_apis_access_integration
ALLOWED_NETWORK_RULES = (translate_network_rule, google_network_rule)
ENABLED = true;

For other Snowflake services, an external access integration can reference Snowflake Secrets. For Container Services, you can pass secrets as either environment variables or files.

Finally, create the service with a reference to the external access integration:

CREATE SERVICE eai_service
IN COMPUTE POOL MYPOOL
FROM SPECIFICATION
$$
spec:
containers:
- name: main
image: /db/data_schema/tutorial_repository/my_echo_service_image:tutorial
env:
TEST_FILE_STAGE: source_stage/test_file
args:
- read_secret.py
endpoints:
- name: read
port: 8080
$$
EXTERNAL_ACCESS_INTEGRATIONS = (google_apis_access_integration);

Logs

Snowflake provides an account-wide event log facility. All logs from your container arrive in a single event table. This includes the STDOUT from containers. A lag exists between the container and the table in the order of minutes.

You can use the RESOURCE_ATTRIBUTES column to identify the source of the log message. This column is a JSON object that includes the container name, database, and schema.

Persistent storage volumes

Volumes are now available for persistent storage in three different types:

  • Memory - in memory for that container
  • Local - for all containers in a single service instance. Not shared across service instances
  • Stage - An internal Snowflake stage (not external) with an SSE encryption set.