Why SOLE?
In any DataOps project, it is necessary to create, update and drop Snowflake objects. The need to maintain this object lifecycle has prompted the creation of many tools in the extended database ecosystem. However, most of these tools use a procedural, imperative approach to making additions and alterations to the database structure, a state-sensitive and often error-prone technique.
SOLE uses a declarative configuration to define Snowflake objects, with all logic for creation, alteration and destruction being generated automatically.
Imperative vs declarative
Consider the following real-world example, where the objective is to:
Go to 221B Baker Street, London
Imperative
An imperative approach to achieve the above objective might take the form of step-by-step directions. This would, of course, be sensitive to initial conditions, i.e. the current location of the individual following the directions. Then, something like the following procedural method could be followed:
- Face north,
- walk 1.2 miles,
- turn right onto Oxford Street,
- walk 200 yards,
- turn left and then it's 245 ft on the right.
Naturally, there are a number of issues that could (and frequently do) arise from this type of procedure:
- A misunderstanding of initial conditions could lead to instructions becoming meaningless, or maybe worse, silently following the wrong course.
- Any mistake in the individual instructions could break the procedure in the same way.
- Fixed instructions like these cannot deal with sudden environmental problems, such as a road closure.
Declarative
Alternatively, we may wish to offload the process of computing how we achieve our objective to another system:
"OK Google, show me directions to 221B Baker Street, London"
Clicking this link will do exactly that. The initial conditions (starting point) are automatically computed (if you enable location sharing), a route is calculated, and if you are moving, will be recalculated if conditions change.
All that was needed was to set out the objective in a declarative way, and let the system do the rest!
Imperative vs declarative in SQL
So how does this apply to the management of database objects?
Let's take a simple example: a database table definition has been updated to add a new column named ADDRESS
:
CREATE TABLE PEOPLE (FIRST_NAME VARCHAR, LAST_NAME VARCHAR, ADDRESS VARCHAR);
¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
SQL is an imperative language, with most statements starting with a verb (e.g. CREATE
, ALTER
, DROP
).
Therefore, the above statement is sensitive to initial conditions. If the table does not exist, it will be created.
If there is already a table with the name PEOPLE
then the statement will fail.
The property of being able to run an action once or multiple times, leading to the same result, is known as idempotence. It is an important concept for DataOps, particularly when life-cycling Snowflake objects.
Imperative Snowflake SQL
Whilst still using imperative statements, Snowflake does give us a couple of mechanisms to make this slightly better.
CREATE TABLE IF NOT EXISTS PEOPLE (FIRST_NAME VARCHAR, LAST_NAME VARCHAR, ADDRESS VARCHAR);
¯¯¯¯¯¯¯¯¯¯¯¯¯
This will allow the statement to be run repeatedly - if there is already a table named PEOPLE
then it will
do nothing and return successfully. However, the existing table will not be updated with the new ADDRESS
column.
What about:
CREATE OR REPLACE TABLE PEOPLE (FIRST_NAME VARCHAR, LAST_NAME VARCHAR);
¯¯¯¯¯¯¯¯¯¯
Now, if there is already a table called PEOPLE
, Snowflake will drop it and recreate from the CREATE
statement.
The structure will then match the updated specification in the statement. Shame about the data though!
Could we not, though, combine these approaches in some way?
CREATE TABLE IF NOT EXISTS PEOPLE (FIRST_NAME VARCHAR, LAST_NAME VARCHAR);
ALTER TABLE PEOPLE ADD COLUMN ADDRESS VARCHAR;
Now this is becoming more complex, and would get much more so as additional changes are made to the table.
Also, this is no longer idempotent - if this block is run for a second time, the second statement will
fail as the column ADDRESS
will already exist.
Log-based imperative
So, if SQL is imperative, why not just keep a log of statements so they can be replayed? Many existing tools use this approach, maintaining a sequential set of SQL tasks.
stmt-00001.sql
CREATE TABLE IF NOT EXISTS PEOPLE (FIRST_NAME VARCHAR, LAST_NAME VARCHAR);
stmt-00002.sql
ALTER TABLE PEOPLE ADD COLUMN ADDRESS VARCHAR;
This will now be repeatable, as long as the log position is maintained for each environment where this is deployed. However, as with the street directions example above, this will become an inflexible procedure, felled by the slightest error in any step or any environmental change (e.g. a manual alteration to Snowflake).
Check-based imperative
Rather than blindly following a sequence of SQL statements, we can examine the condition of the target object at each stage, deciding whether it is necessary to execute each statement.
if table PEOPLE does not exist:
execute stmt-00001.sql
if table PEOPLE has no column ADDRESS:
execute stmt-00002.sql
...
This can be replayed in an environment, applying only the changes that have not yet been applied, based on an examination of the actual database environment at each stage. However, this will quickly become very complex, needing increasingly specific tests against every single change.
There must be a better way!
Declarative
If we move from the imperative SQL to a more declarative language, such as YAML, we can express the same table in this form:
PEOPLE:
columns:
FIRST_NAME:
type: VARCHAR
LAST_NAME:
type: VARCHAR
ADDRESS:
type: VARCHAR
This can then be fed into a black-box style engine (such as SOLE...), which compares it against the current database configuration to internally generate and execute a set of SQL statements. This list of statements, which the operator doesn't need to interact with, is generated dynamically each time, so is not sensitive to the database being in a "known state".
SOLE employs this declarative approach, using a YAML configuration very similar to the above example, and runs natively within the DataOps environment.