Cheat Sheet
Overview
The data product platform is extremely powerful but requires a bit of knowledge. Like most powerful systems, the things you use daily will become second nature. This cheat sheet is for everything else.
Conventions
By convention, all VARIABLES, ENVIRONMENT_VARIABLES, or PLACE_HOLDERS in templates or pipeline definition files are in upper case. This naming convention makes them identifiable compared to most other text or configurations, which is usually lower case e.g.
Run stack sales sources:
extends: .modelling_and_transformation_base
variables:
TRANSFORM_ACTION: RUN
TRANSFORM_MODEL_SELECTOR: "tag:source_stack_sales"
or
dbname: "{{ DATABASE }}" # Snowflake database name
user: "{{ SNOWFLAKE_USERNAME }}" # Snowflake user
password: "{{ SNOWFLAKE_PASSWORD }}" # Plain string or vault encrypted
Note that template rendering and variable substitution are case-sensitive.
Project structure
The default project structure for any DataOps repository should be:
/dataops
/dataops/snowflake This is where all configurations for DataOps Snowflake Object Lifecycle Engine go
/dataops/modelling This is where all configurations for DataOps Modelling and Transformation Engine go
/pipelines This is where all configurations for DataOps Pipelines go
You can create any other root-level directories for storing other code/configuration related to other orchestrators e.g. for Talend Jobs.
For more information, see DataOps Project Structure.
Git workflow and Git command line
If you are unfamiliar with git, we recommend reading Git in 30 Seconds.
Naming conventions
Branches
The following branch names have special meanings in a DataOps project and should not be used for any other purpose
- main
- qa
- dev
Branches can have any other names as long as it doesn't have any white spaces or special characters, but the following best practices are strongly recommended:
- A branch name must immediately tell other people what is in this branch
- Where possible, a branch name should have a reference to the ticketing system/project management system
- Don't use
main
,qa
, ordev
as part of other branch names. A really good branch name would be something like:DATATEAM-435-add-length-of-service-to-HR-Employee-Consumption-Model
- Remember that the branch name will be used to create a dynamic Feature Database (see below)
- Many tutorials suggest having branch names such as feat/new-table-creation.
The
/
character will cause problems with DataOps projects, and so should be avoided.
For more information, see Branching Strategies.
Database structure
Database names
The data product platform automatically creates databases as needed using the following naming:
[DATABASE_NAME_PREFIX]_[DATABASE_IDENTIFIER]
DATABASE_NAME_PREFIX
is set in the -ci.yml file. By default, this is set to DATAOPS
. DATABASE_IDENTIFIER
is calculated using logic in the -ci.yml. The default behavior is:
- If branch=main then
DATABASE_IDENTIFIER
=PROD
and therefore fullDATABASE
would be something likeDATAOPS_PROD
- If branch=qa then
DATABASE_IDENTIFIER
=QA
and therefore fullDATABASE
would be something likeDATAOPS_QA
- If the branch name is anything else then
DATABASE_IDENTIFIER
=FEATURE_[BRANCH_NAME]
but with everything other than alphanumeric characters removed e.g. if the branch name isDATATEAM-435-add-length-of-service-to-HR-Employee-Consumption-Model
then the fullDATABASE
would beDATAOPS_FEATURE_DATATEAM435ADDLENGTHOFSERVICETOHEEMPLOYEECONSUMPTIONMODEL
. This is at the very upper limit of a reasonable DATABASE name length, although Snowflake technically supports up to 255 characters.
For more information, see Database Objects Namespacing.
Schema names
Schema names should follow the following naming convention
[source|business]_[stack_name]_[mda_layer]
For example, in the system ingesting from two source systems, hr, and sales (source stacks) and service two sets of business needs (business stacks), hr and sales forecasting, the following schemas would exist:
source_hr_curation
source_hr_ingestion
source_sales_curation
source_sales_ingestion
business_hr_calculation
business_hr_consumption
business_sales_forecasting_calculation
business_sales_forecasting_consumption
Modeling and transformation
Directory structure
Modelling and Transformation projects often end up with a large number of SQL and YAML files in various subdirectories.
Underneath /dataops/modelling
, there are several standard directories:
/dataops/modelling/models
- this is where most of your work will be done/dataops/modelling/macros
- this is where custom test definitions, macros, etc., are stored. See the detailed documentation for Modelling and Transformation/dataops/modelling/snapshots
- this is where you will define your slowly changing dimension tables (sometimes referred to as snapshots)
Modeling naming
Model name should follow the following naming convention
[schema_name]_[model_name].[sql|yml]
. Following the previous example,
this may create:
/dataops/modelling/sources/hr/ingestion.yml
/dataops/modelling/sources/hr/source_hr_curation_employee.yml
/dataops/modelling/sources/hr/source_hr_curation_employee.sql
/dataops/modelling/sources/sales/ingestion.yml
/dataops/modelling/sources/sales/source_sales_curation_orders.yml
/dataops/modelling/sources/sales/source_sales_curation_orders.sql
/dataops/modelling/models/business/sales_forecasting/business_sales_forecasting_calculation_salestotals.sql
/dataops/modelling/models/business/sales_forecasting/business_sales_forecasting_calculation_salestotals.yml
/dataops/modelling/models/business/sales_forecasting/business_sales_forecasting_calculation_salesmissed.sql
/dataops/modelling/models/business/sales_forecasting/business_sales_forecasting_calculation_salesmissed.yml
/dataops/modelling/models/business/sales_forecasting/business_sales_forecasting_consumption_salescommission.sql
/dataops/modelling/models/business/sales_forecasting/business_sales_forecasting_consumption_salescommission.yml
/dataops/modelling/models/business/sales_forecasting/business_sales_forecasting_consumption_salesperformance.sql
/dataops/modelling/models/business/sales_forecasting/business_sales_forecasting_consumption_salesperformance.yml
There is some duplication between the directory path and the filename because the filename must be unique across the whole modeling and transformation project.
Useful DataOps.live tricks
Make a commit without running a pipeline
Include [skip ci]
in your commit message from any Git client, not just the Web IDE. For example:
.