Snowflake

๐Ÿ‘

Estimated Setup time

15 minutes

There are 2 ways to provide Snowflake data access to Aampe

  1. Unloading data into Google Cloud Storage from Snowflake [Link]
  2. Provide direct access for Aampe user to your snowflake dataset (and tables) with event data [Link]

Unloading data into Google Cloud Storage from Snowflake

๐Ÿ“˜

If you already are pushing to a GCP bucket, we would be able to set up a direct transfer from that bucket. If not the setup work in itself should take just a few minutes

  1. Create Snowflake Integration

Integrations are named, first-class Snowflake objects that avoid the need for passing explicit cloud provider credentials such as secret keys or access tokens; instead, integration objects reference a Cloud Storage service account. We will provide you with the bucket link mentioned below

CREATE STORAGE INTEGRATION gcs_int
  TYPE = EXTERNAL_STAGE
  STORAGE_PROVIDER = GCS
  ENABLED = TRUE
  STORAGE_ALLOWED_LOCATIONS = ('gcs-bucket-link');
  1. Obtain GCP Service Account

A GCP Service account is created by Snowflake as part of (Step 1)

DESC STORAGE INTEGRATION gcs_int;

In the output of the above command, the STORAGE_GCP_SERVICE_ACCOUNT property contains the GCP service account name. An example output here:

DESC STORAGE INTEGRATION gcs_int;
+-----------------------------+---------------+-----------------------------------------------------------------------------+------------------+
| property                    | property_type | property_value                                                              | property_default |
+-----------------------------+---------------+-----------------------------------------------------------------------------+------------------|
| ENABLED                     | Boolean       | true                                                                        | false            |
| STORAGE_ALLOWED_LOCATIONS   | List          | bucket-link-that-we-will-send-you                                           | []               |             
| STORAGE_GCP_SERVICE_ACCOUNT | String        | [email protected]                  |                  |
+-----------------------------+---------------+-----------------------------------------------------------------------------+------------------+

We will take this service account ([email protected]) and provide it permissions to add/modify data to the GCS bucket.

  1. Create File Formats

A parquet file format can be created in Snowflake as follows to unload snowflake data in the parquet format. Any other format like CSV or JSON also can be created. Here is an example script to create parquet format:

USE DB_NAME;
USE SCHEMA_NAME;
CREATE
OR REPLACE FILE FORMAT parquet_unload_file_format
TYPE = PARQUET
SNAPPY_COMPRESSION = TRUE
COMMENT = 'FILE FORMAT FOR UNLOADING AS PARQUET FILES';

More options and file types can be found here: https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html

  1. Create External Stage

In Snowflake, an EXTERNAL STAGE object references data files stored in a location outside of Snowflake.

USE DB_NAME;
USE SCHEMA SCHEMA_NAME;
CREATE OR REPLACE STAGE parquet_unload_gcs_stage
URL = 'gcs-'bucket-link'
storage_integration  = gcs_int
FILE_FORMAT = parquet_unload_file_format
COMMENT = 'GCS Stage for the Snowflake external Parquet export';
  1. Unload Data

From Snowflake, use the COPY command in Snowflake to unload data from a Snowflake table into a GCS bucket. To retain the column names in the output file, use the HEADER = TRUE copy option.

COPY INTO @<EXTERNAL_STAGE_NAME>/<file_name_prefix>
FROM DB_NAME.SCHEMA_NAME.TABLE_NAME
HEADER = TRUE;


For example,
COPY INTO @parquet_unload_gcs_stage/region
FROM SF01.TEST01.REGION
HEADER=TRUE;

For the minimal fields we need, you can refer to Data Models

There are various parameters for COPY command which can limit the partitions and provide optimisations: https://docs.snowflake.com/en/sql-reference/sql/copy-into-location.html

  1. Automate the unloading

What we recommend is to automate the COPY command that unloads each dayโ€™s events (single partition) into the external stage by using Snowflake TASK for it to run at a schedule
https://docs.snowflake.com/en/sql-reference/sql/create-task.html#

An example of same would be

CREATE TASK copy_to_gcs 
SCHEDULE = 'USING CRON 0 10 * * * Europe/Rome' 
COMMENT = 'Test Schedule' AS 
COPY INTO @parquet_unload_gcs_stage/region
from SF01.TEST01.REGION
HEADER=TRUE;

If you use Airflow on your side for orchestration you can also use that to schedule this query at a regular interval

References:

https://docs.snowflake.com/en/user-guide/data-load-gcs-config.html
https://docs.snowflake.com/en/user-guide/data-unload-prepare.html

Provide direct access for Aampe user to your snowflake dataset (and tables) with event data

Our main purpose will be to create a user for Aampe which will have read access to the data warehouse and underlying tables. We will use this user (and in turn role) to fetch data into our system

  1. Start by using account admin to grant access in your snowflake worksheet console
USE ROLE ACCOUNTADMIN;
  1. Create a custom role to provide read access. If you already have a role that you use to define read only access you can skip this step and use that directly everywhere instead of AAMPE_ROLE
CREATE ROLE AAMPE_ROLE;


#Create user with password authentication
CREATE USER AAMPE_DEVELOPER
 PASSWORD             = 'aampe_developer'     
 EMAIL                = '[email protected]'
 MUST_CHANGE_PASSWORD = FALSE
 DEFAULT_ROLE         = AAMPE_ROLE
 DEFAULT_WAREHOUSE    = <YOUR_WAREHOUSE>
  1. Grant the read role to the user (if you already have a role with read access then you can use that directly)
GRANT ROLE AAMPE_ROLE TO USER AAMPE_DEVELOPER;
  1. Grant access to the new role. You can restrict to certain db/tables as per needed. Small reminder to execute all queries if doing through UI
USE ROLE ACCOUNTADMIN;
GRANT USAGE ON WAREHOUSE <MY_WH> TO ROLE AAMPE_ROLE;
GRANT USAGE ON DATABASE <MY_DB> TO ROLE AAMPE_ROLE;
GRANT USAGE ON ALL SCHEMAS IN DATABASE <MY_DB> TO ROLE AAMPE_ROLE;
GRANT SELECT ON ALL TABLES IN DATABASE <MY_DB> TO ROLE AAMPE_ROLE;
GRANT SELECT ON ALL VIEWS IN DATABASE <MY_DB> TO ROLE AAMPE_ROLE;

USE ROLE ACCOUNTADMIN;
GRANT USAGE ON future SCHEMAS IN DATABASE <MY_DB> TO ROLE AAMPE_ROLE;
GRANT SELECT ON future TABLES IN DATABASE <MY_DB> TO ROLE AAMPE_ROLE;
GRANT SELECT ON future VIEWS IN DATABASE <MY_DB> TO ROLE AAMPE_ROLE;

Now all you need now is to send us these values:

a) Account
e.g (If your URL is https://tzviibq-hr75240.snowflakecomputing.com account will be tzviibq-hr75240)
b) Data Warehouse Name
c) Database Name
d) Table Names (that contain events data)
e) Schema Name

and we will be setting up the direct access to your snowflake table