Snowflake
Estimated Setup time
1 hour
There are 2 ways to provide Snowflake data access to Aampe
For both the methods, customers generally create a dataset for Aampe with limited data tables/views that are only necessary for Aampe for the restricted access.
- Unloading data into Google Cloud Storage from Snowflake [Link]
- Provide direct access for Aampe user to your snowflake dataset (and tables) with event data [Link]
Unloading data into Google Cloud Storage from Snowflake
- Create Snowflake Integration
Integrations are named, first-class Snowflake objects that avoid the need for passing explicit cloud provider credentials such as secret keys or access tokens; instead, integration objects reference a Cloud Storage service account. We will provide you with the bucket link mentioned below
CREATE STORAGE INTEGRATION gcs_int
TYPE = EXTERNAL_STAGE
STORAGE_PROVIDER = GCS
ENABLED = TRUE
STORAGE_ALLOWED_LOCATIONS = ('gcs-bucket-link');
- Obtain GCP Service Account
A GCP Service account is created by Snowflake as part of (Step 1)
DESC STORAGE INTEGRATION gcs_int;
In the output of the above command, the STORAGE_GCP_SERVICE_ACCOUNT property contains the GCP service account name. An example output here:
DESC STORAGE INTEGRATION gcs_int;
+-----------------------------+---------------+-----------------------------------------------------------------------------+------------------+
| property | property_type | property_value | property_default |
+-----------------------------+---------------+-----------------------------------------------------------------------------+------------------|
| ENABLED | Boolean | true | false |
| STORAGE_ALLOWED_LOCATIONS | List | bucket-link-that-we-will-send-you | [] |
| STORAGE_GCP_SERVICE_ACCOUNT | String | [email protected] | |
+-----------------------------+---------------+-----------------------------------------------------------------------------+------------------+
We will take this service account ([email protected]) and provide it permissions to add/modify data to the GCS bucket.
- Create File Formats
A parquet file format can be created in Snowflake as follows to unload snowflake data in the parquet format. Any other format like CSV or JSON also can be created. Here is an example script to create parquet format:
USE DB_NAME;
USE SCHEMA_NAME;
CREATE
OR REPLACE FILE FORMAT parquet_unload_file_format
TYPE = PARQUET
SNAPPY_COMPRESSION = TRUE
COMMENT = 'FILE FORMAT FOR UNLOADING AS PARQUET FILES';
More options and file types can be found here: https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html
- Create External Stage
In Snowflake, an EXTERNAL STAGE object references data files stored in a location outside of Snowflake.
USE DB_NAME;
USE SCHEMA SCHEMA_NAME;
CREATE OR REPLACE STAGE parquet_unload_gcs_stage
URL = 'gcs-'bucket-link'
storage_integration = gcs_int
FILE_FORMAT = parquet_unload_file_format
COMMENT = 'GCS Stage for the Snowflake external Parquet export';
- Unload Data
From Snowflake, use the COPY command in Snowflake to unload data from a Snowflake table into a GCS bucket. To retain the column names in the output file, use the HEADER = TRUE copy option.
COPY INTO @<EXTERNAL_STAGE_NAME>/<file_name_prefix>
FROM DB_NAME.SCHEMA_NAME.TABLE_NAME
HEADER = TRUE;
For example,
COPY INTO @parquet_unload_gcs_stage/region
FROM SF01.TEST01.REGION
HEADER=TRUE;
For the minimal fields we need, you can refer to Data Models
There are various parameters for COPY command which can limit the partitions and provide optimisations: https://docs.snowflake.com/en/sql-reference/sql/copy-into-location.html
- Automate the unloading
What we recommend is to automate the COPY command that unloads each day’s events (single partition) into the external stage by using Snowflake TASK for it to run at a schedule
https://docs.snowflake.com/en/sql-reference/sql/create-task.html#
An example of same would be
CREATE TASK copy_to_gcs
SCHEDULE = 'USING CRON 0 10 * * * Europe/Rome'
COMMENT = 'Test Schedule' AS
COPY INTO @parquet_unload_gcs_stage/region
from SF01.TEST01.REGION
HEADER=TRUE;
If you use Airflow on your side for orchestration you can also use that to schedule this query at a regular interval
References:
https://docs.snowflake.com/en/user-guide/data-load-gcs-config.html
https://docs.snowflake.com/en/user-guide/data-unload-prepare.html
Provide direct access for Aampe user to your snowflake dataset (and tables) with event data
Our main purpose will be to create a user for Aampe which will have read access to the data warehouse and underlying tables. We will use this user (and in turn role) to fetch data into our system
- Start by using account admin to grant access in your snowflake worksheet console
USE ROLE ACCOUNTADMIN;
- Create a custom role to provide read access. If you already have a role that you use to define read only access you can skip this step and use that directly everywhere instead of AAMPE_ROLE
CREATE ROLE AAMPE_ROLE;
#Create user with password authentication
CREATE USER AAMPE_DEVELOPER
PASSWORD = 'aampe_developer'
EMAIL = '[email protected]'
MUST_CHANGE_PASSWORD = FALSE
DEFAULT_ROLE = AAMPE_ROLE
DEFAULT_WAREHOUSE = <YOUR_WAREHOUSE>
- Grant the read role to the user (if you already have a role with read access then you can use that directly)
GRANT ROLE AAMPE_ROLE TO USER AAMPE_DEVELOPER;
- Grant access to the new role. You can restrict to certain db/tables as per needed. Small reminder to execute all queries if doing through UI
USE ROLE ACCOUNTADMIN;
GRANT USAGE ON WAREHOUSE <MY_WH> TO ROLE AAMPE_ROLE;
GRANT USAGE ON DATABASE <MY_DB> TO ROLE AAMPE_ROLE;
GRANT USAGE ON ALL SCHEMAS IN DATABASE <MY_DB> TO ROLE AAMPE_ROLE;
GRANT SELECT ON ALL TABLES IN DATABASE <MY_DB> TO ROLE AAMPE_ROLE;
GRANT SELECT ON ALL VIEWS IN DATABASE <MY_DB> TO ROLE AAMPE_ROLE;
USE ROLE ACCOUNTADMIN;
GRANT USAGE ON future SCHEMAS IN DATABASE <MY_DB> TO ROLE AAMPE_ROLE;
GRANT SELECT ON future TABLES IN DATABASE <MY_DB> TO ROLE AAMPE_ROLE;
GRANT SELECT ON future VIEWS IN DATABASE <MY_DB> TO ROLE AAMPE_ROLE;
Now all you need now is to send us these values:
a) Account
e.g (If your URL is https://tzviibq-hr75240.snowflakecomputing.com account will be tzviibq-hr75240)
b) Data Warehouse Name
c) Database Name
d) Table Names (that contain events data)
e) Schema Name
and we will be setting up the direct access to your snowflake table to fetch the required data at regular intervals
Updated 4 months ago