Snowﬂake data warehouse was built from the cloud up to support modern data warehousing at any scale. Its multicluster, scale-out approach is designed to separate compute from storage. Its architecture is built so that multiple, independent compute clusters can access a shared pool of data without resource contention. Customers pay for what is used without a stair-step approach to resources and pricing. The model, cost, features, and scalability have already caused some to postpone Hadoop adoption, with more than 600 customers ranging from small startups to large enterprises.
Snowflake data warehouse’s pricing model includes two items: the cost of storage and the cost of compute resources consumed. The charge for storage is per terabyte, compressed, per month. The charge for compute is based on the processing, consumed to run your queries or perform a service.
Snowflake SQL includes support of objects in JSON, XML, Avro and Parquet using a special data type that can handle flexible-schema, nested, and hierarchical data in table form. There are no indexes either, as zone maps are used for an abstract understanding of data in the database. Also, no downtime is required for anything including upgrades or cluster expansion.
Concurrency, a clear challenge in database scale-out, is a focus at Snowflake. Their automatic concurrency scaling is a single logical virtual warehouse composed of multiple compute clusters split across availability zones.
Here at Intersys, we tested Snowflake’s data loading and querying capabilities with respect to heterogeneous data formats in the standard edition and intend to continue evaluating its features.
About Sample Data
We obtained public data from the National Center for Charitable Statistics (NCCS) and semi-structured a Tax filling return in JSON format. Intersys frequently volunteers for various organizations and found this data interesting to connect.
For this Proof of Concept, we used 350 MB of structure data (NCCS Data) in CSV File Format and 60 MB of semi-structured data (Tax filing) in JSON format. We hosted both these files in Amazon S3 (Simple Storage Service) which is a web service to store data in Amazon Cloud.
Loading Data into Snowflake
The data loading process into Snowflake is intuitive and easy to follow. In order to load data from the S3 bucket into Snowflake, we followed the below-mentioned steps:
- Step 1: Create a new table to persist the data in Snowflake
- Step 2: Create a File Format
- Step 3: Create a Named Stage Object
- Step 4: Load the data (Structured and Semi-structured) into the new table
Let’s go through these steps one by one.
NOTE: We have demonstrated data load process only for JSON file (Semi-structured Data) to keep this blog concise. Please note that the data loading process for CSV file (Structured Data) are identical and easy to follow.
Step 1: First, we have to create a table to hold the data from the data files. Tables are created for the Structured data like any other database but for semi-structured data, Snowflake created its patented datatypes. These datatypes are as follows:
Out of these Data types, we will use VARIANT for this use case as it is a tagged universal type, which can store values of any other type including OBJECT and ARRAY.
The below screenshot shows the table definition:
In this code snippet, we have created a Table named User_Tax_Data_JSON which contains one column named ‘src’ of Variant data type.
Step2: Secondly, we should create a file format. A named file format provides a convenient way to store every detail of the format required for loading data from files into tables. This step is an optional step, however, it is highly recommended when you plan to load a large number of files of a specific format. For this POC, we created a file format as shown below:
Step 3: Create a Named Stage Object
CREATE STAGE command is used to create a named external stage. An external named stage object is used to reference all the files stored in a S3 storage. It is like a pointer to the S3 bucket, once we create a named stage object, we can access all the public files within that folder.
This step is recommended when you plan to frequently load a lot of files from the same source. Below is the named external stage that we created:
Notice that we have referred the file format (jsonformat) that we created in step 2 for creating the external stage. Note that the named stage object will work only if the files have public access. If the files are private, then you would provide an AWS key and secret key.
Step 4: The last step is to load the data from the files into the table. Use the command “COPY INTO Table” to load the data from the files to the table.
As shown in below snapshot, we are loading the Table User_Tax_Data_JSON from the named stage object we created. We are loading the file index_2011_form_990_Updated.json and in case of any errors encountered during the loading process, we want to skip the file.
Data Loading and Querying Statistics
We observed that Snowflake’s structured and semi-structured data loading capabilities are one of the best in the business. We could load 60 MB (around 203K records) semi-structured data in 6.5 seconds as shown:
We had spun up a “Small Size” cluster for this POC; please note that if you use a large size cluster, then the loading and querying process will be even swifter because of Snowflake’s linear scalability feature.
Similarly, it took around 58 seconds to load 700 MB (1.5 Million records) of structured data into a Snowflake table from Amazon S3.
Querying Capabilities and Related Statistics
Next, we tried to gauge the querying capabilities involving structured and semi-structured data. It is safe to assess that Snowflake is very fast in querying to the Structure and Semi-structured data, thanks to its columnar storage and data compression features. The below snapshot showcases a query jointly on structured and semi-structured data. Over 188K rows were returned in less than 1.2 seconds which sets a benchmark in terms of speed and efficiency for modern data warehousing applications.
Snowflake data warehouse has jumped constraints found in databases from earlier development and honed a promising cloud analytic database. Eminently elastic on a foundation of separation of compute and storage. After our POC we came to the conclusion that Snowflake is market-leading in what you would want for a multi-purpose cloud data warehouse/ analytical database.