Use cases
Parquet Load suits a broad range of analytical and data lake ingestion scenarios, given Parquet’s widespread use across the modern data stack. For example, you can use this component to:- Load Parquet files exported from Spark, Pandas, or dbt pipelines into your warehouse for downstream reporting or transformation.
- Ingest data lake exports from lakehouse architectures such as Delta Lake, Apache Iceberg, or Apache Hudi.
- Bring in machine learning (ML) feature datasets or model outputs stored as Parquet files for monitoring or further analysis.
- Load third-party data exports produced by tools such as AWS Glue, Azure Data Factory, or Google Dataflow.
- Load Parquet-format exports from cloud analytics services such as Amazon Athena, Google BigQuery, or Azure Synapse for consolidation in your warehouse.
- Cloud Provider Credentials: By default, explicit Cloud Provider Credentials are used to facilitate cross-account or cross-cloud data loads.
- Agent Identity: If cloud provider credentials are not specified, the component attempts to use the Matillion Agent’s native service identity to access the source data.
Properties
A human-readable name for the component.
Connect
- Snowflake
- Databricks
- Google BigQuery
- Snowflake Managed: Load your file from a Snowflake internal stage.
- S3: Load your file from an S3 bucket.
- Google Cloud Storage: Load your file from a Google Cloud Storage bucket.
- Azure Blob Storage: Load your file from an Azure Blob Storage container.
Select a staging area for the data. The special value [New Stage] will create a temporary stage to be used for loading the data when the corresponding parameter values are provided.The options in this drop-down menu depend on the values you select for the
Database and Schema parameters. If you change these values, the list of available options updates automatically, and the previously selected option may become invalid.When [New Stage] is selected, the component uses the cloud credentials configured for your environment to access the required resources.- Snowflake Managed
- S3
- Google Cloud Storage
- Azure Blob Storage
A stage may include directories. The user has the option to browse and select files from a specific directory path, for example,
/Example/Path.If this is left blank, and if the Pattern parameter is also empty, all files from the stage will be loaded.A valid regular expression (regex) to match part of a file’s path or name. Files that match the pattern will be included in the load.If this parameter is left empty, the component automatically uses the pattern
*, which matches all files within the specified Staged file path or Stage parameters. The pattern applies to the entire file path, not just the directory defined in Staged file path.The subfolder containing the object to load must be included here.- Only Parquet files are supported.
- Ensure that the pattern entered correctly targets the files you intend to load.
Configure
- Snowflake
- Databricks
- Google BigQuery
The default value is set to
[New File Format]. Specify a file format, and a temporary format with those settings will be used when the component runs. Alternatively, select a pre-made file format.Select the compression algorithm used on the file being loaded.The default setting is AUTO.
Determines how columns without an assigned logical data type are handled. When True, such columns are interpreted as UTF-8 text; when False, they are treated as binary data.
When True, removes whitespace from fields.This will only work when
Match by column name is set to either Case Insensitive or Case Sensitive in the Advanced Settings below.When set to True, interprets Parquet logical type definitions according to the Parquet specification.
When True, uses a vectorized scanner for loading Parquet files.
Snowflake replaces invalid UTF-8 characters with the Unicode replacement character. When False, the load operation produces an error when invalid UTF-8 character encoding is detected.
Specify one or more strings (one string per row of the table) to convert to NULL values. When one of these strings is encountered in the file, it is replaced with an SQL NULL value for that field in the loaded table. Click + to add a string.This will only work when
Match by column name is set to either Case Insensitive or Case Sensitive in the Advanced Settings below.Destination
- Snowflake
- Databricks
- Google BigQuery
Specify the name of the newly created or existing table to load data into. The table will be created according to the
Load strategy you select below.- Replace: If the specified table name already exists, that table will be destroyed and replaced by the table created during this pipeline run.
- Truncate and Insert: Each time the pipeline runs, two operations are performed: first, the table is truncated, meaning all existing rows are deleted. Then, your new rows are inserted. The table itself is never destroyed and recreated.
- Fail if Exists: If the specified table name already exists, this pipeline will fail to run, and no data will be copied to the table.
- Append: If the specified table name already exists, then the data is inserted without altering or deleting the existing data in the table. It’s appended onto the end of the existing data in the table. If the specified table name doesn’t exist, then an error message will appear and the table is not created. You must always provide a table name.
Advanced Settings
- Snowflake
- Databricks
- Google BigQuery
Specify the number of errors or the percentage of errors required to skip the file. This parameter only accepts integer characters.
% is not accepted. Specify percentages as a number only.This parameter is only available when On Error is set to either Skip File When n Errors or Skip File When n% Errors.Specify the maximum size, in bytes, of data to be loaded for a given COPY statement. If the maximum is exceeded, the COPY operation discontinues loading files. For more information, refer to the Snowflake documentation.
When True, purges data files after the data is successfully loaded. Default setting is False.
Specify whether to load semi-structured data into columns in the target table that match corresponding columns represented in the data.
- Case Insensitive: Load semi-structured data into columns in the target table that match corresponding columns represented in the data. Column names should be case-insensitive.
- Case Sensitive: Load semi-structured data into columns in the target table that match corresponding columns represented in the data. Column names should be case-sensitive.
- None: The COPY operation loads the semi-structured data into a variant column or, if a query is included in the COPY statement, transforms the data.
When True, strings are automatically truncated to the target column length. When False (default), the COPY statement produces an error if a loaded string exceeds the target column length.
When True, loads all files, regardless of whether they have been loaded previously and haven’t changed since they were loaded. Default setting is False.When set to True, this option reloads files and can lead to duplicated data in a table.
