Parquet Load - Maia Documentation

The Parquet Load orchestration component loads data into a table from one or more Parquet files. The component automatically infers the schema from the structure of the selected file(s), creates the corresponding table, and then loads the data into the table.

Use cases

Parquet Load suits a broad range of analytical and data lake ingestion scenarios, given Parquet’s widespread use across the modern data stack. For example, you can use this component to:

Load Parquet files exported from Spark, Pandas, or dbt pipelines into your warehouse for downstream reporting or transformation.
Ingest data lake exports from lakehouse architectures such as Delta Lake, Apache Iceberg, or Apache Hudi.
Bring in machine learning (ML) feature datasets or model outputs stored as Parquet files for monitoring or further analysis.
Load third-party data exports produced by tools such as AWS Glue, Azure Data Factory, or Google Dataflow.
Load Parquet-format exports from cloud analytics services such as Amazon Athena, Google BigQuery, or Azure Synapse for consolidation in your warehouse.

You can authorize this component using either of these two methods:

Cloud Provider Credentials: By default, explicit Cloud Provider Credentials are used to facilitate cross-account or cross-cloud data loads.
Agent Identity: If cloud provider credentials are not specified, the component attempts to use the Matillion Agent’s native service identity to access the source data.

Properties

Name

string

required

A human-readable name for the component.

Connect

Snowflake
Databricks
Google BigQuery

Type

drop-down

required

Snowflake Managed: Load your file from a Snowflake internal stage.
S3: Load your file from an S3 bucket.
Google Cloud Storage: Load your file from a Google Cloud Storage bucket.
Azure Blob Storage: Load your file from an Azure Blob Storage container.

Stage

drop-down

required

Select a staging area for the data. The special value [New Stage] will create a temporary stage to be used for loading the data when the corresponding parameter values are provided.The options in this drop-down menu depend on the values you select for the Database and Schema parameters. If you change these values, the list of available options updates automatically, and the previously selected option may become invalid.When [New Stage] is selected, the component uses the cloud credentials configured for your environment to access the required resources.

Snowflake Managed
S3
Google Cloud Storage
Azure Blob Storage

Staged File Path

string

A stage may include directories. The user has the option to browse and select files from a specific directory path, for example, /Example/Path.If this is left blank, and if the Pattern parameter is also empty, all files from the stage will be loaded.

S3 object prefix

file explorer

required

To retrieve the intended files, use the file explorer to enter the container path where the S3 bucket is located, or select from the list of S3 buckets.This must have the format s3://<bucket>/<path>.

Encryption

drop-down

required

Decide how the files are encrypted inside the S3 bucket. This property is available when using an existing Amazon S3 location for staging.

Client Side Encryption: Encrypt the data according to a client-side master key. For more information, read Protecting data using client-side encryption.
KMS Encryption: Encrypt the data according to a key stored on KMS. For more information, read Using server-side encryption with AWS KMS keys.
S3 Encryption: Encrypt the data according to a key stored on an S3 bucket. For more information, read Using server-side encryption with Amazon S3-managed encryption keys (SSE-S3).
None: No encryption.

Master key

drop-down

required

The secret definition denoting your master key for client-side encryption. Your password should be saved as a secret definition before using this component.Only available when Encryption is set to Client Side Encryption.

KMS key ID

drop-down

required

The ID of the KMS encryption key you have chosen to use in the Encryption property.Only available when Encryption is set to KMS Encryption.

Storage integration

drop-down

required

Select the storage integration. Storage integrations are required to permit Snowflake to read from and write to a cloud storage location. Integrations must be set up in advance and configured to support Google Cloud Storage. For more information, read Create storage integration.

Google Storage URL location

file explorer

required

Azure storage location

file explorer

required

To retrieve the intended files, use the file explorer to enter the container path where the Azure Storage account is located, or select from the list of storage accounts.This must have the format azure://<STORAGEACCOUNT>/<path>.

Pattern

string

A valid regular expression (regex) to match part of a file’s path or name. Files that match the pattern will be included in the load.If this parameter is left empty, the component automatically uses the pattern *, which matches all files within the specified Staged file path or Stage parameters. The pattern applies to the entire file path, not just the directory defined in Staged file path.The subfolder containing the object to load must be included here.

Only Parquet files are supported.
Ensure that the pattern entered correctly targets the files you intend to load.

Type

drop-down

required

S3: Load your file from an S3 bucket.
Google Cloud Storage: Load your file from a Google Cloud Storage bucket.
Azure Blob Storage: Load your file from an Azure Blob Storage container.

S3
Google Cloud Storage
Azure Blob Storage

S3 object prefix

file explorer

required

Google Storage URL location

file explorer

required

To retrieve the intended files, use the file explorer to enter the container path where the Google Cloud Storage bucket is located, or select from the list of Google Cloud Storage buckets.This must have the format gs://<bucket>/<path>.For this to work, an external location with access to the specified location must be configured. For more information, read External Locations | Databricks on AWS.

Azure storage location

file explorer

required

Glob pattern

string

A valid glob pattern to match part of a file’s path or name. Files that match the pattern will be included in the load.If this parameter is left empty, the component automatically uses the pattern *, which matches all files within the specified location parameters. The pattern applies to the entire file path, not just the directory defined in the location parameter.

Only Parquet files are supported.
Ensure that the pattern entered correctly targets the files you intend to load.

Type

drop-down

required

Select Google Cloud Storage to load your file from a Google Cloud Storage bucket.

Google Storage URL location

file explorer

required

File filter

string

Filter for specific files by their file name, including their file extension. You can use wildcards in the file name, for example prefix_*{filetype}.If this field is left empty, all files at the specified Google Storage URL location are loaded.If the specified Google Storage URL location contains a *, you can’t apply a file filter.

Configure

Snowflake
Databricks
Google BigQuery

File format

drop-down

required

The default value is set to [New File Format]. Specify a file format, and a temporary format with those settings will be used when the component runs. Alternatively, select a pre-made file format.

Compression

drop-down

required

Select the compression algorithm used on the file being loaded.The default setting is AUTO.

Binary as text

boolean

Determines how columns without an assigned logical data type are handled. When True, such columns are interpreted as UTF-8 text; when False, they are treated as binary data.

Trim space

boolean

When True, removes whitespace from fields.This will only work when Match by column name is set to either Case Insensitive or Case Sensitive in the Advanced Settings below.

Use logical type

boolean

When set to True, interprets Parquet logical type definitions according to the Parquet specification.

Use vectorized scanner

boolean

When True, uses a vectorized scanner for loading Parquet files.

Replace invalid characters

boolean

Snowflake replaces invalid UTF-8 characters with the Unicode replacement character. When False, the load operation produces an error when invalid UTF-8 character encoding is detected.

Null if

editor

Specify one or more strings (one string per row of the table) to convert to NULL values. When one of these strings is encountered in the file, it is replaced with an SQL NULL value for that field in the loaded table. Click + to add a string.This will only work when Match by column name is set to either Case Insensitive or Case Sensitive in the Advanced Settings below.

Ignore corrupt files

boolean

When set to Yes, corrupt input files are skipped and not included in the load.

Ignore missing files

boolean

When set to Yes, files that cannot be found are ignored during the load process.

Modified after

string

An optional timestamp as a filter to only ingest files that have a modification timestamp after the provided timestamp.

Modified before

string

An optional timestamp as a filter to only ingest files that have a modification timestamp before the provided timestamp.

Recursive file lookup

boolean

When set to Yes, files are read recursively from all subdirectories under the specified path. When enabled, partition inference is disabled. To control which files are loaded, use the Glob pattern property instead.

Datetime rebase mode

drop-down

Controls the rebasing of DATE and TIMESTAMP values between Julian and Proleptic Gregorian calendars. Valid values are EXCEPTION, LEGACY, or CORRECTED.

LEGACY: To rebase values from the Julian calendar to the Proleptic Gregorian calendar.
CORRECTED: To read values as-is without rebasing.
EXCEPTION: To throw an error if legacy date/time values are encountered.

INT96 rebase mode

drop-down

Controls the rebasing of INT96 timestamp values between Julian and Proleptic Gregorian calendars. Valid values are EXCEPTION, LEGACY, or CORRECTED.

LEGACY: To rebase values from the Julian calendar to the Proleptic Gregorian calendar.
CORRECTED: To read values as-is without rebasing.
EXCEPTION: To throw an error if legacy INT96 timestamp values are encountered.

Merge schema

boolean

When set to Yes, schemas from all input files are merged during the load.

Reader case sensitive

boolean

If Yes, the column names in the Parquet file are matched with the table schema in a case-sensitive manner. If No, column name matching is case-insensitive.

Rescued data column

string

An optional column name to store data that does not match the schema. When specified, fields that don’t match the expected schema are collected into this column as a JSON string.

Destination

Snowflake
Databricks
Google BigQuery

Warehouse

drop-down

required

Database

drop-down

required

Schema

drop-down

required

Target table name

string

required

Specify the name of the newly created or existing table to load data into. The table will be created according to the Load strategy you select below.

Load strategy

drop-down

required

Replace: If the specified table name already exists, that table will be destroyed and replaced by the table created during this pipeline run.
Truncate and Insert: Each time the pipeline runs, two operations are performed: first, the table is truncated, meaning all existing rows are deleted. Then, your new rows are inserted. The table itself is never destroyed and recreated.
Fail if Exists: If the specified table name already exists, this pipeline will fail to run, and no data will be copied to the table.
Append: If the specified table name already exists, then the data is inserted without altering or deleting the existing data in the table. It’s appended onto the end of the existing data in the table. If the specified table name doesn’t exist, then an error message will appear and the table is not created. You must always provide a table name.

Catalog

drop-down

required

Schema (database)

drop-down

required

Target table name

string

required

The name of the table to be created. The table will be created according to the Load strategy you select below.

Load strategy

drop-down

required

Replace: If the specified table name already exists, that table will be destroyed and replaced by the table created during this pipeline run.
Truncate and Insert: Each time the pipeline runs, two operations are performed: first, the table is truncated, meaning all existing rows are deleted. Then, your new rows are inserted. The table itself is never destroyed and recreated.
Fail if Exists: If the specified table name already exists, this pipeline will fail to run, and no data will be copied to the table.
Append: If the specified table name already exists, then the data is inserted without altering or deleting the existing data in the table. It’s appended onto the end of the existing data in the table. If the specified table name doesn’t exist, then an error message will appear and the table is not created. You must always provide a table name.

If not specified, the default is Fail if Exists.

Databricks uses COPY INTO, which is idempotent. This means it tracks which source files have already been loaded into the target Delta table and skips them on subsequent runs.Enabling Force load overrides this behavior for the Truncate and Insert and Append load strategies, causing all matching source files to be reprocessed.

Project

drop-down

required

Select the ID of the Google Cloud Platform project to load data into. The special value [Environment Default] uses the project defined in the environment.

Dataset

drop-down

required

Select the Google BigQuery dataset to load data into. The special value [Environment Default] uses the dataset defined in the environment.

Target table name

string

required

The name of the table to be created in your Google BigQuery project. You can use a Table Input component in a transformation pipeline to access and transform this data after it has been loaded.

Load strategy

drop-down

required

Define what happens if the table name already exists in the specified destination.

Replace: If the specified table name already exists, that table will be destroyed and replaced by the table created during this pipeline run.
Truncate and Insert: Each time the pipeline runs, two operations are performed: first, the table is truncated, meaning all existing rows are deleted. Then, your new rows are inserted. The table itself is never destroyed and recreated.
Fail if Exists: If the specified table name already exists, this pipeline will fail to run.
Append: If the specified table name already exists, then the data is inserted without altering or deleting the existing data in the table. It’s appended onto the end of the existing data in the table. If the specified table name doesn’t exist, then the table will be created, and your data will be inserted into the table.

Advanced Settings

Snowflake
Databricks
Google BigQuery

integer

required

Specify the number of errors or the percentage of errors required to skip the file. This parameter only accepts integer characters. % is not accepted. Specify percentages as a number only.This parameter is only available when On Error is set to either Skip File When n Errors or Skip File When n% Errors.

Size limit (in bytes)

integer

Specify the maximum size, in bytes, of data to be loaded for a given COPY statement. If the maximum is exceeded, the COPY operation discontinues loading files. For more information, refer to the Snowflake documentation.

Purge files

boolean

required

When True, purges data files after the data is successfully loaded. Default setting is False.

Match by column name

drop-down

required

Specify whether to load semi-structured data into columns in the target table that match corresponding columns represented in the data.

Case Insensitive: Load semi-structured data into columns in the target table that match corresponding columns represented in the data. Column names should be case-insensitive.
Case Sensitive: Load semi-structured data into columns in the target table that match corresponding columns represented in the data. Column names should be case-sensitive.
None: The COPY operation loads the semi-structured data into a variant column or, if a query is included in the COPY statement, transforms the data.

Truncate columns

boolean

required

When True, strings are automatically truncated to the target column length. When False (default), the COPY statement produces an error if a loaded string exceeds the target column length.

Force load

boolean

required

Force load

boolean

required

When True, loads all files, regardless of whether they have been loaded previously and haven’t changed since they were loaded. Default setting is False.When set to True, this option reloads files and can lead to duplicated data in a table. The Load strategy in the destination needs to be Truncate and Insert or Append for this to work.

Default rounding mode

drop-down

Select the rounding mode to use when storing numeric data types. For more information, read the Google BigQuery RoundingMode documentation.

ROUND_HALF_AWAY_FROM_ZERO rounds half values away from zero.
ROUND_HALF_EVEN rounds half values to the nearest even value.

​Use cases

​Properties

​Connect

​Configure

​Destination

​Advanced Settings

Use cases

Properties

Connect

Configure

Destination

Advanced Settings