S3 Load - Maia Documentation

The S3 Load component lets users load data into an existing table from objects stored in Amazon Simple Storage Service (Amazon S3). The S3 Load component never truncates the target table. It has no built-in truncate table option and performs an append-only load. Properties like Truncate Columns and Force Load are unrelated to clearing the table: Truncate Columns governs string value length truncation, while Force Load controls whether already-loaded files are reloaded. This component requires working AWS Credentials, with Read access to the bucket containing the source data files. To access an S3 bucket from a different AWS account, read Background: Cross-account permissions and using IAM roles.

If you’re using a Matillion Full SaaS solution, you may need to allow these IP address ranges from which Full SaaS s will call out to their source systems or to cloud data platforms.

If the component requires access to a cloud provider (AWS, Azure, or GCP), it will use credentials as follows:

If using Matillion Full SaaS: The component will use the cloud credentials associated with your environment to access resources.
If using Hybrid SaaS: By default the component will inherit the agent’s execution role (service account role). However, if there are cloud credentials associated to your environment, these will overwrite the role.

Video example

Properties

Name

string

required

A human-readable name for the component.

Snowflake
Databricks
Amazon Redshift

Stage

drop-down

required

Select a staging area for the data. Staging areas can be created through Snowflake using the CREATE STAGE command. Internal stages can be set up this way to store staged data within Snowflake. Selecting [Custom] will avail the user of properties to specify a custom staging area on S3. Users can add a fully qualified stage by typing the stage name. This should follow the format databaseName.schemaName.stageName

Authentication

drop-down

required

Select the authentication method. Users can choose either:

Credentials: Uses AWS security credentials.
Storage Integration: Use a Snowflake storage integration. A storage integration is a Snowflake object that stores a generated identity and access management (IAM) entity for your external cloud storage, along with an optional set of permitted or blocked storage locations (Amazon S3, Google Cloud Storage, or Microsoft Azure Blob Storage). More information can be found at CREATE STORAGE INTEGRATION.

Storage Integration

drop-down

required

Select the storage integration. Storage integrations are required to permit Snowflake to read data from and write to a cloud storage location. Integrations must be set up in advance of selecting them. Storage integrations can be configured to support Amazon S3, Google Cloud Storage, or Microsoft Azure Blob Storage, regardless of the cloud provider that hosts your Snowflake account.

S3 Object Prefix

file explorer

required

To retrieve the intended files, use the file explorer to enter the container path where the S3 bucket is located, or select from the list of S3 buckets.This must have the format S3://<bucket>/<path>.

Pattern

string

required

A string that will partially match all file paths and names that are to be included in the load. Defaults to .* indicating all files within the S3 Object Prefix. This property is a pattern on the complete path of the file, and is not just relative to the directory configured in the S3 Object Prefix property.The subfolder containing the object to load must be included here.

Encryption

drop-down

required

Decide how the files are encrypted inside the S3 bucket. This property is available when using an existing Amazon S3 location for staging.

None: No encryption.
SSE KMS: Encrypt the data according to a key stored on KMS. Read AWS Key Management Service (AWS KMS) to learn more.
SSE S3: Encrypt the data according to a key stored on an S3 bucket. Read Using server-side encryption with Amazon S3-managed encryption keys (SSE-S3) to learn more.

KMS Key ID

drop-down

The ID of the KMS encryption key you have chosen to use in the Encryption property.

Master Key

string

required

Your client-side encryption master key. This property is only available when using client-side encryption.

Warehouse

drop-down

required

The Snowflake warehouse used to run the queries. The special value [Environment Default] uses the warehouse defined in the environment. Read Overview of Warehouses to learn more.

Database

drop-down

required

The Snowflake database. The special value [Environment Default] uses the database defined in the environment. Read Databases, Tables and Views - Overview to learn more.

Schema

drop-down

required

The Snowflake schema. The special value [Environment Default] uses the schema defined in the environment. Read Database, Schema, and Share DDL to learn more.

Target Table

string

required

Select an existing table to load data into. The tables available for selection depend on the chosen schema.

Load Columns

dual listbox

Choose the columns to load. If you leave this parameter empty, all columns will be loaded.

Format

drop-down

required

Select a pre-made file format that will automatically set many of the S3 Load component properties. These formats can be created through the Create File Format component.Users can add a fully qualified format by typing the format name. This should read as databaseName.schemaName.formatName

File Type

drop-down

required

Select the type of data to load. Available data types are: AVRO, CSV, JSON, ORC, PARQUET, and XML. Some file types may require additional formatting—this is explained in the Snowflake documentation.Component properties will change to reflect the selected file type. Click one of the tabs below for properties applicable to that file type.

Compression

drop-down

required

Select the compression method if you wish to compress your data. If you do not wish to compress at all, select NONE. The default setting is AUTO.

Null If

editor

required

Specify one or more strings (one string per row of the table) to convert to NULL values. When one of these strings is encountered in the file, it is replaced with an SQL NULL value for that field in the loaded table. Click + to add a string.

Trim Space

boolean

required

When True, removes whitespace from fields. Default setting is False.

Compression

drop-down

required

Select the compression method if you wish to compress your data. If you do not wish to compress at all, select NONE. The default setting is AUTO.

Record Delimiter

string

Input a delimiter for records. This can be one or more single-byte or multibyte characters that separate records in an input file.Accepted values include: leaving the field empty; a newline character \ or its hex equivalent 0x0a; a carriage return \\r or its hex equivalent 0x0d. Also accepts a value of NONE.If you set the Skip Header to a value such as 1, then you should use a record delimiter that includes a line feed or carriage return, such as \ or \\r. Otherwise, your entire file will be interpreted as the header row, and no data will be loaded.The specified delimiter must be a valid UTF-8 character and not a random sequence of bytes.Do not specify characters used for other file type options such as Escape or Escape Unenclosed Field.The default (if the field is left blank) is a newline character.

Field Delimiter

string

Input a delimiter for fields. This can be one or more single-byte or multibyte characters that separate fields in an input file.Accepted characters include common escape sequences, octal values (prefixed by \), or hex values (prefixed by 0x). Also accepts a value of NONE.This delimiter is limited to a maximum of 20 characters.While multi-character delimiters are supported, the field delimiter cannot be a substring of the record delimiter, and vice versa. For example, if the field delimiter is “aa”, the record delimiter cannot be “aabb”.The specified delimiter must be a valid UTF-8 character and not a random sequence of bytes.Do not specify characters used for other file type options such as Escape or Escape Unenclosed Field.The Default setting is a comma: ,.

Skip Header

integer

Specify the number of rows to skip. The default is 0.If Skip Header is used, the value of the record delimiter will not be used to determine where the header line is. Instead, the specified number of CRLF will be skipped. For example, if the value of Skip Header = 1, skips to the first CRLF that it finds. If you have set the Field Delimiter property to be a single character without a CRLF, then skips to the end of the file (treating the entire file as a header).

Skip Blank Lines

boolean

required

When True, ignores blank lines that only contain a line feed in a data file and does not try to load them. Default setting is False.

Date Format

string

Define the format of date values in the data files to be loaded. If a value is not specified or is AUTO, the value for the DATE_INPUT_FORMAT session parameter is used. The default setting is AUTO.

Time Format

string

Define the format of time values in the data files to be loaded. If a value is not specified or is AUTO, the value for the TIME_INPUT_FORMAT session parameter is used. The default setting is AUTO.

Timestamp Format

string

Define the format of timestamp values in the data files to be loaded. If a value is not specified or is AUTO, the value for the TIMESTAMP_INPUT_FORMAT session parameter is used.

Escape

string

Specify a single character to be used as the escape character for field values that are enclosed. Default is NONE.

Escape Unenclosed Field

string

Specify a single character to be used as the escape character for unenclosed field values only. Default is \\. If you have set a value in the property Field Optionally Enclosed, all fields will become enclosed, rendering the Escape Unenclosed Field property redundant, in which case, it will be ignored.

Trim Space

boolean

required

When True, removes whitespace from fields. Default setting is False.

Field Optionally Enclosed

string

Specify a character used to enclose strings. The value can be NONE, single quote character ', or double quote character ". To use the single quote character, use the octal or hex representation 0x27 or the double single-quoted escape ''. Default is NONE.When a field contains one of these characters, escape the field using the same character. For example, to escape a string like this: 1 “2” 3, use double quotation to escape, like this: 1 ""2"" 3.

Null If

editor

required

Error On Column Count Mismatch

boolean

required

When True, generates an error if the number of delimited columns in an input file does not match the number of columns in the corresponding table. When False (default), an error is not generated and the load continues. If the file is successfully loaded in this case:

Where the input file contains records with more fields than columns in the table, the matching fields are loaded in order of occurrence in the file, and the remaining fields are not loaded.
Where the input file contains records with fewer fields than columns in the table, the non-matching columns in the table are loaded with NULL values.

Empty Field As Null

boolean

required

When True, inserts NULL values for empty fields in an input file. This is the default setting.

Replace Invalid Characters

boolean

required

Snowflake replaces invalid UTF-8 characters with the Unicode replacement character. When False (default), the load operation produces an error when invalid UTF-8 character encoding is detected.

Encoding Type

drop-down

required

Select the string that specifies the character set of the source data when loading data into a table. Refer to the Snowflake documentation for more information.

On Error

drop-down

required

Decide how to proceed upon an error.

Abort Statement: Aborts the load if any error is encountered. This is the default setting.
Continue: Continue loading the file.
Skip File: Skip file if any errors are encountered in the file.
Skip File When n Errors: Skip file when the number of errors in the file is equal to or greater than the specified number in the next property, n.
Skip File When n% Errors: Skip file when the percentage of errors in the file exceeds the specified percentage of n.

integer

required

Specify the number of errors or the percentage of errors required to skip the file. This parameter only accepts integer characters. % is not accepted. Specify percentages as a number only.

Size Limit (B)

integer

required

Specify the maximum size, in bytes, of data to be loaded for a given COPY statement. If the maximum is exceeded, the COPY operation discontinues loading files. For more information, refer to the Snowflake documentation.

Purge Files

boolean

required

When True, purges data files after the data is successfully loaded. Default setting is False.

Match By Column Name

drop-down

required

Specify whether to load semi-structured data into columns in the target table that match corresponding columns represented in the data.

Case Insensitive: Load semi-structured data into columns in the target table that match corresponding columns represented in the data. Column names should be case-insensitive.
Case Sensitive: Load semi-structured data into columns in the target table that match corresponding columns represented in the data. Column names should be case-sensitive.
None: The COPY operation loads the semi-structured data into a variant column or, if a query is included in the COPY statement, transforms the data.

Truncate Columns

boolean

required

When True, strings are automatically truncated to the target column length. When False (default), the COPY statement produces an error if a loaded string exceeds the target column length.

Force Load

boolean

required

When True, loads all files, regardless of whether they have been loaded previously and haven’t changed since they were loaded. Default setting is False.When set to True, this option reloads files and can lead to duplicated data in a table.

Metadata Fields

dual listbox

required

Snowflake metadata columns available to include in the load. For more information, read Querying Metadata for Staged Files. This property is only available when an external stage is selected.

Location

file explorer

required

Pattern

string

required

Files in the specified Location will only be loaded if their names match the pattern you specify here. You can use wildcards in the pattern. Enter .* to match all files in the location. This property is mandatory.

Catalog

drop-down

required

Select a Databricks Unity Catalog. The special value [Environment Default] uses the catalog defined in the environment. Selecting a catalog will determine which databases are available in the next parameter.

Schema (Database)

drop-down

required

The Databricks schema. The special value [Environment Default] uses the schema defined in the environment. Read Create and manage schemas to learn more.

Table

string

required

Select an existing table to load data into. The tables available for selection depend on the chosen schema (database).

Load Columns

dual listbox

Choose the columns to load. If you leave this parameter empty, all columns will be loaded.

File Type

drop-down

required

Select the type of data to load. Available data types are: CSV, JSON, PARQUET, and AVRO.Component properties will change to reflect the selected file type. Click one of the tabs below for properties applicable to that file type.

CSV
JSON
PARQUET
AVRO

Header

boolean

required

Select Yes to use the first line of the file as column names. If not specified, the default is No.

Field Delimiter

string

required

Enter the delimiter character used to separate fields in the CSV file. This can be one or more single-byte or multibyte characters that separate fields in an input file. If none is specified, the default is a comma.Accepted characters include common escape sequences, octal values (prefixed by \\), or hex values (prefixed by 0x). This delimiter is limited to a maximum of 20 characters. The specified delimiter must be a valid UTF-8 character and not a random sequence of bytes.

Date Format

string

required

Describe the format of date values in the data files to be loaded. If none is specified, the default is yyyy-MM-dd.

Timestamp Format

string

required

Describe the format of timestamp values in the data files to be loaded. If none is specified, the default is yyyy-MM-dd'T'HH:mm:ss.[SSS][XXX].

Encoding Type

string

required

The encoding type to use when decoding the input files. If none is specified, the default is UTF-8.

Ignore Leading Whitespace

boolean

required

When Yes, removes whitespace from the beginning of fields. If not specified, the default is No.

Ignore Trailing Whitespace

boolean

required

When Yes, removes whitespace from the end of fields. If not specified, the default is No.

Infer Schema

boolean

required

If Yes, will attempt to determine the input schema automatically from the data contained in the CSV file. If not specified, the default is No.

Multi Line

boolean

required

If Yes, will parse records which may span multiple lines. If not specified, the default is No.

Null Value

string

required

Sets the string representation of a null value. If not specified, the default value is an empty string.

Empty Value

string

required

Sets the string representation of an empty value. If not specified, the default value is an empty string.

Recursive File Lookup

boolean

required

If Yes, partition inference is disabled. If not specified, the default is No. To control which files are loaded, use the Pattern property instead.

Force Load

boolean

required

If Yes, files are loaded regardless of whether they’ve been loaded before. If not specified, the default is No.

Date Format

string

required

Describe the format of date values in the data files to be loaded. If none is specified, the default is yyyy-MM-dd.

Timestamp Format

string

required

Describe the format of timestamp values in the data files to be loaded. If none is specified, the default is yyyy-MM-dd'T'HH:mm:ss.[SSS][XXX].

Encoding Type

string

required

The encoding type to use when decoding the input files. If none is specified, the default is UTF-8.

Multi Line

boolean

required

If Yes, will parse records which may span multiple lines. If not specified, the default is No.

Primitives As String

boolean

required

If Yes, primitive data types are interpreted as strings in JSON files. If not specified, the default is No.

Prefers Decimals

boolean

required

If Yes, all floating-point values will be treated as a decimal type in JSON files. If not specified, the default is No.

Allow Comments

boolean

required

If Yes, will allow JAVA/C++ comments in JSON records. If not specified, the default is No.

Allow Unquoted Field Names

boolean

required

If Yes, will allow unquoted JSON field names. If not specified, the default is No.

Allow Single Quotes

boolean

required

If Yes, will allow single quotes in addition to double quotes in JSON records. If not specified, the default is No.

Allow Numeric Leading Zeros

boolean

required

If Yes, will allow leading zeros in numbers in JSON records. If not specified, the default is No.

Allow Backslash Escaping Any Character

boolean

required

If Yes, will allow the quoting of all characters using the backslash quoting mechanism in JSON records. If not specified, the default is No.

Allow Unquoted Control Chars

boolean

required

If Yes, will allow JSON strings to include unquoted control characters in JSON records. If not specified, the default is No.

Drop Field If All Null

boolean

required

If Yes, will ignore columns that contain only null values in JSON records. If not specified, the default is No.

Recursive File Lookup

boolean

required

If Yes, partition inference is disabled. If not specified, the default is No. To control which files are loaded, use the Pattern property instead.

Force Load

boolean

required

If Yes, files are loaded regardless of whether they’ve been loaded before. If not specified, the default is No.

Schema

drop-down

required

Select the table schema. The special value [Environment Default] uses the schema defined in the environment. For more information on using multiple schemas, read Schemas.

Target Table Name

drop-down

required

Select an existing table to load data into. The selected target table is where data will be loaded into from the S3 object.When selecting a target table with the Redshift SUPER data type, the S3 Load component will only offer the following data file types: JSON, ORC, and PARQUET. However, if you select to include columns of data file types other than SUPER in the Load Columns property, all data file types will be available for selection.

Load Columns

dual listbox

required

Select which of the target table’s columns should be loaded.

S3 URL Location

file explorer

required

S3 Object Prefix

string

required

All files that begin with this prefix will be included in the load into the target table.

IAM Role ARN

drop-down

required

Select an IAM role Amazon Resource Name (ARN) that is already attached to your Redshift cluster, and that has the necessary permissions to access S3.This setting is optional, since without this style of setup, the credentials of the environment will be used.Read the Redshift documentation for more information about using a role ARN with Redshift.

Data File Type

drop-down

required

The type of expected data to load. Some may require additional formatting, explained in the Amazon documentation.Available options are: Avro, CSV, Delimited, Fixed Width, JSON, ORC, PARQUET.Component properties will change to reflect the choice made here and give options based on the specific data file type.

Fixed Width Spec

string

required

Fixed width only. Loads the data from a file where each column width is a fixed length, rather than columns being separated by a delimiter. For more information, read the AWS documentation.To use variables in this field, type the name of the variable prefixed by the dollar symbol and surrounded by { } brackets, as follows: ${variable}. Once you type ${, a drop-down list of autocompleted suggested variables will appear. This list updates as you type; for example, if you type ${date, functions and variables containing date will be listed.

JSON Layout

string

required

(JSON only) Defaults to “auto”, which will work for the majority of JSON files if the fields match the table field names. More information about JSON file types can be found in the AWS documentation.

Delimiter

string

required

(CSV only) Specify a delimiting character to separate columns. The default character is a comma: ,A [TAB] character can be specified using ”\ ”.

CSV Quoter

string

required

(CSV only) Specify the character to be used as the quote character when using the CSV file type.

Explicit IDs

drop-down

required

Select whether to load data from the S3 objects into an IDENTITY column. Read the Redshift documentation for more information.

S3 Bucket Region

drop-down

required

Select the Amazon S3 region that hosts the selected S3 bucket. This setting is optional and can be left as “None” if the bucket is in the same region as your Redshift cluster.

Compression Method

drop-down

required

Specify whether the input is compressed in any of the following formats: BZip2, gzip, LZop, Zstd, or not at all (None).

Encoding

drop-down

required

Select the encoding format the data is in. The default for this setting is UTF-8.

Replace Invalid Characters

string

required

Specify a single character that will replace any invalid unicode characters in the data. The default is: ?.

Remove Quotes

drop-down

required

(Delimited, Fixed Width only) When “Yes”, removes surrounding quotation marks from strings in the incoming data. All characters within the quotation marks, including delimiters, are retained.If a string has a beginning single or double quotation mark but no corresponding ending mark, the COPY command fails to load that row and returns an error.The default setting is “No”.For more information, read the AWS documentation.

Maximum Errors

integer

required

The maximum number of individual parsing errors permitted before the load will fail. If the number of errors reaches this limit, the load fails; errors below the limit are skipped and logged. The Amazon default is 1000.

Date Format

string

required

Defaults to ‘auto’. Users can specify their preferred date format manually if they wish. For more information on date formats, read the AWS documentation.

Time Format

string

required

Defaults to ‘auto’. Users can specify their preferred time format manually if they wish. For more information on time formats, read the AWS documentation.

Ignore Header Rows

integer

required

Specify the number of rows at the top of the file to ignore. The default is 0 (no rows ignored).

Accept Any Date

drop-down

required

Yes: Invalid dates such as ‘45-65-2020’ are not considered an error, but will be loaded as NULL values. This is the default setting.
No: Invalid dates will return an error.

Ignore Blank Lines

drop-down

required

(CSV only) When “Yes”, ignores blank lines that only contain a line feed in a data file and does not try to load them. This is the default setting.

Truncate Columns

drop-down

required

Yes: Any instance of data in the input file that is too long to fit into the specified target column width will be truncated to fit, instead of causing an error.
No: Any data in the input file that is too long to fit into the specified target column width will cause an error. This is the default setting.

Fill Record

drop-down

required

When “Yes”, allows data files to be loaded when contiguous columns are missing at the end of some records. The missing columns are filled with either zero-length strings or NULLs, as appropriate for the data types of the columns in question. This is the default setting.

Trim Blanks

drop-down

required

Yes: Removes the trailing white space characters from a VARCHAR string. This property only applies to columns with a VARCHAR data type.
No: Does not remove trailing white space characters from the input data.

NULL As

string

required

Loads fields that match the specified NULL string. The default is \\N with an additional \ at the start to escape. Case-sensitive.For more information, read the AWS documentation.

Empty As Null

drop-down

required

When “Yes”, empty columns in the input file will become NULL values. This is the default setting.

Blanks As Null

drop-down

required

When “Yes”, loads blank columns, which consist of only whitespace characters, as NULL. This is the default setting.This option applies only to CHAR and VARCHAR columns. Blank fields for other data types, such as INT, are always loaded with NULL. For example, a string that contains three space characters in succession (and no other characters) is loaded as a NULL. The default behavior, without this option, is to load the space characters as is.

Comp Update

drop-down

required

When “On”, compression encodings are automatically applied during a COPY command. This is the default setting.This is usually a good idea to optimize the compression used when storing the data.

Stat Update

drop-down

required

When “On”, governs automatic computation and refreshing of optimizer statistics at the end of a successful COPY command. This is the default setting.

Escape

drop-down

required

(Delimited only) When “Yes”, the backslash character \ in input data is treated as an escape character. The character that immediately follows the backslash character is loaded into the table as part of the current column value, even if it is a character that normally serves a special purpose. For example, you can use this parameter to escape the delimiter character, a quotation mark, an embedded newline character, or the escape character itself when any of these characters is a legitimate part of a column value.The default setting is “No”.

Round Decimals

drop-down

required

When “Yes”, any decimals are rounded to fit into the column in any instance where the number of decimal places in the input is larger than defined for the target column. This is the default setting.

Manifest

drop-down

required

When “Yes”, the given object prefix is that of a manifest file. The default setting is “No”.For more information, read the AWS documentation.

S3 access

To access an S3 bucket from a different AWS account, the following is required:

Set up cross-account access via AWS roles.
The user must type in the bucket they want to access or use a variable to load/unload to those structures.

File patterns with Snowflake

In Snowflake, the Pattern parameter in the COPY INTO syntax is a pattern on the complete path of the file and is not just relative to the directory configured in the S3 Object Prefix parameter. The table below provides an example of S3 Object Prefix and Pattern behaviors, including success and failure states.

S3Object Prefix	Pattern	Outcome	Comments
s3://testbucket/	testDirectory/alphabet_0_0_0.csv.gz	Success	This is the format that the S3 Load Generator will generate.
s3://testbucket/testDirectory/	testDirectory/alphabet_0_0_0.csv.gz	Success	Loads the file successfully because the pattern is matching the full path.
s3://testbucket/testDirectory/	.*.csv.gz	Success	Would load all files ending in .csv.gz in the testDirectory directory.
s3://testbucket/testDirectory/	alphabet_0_0_0.csv.gz	Failure	Does not load the file because the pattern does not match.

Documentation Index

​Video example

​Properties

​S3 access

​File patterns with Snowflake

Video example

Properties

S3 access

File patterns with Snowflake