> ## Documentation Index
> Fetch the complete documentation index at: https://docs.maia.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Sync All Tables

export const m_runner = "Maia runner";

export const maia = "Maia";

Process the latest files from cloud storage to maintain tables in your cloud data platform.

If you have configured a streaming pipeline with Amazon S3 or Azure Blob as the destination, these {maia} pre-built pipelines can be used to load the Avro files into Snowflake, Databricks, or Amazon Redshift.

Pre-built pipelines are available at [Matillion Exchange](https://exchange.matillion.com/data-productivity-cloud/pipeline/pre-built-pipelines-for-streaming/).

***

## Compatibility

| Cloud data platform | Full SaaS | Hybrid SaaS |
| ------------------- | --------- | ----------- |
| Snowflake           | ✅         | ✅           |
| Databricks          | ❌         | ✅           |
| Amazon Redshift     | ❌         | ✅           |

***

## Installation

1. Open a branch on your {maia} project.
2. In the **Files** panel, click **Add** → **Browse Exchange**.
3. Search for `streaming`, and locate the pipeline **Load streaming data from cloud storage**.
4. Click the pipeline to import it into your project.

You should now have a folder named **Imported from Exchange**, with a sub-folder named **Load Streaming Data from Cloud Storage**. This will contain the latest version of the pre-built pipelines.

***

## Usage

1. Open the "Example" pipeline.
2. Copy the "Sync All Tables - Template" [Run Orchestration](/docs/components/run-orchestration) component and paste it into your own orchestration pipeline.
3. Click into the "Sync All Tables - Template" component and edit the `Set Scalar Variables` and `Set Grid Variables` parameters accordingly.
4. You can now run or schedule your orchestration pipeline to keep Snowflake up to date with your streaming files.

***

## Set Scalar Variables

The following variables can be set in the `Set Scalar Variables` parameter.

<Tabs>
  <Tab title="Snowflake">
    <ResponseField name="cloud_storage_url" type="string" required>
      The url of the cloud storage location that the streaming pipeline is writing to. This should take one of the following formats:

      * `s3://<bucket>/<prefix>`
      * `azure://<storage_account>.blob.core.windows.net/<container>/<prefix>`
      * `gs://<bucket>/<prefix>`
    </ResponseField>

    <ResponseField name="warehouse" type="string" required>
      The Snowflake virtual warehouse used to execute the SQL statements. Read [Overview of Warehouses](https://docs.snowflake.com/en/user-guide/warehouses-overview.html) to learn more.
    </ResponseField>

    <ResponseField name="target_database" type="string" required>
      The Snowflake database where the external table and target tables will be created. Read [Databases, Tables and Views - Overview](https://docs.snowflake.com/en/guides-overview-db) to learn more.
    </ResponseField>

    <ResponseField name="stage_schema" type="string">
      The schema containing the external stage, and where the external table will be created. If not specified, the **target\_schema** will be used.
    </ResponseField>

    <ResponseField name="target_schema" type="string" required>
      The schema where the target tables will be created, unless **use\_source\_schemas** has been set to **Y**. Read [Database, Schema, and Share DDL](https://docs.snowflake.com/en/sql-reference/ddl-database.html) to learn more.
    </ResponseField>

    <ResponseField name="external_stage" type="string" required>
      The name of an existing external stage that contains the files output by the streaming pipeline. The URL of the external stage must contain the `cloud_storage_url`.
    </ResponseField>

    <ResponseField name="external_table" type="string" required>
      The external table that will be created to read the files output by the streaming pipeline.
    </ResponseField>

    <ResponseField name="use_source_schemas" type="string">
      Create the target tables in a schema with the same name as the schema containing the source table. If the schema doesn't already exist, the pipeline will try to create it. Options are `Y` or `N`.
    </ResponseField>

    <ResponseField name="target_prefix" type="string" required>
      A prefix to add to the source table name to generate the target table name. If no prefix is specified, the target table will have the same name as the source table.
    </ResponseField>

    <ResponseField name="fully_qualify_target_table" type="string">
      Includes the source database and schema in the target table name. If `use_source_schemas = N`, it is recommended to set this to `Y`, unless you are confident that your source table names will always be unique. Options are `Y` or `N`.
    </ResponseField>

    <ResponseField name="transformation_type" type="string" required>
      The type of transformation used when applying the change events to the target table. Available options:

      * **Copy Table:** The target table will be maintained as a copy of the source table.
      * **Copy Table with Soft Deletes:** Same as `Copy Table`, but records deleted in the source table will be retained in the target table.
      * **Change Log:** All change events will be extracted and appended to the target table.

      <Note>
        A primary key is required on the source table for `Copy Table` and `Copy Table with Soft Deletes` transformations. The primary key is used by the pre-built pipeline to merge updates into the target table. If the source table doesn't have a primary key, the transformation type will be updated to `Change Log` for that table.
      </Note>
    </ResponseField>

    <ResponseField name="append_metadata" type="string" required>
      Whether to add all metadata columns to the target table, or just the minimum required for the selected `transformation_type`. Options are `Y` or `N`.
    </ResponseField>

    <ResponseField name="bytes_to_decimal_function" type="string" required>
      The name of a user defined function (UDF) that will be created to convert VariableScaleDecimals back to a decimal representation. If no function name is specified, any columns of type VariableScaleDecimal in the Avro files will be created as Variants in Snowflake.
    </ResponseField>

    <ResponseField name="schema_drift_action" type="string" required>
      If the pipeline detects that there have been schema changes in the source table, and which are not compatible with the current target table, the target table can be altered to accept the new data. Options are `Update Target` or `Fail Job`.
    </ResponseField>
  </Tab>

  <Tab title="Databricks">
    <ResponseField name="cloud_storage_url" type="string" required>
      The url of the cloud storage location that the streaming pipeline is writing to. This should take one of the following formats:

      * `s3://<bucket>/<prefix>`
      * `abfss://<container>@<storage_account>.dfs.core.windows.net/<prefix>`
    </ResponseField>

    <ResponseField name="catalog" type="string" required>
      The name of the [Databricks Unity Catalog](https://docs.databricks.com/en/data-governance/unity-catalog/index.html) containing the schemas. For workspaces that do not have Unity Catalog enabled, this should be set to hive\_metastore.
    </ResponseField>

    <ResponseField name="stage_schema" type="string">
      The schema where the working tables will be created. If not specified, the **target\_schema** will be used.
    </ResponseField>

    <ResponseField name="target_schema" type="string" required>
      The [Databricks schema](https://docs.databricks.com/en/data-governance/unity-catalog/create-schemas.html) where the target tables will be created, unless **use\_source\_schemas** has been set to **Y**.
    </ResponseField>

    <ResponseField name="use_source_schemas" type="string">
      Create the target tables in a schema with the same name as the schema containing the source table. If the schema doesn't already exist, the pipeline will try to create it. Options are `Y` or `N`.
    </ResponseField>

    <ResponseField name="tag_processed_files" type="string" required>
      Once the Avro files have been processed, update the tags or metadata in cloud storage. Options are `Y` or `N`.

      For Amazon S3, a tag of `matillion_cdc_processed = true` will be added to each file. If a file already has the maximum 10 tags, the tagging process will be skipped. To manage cloud storage costs, a lifecycle rule can be created on the bucket which filters on this tag.

      <Note>
        To read and write the tags in Amazon S3, the `ECSTaskRole` for the {m_runner} will need to be updated to include the `s3:GetObjectTagging` and `s3:GetObjectTagging` actions.
      </Note>

      For Azure storage accounts, the file metadata will be updated with `matillion_cdc_processed = true`. This option currently has limited benefits with Azure storage accounts. On storage accounts with a hierarchical namespace (required for Databricks Volumes), Azure lifecycle management rules do not provide an option to filter on file metadata.

      <Note>
        To add the file metadata, the {m_runner} will need to be updated to load the Python packages `azure-identity` and `azure-storage-file-datalake`. For more information, read [Loading additional Python libraries](/docs/guides/optional-runner-parameters#loading-additional-python-libraries).
      </Note>
    </ResponseField>

    <ResponseField name="stage_prefix" type="string" required>
      A prefix to add to the names of the working tables.
    </ResponseField>

    <ResponseField name="target_prefix" type="string" required>
      A prefix to add to the source table name to generate the target table name. If no prefix is specified, the target table will have the same name as the source table.
    </ResponseField>

    <ResponseField name="fully_qualify_target_table" type="string">
      Includes the source database and schema in the target table name. If `use_source_schemas` = `N`, it is recommended to set this to `Y`, unless you are confident that your source table names will always be unique. Options are `Y` or `N`.
    </ResponseField>

    <ResponseField name="transformation_type" type="string" required>
      The type of transformation used when applying the change events to the target table. Available options:

      * **Copy Table:** The target table will be maintained as a copy of the source table.
      * **Copy Table with Soft Deletes:** Same as `Copy Table`, but records deleted in the source table will be retained in the target table.
      * **Change Log:** All change events will be extracted and appended to the target table.

      <Note>
        A primary key is required on the source table for `Copy Table` and `Copy Table with Soft Deletes` transformations. The primary key is used by the pre-built pipeline to merge updates into the target table. If the source table doesn't have a primary key, the transformation type will be updated to `Change Log` for that table.
      </Note>
    </ResponseField>

    <ResponseField name="append_metadata" type="string" required>
      Whether to add all metadata columns to the target table, or just the minimum required for the selected `transformation_type`. Options are `Y` or `N`.
    </ResponseField>

    <ResponseField name="bytes_to_decimal_function" type="string" required>
      Reserved for future use.
    </ResponseField>

    <ResponseField name="schema_drift_action" type="string" required>
      If the pipeline detects that there have been schema changes in the source table, and which are not compatible with the current target table, the target table can be altered to accept the new data. Options are `Update Target`, `Fail Job`, or `Skip`. For non-Unity Catalog workspaces, this must be set to `Skip`.
    </ResponseField>
  </Tab>

  <Tab title="Redshift">
    <ResponseField name="cloud_storage_url" type="string" required>
      The URL of the cloud storage location that the streaming pipeline is writing to. This should take the following format: `s3://<bucket>/<prefix>`.
    </ResponseField>

    <ResponseField name="iam_role" type="string" required>
      The Amazon Resource Name (ARN) for an IAM Role with the necessary permissions to access the S3 bucket containing the streaming event data.
    </ResponseField>

    <ResponseField name="region" type="string">
      The AWS region of the S3 bucket that contains the streaming event data, if it's different from the region of the Redshift cluster.
    </ResponseField>

    <ResponseField name="stage_schema" type="string">
      The schema where the working tables will be created. If not specified, the **target\_schema** will be used.
    </ResponseField>

    <ResponseField name="target_schema" type="string" required>
      The Redshift schema where the target tables will be created, unless **use\_source\_schemas** has been set to **Y**.
    </ResponseField>

    <ResponseField name="use_source_schemas" type="string">
      Create the target tables in a schema with the same name as the schema containing the source table. If the schema doesn't already exist, the pipeline will try to create it. Options are `Y` or `N`.
    </ResponseField>

    <ResponseField name="stage_prefix" type="string" required>
      A prefix to add to the source table name to generate the stage and metadata table names.
    </ResponseField>

    <ResponseField name="target_prefix" type="string" required>
      A prefix to add to the source table name to generate the target table name. If no prefix is specified, the target table will have the same name as the source table.
    </ResponseField>

    <ResponseField name="fully_qualify_target_table" type="string">
      Includes the source database and schema in the target table name. If `use_source_schemas` = `N`, it is recommended to set this to `Y`, unless you are confident that your source table names will always be unique. Options are `Y` or `N`.
    </ResponseField>

    <ResponseField name="transformation_type" type="string" required>
      The type of transformation used when applying the change events to the target table. Available options:

      * **Copy Table:** The target table will be maintained as a copy of the source table.
      * **Copy Table with Soft Deletes:** Same as `Copy Table`, but records deleted in the source table will be retained in the target table.
      * **Change Log:** All change events will be extracted and appended to the target table.

      <Note>
        A primary key is required on the source table for `Copy Table` and `Copy Table with Soft Deletes` transformations. The primary key is used by the pre-built pipeline to merge updates into the target table. If the source table doesn't have a primary key, the transformation type will be updated to `Change Log` for that table.
      </Note>
    </ResponseField>

    <ResponseField name="max_files_to_process" type="number" required>
      The maximum number of files that will be processed per source table in a single pipeline execution.
    </ResponseField>

    <ResponseField name="tag_processed_files" type="string" required>
      Once the Avro files have been processed, update the tags or metadata in cloud storage. Options are `Y` or `N`. A tag of `matillion_cdc_processed = true` will be added to each file. If a file already has the maximum 10 tags, the tagging process will be skipped. To manage cloud storage costs, a lifecycle rule can be created on the bucket which filters on this tag.

      <Note>
        To be able to read and write the tags in S3, the `ECSTaskRole` for the {m_runner} will need to be updated to include the `s3:GetObjectTagging` and `s3:GetObjectTagging` actions.
      </Note>
    </ResponseField>
  </Tab>
</Tabs>

***

## Set Grid Variables

The following variables can be set in the `Set Grid Variables` parameter.

<Tabs>
  <Tab title="Snowflake">
    <ResponseField name="primary_key_override" type="drop-down">
      Provide a list of primary key columns for the source tables. By default, the pipeline will read the primary key columns from the change data capture Avro files. However, if the source table does not a have a primary key defined in its Data Definition Language (DDL), a list of unique columns can be specified here to enable **Copy Table** transformations.

      <Note>
        The values for the source\_database, source\_schema, source\_table, and source\_column are case sensitive, and must match the source database.
      </Note>
    </ResponseField>
  </Tab>

  <Tab title="Databricks">
    <ResponseField name="primary_key_override" type="drop-down">
      Provide a list of primary key columns for the source tables. By default, the pipeline will read the primary key columns from the change data capture Avro files. However, if the source table does not a have a primary key defined in its Data Definition Language (DDL), a list of unique columns can be specified here to enable **Copy Table** transformations.

      <Note>
        The values for the source\_database, source\_schema, source\_table, and source\_column are case sensitive, and must match the source database.
      </Note>
    </ResponseField>
  </Tab>

  <Tab title="Amazon Redshift">
    <ResponseField name="primary_key_override" type="drop-down">
      Provide a list of primary key columns for the source tables. By default, the pipeline will read the primary key columns from the change data capture Avro files. However, if the source table does not a have a primary key defined in its Data Definition Language (DDL), a list of unique columns can be specified here to enable **Copy Table** transformations.

      <Note>
        The values for the source\_database, source\_schema, source\_table, and source\_column are case sensitive, and must match the source database.
      </Note>
    </ResponseField>

    <ResponseField name="exclude_columns" type="drop-down" required>
      The list of columns in the Avro files that should be excluded from loading into Redshift. For example, if the source table has columns that are too large to load into Redshift, they can be listed here.
    </ResponseField>
  </Tab>
</Tabs>
