Chunk Text - Maia Documentation

Production use of this feature is available for specific editions only. Contact our sales team for more information.

Chunk Text is an orchestration component that performs pushdown text chunking using a Python user-defined function (UDF) in Snowflake via the computational power of your Snowflake warehouse. Specify an existing Snowflake source table and set a target table. If the target table already exists, the table will be overwritten. You can choose text or Markdown as your data format.

Properties

Name

string

required

A human-readable name for the component.

Database

drop-down

required

The Snowflake source database. The special value [Environment Default] uses the database defined in the environment. Read Databases, Tables and Views - Overview to learn more.

Schema

drop-down

required

The Snowflake source schema. The special value [Environment Default] uses the schema defined in the environment. Read Database, Schema, and Share DDL to learn more.

Table

drop-down

required

An existing Snowflake table to use as the input. The tables available will depend on the schema you select.

Text Column

drop-down

required

The column in your table that holds the text data you wish to chunk.

Include Input Columns

dual listbox

Select any other input columns that you wish to include in the output table.

Data Format

boolean

required

The format of the text to chunk.

Text: Your text data is chunked using a recursive character splitting method.
Markdown: Your text data is chunked using a header splitting method.
HTML: Your text data is chunked using a header splitting method.

If your text data is Markdown, you can still set this parameter to Text.

Text
Markdown
HTML

Chunking Method

drop-down

required

Currently supports recursive character splitting.Recursive character splitting lets you define characters to recursively split your text on until chunks are small enough. Common characters are \n\n, \n, , .. This method attempts to keep all paragraphs (and then sentences, and then words) together as long as possible.

Chunk Size

integer

required

The maximum size of chunks in characters. For example 100 or 250.

Chunk Overlap

integer

required

The number of overlapping characters between two chunks. Overlapping chunks can help to preserve context across chunks.The integer value sets the total number of characters to overlap. For example 10 or 25.

Separators

column editor

required

Define separator characters to recursively split on. Common characters are \n\n, \n, , ..Order matters in this list. If you wish to preserve the structure of a text document as much as possible, ensure your separators are ordered—i.e. with \n\n above .. You can reorder your rows with click-and-drag.

Chunking Method

drop-down

required

Currently supports header splitting, wherein you can specify which HTML headers to split your text on. You don’t need to specify HTML tag characters <>. h1, h2, and so on is sufficient.

Headers To Split On

column editor

required

Header: Specify HTML header syntax to define headers to split on. For example, use “h1” to add a Header 1 to the list of headers to split on. Your output table will offer metadata about each row of chunked text, confirming which header element the chunked text is a child element to.
Alias: An alias for this header to contextualize the operation. For example, “Header 1”, “H1”, “Page title”, and so on.

Timeout

integer

required

The number of seconds to wait for script termination. After the set number of seconds has elapsed, the script is forcibly terminated. The default is 360 seconds (6 minutes).

Database

drop-down

required

The target Snowflake database. The special value [Environment Default] uses the database defined in the environment. Read Databases, Tables and Views - Overview to learn more.

Schema

drop-down

required

The target Snowflake schema. The special value [Environment Default] uses the schema defined in the environment. Read Database, Schema, and Share DDL to learn more.

Table

string

required

The name of your Snowflake target table. If the table already exists, it will be overwritten when you run this pipeline.

Output Column Name

string

required

Set a contextual name for your chunked text output column.

​Properties

Properties