> ## Documentation Index
> Fetch the complete documentation index at: https://docs.maia.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Cortex Parse Document

export const ComponentMetadata = ({warehouses, unsupportedWarehouses = [], componentType, connectionInputs, connectionOutputs}) => {
  const allWarehouses = [...warehouses.map(w => ({
    name: w,
    supported: true
  })), ...unsupportedWarehouses.map(w => ({
    name: w,
    supported: false
  }))];
  return <div style={{
    background: 'var(--colors-background-light, #f9fafb)',
    border: '1px solid var(--colors-border-default, #e5e7eb)',
    borderRadius: '12px',
    padding: '20px 28px',
    marginBottom: '28px',
    boxShadow: '0 1px 4px rgba(0,0,0,0.10)'
  }}>
      <table style={{
    width: '100%',
    borderCollapse: 'collapse'
  }}>
        <tbody>
          <tr>
            <td style={{
    fontWeight: '600',
    paddingRight: '32px',
    paddingBottom: '14px',
    whiteSpace: 'nowrap',
    verticalAlign: 'middle',
    width: '180px'
  }}>Project Availability</td>
            <td style={{
    paddingBottom: '14px',
    verticalAlign: 'middle'
  }}>
              <div style={{
    display: 'flex',
    flexWrap: 'wrap',
    gap: '8px'
  }}>
                {allWarehouses.map((w, i) => <span key={i} style={{
    background: w.supported ? '#dcfce7' : '#fee2e2',
    color: w.supported ? '#15803d' : '#b91c1c',
    border: `1px solid ${w.supported ? '#bbf7d0' : '#fca5a5'}`,
    borderRadius: '9999px',
    padding: '3px 12px',
    fontSize: '0.85rem',
    fontWeight: '500',
    whiteSpace: 'nowrap'
  }}>
                    {w.name} {w.supported ? '✅' : '❌'}
                  </span>)}
              </div>
            </td>
          </tr>
          <tr>
            <td style={{
    fontWeight: '600',
    paddingRight: '32px',
    paddingBottom: '14px',
    whiteSpace: 'nowrap',
    verticalAlign: 'middle'
  }}>Component Type</td>
            <td style={{
    paddingBottom: '14px',
    verticalAlign: 'middle'
  }}>{componentType}</td>
          </tr>
          <tr>
            <td style={{
    fontWeight: '600',
    paddingRight: '32px',
    paddingBottom: '14px',
    whiteSpace: 'nowrap',
    verticalAlign: 'middle'
  }}>Connection Inputs</td>
            <td style={{
    paddingBottom: '14px',
    verticalAlign: 'middle'
  }}>{connectionInputs}</td>
          </tr>
          <tr>
            <td style={{
    fontWeight: '600',
    paddingRight: '32px',
    whiteSpace: 'nowrap',
    verticalAlign: 'middle'
  }}>Connection Outputs</td>
            <td style={{
    verticalAlign: 'middle'
  }}>{connectionOutputs}</td>
          </tr>
        </tbody>
      </table>
    </div>;
};

<ComponentMetadata warehouses={["Snowflake"]} unsupportedWarehouses={["Databricks", "Amazon Redshift", "BigQuery"]} componentType="Transformation" connectionInputs="None" connectionOutputs="Unlimited" />

<Badge color="green" shape="pill" stroke size="lg">Public preview</Badge>

<Info>
  Production use of this feature is available for specific editions only. [Contact our sales team](https://www.matillion.com/contact) for more information.
</Info>

The [Cortex Parse Document](https://docs.snowflake.com/en/sql-reference/functions/parse_document-snowflake-cortex) transformation component lets you extract content from PDFs and image files for use in your pipeline. It uses [Snowflake Cortex](https://www.snowflake.com/en/data-cloud/cortex/) to extract content from a document on an internal or external stage in the form of an object that contains JSON-encoded objects as strings.

To use this component, you must use a Snowflake role that has been granted the [SNOWFLAKE.CORTEX\_USER database role](https://docs.snowflake.com/en/sql-reference/snowflake-db-roles#label-snowflake-db-roles-cortex-schema). Read [Required Privileges](https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions#label-cortex-llm-privileges) to learn more about granting this privilege.

To learn more about Snowflake Cortex, such as availability, usage quotas, managing costs, and more, read [Large Language Model (LLM) Functions (Snowflake Cortex)](https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions).

As per [Snowflake's documentation](https://docs.snowflake.com/en/user-guide/snowflake-cortex/parse-document#using-parse-document):

> *PARSE\_DOCUMENT supports processing of documents stored in an internal Snowflake stage, or an external stage. In creating your stage, Server Side Encryption is required. Otherwise, PARSE\_DOCUMENT will return an error that the provided file isn't in the expected format or is client-side encrypted.*

### Use case

You can use this component to obtain a range of information from PDFs and image files. For example, use it to:

* Automatically extract customer details from completed PDF forms.
* Extract details from scanned receipts when processing expenses.

***

## Video example

<iframe width="560" height="315" src="https://www.youtube.com/embed/h-gQ9JEfkqc?si=h4XiiXUN4P-mNLDx&enablejsapi=1" title="YouTube video player" frameBorder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" referrerPolicy="strict-origin-when-cross-origin" allowFullScreen />

***

## Properties

<ResponseField name="Name" type="string" required>
  A human-readable name for the component.
</ResponseField>

{/* <!-- param-start:[database] | warehouses: [snowflake] --> */}

<ResponseField name="Database" type="drop-down" required>
  The Snowflake database. The special value `[Environment Default]` uses the database defined in the environment. Read [Databases, Tables and Views - Overview](https://docs.snowflake.com/en/guides-overview-db) to learn more.
</ResponseField>

{/* <!-- param-start:[schema] | warehouses: [snowflake] --> */}

<ResponseField name="Schema" type="drop-down" required>
  The Snowflake schema. The special value `[Environment Default]` uses the schema defined in the environment. Read [Database, Schema, and Share DDL](https://docs.snowflake.com/en/sql-reference/ddl-database.html) to learn more.
</ResponseField>

{/* <!-- param-start:[stage] | warehouses: [snowflake] --> */}

<ResponseField name="Stage" type="drop-down" required>
  The internal stage (Snowflake managed) or external stage (such as AWS S3, Azure Blob Storage, or Google Cloud Storage) where the file to extract content from is stored.
</ResponseField>

{/* <!-- param-start:[filePattern] | warehouses: [snowflake] --> */}

<ResponseField name="File Pattern" type="string">
  Use a file pattern to specify and then match files based on their names or extensions. For example, `file1.pdf` would return any file with that name and extension.

  Regular Expressions are supported. For example, `.*\.pdf` would match all .PDF files in a stage.
</ResponseField>

{/* <!-- param-start:[mode] | warehouses: [snowflake] --> */}

<ResponseField name="Mode" type="drop-down" required>
  * **OCR:** This mode is optimized for text extraction from documents. This mode is recommended when extracting content from documents that don't have a strong semantic structure. This is the default setting.
  * **Layout:** This mode is optimized for text and layout extraction, including elements such as tables. According to the [Snowflake documentation](https://docs.snowflake.com/en/sql-reference/functions/parse_document-snowflake-cortex), when using this mode, the data is returned as Markdown, and can capture the layout and structure of content elements better than OCR.

  Read [How Parse Document works](https://docs.snowflake.com/en/user-guide/snowflake-cortex/parse-document#how-parse-document-works) to learn more about Snowflake's suggested usage of the OCR and Layout modes.
</ResponseField>

{/* <!-- param-start:[outputColumnName] | warehouses: [snowflake] --> */}

<ResponseField name="Output Column Name" type="string" required>
  The name of the new column that is output when the component is executed.
</ResponseField>
