Pinecone Vector Upsert - Maia Documentation

Production use of this feature is available for specific editions only. Contact our sales team for more information.

The Pinecone Vector Upsert component lets you convert data stored in your cloud data warehouse into embeddings and then store these embeddings as vectors in your Pinecone vector database. If the component requires access to a cloud provider (AWS, Azure, or GCP), it will use credentials as follows:

If using Matillion Full SaaS: The component will use the cloud credentials associated with your environment to access resources.
If using Hybrid SaaS: By default the component will inherit the agent’s execution role (service account role). However, if there are cloud credentials associated to your environment, these will overwrite the role.

Video example

Data freshness

According to Pinecone’s documentation:

Pinecone is eventually consistent, so there can be a slight delay before new or changed records are visible to queries.

Keep this in mind for instances when running query operations shortly after upsert operations.

Properties

Name

string

required

A human-readable name for the component.

Select your cloud data warehouse.

Snowflake
Databricks
Amazon Redshift

Database

drop-down

required

The Snowflake database. The special value [Environment Default] uses the database defined in the environment. Read Databases, Tables and Views - Overview to learn more.

Schema

drop-down

required

The Snowflake schema. The special value [Environment Default] uses the schema defined in the environment. Read Database, Schema, and Share DDL to learn more.

Table

string

required

The Snowflake table that holds your source data.

Catalog

drop-down

required

Select a Databricks Unity Catalog. The special value [Environment Default] uses the catalog defined in the environment. Selecting a catalog will determine which databases are available in the next parameter.

Schema (Database)

drop-down

required

The Databricks schema. The special value [Environment Default] uses the schema defined in the environment. Read Create and manage schemas to learn more.

Table

drop-down

required

The Databricks table that holds your source data.

Schema

drop-down

required

The Amazon Redshift schema. The special value [Environment Default] uses the schema defined in the environment. Read Schemas to learn more.For more information on using multiple schemas, read Schemas.

Table

drop-down

required

An existing Redshift table to use as the input.

Key Column

drop-down

required

Set a column as the primary key.

Text Column

drop-down

required

The column of data to convert into embeddings to then be upserted into your Pinecone vector database.

Limit

integer

Set a limit for the numbers of rows from the table to load. The default is 1000.

Embedding Provider

drop-down

required

The embedding provider is the API service used to convert the search term into a vector. Choose either OpenAI or Amazon Bedrock. The embedding provider receives a search term (e.g. “How do I log in?”) and returns a vector.Choose your provider:

OpenAI
Amazon Bedrock

OpenAI API Key

drop-down

required

Use the drop-down menu to select the corresponding secret definition that denotes the value of your OpenAI API key.Read Secrets and secret definitions to learn how to create a new secret definition.To create a new OpenAI API key:

Log in to OpenAI.
Click your avatar in the top-right of the UI.
Click View API keys.
Click + Create new secret key.
Give a name for your new secret key and click Create secret key.
Copy your new secret key and save it. Then click Done.

Embedding Model

drop-down

required

Select an embedding model.Currently supports:

Model	Dimension
text-embedding-ada-002	1536
text-embedding-3-small	1536
text-embedding-3-large	3072

API Batch Size

integer

required

Set the size of array of data per API call. The default size is 10. When set to 10, 1000 rows would therefore require 100 API calls.You may wish to reduce this number if a row contains a high volume of data; and conversely, increase this number for rows with low data volume.

Model	Dimension
Titan Embeddings G1 - Text	1536

Load Strategy

drop-down

required

Upsert: If a record within a namespace exists, upsert updates it with new values. If a record doesn’t exist, the upsert inserts a new record.
Truncate and Insert: If the specified namespace already exists, the namespace will be destroyed and recreated with new rows per the next run of this pipeline.

Pinecone API Key

drop-down

required

Use the drop-down menu to select the corresponding secret definition that denotes the value of your Pinecone API key.Read Secrets and secret definitions to learn how to create a new secret definition.

Pinecone Index Name

drop-down

required

The name of the Pinecone vector search index to connect to. The list is generated once you pass a valid Pinecone API key.

Pinecone Namespace

string

The name of the Pinecone namespace. Pinecone lets you partition records in an index into namespaces. To retrieve a namespace name:

Log in to Pinecone.
Click PROJECTS in the left sidebar.
Click a project tile. This action will open the list of vector search indexes in your project.
Click on your vector search index tile.
Click the NAMESPACES tab. Your namespaces will be listed.

Upsert Batch Size

integer

required

Set the size of the batches of vectors that Pinecone receives. The default size is 100 vectors per request.You may wish to reduce this number if a row contains a high volume of data; and conversely, increase this number for rows with low data volume.

Documentation Index

​Video example

​Data freshness

​Properties

Video example

Data freshness

Properties