Amazon OpenSearch Upsert - Maia Documentation

Production use of this feature is available for specific editions only. Contact our sales team for more information.

Amazon OpenSearch Upsert is an orchestration component that lets you convert data stored in your cloud data warehouse into vector embeddings to then be stored in an Amazon OpenSearch Index. This will allow you to use alternative embedding models (for example, OpenAI or Amazon Bedrock) instead of Amazon OpenSearch’s built-in embedding.

Currently, this component only supports provisioned OpenSearch services, not serverless.

Prerequisites

Before you use the Amazon OpenSearch Upsert component, you’ll need to add AWS cloud credentials to .

Permissions

You’ll need to ensure you have permissions for Amazon OpenSearch Service. If you’re using Amazon Bedrock for your embeddings, you’ll also need permission to invoke the model.

Amazon OpenSearch Service

To use the Amazon OpenSearch Upsert component, ensure that your IAM role or user has the necessary permissions to interact with Amazon OpenSearch. Below is an example of an IAM policy that grants the required permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "es:ESHttpPost",
        "es:ESHttpPut",
        "es:ESHttpGet",
        "es:ESHttpDelete"
      ],
      "Resource": "arn:aws:es:<region>:<account-id>:domain/<your-domain-name>/*"
    }
  ]
}

If you are using fine-grained access control in Amazon OpenSearch Service, you’ll also need OpenSearch user/role permissions (for example, a role mapped to the appropriate OpenSearch index with write access).

Amazon Bedrock

If you are using Amazon Bedrock as your embedding provider, ensure that your IAM role or user has the necessary permissions to invoke the model. Below is an example of an IAM policy that grants the required permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Resource": "arn:aws:bedrock:<region>:<account-id>:foundation-model/<model-name>"
    }
  ]
}

Properties

Reference material is provided below for the Source, Configure, and Destination properties.

Name

string

required

A human-readable name for the component.

Source

Snowflake
Databricks
Amazon Redshift

Database

drop-down

required

The Snowflake database. The special value [Environment Default] uses the database defined in the environment. Read Databases, Tables and Views - Overview to learn more.

Schema

drop-down

required

The Snowflake source schema. The special value [Environment Default] uses the schema defined in the environment. Read Database, Schema, and Share DDL to learn more.

Catalog

drop-down

required

Select a Databricks Unity Catalog. The special value [Environment Default] uses the catalog defined in the environment. Selecting a catalog will determine which databases are available in the next parameter.

Schema (Database)

drop-down

required

The Databricks schema. The special value [Environment Default] uses the schema defined in the environment. Read Create and manage schemas to learn more.

Schema

drop-down

required

The Amazon Redshift source schema. The special value [Environment Default] uses the schema defined in the environment. Read Schemas to learn more.

Table

drop-down

required

Select the table that contains the data you want to upsert into Amazon OpenSearch.

Key Column

drop-down

required

This column is used to uniquely identify each row in the table. It is used to ensure that the data is not duplicated when it is loaded into the destination. An example use case would be a column of product IDs that you want to use to identify each product in the table.

Text Column

drop-down

required

This column is used to generate vectors for the text data in the table, which are then upserted as embeddings to Amazon OpenSearch. An example use case of this column would be a column of product reviews that you want to convert into vectors for semantic search or to perform sentiment analysis on.

Limit

integer

Set the Limit to control the maximum number of records (rows) to load from the table. The default is 1000.

Configure

Embedding Provider

drop-down

required

The embedding provider is the API service used to convert the search term into a vector. Choose either OpenAI or Amazon Bedrock. The embedding provider receives a search term (e.g. “How do I log in?”) and returns a vector.

OpenAI
Amazon Bedrock

OpenAI API Key

drop-down

required

Use the drop-down menu to select the corresponding secret definition that denotes the value of your OpenAI API key.Read Secrets and secret definitions to learn how to create a new secret definition.To create a new OpenAI API key:

Log in to OpenAI.
Click your avatar in the top-right of the UI.
Click View API keys.
Click + Create new secret key.
Give a name for your new secret key and click Create secret key.
Copy your new secret key and save it. Then click Done.

Model

drop-down

required

Select an OpenAI embedding model.Currently supports:

text-embedding-ada-002
text-embedding-3-small
text-embedding-3-large

API Batch Size

integer

required

Set the size of array of data per API call. The default size is 10. When set to 10, 1000 rows would therefore require 100 API calls.You may wish to reduce this number if a row contains a high volume of data, and conversely, increase this number for rows with low data volume.

Destination

Endpoint URL

string

required

The URL of the Amazon OpenSearch domain endpoint to upsert your vector embeddings to. To find your endpoint URL:

Log in to the Amazon OpenSearch Service console.
Navigate to the Domains page.
Click on the domain you want to use.
Copy the Domain Endpoint URL from the domain details page.

Index

string

required

The name of an existing Amazon OpenSearch index where the vector embeddings will be upserted. An index in Amazon OpenSearch is similar to a table in a relational database.Below is an example code snippet you could use to create an index. You can run the following command in the OpenSearch Dev Tools console or a REST API client like curl or Postman:

PUT /test-index
{
  "settings": {
    "index.knn": true
  },
  "mappings": {
    "properties": {
      "vector": {
        "type": "knn_vector",
        "dimension": 1536
      },
      "rawData": {
        "type": "text",
        "index": false
      },
      "key": {
        "type": "object"
      }
    }
  }
}

The dimension value must match the output dimension of the embedding model you have chosen:

text-embedding-ada-002 and text-embedding-3-small output vectors of dimension “1536”.
text-embedding-3-large output vectors of dimension “3072”.

If you’re using the text-embedding-3-large model, update "dimension": 1536 to “dimension”: 3072 in the mapping above.

Region

drop-down

required

Select the AWS region of your Amazon OpenSearch Service domain.

​Prerequisites

​Permissions

​Properties

​Source

​Configure

​Destination

Prerequisites

Permissions

Properties

Source

Configure

Destination