Production use of this feature is available for specific editions only. Contact our sales team for more information.
Make sure you have read and understand the Requirements set out by Databricks before using this component.
Use case
Some typical use cases for this component include:- Deduplication of text data, by identifying and grouping duplicate or near-duplicate entries in datasets like product descriptions, survey responses, or user comments. For example, “iPhone 14 Pro Max 256GB” and “Apple iPhone 14 Pro Max, 256 GB” are non-matching strings but have a high similarity score so can be considered duplicates.
- Record linking through semantic joins on datasets where the matching field contains slightly different wording.
- Detecting content overlaps to check whether content is reworded or copied from other sources.
Properties
A human-readable name for the component.
Base Column: The base column.
Comparison Column: The column to compare against your base column.
- Yes: Includes both your input columns and the new semantic similarity scores column. This will also include those input columns not selected in Columns.
- No: Only includes the new semantic similarity scores column.

