Azure AKS deployment guide for Matillion agents

This document helps you understand Azure-specific architecture decisions, deployment considerations, and readiness requirements for running s on Azure Kubernetes Service (AKS). AKS provides a managed Kubernetes control plane for running Matillion agents in your Azure infrastructure. This deployment model combines Azure-native security features (Managed Identity, Workload Identity) with Kubernetes operational flexibility. For complete Terraform modules, Helm charts, and step-by-step implementation instructions, see the Azure agent directory in the Matillion deployment library on GitHub. You should read the general Kubernetes deployment guide before reading this document.

What you get with AKS deployment

Managed Kubernetes control plane. Azure handles the Kubernetes API server, etcd, and control plane upgrades.
Workload Identity/Managed Identity. Allows credential-free authentication to Azure services.
Flexible node pools. VM scale sets with configurable instance types.
Azure integration. Native support for Azure Monitor, VNet networking, and Azure Load Balancers.
Horizontal Pod Autoscaler. Scale agent pods based on metrics.
Cluster Autoscaler. Automatically adjust VM capacity in node pools.

When to choose AKS

Choose AKS for deployment when:

You have existing Azure infrastructure and expertise.
You need Managed Identity for secure, credential-free access to Azure services (Storage, Key Vault).
You require integration with Azure monitoring and security tools (Azure Monitor, Sentinel).
You want Kubernetes operational flexibility with Azure managed services.
You plan to deploy agents across multiple availability zones for high availability.

Prerequisites and readiness

Azure account requirements

Required Azure services:

Azure Kubernetes Service enabled in your subscription.
Sufficient VM quotas for worker nodes.
Virtual Network (VNet) with subnet configuration.

Your Azure identity needs permissions to:

Create and manage AKS clusters and node pools.
Create and manage virtual machines and VM scale sets.
Create Managed Identities and role assignments.
Manage VNet resources (subnets, route tables, network security groups).
Access Azure Key Vault (for storing OAuth credentials).
Create Storage Accounts (for agent staging data).
Configure Azure Monitor and Log Analytics.

We recommend you use the Azure Contributor role for initial deployment, then scope down to least-privilege for ongoing operations.

Matillion account setup

Before deploying infrastructure, create an agent in the Matillion console. You need to obtain the following information about the agent you created:

Account ID: Your Matillion organization identifier.
Agent ID: The unique identifier for this agent (auto-generated).
OAuth Client ID and Secret: Agent authentication credentials
Region: us1 (United States) or eu1 (Europe).

These credentials are required for the Helm deployment in Phase 4. Store them securely. For details, read create an agent.

Required tools

Ensure these tools are installed and configured on your deployment workstation:

Terraform 1.0+ for infrastructure provisioning.
Azure CLI configured with credentials (az login).
kubectl for Kubernetes cluster management.
Helm 3.x for application deployment.

Verify prerequisites:

# Verify Azure CLI authentication
az account show

# Verify tool versions
terraform --version
kubectl version --client
helm version

Architecture decision points

Before deploying, make these key architectural decisions.

1. VNet strategy

Decide whether to use an existing VNet or create a new one.

Option	When to use	What gets created
Create new VNet	Isolated agent deployment, no existing VNet.	New VNet with subnets across availability zones, route tables, network security groups.
Use existing VNet	Integrate with existing Azure infrastructure.	AKS cluster in existing VNet; may create new subnets if needed.

Set use_existing_vnet = true or use_existing_vnet = false in Terraform variables. See terraform.tfvars.example in the deployment library for an example.

2. Subnet strategy

If using an existing VNet, decide whether to use existing subnets or create new ones.

Option	Requirements	Considerations
Existing subnets	Subnet with outbound internet access (NAT Gateway or default route).	Must have available IP addresses for AKS nodes and pods.
Create new subnets	Room in existing VNet CIDR for new subnet ranges.	Terraform creates new subnets within existing VNet.

Ensure at least /24 CIDR for node subnet to accommodate node IPs and pod IP allocation.

3. Public vs private cluster

Setting	API server access	Use case
Public cluster	API server publicly accessible from authorized IP ranges.	Development, testing, faster initial setup.
Private cluster	API server accessible only from within VNet.	Production, enhanced security, requires bastion host or VPN.

Set is_private_cluster = true or is_private_cluster = false in Terraform variables. See terraform.tfvars.example in the deployment library for an example. For private clusters, ensure:

Deployment workstation has VPN or bastion access to VNet.
CI/CD runners can access cluster API server.
Authorized IP ranges include your access points.

4. Authentication strategy

Workload Identity (recommended for AKS 1.25+):
- Agent pods authenticate to Azure using federated identity credentials.
- No secrets stored in cluster.
- Azure AD-backed, automatic token rotation.
- Best-practice security model for AKS.
Managed Identity (Azure AD Pod Identity):
- Agent pods assume Managed Identity without storing credentials.
- Supports AKS clusters older than 1.25.
- Automatic credential rotation by Azure.
- Terraform module can configure either approach.
Static OAuth Credentials:
- OAuth credentials stored in Kubernetes Secrets.
- Use only if identity-based authentication unavailable (not recommended for AKS).

Recommendation: Use Workload Identity for AKS 1.25+ deployments. The deployment library Terraform module creates the required identity and role assignments automatically.

5. Node pool strategy

Node VM Sizing:

VM size	vCPU	Memory	Use case
`Standard_D2s_v4`	2	8 GB	Development, testing, low workload.
`Standard_D4s_v4`	4	16 GB	Small to medium production workloads.
`Standard_D8s_v4`	8	32 GB	Production workloads.
`Standard_D16s_v4`	16	64 GB	High-throughput production workloads.

Considerations:

Transformation-heavy workloads: SQL generation tasks, low agent CPU usage → Smaller VMs sufficient.
Data ingestion/scripting workloads: High data transfer, processing on agent → Larger VMs needed.
Pod density: Larger VMs allow more agent pods per node, reducing operational overhead.

Configure VM size in Terraform node pool settings.

6. Scaling strategy

Static Replica Count:

Fixed number of agent pods (e.g., 2, 5, 10).
Predictable capacity and costs.
Suitable for steady-state workloads.

Horizontal Pod Autoscaler (HPA):

Automatically scales agent pods based on CPU, memory, or custom metrics.
Configure min/max replicas (e.g. min: 2, max: 10).
Responds to workload spikes dynamically.

Cluster Autoscaler:

Automatically adds/removes VMs in node pool based on pod scheduling needs.
Works in tandem with HPA.
Optimizes infrastructure costs.

Recommendation: start with static replicas, add HPA as you understand workload patterns.

Container images

Agent images are available in Azure Container Registry. Image Repository: matillion.azurecr.io/cloud-agent. Available Tags:

:stable - Slower release cycle, maximum stability, recommended for production.
:current - Faster release cycle, earlier access to new features.

Both tags are production-ready. Choose whichever suits your preference for stability vs. early feature access. No authentication is required. Matillion’s public Azure Container Registry allows anonymous pulls.

Deployment journey

Expected timeline

Phase 1 - Agent registration: 10 minutes (Matillion console).
Phase 2 - Infrastructure provisioning: 15-20 minutes (Terraform: VNet, AKS cluster, Managed Identity).
Phase 3 - Configure kubectl access: 2 minutes (Azure CLI + kubectl).
Phase 4 - Agent deployment: 5-10 minutes (Helm chart).
Phase 5 - Validation: 15-30 minutes (Pre-deployment checks + testing).

Total: 50-75 minutes for first-time deployment.

Phase 1: Agent registration (Matillion console)

Refer to Prerequisites, above, for details of agent creation. What you’ll have at the end:

Account ID.
Agent ID.
OAuth Client ID and Secret.
Region (us1 or eu1).

Store these securely. You’ll need them for Helm deployment in Phase 4.

Phase 2: Infrastructure provisioning (Terraform)

The Terraform module creates:

Azure AKS cluster:
- Managed Kubernetes control plane (API server, etcd, controller manager).
- AKS-managed upgrades and patching.
- Azure Monitor integration for control plane logs.
Node Pools:
- VM Scale Set with configurable VM sizes.
- Azure-managed node lifecycle and upgrades.
- Kubernetes node labels and taints (if configured).
Managed Identity/Workload Identity:
- User-assigned Managed Identity for agent workloads.
- Role assignments for Storage Account access and Key Vault access.
- Federated identity credentials (for Workload Identity).
VNet and Networking (if creating new):
- Virtual Network with subnets across availability zones.
- Network security groups for cluster communication.
- Route tables for outbound connectivity.
Network Security Groups:
- Rules for control plane and node communication.
- HTTPS outbound to Matillion control plane.

In terraform.tfvars you will need to make these configuration changes:

azure_subscription_id: Your Azure subscription ID.
azure_tenant_id: Your Azure AD tenant ID.
resource_group_name: Resource group name (e.g. matillion-agent-rg).
location: Azure region (e.g. East US, West Europe).
name: Cluster name prefix (e.g. matillion-agent).
vm_size: Node pool VM size (e.g. Standard_D4s_v4).
desired_node_count: Initial node count (e.g. 3).
is_private_cluster: true or false.
authorized_ip_ranges: List of CIDRs allowed to access API server.
workload_identity_enabled: true (recommended) or false.
tags: Resource tags for cost allocation.

After terraform apply completes, retrieve the Terraform outputs using:

terraform output cluster_name
terraform output managed_identity_client_id

Where to implement: AKS Terraform Module.

Phase 3: Configure kubectl access

You must configure kubectl to authenticate to your AKS cluster using the Azure CLI. The az aks get-credentials command retrieves cluster endpoint and certificate authority data, then configures your local kubeconfig file with Azure AD authentication. The command is:

az aks get-credentials --resource-group <resource-group> --name <cluster-name>

Use the <resource-group> and <cluster-name> from your Terraform variables. See terraform.tfvars.example in the deployment library for an example. Verification:

kubectl get nodes
kubectl get namespaces

You should see AKS worker nodes and default Kubernetes namespaces.

Phase 4: Agent deployment (Helm)

The Helm chart deploys:

Agent Pods:
- Deployment with configurable replica count (default: 2).
- Each pod runs the Matillion agent binary.
- Resource requests and limits for CPU and memory.
ServiceAccount:
- Kubernetes ServiceAccount configured for Workload Identity or Managed Identity.
- Annotated with Managed Identity client ID (from Phase 2).
ConfigMaps:
- Agent configuration (account ID, agent ID, region).
- Environment-specific settings.
Secrets:
- OAuth Client ID and Secret for Matillion control plane authentication.
Service:
- Kubernetes service exposing Prometheus metrics endpoint (port 8080).
- Annotated for Prometheus service discovery.

You will provide the following configuration values:

Value	Source	Example
`cloudProvider`	Static	`"azure"`
`config.oauthClientId`	Phase 1 (Matillion console)	`"abc123..."`
`config.oauthClientSecret`	Phase 1 (Matillion console)	`"secret456..."`
`dpcAgent.dpcAgent.env.accountId`	Phase 1 (Matillion console)	`"12345"`
`dpcAgent.dpcAgent.env.agentId`	Phase 1 (Matillion console)	`"agent-prod-01"`
`dpcAgent.dpcAgent.env.matillionRegion`	Phase 1 (Matillion console)	`"us1"` or `"eu1"`
`dpcAgent.replicas`	Your decision	`2` (baseline) to `10+` (high throughput)
`dpcAgent.dpcAgent.image.repository`	Static	`"matillion.azurecr.io/cloud-agent"`
`dpcAgent.dpcAgent.image.tag`	Your decision	`"stable"` or `"current"`

Where to implement:

Phase 5: Validation and testing

Run automated pre-deployment validation scripts to verify agent pod environment:

# From deployment library root
./agent/helm/checks/run-check.sh --namespace matillion --release matillion-agent

What gets checked:

Python 3 and Java runtime available.
Filesystem permissions correct.
Environment variables set (ACCOUNT_ID, AGENT_ID, etc.).
cgroup CPU and memory limits applied.
Network connectivity to Matillion control plane.
Security agents that might interfere (Crowdstrike, Prisma Cloud).

Manual verification:

Matillion Console: Navigate to Manage → Agents. Verify agent status shows “Connected”.
Test Pipeline: Create a simple pipeline (e.g. “Hello World” transformation) and execute.
Prometheus Metrics: Verify metrics available at http://<pod-ip>:8080/actuator/prometheus.

Agent application logs are available in Azure Monitor (if Container Insights enabled):

Log Analytics workspace query: ContainerLog|where PodName contains "matillion-agent"

Agent architecture on AKS

How data flows

This shows the architecture flow from the Matillion Control Plane through the Azure AKS cluster down to the agent pods connecting to your data services. Agent architecture diagram

Workload Identity / Managed Identity

How Workload Identity works:

Kubernetes ServiceAccount is annotated with Managed Identity client ID.
AKS OIDC Issuer allows Kubernetes to issue tokens trusted by Azure AD.
Agent pod requests Azure AD token using projected service account token.
Azure AD exchanges token for access token (valid 1 hour, auto-refreshed).
Agent accesses Azure services (Storage, Key Vault) without storing credentials.

Security benefits:

No long-lived credentials in cluster.
Automatic token rotation (every hour).
Least-privilege access (Managed Identity scoped to specific resources).
Pod-level isolation (each pod authenticates independently).

What the Terraform module creates:

User-assigned Managed Identity.
Federated identity credential (Workload Identity).
Role assignments for Storage Account, Key Vault.
AKS cluster OIDC issuer configuration.

Task capacity and throughput

Per-Pod Capacity: Each agent pod can execute up to 20 concurrent tasks. Throughput calculation: Maximum concurrent tasks = (Number of agent pods) × 20. Examples:

2 pods (default) = 40 concurrent tasks.
5 pods = 100 concurrent tasks.
10 pods = 200 concurrent tasks.

Scaling guidance:

For transformation workloads: Tasks generate SQL executed by data warehouse. Agent CPU/memory usage is low. Fewer pods needed.
For data ingestion workloads: Tasks transfer and process data on agent. Agent CPU/memory usage is high. More pods needed.

Queuing behavior: When all pods are at capacity (20 tasks each), new tasks queue in Matillion’s agent gateway until capacity becomes available.

Monitoring and observability

Native Prometheus metrics

Agent pods expose Prometheus-compatible metrics at:

Endpoint: http://<pod-ip>:8080/actuator/prometheus.
Service: Automatically created by Helm chart with Prometheus annotations.

Key metrics:

app_version_info: Agent version and build metadata.
app_agent_status: Agent status (1 = running, 0 = stopped).
app_active_task_count: Current number of executing tasks.
app_active_request_count: Active HTTP requests to agent.
app_open_sessions_count: Open connections to data warehouses.

The Helm chart includes annotations for automatic Prometheus service discovery:

prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/actuator/prometheus"

If Prometheus is deployed in your cluster, it will automatically discover and scrape these metrics.

Azure Monitor integration

Enable Azure Monitor Container Insights for comprehensive AKS monitoring:

Cluster-level metrics (CPU, memory, network).
Pod-level metrics (resource usage per agent pod).
Node-level metrics (VM health, disk usage).

Agent application logs are streamed to Log Analytics workspace:

Centralized log aggregation across all pods.
Query with Kusto Query Language (KQL).
Set up alerts on error patterns.

Recommended Azure Monitor alerts:

Agent pod restarts > threshold.
Agent pods in CrashLoopBackOff state.
Task execution failures (requires custom metric from agent logs).
Node pool CPU/memory > 80%.

Recommended monitoring setup

Use Prometheus for agent-specific metrics (task count, session count, agent status).
Use Azure Monitor for infrastructure metrics (cluster health, node capacity, pod restarts).
Set up Grafana dashboards combining Prometheus and Azure Monitor metrics.
Configure alerts:
- Agent connectivity to Matillion control plane lost.
- Task queue depth increasing (capacity insufficient).
- Agent pod memory usage approaching limits.

Security best practices

Network security

VNet configuration:

Deploy agent pods in private subnets for enhanced security.
Use NAT Gateway or default route for outbound internet access (required for Matillion control plane).
Restrict network security groups to minimum required ingress/egress.

Outbound connectivity requirements:

HTTPS (443) to Matillion control plane (region-specific endpoints).
HTTPS/JDBC to data warehouse endpoints (Snowflake, Synapse, BigQuery).
HTTPS (443) to Azure APIs (Storage, Key Vault, Azure AD for Workload Identity).

Network security group recommendations:

Egress: Allow HTTPS (443) to specific endpoints only.
Ingress: No inbound traffic required (agent initiates all connections).

Private cluster considerations:

API server accessible only from VNet (or authorized VPN/bastion).
Requires VPN or Azure Bastion for kubectl access.
CI/CD pipelines need VNet connectivity or VPN access.

Pod security standards

The Helm chart implements Kubernetes pod security standards. Security context configuration:

Run as non-root user (UID 65534).
Read-only root filesystem.
No privilege escalation.
Drop all Linux capabilities.
Seccomp profile: RuntimeDefault.

Example from Helm chart:

securityContext:
  runAsNonRoot: true
  runAsUser: 65534
  fsGroup: 65534
  seccompProfile:
    type: RuntimeDefault

containers:
  - securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]

Secrets management

OAuth Credentials Storage Options:

Azure Key Vault (Recommended):
- Store OAuth credentials in Azure Key Vault.
- Use External Secrets Operator or Azure Key Vault Provider for CSI Driver to sync to Kubernetes Secrets.
- Automatic rotation support.
- Centralized secret management across environments.
Kubernetes Secrets (Default):
- Credentials provided via Helm values.
- Stored as base64-encoded Kubernetes Secret.
- Not encrypted at rest by default (enable Azure Disk encryption).

Recommendation: For production, use Azure Key Vault with External Secrets Operator for centralized, auditable secret management.

Scaling considerations

When to scale

Indicators to add more agent pods:

Task queue depth consistently > 0 (check the Task history or metrics).
Pipeline execution time increases due to task queuing.
More concurrent pipelines being executed.
Workload characteristics change (more data ingestion vs transformation).

Indicators to keep current capacity:

Task queue depth consistently = 0.
Agent pod CPU < 60%, memory < 70%.
Pipeline execution times stable.
Workload primarily transformation (SQL generation).

Horizontal Pod Autoscaler (HPA)

How it works:

Kubernetes HPA monitors pod metrics (CPU, memory, or custom metrics).
Automatically scales Deployment replicas within configured min/max range.
Evaluates every 15 seconds (default), scales up/down based on thresholds.

Example HPA configuration:

Min replicas: 2 (baseline availability).
Max replicas: 10 (cost control).
Target CPU: 70% (scale up when average CPU > 70%).

Configure via Helm values or separate HPA manifest. Read the HPA documentation for details.

Cluster autoscaler

How it works:

Monitors pods in Pending state (unable to schedule due to insufficient node capacity).
Automatically adds VMs to node pool (VM Scale Set).
Removes underutilized nodes after 10 minutes of low usage.

Works with HPA:

HPA scales agent pods based on metrics.
If pods can’t schedule (no node capacity), cluster autoscaler adds VMs.
Agent pods schedule on new VMs.
When load decreases, HPA scales down pods, cluster autoscaler removes empty VMs.

Read the AKS cluster autoscaler documentation for more details.

Vertical scaling

Adjust CPU and memory limits per pod via Helm values:

Useful when individual tasks require more resources than current pod limits.
Requires pod restart to apply new resource limits.
Consider workload characteristics (transformation vs ingestion).

Cost optimization

AKS pricing components

AKS Control Plane: Free (for standard tier).
VM Node Pool: Pay for VMs (varies by VM size and region):
- Standard_D2s_v4: ~$70/month (2 vCPU, 8GB RAM)
- Standard_D4s_v4: ~$140/month (4 vCPU, 16GB RAM)
- Standard_D8s_v4: ~$280/month (8 vCPU, 32GB RAM)
Data Transfer: Outbound data transfer charges (data warehouse connections, Matillion control plane).
Managed Disks: For node OS disks (~$5-20/month per node depending on disk type).

Cost optimization strategies

Right-size VMs: Match VM type to workload (transformation-heavy = smaller, ingestion-heavy = larger).
Use Cluster Autoscaler: Automatically remove unused VMs during low-usage periods.
Consider Azure Reserved Instances: For predictable baseline capacity (1-year or 3-year commitment).
Use Spot VMs for non-critical workloads: Up to 80% cost savings (with eviction risk).
Monitor data transfer: Ensure data warehouses in same region to avoid cross-region charges.

Additional resources

Implementation and deployment

For complete Terraform modules, Helm charts, and step-by-step implementation, see the following in the Matillion Deployment Library on GitHub:

You can find the Matillion Deployment Library at github.com/matillion-public/deployment-library.

General Kubernetes guide

You should read the general Kubernetes deployment guide for platform-agnostic concepts and architecture.

Matillion documentation

For deployment models, read Agent overview.
For agent registration, read Create an agent.
For capacity planning, read Scaling best practices.

Azure documentation

For AKS concepts and operations, read Azure Kubernetes service documentation.
For Azure AD Workload Identity for AKS, read Workload Identity.
For automatic node scaling, read AKS Cluster Autoscaler.

Setup

Getting started

Projects

Custom connectors

Agents

Maia

Git

Connections and credentials

Data ops

Troubleshooting

Videos

​What you get with AKS deployment

​When to choose AKS

​Prerequisites and readiness

​Azure account requirements

​Matillion account setup

​Required tools

​Architecture decision points

​1. VNet strategy

​2. Subnet strategy

​3. Public vs private cluster

​4. Authentication strategy

​5. Node pool strategy

​6. Scaling strategy

​Container images

​Deployment journey

​Expected timeline

​Phase 1: Agent registration (Matillion console)

​Phase 2: Infrastructure provisioning (Terraform)

​Phase 3: Configure kubectl access

​Phase 4: Agent deployment (Helm)

​Phase 5: Validation and testing

​Agent architecture on AKS

​How data flows

​Workload Identity / Managed Identity

​Task capacity and throughput

​Monitoring and observability

​Native Prometheus metrics

​Azure Monitor integration

​Recommended monitoring setup

​Security best practices

​Network security

​Pod security standards

​Secrets management

​Scaling considerations

​When to scale

​Horizontal Pod Autoscaler (HPA)

​Cluster autoscaler

​Vertical scaling

​Cost optimization

​AKS pricing components

​Cost optimization strategies

​Additional resources

​Implementation and deployment

​General Kubernetes guide

​Matillion documentation

​Azure documentation