What you get with AKS deployment
- Managed Kubernetes control plane. Azure handles the Kubernetes API server, etcd, and control plane upgrades.
- Workload Identity/Managed Identity. Allows credential-free authentication to Azure services.
- Flexible node pools. VM scale sets with configurable instance types.
- Azure integration. Native support for Azure Monitor, VNet networking, and Azure Load Balancers.
- Horizontal Pod Autoscaler. Scale pods based on metrics.
- Cluster Autoscaler. Automatically adjust VM capacity in node pools.
When to choose AKS
Choose AKS for deployment when:- You have existing Azure infrastructure and expertise.
- You need Managed Identity for secure, credential-free access to Azure services (Storage, Key Vault).
- You require integration with Azure monitoring and security tools (Azure Monitor, Sentinel).
- You want Kubernetes operational flexibility with Azure managed services.
- You plan to deploy s across multiple availability zones for high availability.
Prerequisites and readiness
Azure account requirements
Required Azure services:- Azure Kubernetes Service enabled in your subscription.
- Sufficient VM quotas for worker nodes.
- Virtual Network (VNet) with subnet configuration.
- Create and manage AKS clusters and node pools.
- Create and manage virtual machines and VM scale sets.
- Create Managed Identities and role assignments.
- Manage VNet resources (subnets, route tables, network security groups).
- Access Azure Key Vault (for storing OAuth credentials).
- Create Storage Accounts (for staging data).
- Configure Azure Monitor and Log Analytics.
Maia account setup
Before deploying infrastructure, create a in . You need to obtain the following information about the you created:- Account ID: Your Matillion organization identifier.
- Runner ID: The unique identifier for this (auto-generated).
- OAuth Client ID and Secret: authentication credentials
- Region:
us1(United States),eu1(Europe), orau1(Australia/Asia-Pacific).
Required tools
Ensure these tools are installed and configured on your deployment workstation:- Terraform 1.0+ for infrastructure provisioning.
- Azure CLI configured with credentials (
az login). - kubectl for Kubernetes cluster management.
- Helm 3.x for application deployment.
Architecture decision points
Before deploying, make these key architectural decisions.1. VNet strategy
Decide whether to use an existing VNet or create a new one.| Option | When to use | What gets created |
|---|---|---|
| Create new VNet | Isolated deployment, no existing VNet. | New VNet with subnets across availability zones, route tables, network security groups. |
| Use existing VNet | Integrate with existing Azure infrastructure. | AKS cluster in existing VNet; may create new subnets if needed. |
use_existing_vnet = true or use_existing_vnet = false in Terraform variables. See terraform.tfvars.example in the deployment library for an example.
2. Subnet strategy
If using an existing VNet, decide whether to use existing subnets or create new ones.| Option | Requirements | Considerations |
|---|---|---|
| Existing subnets | Subnet with outbound internet access (NAT Gateway or default route). | Must have available IP addresses for AKS nodes and pods. |
| Create new subnets | Room in existing VNet CIDR for new subnet ranges. | Terraform creates new subnets within existing VNet. |
/24 CIDR for node subnet to accommodate node IPs and pod IP allocation.
3. Public vs private cluster
| Setting | API server access | Use case |
|---|---|---|
| Public cluster | API server publicly accessible from authorized IP ranges. | Development, testing, faster initial setup. |
| Private cluster | API server accessible only from within VNet. | Production, enhanced security, requires bastion host or VPN. |
is_private_cluster = true or is_private_cluster = false in Terraform variables. See terraform.tfvars.example in the deployment library for an example.
For private clusters, ensure:
- Deployment workstation has VPN or bastion access to VNet.
- CI/CD runners can access cluster API server.
- Authorized IP ranges include your access points.
4. Authentication strategy
-
Workload Identity (recommended for AKS 1.25+):
- pods authenticate to Azure using federated identity credentials.
- No secrets stored in cluster.
- Azure AD-backed, automatic token rotation.
- Best-practice security model for AKS.
-
Managed Identity (Azure AD Pod Identity):
- pods assume Managed Identity without storing credentials.
- Supports AKS clusters older than 1.25.
- Automatic credential rotation by Azure.
- Terraform module can configure either approach.
-
Static OAuth Credentials:
- OAuth credentials stored in Kubernetes Secrets.
- Use only if identity-based authentication unavailable (not recommended for AKS).
5. Node pool strategy
Node VM sizing:| VM size | vCPU | Memory | Use case |
|---|---|---|---|
Standard_D2s_v4 | 2 | 8 GB | Development, testing, low workload. |
Standard_D4s_v4 | 4 | 16 GB | Small to medium production workloads. |
Standard_D8s_v4 | 8 | 32 GB | Production workloads. |
Standard_D16s_v4 | 16 | 64 GB | High-throughput production workloads. |
- Transformation-heavy workloads: SQL generation tasks, low CPU usage → Smaller VMs sufficient.
- Data ingestion/scripting workloads: High data transfer, processing on → Larger VMs needed.
- Pod density: Larger VMs allow more pods per node, reducing operational overhead.
6. Scaling strategy
Static replica count:- Fixed number of pods (e.g., 2, 5, 10).
- Predictable capacity and costs.
- Suitable for steady-state workloads.
- Automatically scales pods based on workload metrics.
- Configure min/max replicas (e.g. min: 2, max: 10).
- Responds to workload spikes dynamically.
- Automatically adds/removes VMs in node pool based on pod scheduling needs.
- Works in tandem with HPA.
- Optimizes infrastructure costs.
Container images
images are available in Azure Container Registry. Image repository:matillion.azurecr.io/cloud-agent.
Available tags:
:stable- Slower release cycle, maximum stability, recommended for production.:current- Faster release cycle, earlier access to new features.
Deployment journey
Expected timeline
- Phase 1 — registration: 10 minutes (Matillion console).
- Phase 2 — Infrastructure provisioning: 15-20 minutes (Terraform: VNet, AKS cluster, Managed Identity).
- Phase 3 — Configure kubectl access: 2 minutes (Azure CLI + kubectl).
- Phase 4 — deployment: 5-10 minutes (Helm chart).
- Phase 5 — Validation: 15-30 minutes (Pre-deployment checks + testing).
Phase 1: Maia runner registration (Matillion console)
Refer to Prerequisites, above, for details of creation. What you’ll have at the end:- Account ID.
- ID.
- OAuth Client ID and Secret.
- Region (us1, eu1, or au1).
Phase 2: Infrastructure provisioning (Terraform)
The Terraform module creates:-
Azure AKS cluster:
- Managed Kubernetes control plane (API server, etcd, controller manager).
- AKS-managed upgrades and patching.
- Azure Monitor integration for control plane logs.
-
Node pools:
- VM scale set with configurable VM sizes.
- Azure-managed node lifecycle and upgrades.
- Kubernetes node labels and taints (if configured).
-
Managed Identity/Workload Identity:
- User-assigned Managed Identity for workloads.
- Role assignments for Storage Account access and Key Vault access.
- Federated identity credentials (for Workload Identity).
-
VNet and networking (if creating new):
- Virtual Network with subnets across availability zones.
- Network security groups for cluster communication.
- Route tables for outbound connectivity.
-
Network security groups:
- Rules for control plane and node communication.
- HTTPS outbound to Matillion control plane.
terraform.tfvars you will need to make these configuration changes:
- azure_subscription_id: Your Azure subscription ID.
- azure_tenant_id: Your Azure AD tenant ID.
- resource_group_name: Resource group name (for example,
matillion-agent-rg). - location: Azure region (for example,
East US,West Europe). - name: Cluster name prefix (for example,
matillion-agent). - vm_size: Node pool VM size (for example,
Standard_D4s_v4). - desired_node_count: Initial node count (for example,
3). - is_private_cluster:
trueorfalse. - authorized_ip_ranges: List of CIDRs allowed to access API server.
- workload_identity_enabled:
true(recommended) orfalse. - tags: Resource tags for cost allocation.
terraform apply completes, retrieve the Terraform outputs using:
Phase 3: Configure kubectl access
You must configure kubectl to authenticate to your AKS cluster using the Azure CLI. Theaz aks get-credentials command retrieves cluster endpoint and certificate authority data, then configures your local kubeconfig file with Azure AD authentication.
The command is:
<resource-group> and <cluster-name> from your Terraform variables. See terraform.tfvars.example in the deployment library for an example.
Verification:
Phase 4: Maia runner deployment (Helm)
The Helm chart deploys:-
pods:
- Deployment with configurable replica count (default: 2).
- Each pod runs the Matillion binary.
- Resource requests and limits for CPU and memory.
-
ServiceAccount:
- Kubernetes ServiceAccount configured for Workload Identity or Managed Identity.
- Annotated with Managed Identity client ID (from Phase 2).
-
ConfigMaps:
- configuration (account ID, ID, region).
- Environment-specific settings.
-
Secrets:
- OAuth Client ID and Secret for Matillion control plane authentication.
-
Service:
- Kubernetes service exposing Prometheus metrics endpoint (port 8080).
- Annotated for Prometheus service discovery.
| Value | Source | Example |
|---|---|---|
cloudProvider | Static | "azure" |
config.oauthClientId | Phase 1 (Matillion console) | "abc123..." |
config.oauthClientSecret | Phase 1 (Matillion console) | "secret456..." |
dpcAgent.dpcAgent.env.accountId | Phase 1 (Matillion console) | "12345" |
dpcAgent.dpcAgent.env.agentId | Phase 1 (Matillion console) | "agent-prod-01" |
dpcAgent.dpcAgent.env.matillionRegion | Phase 1 (Matillion console) | "us1", "eu1", or "au1" |
dpcAgent.replicas | Your decision | 2 (baseline) to 10+ (high throughput) |
dpcAgent.dpcAgent.image.repository | Static | "matillion.azurecr.io/cloud-agent" |
dpcAgent.dpcAgent.image.tag | Your decision | "stable" or "current" |
Phase 5: Validation and testing
Run automated pre-deployment validation scripts to verify pod environment:- Python 3 and Java runtime available.
- Filesystem permissions correct.
- Environment variables set (ACCOUNT_ID, AGENT_ID, etc.).
- cgroup CPU and memory limits applied.
- Network connectivity to Matillion control plane.
- Security agents that might interfere (Crowdstrike, Prisma Cloud).
- Matillion Console: Navigate to Manage runners. Verify status shows “Connected”.
- Test pipeline: Create a simple pipeline (for example, “Hello World” transformation) and execute.
- Prometheus metrics: Verify metrics available at
http://<pod-ip>:8080/actuator/prometheus.
- Log Analytics workspace query:
ContainerLog|where PodName contains "matillion-agent".
Maia runner architecture on AKS
Workload Identity / Managed Identity
How Workload Identity works:- Kubernetes ServiceAccount is annotated with Managed Identity client ID.
- AKS OIDC Issuer allows Kubernetes to issue tokens trusted by Azure AD.
- pod requests Azure AD token using projected service account token.
- Azure AD exchanges token for access token (valid 1 hour, auto-refreshed).
- accesses Azure services (Storage, Key Vault) without storing credentials.
- No long-lived credentials in cluster.
- Automatic token rotation (every hour).
- Least-privilege access (Managed Identity scoped to specific resources).
- Pod-level isolation (each pod authenticates independently).
- User-assigned Managed Identity.
- Federated identity credential (Workload Identity).
- Role assignments for Storage Account, Key Vault.
- AKS cluster OIDC issuer configuration.
Task capacity and throughput
Per-pod capacity: Each pod can execute up to 20 concurrent tasks. Throughput calculation: Maximum concurrent tasks = (Number of pods) × 20. Examples:- 2 pods (default) = 40 concurrent tasks.
- 5 pods = 100 concurrent tasks.
- 10 pods = 200 concurrent tasks.
- For transformation workloads: Tasks generate SQL executed by data warehouse. CPU/memory usage is low. Fewer pods needed.
- For data ingestion workloads: Tasks transfer and process data on . CPU/memory usage is high. More pods needed.
Monitoring and observability
Native Prometheus metrics
pods expose Prometheus-compatible metrics at:- Endpoint:
http://<pod-ip>:8080/actuator/prometheus. - Service: Automatically created by Helm chart with Prometheus annotations.
app_version_info: version and build metadata.app_agent_status: status (1 = running, 0 = stopped).app_active_task_count: Current number of executing tasks.
Azure Monitor integration
Enable Azure Monitor Container Insights for comprehensive AKS monitoring:- Cluster-level metrics (CPU, memory, network).
- Pod-level metrics (resource usage per pod).
- Node-level metrics (VM health, disk usage).
- Centralized log aggregation across all pods.
- Query with Kusto Query Language (KQL).
- Set up alerts on error patterns.
- pod restarts > threshold.
- pods in CrashLoopBackOff state.
- Task execution failures (requires custom metric from logs).
- Node pool CPU/memory > 80%.
Security best practices
Network security
VNet configuration:- Deploy pods in private subnets for enhanced security.
- Use NAT Gateway or default route for outbound internet access (required for Matillion control plane).
- Restrict network security groups to minimum required ingress/egress.
- HTTPS (443) to Matillion control plane (region-specific endpoints).
- HTTPS/JDBC to your relevant data warehouse endpoints.
- HTTPS (443) to Azure APIs (Storage, Key Vault, Azure AD for Workload Identity).
- HTTP (80) to Snowflake endpoints.
- HTTPS (443) to all other required specific endpoints.
- Ingress: No inbound traffic required ( initiates all connections).
- API server accessible only from VNet (or authorized VPN/bastion).
- Requires VPN or Azure Bastion for kubectl access.
- CI/CD pipelines need VNet connectivity or VPN access.
Pod security standards
The Helm chart implements Kubernetes pod security standards. Security context configuration:- Run as non-root user (UID 65534).
- Read-only root filesystem.
- No privilege escalation.
- Drop all Linux capabilities.
- Seccomp profile: RuntimeDefault.
Secrets management
OAuth credentials storage options:- Azure Key Vault (recommended):
- Store OAuth credentials in Azure Key Vault.
- Use External Secrets Operator or Azure Key Vault Provider for CSI Driver to sync to Kubernetes Secrets.
- Automatic rotation support.
- Centralized secret management across environments.
- Kubernetes Secrets (default):
- Credentials provided via Helm values.
- Stored as base64-encoded Kubernetes Secret.
- Not encrypted at rest by default (enable Azure Disk encryption).
Scaling considerations
When to scale
Indicators to add more pods:- Task queue depth consistently > 0 (check the Task history or metrics).
- Pipeline execution time increases due to task queuing.
- More concurrent pipelines being executed.
- Workload characteristics change (more data ingestion vs transformation).
- Task queue depth consistently = 0.
- Pipeline execution times stable.
- Workload primarily transformation (SQL generation).
Horizontal Pod Autoscaler (HPA)
How it works:- Kubernetes HPA monitors pod metrics.
- Automatically scales Deployment replicas within configured min/max range.
- Evaluates every 15 seconds (default), scales up/down based on thresholds.
- Min replicas: 2 (baseline availability).
- Max replicas: 10 (cost control).
Cluster autoscaler
How it works:- Monitors pods in Pending state (unable to schedule due to insufficient node capacity).
- Automatically adds VMs to node pool (VM Scale Set).
- Removes underutilized nodes after 10 minutes of low usage.
- HPA scales pods based on metrics.
- If pods can’t schedule (no node capacity), cluster autoscaler adds VMs.
- pods schedule on new VMs.
- When load decreases, HPA scales down pods, cluster autoscaler removes empty VMs.
Vertical scaling
Adjust CPU and memory limits per pod via Helm values:- Useful when individual tasks require more resources than current pod limits.
- Requires pod restart to apply new resource limits.
- Consider workload characteristics (transformation vs ingestion).
Cost optimization
Cost optimization strategies
- Right-size VMs: Match VM type to workload (transformation-heavy = smaller, ingestion-heavy = larger).
- Use Cluster Autoscaler: Automatically remove unused VMs during low-usage periods.
- Consider Azure Reserved Instances: For predictable baseline capacity (1-year or 3-year commitment).
- Monitor data transfer: Ensure data warehouses in same region to avoid cross-region charges.
Additional resources
Implementation and deployment
For complete Terraform modules, Helm charts, and step-by-step implementation, see the following in the Matillion Deployment Library on GitHub: You can find the Matillion Deployment Library at github.com/matillion-public/deployment-library.General Kubernetes guide
You should read the general Kubernetes deployment guide for platform-agnostic concepts and architecture.Matillion documentation
- For deployment models, read overview.
- For registration, read Create a .
- For capacity planning, read Scaling best practices.
Azure documentation
- For AKS concepts and operations, read Azure Kubernetes service documentation.
- For Azure AD Workload Identity for AKS, read Workload Identity.
- For automatic node scaling, read AKS Cluster Autoscaler.
