> ## Documentation Index
> Fetch the complete documentation index at: https://docs.maia.ai/llms.txt
> Use this file to discover all available pages before exploring further.

# Azure AKS deployment guide for Maia runners

export const m_runner = "Maia runner";

export const maia = "Maia";

This document helps you understand Azure-specific architecture decisions, deployment considerations, and readiness requirements for running {m_runner}s on Azure Kubernetes Service (AKS). AKS provides a managed Kubernetes control plane for running Matillion {m_runner}s in your Azure infrastructure. This deployment model combines Azure-native security features (Managed Identity, Workload Identity) with Kubernetes operational flexibility.

For complete Terraform modules, Helm charts, and step-by-step implementation instructions, see the [Azure {m_runner} directory](https://github.com/matillion-public/deployment-library/tree/main/agent/azure) in the Matillion deployment library on GitHub.

You should read the general [Kubernetes deployment guide](/docs/guides/kubernetes-deployment-guide) before reading this document.

### What you get with AKS deployment

* Managed Kubernetes control plane. Azure handles the Kubernetes API server, etcd, and control plane upgrades.
* Workload Identity/Managed Identity. Allows credential-free authentication to Azure services.
* Flexible node pools. VM scale sets with configurable instance types.
* Azure integration. Native support for Azure Monitor, VNet networking, and Azure Load Balancers.
* Horizontal Pod Autoscaler. Scale {m_runner} pods based on metrics.
* Cluster Autoscaler. Automatically adjust VM capacity in node pools.

### When to choose AKS

Choose AKS for {m_runner} deployment when:

* You have existing Azure infrastructure and expertise.
* You need Managed Identity for secure, credential-free access to Azure services (Storage, Key Vault).
* You require integration with Azure monitoring and security tools (Azure Monitor, Sentinel).
* You want Kubernetes operational flexibility with Azure managed services.
* You plan to deploy {m_runner}s across multiple availability zones for high availability.

***

## Prerequisites and readiness

### Azure account requirements

Required Azure services:

* Azure Kubernetes Service enabled in your subscription.
* Sufficient VM quotas for worker nodes.
* Virtual Network (VNet) with subnet configuration.

Your Azure identity needs permissions to:

* Create and manage AKS clusters and node pools.
* Create and manage virtual machines and VM scale sets.
* Create Managed Identities and role assignments.
* Manage VNet resources (subnets, route tables, network security groups).
* Access Azure Key Vault (for storing OAuth credentials).
* Create Storage Accounts (for {m_runner} staging data).
* Configure Azure Monitor and Log Analytics.

We recommend you use the Azure **Contributor** role for initial deployment, then scope down to least-privilege for ongoing operations.

### Maia account setup

Before deploying infrastructure, create a {m_runner} in {maia}.

You need to obtain the following information about the {m_runner} you created:

* **Account ID:** Your Matillion organization identifier.
* **Runner ID:** The unique identifier for this {m_runner} (auto-generated).
* **OAuth Client ID and Secret:** {m_runner} authentication credentials
* **Region:** `us1` (United States), `eu1` (Europe), or `au1` (Australia/Asia-Pacific).

These credentials are required for the Helm deployment in [Phase 4](#phase-4-maia-runner-deployment-helm). Store them securely.

For details, read [create a {m_runner}](/docs/guides/create-a-runner#prerequisites).

### Required tools

Ensure these tools are installed and configured on your deployment workstation:

* [Terraform 1.0+](https://www.terraform.io/downloads.html) for infrastructure provisioning.
* [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli) configured with credentials (`az login`).
* [kubectl](https://kubernetes.io/docs/tasks/tools/) for Kubernetes cluster management.
* [Helm 3.x](https://helm.sh/docs/intro/install/) for application deployment.

Verify prerequisites:

```bash theme={null}
# Verify Azure CLI authentication
az account show

# Verify tool versions
terraform --version
kubectl version --client
helm version
```

***

## Architecture decision points

Before deploying, make these key architectural decisions.

### 1. VNet strategy

Decide whether to use an existing VNet or create a new one.

| Option                | When to use                                       | What gets created                                                                       |
| --------------------- | ------------------------------------------------- | --------------------------------------------------------------------------------------- |
| **Create new VNet**   | Isolated {m_runner} deployment, no existing VNet. | New VNet with subnets across availability zones, route tables, network security groups. |
| **Use existing VNet** | Integrate with existing Azure infrastructure.     | AKS cluster in existing VNet; may create new subnets if needed.                         |

Set `use_existing_vnet = true` or `use_existing_vnet = false` in Terraform variables. See `terraform.tfvars.example` in the deployment library for an example.

### 2. Subnet strategy

If using an existing VNet, decide whether to use existing subnets or create new ones.

| Option                 | Requirements                                                         | Considerations                                           |
| ---------------------- | -------------------------------------------------------------------- | -------------------------------------------------------- |
| **Existing subnets**   | Subnet with outbound internet access (NAT Gateway or default route). | Must have available IP addresses for AKS nodes and pods. |
| **Create new subnets** | Room in existing VNet CIDR for new subnet ranges.                    | Terraform creates new subnets within existing VNet.      |

Ensure at least `/24` CIDR for node subnet to accommodate node IPs and pod IP allocation.

### 3. Public vs private cluster

| Setting             | API server access                                         | Use case                                                     |
| ------------------- | --------------------------------------------------------- | ------------------------------------------------------------ |
| **Public cluster**  | API server publicly accessible from authorized IP ranges. | Development, testing, faster initial setup.                  |
| **Private cluster** | API server accessible only from within VNet.              | Production, enhanced security, requires bastion host or VPN. |

Set `is_private_cluster = true` or `is_private_cluster = false` in Terraform variables. See `terraform.tfvars.example` in the deployment library for an example.

For private clusters, ensure:

* Deployment workstation has VPN or bastion access to VNet.
* CI/CD runners can access cluster API server.
* Authorized IP ranges include your access points.

### 4. Authentication strategy

* Workload Identity (recommended for AKS 1.25+):

  * {m_runner} pods authenticate to Azure using federated identity credentials.
  * No secrets stored in cluster.
  * Azure AD-backed, automatic token rotation.
  * Best-practice security model for AKS.

* Managed Identity (Azure AD Pod Identity):

  * {m_runner} pods assume Managed Identity without storing credentials.
  * Supports AKS clusters older than 1.25.
  * Automatic credential rotation by Azure.
  * Terraform module can configure either approach.

* Static OAuth Credentials:

  * OAuth credentials stored in Kubernetes Secrets.
  * Use only if identity-based authentication unavailable (not recommended for AKS).

Recommendation: Use Workload Identity for AKS 1.25+ deployments. The deployment library Terraform module creates the required identity and role assignments automatically.

### 5. Node pool strategy

Node VM sizing:

| VM size            | vCPU | Memory | Use case                              |
| ------------------ | ---- | ------ | ------------------------------------- |
| `Standard_D2s_v4`  | 2    | 8 GB   | Development, testing, low workload.   |
| `Standard_D4s_v4`  | 4    | 16 GB  | Small to medium production workloads. |
| `Standard_D8s_v4`  | 8    | 32 GB  | Production workloads.                 |
| `Standard_D16s_v4` | 16   | 64 GB  | High-throughput production workloads. |

Considerations:

* **Transformation-heavy workloads:** SQL generation tasks, low {m_runner} CPU usage → Smaller VMs sufficient.
* **Data ingestion/scripting workloads:** High data transfer, processing on {m_runner} → Larger VMs needed.
* **Pod density:** Larger VMs allow more {m_runner} pods per node, reducing operational overhead.

Configure VM size in Terraform node pool settings.

### 6. Scaling strategy

Static replica count:

* Fixed number of {m_runner} pods (e.g., 2, 5, 10).
* Predictable capacity and costs.
* Suitable for steady-state workloads.

Horizontal Pod Autoscaler (HPA):

* Automatically scales {m_runner} pods based on workload metrics.
* Configure min/max replicas (e.g. min: 2, max: 10).
* Responds to workload spikes dynamically.

Cluster Autoscaler:

* Automatically adds/removes VMs in node pool based on pod scheduling needs.
* Works in tandem with HPA.
* Optimizes infrastructure costs.

Recommendation: start with static replicas, add HPA as you understand workload patterns.

***

## Container images

{m_runner} images are available in Azure Container Registry.

Image repository: `matillion.azurecr.io/cloud-agent`.

Available tags:

* `:stable` - Slower release cycle, maximum stability, recommended for production.
* `:current` - Faster release cycle, earlier access to new features.

Both tags are production-ready. Choose whichever suits your preference for stability vs. early feature access.

No authentication is required. Matillion's public Azure Container Registry allows anonymous pulls.

***

## Deployment journey

### Expected timeline

* **Phase 1 — {m_runner} registration:** 10 minutes (Matillion console).
* **Phase 2 — Infrastructure provisioning:** 15-20 minutes (Terraform: VNet, AKS cluster, Managed Identity).
* **Phase 3 — Configure kubectl access:** 2 minutes (Azure CLI + kubectl).
* **Phase 4 — {m_runner} deployment:** 5-10 minutes (Helm chart).
* **Phase 5 — Validation:** 15-30 minutes (Pre-deployment checks + testing).

**Total:** 50-75 minutes for first-time deployment.

### Phase 1: Maia runner registration (Matillion console)

Refer to [Prerequisites](#prerequisites-and-readiness), above, for details of {m_runner} creation.

What you'll have at the end:

* Account ID.
* {m_runner} ID.
* OAuth Client ID and Secret.
* Region (us1, eu1, or au1).

Store these securely. You'll need them for Helm deployment in [Phase 4](#phase-4-maia-runner-deployment-helm).

### Phase 2: Infrastructure provisioning (Terraform)

The Terraform module creates:

1. **Azure AKS cluster:**

   * Managed Kubernetes control plane (API server, etcd, controller manager).
   * AKS-managed upgrades and patching.
   * Azure Monitor integration for control plane logs.

2. **Node pools:**

   * VM scale set with configurable VM sizes.
   * Azure-managed node lifecycle and upgrades.
   * Kubernetes node labels and taints (if configured).

3. **Managed Identity/Workload Identity:**

   * User-assigned Managed Identity for {m_runner} workloads.
   * Role assignments for Storage Account access and Key Vault access.
   * Federated identity credentials (for Workload Identity).

4. **VNet and networking (if creating new):**

   * Virtual Network with subnets across availability zones.
   * Network security groups for cluster communication.
   * Route tables for outbound connectivity.

5. **Network security groups:**

   * Rules for control plane and node communication.
   * HTTPS outbound to Matillion control plane.

In `terraform.tfvars` you will need to make these configuration changes:

* **azure\_subscription\_id:** Your Azure subscription ID.
* **azure\_tenant\_id:** Your Azure AD tenant ID.
* **resource\_group\_name:** Resource group name (for example, `matillion-agent-rg`).
* **location:** Azure region (for example, `East US`, `West Europe`).
* **name:** Cluster name prefix (for example, `matillion-agent`).
* **vm\_size:** Node pool VM size (for example, `Standard_D4s_v4`).
* **desired\_node\_count:** Initial node count (for example, `3`).
* **is\_private\_cluster:** `true` or `false`.
* **authorized\_ip\_ranges:** List of CIDRs allowed to access API server.
* **workload\_identity\_enabled:** `true` (recommended) or `false`.
* **tags:** Resource tags for cost allocation.

After `terraform apply` completes, retrieve the Terraform outputs using:

```bash theme={null}
terraform output cluster_name
terraform output managed_identity_client_id
```

**Where to implement:** [AKS Terraform module](https://github.com/matillion-public/deployment-library/tree/main/agent/azure/aks).

### Phase 3: Configure kubectl access

You must configure kubectl to authenticate to your AKS cluster using the Azure CLI.

The `az aks get-credentials` command retrieves cluster endpoint and certificate authority data, then configures your local `kubeconfig` file with Azure AD authentication.

The command is:

```bash theme={null}
az aks get-credentials --resource-group <resource-group> --name <cluster-name>
```

Use the `<resource-group>` and `<cluster-name>` from your Terraform variables. See `terraform.tfvars.example` in the deployment library for an example.

Verification:

```bash theme={null}
kubectl get nodes
kubectl get namespaces
```

You should see AKS worker nodes and default Kubernetes namespaces.

### Phase 4: Maia runner deployment (Helm)

The Helm chart deploys:

1. **{m_runner} pods:**

   * Deployment with configurable replica count (default: 2).
   * Each pod runs the Matillion {m_runner} binary.
   * Resource requests and limits for CPU and memory.

2. **ServiceAccount:**

   * Kubernetes ServiceAccount configured for Workload Identity or Managed Identity.
   * Annotated with Managed Identity client ID (from [Phase 2](#phase-2-infrastructure-provisioning-terraform)).

3. **ConfigMaps:**

   * {m_runner} configuration (account ID, {m_runner} ID, region).
   * Environment-specific settings.

4. **Secrets:**

   * OAuth Client ID and Secret for Matillion control plane authentication.

5. **Service:**

   * Kubernetes service exposing Prometheus metrics endpoint (port 8080).
   * Annotated for Prometheus service discovery.

You will provide the following configuration values:

| Value                                   | Source                                                                       | Example                                   |
| --------------------------------------- | ---------------------------------------------------------------------------- | ----------------------------------------- |
| `cloudProvider`                         | Static                                                                       | `"azure"`                                 |
| `config.oauthClientId`                  | [Phase 1](#phase-1-agent-registration-matillion-console) (Matillion console) | `"abc123..."`                             |
| `config.oauthClientSecret`              | [Phase 1](#phase-1-agent-registration-matillion-console) (Matillion console) | `"secret456..."`                          |
| `dpcAgent.dpcAgent.env.accountId`       | [Phase 1](#phase-1-agent-registration-matillion-console) (Matillion console) | `"12345"`                                 |
| `dpcAgent.dpcAgent.env.agentId`         | [Phase 1](#phase-1-agent-registration-matillion-console) (Matillion console) | `"agent-prod-01"`                         |
| `dpcAgent.dpcAgent.env.matillionRegion` | [Phase 1](#phase-1-agent-registration-matillion-console) (Matillion console) | `"us1"`, `"eu1"`, or `"au1"`              |
| `dpcAgent.replicas`                     | Your decision                                                                | `2` (baseline) to `10+` (high throughput) |
| `dpcAgent.dpcAgent.image.repository`    | Static                                                                       | `"matillion.azurecr.io/cloud-agent"`      |
| `dpcAgent.dpcAgent.image.tag`           | Your decision                                                                | `"stable"` or `"current"`                 |

**Where to implement:**

* [Helm chart documentation](https://github.com/matillion-public/deployment-library/tree/main/agent/helm).
* [values.yaml reference](https://github.com/matillion-public/deployment-library/blob/main/agent/helm/agent/values.yaml).

### Phase 5: Validation and testing

Run automated pre-deployment validation scripts to verify {m_runner} pod environment:

```bash theme={null}
# From deployment library root
./agent/helm/checks/run-check.sh --namespace matillion --release matillion-agent
```

What gets checked:

* Python 3 and Java runtime available.
* Filesystem permissions correct.
* Environment variables set (ACCOUNT\_ID, AGENT\_ID, etc.).
* cgroup CPU and memory limits applied.
* Network connectivity to Matillion control plane.
* Security agents that might interfere (Crowdstrike, Prisma Cloud).

Manual verification:

1. **Matillion Console:** Navigate to **Manage runners**. Verify {m_runner} status shows "Connected".
2. **Test pipeline:** Create a simple pipeline (for example, "Hello World" transformation) and execute.
3. **Prometheus metrics:** Verify metrics available at `http://<pod-ip>:8080/actuator/prometheus`.

{m_runner} application logs are available in Azure Monitor (if Container Insights enabled):

* Log Analytics workspace query: `ContainerLog|where PodName contains "matillion-agent"`.

***

## Maia runner architecture on AKS

### Workload Identity / Managed Identity

How Workload Identity works:

1. **Kubernetes ServiceAccount** is annotated with Managed Identity client ID.
2. **AKS OIDC Issuer** allows Kubernetes to issue tokens trusted by Azure AD.
3. **{m_runner} pod** requests Azure AD token using projected service account token.
4. **Azure AD** exchanges token for access token (valid 1 hour, auto-refreshed).
5. **{m_runner}** accesses Azure services (Storage, Key Vault) without storing credentials.

Security benefits:

* No long-lived credentials in cluster.
* Automatic token rotation (every hour).
* Least-privilege access (Managed Identity scoped to specific resources).
* Pod-level isolation (each pod authenticates independently).

What the Terraform module creates:

* User-assigned Managed Identity.
* Federated identity credential (Workload Identity).
* Role assignments for Storage Account, Key Vault.
* AKS cluster OIDC issuer configuration.

### Task capacity and throughput

**Per-pod capacity:** Each {m_runner} pod can execute up to 20 concurrent tasks.

**Throughput calculation:** Maximum concurrent tasks = (Number of {m_runner} pods) × 20.

Examples:

* 2 pods (default) = 40 concurrent tasks.
* 5 pods = 100 concurrent tasks.
* 10 pods = 200 concurrent tasks.

**Scaling guidance:**

* For transformation workloads: Tasks generate SQL executed by data warehouse. {m_runner} CPU/memory usage is low. Fewer pods needed.
* For data ingestion workloads: Tasks transfer and process data on {m_runner}. {m_runner} CPU/memory usage is high. More pods needed.

**Queuing behavior:** When all pods are at capacity (20 tasks each), new tasks queue in Matillion's agent gateway until capacity becomes available.

***

## Monitoring and observability

### Native Prometheus metrics

{m_runner} pods expose Prometheus-compatible metrics at:

* **Endpoint:** `http://<pod-ip>:8080/actuator/prometheus`.
* **Service:** Automatically created by Helm chart with Prometheus annotations.

Key metrics:

* `app_version_info`: {m_runner} version and build metadata.
* `app_agent_status`: {m_runner} status (1 = running, 0 = stopped).
* `app_active_task_count`: Current number of executing tasks.

The Helm chart includes annotations for automatic Prometheus service discovery:

```yaml theme={null}
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/actuator/prometheus"
```

If Prometheus is deployed in your cluster, it will automatically discover and scrape these metrics.

### Azure Monitor integration

Enable Azure Monitor Container Insights for comprehensive AKS monitoring:

* Cluster-level metrics (CPU, memory, network).
* Pod-level metrics (resource usage per {m_runner} pod).
* Node-level metrics (VM health, disk usage).

{m_runner} application logs are streamed to Log Analytics workspace:

* Centralized log aggregation across all pods.
* Query with Kusto Query Language (KQL).
* Set up alerts on error patterns.

Recommended Azure Monitor alerts:

* {m_runner} pod restarts > threshold.
* {m_runner} pods in CrashLoopBackOff state.
* Task execution failures (requires custom metric from {m_runner} logs).
* Node pool CPU/memory > 80%.

***

## Security best practices

### Network security

VNet configuration:

* Deploy {m_runner} pods in **private subnets** for enhanced security.
* Use NAT Gateway or default route for outbound internet access (required for Matillion control plane).
* Restrict network security groups to minimum required ingress/egress.

Network connectivity requirements:

Outbound:

* **HTTPS (443)** to Matillion control plane (region-specific endpoints).
* **HTTPS/JDBC** to your relevant data warehouse endpoints.
* **HTTPS (443)** to Azure APIs (Storage, Key Vault, Azure AD for Workload Identity).
* **HTTP (80)** to Snowflake endpoints.
* **HTTPS (443)** to all other required specific endpoints.

Inbound:

* **Ingress:** No inbound traffic required ({m_runner} initiates all connections).

Private cluster considerations:

* API server accessible only from VNet (or authorized VPN/bastion).
* Requires VPN or Azure Bastion for kubectl access.
* CI/CD pipelines need VNet connectivity or VPN access.

### Pod security standards

The Helm chart implements Kubernetes pod security standards.

Security context configuration:

* Run as non-root user (UID 65534).
* Read-only root filesystem.
* No privilege escalation.
* Drop all Linux capabilities.
* Seccomp profile: RuntimeDefault.

Example from Helm chart:

```yaml theme={null}
securityContext:
  runAsNonRoot: true
  runAsUser: 65534
  fsGroup: 65534
  seccompProfile:
    type: RuntimeDefault

containers:
  - securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
```

### Secrets management

OAuth credentials storage options:

* Azure Key Vault (recommended):
  * Store OAuth credentials in Azure Key Vault.
  * Use External Secrets Operator or Azure Key Vault Provider for CSI Driver to sync to Kubernetes Secrets.
  * Automatic rotation support.
  * Centralized secret management across environments.
* Kubernetes Secrets (default):
  * Credentials provided via Helm values.
  * Stored as base64-encoded Kubernetes Secret.
  * Not encrypted at rest by default (enable Azure Disk encryption).

Recommendation: For production, use Azure Key Vault with External Secrets Operator for centralized, auditable secret management.

***

## Scaling considerations

### When to scale

Indicators to add more {m_runner} pods:

* Task queue depth consistently > 0 (check the {maia} [Task history](/docs/guides/designer-ui-basics#task-history) or metrics).
* Pipeline execution time increases due to task queuing.
* More concurrent pipelines being executed.
* Workload characteristics change (more data ingestion vs transformation).

Indicators to keep current capacity:

* Task queue depth consistently = 0.
* Pipeline execution times stable.
* Workload primarily transformation (SQL generation).

### Horizontal Pod Autoscaler (HPA)

How it works:

* Kubernetes HPA monitors pod metrics.
* Automatically scales Deployment replicas within configured min/max range.
* Evaluates every 15 seconds (default), scales up/down based on thresholds.

Example HPA configuration:

* Min replicas: 2 (baseline availability).
* Max replicas: 10 (cost control).

Configure via Helm values or separate HPA manifest.

Read the [HPA documentation](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) for details.

### Cluster autoscaler

How it works:

* Monitors pods in Pending state (unable to schedule due to insufficient node capacity).
* Automatically adds VMs to node pool (VM Scale Set).
* Removes underutilized nodes after 10 minutes of low usage.

Works with HPA:

1. HPA scales {m_runner} pods based on metrics.
2. If pods can't schedule (no node capacity), cluster autoscaler adds VMs.
3. {m_runner} pods schedule on new VMs.
4. When load decreases, HPA scales down pods, cluster autoscaler removes empty VMs.

Read the [AKS cluster autoscaler documentation](https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler) for more details.

### Vertical scaling

Adjust CPU and memory limits per pod via Helm values:

* Useful when individual tasks require more resources than current pod limits.
* Requires pod restart to apply new resource limits.
* Consider workload characteristics (transformation vs ingestion).

***

## Cost optimization

### Cost optimization strategies

* **Right-size VMs:** Match VM type to workload (transformation-heavy = smaller, ingestion-heavy = larger).
* **Use Cluster Autoscaler:** Automatically remove unused VMs during low-usage periods.
* **Consider Azure Reserved Instances:** For predictable baseline capacity (1-year or 3-year commitment).
* **Monitor data transfer:** Ensure data warehouses in same region to avoid cross-region charges.

***

## Additional resources

### Implementation and deployment

For complete Terraform modules, Helm charts, and step-by-step implementation, see the following in the Matillion Deployment Library on GitHub:

* [Azure {m_runner} directory](https://github.com/matillion-public/deployment-library/tree/main/agent/azure).
* [AKS Terraform module](https://github.com/matillion-public/deployment-library/tree/main/agent/azure/aks).
* [Helm charts](https://github.com/matillion-public/deployment-library/tree/main/agent/helm).

You can find the Matillion Deployment Library at [github.com/matillion-public/deployment-library](https://github.com/matillion-public/deployment-library).

### General Kubernetes guide

You should read the general [Kubernetes deployment guide](/docs/guides/kubernetes-deployment-guide) for platform-agnostic concepts and architecture.

### Matillion documentation

* For deployment models, read [{m_runner} overview](/docs/guides/runner-overview).
* For {m_runner} registration, read [Create a {m_runner}](/docs/guides/create-a-runner).
* For capacity planning, read [Scaling best practices](/docs/guides/scaling-best-practices).

### Azure documentation

* For AKS concepts and operations, read [Azure Kubernetes service documentation](https://learn.microsoft.com/en-us/azure/aks/).
* For Azure AD Workload Identity for AKS, read [Workload Identity](https://learn.microsoft.com/en-us/azure/aks/workload-identity-overview).
* For automatic node scaling, read [AKS Cluster Autoscaler](https://learn.microsoft.com/en-us/azure/aks/cluster-autoscaler).
