This document helps you understand Google Cloud-specific architecture decisions, deployment considerations, and readiness requirements for running s on Google Kubernetes Engine (GKE). GKE provides a managed Kubernetes control plane for running s in your Google Cloud infrastructure. This deployment model combines Google Cloud-native security features (Workload Identity) with Kubernetes operational flexibility. For complete Terraform modules, Helm charts, and step-by-step implementation instructions, see the GCP directory in the Matillion deployment library on GitHub. You should read the general Kubernetes deployment guide before reading this document.Documentation Index
Fetch the complete documentation index at: https://docs.maia.ai/llms.txt
Use this file to discover all available pages before exploring further.
What you get with GKE deployment
- Managed Kubernetes control plane. Google handles the Kubernetes API server, etcd, and control plane upgrades.
- Workload Identity. Credential-free authentication from pods to Google Cloud services using federated identity credentials.
- Flexible node pools. Configurable machine types with autoscaling support.
- Google Cloud integration. Native support for Cloud Monitoring, VPC networking, and Google Cloud Load Balancers.
- Horizontal Pod Autoscaler. Scale pods based on metrics.
- Cluster Autoscaler. Automatically adjust node capacity in node pools.
Prerequisites and readiness
Google Cloud requirements
Required Google Cloud services:- Google Kubernetes Engine API enabled in your target project.
- Billing enabled for your GCP project.
- Sufficient compute quotas for node pool VMs.
- VPC with subnet configuration.
- Create and manage GKE clusters and node pools.
- Create and manage Compute Engine instances.
- Create service accounts and IAM bindings (including Workload Identity bindings).
- Manage VPC resources (subnets, routes, Cloud NAT, firewall rules).
- Access Secret Manager (for storing OAuth credentials).
- Create GCS buckets (for staging data).
- Configure Cloud Logging and Cloud Monitoring.
Matillion account setup
Before deploying infrastructure, create a in the Matillion console. You need to obtain the following information about the you created:- Account ID: Your Matillion organization identifier.
- Agent ID: The unique identifier for this (auto-generated).
- OAuth Client ID and Secret: authentication credentials.
- Region:
us1(United States),eu1(Europe), orau1(Australia/Asia-Pacific).
Required tools
Ensure these tools are installed and configured on your deployment workstation:- Terraform 1.0+ for infrastructure provisioning.
- Google Cloud SDK (
gcloud) configured with credentials (gcloud auth application-default login). - gke-gcloud-auth-plugin for kubectl authentication to GKE.
- kubectl for Kubernetes cluster management.
- Helm 3.x for application deployment.
Architecture decision points
Before deploying, make these key architectural decisions.1. VPC and networking
The GKE Terraform module always creates a dedicated VPC. The module does not support attaching to an existing VPC. What gets created:- VPC with subnets and secondary IP ranges for GKE pods and services.
- Cloud NAT (when
enable_cloud_nat = true) for outbound internet access from private nodes. - Firewall rules for cluster communication.
GKE requires secondary IP ranges for pod and service IPs. The module configures these automatically—no manual subnet configuration is required.
2. Public vs private cluster
| Setting | API server access | Use case |
|---|---|---|
| Public cluster | API server publicly accessible from authorized IP ranges. | Development, testing, faster initial setup. |
| Private cluster | API server accessible only from within VPC. | Production, enhanced security, requires Cloud NAT for node egress. |
is_private_cluster = true or is_private_cluster = false in Terraform variables. See terraform.tfvars.example in the deployment library for an example.
For private clusters, ensure:
- Cloud NAT is enabled so private nodes can pull container images and reach external endpoints.
- Deployment workstation has VPN or bastion access to VPC.
- Authorized IP ranges include your access points.
3. Authentication strategy
-
Workload Identity (recommended):
- pods authenticate to Google Cloud APIs using federated identity credentials.
- No static credentials stored in the cluster.
- Automatic token rotation by Google Cloud STS.
- Best-practice security model for GKE.
-
Static OAuth credentials:
- OAuth credentials stored in Kubernetes Secrets.
- Use only if Workload Identity cannot be implemented (not recommended for GKE).
4. Node pool strategy
Node machine type sizing:| Machine type | vCPU | Memory | Use case |
|---|---|---|---|
e2-standard-2 | 2 | 8 GB | Development, testing, low workload. |
e2-standard-4 | 4 | 16 GB | Small to medium production workloads. |
e2-standard-8 | 8 | 32 GB | Production workloads. |
e2-standard-16 | 16 | 64 GB | High-throughput production workloads. |
- Transformation-heavy workloads: SQL generation tasks, low CPU usage → Smaller machines sufficient.
- Data ingestion/scripting workloads: High data transfer, processing on → Larger machines needed.
- Pod density: Larger machines allow more pods per node, reducing operational overhead.
5. Scaling strategy
Static replica count:- Fixed number of pods (for example, 2, 5, 10).
- Predictable capacity and costs.
- Suitable for steady-state workloads.
- Automatically scales pods based on workload metrics.
- Configure min/max replicas (for example, min: 2, max: 10).
- Responds to workload spikes dynamically.
- Automatically adds/removes nodes in the node pool based on pod scheduling needs.
- Works in tandem with HPA.
- Optimizes infrastructure costs.
Container images
images are available in Google Artifact Registry. Image repositories:- US region:
us-docker.pkg.dev/maia-492711/maia-runners/maia-runner - EU region:
europe-docker.pkg.dev/maia-492711/maia-runners/maia-runner - AU region:
australia-southeast1-docker.pkg.dev/maia-492711/maia-runners/maia-runner
:stable—Slower release cycle, maximum stability, recommended for production.:current—Faster release cycle, earlier access to new features.
:stable for stability-first deployments, or :current for early access to features.
Select the repository that matches your Matillion region (us1 → US registry, eu1 → EU registry, au1 → AU registry) to minimize latency and egress costs.
Deployment journey
Expected timeline
- Phase 1— registration: 10 minutes (Matillion console).
- Phase 2—Infrastructure provisioning: 15-20 minutes (Terraform: VPC, GKE cluster, Workload Identity).
- Phase 3—Configure kubectl access: 2 minutes (gcloud CLI + kubectl).
- Phase 4— deployment: 5-10 minutes (Helm chart).
- Phase 5—Validation: 15-30 minutes (pre-deployment checks + testing).
Phase 1: Maia runner registration (Matillion console)
Refer to Prerequisites, above, for details of creation. What you’ll have at the end:- Account ID.
- Agent ID.
- OAuth Client ID and Secret.
- Region (us1, eu1, or au1).
Phase 2: Infrastructure provisioning (Terraform)
The Terraform module creates:-
GKE cluster:
- Managed Kubernetes control plane (API server, etcd, controller manager).
- GKE-managed upgrades and patching.
- Cloud Logging and Cloud Monitoring integration.
-
Node pool:
- Managed instance group with configurable machine types.
- Shielded nodes with Secure Boot enabled.
- Kubernetes node labels (if configured).
-
Workload Identity:
- GCP service account for workloads.
- IAM binding linking the GCP service account to the Kubernetes service account.
- Role assignments for Secret Manager and GCS access.
-
VPC and networking:
- VPC with subnets and secondary IP ranges for pods and services.
- Cloud NAT for outbound internet from private nodes.
- Firewall rules for cluster communication.
-
Supporting services:
- GCS bucket for staging storage.
- Secret Manager secret for credential storage.
terraform.tfvars you will need to make these configuration changes:
- project_id: Your GCP project ID.
- region: GCP region (for example,
us-central1,europe-west1,australia-southeast1). - name: Cluster name prefix (for example,
matillion-runner). - desired_node_count: Initial node count (for example,
2). - machine_type: Node pool machine type (for example,
e2-standard-4). - is_private_cluster:
trueorfalse. - master_ipv4_cidr_block: CIDR for the private cluster control plane (for example,
172.16.0.0/28). - authorized_ip_ranges: List of CIDRs allowed to access the API server.
- enable_cloud_nat:
true(required whenis_private_cluster = true). - labels: Resource labels for cost allocation.
terraform.tfvars.example in the deployment library for a complete example.
Why the three project-wide IAM grants?The service account is granted three project-wide IAM roles:
roles/browser, roles/secretmanager.secretAccessor, and roles/secretmanager.viewer. Matillion models the GCP project as the “vault” for this , equivalent to an Azure Key Vault or a Snowpark schema in other providers. The UI therefore surfaces a GCP Project ID selector when users define a secret, which needs roles/browser to enumerate projects and the two Secret Manager roles to list and read secrets within them.terraform apply completes, retrieve the Terraform outputs using:
If you see an “identity pool does not exist” error during
terraform apply, run terraform apply a second time. This occurs because the GKE Workload Identity pool takes a moment to propagate after cluster creation.Phase 3: Configure kubectl access
You must configure kubectl to authenticate to your GKE cluster using the Google Cloud SDK. Thegcloud container clusters get-credentials command retrieves cluster endpoint and certificate authority data, then configures your local kubeconfig file with GKE authentication.
Install the GKE authentication plugin (if not already installed):
cluster_name from Terraform output for <cluster-name>, and the same region and project_id from your Terraform variables.
If the plugin was just installed, add the Google Cloud SDK binary directory to your PATH before running the get-credentials command:
export PATH="/opt/homebrew/share/google-cloud-sdk/bin:$PATH".Phase 4: Maia runner deployment (Helm)
The Helm chart deploys:-
pods:
- Deployment with configurable replica count (default: 2).
- Each pod runs the binary.
- Resource requests and limits derived from
runnerSizet-shirt sizing.
-
ServiceAccount:
- Kubernetes ServiceAccount configured for Workload Identity.
- Annotated with the GCP service account email (from Phase 2).
-
ConfigMaps:
- configuration (account ID, Agent ID, region, default GCP project).
- Environment-specific settings.
-
Secrets:
- OAuth Client ID and Secret for Matillion control plane authentication.
-
Service:
- Kubernetes service exposing Prometheus metrics endpoint (port 8080).
- Annotated for Prometheus service discovery.
| Value | Source | Example |
|---|---|---|
cloudProvider | Static | "gcp" |
runnerSize | Your decision | small | medium | large | xlarge |
config.oauthClientId | Phase 1 (Matillion console) | "abc123..." |
config.oauthClientSecret | Phase 1 (Matillion console) | "secret456..." |
dpcAgent.dpcAgent.env.accountId | Phase 1 (Matillion console) | "12345" |
dpcAgent.dpcAgent.env.agentId | Phase 1 (Matillion console) | "agent-prod-01" |
dpcAgent.dpcAgent.env.matillionRegion | Phase 1 (Matillion console) | "us1", "eu1", or "au1" |
dpcAgent.dpcAgent.env.defaultGcpProject | Your GCP project ID | "your-gcp-project-id" |
dpcAgent.dpcAgent.image.repository | Region-specific registry | "us-docker.pkg.dev/maia-492711/maia-runners/maia-runner" |
dpcAgent.dpcAgent.image.tag | Your decision | "stable" or "current" |
gcp.workloadIdentity.serviceAccountEmail | Phase 2 Terraform output | "runner-sa@project.iam.gserviceaccount.com" |
dpcAgent.replicas | Your decision | 2 (baseline) to 10+ (high throughput) |
defaultGcpProjectis the fallback vault used when a secret definition does not specify a project.- Additional GCP projects can be granted access so they also appear in the Matillion UI. See Phase 4a below if you need multi-project access.
- For dev and preprod environments, set
matillionEnvin thedpcAgent.dpcAgent.envblock. Without it the defaults to prod, which will cause it to connect to the wrong environment. - The optional environment variables (
proxyHttp,customCertLocation, etc.) must be explicitly set to empty strings if not used. Leaving them as template placeholders (for example,<CustomCertLocation>) causes the to exit immediately on startup with no further error. - The Helm release name and Kubernetes namespace must match. The Workload Identity binding in Terraform is created for
{namespace}/{namespace}-sa. If the namespace differs from the release name, authentication will silently fail.
Phase 4a (Optional): Granting access to additional GCP projects
In , each GCP project is a vault—a namespace for secrets. By default, the can only read secrets from the project set indefaultGcpProject. If your organization stores secrets in multiple GCP projects (for example, one per environment or team), you can grant the service account access to each of them. Once granted, those projects appear in the UI’s GCP Project ID drop-down alongside the default.
In each additional GCP project, ensure the following APIs are enabled. The project owner can do this once:
Option 1: Terraform (recommended)
Add the additional project IDs to yourterraform.tfvars:
terraform apply. Terraform will automatically grant roles/secretmanager.secretAccessor, roles/secretmanager.viewer, and roles/browser to the service account in each listed project.
Option 2: Manual (gcloud)
If you prefer to grant access manually, replaceRUNNER_SA with the value from the Terraform output runner_workload_sa_email (noted in Phase 2):
defaultGcpProject remains the fallback when no project is explicitly selected.
Security considerationA single service account with access to multiple projects means that a compromise of that can affect secrets across all projects. For strict isolation between projects or teams, the recommended approach is to deploy separate instances, one per project, each with its own service account.
Phase 5: Validation and testing
Run automated pre-deployment validation scripts to verify pod environment:- Python 3 and Java runtime available.
- Filesystem permissions correct.
- Environment variables set (ACCOUNT_ID, AGENT_ID, etc.).
- cgroup CPU and memory limits applied.
- Network connectivity to Matillion control plane.
- Security agents that might interfere (Crowdstrike, Prisma Cloud).
- Matillion Console: Navigate to Manage runners. Verify status shows “Connected”.
- Test pipeline: Create a simple pipeline (for example, “Hello World” transformation) and execute.
- Prometheus metrics: Verify metrics available at
http://<pod-ip>:8080/actuator/prometheus.
Maia runner architecture on GKE
Workload Identity
How Workload Identity works:- Kubernetes ServiceAccount is annotated with the GCP service account email.
- GKE OIDC issuer allows Kubernetes to issue tokens trusted by Google Cloud IAM.
- pod requests a Google Cloud access token using the projected service account token.
- Google Cloud STS exchanges the token for short-lived access credentials (valid one hour, auto-refreshed).
- accesses Google Cloud services (Secret Manager, GCS) without storing credentials.
- No long-lived credentials in cluster.
- Automatic token rotation (every hour).
- Least-privilege access (GCP service account scoped to specific resources).
- Pod-level isolation (each pod authenticates independently).
- GCP service account for workloads.
- Workload Identity IAM binding between the GCP service account and the Kubernetes service account.
- Role assignments for Secret Manager (
secretAccessor,viewer) and GCS. - GKE Workload Identity pool configuration on the cluster.
Task capacity and throughput
Per-pod capacity: Each pod can execute up to 20 concurrent tasks. Throughput calculation: Maximum concurrent tasks = (Number of pods) × 20. Examples:- Two pods (default) = 40 concurrent tasks.
- Five pods = 100 concurrent tasks.
- 10 pods = 200 concurrent tasks.
- For transformation workloads: Tasks generate SQL executed by data warehouse. CPU/memory usage is low. Fewer pods needed.
- For data ingestion workloads: Tasks transfer and process data on . CPU/memory usage is high. More pods needed.
Monitoring and observability
Native Prometheus metrics
pods expose Prometheus-compatible metrics at:- Endpoint:
http://<pod-ip>:8080/actuator/prometheus. - Service: Automatically created by Helm chart with Prometheus annotations.
app_version_info: version and build metadata.app_agent_status: status (1 = running, 0 = stopped).app_active_task_count: Current number of executing tasks.
Google Cloud Monitoring integration
Enable Cloud Monitoring Container Insights for comprehensive GKE monitoring:- Cluster-level metrics (CPU, memory, network).
- Pod-level metrics (resource usage per pod).
- Node-level metrics (VM health, disk usage).
- Centralized log aggregation across all pods.
- Query with the Logs Explorer or gcloud CLI.
- Set up log-based alerts on error patterns.
- pod restarts > threshold.
- pods in CrashLoopBackOff state.
- Task execution failures (requires custom metric from logs).
- Node pool CPU/memory > 80%.
Security best practices
Network security
VPC configuration:- Deploy pods in private node pools for enhanced security.
- Use Cloud NAT for outbound internet access (required for Matillion control plane and image pulls when
is_private_cluster = true). - Restrict firewall rules to minimum required ingress/egress.
- HTTPS (443) to Matillion control plane (region-specific endpoints).
- HTTPS/JDBC to data warehouse endpoints (Snowflake, BigQuery).
- HTTPS (443) to Google Cloud APIs (Secret Manager, GCS, IAM, Artifact Registry).
- HTTP (80) to Snowflake endpoints.
- Ingress: No inbound traffic required ( initiates all connections).
- API server accessible only from VPC (or authorized VPN/bastion).
- Requires VPN or Identity-Aware Proxy for kubectl access.
- CI/CD pipelines need VPC connectivity or VPN access.
Pod security standards
The Helm chart implements Kubernetes pod security standards. Security context configuration:- Run as non-root user (UID 65534).
- Read-only root filesystem.
- No privilege escalation.
- Drop all Linux capabilities.
- Seccomp profile: RuntimeDefault.
Secrets management
OAuth credentials storage options:- Google Cloud Secret Manager (recommended):
- Store OAuth credentials in Secret Manager.
- Use External Secrets Operator to sync to Kubernetes Secrets.
- Automatic rotation support.
- Centralized secret management across environments.
- Kubernetes Secrets (default):
- Credentials provided via Helm values.
- Stored as base64-encoded Kubernetes Secret.
- Not encrypted at rest by default (enable GKE application-layer secret encryption with Cloud KMS).
Scaling considerations
When to scale
Indicators to add more pods:- Task queue depth consistently > 0 (check the Task history or metrics).
- Pipeline execution time increases due to task queuing.
- More concurrent pipelines being executed.
- Workload characteristics change (more data ingestion vs transformation).
- Task queue depth consistently = 0.
- Pipeline execution times stable.
- Workload primarily transformation (SQL generation).
Horizontal Pod Autoscaler (HPA)
How it works:- Kubernetes HPA monitors pod metrics.
- Automatically scales Deployment replicas within configured min/max range.
- Evaluates every 15 seconds (default), scales up/down based on thresholds.
hpa.metrics.target.averageValue—the target number of in-flight tasks per pod.
- Hard cap: 20. Each pod runs a maximum of 20 concurrent tasks. Values above 20 mean the HPA can never reach the target.
- Recommended range: 15–17.
15—proactive (spiky or latency-sensitive workloads, more headroom).16—balanced (recommended default).17—reactive (steady workloads, some queueing acceptable, lower cost).
Cluster autoscaler
How it works:- Monitors pods in Pending state (unable to schedule due to insufficient node capacity).
- Automatically adds VMs to node pool (managed instance group).
- Removes underutilized nodes after 10 minutes of low usage.
- HPA scales pods based on metrics.
- If pods can’t schedule (no node capacity), cluster autoscaler adds nodes.
- pods schedule on new nodes.
- When load decreases, HPA scales down pods, cluster autoscaler removes empty nodes.
Vertical scaling
Adjust CPU and memory limits per pod via therunnerSize Helm value:
- Useful when individual tasks require more resources than current pod limits.
- Requires pod restart to apply new resource limits.
- Consider workload characteristics (transformation vs ingestion).
Cost optimization
Cost optimization strategies
- Right-size nodes: Match machine type to workload (transformation-heavy = smaller, ingestion-heavy = larger).
- Use Cluster Autoscaler: Automatically remove unused nodes during low-usage periods.
- Consider Committed Use Discounts: For predictable baseline capacity (1-year or 3-year commitment).
- Monitor data transfer: Ensure data warehouses in the same region to avoid cross-region egress charges.
- Use Spot VMs for non-critical workloads: Configure node pool with Spot VMs for cost savings on interruptible workloads.
Troubleshooting
Maia runner exits immediately with no error (exit code 1)
The startup script treats unresolved placeholder values (for example,<CustomCertLocation>) as fatal. Set all optional environment variables to "" in your Helm values file if not used.
BeanCreationException: gcpSecretManager
If you seeError creating bean with name 'gcpSecretManager', ensure defaultGcpProject is set in your Helm values under dpcAgent.dpcAgent.env.defaultGcpProject.
Unknown Matillion region
EnsurematillionRegion uses the full region identifier: us1, eu1, or au1. Using a partial identifier (for example, eu without the digit) will fail.
Workload Identity binding—identity pool does not exist
This is a known race condition: the IAM binding is created before GKE finishes provisioning its Workload Identity pool. Runterraform apply a second time—it will succeed once the cluster is fully ready.
gke-gcloud-auth-plugin not found
After installing viagcloud components install gke-gcloud-auth-plugin, the plugin may not be on your PATH in the current shell session. Add the Google Cloud SDK binary directory to your PATH:
Maia runner gateway connection unhealthy
If the starts successfully but logs showconnection is unhealthy. lastKeepAliveTime=[null], this is expected until the OAuth client credentials and registration are correctly configured. The infrastructure and startup itself are healthy. This typically resolves once the OAuth credentials and registration are correctly set up.
Additional resources
Implementation and deployment
For complete Terraform modules, Helm charts, and step-by-step implementation, see the following in the Matillion Deployment Library on GitHub: You can find the Matillion Deployment Library at github.com/matillion-public/deployment-library.General Kubernetes guide
You should read the general Kubernetes deployment guide for platform-agnostic concepts and architecture.Matillion documentation
- For deployment models, read overview.
- For registration, read Create a .
- For capacity planning, read Scaling best practices.
Google Cloud documentation
- For GKE concepts and operations, read Google Kubernetes Engine documentation.
- For Workload Identity setup, read Use Workload Identity.
- For automatic node scaling, read Cluster Autoscaler.
