MatrixGard Blog

AWS IAM Access Analyzer: The 6 Findings I See Most in Pre-Seed Accounts

noreply@matrixgard.com (Avinash S) — Thu, 04 Jun 2026 03:45:00 GMT

AWS IAM Access Analyzer is one of the few security services that costs nothing to switch on and starts earning its keep the same afternoon. Yet in most pre-seed and seed AWS accounts I open, one of two things is true: it was never enabled at all, or it was enabled once, produced a wall of findings nobody triaged, and has been quietly ignored ever since. Both outcomes leave the same gap. The account is sharing something with the outside world that the founders do not know about.

This is not a feature tour. The AWS documentation already covers every knob. This is a focused list of the six findings I see most often in early-stage accounts: what each one actually means, why it tends to show up at this stage, and how to either fix it or safely archive it so it stops competing for attention with the findings that matter.

Where I am offering a practitioner judgement rather than restating AWS documentation, I have labelled it inline.

Quick context: what Access Analyzer actually does in 2026

IAM Access Analyzer is not one feature. By 2026 it is three distinct engines under one console, and conflating them is the first mistake teams make.

External access analysis (free): continuously reviews resource-based policies and flags any resource that can be reached by a principal outside your defined zone of trust, meaning your account or your AWS Organization. This is the original feature and the one this post is mostly about.
Unused access analysis (paid): flags IAM roles, access keys, passwords, and individual permissions that have not been used within a tracking window. It is priced per analyzed IAM role and user per month; check the pricing page for current rates.
Policy validation and custom policy checks: over a hundred automated checks that validate a policy against IAM grammar and security best practice, plus automated-reasoning checks against your own standards.

The concept that ties the external findings together is the zone of trust. You set it to either your single account or your whole Organization, and anything reachable from outside that boundary becomes a finding. Findings flow to AWS Security Hub and Amazon EventBridge, so you can route them into Slack or a ticket queue instead of checking the console by hand. With that frame, here are the six findings I see most.

1. An S3 bucket reachable from outside your account

What the finding says: a resource-based policy or ACL on an S3 bucket grants access to a principal outside your zone of trust, sometimes the entire internet ("Principal": "*"), sometimes a specific external account.

This is the single most common external-access finding in pre-seed accounts, and it is almost always one of three causes. Either a developer made a bucket public to serve static assets and never moved that content behind CloudFront with Origin Access Control. Or a bucket policy was copied from a tutorial that used a wildcard principal. Or a third-party tool (an analytics vendor, a backup product) was granted cross-account read access and the grant was never scoped down or removed when the tool was dropped.

The danger is not theoretical. Public buckets remain one of the most reliable sources of accidental data exposure, which is why AWS now enables S3 Block Public Access by default on new buckets. Access Analyzer catches the cases that slip past that default, particularly cross-account grants, which Block Public Access does not stop.

What fixing looks like: if the bucket genuinely needs to serve public content, put it behind a CloudFront distribution with Origin Access Control and keep the bucket itself private. If it was a cross-account grant for a tool you still use, scope the policy to the specific external account and the specific prefix, and add a condition. If the tool is gone, delete the statement.

Takeaway: a public-asset bucket is fine; a bucket that is public by accident is a breach waiting for a scanner to find it before you do.

2. An IAM role an external account can assume without an external ID

What the finding says: a role trust policy allows a principal in another AWS account to call sts:AssumeRole, and your zone of trust does not include that account.

Almost every startup creates one of these on purpose. The monitoring vendor, the CI provider, the cost-optimization tool, the security scanner: each asks you to create a role their account can assume. That is a legitimate pattern. The finding exists so you can confirm each external trust is one you meant to create, and, more importantly, that it is protected against the confused deputy problem.

The confused deputy risk is specific. If a third party tells thousands of customers to create a role trusting their account, and they do not isolate each customer, a malicious actor who is also their customer could trick the vendor into assuming your role. AWS's published mitigation is the external ID: a unique value the vendor sets in the assume-role call and that you require in your trust-policy condition. Reputable vendors hand you one. The finding is your prompt to check that the condition is actually present.

What fixing looks like: for every external-account trust, confirm the vendor is real and current, add an sts:ExternalId condition matching the value the vendor provides, and prefer scoping the trust to a specific role ARN in their account rather than the whole account root. Then archive the finding so it stops reappearing.

Takeaway: external trust is normal; external trust with no external-ID condition is the gap an attacker looks for first.

3. A KMS key usable by an outside account

What the finding says: a KMS key policy grants a principal outside your zone of trust permission to use or manage the key.

KMS findings are less frequent than S3 ones, but they carry more weight, because a shared key often means shared data. The usual cause at pre-seed: a key was created for a cross-account data-sharing setup (a shared S3 bucket encrypted with a customer-managed key, a snapshot shared with a sister account), and the key policy was opened up more than the actual sharing required. Because KMS evaluates the key policy first, an over-broad key policy can silently undo the careful scoping you did everywhere else.

The subtlety here is that KMS is the one service where an over-permissive resource policy can defeat IAM. For most services, access requires both an allow in an identity policy and no explicit deny. For a KMS key, the key policy is the root of trust; if it grants an external account access, that account does not need anything else from you.

What fixing looks like: open the key policy and confirm every external principal is intentional. Scope grants to the specific external role rather than the account root, restrict to the specific actions needed (often just kms:Decrypt or kms:GenerateDataKey, not kms:*), and add conditions such as kms:ViaService where the key is only meant to be used through one service. The key-policy docs cover the precedence rules.

Takeaway: a shared key is a shared door, and the key policy is the only lock on it.

4. A publicly shared RDS or EBS snapshot

What the finding says: an RDS DB snapshot, an RDS cluster snapshot, or an EBS volume snapshot has been shared so that accounts outside your zone of trust, sometimes all AWS accounts, can restore it.

This is the finding founders react to most strongly when they see it, because the failure mode is stark: a database snapshot marked public means anyone with an AWS account can restore your production data into their own account and read all of it. There is no further authentication step. Public snapshots have been the root cause of several well-documented data exposures over the years.

It happens by accident more often than you would expect. An engineer shares a snapshot with a second account for a migration or a staging refresh, picks the wrong sharing option, and sets it to public rather than to the specific account. Or a snapshot was made public years ago for a one-off and never reverted. Access Analyzer surfaces both because it inspects the snapshot sharing attributes, not just bucket and role policies.

What fixing looks like: change the snapshot sharing from public to the specific account IDs that need it, or stop sharing entirely if the need has passed. For RDS this is the snapshot visibility attribute; for EBS it is the createVolumePermission attribute. Then encrypt future snapshots with a customer-managed KMS key, because an encrypted snapshot cannot be made public at all, which removes the failure mode structurally.

Takeaway: a public snapshot is the highest-severity finding on this list; treat it as an incident, not a backlog item.

5. Unused IAM roles and access keys

What the finding says: (unused access analyzer) an IAM role has not been used within your tracking window, or an IAM user has an access key or password that has gone unused.

This is a different engine from the external-access findings above, and it is the paid one, but at pre-seed scale the cost is small and the signal is high. Early accounts accumulate dead credentials fast: the access key from the founder's first laptop setup, the role left behind by a deleted service, the contractor user nobody offboarded. Each one is standing attack surface that does nothing useful.

Unused credentials matter because they are the quietest way in. A leaked key for a role you forgot exists will not trip any behavioural alarm, because there is no baseline of normal use to deviate from. The recurring wave of attacks against long-lived access keys leaked in public code repositories is the same story told over and over.

What fixing looks like: for unused access keys, deactivate first (it is reversible), wait, then delete. For unused roles, confirm nothing references them and remove. The durable fix is structural: move humans to short-lived credentials via IAM Identity Center so there are no long-lived user keys to leave lying around, and prefer roles over IAM users for workloads. AWS documents the unused-access analyzer and its tracking-period settings.

Takeaway: every credential not in use is risk with no offsetting benefit; the cheapest security win available to you is deletion.

6. Over-broad permissions on the roles you do use

What the finding says: (unused access analyzer, action level) an active role is granted services or individual actions it has not actually used during the tracking window.

This is the finding that maps directly to least privilege, and it is the one that pays off long after pre-seed. The common pattern: a role was created with AdministratorAccess or a broad managed policy to unblock a deploy, and it was never narrowed once the team learned what the workload actually needs. The action-level unused findings tell you precisely which granted actions were never called, which turns least-privilege from a guessing game into a list.

The reason this matters more than it looks: in a breach, the blast radius is whatever the compromised role can do, not whatever it actually did. A deploy role with unused iam:* and s3:* is a privilege-escalation path even if your pipeline only ever pushed to one bucket. Trimming unused actions directly shrinks blast radius.

What fixing looks like: use the unused-action findings together with Access Analyzer's policy generation, which reads CloudTrail history and drafts a least-privilege policy for the role. Treat the generated policy as a strong first draft, not a final answer, because it only knows what happened during the logged window. Review it, apply it, then re-run after a full business cycle to catch periodic jobs.

Takeaway: the permissions a role never uses are pure downside, and Access Analyzer hands you the exact list to cut.

Triage: archive rules are what keep the tool usable

The reason most teams abandon Access Analyzer is not that it is wrong. It is that a fresh account generates a batch of findings on day one and there is no obvious way to separate "expected and fine" from "investigate now". The mechanism AWS provides for this is the archive rule.

An archive rule auto-archives findings that match criteria you trust. The disciplined workflow is: triage every finding once, fix the genuine ones, and for each finding that is intentional and safe (the monitoring role you confirmed has an external ID, the public-assets bucket fronted by CloudFront), write an archive rule so that finding and future identical ones move out of the active view. What remains in "active" is then, by construction, the set that needs human eyes. Without this, the active list grows until it is noise and the tool gets muted.

Practitioner opinion: do the archive-rule pass in the same session as your first triage. An Access Analyzer console with forty unreviewed active findings teaches the team to ignore it within a week, and an ignored security tool is worse than none, because it creates false confidence.

What Access Analyzer will not catch

Knowing the boundary matters as much as knowing the findings. Access Analyzer reasons about policies, not network paths or application logic, so several real exposures sit outside its remit.

Network exposure. A security group open to 0.0.0.0/0 on a database port is a serious problem, but it is not an Access Analyzer finding. That is the job of AWS Config rules, VPC reachability analysis, or a dedicated audit pass.
Application-layer access. An API with broken authorization, or a public endpoint that should be private: Access Analyzer never sees these, because they live above the IAM and resource-policy layer.
Unsupported resource types. External access analysis covers a defined list of resource types (S3, IAM roles, KMS, Lambda, SQS, Secrets Manager, SNS, snapshots, ECR, EFS, DynamoDB, and more). Anything off that list is not analyzed, so check the supported-resources list rather than assuming full coverage.

Access Analyzer is the policy-exposure layer of a defence-in-depth setup, not the whole thing. Pair it with network and configuration scanning to cover the gaps it leaves.

The honest summary table

Finding	Engine	Severity at pre-seed	Typical fix
Public or cross-account S3 bucket	External access (free)	High	CloudFront with OAC, or scope the bucket policy
External-account role with no external ID	External access (free)	High	Add sts:ExternalId condition, scope the trust
KMS key shared with an outside account	External access (free)	High	Scope key policy to specific role and actions
Public RDS or EBS snapshot	External access (free)	Critical	Unshare or restrict; encrypt future snapshots
Unused roles and access keys	Unused access (paid)	Medium	Deactivate then delete; move to Identity Center
Over-broad active permissions	Unused access (paid)	Medium	Policy generation; trim unused actions

Stage-specific recommendation

If you are pre-seed (under 10 engineers, one AWS account): turn on the free external-access analyzer today and set the zone of trust to your account. Do one triage pass, fix the public buckets, public snapshots, and any external trust without an external ID, then write archive rules for everything intentional. That is a half-day of work that closes your most likely accidental-exposure paths. Defer the paid unused-access analyzer until you have more than a handful of roles.

If you are seed (10 to 30 engineers, moving to AWS Organizations and multiple accounts): create the analyzer at the Organization level so the zone of trust is the whole org and routine cross-account access between your own accounts stops generating noise. Turn on unused-access analysis now, because this is the stage where dead roles and over-broad permissions accumulate fastest. Route findings to Security Hub and into a Slack channel via EventBridge so triage is continuous, not quarterly.

If you are Series A (multiple accounts, first dedicated security or platform hire): wire Access Analyzer findings into your ticketing system with an SLA by severity, add custom policy checks to your CI so a pull request that would create external access is caught before merge, and make the policy-generation workflow part of how every new role is created. At this stage Access Analyzer should be a guardrail in the pipeline, not a console someone occasionally visits.

If you want a second pair of eyes on your IAM exposure

MatrixGard runs a free 20-minute IAM and external-access review for pre-seed and seed founders: what your account is sharing outside its boundary, which findings are real, and the three fixes worth doing first. No NDA required for the first conversation. Send a note.

Avinash S is the founder of MatrixGard. Fractional DevSecOps for pre-seed and seed startups across India, the GCC, the UK, and the US. Almost a decade of building, breaking, and securing cloud infrastructure for fintech, healthtech, and SaaS workloads.

Methodology note. Feature behaviour, resource coverage, and finding types are drawn from the AWS IAM Access Analyzer documentation current as of May 2026. The "six most common" framing is a practitioner opinion based on the pattern frequency I see in early-stage AWS accounts, not a published AWS statistic. Pricing for the unused-access analyzer changes over time and varies by region; consult the official pricing page for current rates rather than relying on any figure quoted elsewhere. Severity labels are practitioner judgement for a typical pre-seed context and will differ with your data sensitivity and architecture.

GCP Workload Identity Federation: How Startups Kill Static Keys

noreply@matrixgard.com (Avinash S) — Thu, 04 Jun 2026 03:45:00 GMT

Most guides on Google Cloud service accounts still tell you to download the JSON key, drop it in a secret manager, and rotate it every 90 days. That advice is a decade old and it is now actively wrong. In 2026 the correct number of static service account keys in a startup GCP project is zero.

This post is for founders and engineers running pre-seed and seed startups on Google Cloud who still have at least one credentials.json sitting in a CI variable, a developer laptop, or a Kubernetes Secret. It covers what a service account key actually is, why it is the single credential most likely to leak your entire project, and how Workload Identity Federation removes the need for it across GKE, CI/CD pipelines, and multicloud workloads.

What generic articles get wrong: they treat key rotation as the goal. Rotation is damage control for a credential that should not exist. The real goal is to never hold a long-lived key at all, so there is nothing to leak, rotate, or revoke under pressure at 2 AM. Google has been steering customers this way since 2023, and for new organizations the platform now blocks key creation by default.

The state of GCP service account keys in 2026

Google Cloud now disables service account key creation by default for new customers. If your organization was created on or after May 3, 2024, the organization policy constraint iam.disableServiceAccountKeyCreation is enforced from day one, and any attempt to create a key fails with FAILED_PRECONDITION: Key creation is not allowed on this service account. The behaviour is documented in the organization policy reference for restricting service accounts.

This is not a soft suggestion buried in a best-practices PDF. It is the platform default. Google's own best-practices documentation states plainly that Workload Identity Federation is the preferred way to configure identities for external workloads, because it relies on short-lived credentials instead of long-lived secrets. If you are still building around downloaded keys, you are swimming against the direction the platform is moving.

1. What a service account key actually is, and why it is dangerous

A Google Cloud service account key is an RSA private key wrapped in a JSON file. It does not expire. There is no second factor on it, no IP restriction by default, no session length. Anyone who holds the file can authenticate as that service account from any machine on earth and act with its full set of permissions until a human notices and deletes the key.

Compare that to a user password, which at least sits behind multi-factor authentication and conditional access. A service account key has none of that. It is a bearer credential: possession equals identity.

The leak paths are mundane and constant. Keys get committed to git history, printed into CI logs, baked into container images, copied into Slack, left on a stolen laptop, or pasted into a third-party tool during a debugging session. GitHub secret scanning catches some, but only after the key is already public. The blast radius is whatever the service account can do, which at a pre-seed startup is almost always more than it should be, because nobody scoped it down when they were shipping the MVP.

Takeaway: treat any service account JSON key on disk as already compromised. The question is not whether it leaks, but when, and how much it can touch when it does.

2. How Workload Identity Federation actually works

Workload Identity Federation removes the key by removing the need to prove identity with a secret you store. Instead, it trusts an identity the workload already has from an external issuer.

The model has three parts. First, you create a workload identity pool that represents a set of external identities. Second, you add a provider to that pool that trusts a specific issuer: GitHub's OIDC endpoint, an AWS account, an Azure tenant, or any provider that speaks OpenID Connect or SAML 2.0. Third, at runtime the workload presents its native token to Google's Security Token Service, which validates the token against the pool's attribute mapping and conditions, then hands back a short-lived federated access token. The Workload Identity Federation documentation lists the supported sources: AWS, Azure, on-premises Active Directory, GitHub, GitLab, workloads using X.509 client certificates, and any OIDC or SAML 2.0 identity provider.

The federated credentials are short-lived. By default the access token expires one hour after it is created, which sharply limits how long a stolen token is useful. Because the trust lives in configuration rather than in a file, there is no secret to rotate or store after the initial setup.

Takeaway: the security win is structural. You are not protecting a key better. You are deleting the key and proving identity with a token that expires before most attackers can act on it.

3. Workload Identity Federation for GKE: the most common startup case

If you run Google Kubernetes Engine, this is where you start, because GKE is where most startups accidentally store keys as Kubernetes Secrets. Workload Identity Federation for GKE lets each pod authenticate as its Kubernetes service account, with no JSON key ever entering the cluster.

You enable the feature on the cluster and on each node pool. The GKE metadata server, which runs as a DaemonSet on every node per the GKE Workload Identity concepts page, intercepts the pod's credential request and performs the token exchange transparently.

There are two modes. In the older impersonation mode, you annotate the Kubernetes service account with iam.gke.io/gcp-service-account pointing at a Google service account, and you grant that Google service account the roles/iam.workloadIdentityUser role bound to the Kubernetes identity. In the newer direct-access mode, you address the Kubernetes service account directly as an IAM principal, which removes the intermediate Google service account and its extra bindings entirely. The how-to guide walks both paths.

Takeaway: enable Workload Identity Federation on the cluster and node pools, map your Kubernetes service accounts to permissions, then delete every Kubernetes Secret that holds a service account key. A pod that needs BigQuery should get there through its identity, not through a mounted file.

4. Keyless CI/CD with GitHub Actions and GitLab

CI/CD is the most common place a startup leaks a key, because a deploy pipeline needs broad permissions and the path of least resistance is to paste a JSON key into a repository secret. Workload Identity Federation kills that pattern.

GitHub Actions can mint an OIDC token from the issuer https://token.actions.githubusercontent.com that uniquely identifies the repository, workflow, branch, and environment. You configure a workload identity pool provider to trust that issuer, set an attribute condition that pins access to your specific repository, and use the google-github-actions/auth action in the workflow. GitHub's own OIDC configuration guide and Google's keyless authentication announcement both cover the setup end to end.

The token lifetimes are tight: the GitHub OIDC token lives roughly five minutes, and the derived Google credential expires within the hour. The one mistake to avoid is leaving the attribute condition too loose. If you trust the issuer without pinning the repository, any GitHub repository in the world can request your identity. Pin it to your org and repo, and ideally restrict by branch or environment for production deploys.

Takeaway: delete the service account key secret from your CI configuration today. It is usually the single highest-value secret a startup stores, because it can deploy.

5. Authenticating AWS and Azure workloads to GCP

Plenty of startups are not single-cloud by choice. A Lambda function writes to BigQuery, an Azure function calls a Vertex AI endpoint, an on-premises job pushes data to Cloud Storage. The old answer was to courier a GCP service account key into the other cloud's secret store. Workload Identity Federation removes the courier.

For AWS, the workload uses its existing IAM role. The federation flow validates a signed AWS GetCallerIdentity request as proof of the role, and you restrict the pool to a specific AWS account and role ARN. For Azure, the workload presents the token from its managed identity, and you restrict by tenant and object ID. The deployment-pipelines guide documents these attribute conditions.

No GCP key crosses the cloud boundary in either direction. The AWS or Azure workload keeps using the credential its own platform already manages, and GCP trusts that credential through configuration.

Takeaway: cross-cloud access does not require a key to travel between providers. Map the foreign identity into a pool and scope it tightly to the exact role or managed identity that needs access.

6. Lock the door with organization policy

Migrating your workloads is necessary but not sufficient. An engineer under deadline pressure can create a fresh key in thirty seconds and undo the whole effort. You close that door with organization policy.

Enforce iam.disableServiceAccountKeyCreation at the organization or folder level. Organizations created on or after May 3, 2024 have it enforced already; older organizations must set it explicitly. Pair it with iam.disableServiceAccountKeyUpload so nobody re-introduces an externally generated key. Google also offers a newer managed constraint, iam.managed.disableServiceAccountKeyCreation, which supports conditions and dry-run mode for a staged rollout. Both are covered in the disable and enable service account keys documentation.

Set the policy at the highest scope you can, then grant narrow exceptions on the rare project that genuinely needs a key for a legacy integration. Exceptions should be the documented anomaly, not the default.

Takeaway: the migration is not finished until policy makes regression impossible. A keyless project that allows new keys is one rushed pull request away from being a key project again.

7. Finding and killing the keys you already have

You cannot delete what you cannot see, so the migration starts with an inventory, not a deletion. List the keys on every service account with gcloud iam service-accounts keys list, and filter for USER_MANAGED keys. Ignore the SYSTEM_MANAGED keys: those are the ones Google creates and rotates for you, and they are fine.

Before you delete anything, check whether each key is still in use. Policy Analyzer and the service account authentication logs expose the last authentication time for a key. A key that has not authenticated in 90 days is almost certainly safe to remove. A key that authenticated an hour ago is load-bearing, and you need to find the workload first.

Then disable before you delete. Disabling a key is reversible; deletion is not. Disable the key, watch for breakage for a week, and only then delete it. Work in order of blast radius: kill CI keys first, then GKE Secrets, then human-developer keys, which you replace with gcloud auth login and Application Default Credentials so engineers stop carrying personal copies.

Takeaway: disable then delete, never delete blind. The goal is a clean cutover, not a self-inflicted outage that makes the security team look reckless.

8. Common failure modes and how to debug them

Almost every Workload Identity Federation failure traces back to an attribute-condition mismatch or a missing IAM binding, not to a platform bug. The error messages point at the cause if you read them in that frame.

A token exchange that returns permission denied usually means the incoming token's claims do not satisfy the provider's attribute condition: the repository, branch, role ARN, or audience does not match what you mapped. A GKE pod that cannot authenticate usually means Workload Identity Federation is not enabled on the node pool, the Kubernetes service account annotation has a typo, or the roles/iam.workloadIdentityUser binding is missing. A GitHub Actions workflow that works on the main branch but fails on pull requests usually means the attribute condition is pinned to a single branch. And the error Key creation is not allowed on this service account is not a bug at all: it is the organization policy from section 6 doing its job. Do not disable the policy to make the error go away. Fix the workload to use federation instead.

Takeaway: when federation fails, read the rejected token's claims and compare them to your attribute conditions line by line. The mismatch is nearly always there.

Summary table

Workload	Old (key) pattern	Keyless pattern	Key control to set
GKE pods	JSON key in a Kubernetes Secret	WIF for GKE, KSA mapped to IAM	Delete the Secret after cutover
GitHub Actions	Key in a repo secret	OIDC token to a scoped pool	Pin attribute condition to repo
GitLab CI	Key in a CI variable	OIDC token to a scoped pool	Pin to project and ref
AWS workload	GCP key in AWS Secrets Manager	IAM role via GetCallerIdentity	Restrict to account and role ARN
Azure workload	GCP key in Key Vault	Managed identity token	Restrict to tenant and object ID
Developer laptop	Personal JSON key	gcloud auth login plus ADC	Delete the user-managed key
Whole org	Keys allowed by default	Federation everywhere	Enforce key-creation org policy

What to do at each stage

Pre-seed (under 10 engineers, one GKE cluster, a single CI pipeline): this is one focused day of work. Enable Workload Identity Federation on the cluster, move GitHub Actions to OIDC, delete the keys you find, and turn on the iam.disableServiceAccountKeyCreation org policy. You are small enough that there is no legacy integration to babysit. Do it before you have ten more services.

Seed (10 to 30 engineers, multiple environments, maybe a second cloud): add the multicloud federation for any AWS or Azure workloads, run a Policy Analyzer pass to find keys that survived the first sweep, and pin per-environment attribute conditions so a staging pipeline cannot deploy to production. This is the stage where a forgotten key in a side project becomes the breach.

Series A and beyond: move to the managed constraint with dry-run mode so you can stage policy changes across many projects without breaking a team, enforce conditions per environment, and audit federation configuration the same way you audit IAM roles. At this size the risk is not a single leaked key, it is configuration drift across dozens of projects.

The honest bottom line

Workload Identity Federation is not a nice-to-have for 2026. It is the default the platform now ships, and the keyed alternative is a credential that cannot be made safe, only watched. For a pre-seed startup the entire migration is roughly a day of work, and it removes the single most dangerous credential class you own. That is one of the best security returns on a day you will find anywhere in your cloud setup.

If you want a second set of eyes on your specific GCP setup, I run a free 20-minute cloud audit for founders. Your workloads, your CI, your IAM, and an honest read on where the keys are hiding and what it takes to remove them. Send a note.

Methodology note. Technical claims reference public Google Cloud IAM, GKE, and Organization Policy documentation, the GitHub Actions OIDC documentation, and the google-github-actions/auth project, all current as of June 2026. Default-enforcement dates and token lifetimes are taken from Google's published documentation; verify them against your own organization's policy state before you act, since defaults differ by organization creation date. Operational sequencing advice (disable before delete, order of migration) is practitioner opinion grounded in production experience.

Kubernetes Audit Log Analysis: 7 Patterns That Signal a Compromise

noreply@matrixgard.com (Avinash S) — Tue, 26 May 2026 03:57:00 GMT

Kubernetes audit logs are the single richest forensic surface most engineering teams already produce and almost no team reads. Every request that hits the API server, from a kubectl on an engineer's laptop to a controller reconciling a Deployment, leaves an immutable JSON line behind. If your cluster gets compromised, the answer to "what did the attacker do, when, from where, with which identity" lives in those lines. The problem is that the volume is overwhelming (a busy cluster emits millions of events per day) and the default tooling does almost nothing to surface what matters.

This post is a practitioner's read on the seven compromise patterns I look for first when an audit log lands on my desk, plus what each one looks like in a real JSON event, and the simplest reliable detection you can wire up for it today. It is written for pre-seed and seed startup CTOs and platform engineers running production workloads on EKS, GKE, AKS, or self-managed Kubernetes. The patterns themselves are cluster-agnostic; the wiring differs by provider.

What generic K8s security posts get wrong. Most blog posts on this topic stop at "enable audit logging and use Falco or Tetragon." That is not wrong, but it is also not enough. The runtime tools watch syscalls; the audit log watches the API. An attacker who knows what they are doing can do enormous damage purely through the API (exfiltrate secrets, pivot via service accounts, plant persistence in CRDs) without ever touching a pod's syscalls. The audit log is the only place these patterns show up.

Quick context: how Kubernetes audit logging actually works in 2026

The kube-apiserver supports four audit levels per rule: None, Metadata, Request, RequestResponse. The default in managed clusters varies. EKS Control Plane Logs default to logging API requests at Metadata level when you enable the audit log type. GKE Cloud Audit Logs include admin activity by default at Metadata level; data access logs are off by default. AKS audit-control-plane logs ship to Log Analytics when you enable the diagnostic setting. Reference: Kubernetes Auditing docs.

What you should actually enable. For a pre-seed or seed startup, Metadata level on all verbs (Verbose enough to detect every pattern below) plus Request level on Secrets, ServiceAccounts, and RBAC objects (because you need the request body to spot privilege escalation). Skip RequestResponse globally; the storage cost climbs fast and Metadata covers most patterns. Reference: audit-policy example from upstream.

Storage and query. Ship the events into the provider's native log service (CloudWatch Logs Insights for EKS, Cloud Logging for GKE, Log Analytics for AKS), or into an object store with Athena or BigQuery for ad-hoc query. Real-time alerting needs a stream-processing layer; CloudWatch metric filters and Cloud Logging log-based alerts are the cheap entry point. Falco's k8s-audit plugin ingests the audit stream directly. Reference: Falco k8s-audit plugin docs.

1. Anonymous or system:unauthenticated requests that succeed

The Kubernetes API server, if anonymous auth is enabled (the default in many self-managed clusters and historically in older Kops and Kubeadm setups), treats unauthenticated requests as the user system:anonymous belonging to the group system:unauthenticated. Most requests from this principal should be rejected with a 401 or 403. A successful request, in particular any 2xx response to a read or write from this principal, is a five-alarm signal.

The audit log line looks like this in the user.username field: system:anonymous. The responseStatus.code is the verdict. Filter for any event where user.username equals system:anonymous AND responseStatus.code is between 200 and 299. On a healthy cluster the count should be zero. Anything above zero needs investigation today.

The historical compromise pattern. CVE-2018-1002105 (the "Kubernetes API privilege escalation" bug) and a long line of API server misconfigurations have led to clusters where the kubelet's API or the API server itself accepts unauthenticated requests for specific resources. In 2025 the Tigera and Aqua research teams documented multiple Indian and Southeast Asian self-managed clusters with anonymous read access to pods and secrets; some had write access to events, enabling cryptominer-injection attacks.

Detection. CloudWatch Insights query for EKS:

fields @timestamp, user.username, verb, objectRef.resource, responseStatus.code
| filter user.username = "system:anonymous" and responseStatus.code >= 200 and responseStatus.code < 300
| sort @timestamp desc
| limit 100

Practical takeaway: disable anonymous auth at the API server with --anonymous-auth=false unless you have a documented reason to keep it on. On managed providers, anonymous auth is off by default; verify with a periodic curl test against your API server. Reference: Anonymous requests in kube-apiserver.

2. Pod exec sessions from outside CI or break-glass

The kubectl exec and kubectl attach commands cause the API server to log a request against pods/exec or pods/attach. This is normal during incident response. It is not normal as a steady-state operation. If your audit log shows pod exec requests from a user that is not your break-glass admin or your debugging proxy, you have either an engineer doing debugging from their laptop directly (which is its own RBAC problem) or an attacker who has obtained credentials.

The audit signature. The verb field is create, the objectRef.resource is pods, and the objectRef.subresource is exec or attach. The user.username tells you who; the sourceIPs array tells you from where. A burst of exec requests from a single user against multiple distinct pods in a short window is the textbook lateral-movement pattern after an initial credential leak.

Why this matters at startup scale. Pre-seed and seed teams often share one cluster-admin kubeconfig over the team Slack. Every engineer can exec into every pod. When that kubeconfig leaks (and it leaks), the attacker has root on every workload. The audit log will show exec activity from an unfamiliar IP; that is your only chance to catch them in time.

Detection. Cloud Logging query for GKE:

resource.type="k8s_cluster"
protoPayload.methodName=~"^io.k8s.core.v1.pods.(exec|attach).create$"
protoPayload.authenticationInfo.principalEmail!="break-glass@example.com"

Practical takeaway: bind exec permission to a single break-glass role assumable only with MFA; bind it nowhere else. Alert on every exec call. The signal-to-noise is high. Reference: pods/exec subresource RBAC.

3. Secret access from a service account that does not normally read secrets

Service accounts in Kubernetes get bound to roles that grant them very specific resource access. A logging agent's service account should read pods. A workload's service account might read a single secret it depends on. When that same service account suddenly reads dozens of secrets across multiple namespaces, somebody has either misconfigured RBAC or compromised the pod.

The audit signature. user.username starts with system:serviceaccount:. The verb is get or list. The objectRef.resource is secrets. The detection pattern is volume- and namespace-spread, not absolute count: a service account reading 1 secret per hour is normal; reading 30 secrets across 8 namespaces in 5 minutes is not.

Why this is the highest-value pattern. Most modern Kubernetes attacks pivot through secrets. Compromise a pod, dump its mounted service account token, use it to read secrets in the same namespace, escalate to a cluster-admin secret if one exists, walk laterally. The audit log is the only place this pivot leaves an irrefutable trail. Reference: Service account tokens and audit.

Detection. Falco's k8s-audit rule for unexpected secret access (k8s_audit_rules.yaml, rule Get Secret) is a strong starting point and works out of the box on EKS, GKE, AKS audit streams.

Practical takeaway: use the External Secrets Operator or your cloud provider's secret manager (AWS Secrets Manager, GCP Secret Manager) instead of mounting raw Kubernetes Secret objects whenever possible; the audit trail is then in the cloud secret service, which has finer-grained access reporting. Reference: External Secrets Operator.

4. Impersonation requests, especially toward system:masters

The Kubernetes API server supports impersonation: a privileged user can include Impersonate-User, Impersonate-Group, Impersonate-Uid, and Impersonate-Extra-* headers in a request, and the API server processes the request as if it came from the impersonated principal. Used legitimately, impersonation lets a controller act on behalf of a user (the kubectl auth can-i --as flow uses it).

Used illegitimately, impersonation is a privilege-escalation primitive. A user with the impersonate verb on users and groups can impersonate any principal in the cluster, including the all-powerful system:masters group. That bypasses every RBAC role binding you have configured.

The audit signature. The impersonatedUser field is populated in the audit event. The impersonatedUser.groups array might contain system:masters or another privileged group. Filter for any audit event where impersonatedUser.groups contains system:masters and the request is anything other than a known controller path.

The historical pattern. In 2023 the SecureStack research team published a write-up on cluster takeover via the cert-manager service account when it had an over-broad impersonate permission. The fix was an RBAC tightening; the audit log was the only evidence that anyone had tried. Reference: Impersonation in Kubernetes auth.

Detection. The simplest CloudWatch Insights query for EKS:

fields @timestamp, user.username, impersonatedUser.username, impersonatedUser.groups, verb, objectRef.resource
| filter impersonatedUser.groups like /system:masters/
| sort @timestamp desc

Practical takeaway: grant the impersonate verb sparingly and only on specific user names or groups using RBAC resourceNames, never with wildcard *. Treat any audit event impersonating system:masters as a compromise until proven otherwise.

5. ClusterRoleBinding creation or modification to a privileged role

Privilege escalation in Kubernetes most often takes the shape of a new ClusterRoleBinding that binds the attacker's principal to a powerful role like cluster-admin. This is exactly what every public Kubernetes attack write-up has shown since the original TeamTNT cryptomining campaigns: the attacker creates a binding, executes its work, optionally cleans up by deleting the binding, and moves on.

The audit signature. The verb is create, update, patch, or delete. The objectRef.apiGroup is rbac.authorization.k8s.io. The objectRef.resource is clusterrolebindings or rolebindings. The request body (only present when audit level is Request or RequestResponse) shows the binding target.

Why you need Request-level audit for this rule. Metadata-level audit tells you that a ClusterRoleBinding was created, but not who it bound to which role. You need the request body to extract the roleRef.name and the subjects array. This is the reason I recommend Request-level audit specifically for RBAC objects, even though the rest of the cluster can stay on Metadata level. Reference: RBAC audit guidance.

Detection. Cloud Logging query for GKE:

resource.type="k8s_cluster"
protoPayload.methodName=~"clusterrolebindings.(create|patch|update)"
protoPayload.request.roleRef.name=~"cluster-admin|admin|edit"

Practical takeaway: enable GitOps for RBAC. Every ClusterRoleBinding should come from a reviewed pull request in your IaC repo. An audit event creating a binding that does not exist in your Git history is, by definition, an out-of-band change and worth alerting on.

6. Privileged pod creation or hostPath, hostNetwork, hostPID workloads

A privileged pod (containers with securityContext.privileged: true, hostNetwork: true, hostPID: true, or mounted hostPath volumes pointing at sensitive host paths like /, /etc, /var/run/docker.sock) is a container escape primitive. The pod can read or modify the host filesystem, see all host processes, and in the worst case execute arbitrary host commands.

Most attackers, after they gain RBAC permission to create pods, immediately create a privileged pod that mounts the host filesystem and gives them shell access to the underlying node. This is the canonical Kubernetes-to-host pivot.

The audit signature. The verb is create. The objectRef.resource is pods. The request body (Request-level audit needed here too) contains the pod spec. The detection pattern is any of: containers[].securityContext.privileged: true, containers[].securityContext.capabilities.add containing SYS_ADMIN, hostNetwork: true, hostPID: true, or volumes[].hostPath.path matching a sensitive prefix.

Why Pod Security Admission helps but is not enough. Kubernetes 1.25+ ships Pod Security Admission (PSA) with three profiles: privileged, baseline, restricted. Labeling namespaces with pod-security.kubernetes.io/enforce: restricted blocks privileged pod creation at admission. This is the right preventive control. The audit log is the detective control on top: PSA blocks the attempt, but the audit event tells you somebody tried. Reference: Pod Security Standards.

Detection. Falco's k8s-audit rules Create Privileged Pod and Create HostNetwork Pod cover this out of the box.

Practical takeaway: enforce the restricted PSA profile on all namespaces by default, with baseline only on the system namespaces that need it. Alert on every attempt to create a pod in the privileged profile.

7. Token request burst from a single service account

In 2022 Kubernetes shipped the TokenRequest API and the bound-service-account-token-volume feature, replacing the older long-lived service-account secret model. Pods now receive time-bound, audience-scoped tokens that get rotated automatically. The TokenRequest API itself shows up in audit logs as a request against serviceaccounts/token.

The benign pattern. Every pod requests one or two tokens per hour for token refresh. A logging or monitoring agent might request a token per scrape interval. The volume per service account is predictable and stable.

The compromise pattern. An attacker who has obtained a pod's service account token tries to mint additional tokens (sometimes with extended expiry, sometimes targeting different audiences) to maintain persistence. A sudden burst of TokenRequest events from a single service account, especially with expirationSeconds near the maximum or with audiences different from the normal audience for that account, is a strong persistence signal.

The audit signature. The verb is create. The objectRef.resource is serviceaccounts. The objectRef.subresource is token. The requestObject.spec.audiences array and requestObject.spec.expirationSeconds are visible in Request-level audit events. Detection is volume- and parameter-deviation based: any service account that triples its 24-hour TokenRequest baseline is worth investigating.

Detection. CloudWatch Insights query for EKS (counts TokenRequests per service account in a 1-hour window):

fields @timestamp, user.username, objectRef.namespace, objectRef.name
| filter objectRef.resource = "serviceaccounts" and objectRef.subresource = "token"
| stats count() as tokens by user.username, bin(1h)
| sort tokens desc

Practical takeaway: cap TokenRequest expirationSeconds at the cluster level with the --service-account-max-token-expiration apiserver flag (default is 1 year, which is far too generous for most workloads; cap at 24 hours). Reference: TokenRequest API.

Summary table: the 7 patterns, the audit fields, the detective control

Pattern	Audit fields to filter on	Required audit level	Primary detective control
1. Anonymous success	user.username = system:anonymous, responseStatus.code 200-299	Metadata	Log-based alert in cloud log service
2. Pod exec from non-break-glass	verb = create, objectRef.subresource in (exec, attach)	Metadata	RBAC binding for exec to single role, alert on every event
3. Unexpected secret access	user.username starts with system:serviceaccount:, objectRef.resource = secrets, verb in (get, list)	Metadata	Falco k8s-audit rule, baseline volume thresholds
4. Impersonation to system:masters	impersonatedUser.groups contains system:masters	Metadata	Restrict impersonate verb with resourceNames
5. CRB to privileged role	verb in (create, patch, update), objectRef.resource = clusterrolebindings, request.roleRef.name in (cluster-admin, admin, edit)	Request	GitOps for RBAC, alert on out-of-band changes
6. Privileged pod creation	verb = create, objectRef.resource = pods, request body shows privileged or hostPath	Request	Pod Security Admission restricted profile
7. TokenRequest burst	verb = create, objectRef.resource = serviceaccounts, objectRef.subresource = token	Request	Cap service-account-max-token-expiration, baseline per SA

Stage-specific recommendations

Pre-seed (1 to 5 engineers, 1 cluster). Enable Metadata-level audit on every verb, Request-level for secrets, serviceaccounts, and rbac. Ship logs to your cloud provider's native log service (CloudWatch, Cloud Logging, Log Analytics) with a 30-day retention. Set up log-based alerts for patterns 1, 2, 4, and 5. Patterns 3, 6, 7 need a runtime tool, defer those for now. Total monthly cost: under USD 30 for a small cluster.

Seed (5 to 15 engineers, 2 to 4 clusters). Add Falco with the k8s-audit plugin on every cluster (free, open source, 100MB pod). Ship Falco alerts into your incident channel. Enable Pod Security Admission with the restricted profile on all application namespaces; baseline only on kube-system and ingress namespaces. Switch raw Kubernetes Secrets to External Secrets Operator pointing at AWS Secrets Manager or GCP Secret Manager. Cap TokenRequest expiry at 24 hours.

Series A (15 to 50 engineers, 4 plus clusters). Adopt a managed Kubernetes runtime security platform (Sysdig Secure, Wiz Runtime Sensor, Datadog Cloud Workload Security, or the open-source Tetragon plus your own pipeline). Centralize audit logs into a SIEM (Datadog, Sumo Logic, Elastic, Chronicle) with cross-cluster correlation. Enforce GitOps for every RBAC object via Flux or Argo CD with policy-as-code gates (Kyverno or OPA Gatekeeper). The audit log is no longer your sole detective control; it is one of three (audit + runtime + IaC drift).

The hidden audit-log antipattern: sampling

Some cluster operators, faced with audit-log storage cost climbing, reach for sampling: log 10 percent of events. Do not do this. Sampling defeats the entire forensic value of the audit log because the one event that matters (the impersonation attempt, the CRB create, the privileged pod) is precisely the rare event that sampling drops. The correct cost optimization is the audit policy, not the sample rate. Drop verbs and resources you do not care about (events, leases, endpointslices, in the default audit policies these often dominate volume) and keep 100 percent of the verbs you do care about. Reference: audit-policy syntax.

If you want a second opinion on your Kubernetes audit setup

I run a free 20-minute Kubernetes audit-log and RBAC review for early-stage startups. Bring your audit-policy YAML, your RBAC binding list, and your top 5 service accounts by token volume. I will tell you which of the seven patterns above are already covered, which are blind spots, and the three highest-leverage fixes specific to your cluster size and provider. No NDA needed for the first conversation. Send a note.

Avinash S is the founder of MatrixGard. Fractional DevSecOps for pre-seed and seed startups across India, the GCC, the UK, and the US. Nearly a decade of running production Kubernetes workloads on EKS, GKE, AKS, and self-managed clusters from 10 to 500 nodes, including audit-log analysis during three real incident-response engagements.

Methodology note. All technical references taken from the public Kubernetes documentation, the Falco security project pages, the AWS EKS, GCP GKE, and Azure AKS provider documentation, and publicly published security research write-ups, current as of May 2026. Failure modes and detection queries are drawn from production audit-log reviews I have performed; specific incidents are described generically. Stage-specific recommendations are practitioner judgment and will vary by team composition and risk appetite.

Terraform State for Startups: 5 Patterns and When Each Breaks at Scale

noreply@matrixgard.com (Avinash S) — Sat, 23 May 2026 03:45:00 GMT

Terraform state is the single piece of your infrastructure setup that, when it goes wrong, costs you a weekend and possibly a production outage. State files are not Terraform code. They are a serialised snapshot of every resource Terraform has provisioned for you, with attributes, dependencies, and (often) sensitive values embedded in plain JSON. Lose the file, and Terraform forgets what exists. Corrupt the file, and Terraform tries to recreate things that already exist. Let two engineers run apply at the same time without locking, and the file becomes a race condition that leaves your account in a half-applied state.

This is the honest 2026 review of the five state-management patterns I see in pre-seed and seed startups, what each one is good at, and the specific scale point at which each one breaks. Recommendations are stage-specific (pre-seed, seed, Series A) and grounded in actual production failures, not the marketing pages of the runner vendors. OpenTofu is now a real fork with its own release cadence and a growing user base; the patterns below apply to both Terraform and OpenTofu unless I call out the difference.

Quick context. A Terraform run does two things to state. First, on terraform init it reads the backend block and figures out where state lives. Second, on terraform plan and apply it reads the existing state, compares it to your desired configuration, and writes a new state back to the backend. The backend is the contract between Terraform and the persistence layer. Local backend means a terraform.tfstate file in your working directory. Remote backends include S3 with DynamoDB, Google Cloud Storage, Azure Blob Storage, Terraform Cloud (now HCP Terraform), and a handful of others. Reference: HashiCorp State documentation.

The choice of backend governs three things that matter at scale: where the file lives, how it is locked during a run, and who can read or write it. The five patterns below differ on those three axes. The breakage points are where the chosen pattern stops scaling on at least one of them.

1. Pattern one: Local state on a laptop

The default backend when you run terraform init with no backend block is the local backend. State is a JSON file named terraform.tfstate in your working directory, with a backup at terraform.tfstate.backup. It is the simplest possible setup, and for a single engineer prototyping in a sandbox account on day one, it is fine.

Where it breaks. The moment a second person needs to run Terraform against the same infrastructure, local state is dead. There is no shared source of truth, no locking, no way to know if the file in your colleague's checkout is current. Engineers compensate by emailing tfstate around, committing it to git (please do not), or running everything through one person. All three are anti-patterns. Committing state to git additionally leaks every secret Terraform put in state into the repo history.

The other quiet failure is laptop loss. If the state file lives only on a MacBook and the MacBook dies, the infrastructure is orphaned. Terraform does not know it exists. You either reconstruct state by writing a long sequence of terraform import commands, one per resource, or you destroy and rebuild. Both options are days of work and real production risk.

Practical takeaway: use local state only for throwaway sandbox experiments. The moment the work matters, move to a remote backend on day one. Reference: Local backend docs.

2. Pattern two: S3 with DynamoDB locking on AWS

This is the workhorse pattern for AWS-based startups and probably the single most common backend I encounter on audits. State lives in an S3 bucket with versioning and server-side encryption enabled. A DynamoDB table with a partition key named LockID provides the lock; Terraform writes a row to the table at the start of a run and deletes it at the end. The backend block is short, the IAM is straightforward, and the costs are negligible at startup scale.

Canonical setup. One S3 bucket per environment or per account, versioning ON, default encryption with a customer-managed KMS key, and a bucket policy that denies any non-TLS access. A single DynamoDB table per account is enough. Reference: HashiCorp S3 backend docs.

One important 2024 change. HashiCorp shipped native S3 locking via the use_lockfile = true option (Terraform 1.10+), which stores a lock file alongside the state file in S3 itself, no DynamoDB table required. For new setups in 2025 and 2026, you can skip DynamoDB entirely. Existing setups with DynamoDB locking continue to work and do not need urgent migration. Reference: S3 native locking.

Where it breaks. Three failure modes. First, a stale lock when a run is killed (laptop sleep, Ctrl-C, runner crash) leaves the DynamoDB row in place and blocks the next run. Fix it with terraform force-unlock LOCK_ID, safely only when you actually know no one else is running. Second, S3 versioning is mandatory; without it a corrupted state means you restore from absolutely nothing. Third, IAM permissions for the bucket and table tend to be over-scoped at startup, so any engineer can overwrite production state. Tighten with object-level S3 conditions and per-environment role separation before you hire the third engineer.

3. Pattern three: GCS or Azure Blob with native locking

The Google Cloud Storage backend uses native object locking via the GCS API; there is no separate DynamoDB-equivalent table to provision. The Azure Blob Storage backend uses lease-based locking on the blob itself. Both are conceptually cleaner than the historical AWS pattern because the lock and the state live in the same primitive.

GCS setup. A single bucket per environment, uniform bucket-level access ON, customer-managed encryption keys, and Object Versioning enabled. The backend block is five lines and Terraform handles locking for you. Reference: GCS backend docs. Azure setup. A storage account with a container, soft delete enabled, and a Service Principal or Managed Identity that has Storage Blob Data Contributor on the container. Reference: azurerm backend docs.

Where it breaks. On GCS, if soft delete is not enabled and an engineer accidentally deletes the state object, you lose everything. Versioning is the safety net; enable it before your second engineer touches the project. On Azure, lease-based locks expire after 60 seconds by default and a long-running apply on a large state file can race itself if the apply takes longer than the lease and the lease cannot be renewed cleanly. Rare, but it has happened to teams running 2000+ resources in a single state.

Practical takeaway. For GCP-first or Azure-first startups, use these native backends; no need to bolt on a third-party tool. Make sure versioning and soft delete are enabled before any non-prototype run.

4. Pattern four: Terraform Cloud (HCP Terraform) or Spacelift

HCP Terraform (the rebrand of Terraform Cloud since 2024) is HashiCorp's hosted runner. It stores state, runs plans and applies on managed workers, surfaces a web UI for plan approvals, and integrates with VCS providers for plan-on-PR workflows. Free for the first 500 resources, then per-resource or per-seat pricing tiers above that. Reference: HCP Terraform docs.

Spacelift is the most-cited independent alternative. Native support for Terraform, OpenTofu, Pulumi, CloudFormation, and Kubernetes manifests. Stack-and-policy model, drift detection, and a richer permissions surface than HCP Terraform at the team scale. Reference: Spacelift docs.

The pattern. State is stored by the runner platform itself (you do not configure an S3 or GCS backend). Engineers commit code, open a PR, the runner posts a plan as a PR comment, a reviewer approves, the apply runs on a managed worker, the state updates. The lock is implicit in the run queue: only one run executes per stack at a time.

Where it breaks. Vendor coupling is the first one. Once you have a year of run history, audit trails, and policy code in the platform, migrating off is a real project. Cost is the second; HCP Terraform's per-resource pricing climbs faster than most teams expect once you cross 1000+ resources. The third failure is OpenTofu drift; HCP Terraform's terms of service restrict OpenTofu use, so if your team has standardised on OpenTofu, Spacelift or Atlantis is the better choice.

5. Pattern five: Atlantis or a CI-driven runner on your own infrastructure

Atlantis is the open-source pull-request automation server for Terraform and OpenTofu. You deploy it as a single container in your own cloud (an ECS task, a small GKE pod, a Fly machine), point your VCS webhooks at it, and it runs plan on every PR and apply on a comment trigger. State lives in whichever backend you configured (S3, GCS, or otherwise); Atlantis is the orchestration layer, not the persistence layer. Reference: Atlantis docs.

The lighter-weight version is GitHub Actions or GitLab CI running plan and apply jobs directly, with state in S3 or GCS. This is what most pre-seed teams converge on after they outgrow laptop state: a single workflow file, OIDC federation to assume a cloud role, S3 backend with native locking, plan-on-PR with a manual approval gate on apply. No external runner platform to pay for.

Where it breaks. Two failure modes. First, the runner becomes a single point of failure once your apply jobs depend on it; an Atlantis pod that crashes during an apply leaves you with stale-lock recovery work plus operational burden figuring out which run was in flight. Second, the security posture of the runner itself matters more than people realise. Whoever can push to the workflow file effectively has cloud admin rights, because they can change what Terraform runs. Lock down the workflow file with CODEOWNERS, require signed commits, and audit-log every apply.

Practical takeaway. This is the most honest fit for a pre-seed or seed startup comfortable operating its own tools. Atlantis or a CI workflow plus S3 plus native locking covers 95 percent of what HCP Terraform sells you, at zero platform cost, with full control. The gap is the polished web UI for non-engineers and the drift-detection feature, which most early teams do not need.

6. The cross-cutting issue: workspaces and environment isolation

Independent of which backend you pick, you have to decide how to split state between environments (dev, staging, prod) and between concerns (network, data, application). Terraform offers two mechanisms: workspaces (one backend, multiple named state files) and full directory or backend separation (one backend per environment, completely independent state).

Workspaces are seductive because they are easy: terraform workspace new prod, run apply, done. They are also dangerous because every workspace lives in the same bucket, under the same IAM, accessible to the same credentials. The blast radius of a misconfigured run is every workspace, not just the one you thought you were targeting. HashiCorp's own workspace docs are explicit that workspaces are not a substitute for environment isolation.

Full separation means a per-environment directory, per-environment backend, per-environment cloud account, and per-environment credentials. The prod state file lives in a prod-only bucket that the dev role cannot read. This is the only configuration where a leaked dev credential cannot destroy prod by accident. Use workspaces only for short-lived, identical environments inside the same trust boundary (ephemeral PR environments are the canonical example).

7. State splitting: one big state file vs many small ones

Past 200 to 300 resources in a single state file, Terraform performance starts to degrade noticeably. Plans take minutes instead of seconds. Refresh storms when a tag changes across hundreds of resources. The probability of a partial-apply failure goes up because the run window is longer. Past 1000 resources in a single state, an apply that fails halfway through can leave you with hours of reconciliation work.

The standard fix is state splitting. Carve the infrastructure into bounded contexts (one state for the VPC and networking, one for the database tier, one for the Kubernetes cluster, one for the application services, one for IAM and identity) and let each have its own state file. Modules that need outputs from another state read them via terraform_remote_state data sources or, better, via SSM Parameter Store or Secret Manager so the coupling is loose.

Where splitting itself breaks. Too many small states becomes a coordination problem. If your application service state depends on five other states and any of those need a coordinated change, you now have a multi-state apply sequence with no transactional guarantee. Split along ownership and change-frequency lines, not arbitrary technical lines. Start with one state file at pre-seed, split at seed when you cross 300+ resources, aim for 5 to 10 states maximum at Series A.

8. The secrets-in-state problem

Terraform state stores every attribute of every resource, including attributes the provider marks as sensitive. Database passwords, RDS master credentials, IAM access keys generated inline, KMS key material wrapped during initial provisioning, all of it ends up as plain JSON in the state file. The state file is encrypted at rest in S3 or GCS, but anyone with read access to the backend has the plaintext secrets the moment they pull state.

HashiCorp official guidance, as of 2026, is to treat the state file as sensitive and restrict access accordingly. Reference: Sensitive Data in State. Practical fixes:

Generate secrets outside Terraform and inject by reference. Create the database password in AWS Secrets Manager or GCP Secret Manager via a separate workflow, then have Terraform read the secret name and pass it to the RDS instance, never the value.
Use providers that support secret references. AWS provider's aws_secretsmanager_secret_version with secret_string sourced from a data block keeps the value out of Terraform state in most attribute shapes.
Enable state encryption with a customer-managed KMS key. Default S3 encryption with SSE-S3 is not enough; SSE-KMS with a CMK lets you audit every state read.
Restrict S3 GetObject on the state bucket to the runner role only. Engineers should not pull production state to their laptops.

Practitioner opinion: the most common audit finding I see in this category is a state bucket where every engineer's IAM role has s3:GetObject. Lock that down before anything else.

9. Refactoring state: moved blocks, import blocks, and state mv

Terraform code changes over time. You rename a module, you split a resource group, you adopt a new naming convention. State has to follow the code, or Terraform will plan to destroy and recreate every renamed resource. Three tools matter.

The moved block (Terraform 1.1+) lets you declare a refactor in code. When you rename a resource from aws_instance.web to aws_instance.web_server, you add a moved { from = aws_instance.web; to = aws_instance.web_server } block and Terraform updates state on the next plan, no destroy-recreate. Reference: moved blocks docs.

The import block (Terraform 1.5+) lets you adopt resources that exist in the cloud but not in Terraform state. Write the import block, run plan, Terraform shows you what it would import, run apply. Replaces the older interactive terraform import CLI for production-style workflows. Reference: import block docs.

The terraform state mv CLI is the older mechanism, still useful for one-off surgery. Manual, requires the state lock, leaves no audit trail in your code. Prefer moved blocks in code over state mv on the CLI: code is reviewable, auditable, and survives engineer turnover.

10. The honest summary table

Pattern	Locking	Cost	Breaks at	Best stage
Local state	None	Free	Second engineer	Sandbox only
S3 + DynamoDB or native	DynamoDB or lockfile	Pennies / month	Misscoped IAM, stale locks	Pre-seed and seed AWS
GCS or Azure Blob	Native (GCS) or lease (Azure)	Pennies / month	Missing versioning, long applies on Azure	Pre-seed and seed GCP or Azure
HCP Terraform or Spacelift	Run queue	$0 to $20+ per resource per month	Vendor lock-in, cost at 1000+ resources, OpenTofu (HCP only)	Seed with budget, Series A
Atlantis or CI runner	Backend-level (S3, GCS)	Self-hosted compute	Runner single point of failure, workflow-file security	Pre-seed and seed with ops appetite

11. Stage-specific recommendations

Pre-seed (1 to 5 engineers, less than 100 cloud resources). S3 plus native locking (or GCS, Azure Blob equivalent) plus GitHub Actions with OIDC. One state file. One backend bucket per cloud account. KMS-encrypted, versioned, IAM tight. Zero platform cost, full control, scales comfortably to 200+ resources. Do not buy HCP Terraform at this stage.

Seed (5 to 15 engineers, 100 to 500 cloud resources). Same backend but split state along environment lines (dev, staging, prod, each in a separate bucket and ideally a separate cloud account). Introduce Atlantis if you want PR-comment workflows without writing them yourself. Evaluate HCP Terraform free tier if you want the polished UI for plan reviews. Tighten IAM so engineers cannot read prod state from their laptops.

Series A (15 to 50 engineers, 500 to 2000 cloud resources). Split state along service-ownership lines as well as environment lines. Introduce a runner platform (HCP Terraform, Spacelift, or Env0) for the audit trail, drift detection, and policy-as-code surface. Plan a deliberate migration if you are still on Terraform 1.5 or earlier; the 1.10+ native S3 locking and import-block ergonomics are worth the version bump. If your team is on OpenTofu, Spacelift or Atlantis are your runner options.

The trap: changing backends late is expensive

Every team that starts with local state and grows out of it pays a one-time migration tax to move to a remote backend. Every team that starts on HCP Terraform and decides to move off pays a similar tax in the other direction. The cost is roughly one engineering week per backend per environment, not counting the institutional knowledge encoded in the runner platform itself (run history, policy configuration, workspace settings). The cheapest path is to pick the right backend on day one and stick with it. For 90 percent of pre-seed startups in 2026, that is S3 (or GCS, Azure Blob) plus native locking plus a CI runner. Upgrade to HCP Terraform or Spacelift when you have a clear reason: non-engineers approving runs, the audit-trail threshold for SOC 2, or a coordination bottleneck the runner platform genuinely solves. Do not upgrade because a marketing page told you to.

If you want a second opinion on your Terraform setup

I run a free 20-minute Terraform state and IaC audit for early-stage startups. Pull your backend config, your workspace structure, your state file count and resource count; bring them. I will give you a ranked list of the three highest-leverage fixes specific to your stage, with rough effort estimates. No NDA needed for the first conversation. Send a note.

Avinash S is the founder of MatrixGard. Fractional DevSecOps for pre-seed and seed startups across India, the GCC, the UK, and the US. Almost a decade of running production workloads across AWS, GCP, and Azure, including Terraform and OpenTofu infrastructure-as-code at 100 to 5000-resource scale.

Methodology note. All technical references taken from public HashiCorp documentation, the Atlantis and Spacelift docs, the OpenTofu project pages, and the AWS, GCP, and Azure provider documentation, current as of May 2026. No vendor sales decks were used. Failure modes are drawn from production audits I have performed across pre-seed and seed startups; specific incidents are described generically. Stage-specific recommendations are practitioner judgment and will vary by team composition and risk appetite.

Cloud Egress Costs in 2026: AWS vs GCP vs Azure for High-Traffic SaaS Startups

noreply@matrixgard.com (Avinash S) — Tue, 19 May 2026 08:30:00 GMT

Egress is the cloud bill line item that high-traffic SaaS founders almost always underestimate. Compute and database costs are predictable. You provision them, you watch them, you pay for them. Egress is different. It scales with user behaviour, with feature shape, with one accidental misconfiguration in a webhook fanout. It hides under different names on different providers (Data Transfer Out, Internet Egress, Outbound Data Transfer), it has tiered pricing that no built-in dashboard summarises clearly, and it is the single category most likely to surprise an early-stage SaaS team on the bill that arrives after a launch week.

This is the honest 2026 breakdown of cloud egress costs across AWS, GCP, Azure, and the egress-disrupting alternatives (Cloudflare R2, Backblaze B2). What the numbers actually are. What changed after the EU Data Act forced free egress on exit in 2024. The hidden inter-region and inter-AZ bills. The six engineering tactics that move the cost needle for high-traffic SaaS, and the stage-specific recommendations for pre-seed, seed, and Series A teams.

1. What "egress" actually means on a cloud bill

Cloud egress is the umbrella term for outbound data transfer that leaves the cloud provider's network. The bill breaks it into three buckets, priced very differently from each other:

Internet egress. Data that leaves the provider's edge and reaches the public internet (your users, third-party APIs, on-prem systems). Most expensive bucket. Tiered by monthly volume.
Inter-region egress. Data moving between two regions of the same provider (e.g. ap-south-1 to us-east-1 on AWS, or asia-south1 to us-central1 on GCP). Cheaper than internet, more expensive than zero, and almost never visualised on default dashboards.
Inter-AZ egress. Data moving between availability zones of the same region. The smallest line item in any specific request but the largest in volume for HA architectures, since every multi-AZ Postgres replica, every Kafka broker spread across zones, every Application Load Balancer fan-out generates this traffic.

Public sources for the canonical pricing: AWS EC2 Data Transfer pricing, GCP VPC Network Pricing, Azure Bandwidth Pricing. Treat these as the single source of truth; vendor sales decks routinely round in the direction that flatters their position.

2. The 2024 EU Data Act and the new free-egress-on-exit policy

The most important regulatory change for cloud egress in the last two years is the EU Data Act, which entered into force on 11 January 2024. Articles 23 to 25 of the Act target what the EU called "unjustified obstacles to switching" between cloud providers, with the specific goal of removing egress fees as a switching barrier. The Act gave providers a transitional period and a final deadline of January 2027, after which all switching-related data transfer charges must be removed.

The hyperscalers responded in early 2024:

AWS announced free data transfer out to the internet when moving out of AWS on 5 March 2024. Customers must request the credit, fully migrate workloads, and close their AWS accounts to qualify.
GCP eliminated data transfer fees when migrating off Google Cloud in January 2024, ahead of AWS. Similar request-based credit model.
Microsoft extended free data transfers out for customers leaving Azure shortly after, with the same pattern.

What this does NOT change: your day-to-day egress bill. The free-egress-on-exit policies cover only the one-time migration scenario when you fully leave the provider. Webhook traffic to a third party, user downloads from your S3 bucket, cross-cloud replication between AWS and a GCP analytics warehouse, none of that is touched by the EU Data Act provisions. The hyperscalers continue to charge their standard tiered egress rates for everything other than the explicit exit case.

Practitioner opinion: the press coverage of "AWS makes egress free" in 2024 was misleading for SaaS operators. Treat the policy as a switching-cost relief valve, not a structural change to your operating bill.

3. Internet egress: the actual per-GB numbers in 2026

The headline-grabbing numbers from each provider's pricing page, current as of May 2026. Note that GCP and Azure express egress pricing by destination zone (where the traffic lands), while AWS prices by source region (where the traffic originates), so the comparison is region-pair dependent.

AWS internet egress, US East (us-east-1) source:

First 10 TB / month: $0.09 per GB
Next 40 TB: $0.085 per GB
Next 100 TB: $0.07 per GB
Over 150 TB: $0.05 per GB (committed contract pricing can go lower)
First 100 GB per month is free across all AWS accounts

AWS internet egress, Mumbai (ap-south-1) source:

First 10 TB / month: $0.1093 per GB
Next 40 TB: $0.085 per GB
Next 100 TB: $0.082 per GB
Over 150 TB: $0.075 per GB

India egress is roughly 20 percent more expensive than US East at the entry tier, narrowing at higher volume. The same shape repeats for Singapore, Sao Paulo, and other emerging-market AWS regions.

GCP internet egress (worldwide destinations, excluding China and Australia):

First 1 TB / month: $0.12 per GB
Next 9 TB: $0.11 per GB
Over 10 TB: $0.08 per GB
Egress to a Google-owned destination (e.g. user traffic landing on the Google ASN): pricing varies by tier

GCP egress to Australia is priced separately and is higher; GCP egress to China is the most expensive of any provider-destination pair across the three clouds.

Azure internet egress (Zone 1 source: North America, Europe):

First 100 GB / month: free
Next 10 TB: $0.087 per GB
Next 40 TB: $0.083 per GB
Next 100 TB: $0.07 per GB
Over 150 TB: $0.05 per GB

Azure Zone 2 (India, Singapore, Hong Kong, Japan, Korea) is priced separately and runs around 10-15 percent higher than Zone 1 at every tier.

Cloudflare R2 (object storage, designed as an S3 alternative):

Internet egress: $0 per GB. Cloudflare publicly committed to zero egress fees at launch and has held the line.
You pay only for storage at $0.015 per GB-month and operations (Class A writes, Class B reads).

Backblaze B2 (object storage):

Internet egress: $0.01 per GB. Plus 3x your average storage daily egress is free under the Cloud Replication tier.

4. Inter-region egress: the bill that surprises HA architectures

The moment your architecture spans two regions, inter-region egress shows up. For high-availability database replicas, cross-region object storage replication, or multi-region Kafka, this becomes a meaningful line item that is rarely visible on default cost dashboards.

Reference numbers as of May 2026:

AWS: $0.02 per GB for inter-region transfer in the same continent (us-east to us-west), $0.05-$0.09 per GB for cross-continent (us-east to ap-south-1). Mumbai outbound to other AWS regions is in the $0.08-$0.09 range. AWS public pricing.
GCP: $0.02 per GB within North America, $0.05-$0.08 per GB cross-continent. India to North America runs at the upper end. Cloud Interconnect changes the calculation; see section 6.
Azure: $0.02 per GB Zone 1 to Zone 1, $0.05 per GB Zone 1 to Zone 2, $0.087 per GB Zone 2 outbound to any other zone.

Operational example. A startup running Aurora Postgres Multi-AZ in ap-south-1 with a cross-region read replica in us-east-1 will pay roughly $0.08-$0.09 per GB of WAL traffic shipped to the replica. For a transactional workload generating 200 GB of WAL per day, that is roughly $500-$550 / month on cross-region replication egress alone, on top of the database instance cost. Most early-stage teams do not see this line item because it is bundled into a generic "Data Transfer" category on the Cost Explorer default view.

5. Inter-AZ egress: the invisible HA tax

Same provider, same region, different availability zones. The smallest per-GB number on the bill, the largest cumulative line item for properly-architected HA systems.

AWS: $0.01 per GB in each direction (so $0.02 per GB round-trip) for inter-AZ. Same number across all regions. AWS Data Transfer within the same Region.
GCP: $0.01 per GB within the same region across zones, charged on the sender side only.
Azure: $0.01 per GB Availability Zone egress within a region.

Where this surprises teams:

Kafka clusters spread across three AZs. Default replication factor 3 means every produced byte is shipped to two replica brokers, both in different AZs. A 500 MB / second produce rate becomes 1 GB / second of inter-AZ traffic, or about 86 TB / day. That is $860 / day, $26,000 / month, of pure inter-AZ egress on a single Kafka cluster. The AWS MSK pricing page does not show this; it appears in EC2 Data Transfer.
Cross-AZ database replicas. Aurora Multi-AZ does not incur inter-AZ egress (Aurora uses a shared storage layer that pre-replicates), but classic RDS Multi-AZ does. Cloud SQL HA same shape. Verify on your specific managed database before assuming.
EKS / GKE cluster pods talking to each other across AZs. The default Kubernetes scheduler does not consider AZ-affinity for inter-service traffic. A pod in zone A talking to a service IP that routes to a backend pod in zone B generates inter-AZ egress on every request.

Practitioner opinion: for a high-throughput SaaS at the seed stage and beyond, inter-AZ egress is often 20-40 percent of total egress spend. The default operating posture should be: place latency-sensitive call graphs in the same AZ via topology-aware routing, and accept the slightly reduced HA blast radius. Spreading a microservice mesh across three AZs by default, with no topology awareness, is operationally expensive and almost never delivers the HA benefit it implies.

6. Private connectivity: Direct Connect, Cloud Interconnect, ExpressRoute

If your egress volume to a specific destination crosses 5-10 TB / month, private connectivity becomes a real cost lever, not a luxury.

AWS Direct Connect. A 1 Gbps Dedicated Connection from an AWS Direct Connect location runs around $0.30 per port-hour plus data transfer at $0.02 per GB outbound to the internet (versus $0.09-$0.11 per GB on the standard egress path). Break-even versus standard egress: roughly 5-7 TB / month. Public reference: AWS Direct Connect pricing.

GCP Cloud Interconnect. Dedicated Interconnect at 10 Gbps runs around $1,700 / month for the port (regional availability dependent) plus $0.02 per GB outbound. Partner Interconnect at smaller commits (50 Mbps to 10 Gbps) at proportional pricing. Public reference: GCP Interconnect pricing.

Azure ExpressRoute. Local SKU starts at around $55 / month for 50 Mbps to a metro circuit; Standard SKU at 1 Gbps runs around $300 / month plus $0.025 per GB egress (Zone 1) on Metered plans, or unlimited egress on the Unlimited Data plan. Public reference: Azure ExpressRoute pricing.

For a high-traffic SaaS pushing 50-100 TB / month to a small set of large enterprise customers (typical B2B SaaS shape), private connectivity is the largest single FinOps lever. A 1 Gbps Direct Connect carrying 50 TB / month costs roughly $1,000 in port-hours and another $1,000 in data transfer, total $2,000, versus the same 50 TB at standard egress rates of $0.085-$0.09 per GB which runs $4,250-$4,500. The savings compound at higher volumes.

Caveat: private connectivity adds operational complexity (circuit ordering through a carrier or DC partner, BGP peering, routing policy, redundancy planning). For workloads under 5 TB / month it is rarely worth the engineering time.

7. CDN egress: CloudFront, Cloud CDN, Azure Front Door

For consumer-facing or content-heavy SaaS, the right question is rarely "how do I cut origin egress" and almost always "how do I serve from a cache that does not bill origin egress on every hit." The CDN tier is where this happens.

AWS CloudFront. Per-GB pricing is broadly cheaper than direct S3 / EC2 egress in most regions, especially under the free 1 TB / month tier and the CloudFront Security Savings Bundle. India (Asia Pacific) CloudFront pricing: $0.109 per GB first 10 TB, $0.085 next 40 TB. North America: $0.085 first 10 TB, $0.080 next 40 TB. CloudFront-to-S3 origin pulls are free, which is the key economic property.

GCP Cloud CDN. Cache egress to internet (cache fill from origin is free for GCS origins in the same region). Tier 1 (worldwide destinations excluding Australia, China): $0.08-$0.12 per GB depending on volume. GCP Cloud CDN pricing.

Azure Front Door / Azure CDN. Standard tier egress $0.081-$0.087 per GB for Zone 1 destinations. Azure Front Door pricing.

Cloudflare (used as a CDN in front of any origin). Cloudflare's CDN egress to internet is included in the plan flat fee. The Free, Pro ($25 / month), and Business ($250 / month) plans all carry unmetered bandwidth for typical web traffic. Enterprise plans negotiate. For an early-stage SaaS, putting Cloudflare in front of an AWS / GCP / Azure origin and caching aggressively turns most of the egress bill into a flat monthly Cloudflare fee. This is the largest possible cost lever for a content-heavy or read-heavy workload.

Note: Cloudflare's Terms of Service section 2.8 historically restricted unmetered bandwidth for non-HTML / non-website traffic on lower tiers. Video streaming, large file distribution, and similar workloads can trip the AUP. Read the AUP before betting your architecture on "unmetered."

8. Object storage egress: S3, GCS, Azure Blob, R2, B2

Object storage egress deserves its own treatment because it is the single most common surprise on a startup's bill. Numbers as of May 2026:

S3 internet egress (us-east-1): $0.09 per GB tier 1 (uses the same EC2 Data Transfer Out tiering).
S3 internet egress (ap-south-1): $0.1093 per GB tier 1.
S3 to CloudFront: free ("origin fetch"). This is why CDN-fronted S3 is the standard pattern.
GCS internet egress (worldwide, tier-1 excluding China and Australia): $0.12 per GB first 1 TB, $0.11 per GB next, $0.08 per GB over 10 TB.
Azure Blob internet egress (Zone 1): $0.087 per GB first 10 TB.
Cloudflare R2 internet egress: $0 per GB. Architecturally the most disruptive option for egress-heavy workloads.
Backblaze B2 internet egress: $0.01 per GB, with the first 3x of daily storage free.

For pure object storage backed by frequent egress (CDN origin for static assets, software downloads, media libraries, on-demand video) the gap between R2 / B2 and the hyperscalers is structural. A 100 TB / month egress workload runs roughly $8,500-$10,000 on S3 / GCS / Blob, $1,000 on B2, and effectively zero on R2 (only the storage and operations fees, around $1,500 / month for 100 TB stored).

Practitioner opinion: if you are running a static-asset-heavy SaaS and your egress bill is more than $2,000 / month, R2 or B2 should be on your six-month roadmap. The migration is mechanical and the savings recover the engineering time within one to two billing cycles.

9. Which SaaS workload shapes get hurt most by egress

Some workloads are egress-light by nature; others are structurally egress-heavy. Recognising your shape early matters because the architectural response is different for each.

API-only SaaS (CRM, accounting, project management). Egress is usually 5-15 percent of bill. Response sizes are small, JSON payloads compress well, mostly TLS overhead. Low priority for egress optimisation work.
Webhook-heavy fintech and notification platforms. Outbound webhook delivery to thousands of external endpoints, often retrying on failure. Egress can run 15-30 percent of bill. Look at retry-storm patterns, exponential backoff configuration, and dead-letter queues before optimising the data path itself.
Media-heavy SaaS (video editing, photo sharing, podcast hosting). Egress is often 40-70 percent of the bill once the user base crosses a few thousand active accounts. R2 / B2 plus aggressive CDN caching is the structural fix. Origin-served media without a CDN is a financial mistake at scale.
Data and analytics SaaS (BI, data warehouse, observability). Egress shows up two ways: customer-facing exports (CSV / Parquet downloads) and cross-cloud replication if the analytics tier lives somewhere different from operational data. Cross-cloud replication is the more dangerous of the two because it is steady, predictable, and rarely visible to the engineering team.
AI inference SaaS. Egress includes both the response payloads (large for image / video generation, small for text) and any audio / video streaming back to the client. For a video-generation SaaS pushing 50-200 MB outputs at scale, egress can equal compute spend.

10. Six engineering tactics that move the egress bill

In order of leverage, highest first:

(a) Put a CDN in front of everything that can cache. CloudFront, Cloud CDN, Azure Front Door, Cloudflare. The economics of S3-to-CloudFront-free and Cloudflare's flat-fee bandwidth make this the single biggest cost lever for any read-heavy workload. If you are not running a CDN today, this is week-one work.

(b) Move static-asset and download-heavy object storage to R2 or B2. If your egress is dominated by static-asset serving, the price differential is too large to ignore. R2 specifically eliminates the egress line item entirely. The S3 API compatibility makes the migration a config change for most SDK-based workflows.

(c) Topology-aware AZ routing for inter-service traffic. In Kubernetes, use service topology hints or the TopologyAwareRouting feature to keep client-server traffic in the same AZ when possible. In AWS classic VPC architectures, place tightly-coupled services (web server + cache + database) in the same AZ. Accept that one-AZ-down loses that service tier, and rely on multi-AZ ELB / ALB for fan-out resilience rather than mesh-level multi-AZ chatter.

(d) Compress everything. gzip, Brotli, and zstd at the application level for HTTP responses. zstd at the storage tier for cold data. For JSON-heavy APIs, Brotli at quality 4-6 typically compresses 60-75 percent versus uncompressed, and your egress bill drops in roughly the same proportion for that traffic.

(e) Replace cross-cloud or cross-region replication with private connectivity. If you have a steady 5+ TB / month flowing between AWS and GCP, or between two AWS regions, the economics of Direct Connect / Interconnect / ExpressRoute pay back inside a quarter for most volume tiers. Combine with replication-friendly database engines that ship deltas rather than full rows.

(f) Audit your Cost Explorer for the "Data Transfer" bucket every month. Most early-stage teams look at compute and database costs first, egress last. Flip that order. The biggest single optimisation discovery I have personally found across audits is a misconfigured cross-region replication shipping a database in real time to a region that was supposed to be the cold DR target. Six months of $4,000 / month bills before anyone noticed.

11. The honest summary table

Workload	AWS	GCP	Azure	Best alternative
Internet egress, US source, <10 TB	$0.09/GB	$0.12/GB	$0.087/GB	Cloudflare R2 ($0)
Internet egress, India source, <10 TB	$0.1093/GB	$0.12/GB	$0.10/GB (Zone 2)	Cloudflare R2 ($0)
Inter-region (same continent)	$0.02/GB	$0.02/GB	$0.02/GB	Private connectivity if >5 TB / mo
Inter-region (cross-continent)	$0.05-$0.09/GB	$0.05-$0.08/GB	$0.05-$0.087/GB	Private connectivity, async replication
Inter-AZ same region	$0.01/GB each way	$0.01/GB sender	$0.01/GB	Topology-aware routing
Free-egress on exit (EU Data Act)	Yes, since Mar 2024	Yes, since Jan 2024	Yes, since 2024	One-time only
Object storage egress (heavy CDN origin)	S3 $0.09/GB direct, free to CloudFront	GCS $0.12/GB direct, free to Cloud CDN	Blob $0.087/GB direct, free to Azure CDN	R2 ($0 egress), B2 ($0.01/GB)
CDN egress (cached delivery)	CloudFront $0.085-$0.109/GB	Cloud CDN $0.08-$0.12/GB	Front Door $0.081-$0.087/GB	Cloudflare (flat plan fee)

12. Stage-specific recommendations

Pre-seed (1-5 engineers, <$5k / month cloud bill). Egress is probably 5-10 percent of your bill. Do not over-engineer. Put Cloudflare in front of your origin (free or $25 Pro tier), enable gzip on every endpoint, leave the rest alone. The opportunity cost of optimising egress at this stage is much higher than the dollar savings.

Seed (5-20 engineers, $5k-$30k / month cloud bill). Egress is probably 10-25 percent of bill. Audit the Data Transfer line on Cost Explorer / Billing once a month. If you are serving static assets, move them behind Cloudflare with aggressive caching; if you are running cross-region replication, verify it is necessary and configured efficiently. For media-heavy workloads, evaluate R2 / B2 migration as a one-quarter project.

Series A (20-50 engineers, $30k-$200k / month cloud bill). Egress is probably 20-40 percent of bill. Hire or assign a part-time FinOps owner. Audit inter-AZ traffic patterns (especially Kafka and Kubernetes service mesh). Evaluate Direct Connect / Cloud Interconnect / ExpressRoute for the top 2-3 destinations. Consider negotiating committed egress pricing with your account team; at this volume, 15-25 percent discounts versus published rates are routinely available with a 1-3 year commit.

Series B and beyond. Egress economics start to drive architectural decisions: where to place compute relative to users, whether to operate your own edge POPs (rare, but real at the scale of Netflix, Cloudflare, Spotify), and whether multi-cloud is paying for itself or quietly bleeding 1.6-1.8x on egress with no offsetting benefit.

The trap: free-egress-on-exit makes the day-to-day bill look smaller than it is

The 2024 EU Data Act coverage made cloud egress sound like a solved problem in the press. It is not. The free-egress-on-exit policy applies only when you fully leave a provider, and even then you need to actively request the credit and close the account. Daily operational egress to your users, to third-party APIs, to your other cloud, continues to bill at standard tiered rates and remains one of the largest single optimisable line items on any high-traffic SaaS bill.

Treat egress as you would any other unbounded cost driver: instrument it, tag it, alert on anomalies, and assign a clear owner to optimise it. The teams I have seen most surprised by their egress bill are uniformly the teams that had no one looking at it month-over-month.

If you want a second opinion on your egress posture

I run a free 20-minute cloud cost audit for SaaS founders looking at high-traffic workloads. Pull your Cost Explorer / Billing report for Data Transfer for the last 90 days; bring the breakdown; I will give you a ranked list of the three highest-leverage optimisations specific to your architecture, with rough payback timelines. No NDA needed for the first conversation. Send a note.

Avinash S is the founder of MatrixGard. Fractional DevSecOps for pre-seed and seed startups across India, the GCC, the UK, and the US. Almost a decade of running production workloads across AWS, GCP, and Azure, including egress-heavy CDN, media, and data-replication architectures.

Methodology note. All pricing references taken from public AWS, GCP, and Azure pricing pages, plus the public Cloudflare and Backblaze pricing pages, current as of May 2026. Regulatory references taken from the European Commission's Data Act materials and the public AWS / GCP / Azure announcement blogs on free-egress-on-exit. Vendor sales decks and analyst reports were not used. Cloud pricing changes quarterly; verify the specific numbers against the source pages before committing them to a budget. Operational opinions are mine, labelled inline. The summary table aggregates published prices and rounds to the nearest commonly-cited tier; reasonable practitioners working from the same primary sources will arrive at substantially the same conclusions, though stage-specific recommendations vary by workload shape.

PCI DSS 4.0 in 2026: The 9 Most-Missed Requirements for Pre-Seed Fintech CTOs

noreply@matrixgard.com (Avinash S) — Tue, 19 May 2026 08:00:00 GMT

PCI DSS 4.0 has been fully in force globally since March 31, 2025. By May 2026, every entity touching cardholder data, whether a payment-processing startup or an e-commerce shop accepting card payments, is expected to be compliant against the updated standard. Yet most pre-seed and seed fintech teams are still operating against PCI DSS 3.2.1 mental models. The result: their first formal assessment lands with avoidable failures.

This is not another generic walkthrough of the 12 PCI DSS requirement areas. The PCI Security Standards Council publishes those, and they are exhaustive. This is a focused list of the nine requirements I see pre-seed and seed fintechs most often miss when preparing for their first assessment, drawn from public PCI Council documentation and patterns common across early-stage cloud-native startups.

Each section names the requirement, cites its PCI DSS 4.0 reference, explains why startups miss it, and outlines what "passing" actually looks like at the cloud-native engineering level. Where I am stating practitioner opinion rather than the standard's text, I have labelled it inline.

Quick context: what changed in 4.0

PCI DSS 4.0 was published by the PCI Security Standards Council in March 2022. The transition timeline was: PCI DSS 3.2.1 retired in March 2024, and the "future-dated" requirements (the most operationally demanding changes) became mandatory in March 2025. As of 2026, full 4.0 compliance is required for any merchant or service provider in scope.

Four structural shifts matter for the engineering team:

Customised approach: a new option (Annex E) that lets entities meet a requirement through alternative controls, provided they document a Targeted Risk Analysis. This is genuine flexibility but it adds documentation overhead.
Continuous focus, not point-in-time: many controls now require ongoing monitoring rather than annual proof.
Multi-factor authentication everywhere: no longer just for admin access.
Stronger cryptography and inventory requirements: full crypto inventories, mandatory keyed hashing, and longer minimum password lengths.

With that frame, here are the nine requirements I see startups miss most.

1. MFA on ALL access to the CDE, not just admin

PCI DSS 4.0 Reference: Requirement 8.4.2

Under 3.2.1, multi-factor authentication was required only for non-console administrative access to the cardholder data environment (CDE) and for remote access. Under 4.0, MFA is required for all access into the CDE, regardless of whether the user is an administrator or a regular employee. This is the single most common gap I see at startups.

The typical failure pattern: the engineering team has MFA enforced on their cloud console (AWS Console, GCP Console) for admin roles via IAM Identity Center or similar. But the backend admin portal that customer support staff use to look up a transaction, a portal that touches cardholder data, only requires a password. That portal is now non-compliant.

What passing looks like: every system that stores, processes, or transmits cardholder data, plus every system connected to the CDE, enforces MFA for all users. For cloud-native startups this typically means: IAM Identity Center with MFA enforced at the SSO layer for all human access, plus application-level MFA on internal admin portals via your auth provider (Auth0, Clerk, WorkOS).

2. Fifteen-character minimum passwords

PCI DSS 4.0 Reference: Requirement 8.3.6

The minimum password length under 3.2.1 was seven characters. Under 4.0 it is fifteen characters (or twelve if combined with other complexity requirements). Most pre-seed startups still have their authentication providers configured to the seven-character minimum that was the industry standard a decade ago.

This sounds trivial. It is not. Changing minimum length triggers password resets for the existing user base, which means a forced support workload spike on the day of the change. Startups that defer this hit the deadline scramble at month eleven of their compliance prep.

What passing looks like: your IdP (Okta, Azure AD, Auth0) password policy is updated to 15-character minimum, and the change is rolled out with sufficient communication time so users do not get locked out. A single configuration change in Auth0 or Okta admin, but plan it for a low-traffic week.

3. Authenticated internal vulnerability scans

PCI DSS 4.0 Reference: Requirement 11.3.1.2

Under 3.2.1, internal vulnerability scans had to be performed quarterly but could be unauthenticated (the scanner did not need to log in to the systems it was scanning). Under 4.0, internal vulnerability scans must be authenticated, meaning the scanner runs with credentials that allow it to inspect the actual configuration of each host.

The failure mode: startups run Nessus, Qualys, or Tenable scans without configuring credentialed scanning, then submit the results as evidence. The auditor flags this immediately. Authenticated scanning surfaces a different (and larger) set of findings, because it can read configuration files, package versions, and patch levels that surface scanning cannot see.

What passing looks like: your vulnerability scanner is configured with a dedicated service account on each in-scope host (or via cloud-native agents) that has read-only access to package managers, registry/config stores, and OS-level metadata. Scans run quarterly minimum, and reports are reviewed within an SLA. For containerised workloads this typically means using AWS Inspector or equivalent.

4. Targeted Risk Analysis documentation

PCI DSS 4.0 Reference: Requirement 12.3.1

Under 4.0, the entity must perform and document a Targeted Risk Analysis (TRA) for every requirement where it uses the customised approach (Annex E), and for every compensating control. The TRA must justify the risk-equivalence of the alternative control compared to the defined approach.

Most startups discover this requirement on the day they realise a particular defined-approach control will not work for them. They reach for the customised approach as a workaround, then learn that customised approach requires extensive TRA documentation: threat modelling, control effectiveness analysis, residual risk justification, annual review.

What passing looks like: a documented TRA for each requirement where you deviate from the defined approach. The PCI Council publishes a TRA template; use it. The TRA is a written artefact, not a verbal explanation to the auditor. Annual review is required, so calendar a TRA refresh review every twelve months.

Practitioner opinion: for a pre-seed startup, the customised approach is usually not worth the documentation overhead. Stick to the defined approach wherever possible and only invoke customised approach for the one or two genuinely awkward controls.

5. Detection of changes to payment pages (anti-skimming)

PCI DSS 4.0 Reference: Requirements 6.4.3 and 11.6.1

This is the single most consequential new requirement in 4.0 for e-commerce merchants and payment-page integrators. The standard now requires:

Req 6.4.3: a mechanism to authorise all scripts loaded on payment pages, plus an integrity check to detect unauthorised script changes.
Req 11.6.1: a change-and-tamper-detection mechanism that alerts the entity to unauthorised modifications of HTTP headers or the payment-page DOM.

The threat being mitigated here is Magecart-style attacks, where a malicious script is injected into a payment page and silently exfiltrates card data to an attacker-controlled domain. Most startups have no monitoring at all on their payment-page integrity.

What passing looks like: implementation of either Content Security Policy with strict source allowlisting, Subresource Integrity (SRI) hashes for every third-party script, or a payment-page monitoring tool (Source Defense, Imperva Client-Side Protection, Akamai Page Integrity Manager) that detects DOM/script changes in real time. For a pre-seed shop the cheapest viable path is CSP plus SRI, configured carefully and tested against the actual payment integration (Stripe, Razorpay, Adyen). Many fintechs offload this entirely to the payment processor by using a hosted payment page (Stripe Checkout, Razorpay Standard Checkout) where the merchant page never directly handles the card data, narrowing PCI scope.

6. Cryptographic inventory

PCI DSS 4.0 Reference: Requirement 12.3.3

The entity must maintain a documented inventory of all cryptographic cipher suites and protocols in use, reviewed at least annually. This includes both data-at-rest and data-in-transit cryptography, across all systems in scope.

Most cloud-native startups have no formal inventory. They know "we use TLS 1.2 or higher" and "we encrypt with AES-256" but cannot produce a written document listing: which TLS versions are enabled on which load balancers, which cipher suites are accepted, which KMS keys exist, which symmetric and asymmetric algorithms are used by which application, which hash functions are used for password storage, what the key rotation schedule is for each key.

What passing looks like: a single document (typically a spreadsheet or a Confluence page) listing every cryptographic algorithm, cipher suite, and key in use across the in-scope environment, mapped to the system that uses it, the rotation schedule, and the responsible team. Reviewed annually with a documented sign-off. This document is one of the highest leverage compliance artefacts to build early because it surfaces weak-cipher misconfigurations that would have been failures regardless of PCI DSS.

7. Anti-phishing controls

PCI DSS 4.0 Reference: Requirement 5.4.1

The entity must deploy automated mechanisms that detect and protect personnel against phishing attacks. This is a new explicit requirement in 4.0; under 3.2.1, anti-phishing was implicit under broader malware-protection language.

Most startups rely on the default phishing protection that ships with Google Workspace or Microsoft 365. That default is good but does not by itself satisfy the requirement. The standard expects active configuration plus visible evidence of detection capability.

What passing looks like: a documented anti-phishing technology stack (the email provider's protection settings, configured rather than at default, plus optionally a dedicated tool like Abnormal Security, Material Security, or Tessian for higher-risk environments) and quarterly phishing simulation runs with results tracked. For a pre-seed team, the cheapest viable path is enabling Google Workspace's advanced phishing protection settings (Strict mode, external sender warnings, encrypted external email warnings) plus running a quarterly phishing simulation via a free tier of KnowBe4 or GoPhish.

8. Manual code review for bespoke software in the CDE

PCI DSS 4.0 Reference: Requirement 6.2.4

Under 4.0, software developed internally for use in the CDE (custom and bespoke software) must be reviewed at least annually using either manual code review by qualified personnel or automated tools (or both). The wording is important: "either" is acceptable, but pure reliance on automated SAST scanning without any manual review is not sufficient if the SAST tool has known limitations on the language or framework used.

The failure pattern: a startup runs GitHub Advanced Security or Snyk Code, generates a clean scan report, and assumes that suffices. The auditor asks: what does the SAST tool's documentation say about its coverage of your stack? If there are known gaps (and there always are: SAST tools struggle with custom DSLs, complex business-logic vulnerabilities, and certain serverless patterns), some level of manual review is required to compensate.

What passing looks like: automated SAST in CI/CD (GitHub Advanced Security, Snyk, Semgrep), plus an annual targeted manual review of the in-scope code paths by either a qualified team member or an external code-review service. Documented review notes, not just the SAST report.

9. Continuous monitoring for service providers

PCI DSS 4.0 Reference: Requirement A.3.5

For entities that meet the definition of a service provider (which includes most B2B fintech startups that process or store cardholder data on behalf of another entity), 4.0 introduces continuous monitoring obligations that go beyond the annual assessment. Service providers must perform and document ongoing reviews of their PCI DSS scope, the in-scope systems, and the effectiveness of their controls.

Pre-seed and seed fintechs often miss this because they treat PCI DSS compliance as a one-and-done event (pass the assessment, file the AOC, ship). The standard now expects ongoing operational rigour: quarterly internal reviews of scope changes, ongoing control effectiveness validation, change-driven re-assessment when the architecture shifts.

What passing looks like: a documented quarterly compliance review cadence with assigned owner, output artefacts (a quarterly compliance status report), and evidence of scope re-validation when significant architectural changes occur. This is calendar discipline more than engineering work, but startups that skip it find themselves scrambling to reconstruct evidence at re-assessment time.

The honest summary table

Most-missed requirement	PCI 4.0 Section	Typical fix effort
MFA on all CDE access, not just admin	8.4.2	1-2 weeks (IdP reconfiguration + comms)
15-character minimum passwords	8.3.6	1 day (IdP config) + 1-2 weeks for rollout
Authenticated internal vulnerability scans	11.3.1.2	2-4 weeks (scanner credentials, agent rollout, baseline)
Targeted Risk Analysis documentation	12.3.1	1-2 weeks per TRA (compounds quickly)
Payment-page change detection (anti-skimming)	6.4.3 / 11.6.1	2-6 weeks (CSP + SRI + monitoring tool)
Cryptographic inventory	12.3.3	1-2 weeks (audit + documentation)
Anti-phishing controls	5.4.1	1 week (Workspace/M365 config + sim setup)
Manual code review for bespoke software	6.2.4	Annual; 1-2 weeks per cycle
Continuous monitoring (service providers)	A.3.5	Ongoing; quarterly cadence

Stage-specific recommendation

If you are a pre-seed fintech (under 15 engineers) just starting PCI DSS scoping: reduce scope first. Use a hosted payment page (Stripe Checkout, Razorpay Standard, Adyen Drop-in) so your application never touches raw card data. This narrows PCI scope dramatically, typically from SAQ D to SAQ A or A-EP. Several of the requirements above either drop out of scope or become straightforward at that lower SAQ tier.

If you are a seed fintech processing card data through your own systems (SAQ D-merchant or SAQ D-service-provider): the nine requirements above are your highest-priority gaps. Order of operations: passwords (Req 8.3.6, fastest) and MFA (Req 8.4.2, near-fastest), then payment-page anti-skimming (Req 6.4.3 / 11.6.1, the highest-risk if missing), then cryptographic inventory (Req 12.3.3, foundational documentation), then the rest in the order above.

If you are a service-provider fintech with enterprise customers asking for AoC: the continuous-monitoring requirement (A.3.5) is your enterprise-customer-facing signal. Build the quarterly review cadence early. Enterprise procurement teams will ask for evidence of ongoing compliance posture, not just an annual certificate.

The trap: assuming PCI DSS is a 12-month project

The most expensive mistake I see Indian and GCC fintech founders make is treating PCI DSS compliance as a 12-month preparation project culminating in an audit. The reality is closer to: PCI DSS becomes a baseline operational rhythm from the day cardholder data first touches your infrastructure. The annual assessment is just the visible checkpoint.

The startups that pass cleanly are not the ones who hire a consultant for a final-month sprint. They are the ones who built the controls in continuously from week one of touching card data. The gap requirements above are the ones that compound when deferred: passwords, MFA, crypto inventory, and TRA documentation all become exponentially harder to retrofit after the system has grown.

If you want a second opinion on your PCI DSS 4.0 scope and gaps

MatrixGard runs a free 20-minute PCI DSS scope and gap-readiness audit for pre-seed and seed fintech founders. Your specific cardholder data flow, your current SAQ tier, your most likely gaps against the 4.0 requirements, my honest read in 20 minutes. No NDA required for the first conversation. Send a note.

Methodology note. All requirement references taken from the PCI DSS v4.0 specification as published by the PCI Security Standards Council. The "most missed" framing is a practitioner opinion based on pattern frequency, not a published PCI Council statistic. Fix-effort estimates are practitioner ranges; actual effort varies with architecture and team maturity. The list is not exhaustive; PCI DSS 4.0 contains 64 distinct requirements across 12 control areas, and full compliance requires meeting all applicable controls for the entity's SAQ tier.

AWS vs GCP for Indian Fintech: The 12 Decision Points No One Writes About

noreply@matrixgard.com (Avinash S) — Fri, 15 May 2026 08:30:00 GMT

The standard AWS-vs-GCP comparisons online miss the realities that matter for an Indian fintech building in 2026. Most are written from a US-enterprise perspective. The factors that actually decide cloud choice for an RBI-regulated, India-incorporated fintech serving Indian users with a 5-50 person engineering team are different.

This is the breakdown across 12 decision points, with honest verdicts per factor. Both clouds are good. Neither is universally right. The right answer depends on which of these 12 you weight highest.

I have shipped production workloads on both AWS and GCP across most of the last decade, including India-region workloads with payment, KYC, and compliance scope. What follows is operational opinion grounded in that, plus public AWS, GCP, RBI, and MeitY documentation. Where I am stating opinion rather than fact, I have labelled it as such.

1. India region maturity and latency

AWS opened Mumbai (ap-south-1) in June 2016 and Hyderabad (ap-south-2) in November 2022. Three Availability Zones in Mumbai, three in Hyderabad. The Mumbai region carries almost every AWS service within months of US launch and has the densest CloudFront edge network in India (Mumbai, Chennai, Delhi, Hyderabad, Bengaluru, Kolkata).

GCP opened Mumbai (asia-south1) in November 2017 and Delhi (asia-south2) in July 2021. Three zones each. Service coverage has caught up substantially since 2022, though a handful of services (some newer Vertex AI features, certain Anthos add-ons) still lag the Mumbai region by 3-6 months versus US launch.

Verdict for Indian fintech: AWS wins on maturity, especially if you need active-active across two Indian regions for RBI Business Continuity Planning expectations. Hyderabad as a second AWS region is more mature than Delhi as a second GCP region today. Latency to users in Mumbai, Bengaluru, and Delhi is similar from both providers; the Tier-1 CDN tiers are comparable. The maturity gap closes another 30-50% per year, so by late 2026 this factor becomes near-neutral.

2. RBI Data Localisation and regulatory comfort

The relevant policies for Indian fintech are: RBI Storage of Payment System Data 2018 (payment data must be stored only in India), RBI Master Direction on Outsourcing of IT Services 2023, and the DPDP Act 2023 rules notification.

Both AWS and GCP are listed as eligible cloud service providers in MeitY's empanelment. Both publish RBI-aligned shared-responsibility models. Both offer India-resident customer data isolation, region-locked storage, and contractual commitments around regulator access. Both have walked through actual RBI bank inspections successfully via customers.

The operational difference is in how much paperwork the vendor already has signed for Indian regulators. AWS has had more Indian banks and NBFCs as customers for longer, which means standard MSAs already include RBI-acceptable clauses (data residency, audit rights, exit assistance, supervisory access). GCP has caught up, but for first-time RBI-regulated buyers the AWS legal package is more out-of-the-box.

Verdict: AWS, narrowly, on regulatory comfort. Once GCP has signed an MSA with you that includes the standard RBI clauses, the difference disappears. Plan an extra 2-4 weeks of legal review if you go GCP-first as a regulated Indian fintech.

3. Pricing for fintech-shaped workloads

The default pricing pages mislead. Indian fintech has a workload shape (compute + managed database + KMS + outbound bandwidth for webhooks + log retention) where the real cost lives in three line items: compute commit discounts, managed-database HA, and egress.

For equivalent on-demand compute (general-purpose VMs in Mumbai), GCP n2-standard pricing runs around 10-20% lower than AWS m6i in 2026, before any commit discount. With Committed Use Discounts (CUDs) of 1-year, GCP can drop another 30-35%. AWS Savings Plans (1-year, all-upfront) typically discount 35-50%. The math evens out at the upper commit tier; GCP wins at the no-commit floor.

Managed databases: Cloud SQL for PostgreSQL HA is about 15-25% cheaper than equivalent AWS RDS Multi-AZ for the same vCPU + memory + storage spec in Mumbai region. Aurora pricing is higher than both but you are buying a different engine architecture. Spanner, GCP's globally distributed SQL database, has no AWS equivalent at the same consistency tier (DynamoDB global tables are eventually consistent at the table level; Spanner is strongly consistent at the row level globally).

Egress bandwidth, the line item most fintech founders ignore until the bill arrives: AWS lists Mumbai egress at $0.1093 per GB up to 10 TB/month. GCP lists Mumbai egress at $0.12 per GB up to 1 TB/month, then $0.11 / $0.08 per GB at higher tiers. AWS's Reserved Instances do not reduce egress; GCP's commits do not either. For a webhook-heavy fintech (payment notifications, account updates, sync to external KYC providers) egress can be 15-30% of the monthly bill.

KMS: AWS KMS charges per key ($1/month per CMK) plus per request ($0.03 per 10,000 requests). GCP KMS charges $0.06 per active key version per month plus $0.03 per 10,000 operations. For a fintech with 50-200 CMKs (one per service per environment), KMS line item is comparable.

Verdict: GCP cheaper at the no-commit floor and for moderate workloads. AWS competitive at high commit tiers (3-year Savings Plans). Honest call: a single early-stage fintech burning ₹5-15 lakh/month on cloud will save 10-25% on GCP. Past ₹50 lakh/month, the gap closes or reverses depending on commit posture.

4. Database choices that matter for ledger systems

This is the factor where the two clouds diverge most for fintech. The choice is rarely simple.

AWS: Aurora PostgreSQL/MySQL is the workhorse for transactional workloads. Aurora Serverless v2 scales between 0.5 and 256 ACUs without read-replica downtime. DynamoDB for high-throughput key-value, with Global Tables for multi-region. RDS Proxy for connection pooling. Redshift for analytical workloads. The fintech-standard stack is: Aurora for ledger + DynamoDB for hot lookups + S3 + Athena for cold analytics.

GCP: Cloud SQL for PostgreSQL/MySQL is operationally simpler than RDS, but lacks Aurora's high-throughput storage architecture. Spanner is the unique GCP capability, globally distributed strongly-consistent SQL with five-nines SLA, but pricing starts around $0.90/node-hour minimum, so the floor for a non-toy Spanner instance is roughly $650/month. Firestore for document/key-value. BigQuery for analytics, the strongest analytical database on either cloud by significant margin.

For an Indian fintech building a ledger system that needs strong consistency at scale (think: settling cross-border remittances or running an in-house wallet), Spanner is genuinely a category-of-one product. AWS does not have a direct equivalent.

For a fintech building a simpler ledger + reads-heavy analytics workload, BigQuery beats Redshift on time-to-insight and price-per-query for ad-hoc fraud and risk queries.

Verdict: GCP wins on analytics (BigQuery) and globally-distributed SQL (Spanner). AWS wins on the operational maturity of Aurora and the depth of the surrounding ecosystem (RDS Proxy, Aurora Serverless v2 autoscaling). For most Indian fintechs at seed stage, Aurora is the safer default. For a fintech that will live or die on real-time analytics, GCP is the better long-term bet.

5. IAM, credential management, and secret rotation

This is the factor I have the strongest opinion on, having operationally maintained both.

AWS IAM is more powerful, more granular, and more complex than GCP IAM. SCPs at the Organizations level, permission boundaries, resource-based policies, and policy simulators give you control that GCP cannot match. AWS IAM Access Analyzer surfaces unintended external sharing more comprehensively than GCP's IAM Recommender.

GCP IAM is simpler, more opinionated, and frequently safer-by-default. The killer feature: Workload Identity Federation for GKE, which eliminates static service account keys for pods. Pods authenticate as Kubernetes service accounts; GCP IAM maps those to GCP service accounts; no JSON keys distributed, no secrets to rotate. AWS has IRSA (IAM Roles for Service Accounts on EKS) which achieves similar, but the GCP implementation requires less ceremony.

Secret management: AWS Secrets Manager is mature, integrates with Lambda, RDS auto-rotation, and CloudWatch Events for custom rotation hooks. GCP Secret Manager is simpler, with versioning baked in, but lacks the same depth of automated-rotation hooks.

Verdict: GCP wins on default-safety (Workload Identity, simpler IAM, fewer ways to misconfigure). AWS wins on advanced control surface (SCPs, permission boundaries, organization-level governance). For a startup with a 5-15 person engineering team that does not have a dedicated cloud security engineer, GCP's defaults reduce risk. For a fintech that needs fine-grained policy control across hundreds of accounts, AWS is more capable.

6. PCI DSS scope and shared-responsibility nuances

Both clouds carry PCI DSS 4.0 attestation. Both publish the Responsibility Matrix and the AOC (Attestation of Compliance) for download.

The operational difference: AWS marketplace has more PCI-scope tooling, log management, file integrity monitoring, vulnerability scanners, that integrates AWS-first. The major Indian compliance-automation platforms (Sprinto, Scrut, Drata, Vanta) all integrate AWS deeply; GCP integrations exist but cover fewer evidence sources. For a fintech going through a first PCI assessment, AWS reduces evidence-collection friction by 20-40%.

Specific PCI DSS 4.0 control areas where AWS has more out-of-box options: log retention with immutability (S3 Object Lock + S3 Glacier for 1-year retention), file integrity monitoring (CloudWatch + Inspector + third-party tools), and network segmentation (more granular Security Group + NACL options than GCP firewall rules).

Verdict: AWS for a first PCI DSS assessment. GCP is fully capable but you will spend more engineering time wiring up evidence collection.

7. Networking for payment-gateway connectivity patterns

Indian fintech needs hybrid connectivity to: bank partners (often via leased lines or MPLS), payment switches (Mindgate, AGS, FSS), KYC providers (Karza, Hyperverge, Signzy), and Aadhaar AUA/KUA infrastructure (UIDAI-mandated VPN tunnels). The cloud needs to support direct-connect to all of these.

AWS Direct Connect has more India-resident colocation partners (CtrlS, NTT, Sify, Reliance Jio) and more pre-existing private connectivity to NPCI, NSE, BSE, and major Indian banks. AWS Transit Gateway as the hub for multi-VPC + on-prem networking is more mature than GCP's equivalent (Network Connectivity Center + Cloud Router).

GCP's Shared VPC is simpler than AWS's account-per-environment VPC peering pattern, and is a genuine operational advantage at the 5-50 engineer scale.

For Aadhaar-bound workloads (eKYC, Aadhaar-linked payouts), both clouds have customers operating UIDAI-approved AUA/KUA architectures. AWS has more documented reference architectures published by Indian fintechs.

Verdict: AWS for hybrid connectivity to Indian banking infrastructure. GCP for cleaner internal networking when you do not need many partner connections.

8. Kubernetes: EKS vs GKE

This is the clearest verdict on the list. GKE wins.

GKE Autopilot mode runs the control plane and node infrastructure for you, billed per-pod. EKS requires you to either run nodes (more ops) or use Fargate (more cost). GKE upgrades, network policy, and HPA work out-of-the-box without the EKS-typical add-on installation ceremony (aws-load-balancer-controller, cluster-autoscaler, external-dns, kube-state-metrics, etc.).

GKE pricing for the managed control plane is comparable to EKS at $0.10/hour per cluster. The hidden cost difference is operational: a typical Indian fintech engineering team will spend 0.5-1 FTE-equivalent on EKS operational toil that simply does not exist on GKE Autopilot.

Verdict: GKE, unambiguously, for any Indian fintech that does not already have deep EKS operational expertise. The category-of-one product on GCP.

9. Serverless for India-specific bursty workloads

India has bursty traffic patterns that pure serverless suits well: NPS / TDS deadlines, IPL match windows, festival sale events, salary-day banking traffic.

AWS Lambda has the deepest ecosystem (custom runtimes, Lambda Layers, X-Ray integration, Step Functions for orchestration), the largest set of trigger sources, and the most mature observability tooling.

GCP Cloud Run is operationally simpler. Container-based, autoscale to zero, supports any runtime that builds to a container, billed per request + CPU-second. For a fintech that already builds Docker images for its services, Cloud Run is essentially "Lambda but you bring your own runtime, and the pricing model is cleaner." Cloud Run jobs and Cloud Run for Anthos add long-running and Kubernetes-bound variants.

Verdict: Cloud Run for simple HTTP-triggered services where you already have containerised builds. Lambda for event-driven workflows with rich AWS trigger graph (S3, DynamoDB Streams, SQS, EventBridge). Most Indian fintechs will use both eventually; pick by where the first 5 services need to live.

10. Security observability and threat detection

AWS approach: a stack of independent services. GuardDuty (threat detection), Security Hub (aggregation + CIS benchmark), AWS Config (configuration drift), AWS Inspector (vulnerability scanning), Macie (data classification), Detective (forensics), Audit Manager (compliance evidence). Each is good. Together, they are powerful but require integration effort.

GCP approach: Security Command Center as the unified pane. Bundled threat detection, vulnerability findings, sensitive-data discovery, posture management, and IAM Recommender all in one product. The Premium tier (required for most of the value) is expensive, but covers what AWS spreads across 5-7 separate services.

For a small fintech team (1-3 engineers responsible for cloud security), GCP's unified surface reduces operational fragmentation. For a larger team with a dedicated security engineer, AWS's specialised services give more depth per domain.

Verdict: GCP Security Command Center wins for small-team operational simplicity. AWS wins for advanced specialisation.

11. Indian talent availability

The hiring market is the factor most cloud-comparison articles ignore. For Indian fintech building in 2026, it is one of the most important.

AWS-certified engineers in India outnumber GCP-certified engineers roughly 5-7 to 1, based on public certification numbers, LinkedIn job posting data, and Naukri search ratios. AWS Solutions Architect is the most common cloud certification on Indian engineering resumes. GCP Professional Cloud Architect is rarer, and commands a 15-25% salary premium in 2026 because supply is constrained.

What this means operationally: if you build on AWS, you can hire mid-level cloud engineers from a pool of ~150,000 in India. If you build on GCP, the pool drops to ~25,000-40,000, and they are more expensive. For senior platform engineers (5+ years cloud-native), the gap narrows somewhat as senior engineers tend to be cloud-agnostic, but the rate premium for GCP senior is real.

The flip side: GCP engineers are often more recent (the certification programmes are newer), and the Indian GCP community runs a tighter set of regular meetups and conferences (GDG, Google Cloud Next India). The talent pool is small but higher-engagement on average.

Verdict: AWS for ease of hiring at mid-level. GCP for a smaller, more recent, more expensive pool. If your hiring runway is short, this factor alone may push you to AWS.

12. Marketplace and ecosystem

The AWS Marketplace has more compliance, security, and observability ISVs available with INR billing through Indian resellers. The major Indian compliance-automation platforms (Sprinto, Scrut, Drata, Vanta) integrate AWS first; GCP integrations exist but cover fewer evidence sources.

Indian managed-service-provider (MSP) ecosystem: AWS has the larger India MSP community by 3-4x. If you plan to outsource cloud operations to an Indian MSP (TCS, Infosys, Wipro, smaller specialists like Minfy, Searce, BluePi), AWS is the more common skill set.

GCP's marketplace has caught up substantially in 2024-2025 with the launch of GCP Marketplace India billing, but the depth of third-party offerings still trails AWS by roughly 2-3x in count.

Verdict: AWS for ecosystem depth and Indian MSP availability. GCP for native Google integrations (Workspace, BigQuery, Looker).

The honest summary table

Decision factor	AWS	GCP	Lean
India region maturity	3 regions, longer history	2 regions, catching up	AWS
RBI regulatory comfort	More pre-signed MSA paperwork	Capable but newer for Indian regulated buyers	AWS
Pricing (no commit)	Higher floor	10-20% cheaper floor	GCP
Pricing (3-year commit)	Aggressive Savings Plans	Strong CUDs	Roughly even
Ledger DB	Aurora, mature	Spanner, unique at scale	Depends on workload
Analytics DB	Redshift	BigQuery	GCP
IAM (default safety)	Powerful, complex	Simpler, safer defaults	GCP
IAM (advanced control)	SCPs, permission boundaries	Simpler, less granular	AWS
PCI DSS evidence collection	Deeper marketplace tooling	Fewer integrations	AWS
Hybrid connectivity (India banks)	More Direct Connect partners	Cleaner internal VPC model	AWS
Kubernetes	EKS, more ops	GKE Autopilot, less ops	GCP
Serverless	Lambda ecosystem	Cloud Run simplicity	Depends on workload
Security observability	Specialised, fragmented	Unified Security Command Center	GCP for small teams
Indian talent pool	5-7x larger	Smaller, more expensive	AWS
Marketplace + MSP	Deeper	Newer, narrower	AWS

The honest recommendation depending on your fintech stage

If you are a seed-stage Indian fintech with under 15 engineers and your first compliance gate is PCI DSS or RBI Master Direction: default to AWS. Lower legal friction, deeper ecosystem, easier hiring. The savings on GCP do not yet outweigh the operational overhead of a smaller talent pool and fewer integrations.

If you are a fintech where analytics and risk modelling are core differentiators: seriously consider GCP. BigQuery is enough of a category-of-one product that the rest of the trade-offs become acceptable.

If your engineering team has strong Kubernetes preferences and wants to spend zero time on cluster operations: GKE Autopilot makes GCP the better choice on day one, and the operational savings compound.

If you are building a globally-distributed ledger or a strong-consistency cross-region payment switch: Spanner is the right tool, and Spanner only exists on GCP.

If none of the above are decisive: AWS as default for Indian fintech in 2026, GCP for specific workloads where the unique capabilities (Spanner, BigQuery, GKE Autopilot) carry real weight.

The trap: defaulting to both

The mistake I see most often with Indian fintechs at the 30-50 engineer stage is "multi-cloud by accident." One team builds on AWS, another picks GCP for an analytics project, two years later the SRE team is maintaining two sets of IAM, two sets of networking, two sets of monitoring, two sets of compliance evidence. Cost increases roughly 1.6-1.8x for the same workload because the commit discount is split across two providers.

Pick one as primary. Use the other for one specific workload where the unique capability justifies the operational overhead. Resist the rest. Multi-cloud as a strategy is rarely a fit for a seed-stage Indian fintech; it is most often a sign that platform decisions were made by feature-team consensus rather than by an architect with the operational picture.

If you want a second opinion on your specific stack

I run a free 20-minute cloud audit for Indian fintech founders evaluating cloud choices. No NDA needed for the first conversation. Your specific workload, your specific compliance gates, my honest read on AWS vs GCP for your situation. Send a note.

Methodology note. Pricing references taken from public AWS and GCP pricing pages as of May 2026; numbers shift quarterly. Regulatory references taken from public RBI, MeitY, and IRDAI notifications. Operational opinions are mine, labelled inline. Where I have stated a verdict, the underlying tradeoffs are documented above; reasonable practitioners can weight them differently and arrive at the opposite call.

AWS S3 Block Public Access: Four Settings, What Each One Does, and Why You Need All Four

noreply@matrixgard.com (Avinash S) — Tue, 12 May 2026 17:30:00 GMT

The pattern doesn't start with a hacker. It starts with a developer in a hurry.

Someone needs to share a file with a vendor. They right-click the S3 object, click "Make public," see it works, move on. Six weeks later, a security researcher with a search index finds the URL.

That's how most S3 incidents actually begin. The breach is a checkbox that got flipped by someone who didn't know what the checkbox protected against.

AWS knows this. In November 2018, they shipped a feature called Block Public Access to fix it. In April 2023, they made the strict version the default for every new bucket. In 2026, public S3 misconfigurations still appear regularly in disclosed breaches, often on buckets created before 2023 or accounts where Block Public Access was deliberately switched off.

This post is the boring reference your team should have read before configuring a bucket. Four settings, what each one does, and why none of them are individually enough.

The four settings

AWS Block Public Access is a set of four boolean controls. They sit at two levels: the AWS account and the individual bucket. The four:

Setting	What it blocks
`BlockPublicAcls`	New ACLs that grant public access. Existing public ACLs continue to work.
`IgnorePublicAcls`	All public ACLs are ignored at evaluation time. Public ACLs continue to exist but have no effect.
`BlockPublicPolicy`	New bucket policies that grant public access.
`RestrictPublicBuckets`	Cross-account and anonymous public access through bucket policies, regardless of policy contents.

These four are layered, not redundant. Each blocks a different way an S3 object can become public.

One. BlockPublicAcls

S3 has two access models. Bucket policies are JSON IAM-style documents. Bucket ACLs are an older system Amazon kept around for compatibility. ACLs let you grant access to specific AWS accounts, the bucket owner, the special AllUsers group (everyone on the internet), or the special AuthenticatedUsers group (anyone with an AWS account).

BlockPublicAcls=true prevents new ACLs being applied that grant access to AllUsers or AuthenticatedUsers. It also blocks PUT Object requests that include an ACL grant to those groups, and PUT Object requests with --acl public-read arguments. The API call returns AccessDenied instead of silently succeeding.

Important: this setting does not retroactively remove public ACLs that already exist. If a developer set an ACL last year before the setting was enabled, the object is still public until the ACL is removed.

Two. IgnorePublicAcls

This is the retroactive fix. IgnorePublicAcls=true tells S3 to treat any existing public ACL as if it doesn't exist when an access request comes in. The object stays in the bucket, the ACL stays on the object, but the public read never resolves.

Most teams enable BlockPublicAcls and IgnorePublicAcls together. The first blocks new mistakes. The second neutralises old ones.

Three. BlockPublicPolicy

ACLs are one path to a public object. Bucket policies are the other. A bucket policy that allows s3:GetObject to Principal: "*" makes every object in the bucket world-readable.

BlockPublicPolicy=true rejects any new bucket policy that would grant public access. Existing public policies continue to operate. This blocks the most common path teams take to share a bucket with the world: pasting a public-bucket policy template from Stack Overflow.

Four. RestrictPublicBuckets

The strictest of the four. When enabled, AWS ignores any portion of a bucket policy or ACL that would grant access to public or anonymous users. The bucket can still have a public policy attached. The policy is just non-functional.

This is the setting that protects you from a bucket policy that already exists and grants public access. BlockPublicPolicy prevents new ones. RestrictPublicBuckets neutralises old ones.

Two levels, not one

These four settings can be configured at the bucket level and at the account level. The account level is an envelope that applies to every bucket.

If account-level BlockPublicAcls=true is set, every bucket in the account behaves as if it had BlockPublicAcls=true, regardless of what the bucket-level setting says. Account-level is strictly more restrictive: the OR of account and bucket settings wins.

This matters because most accidental exposures happen at the bucket level. A developer with s3:PutBucketPublicAccessBlock permission can disable the bucket setting and turn the bucket public. They cannot do the same at the account level without s3:PutAccountPublicAccessBlock, which is normally restricted to a small group.

The clean rule: set all four at the account level, and only allow exceptions case by case. Most teams skip the account-level step. That's the gap.

The April 2023 default change everyone forgets

In April 2023, AWS changed the defaults for new S3 buckets. All four Block Public Access settings now default to true. ACLs are disabled by default. A new bucket created in 2024 or later is private out of the box.

This sounds like the end of the problem. It isn't, for three reasons:

Pre-2023 buckets retain their old configuration. A bucket created in 2019 with all four settings off is still that way unless someone explicitly remediated it.
Account-level defaults were not changed automatically. Your account-level Block Public Access settings are whatever you set them to when you opened the account, or all-off if you never touched them.
The defaults only protect against accidental public access. Deliberately public buckets (static website hosting, public CDN origins) are still common, and once a bucket is intentionally public, every object inside inherits the risk.

The pattern we still see: an Indian seed startup creates an AWS account in 2021, gets a bucket public for a CDN, leaves account-level Block Public Access off, then later creates a private bucket assuming "AWS defaults are safe now." The new bucket is fine. The old one isn't. Account-level was never enabled.

The DPDP and RBI angle

For an Indian startup, public S3 isn't just a security mistake. It's a regulatory event.

Under the DPDP Act 2023, a Data Fiduciary is liable for personal data exposure regardless of intent. The penalty for a significant breach can reach Rs 250 crore. "We left a bucket public by accident" is not a defence under the Act. The duty is to maintain reasonable security safeguards, and exposing personal data through misconfigured S3 fails that test.

For RBI-regulated fintechs, the same exposure also triggers reporting obligations under the Cyber Security and Resilience Framework. The clock starts the moment the misconfiguration is discovered, internally or externally.

The technical fix for both regimes is the same: turn all four Block Public Access settings on, at the account level, and audit existing buckets for pre-2023 settings.

The five-minute audit

For each AWS account you operate:

# Check account-level Block Public Access
aws s3control get-public-access-block --account-id YOUR_ACCOUNT_ID

# Check every bucket
aws s3api list-buckets --query "Buckets[].Name" --output text | \
  tr "\t" "\n" | while read bucket; do
    echo "--- $bucket ---"
    aws s3api get-public-access-block --bucket "$bucket" 2>&1
  done

If any of the four settings return false, or the API returns NoSuchPublicAccessBlockConfiguration, that bucket is in the danger zone.

The remediation in the AWS Console: S3, Block Public Access settings for this account, Edit, tick all four, Save. Then for each bucket that's intentionally public, document why, and add an exception only at the bucket level.

What this doesn't cover

Block Public Access is necessary, not sufficient. It does nothing about:

Pre-signed URLs that leak personal data
IAM users with overly broad S3 permissions
Cross-account bucket sharing through s3:GetBucketAcl
Data accidentally written to a bucket that was never meant to hold it
Server-side encryption gaps

If you want the rest of the layered defence, that's the AWS Security Baseline for Indian Startups we maintain. Block Public Access is one of nine controls in it.

TL;DR

Four settings: BlockPublicAcls, IgnorePublicAcls, BlockPublicPolicy, RestrictPublicBuckets. Each blocks a different path to a public object. None of them work alone. Set all four, at the account level, for every AWS account you run.

For Indian operators, this is also a DPDP control. Treat it that way.

I Audited Five OTT Platforms With Browser Devtools. The Cache Headers Told a Story.

noreply@matrixgard.com (Avinash S) — Thu, 07 May 2026 06:30:00 GMT

A few weeks ago I was watching a cricket match on my phone. The stream dropped to what looked like 480p mid-over.

I cursed my wifi. Then I started wondering whether it actually was my wifi.

So I spent three weeks running technical audits across five OTT streaming platforms. Standard browser developer tools, signed in as a paying or registered user. No DRM bypass, no unauthorized access, no clever exploits. Just the network panel, the Performance API, and a careful eye on what each platform's player was actually doing on the wire.

What I found was less about whose stream is "best." It was about how differently platforms make architectural choices when solving the same problem: get video to a paying user reliably.

Same technical problem. Five completely different answers.

This piece pulls together what I observed. Platforms are anonymized A through E. The methodology section at the bottom explains what was measured and what wasn't.

The cache TTL finding that surprised me most

Streaming video works by chopping content into small segments (2 to 10 seconds each) and delivering them on demand. The CDN caches these segments at edge locations close to viewers. How long a segment stays in cache is set by a Cache-Control: max-age header.

Long cache: origin server gets hit rarely, costs are low. Short cache: origin server gets hit constantly, costs scale linearly with traffic.

Across the five platforms, segment cache TTLs ranged from 5 minutes to nearly a year for the same kind of asset.

Platform	Manifest TTL	Segment TTL
A (global hyperscale)	Signed, ~1hr expiry	Signed, ~1hr expiry
B (Indian market leader)	37 minutes	~1 year
C (Indian, mid-market)	2 minutes	5 minutes
D (Indian, regional)	~3 months	~3 months
E (global hyperscale)	Signed via private protocol	Signed

Read that table again.

Platform B caches each video segment for nearly a year. Platform C caches the same kind of object for five minutes. Both serve Indian users. Both run on commercial CDNs.

The difference is a deliberate engineering choice with massive cost implications.

A segment cached for a year hits origin once and serves from edge for everyone forever. A segment cached for 5 minutes hits origin every five minutes per edge node, multiplied by every edge node serving traffic. At scale, this is the difference between a CDN bill that works and one that doesn't.

The reason Platform B can cache aggressively: they treat segments as immutable. Once packaged, never changed. Platform C re-validates them constantly, probably out of caution about content updates, but the caution is unnecessary if your packaging pipeline is right.

This choice doesn't show up on any architecture diagram. But it separates teams that have thought hard about CDN economics from teams that haven't.

URL signing: the security layer most platforms skip

When you watch a video, your player fetches segment URLs from the CDN. Whether those URLs are signed determines whether they can be shared.

Platform B signs every segment URL with an HMAC token that expires in about an hour. The URL is bound to a session. Try to use it from a different IP or after expiry, and you get a 403.

Platforms C and D ship plain, unsigned URLs.

Anyone who pulls a URL from their browser's network panel can paste it into another browser, on another network, and stream the content directly. With Platform D's months-long cache TTL, a leaked URL stays valid for an absurdly long time.

The DRM on the segment bytes still protects against re-distribution of decrypted content. But unsigned URLs eliminate the first layer of defense. They make scraping easier. They make casual sharing trivially possible. They turn the CDN into a public file server with extra steps.

Most platforms that skip URL signing aren't doing it deliberately. They inherited a CDN config that didn't include token authorization, and nobody went back to fix it.

Where auth tokens live

This is the finding that surprised me least but matters most.

Every modern web platform stores a session token somewhere on the client. Two options: a cookie marked httpOnly (JavaScript on the page cannot read it), or localStorage (any JavaScript on the page can read it).

The pattern was striking:

Platform	Auth storage
A	httpOnly cookies only
B	httpOnly cookies only
C	Tokens duplicated across cookies and localStorage
D	OAuth2 access and refresh tokens in localStorage
E	httpOnly cookies + private protocol

Why does this matter?

If anyone successfully injects JavaScript into the platform's pages, through stored XSS, a compromised third-party SDK, or a malicious browser extension, they can read whatever's in localStorage and exfiltrate it. They cannot read httpOnly cookies. The cookie can still make requests on the user's behalf, but the raw token never leaves the browser.

Refresh tokens are the highest-stakes case. An access token is usually short-lived. A refresh token might be valid for days or weeks. An attacker who exfiltrates a refresh token can mint new access tokens long after the user has logged out and gone to bed.

Platforms that get this wrong usually have an architectural reason. A third-party SDK or a legacy OAuth flow that needed JavaScript access at some point. The fix is well-documented. The cost of not fixing it scales with your XSS exposure, which scales with your third-party JS footprint.

This is one of those "the cost is invisible until something goes wrong, and then the cost is enormous" patterns.

Player choices: build, buy, or wrap

Three strategies for getting a video player on your platform.

Build it yourself. Platform A built Cadmium, an entirely proprietary player that talks to its CDN over a private protocol. Platform E went the same route. Multi-year investment, dedicated player team, only justified at hyperscale.

Buy a vendor. Platform D uses a commercial player engine bundled into their app. The vendor handles the player, the DRM integration, the ABR controller. The platform handles UI and CMS.

Wrap an open-source player. Platform B uses Shaka Player (Google maintains it) under their own branded wrapper with custom telemetry, DRM orchestration, and UI. Platform C does the same with Video.js.

For the longest time I assumed the "best" platforms wrote their own players. The audit data corrected me.

Platform B is widely considered best-in-class for its market. They use off-the-shelf Shaka with a thin wrapper. They wrote the parts that matter (telemetry, ABR memory, DRM caching) and let Google maintain the player engine.

If you're building an OTT at any scale below Netflix, you almost certainly don't need to write a player from scratch. Pick an open-source engine, wrap it well, ship it.

CDN topology: owning vs renting the wire

This is where Platform A is in a class of its own.

Most platforms (B, C, D) use commercial CDNs. Akamai, CloudFront, Cloudflare. Their video segments live on the CDN's edge servers, which are geographically distributed but run by the CDN, not the platform.

Platform A built and operates Open Connect Appliances. Physical servers shipped to ISPs, who install them inside their own networks.

When you watch Platform A's content from a major Indian ISP, your video doesn't traverse the public internet. It comes from a Platform A appliance physically located inside the ISP's data center, on the ISP's own network, often with zero transit cost.

The hostnames told the story. I observed segments served from clusters in two different Indian cities, inside two different ISPs, simultaneously, on a single playback session. The platform's client was steering between four different appliances mid-playback based on conditions I couldn't see.

This is a 10+ year capital investment that no other platform in my audit comes close to matching. It's not replicable at small scale, and it's not even strictly necessary at small scale.

But it explains why Platform A's streams feel different. They're physically closer to the user than anyone else's, by a wide margin.

Telemetry: centralized vs federated

How does each platform know what's happening with your stream? They send telemetry beacons.

Platform A: small number of beacons per session, all to its own first-party endpoint, in JSON, with an outbox pattern (failed sends queued in localStorage and retried). Telemetry treated as a first-class engineering concern.

Platform B: beacons in Protobuf (a binary wire format) to a single first-party endpoint. Response acknowledgment is two bytes. Beacons are 5 to 12 KB. Under surge conditions, this matters. Telemetry itself becomes a load source if you're not careful.

Platforms C, D, and others: beacons fanned out to multiple third-party SDKs simultaneously. Mixpanel, CleverTap, NPAW Youbora, Branch.io, Facebook, Google Analytics, Comscore, Conviva, AppsFlyer. One platform's watch page made requests to over 30 distinct hosts.

There's a cost to this federation.

During my audit, one platform's video QoE telemetry endpoint was returning HTTP 503 errors. Their pipeline was broken at the moment I measured it, and presumably had been for some time without detection.

Centralized telemetry has fewer single points of failure than federated telemetry, and easier observability.

The pattern is consistent. Platforms that take observability seriously consolidate. Platforms that treat telemetry as a checkbox spray it across vendors.

Accessibility: the largest gap I observed

I expected to find architectural differences. I didn't expect the gap on accessibility to be this stark.

For a single drama series episode:

Platform	Audio tracks	Subtitle tracks	Audio descriptions
A	35 across 23 languages	42 across 33 languages	14 tracks
B (Indian leader)	1 (English)	1 (English)	None
C (Indian)	1 (regional language)	1 (regional language)	None
D (Indian regional)	1 (English, on a regional drama)	1 (English)	None
E	Multiple	Multiple	Not measurable

Platform A's catalog has been built for a global multi-language audience for over a decade, and it shows.

Platform D, which positions itself as a regional Indian OTT, shipped English-only audio on a regional-language drama series. That's either a packaging mistake on the title I watched, or a capability gap, or a cost choice. Whichever it is, it directly contradicts the platform's stated regional positioning.

Audio descriptions, narration tracks for visually impaired viewers, are present on exactly one of the five platforms. Fourteen tracks across multiple languages on Platform A. Zero on the others.

Accessibility is the dimension where the gap between "platform that takes its users seriously" and "platform that ships the minimum" is most visible.

It's not a hard problem. It's a priority.

What this means if you're building a streaming platform

A few patterns worth taking seriously.

Cache asymmetry is your friend. Manifests should not be cached. Segments should be cached forever, or close to it. They have completely different lifecycles and need completely different cache strategies.

Sign your segment URLs. Every CDN supports it. There's no good reason to ship plain URLs in 2026.

Keep auth out of localStorage. httpOnly cookies have been the right answer for fifteen years. The exceptions are vanishingly rare and almost always trace back to a third-party SDK someone forgot to question.

Don't write a player from scratch unless you're at hyperscale. Wrap Shaka or hls.js. Spend your engineering on the parts users actually feel: telemetry, ABR memory, DRM caching, UI.

Centralize your telemetry. If you're sending the same events to five vendors, you're paying five times for the same insight, debugging five integrations, and giving five third parties access to your user data. Pick one. Build the rest yourself.

Treat accessibility as core, not as an add-on. Multi-language audio and subtitles aren't extras for a global platform. They're the product.

Methodology

All observations were made via standard browser developer tools while signed in as a paying or registered user. No DRM was bypassed. No access controls were circumvented. No license server payloads were captured beyond noting that requests fired and to which endpoints.

Platform identities are anonymized. Findings that could uniquely identify a platform have been described in general terms or omitted.

Single VOD title per platform, on desktop Chrome, on a residential Indian connection. Network throttling and mobile network behavior were not in scope.

If you're building or scaling an OTT, talk to us

The wire tells stories the marketing doesn't. If you recognized your platform in the audit above (good or bad), or if you're building one and want a second set of engineering eyes on your architecture, that is exactly what MatrixGard does.

We do read-only infrastructure audits across cloud, security, and delivery layers. Same methodology as the audit above, but applied to your own stack with full access and a written report at the end. See how a MatrixGard audit works or start with the free 2-minute readiness checklist.

Avinash S is the founder of MatrixGard, a fractional DevSecOps practice helping founder-led teams ship cloud infrastructure that holds up under audit, scale, and incident pressure. Eight-plus years across enterprise and startup cloud environments. M.Tech Cyber Security at SRMIST.

What SOC 2 Actually Costs an Indian Seed Startup in 2026: A Line Item Breakdown

noreply@matrixgard.com (Avinash S) — Thu, 23 Apr 2026 07:30:00 GMT

An Indian seed-stage SaaS founder told me last month that his investor had recommended Vanta + a Big-4 audit firm + a boutique vCISO. The combined quote came to ₹34 lakh. He nearly signed.

We ran the same scope through the Indian-market stack, Sprinto + a small AICPA-licensed Indian audit firm + Astra for the pen test. Total: ₹10 lakh. Same Type II attestation. Same opinion letter. Same customer-facing security page. (Story details changed for anonymity; the price gap is real and recurring.)

This post is the breakdown nobody on a SaaS pricing page will give you, grounded in actual Indian-market quotes (grcdesk.in, neumetric.com, parafoxtechnologies.in, soc2.in), not US-buyer aggregators that overstate Indian pricing 2-4x.

Scope: SOC 2 Type II, the one your enterprise customers actually demand, for an Indian-incorporated SaaS company with a 5-15 person team, in the first audit cycle (12-month observation period).

Why every SOC 2 cost article you've read is misleading

Three reasons, named honestly:

The big-three SaaS (Vanta / Drata / Sprinto) price themselves, not the project. Their pricing page is one bill of five. They don't tell you about the others because if you saw the total upfront, the SaaS subscription would feel like a smaller commitment than it is.
Most "cost of SOC 2" articles are written by the SaaS vendors themselves. Read the byline. The incentive is to make their slice look like the whole pie.
Audit firm quotes are pad-loaded. A Big-4 audit typically costs 2-3x what a small AICPA-licensed specialist firm charges for the same scope of Type II opinion under the same standard. Most Indian startups default to the Big-4 they recognise. Most Indian customers don't actually care which audit firm signed the report, they just want to see SOC 2 Type II on a security page.

The result of all three: founders walk in expecting a ₹6 lakh project and walk out three quarters later having written ₹20+ lakh in cheques across five vendors. The over-spend isn't fraud. It's information asymmetry. This post is the symmetry restored.

The five line items, in rupees

1. Compliance automation SaaS, ₹2-5 lakh/year (Indian path)

The platform that automates evidence collection. You'll need one. The choice is which, and the Indian buyer reality is very different from the US-aggregator number you'll see online.

Sprinto (Bengaluru-HQ, Indian-founded, INR billing): ₹2-5L/year for a startup tier with single framework; ₹5-15L for multi-framework setups (grcdesk.in, cybersecify.com). Pricing is gated behind a demo call, verify directly.
Scrut Automation (also Bengaluru-HQ): ₹2-5L/year at startup tier, comparable feature set to Sprinto for a single-product Indian SaaS (mitigata.com).
Drata (US, no India tier published): Indian buyers report ₹5-15L/year. Built for US mid-market, quoted in USD, no FX cushion (grcdesk.in).
Vanta (US, no India tier): same band as Drata, ₹5-15L/year. Heaviest brand recognition outside India, which is why investors recommend it, not because it's better.

The honest math: Sprinto and Scrut are 2-3x cheaper than Vanta/Drata at the Indian seed tier, with INR billing avoiding FX swing. Capability gap on Trust Services Criteria automation: minimal for a single-product seed-stage SaaS. The reason Western funds push you toward Vanta is unfamiliarity with the Indian alternatives.

What to actually spend the savings on if you have it: a better auditor (next line item).

2. The actual audit (Type II), ₹3-6 lakh from the right firm

This is the part the SaaS pricing page doesn't include and the part most founders forget exists until month four. The auditor, a CPA firm, independently inspects your evidence and issues the opinion letter your customers will ask for.

Indian-market pricing tiers for first-year Type II:

Smaller Indian CA firms / India-resident SOC 2 boutiques (e.g. soc2.in): ₹3-4.2L for a starter package, often bundled with pen-test.
Indian compliance-first shops (Parafox, Neumetric, GRCDesk): ₹4-6L for 10-30 employees; ₹7-10L for 30-100 employees (parafoxtechnologies.in, zcybersecurity.com).
A-LIGN India / Schellman India (US specialist firms with AICPA-licensed Indian teams): buyers report ₹6-10L on calls, neither firm publishes INR pricing.
Big-4 India (PwC / Deloitte / EY / KPMG): ₹15-30L+. They typically don't quote sub-50-FTE SaaS, and when they do, it's at this band.

Picking a smaller Indian CPA firm over Big-4 saves ₹10-25L for the same scope of opinion under the same AICPA standard. The opinion letter has the same legal weight. The customers asking you for SOC 2 won't reject A-LIGN, Schellman, or a credible Indian CA firm, all are on the AICPA's licensed-CPA-firm list.

In our experience, an explicit "Big-4 only" requirement from customers is uncommon. Most enterprise procurement asks for "a recognized AICPA-licensed firm," which any specialist auditor satisfies. When the Big-4-specific demand does appear, it's usually a procurement-team box-tick, and typically negotiable at the contract stage.

3. Consulting / vCISO / readiness, ₹0-15 lakh

This is the line item with the widest range and the highest founder confusion.

DIY with the SaaS tooling: ₹0. The platform's inbuilt readiness assessment + control templates can carry you, if someone on your team can absorb the work.
Boutique vCISO retainer (3-6 month engagement): ₹5-15L. Useful when nobody on your team has done compliance before.
Big-name consultancy (the Deloittes of the world, but for advisory): ₹15-30L. Rare for seed stage. Almost always overkill.

You save ₹5-15L by DIY-ing this. The catch: it requires 80-150 engineering hours across the year, distributed across the right person. If your team is 3 backend engineers and a designer, you don't have that person, and the SaaS platform won't carry you the rest of the way.

The honest test: ask whichever of your engineers will own this whether they've ever read AICPA Trust Services Criteria. If yes, DIY. If no, budget vCISO.

4. Engineering hours (the hidden cost), ₹3-10 lakh equivalent

This is the cost no SaaS marketing page admits exists.

SOC 2 Type II requires evidence, log retention configs, change-management workflows, access reviews, vulnerability scan outputs, vendor-management documentation, security training records. The SaaS platform pulls a lot of this automatically. It does not pull all of it. The remainder requires engineers.

Plan for 80-200 engineering hours over the 12-month observation period. At a fully-loaded cost of ₹3,000-5,000 per hour for a senior engineer (salary + benefits + opportunity cost), that's ₹2.4-10L in real engineering capacity diverted from product.

Reduce this by picking the SaaS with the best evidence-collection automation for your stack. Drata generally edges out Vanta on this dimension as of early 2026; Sprinto is improving fast on Indian-stack integrations.

Do not pretend this cost is zero. It's the most common reason a SOC 2 budget triples mid-year.

5. Pen test (auditor will require it), ₹1.5-3 lakh from Indian vendors

The auditor will require a pen test result for the application within scope. You can't skip this. You can choose how to deliver it.

CERT-In empanelled small Indian firms: ₹40K-1.5L for a single web-app VAPT with a usable certificate (Astra India VAPT guide). Cheapest defensible option.
Astra Security (Delhi-HQ, CERT-In + CREST): single VAPT scan ₹40K-2L; continuous pentest plan ~₹5L/year, overkill for a single SOC 2 cycle (getastra.com/pricing).
Payatu / SAFE Security / NotSoSecure: typical Indian VAPT range ₹1.5-3L for a thorough manual + automated SaaS test (neumetric.com, bminfotrade.com).
Western firm: ₹5-8L. Same opinion letter on the auditor's desk. Usually picked by founders unfamiliar with Indian options.

The auditor doesn't care which path you pick. Pick by your team's preference and your stack's complexity.

Bonus line, bridge letters between Type II cycles, ₹50K-1.5L per letter

Customers often ask for bridge letters (mini-attestations the auditor issues between annual Type II cycles, confirming nothing material has changed). Each one your auditor issues costs ₹50K-1.5L.

The cheapest path: negotiate 1-2 bridge letters into the original audit scope at signing. After signing, each one becomes a separate engagement at full price.

The total, three real scenarios

Every Indian seed-stage SaaS founder we've helped through SOC 2 ends up at one of three roughly-shaped totals. The spread between them is enormous.

Scenario	Automation	Audit	Pen test	Readiness	Total
Cheap DIY (Indian boutique) (Sprinto + soc2.in-style starter + CERT-In small firm)	₹2.5L	₹3L	₹1.5L	₹0 (founder-led)	₹7L
Typical Indian seed-stage (Sprinto/Scrut + mid-tier Indian CPA + Astra/Payatu + light consulting)	₹3L	₹4-5L	₹2L	₹1L	₹10-11L
Western-default-imported (the trap) (Drata/Vanta + Big-4 + vCISO retainer + Western pen-test)	₹8L+	₹15L+	₹5L	₹6L	₹34L+

The headline: the spread between the cheapest defensible Indian path and the Western-default trap is roughly ₹27 lakh. Customers can't tell them apart. The opinion letter reads the same. The Trust Services Criteria coverage is identical. Most Indian seed-stage SaaS land in the middle row at ₹8-14L all-in.

Many US-funded Indian startups default to Vanta or Drata plus a US audit firm, usually because that's what their investors and US customers recognize, not because the Indian alternatives can't deliver the same attestation.

What the SaaS sales reps won't tell you

Five specific things, named:

You don't need their consulting add-on if you have a competent senior engineer. The platform IS the consulting layer for most of the work. The add-on is for companies without infrastructure understanding. If your CTO can read the AICPA Trust Services Criteria PDF without flinching, skip the add-on.
You can switch SaaS platforms mid-year. Evidence portability across compliance platforms is real now, Vanta, Drata, and Sprinto all export evidence in standard formats. If your pricing surprises you at renewal, switch.
The auditor doesn't care which SaaS you use. They care about evidence quality and completeness. You can switch auditors and SaaS independently.
Type II isn't "another full audit" after Type I. Type I confirms your controls exist on a single date; Type II confirms they operated effectively over 6-12 months. Type II typically prices at 1.3-1.5x Type I, same controls, longer observation window, more evidence sampling (Sprinto, Comp AI).
The "you must use a Big-4" customer demand is rare. When it does appear, it's almost always negotiable. Specialist firms (A-LIGN, Schellman, Sensiba) appear on the same AICPA-licensed-CPA-firm list. In our experience the demand for a specifically-Big-4 firm is uncommon and usually softens once the AICPA-licensed status is shown.

What about ISO 27001? HIPAA? PCI?

Same line items, different multipliers:

ISO 27001: comparable first-year cost in India, with recurring surveillance audits roughly ₹4-10L/year (Wattlecorp) vs SOC 2's annual re-audit cycle. Indian certification bodies (BSI India, TÜV, BV, DNV) compete on price against UK/US bodies.
HIPAA: not a certification, it's compliance with US healthcare regulation. No formal audit unless a Business Associate contract demands one. Tooling cost roughly the same; engineering cost higher because of mandatory encryption and access control depths.
PCI DSS: variable from ₹5L (SAQ A self-assessment for Stripe-style flows where you never touch card numbers) to ₹40L+ (mid-scope QSA assessment). Level 1 (>6M transactions/year) can exceed ₹1Cr and is out of scope for most seed-stage. Most Indian fintech founders dramatically over-scope this. If you can use Stripe / Razorpay / Cashfree as the payment processor, you almost never need a full PCI assessment.

The pattern repeats: SaaS automation, an audit body, optional consulting, engineering hours, and at least one external test. The rupee amounts vary by framework. The five-line structure does not.

When you'd actually want to bring in help

Three triggers where DIY stops being the right call:

You have an enterprise customer demanding SOC 2 in under 90 days. The DIY path takes 6+ months end-to-end. If the timeline is forced, buy your way in with a vCISO retainer and an auditor that has Type II completion in <120 days as a stated capability. A few specialists offer this; most don't.
You don't have a senior engineer who's done compliance work before. The platform won't save you. The engineering hours will quietly compound past the consulting fee you would have paid. A boutique vCISO at ₹1-2L/month for 6 months is often cheaper than 200 untracked engineering hours.
You're targeting HIPAA / PCI DSS / FedRAMP / RBI Master Direction next year. Don't DIY SOC 2 if you'll need a real GRC function in 18 months. Build the muscle now with a vCISO who can carry you across multiple frameworks. The marginal cost of the second framework is much lower than the first if you build the right operating model up front.

If none of those apply, you can probably DIY the first SOC 2 cycle and revisit the question at year two.

What this post is missing

I deliberately didn't cover:

Trust Services Criteria selection (Security only vs Security + Availability + Confidentiality, etc.). That's a separate post, for almost all seed-stage SaaS, Security-only is correct, but the reasoning matters.
Specific control implementation (how to actually configure CloudTrail / Cloud Audit Logs / vendor reviews / change management). Each of those is a post on its own.
The exact AICPA TSC text. It's free at aicpa-cima.com. Read it once. It's 40 pages. It will save you weeks of consulting time.

If you want me to look at your specific SOC 2 path

I do this for ~10 startups a quarter, free, no NDA needed: 30 minutes, your specific stack, where the cheapest viable path lives, what you can DIY, what's worth paying for. Mostly because it's the fastest way I know to find startups who actually need the work I do once the audit cycle starts.

Send me a note with what framework you're targeting and your timeline. I'll reply with a 5-line read on the cheapest viable path for your situation.

Avinash S is the founder of MatrixGard. Cloud and DevSecOps for startups who can't afford the team they need. Almost a decade of building, breaking, and securing cloud infrastructure across India, Singapore, and the US.

Methodology note. Pricing ranges sourced exclusively from Indian-market public references, GRCDesk, Neumetric, Parafox, soc2.in, Cybersecify, Z Cybersecurity, Astra Security, Neumetric VAPT, BM Infotrade, combined with quotes shared by Indian founders in our network for first-time, single-criterion SOC 2 Type II engagements at seed-stage SaaS (10-50 FTE). US-buyer aggregators (Vendr / Spendflo / ComplyJet / Comp AI / SOC2Auditors.org) are deliberately excluded, their numbers reflect US enterprise tiers that are 2-4x higher than what Indian SaaS actually pay. Multi-product, multi-region, or multi-framework scope pushes the upper end significantly. All numbers are directional, get a real quote before you budget.

Ghost Hunter: The $28,000 Question Your Dashboard Won't Answer

noreply@matrixgard.com (Avinash S) — Sun, 19 Apr 2026 06:30:00 GMT

It's 11:47 PM. The CEO sends a two-word email.

Subject: Bill?

The AWS bill went from $135,000 to $163,000 in a single month. The board call is at 9 AM tomorrow. The CFO wants a cause, not a number.

The on-call engineer opens the console. Sees the spike. Does not see the cause. Starts digging.

Three hours, eleven browser tabs, and one cold coffee later, the answer surfaces. A single forgotten GPU instance in us-east-1, launched two weeks ago by someone who has since left the team. $1.62 an hour. 24 hours a day. 18 days.

This scene plays out in every cloud-native company, every month. The senior SRE it takes to resolve it is one of the most expensive people in engineering.

I built Ghost-hunter to play that SRE. At 11:47 PM. When nobody else is awake.

Dashboards describe. They do not diagnose.

Cloud dashboards are the smoke detector. They tell you there is a fire. They cannot tell you which wire frayed.

The "why" lives in three places the dashboard cannot reach:

Command-line output from service-specific tools (aws, gcloud, kubectl)
Log data the dashboard never ingested
Tribal knowledge. Who launched what. Which account is test. What's normal for this team.

A human SRE walks that terrain by hand. They form a theory. Run a read-only command. Read the output. Adjust.

Ghost-hunter does the same. No human required at 11:47 PM.

Two detectives, not one

Most AI tools wrap a single model. You ask a question. It writes commands. It runs them. It tells you what it thinks.

For a chatbot, that's fine. For anything that touches your cloud, it's reckless.

Picture a detective investigating a scene. If the same person forms theories AND handles raw evidence, two things go wrong. They miss what a fresh eye would catch. And they're one bad assumption away from contaminating the scene.

Ghost-hunter uses two.

The lead detective. Forms theories. Weighs evidence. Decides what to investigate next. Never touches the crime scene directly. (This is Claude Opus.)
The evidence technician. Follows instructions. Collects samples. Writes one-line summaries. Signs off on the chain of custody before anything crosses. (This is Claude Sonnet.)

"Contaminating the scene" in this analogy is running a command that damages your cloud. The detective never writes commands. The technician writes them. A seven-gate safety system verifies them. Nothing runs until every gate signs off.

A case, five scenes

I ran Ghost-hunter against the FinOps Foundation's public FOCUS 1.0 sample. Real shape, anonymized data, no customer exposure. The dollar amounts are scaled down. The mechanics are what you'd see in production.

Scene 1. The scene of the spike

Ghost-hunter in advisor mode. "Will not touch your cloud. Reads your billing export, proposes read-only commands, asks you to run them yourself."

EC2 at the top of the list, up 185.5%. 27 other services scanned and ranked by dollar impact.

The investigation starts with a fact, not a guess.

Scene 2. The suspects

The lead detective pulls the file apart. Top SKUs. Top accounts. Top regions. One account, 11353890204, is responsible for 91% of the spend. 92% of it landed in us-east-1.

Four theories go on the board:

H1 (55%). GPU instances running for ML or rendering, driving most of the bill.
H2 (30%). A general-purpose instance left running longer than it should have.
H3 (35%). A CI or batch pipeline spinning up short-burst instances.
H4 (10%). Storage growth as a secondary contributor.

Each one has a confidence score. Each one is testable. The detective picks the strongest.

Scene 3. The interview that goes sideways

The evidence technician drafts a command. Read-only. Validated by four security layers. Copied to the user's clipboard automatically.

The user replies:

"i dont have access to the aws account to run any commands"

Most AI tools break here. Either they freeze. Or they hallucinate a result. Or they quietly pretend the user did run the command.

Ghost-hunter does none of that. The detective takes the refusal as information. Re-reads what's on the board. Updates the confidence scores (H1 climbs from 55 to 75). Concludes with what's actually provable from billing alone.

"Understood. You don't have CLI access. No problem. The billing data is quite revealing on its own. Let me work with what we have and wrap this up."

A fake confidence drop would be worse than no tool at all. Ghost-hunter lands on 72%. Not 95. Not 100. Seventy-two.

Scene 4. The plan

Not "do these twelve things and good luck." A prioritized ladder.

NOW. Contact the owner of account 11353890204. Check running g5.4xlarge instances in us-east-1.
THIS WEEK. Set a Cost Anomaly Detection monitor. Add a $5 budget with email alerts at 80% and 100%.
THIS MONTH. Evaluate Savings Plans. Add an IAM guardrail to block expensive GPU launches without approval.

Every "NOW" item is under five minutes. Nothing in the list is a write command against production. Ghost-hunter will never tell you to delete, terminate, or modify anything without your finger on the key.

Scene 5. The verdict, with honest gaps

A root cause. Five cited pieces of evidence. A list of five things Ghost-hunter could not verify.

This part matters more than the conclusion itself.

Most AI tools close with false certainty because false certainty feels polished. Ghost-hunter tells you what it does not know. "Could not confirm which specific EC2 instances are running." "Could not determine who launched the GPU instances or for what purpose."

That transparency is what makes the conclusion trustworthy. You can read the transcript, see what was cited, see what was not, and decide if 72% is good enough to act on.

The seven doors

Every command Ghost-hunter proposes passes through a vault with seven doors. Miss any one door, the command dies.

 1. Fast reject      shell metacharacters blocked (;, &&, unquoted $())
 2. Allowlist        is this verb on the read-only list?
 3. Flag check       every flag safe for this verb?
 4. Input hygiene    length, encoding, empty-command?
 5. Budget           caps on commands, cost, time per run
 6. Semantic check   does this actually test the stated hypothesis?
 7. Sandbox          environment isolation (active mode only)

A system that sometimes lets through commands its validator was unsure about is a system that will one day run delete by accident. Ghost-hunter has no "helpful override." A command that cannot pass every door does not run.

Three lines I refuse to cross

No writes. Ever. Read-only is the whole product. The detective does not hold the keys to the cloud.
No hardcoded answers. Most "AI FinOps" tools win benchmarks by memorizing patterns. "If NAT Gateway plus high bytes, the answer is missing VPC endpoint." Ghost-hunter refuses. The CI pipeline literally fails commits that put scenario names in prompts. If the reasoning isn't in the transcript, it isn't in the product.
No data leaves your machine. Your bill stays local. The only thing that moves is compressed evidence summaries, through your own Anthropic API key.

Why this matters

Most AI tools in this space are lookup tables with a nice voice. They recognize the shapes they were trained on. They miss the shapes they weren't.

Ghost-hunter is slower. On a known pattern, a memorizing tool will beat it every time.

Ghost-hunter wins on the bill nobody has seen before. Your bill. Your configuration. The spike caused by your ML team's experiment, your third-party vendor's bug, the intern who cloned a production pipeline for testing. Every hypothesis, every command, every piece of evidence sits in a transcript you can read.

You do not trust the conclusion because an AI said so. You trust it because you can audit the reasoning yourself.

That's the product.

Private beta

Ghost-hunter is not yet public. If you run cloud infrastructure and you've ever been the person answering the 11:47 PM email, I'll open access to you first.

Book a 20-minute call and I'll walk you through Ghost-hunter against a billing export of your choosing. Or send me a note with what you'd want it to solve first.

Avinash S is the founder of MatrixGard. Cloud and DevSecOps for startups who cannot afford the team they need. Almost a decade of building, breaking, and securing cloud infrastructure.

I Looked at 30 Startups' Infrastructure. Every Single One Had the Same Problem.

noreply@matrixgard.com (Avinash S) — Sun, 12 Apr 2026 10:00:00 GMT

Over the last 8 years working in cloud infrastructure, I have seen the inside of startups at every stage. Seed rounds running on a single AWS account. Series B companies with 40 engineers and no one owning security. Teams that shipped a product customers love, built on infrastructure that keeps the CTO up at night.

Every single one had the same fundamental problem.

Not a specific vulnerability. Not a misconfigured S3 bucket. Something deeper.

Nobody owned security.

The CTO was doing it. The same person writing architecture docs, reviewing PRs, managing the cloud bill, handling incidents at 2 AM, and pitching to investors on Friday. Security was somewhere on the list. Usually at the bottom.

Not because they did not care. Because there was nobody else.

Here are the 7 things I found in every startup under 50 engineers

1. The CTO is the entire infrastructure team

In 28 out of 30 startups, the CTO or a co-founder was the only person who understood how the infrastructure worked. No DevOps engineer. No SRE. No security person. Just one technical founder wearing four hats and hoping nothing breaks on the weekend.

The engineering budget went to product engineers. Which makes sense when you are trying to ship features and close customers. But it means the person responsible for security is also the person who has the least time for it.

2. Secrets were everywhere except a vault

API keys in environment variables. Database passwords in config files committed to the repo. AWS credentials shared over Slack. One startup had their production database password in a shared Notion page that the entire team could access.

Not one of the 30 startups was using a proper secrets manager. Not AWS Secrets Manager, not HashiCorp Vault, not even a basic encrypted store. The reason was always the same: "We will set it up when we have time."

3. Antivirus was the entire security stack

When I asked about cloud security, the most common answer was: "We have antivirus on our laptops." Endpoint protection was the entire security posture. Nothing in the cloud.

No CloudTrail. No GuardDuty. No WAF. No container scanning. No dependency vulnerability checks. The cloud infrastructure was completely unmonitored. Somebody could be running crypto miners on their AWS account right now and they would not know until the bill arrives.

4. The last security review was never

"When was your last infrastructure security review?"

The most common answer: silence. Followed by: "We have been meaning to do one."

22 out of 30 startups had never done a security review of any kind. Not a penetration test. Not a vulnerability scan. Not even an internal audit. The infrastructure was built to work, not to be secure. And nobody had gone back to check.

5. No incident response plan exists

If a breach happened at 2 AM tonight, what happens?

In most of these startups, the answer is: the CTO's phone rings. Maybe. If someone notices. There is no runbook, no escalation procedure, no communication template, no forensic capability. Just a person waking up and figuring it out in real time.

For fintechs under RBI regulation, the reporting window is 2-6 hours. For DPDP Act compliance, it is 72 hours to the Data Protection Board. You cannot meet those timelines if your incident response plan is "call the CTO."

6. Compliance was a future problem that became a today problem

The pattern repeats: startup builds product, gets traction, raises funding, starts talking to enterprise customers. Enterprise customer sends a vendor assessment. The assessment asks for SOC2 Type II certification, or an ISO 27001 audit report, or evidence of RBI compliance.

The startup does not have any of these. The deal stalls. The CTO scrambles to figure out what SOC2 even requires. The timeline is 3-6 months to get certified. The enterprise customer moves on.

I have seen this exact scenario play out at 4 startups in the last 2 months alone. The compliance gap is not just a security risk. It is a revenue blocker.

7. The AWS bill was hiding real problems

When I asked to look at cloud costs, every single startup had waste. Dev environments running 24/7. Oversized instances nobody had right-sized since launch. Unattached EBS volumes accumulating charges. Load balancers pointing to nothing.

The average waste I found: 30-40% of the monthly cloud bill. One startup was spending over Rs 5 lakh per month on AWS. Nearly 40% of that was resources nobody was using. That adds up to lakhs per year in ghost costs.

The cloud bill is not just a cost problem. Unmonitored resources are also unmonitored attack surface. That idle EC2 instance nobody remembers? It has not been patched in 18 months.

Why this keeps happening

It is not negligence. It is prioritization under pressure.

When you have 15 engineers and 200 things to build, security does not make the sprint. The CTO knows it should. But there is a product launch next week, three customer bugs to fix, a hiring pipeline to manage, and an investor update due Friday.

Security gets pushed to "next quarter." Next quarter it gets pushed again. Until something forces the issue: an enterprise deal that requires SOC2, an RBI audit notice, a customer who finds a vulnerability, or worse.

The startups that avoid this trap are the ones that treat security as infrastructure, not as a project. It is not something you "do" once. It is something that runs alongside your product, maintained by someone whose job it is.

What to do about it

If you recognized your startup in the list above, here are three things you can do this week:

1. Take 2 minutes to score yourself. We built a free security readiness quiz that asks 7 questions and tells you exactly where you stand. No signup required to start. Takes 2 minutes.

2. Fix the free stuff today. Enable MFA on your AWS root account (5 minutes). Turn on CloudTrail (10 minutes). Check for public S3 buckets (one CLI command). These cost nothing and close the most obvious gaps.

3. Get an outside set of eyes. You are too close to your own infrastructure to see the gaps. Someone who has looked at 30 other startups will spot patterns in 20 minutes that would take you weeks to find on your own. Book a free 20-minute infrastructure review and find out what is actually hiding.

The best time to fix your security was when you launched. The second best time is before the next audit, the next enterprise deal, or the next incident forces your hand.

Avinash S is the founder of MatrixGard, a DevSecOps consultancy that helps startups get infrastructure-ready in weeks, not months. Previously 8+ years in cloud infrastructure across enterprise and startup environments.

RBI Compliance for Fintech Startups: Security Checklist 2026

noreply@matrixgard.com (Avinash S) — Sun, 05 Apr 2026 14:00:00 GMT

If you are building a fintech startup in India, RBI compliance is not optional. It is the difference between getting a banking partnership and getting shut down. The Reserve Bank of India issued three major master directions in 2024-2025 alone, each tightening the technical requirements for payment aggregators, NBFCs, and digital lending platforms.

Most fintech founders treat compliance as a legal problem. It is not. It is an infrastructure problem. The RBI does not care about your privacy policy. They care about whether your data is encrypted, whether your cloud runs in India, whether you can detect a breach in 6 hours, and whether you have the audit trails to prove it.

Here is the checklist your CERT-In empanelled auditor will actually check.

Which RBI Framework Applies to You?

Before building anything, know which direction you fall under:

If you are a...	Your governing framework	Compliance deadline
Payment Aggregator	PA Master Direction 2025	Active now
NBFC (Top/Upper/Middle layer)	IT Governance Master Direction 2024	Active since Apr 2024
Non-bank PSO (large)	Cyber Resilience Direction 2024	Active since Apr 2025
Non-bank PSO (medium)	Cyber Resilience Direction 2024	April 1, 2026
Digital lending platform	Digital Lending Directions 2025	Active now

If you process payments, lend money, or route funds through your platform, at least one of these applies to you. Many startups think they are "just an interface." The moment you touch, hold, or settle funds, licensing and compliance requirements kick in.

The Infrastructure Checklist

1. Data Must Live in India

This is non-negotiable. All payment system data must be stored on servers physically located in India. This includes transaction records, card credentials, timestamps, user details, and payment profiles.

What this means for your infrastructure:

AWS: ap-south-1 (Mumbai) only for payment and financial data
Azure: Central India or South India regions
GCP: asia-south1 (Mumbai)
Your Terraform or Pulumi code must enforce region constraints. No exceptions.
If data is processed overseas temporarily, a complete copy must return to India within 24 hours and the foreign copy must be deleted
RBI must have unrestricted audit access to all stored data

The most expensive compliance mistake I see: startups that launch on us-east-1 because it was the default, then discover they need to migrate everything to Mumbai. Retrofitting costs 5x more than building it right from day one.

2. Encryption Everywhere

The RBI mandates encryption in transit and at rest. Specifically:

In transit: TLS 1.2 or higher on all connections. No self-signed certificates in production.
At rest: AES-256 encryption for databases, object storage, and volumes. Use AWS KMS, Azure Key Vault, or GCP Cloud KMS for key management.
Card data: Tokenization required. Storing actual card details is banned.
PCI-DSS compliance mandatory for payment aggregators and their onboarded merchants.

Quick check: run this against your AWS account to find unencrypted EBS volumes:

aws ec2 describe-volumes --filters Name=encrypted,Values=false --query 'Volumes[*].[VolumeId,Size,State]' --output table

If that returns results, you have a compliance gap.

3. Access Controls and MFA

RBI requires access on a need-to-know basis with time-limited duration. In practice:

Multi-factor authentication on everything: AWS console, VPN, admin panels, deployment pipelines
No administrative rights on end-user workstations
Privileged access management with audit logging
Regular access reviews (quarterly minimum)
Service accounts with least-privilege IAM policies

I audit fintech startups where the CEO still has root access to production databases. That is a finding your auditor will flag on page one.

4. 24/7 Security Monitoring

The Cyber Resilience Direction requires a Security Operations Center. This means:

Continuous monitoring with log correlation and threat detection
Automated alerting for suspicious activity
Log management with retention (minimum 1 year)
Threat intelligence integration

You do not need to build an in-house SOC. Outsourced SOC services work and are specifically permitted. But "we check logs when something breaks" is not a SOC.

At minimum, set up CloudWatch Alarms + CloudTrail + GuardDuty on AWS, or the equivalent on Azure/GCP. Configure alerts for: root account usage, IAM policy changes, security group modifications, and unusual API call patterns.

5. Incident Response (2-6 Hours)

When a security incident happens, RBI reporting timelines are tight:

Banks and NBFCs: Report within 2-6 hours of discovery
Non-bank PSOs: Report cyber-attacks, outages, internal frauds, and settlement delays within 6 hours

Your incident response plan must include:

Automated breach detection (not a human checking dashboards)
Escalation procedures with named owners
Communication templates pre-approved by legal
Forensic analysis capability for severity, impact, and root cause
Cyber Crisis Management Plan (CCMP) approved by the board

6 hours from detection to RBI notification. If your team's current incident response is "someone posts in Slack and we figure it out," you will miss that window.

6. VAPT: Not Once, Not Annually, Continuously

Vulnerability Assessment and Penetration Testing requirements:

Vulnerability Assessment: Every 6 months minimum
Penetration Testing: At least annually, by a CERT-In empanelled auditor
Best practice: Quarterly VAPT, plus after major app or infrastructure changes
Must be performed before regulatory audits and before onboarding banking partners

Integrate vulnerability scanning into your CI/CD pipeline. Tools like Trivy for container scanning, Snyk for dependency vulnerabilities, and OWASP ZAP for web application testing should run on every deployment. The formal CERT-In audit happens annually, but you should be catching issues continuously.

7. Business Continuity and Disaster Recovery

The RBI requires:

Board-approved BCP/DR plan
Documented data migration policy with audit trails
Regular DR testing (not just documentation, actual failover tests)
Defined Recovery Time Objective (RTO) and Recovery Point Objective (RPO)

If your DR plan is a document nobody has read since it was written, that is a compliance gap. Test it. Quarterly.

8. Vendor Risk Management

Every vendor that processes data for you is part of your compliance surface. RBI requires:

Security controls to prevent infiltration from vendor networks
Network segmentation between your environment and vendor access
Certified assurance from an independent auditor for vendors involved in critical processes
Regular vendor risk assessments

Your payment gateway, KYC provider, cloud hosting, SMS gateway, analytics tools: each one needs a risk assessment. If your vendor has a breach, it is your compliance problem.

The Annual Audit: What Happens

Every year, a CERT-In empanelled auditor will:

Review your IS (Information Security) policies and whether they are actually followed
Check encryption implementation across your infrastructure
Verify access controls, MFA, and privilege management
Test your incident response readiness
Validate data localization (is all payment data in India?)
Review VAPT reports and whether findings were remediated
Check BCP/DR documentation and testing evidence
Assess vendor risk management practices

The audit report goes to RBI's Regional Office. Material findings can trigger enforcement actions, restrictions on launching new products, or worse.

Penalties That Have Actually Been Enforced

This is not theoretical. RBI issued 79 enforcement actions in FY 2024-25:

Paytm penalized for KYC non-compliance, with additional FIU-IND penalty for AML violations
PhonePe fined Rs 21 lakh for PPI guideline violations
Four NBFCs fined Rs 76.6 lakh combined for P2P lending violations
PAs that missed the December 2025 authorization deadline must wind down by February 2026

On top of RBI penalties, the DPDP Act adds penalties up to Rs 250 crore for data protection failures.

The 6 Mistakes I See in Every Fintech Audit

Wrong cloud region. Payment data on us-east-1. This is the most expensive mistake to fix after the fact.
No MFA on the AWS root account. First thing every auditor checks. Takes 5 minutes to fix.
Production database accessible from the internet. Security groups with 0.0.0.0/0 on port 5432 or 3306.
No audit logging. CloudTrail not enabled, or enabled but nobody reviews the logs.
VAPT reports with open critical findings. Getting the test done is not enough. You must remediate the findings.
"We will do compliance later." By the time a banking partner asks for your audit report, it is too late to start.

Start Here

If you are a fintech startup preparing for your first RBI audit, or a growing platform that knows the infrastructure has gaps, here is what to do this week:

Verify all payment data is on India-region servers
Enable MFA on every admin account
Turn on CloudTrail and GuardDuty (or equivalent)
Check for unencrypted storage volumes
Document your incident response process

If you want someone to do a full audit and tell you exactly where the gaps are, book a free 20-minute infrastructure review. We specialize in getting fintech startups audit-ready in 4-6 weeks.

MatrixGard is a DevSecOps consultancy for funded startups. See our services or view pricing.

DPDP Act Compliance for Startups: What Your Dev Team Needs to Build Before May 2027

noreply@matrixgard.com (Avinash S) — Sun, 05 Apr 2026 10:00:00 GMT

The Digital Personal Data Protection Act is not coming. It is here. The Rules were notified in November 2025, the Data Protection Board is operational, and full enforcement begins May 13, 2027. That gives your startup roughly 13 months to get compliant or face penalties that can reach INR 250 crore (about $30 million) per violation.

Most founders I talk to think this only applies to large enterprises. It does not. The DPDP Act applies to every business processing digital personal data in India, regardless of size. If your SaaS product collects user emails, if your fintech app stores KYC data, if your healthtech platform handles patient records, you are a Data Fiduciary under this law.

Here is what your dev team actually needs to build.

The Timeline You Cannot Ignore

The enforcement rolls out in three phases:

Phase 1 (November 2025, already live): The Data Protection Board of India is established and operational. Administrative provisions are in effect.

Phase 2 (November 2026): Consent Manager registration framework goes live. If your business acts as a consent intermediary, this is your deadline.

Phase 3 (May 13, 2027): Everything else. Consent requirements, Data Principal rights, security safeguards, breach notification, data retention and erasure, cross-border transfer rules. This is the date that matters for most startups.

The 18-month transition window from November 2025 sounds generous. It is not. Building consent infrastructure, auditing data flows, training teams, and implementing security safeguards takes longer than founders expect.

What the DPDP Act Actually Requires From Your Startup

1. Consent Management

Every time you collect personal data, you need explicit, informed, purpose-specific consent. Not a pre-ticked checkbox buried in your terms of service.

The requirements:

Consent must be free, specific, informed, and unambiguous
Each purpose needs separate consent (no bundling)
Withdrawal must be as easy as giving consent
You must provide a clear privacy notice listing exactly what data you collect and why
Consent records must be retained

If you process data from users under 18, you need verifiable parental consent. OTP to parent's mobile, identity document upload, digital signature, or Aadhaar-based authentication. No exceptions.

2. Security Safeguards

This is where the biggest penalty sits: INR 250 crore for failure to implement "reasonable security safeguards." The Rules specify:

Encryption of data at rest and in transit
Access controls with access logs and regular reviews
Intrusion detection systems
Data masking and obfuscation
Regular data backups
Data retention for minimum 1 year for breach investigation

If you are running a startup on AWS or Azure, this translates to: enable encryption everywhere, implement IAM properly, set up CloudTrail or Azure Monitor, configure alerts, and actually review access logs. Most startups I audit have none of this in place.

3. Breach Notification

When (not if) a breach happens, you have two deadlines:

Immediately: First intimation to the Data Protection Board and affected individuals. No delay.
Within 72 hours: Detailed report including what happened, what data was affected, and what you are doing about it.

Without automated detection tools and pre-built incident response templates, most startups will miss the 72-hour window. Build this infrastructure now, not after the breach.

4. Data Principal Rights

Your users have the right to:

Access a summary of their personal data and know who you have shared it with
Correct inaccurate data
Request erasure when the purpose is fulfilled
Withdraw consent at any time
File complaints with the Data Protection Board

You need to build these capabilities into your product. A "delete my data" button is not optional anymore.

5. Data Inventory

You cannot comply with a law about data protection if you do not know what data you have. Map every piece of personal data your startup collects: what data, where stored, who accesses, which vendors touch it, how long you retain it, and whether you can delete it on request.

Every vendor processing personal data for you is part of your risk surface.

The Penalty Table

These are per violation, per instance. A single incident can trigger multiple penalties:

Violation	Maximum Penalty
Failure to implement security safeguards	INR 250 crore (~$30M)
Failure to notify breach within 72 hours	INR 200 crore (~$24M)
Breach of children's data obligations	INR 200 crore (~$24M)
Breach of Significant Data Fiduciary obligations	INR 150 crore (~$18M)
Any other Data Fiduciary violation	INR 50 crore (~$6M)

The Board considers: gravity of breach, data sensitivity, whether it was repeated, what mitigation efforts were taken, and proportionality to your turnover. Being a startup does not give you a pass, but showing good-faith compliance efforts matters.

DPDP Act vs GDPR: Key Differences

If you are already GDPR compliant, you are not automatically DPDP compliant. Critical differences:

No "legitimate interests" basis. Under GDPR, you can process data without consent if you have a legitimate business reason. Under DPDP, it is consent or nothing (with narrow exceptions).
All breaches must be reported. GDPR only requires notification for breaches that risk individual rights. DPDP requires notification for every breach, regardless of severity.
Children's age threshold is 18. GDPR allows 13-16 depending on the member state. DPDP says 18 across the board.
Consent Managers are a new concept. GDPR has no equivalent. DPDP creates registered intermediaries specifically for consent management.
No data portability right. Unlike GDPR, DPDP does not include the right to data portability.
Cross-border transfers use a blacklist model. GDPR requires approved countries (whitelist). DPDP allows transfers everywhere unless a country is specifically restricted.

The 7 Mistakes Startups Make With DPDP Compliance

Assuming it is only for big companies. It is not. Every business processing digital personal data in India is covered.
Copy-pasting a GDPR privacy policy. The consent and notice requirements are different. Generic policies will not satisfy the itemized disclosure requirements.
Bundling consent. "By signing up, you agree to everything" is non-compliant. Each processing purpose needs separate consent.
No data inventory. If you do not know what personal data you have, where it is, and who can access it, you cannot comply.
Ignoring vendor risk. Your AWS account, analytics tools, CRM, payment processor: every third party that touches user data is your responsibility.
No breach response plan. The 72-hour notification window starts from when the breach is detected. Without automated detection and pre-built templates, you will miss it.
Treating security as a Phase 2 problem. The highest penalty (INR 250 crore) is for inadequate security safeguards. This is not something you bolt on later.

Your 6-Month Compliance Roadmap

Month 1: Data Discovery

Complete data inventory: what personal data, where stored, who accesses, which vendors
Map data flows across your application and infrastructure
Identify gaps in your current privacy notice

Month 2: Consent Infrastructure

Build purpose-specific consent collection
Implement consent withdrawal mechanism
Create itemized privacy notice per DPDP requirements
If handling children's data, implement parental consent verification

Month 3: Security Hardening

Enable encryption at rest and in transit across all services
Implement proper IAM with least-privilege access
Set up access logging and monitoring
Configure intrusion detection

Month 4: Breach Response

Build automated breach detection
Create incident response playbook with clear roles
Prepare notification templates for the Board and affected users
Run a tabletop exercise

Month 5: Data Principal Rights

Build data access, correction, and deletion capabilities
Create user-facing dashboard for consent management
Test the full lifecycle: user requests data, receives it, requests deletion, data is deleted

Month 6: Audit and Documentation

Internal compliance audit
Document everything (the Board wants to see evidence of good-faith effort)
Train team members who handle personal data
Set up ongoing monitoring and review cadence

Do Not Wait Until 2027

The startups that start now will be compliant by May 2027. The startups that wait will be scrambling, cutting corners, and hoping the Board does not come knocking.

If you want a clear picture of where your startup stands today, book a free 20-minute infrastructure review. We will tell you exactly what is broken and what it costs to fix. No pitch, just a practical assessment.

MatrixGard helps funded startups get audit-ready in 4-6 weeks. See how we work or view our pricing.

AWS IAM Audit for Startups: A Step-by-Step Guide to Finding and Fixing Risky Permissions

noreply@matrixgard.com (Avinash S) — Thu, 26 Mar 2026 13:02:17 GMT

Most startups don't have an IAM problem. They have ten IAM problems, and they don't know about any of them. A developer needed S3 access six months ago, got AdministratorAccess because it was faster, and that credential is still active. A Lambda function has a role that can write to every DynamoDB table in the account. An intern who left in March still has a login. This is the normal state of AWS IAM at a Series A company, and it is a serious liability.

This guide walks you through an AWS IAM audit for your startup using the AWS CLI and the IAM console. No paid tools required to start. You will know exactly what to look for, what to fix first, and what mistakes to avoid.

Why IAM Audits Matter More at Startups

Larger companies have dedicated security teams running automated compliance checks. Startups move fast, give developers broad access to unblock them, and rarely clean up afterward. That combination means your AWS blast radius, the scope of damage an attacker can do with one compromised credential, is usually much larger than it should be.

IAM misconfigurations are consistently in the top causes of AWS-related breaches. Stolen credentials with overly broad permissions turn a phishing email or a leaked .env file into a full account compromise. An audit does not take days. A focused review takes two to four hours and can significantly reduce your exposure.

Step 1: Generate the IAM Credential Report

Start here. Run this command to generate a CSV of every IAM user, their last activity, and whether MFA is enabled:

aws iam generate-credential-report

Then download it:

aws iam get-credential-report --query Content --output text | base64 -d > iam_report.csv

Open the CSV and look for three things immediately. First, any user where password_last_used is more than 90 days ago or is empty. Those accounts are dormant and should be disabled or deleted. Second, any user where mfa_active is false and password_enabled is true. That is a human login without MFA, which is unacceptable. Third, any access key where access_key_1_last_used_date is older than 90 days. Rotate or delete it.

Step 2: Find Overprivileged Users and Roles

Run this to list all users with attached managed policies:

aws iam list-users --query 'Users[*].UserName' --output text | tr '\t' '\n' | xargs -I{} aws iam list-attached-user-policies --user-name {}

You are specifically looking for AdministratorAccess or PowerUserAccess attached to any user who is not a break-glass emergency account. If a developer has AdministratorAccess for day-to-day work, that is the first thing to fix.

For roles, do the same check:

aws iam list-roles --query 'Roles[*].RoleName' --output text | tr '\t' '\n' | xargs -I{} aws iam list-attached-role-policies --role-name {}

Pay close attention to roles used by Lambda functions, ECS tasks, and EC2 instances. These are frequently over-permissioned because they were set up quickly and never revisited.

Step 3: Use IAM Access Analyzer

Enable IAM Access Analyzer in the IAM console if you have not already. It is free and it will flag any resource policies that allow access from outside your AWS account or organization. Go to IAM, click Access Analyzer, create an analyzer for your account or organization, and review the findings. Any finding labeled as external access to an S3 bucket, KMS key, or Lambda function deserves immediate attention.

Step 4: Review Inline Policies and Old Roles

Inline policies are easy to miss because they do not show up in managed policy lists. Check them with:

aws iam list-user-policies --user-name YOURUSERNAME

Also audit roles that have not been used recently. AWS records last role activity in the console under IAM, Roles. Sort by last activity and flag anything unused for 60 days or more for deletion.

Common Mistakes Startups Make

Using the root account for anything operational. Create an admin IAM user or use AWS SSO. Lock down root and store those credentials offline.
Sharing access keys across team members. Every person and every service should have its own credential. Shared keys make audit logs useless.
Attaching policies directly to users instead of groups or roles. This makes permissions impossible to manage at scale. Use groups for humans and roles for services.
Skipping the permission boundary on developer roles. If developers can create IAM roles themselves, they can escalate their own privileges. Use permission boundaries to cap what they can grant.
Never reviewing third-party cross-account roles. Every SaaS tool you connected to AWS may have a cross-account role sitting in your account with broad access. List them and verify they are still needed and still scoped correctly.

Run this audit quarterly at minimum. If you are preparing for SOC 2 or a security review from an enterprise customer, you will need evidence that you do this regularly. A spreadsheet log of findings and remediations is enough to start.

Need help?

If you'd rather have someone do this for you, book a free 20-minute call with MatrixGard. We'll tell you what's broken and what it costs to fix.

Cloud Cost Optimization for Startups: Cut AWS Bills Fast

noreply@matrixgard.com (Avinash S) — Thu, 26 Mar 2026 13:00:38 GMT

Cloud bills have a way of sneaking up on you. One quarter you are running lean, and the next you are staring at a $40,000 AWS invoice wondering where it all went. For startups, that kind of surprise can derail a runway projection and trigger uncomfortable conversations with your board. The good news is that most cloud waste follows predictable patterns, and fixing them does not require a dedicated FinOps team.

Start With Visibility Before You Cut Anything

The single biggest mistake I see startup teams make is jumping straight to reserved instances or savings plans without first understanding where money is actually going. Turn on AWS Cost Explorer or the equivalent in your cloud provider and tag every resource by environment, team, and service. Without tagging, you are flying blind.

A practical first step: run this AWS CLI command to find untagged EC2 instances.

aws ec2 describe-instances --query 'Reservations[*].Instances[?!not_null(Tags)]'

Once you have tagging in place, set up a weekly cost report delivered to a Slack channel. Visibility alone tends to change behavior. Engineers who see their service costs start making smarter decisions about instance sizes and data transfer.

Right-Size Your Compute First

Compute is almost always the largest line item for early-stage startups, and it is almost always over-provisioned. A team will launch a service on an m5.2xlarge during a high-traffic test and forget to scale it back down. That single instance running idle costs roughly $280 per month.

Use AWS Compute Optimizer or Datadog's infrastructure recommendations to find instances running below 20 percent CPU utilization for more than two weeks. Those are your first targets. Downsizing from an m5.2xlarge to an m5.large on a low-traffic internal service can save over $200 per month per instance.

Check CPU and memory utilization over a 30-day window, not just peak hours
Consider Graviton-based instances (m7g, c7g) which run 20 to 40 percent cheaper than x86 equivalents
Use Spot Instances for batch jobs, data pipelines, and non-critical background workers

Storage Costs Compound Quietly

S3 buckets, EBS volumes, and RDS snapshots accumulate over time without anyone noticing. A startup I worked with was spending $3,200 per month on S3 alone, and nearly half of it was old build artifacts and test data nobody had touched in over a year.

Set lifecycle policies on every S3 bucket. For most engineering assets, moving objects to S3 Intelligent-Tiering after 30 days and to Glacier after 90 days cuts storage costs by 60 percent or more with zero code changes.

For RDS, audit your automated snapshot retention settings. The default is often 7 days, but teams leave it at 35 days and forget. Also check for unattached EBS volumes using:

aws ec2 describe-volumes --filters Name=status,Values=available

Available volumes are not attached to any instance. You are paying for storage that is doing nothing.

Data Transfer Is a Hidden Budget Killer

Data transfer fees are confusing by design, and they catch a lot of startup teams off guard. Traffic leaving AWS to the public internet costs $0.09 per GB in us-east-1. If your application is pulling data from S3 in one region and processing it in another, you are paying cross-region transfer fees on top of that.

Use VPC Endpoints for S3 and DynamoDB to eliminate NAT Gateway data processing charges
Co-locate your compute and storage in the same region and availability zone where possible
Enable S3 Transfer Acceleration only when users are globally distributed, not as a default

A single NAT Gateway processing 10 TB per month adds roughly $450 in processing fees alone, separate from the hourly charge. Switching internal traffic to VPC Endpoints removes that cost entirely for eligible services.

Build Cost Checks Into Your Engineering Workflow

Cloud cost optimization for startups is not a one-time audit. It is a habit. The teams that keep bills under control treat infrastructure spend the same way they treat security, which means they review it regularly and they catch regressions early.

Add Infracost to your Terraform pull requests so engineers see cost diffs before merging
Set billing alerts at 80 percent and 100 percent of your monthly budget in CloudWatch
Schedule a 30-minute monthly cost review with your lead engineer and someone from finance
Use AWS Budgets with service-level breakdowns so you can spot anomalies by resource type

The goal is not to make engineers afraid to provision resources. The goal is to make costs visible so that decisions are intentional. A startup that builds this muscle early will scale infrastructure spending in proportion to revenue instead of in spite of it.

Need help?

If you would rather have someone do this for you, book a free 20-minute call with MatrixGard. We will tell you what is broken and what it costs to fix.