In this blog, I’m summarizing key concepts and ideas i take way from book “Building Multi-Tenant SaaS Architectures” by Tod Golding, which I started reading while working on Tenant Management System at Money Forward.

To be honest this book provides structured view of SaaS and multi-tenancy, it clearly connects business requirements with technical architecture decisions and broadened my understanding of SaaS ecosystem, especially the non-obvious challenges engineers face when designing, scaling, and evolving multi-tenant systems.

This blog focuses on those architectural concepts from developer’s POV.

Getting into SaaS

Classic Software Model vs SaaS

In the classic software model, applications were deployed on customer owned infrastructure. Each customer environment was independently configured, deployed, and operated. Vendors shipped versions; customers ran them.

This led to per-customer system administration, slow and risky upgrades, version fragmentation, and limited runtime visibility for vendors. Operational effort scaled linearly with the number of customers. The model optimized for delivery and sales, not for scalability or continuous operation.

SaaS Changes the Model: it’s Tenants now, Not Customers

SaaS moves applications to vendor-managed infrastructure and replaces customers with tenants. The platform owns deployment, upgrades, monitoring, and reliability.

This enables centralized control, shared infrastructure, continuous delivery, faster feature rollout, and flexible pricing models. At the same time, complexity shifts into the platform. Every request must be tenant-aware. Isolation, noisy-neighbor prevention, tenant context resolution, and non-disruptive deployments become core engineering concerns.

SaaS complexity is runtime and architectural, not deployment-time.

SaaS Is More Than Shared Infrastructure

SaaS is not just applications running on shared infrastructure.

graph TD
    subgraph CP["Control Plane (cross-tenant)"]
        ONB[Onboarding]
        TM[Tenant Management]
        IAM[Identity & Access]
        BILL[Billing & Metering]
        OBS[Observability]
    end

    subgraph AP["Application Plane (tenant-scoped)"]
        SVC[Product Services]
        DATA[(Tenant Data)]
    end

    T1[Tenant A] -->|request| AP
    T2[Tenant B] -->|request| AP
    T3[Tenant C] -->|request| AP

    CP -->|lifecycle & policy| AP
    SVC --> DATA

A SaaS platform includes control-plane systems such as onboarding, tenant management, identity, billing, metering, and observability. All of these must be designed for multi-tenancy, even if the application itself is simple.

SaaS is platform-driven business model. which enables continuous delivery, operational efficiency, and frictionless tenant lifecycle management. The defining property of SaaS is not “software delivered online,” but the ability to operate, evolve, and scale a single system safely across many tenants. Also Hybrid models such as MSPs still exist. They combine SaaS-style centralized operations with customer-installed or tenant-specific deployments, usually due to legacy or regulatory constraints.

Breaking SaaS Multi Tenant Architecture

Multi-tenant SaaS architectures are typically divided into two major planes:

Control plane
Application plane

If you see below image, I think you can easily get to know the diff.

graph LR
    subgraph CP["Control Plane"]
        direction TB
        ONB[Onboarding & Lifecycle]
        IAM[Identity & Access]
        BILL[Billing & Metering]
        GOV[Governance & Metrics]
    end

    subgraph AP["Application Plane"]
        direction TB
        GW[Ingress / Gateway]
        SVC[Business Services]
        DB[(Tenant Data)]
        GW --> SVC --> DB
    end

    CP -- "tenant config\npolicies & tiers" --> AP
    AP -- "usage metrics\naudit events" --> CP

Control Plane

The control plane handles cross-tenant and operational concerns. It owns tenant lifecycle and platform-wide policies but does not implement tenant-specific business logic.

Typical responsibilities include tenant onboarding and lifecycle management, identity and access management, billing and metering, entitlement and tier management, system-wide metrics, and governance. The control plane is tenant-aware but intentionally generic, allowing it to operate uniformly across all tenants.

Application Plane

The application plane delivers product functionality while enforcing tenant boundaries. This is where most multi-tenancy complexity surfaces at runtime.

Key concerns include tenant context propagation, isolation guarantees, data partitioning, tenant-aware routing (especially in siloed or hybrid deployments), and context-aware business logic. In siloed models, the application plane may also manage tenant-specific deployments or routing to isolated stacks.

The primary challenge in this plane is maintaining correctness and isolation without degrading performance or developer velocity.

User Roles

SaaS platforms typically define three distinct user categories:

tenant users who consume the product
tenant admins who manage users and configuration within a tenant, and
system admins who operate the platform across all tenants.

graph TD
    SA[System Admin] -->|platform ops| ADMIN
    TA[Tenant Admin] -->|user + config mgmt| AC
    TU[Tenant User] -->|product usage| APP

    subgraph CP["Control Plane"]
        ADMIN[Admin Console]
        ONB[Onboarding]
        IAM[Identity]
        BILL[Billing]
    end

    subgraph AP["Application Plane"]
        AC[Tenant Admin Console]
        APP[SaaS Application]
        DATA[(Tenant Data)]
    end

    ADMIN -->|lifecycle + policies| APP
    APP --> DATA

System admins usually require a dedicated admin interface to troubleshoot issues, manage tenants, and handle escalations. Tenant admins often use a separate admin console to manage users and settings across one or more SaaS applications. Depending on coupling and ownership, these admin interfaces may live in the control plane or the application plane.

note: There is no single correct multi-tenant architecture. Most SaaS systems evolve over time, shifting responsibilities between planes as scale, compliance, and operational constraints change.

Deployment Models

SaaS deployment models define how tenants are isolated and how infrastructure is shared. Most platforms evolve toward hybrids rather than using a single model. At a high level, deployments fall into siloed (isolated) and pooled (shared) categories.

Full-Stack Siloed

Each tenant runs in a fully isolated application stack, typically including separate deployments and databases. This provides strong isolation, a clear blast radius, and simple cost attribution. The downside is higher infrastructure cost, slower onboarding, and increased operational complexity. This model is usually reserved for regulated or high-value enterprise tenants.

Full-Stack Pooled

All tenants share the same application stack, with isolation enforced at the application and data layers using tenant context. This model optimizes for cost and scalability but requires strict tenant-scoped authorization, data isolation, rate limiting, and protection against noisy-neighbor issues. Failures have a larger blast radius.

Hybrid

Most tenants run on pooled infrastructure, while selected tenants are fully siloed due to compliance, performance, or contractual requirements. This model balances cost and isolation but introduces complexity in onboarding, routing, and operations.

Mixed-Mode

Only specific components are isolated, such as databases or compute-heavy services, while the rest of the stack remains pooled. This reduces cost compared to full siloing but requires well-defined service boundaries to avoid isolation leaks.

Pod-Based

Tenants are grouped into pods, each running a shared stack with fixed capacity. New pods are added as the system scales. This limits blast radius while preserving many benefits of pooled deployments and aligns well with Kubernetes clusters or cloud account boundaries.

Onboarding and Identity

Onboarding and identity belong to the control plane. Before the application goes into production, the platform must reliably create tenants, provision required resources, and establish tenant boundaries.

Onboarding as a Workflow

Onboarding is not a single operation. It is typically triggered by an internal admin console or via self-service signup. The flow creates tenant metadata, assigns plan and tier, provisions infrastructure (depending on silo or pooled model), initializes identity and access, and configures quotas or billing.

Because onboarding is distributed and failure-prone, each step must be idempotent and tracked using explicit states. A tenant moves through: provisioning → active → suspended → deactivated → decommissioned. Invalid transitions (jumping from provisioning to deactivated, for example) should be rejected. Each transition must be logged with who triggered it and why.

State transitions must be idempotent and auditable. Decommissioning must comply with data retention policies, which usually means a grace period before actual deletion.

Tenant-Aware Identity

sequenceDiagram
    participant U as User
    participant RP as SaaS App (RP)
    participant IDP as IdP (OIDC)
    participant TR as Tenant Resolver
    participant SVC as Service

    U->>RP: Login
    RP->>IDP: OIDC Authorization Request
    IDP-->>U: Authenticate
    U-->>IDP: Credentials
    IDP->>TR: Resolve tenant context
    TR-->>IDP: tenant_id, tier, roles
    IDP-->>RP: ID Token (JWT)<br/>sub + email + tenant_id + tier + roles
    RP->>SVC: API Request + JWT
    SVC->>SVC: Validate JWT<br/>Extract tenant context
    SVC-->>RP: Tenant-scoped Response

Identity in SaaS must carry tenant context. Most platforms use OIDC and issue JWTs that include both user identity and tenant context via custom claims: tenant_id, tenant_tier, roles, and permissions alongside the standard sub and email. This lets services enforce tenant-scoped authorization without additional lookups on every request.

In more flexible setups, tenant context is resolved separately from authentication. The IdP authenticates the user; a dedicated service resolves tenant membership and roles. This simplifies multi-tenant users and multi-IdP support.

For enterprise customers, SaaS platforms typically support federated identity using SAML or OIDC. Since external IdPs rarely include tenant identifiers, tenant context must be resolved from IdP configuration, issuer, or domain mappings. SCIM is commonly used alongside federation for user and group provisioning but handles neither authentication nor tenant resolution.

Tenant Management

tenant management is core of part of control plane and it invloves lot of things like onboarding, offboarding, billing, Tenant attributes, Tenent identity configs, routing configs (based on silo and pool) and tenant user.

Tenant identifiers

Tenant identifiers should be globally unique and immutable (UUID). Avoid business identifiers like domain or org name, which may change.

Lifecycle

State transitions must be idempotent and auditable. Decommissioning requires compliance with data retention policies, which usually means a grace period before actual deletion.

Tier and Plan Changes

Tier changes are effectively infrastructure and policy migrations, not just billing updates. Common challenges:

Provisioning new resources (or moving from dedicated to shared)
Updating routing rules
Migrating existing tenant state safely

Feature flags help decouple rollout from deployment, but data shape and backward compatibility must be guaranteed. Complexity increases significantly in hybrid or siloed models, where a tier upgrade might mean spinning up a new isolated stack and migrating data.

Tenant Authentication and Routing

Tenant authentication and routing constructs are generally at front door of the system. As this decides tenant context and downstream authentication, authorization, and shape routing decisions.

Most SaaS platforms covers one of the following domain patters along with fully owned domain:

The subdomain per tenant model

Each tenant is mapped to a platform-managed subdomain (e.g, tenant-a.app.com). This suited for Simplifies certificate management, easy tenant resolution and for pooled architectures (with centralized routing). But this restricts the tenants brand ownership and stricter tenant isolations.

The vanity domain-per-tenant model

Here tenants bring their own custom domain (e.g., login.tenant-a.com). This requires DNS ownership verification, dynamic host-based routing, runtime lookup for tenant context etc.

Also note that domain based tenant onboarding should be relatively lightweight and doesn’t generally requires application redeploys/static routing rules.

tenant resolution Some platforms tries tenant resolution via user identifiers (e.g., email domain → tenant). but this approach breaks when consultants or external users span multiple tenants, so generally production systems typically resolve tenant context before authentication. When tenant context cannot be derived deterministically (shared domains, cross-tenant users), the system must explicitly introduce tenant selection.

note: Federated IdPs introduce additional constraints like Custom claim injection, Tenant-specific authorization logic may need to live outside the IdP.

Routing Authenticated Tenants

Once tenant context is established, routing becomes a function of deployment model.

Serverless (API Gateway / Lambda): resolve tenant at the edge, enable per-tenant throttling and auth policies, and isolate execution paths. Per-tenant infrastructure is possible but increases provisioning complexity.

Container-based (Kubernetes): the Ingress layer resolves domain and injects tenant context as a header. For pooled tenants, a single Ingress routes all traffic and services extract X-Tenant-ID downstream. For siloed tenants, each tenant gets its own namespace with a NetworkPolicy that restricts ingress and egress to that namespace and shared services only. Envoy or Istio handles L7 routing, mTLS, and policy enforcement across both models.

The invariant across all models: tenant context must be resolved early, propagated consistently, and enforced everywhere.

Building Multi-Tenant Services

Start from the domain. SaaS systems align well with business subdomains, DDD, and event-driven patterns. Early decisions around compute (containers, serverless, dedicated) and storage (shared DB, schema-per-tenant, DB-per-tenant) directly define service architecture, scalability, and isolation guarantees.

Inside each service, tenant context is everywhere: request handling, logging, metrics, auth, and data access. Keep tenant resolution out of business logic. Centralize it using interception patterns: gateway, middleware, sidecar, or Lambda layer. The middleware extracts the JWT, validates it, and puts the resolved tenant context (tenant ID, tier, roles) into the request context. Business logic reads from that context and never parses tokens or headers directly.

If your metrics are not broken down per tenant, your multi-tenancy design is incomplete. Emit tenant_id as a label on every service metric.

The noisy-neighbor problem is inevitable unless you explicitly identify services that need stronger isolation and design or deploy them separately from the start.

Data Partitioning

Data partitioning directly impacts SLAs, tenant experience, blast radius, and operational complexity. Default to the pooled model and progressively move toward siloed isolation only when workloads, compliance, or performance demand it.

Over-isolating early leads to management overhead and poor storage utilization. Under-isolating increases noisy-neighbor risk. Throughput limits and throttling must be enforced per tenant regardless of model.

Relational Databases

There are three main strategies, ordered by isolation strength:

Pooled schema with tenant_id column is the simplest. Every tenant shares the same tables, rows are tagged with tenant_id, and queries always filter by it. You must index tenant_id on every high-traffic table, or queries will degrade as the table grows. The blast radius of a bug is the entire database.

Row-level security (RLS) in PostgreSQL adds a database-enforced isolation layer on top of the pooled schema. The application sets a session variable (app.current_tenant_id) at connection time, and a policy on each table ensures queries only return rows matching that tenant. This is a strong defense-in-depth measure: even if application-level authorization has a bug, the database will not return rows from another tenant.

Schema-per-tenant gives each tenant their own PostgreSQL schema. Migrations can be applied per-tenant independently, and backup and restore at tenant granularity is straightforward. The tradeoff is that schema management complexity scales linearly with tenant count.

Database-per-tenant is the siloed extreme, reserved for enterprise tiers or regulated workloads. Enables dedicated connection pools, separate backup schedules, and independent scaling.

NoSQL

For systems like DynamoDB, the partition key must include tenant_id to prevent hot partitions and enable per-tenant throttling. A composite key like TENANT#<tenant_id>#ORDER with a sort key of ORDER#<order_id> works well. Avoid using tenant_id as the sole partition key for large tenants, since all their traffic hits the same partition. Adding entity type to the key distributes load more evenly.

Mixed-Mode in Practice

Most systems evolve into mixed-mode: pooled for standard tenants, siloed for enterprise tiers. The tier boundary is the migration boundary. Plan the data migration path before the first enterprise customer asks for it.

Tenant Isolation

Tenant isolation spans from full-stack isolation (dedicated infra per tenant) to item-level isolation within shared services. Stronger isolation improves safety and blast-radius control but increases cost and operational complexity.

At the application layer, isolation is enforced through authorization, RBAC, and tenant-aware access controls. Use interception patterns (middleware, gateways, sidecars) rather than scattering tenant checks through business logic.

Infrastructure isolation (separate clusters, namespaces, cloud accounts) provides stronger guarantees but reduces flexibility. Effective systems combine:

Deployment-time isolation: how services are deployed (namespace, cluster, account boundaries)
Runtime isolation: how requests are handled (middleware, auth, data access controls)

At scale, isolation policies must be explicitly managed, versioned, and observable. Drift is subtle and usually discovered during an incident. Build tenant isolation into your observability from day one, not after the first cross-tenant data exposure.

Every tenant-scoped authorization decision should emit an audit event carrying at minimum: timestamp, tenant ID, actor, action, resource, and whether it was allowed. This is what lets you audit access patterns, detect anomalies, and prove compliance.

Conclusion

Building multi-tenant SaaS system is fundamentally about owning complexity wantedly. SaaS shifts responsibility from customers to platform, which means tenant awareness, isolation, data partitioning, routing, and lifecycle management are no longer optional, they are core system properties.

There is no single “correct” architecture. Real SaaS platforms evolve from pooled to hybrid and mixed models as scale, compliance, and customer value change. The key is to design for explicit tenant boundaries, centralized control planes, early tenant resolution, and measurable isolation, while keeping business logic clean and adaptable. If multi-tenancy is not observable, enforceable, and evolvable, it will eventually break usually at scale, and always at the worst possible time.

Most SaaS outages are process failures, not code failures.

Building Multi-Tenant SaaS Applications

Published Jan 19, 2025 by bala in SaaS, MultiTenant at https://blog.balashekhar.me/blogs/building-multi-saas-apps/