Server Sizing Calculator
Estimate RAM and CPU requirements for web applications using Little's Law and memory footprint modelling, then match to real cloud instance families.
Section A โ Memory Sizing
Total RAM = (OS baseline + workers ร per-worker memory) ร (1 + headroom %)
Section B โ CPU Sizing (Little's Law)
Concurrent requests = RPS ร (response time ms / 1000) | CPU cores = concurrent requests / (utilisation / 100)
Published: April 2026 | Author: TriVolt Editorial Team
Server Sizing: Why Guessing Is Expensive
Server sizing is a decision with real financial consequences in both directions. Over-provisioning โ picking a larger instance than you need โ wastes money on every hour the server runs. In cloud environments this is a silent tax: an m5.2xlarge at $0.38/hour that should be an m5.large at $0.10/hour costs $2,450 extra per year for a single service. Multiply by a microservices fleet of 30 services and the waste becomes significant.
Under-provisioning causes production incidents. An under-resourced web server saturates its CPU or runs out of memory during traffic peaks, manifesting as degraded response times, dropped connections, and ultimately cascading failures if upstream services cannot absorb the pressure. The cost of an incident โ engineer time, customer trust, potential SLA penalties โ typically dwarfs months of over-provisioning bills.
Systematic sizing โ using queuing theory for CPU and memory footprint analysis for RAM โ transforms a guessing game into a calculation. The numbers are not perfect; real workloads are non-uniform and traffic is bursty. But they provide a defensible starting point that is far more accurate than instinct or copying a similar service's configuration without understanding the differences in workload characteristics.
Little's Law: L = ฮปW
Little's Law is one of the most useful results in queuing theory, and it applies directly to web server capacity planning. The law states that in a stable system, the average number of items in the system (L) equals the average arrival rate (ฮป, pronounced lambda) multiplied by the average time each item spends in the system (W):
In web server terms: L is the number of requests being concurrently processed, ฮป is your requests-per-second arrival rate, and W is the average response time in seconds. If your service handles 200 RPS with an average response time of 150ms (0.15s), then at any given moment there are 200 ร 0.15 = 30 requests being actively processed.
Each concurrently processed request occupies a CPU thread or an event loop slot. By dividing the concurrent request count by your target CPU utilisation (e.g. 70%), you get the minimum number of CPU cores required. At 70% utilisation with 30 concurrent requests, you need at least 30 / 0.70 โ 43 CPU "slots." For a synchronous thread-per-request model (Java Tomcat, PHP-FPM), this translates directly to CPU cores. For an event-loop model (Node.js, Nginx), it translates to worker processes.
The elegance of Little's Law is that it makes no assumptions about request size distribution, arrival pattern, or processing complexity. It holds for any stable system. The practical constraint is "stable": the law breaks down under overload where the queue grows unboundedly.
Memory Sizing for Web Applications
Memory sizing starts by identifying and summing every significant consumer of RAM on the server:
OS baseline is the memory consumed by the operating system kernel, system daemons (syslog, cron, SSH), and infrastructure agents (monitoring, log shipping). On a minimal Ubuntu 22.04 server, this is typically 200โ400 MiB. Including agents and a basic observability stack, 512 MiB is a realistic planning baseline.
Per-worker memory is the memory footprint of each application worker process at idle. A Node.js worker starts at roughly 60โ80 MiB; under load with cached query results and in-flight request objects, it may grow to 150โ200 MiB. PHP-FPM workers start at 20โ30 MiB but grow with each request. Java applications are different: the JVM itself reserves heap and non-heap memory at startup, so a single Java process (running a thread pool) might consume 512 MiBโ2 GiB and serve the same load as dozens of Node.js workers.
Headroom accounts for OS page cache, temporary buffers during burst, and memory that is not easily attributable to specific processes. A 20% headroom is a reasonable default. Reduce it if your workload is extremely predictable and well-characterised; increase it to 30โ40% for applications with known memory spikes during batch operations.
The output is rounded up to the nearest standard server RAM size (1, 2, 4, 8, 16, 32, 64 GiB). Cloud instances do not come in arbitrary sizes โ the next standard size up is what you will actually provision, so the rounding step matters for accurate cost estimation.
CPU Sizing: The Utilisation Headroom Rule
The 70% target CPU utilisation used in this calculator is not arbitrary โ it follows from queuing theory. As a system approaches 100% CPU utilisation, the time requests spend waiting in the queue increases non-linearly. A system at 90% utilisation has approximately 9ร the queue wait of a system at 50% utilisation. At 99% utilisation, average response times are dominated by queue wait, not by actual processing time.
This means designing a system to operate at 70% average CPU gives you a 30% buffer before queuing effects become significant. It also provides capacity for traffic bursts: if your average load is 70%, a 40% traffic spike brings you to 98% temporarily but does not immediately cause catastrophic queue growth.
In practice, modern autoscaling changes the calculus. If your cloud environment can add instances in under 60 seconds, you might comfortably target 80% utilisation knowing that scale-out handles spikes. On-premise or slow-scaling environments should maintain larger headroom โ 60% or even 50%.
Web Server Worker Models
Different server architectures translate CPU and memory resources into request capacity in fundamentally different ways. Understanding your application's concurrency model is essential to applying this calculator correctly.
Prefork (Apache MPM Prefork, PHP-FPM): Each request is handled by a dedicated OS process. Memory usage scales linearly with worker count โ add 10 workers, add 10ร per-worker memory. CPU is utilised one thread per active request. This model is memory-intensive but simple to reason about. The per-worker memory figure directly sets your memory sizing.
Event loop (Node.js, Nginx, Python async frameworks): A single worker process handles many concurrent connections using non-blocking I/O. While waiting for a database query or an HTTP upstream response, the worker serves other requests. Memory per concurrent connection is low, but the event loop is single-threaded โ CPU-intensive work blocks all other requests. Typical deployment: 1 Node.js worker per CPU core, each handling hundreds of concurrent in-flight I/O operations.
Thread pool (Java Tomcat, Spring Boot): A single JVM process spawns a pool of threads. Threads block on I/O but are multiplexed across a shared heap. Memory sizing is dominated by JVM heap (-Xmx), thread stack size (typically 512KBโ1MB per thread), and non-heap (metaspace, code cache). A single Java process with a 128-thread pool and 2 GiB heap may serve the same load as 8 Node.js workers at 256 MiB each.
Reactive / virtual threads (Project Loom, Vert.x, Quarkus): Modern Java runtimes with virtual threads blur the line between event loop and thread pool models. Virtual threads have very low stack overhead (a few KB versus 512KB for platform threads), enabling massive concurrency with thread-per-request code style. This changes memory sizing significantly โ a Loom-based Java service can handle thousands of concurrent requests per GiB of heap.
Cloud Instance Families
Cloud providers organise instances into families optimised for different CPU-to-memory ratios. Choosing the wrong family can mean paying for resources you cannot use.
General purpose (AWS m-series, GCP n-series, Azure D-series): Balanced CPU-to-memory ratio, typically 4 GiB RAM per vCPU. Suitable for most web applications and APIs. The safe default if you have not profiled your workload.
Compute-optimised (AWS c-series, GCP c-series, Azure F-series): More CPU per GiB of RAM โ typically 2 GiB per vCPU. Costs less per vCPU than general purpose but provides less memory. Ideal for CPU-intensive workloads: video encoding, scientific computing, high-frequency trading, or any CPU-bound service where memory is not the bottleneck.
Memory-optimised (AWS r-series, GCP n-highmem, Azure E-series): High RAM per vCPU โ typically 8 GiB or more per vCPU. Suited for in-memory databases (Redis, memcached), large data processing, JVM applications with very large heaps, and services that cache heavily in process memory.
A quick heuristic: calculate your required CPU/RAM ratio. If it is between 2:1 and 4:1 (GiB RAM per vCPU), general purpose is appropriate. Below 2:1 (CPU heavy), choose compute-optimised. Above 8:1 (RAM heavy), choose memory-optimised.
Baseline, Burst, and Autoscaling
This calculator gives you a single-instance minimum. Real production deployments must account for two additional dimensions: burst capacity and fault tolerance.
Horizontal scaling adds more instances of the same size behind a load balancer. It is more resilient (no single point of failure), scales more granularly, and allows rolling deployments. It requires stateless application design โ session state and caches must be externalised to Redis or a database.
Vertical scaling moves to a larger instance. Simpler to operate for stateful applications (one larger database instance rather than a distributed cluster), but has harder scaling limits and requires a restart (brief downtime) to resize on most cloud platforms.
For autoscaling fleets, size your instances for your typical load, then configure autoscaling to handle peaks. A fleet of small instances autoscales more cheaply and granularly than a fleet of large ones. However, very small instances (1โ2 vCPU) can be problematic for JVM applications where JIT compilation, GC, and application threads compete for a single core โ prefer 4+ vCPU instances for Java workloads.
The minimum fleet size should be at least 2 instances so that a single instance failure does not take down the service. For critical services, minimum 3 instances across 3 availability zones provides robust failure isolation.
Disclaimer
This calculator provides estimates for planning purposes based on simplified models. Actual resource requirements depend on your application architecture, traffic patterns, data access patterns, third-party dependencies, and deployment configuration. Instance family pricing changes frequently โ verify current pricing on your cloud provider's pricing page before making purchasing decisions. The instance recommendations shown are illustrative examples; always consult the provider's current instance catalogue for availability, pricing, and latest generation options.
Related calculators: Docker / Kubernetes Resource Calculator ยท Rack Density Calculator
Also in Data Center
- โ DC Cooling Load Calculator โ Calculate the cooling capacity required for a given IT load and PUE target. Outputs in kW and Tons of Refrigeration.
- โ DC Critical Equipment Sizing โ Size all critical power and cooling equipment for a new data center build or upgrade: UPS, generator, transformer, switchgear, cooling, PDUs, and battery. Tier IโIV, N to 2N+1 redundancy, PDF export.
- โ DC Efficiency Audit โ Enter PUE, power chain efficiency, cooling strategy, redundancy tier, and battery runtime to get an overall efficiency score with recommendations.
- โ Docker / Kubernetes Resource Calculator โ Calculate container CPU and memory requests, limits, Kubernetes QoS class, and node packing capacity. Covers Guaranteed/Burstable/BestEffort QoS, CPU throttling, OOMKill, and kubelet overhead.