Benchmarking Autoscaling Postgres

Adam Fletcher
The Inner Join
Published in
7 min readOct 26, 2022

--

Scaling a bit.io database from 0 to more than 20,000 TPS

Today, bit.io enabled powerful autoscaling functionality that allows all of our pro and enterprise tier serverless postgres databases to handle tens of thousands of TPSs across thousands of connections. We wanted all databases on bit.io to be blazing fast under any load. To reach this goal, we had to pressure test bit.io databases and the new autoscaling functionality and be able to observe the results.

This isn’t as easy as it may sound. “Pressure testing” can mean a lot of different things, and optimizing for the wrong metrics can have serious performance consequences. The tools for measuring postgres performance aren’t entirely straightforward, either, and the results can easily lead careless users astray.

One of the things we really wanted to be sure our autoscaling could accomplish is high throughput as measured by TPS (transactions per second). We absolutely needed our system to scale up to 20,000 TPS with < 50ms transaction latency. To make sure it could get there, we needed a benchmark. We turned to pgbench, which “runs the same sequence of SQL commands over and over, possibly in multiple concurrent database sessions, and then calculates the average transaction rate (transactions per second).”

This sounds great — but the postgres docs caution that “It is very easy to use pgbench to produce completely meaningless numbers.” And we needed to be sure the benchmark simulated the kinds of real-world conditions the new autoscaling capability would have to handle.

In this post, we’ll describe how we set up the pgbench benchmark to push the autoscaling system to the limits — and we’ll discuss the results (Spoiler: the new functionality was more than capable of keeping up with the load.).

Getting Started: Goals and Environment

When load testing, defining goals is critical. The best way to do that is to understand the expected workload. High throughput is what we went with, but it isn’t the only goal we could have chosen. We may, for instance, have been interested in how the autoscaling system handles sudden changes or spikes in usage or shifts in the ratio of database reads/writes.

It pays to be explicit when defining goals. We chose the goal of supporting 20,000 TPS because this significantly exceeds the maximum load of our current users and we want to make sure new and existing users have a lot of room for growth. We could have just said “go fast” or “handle lots of load” — but what is “fast” and what is “lots of load”? We couldn’t know whether we’d achieved our throughput goals without putting numbers behind them.

It’s also important to properly set up a testing environment. There are two key components to this environment: the database server (which receives traffic) and the load generator (which runs pgbench and applies load to the database server). One potential pitfall is underprovisioning the load generator: a load generator with insufficient RAM and CPU cores may cause a bottleneck. In some cases, it makes more sense to use multiple load generators instead of just one.

It helps to have the load generators as close to the database under test as possible — ideally in the same network/VPC. First, keeping the load generators and database close drastically reduces network latency/speed of light latency. Second, close physical proximity mitigates the effects of slower network links that may lie in the path between the load generator and the database server. Third, latency results are more reliable as they are less likely to be impacted by network issues.

Our load generators were set up in the same cloud project as our Kubernetes cluster to minimize the network hops between the load generators and the load balancer for our service. This way, we avoided the unpredictable latency seen on the internet at large.

Using pgbench

We have a goal, and we have a test environment. Now comes the fun part: putting the database under load and seeing how it performs. As mentioned above, we used pgbench for benchmarking our autoscaling system. The default “TPC-B-like” benchmark essentially tries to run as many transactions as possible as quickly as possible against a set of tables generated by pg_bench (figure 1). Note that the row counts indicated in figure 1 correspond to a scale factor of 1 (more on scale below).

Figure 1: Number of rows in the test tables built by pgbench with a scale factor of 1.

There are many pgbench configuration options, but we stuck to a few basic ones:

  • Scale: how large the tables are
  • Clients: how many connections to make
  • Threads: how many load creation threads to use

Scale

The “scale” pgbench option (the -s option to pgbench -i) determines the amount of data involved in the benchmark: the numbers of rows in each of the tables shown in figure 1 are multiplied by the “scale” option. It is important to have enough data to be representative of a real database, but figuring out a) what a “real” database looks like and b) what that means in terms of pbench’s scale option is tricky.

It’s easy to set the scale option in such a way as to generate misleading results. A low scale with many database connections may result in situations where many clients try to update the same table row, leading to lock contention and artificially deflated throughput.

Setting a very high scale, on the other hand, will prevent lock contention but result in a lot of time spent hitting the disk on the target server, possibly more than might be realistic, again artificially deflating throughput.

The problem goes the other way, as well. Real workloads do sometimes result in lock contention and time waiting on disk I/O. It’s tempting to look for the scale option that maximizes the metric of interest, but this scale is unlikely to best reflect the real-world workflow of interest.

The pgbench docs state that scale “should be at least as large as the largest number of clients you intend to test (-c); else you’ll mostly be measuring update contention.” In practical testing, we found that following this recommendation still resulted in more lock contention than we wanted in this particular test, so we set the scale option to 1.25 times the number of clients.

Threads and Clients

Picking good values for threads and clients is a bit simpler. First, threads should match the CPU count on the load generation machine. Clients should match the number of concurrent connections expected on the database server (if more than one load generating machine is used, the clients should be split among those machines in a ratio that reflects the number of threads on each load generating machine).

Running the Benchmark

Actually running the benchmark was not 100% straightforward: we ran pgbench until the point of saturation — that is, until the database server could not support more TPS at our target latency. This triggered the autoscaling mechanism, which resulted in a brief drop in the connection which terminated the benchmark. After each time this occurred, we re-started pgbench with a higher load (higher TPS and more connections).

We ran the benchmark until we hit our target of 20,000 TPS. Note that, for the purposes of this benchmark, we disabled WAL archiving. However, as seen in figure 2, we did not disable autovacuuming, a routine automated database maintenance process that, among other things, recovers disc space from updated and deleted table rows, resulting in the brief drops in TPS seen in the results.

Results

The bit.io database we ran the benchmark on scaled from 0 to more than 20,000 TPS with no intervention required from the user (figure 2). Recall that pgbench runs a sequence of queries against the database as fast as possible. As the TPS increased, so too did the number of CPU cores used by the database server, enabling it to handle the increased load — automatically.

Figure 2: Increasing load triggered the bit.io autoscaling mechanism, enabling the database to exceed our goal of 20,000 TPS with < 50ms latency.

This benchmark and others like it give us a lot of confidence in our autoscaling capabilities because we set them up to reflect various challenging real-world usage scenarios. For context, on October 24, 2022, Nasdaq processed 30,293,529 trades, or 1,295 trades per second. From June 2021-June 2022, Visa averaged about 8,100 transactions per second. These aren’t exactly the same measures, but they should give a sense of scale: 20,000 TPS is a lot.

Some Notes on Postgres Benchmarking

  • There are many other Postgres benchmarking tools available, and they can run a wide variety of benchmarks. We recommend starting with pgbench because it ships with postgres, it is easy to start using, and there are all kinds of great guides online on how to use it.
  • We did not see much difference in using pgbench’s TPC-B-like benchmark versus other benchmarks when benchmarking bit.io, but your experience will depend on your specific use case and configuration.

If benchmarking and all of the above seem like a huge pain, just use bit.io! As we showed, bit.io pro and enterprise databases will scale up and down to meet your database’s workload. You don’t need to know anything about pgbench and resource consumption and postgresql.conf values; we take care of all of that automatically.

More importantly, our pricing model hasn’t changed. Pricing is based on the number of rows read, not on the server resources. To be clear: the price per row queried stays the same even when the autoscaling system increases the database server resources. Sign up for bit.io pro or enterprise and give it a try!

Co-authored with Daniel Liden.

--

--