PostgreSQL VACUUM: Managing Dead Rows & Database Bloat

Introduction

At CAW, we greatly value PostgreSQL – it’s our go-to database whenever our products require a robust relational database management system (RDBMS). Over the years, we’ve come across a fair amount of surprises (and shocks) while exploring this incredible database.

We’re excited to share our insights and learnings with you through this two-part series:

1. Connection Pooling in PostgreSQL

2. VACUUM for PostgreSQL Optimisation (You are here)

Here’s something surprising about PostgreSQL:

When you update a record, it doesn’t update the existing row – it creates a new one and marks the old one as obsolete. This means a simple UPDATE operation leaves you with more rows than you started with, though your queries only show the current versions.

For example, let’s say you have a table “Employees” with 4 records, and you want to update 2 of them. After these updates, SELECT queries will only show 4 records but the table internally contains 6 rows – 4 current and 2 obsolete ones. This article explains why PostgreSQL works this way, what happens to these obsolete rows, and how the database manages this through a process called VACUUM.

An Overview of Dead Tuples in PostgreSQL

When you update a row, PostgreSQL just marks it as a dead tuple (like a soft delete) and creates a new record for the values to be updated. That’s why the answer for the above example is 6. The GIF below is an illustration of this.

Any table in PostgreSQL contains a few hidden columns. xmin, xmax, and ctid are a few of them.
Here’s what each of those columns mean:

– xmin: The transaction ID (TXID) that created the record. A value of 0 means that the record was created before PostgreSQL started tracking TXIDs.

– xmax: The TXID of the transaction that last modified or deleted the record. A value of ‘null’ means that the record has never been deleted or modified.

– ctid: The physical location of the record in the table. It is represented as a set of two values.

For example, if a row has a ctid value of (0,5), it means the row is located in block 0 i.e., the first block in the table and the 5th row within that block. A block in this context is often referred to as a page.

But why does PostgreSQL behave this way? This is because PostgreSQL uses a multi-version concurrency control (MVCC) mechanism.

Now the question is, what’s the point of not deleting those dead tuples? We will answer this in the following sections.

What Is Multi-Version Concurrency Control?

Multi-Version Concurrency Control (MVCC) is a technique used by PostgreSQL to allow multiple transactions to access the same data simultaneously without interfering with each other. The technique is implemented using a combination of snapshot isolation and versioning.

In MVCC, each transaction sees a snapshot of the database at a certain point in time, rather than the current state of the database. This snapshot copies the state of the database at the beginning of the transaction and stores it in memory. As the transaction reads and writes data, it sees only the version of the data that was current at the beginning of the transaction.

Let’s take a look at how MySQL handles concurrent transactions. Consider two scenarios:

– Transaction A wants to read data from Table X.

– Transaction B wants to simultaneously add new data to Table X.

If these transactions work with different rows in the table, they won’t interfere with each other. Both can proceed normally.

However, things change when both transactions target the same row. In this case:

– Transaction A holds a lock on the row while reading.

– Transaction B must wait for this lock to be released.

– Only then can Transaction B acquire its own exclusive lock and modify the row.

PostgreSQL handles these situations differently. Thanks to its MVCC mechanism, multiple transactions can read and write to the same row without blocking each other. Here’s how it works:

– Transaction A can read data from a row.

– At the same time, Transaction B can acquire a lock and modify the same row.

– Other transactions can continue reading the row’s previous version.

– The new version becomes visible only after Transaction B commits its changes.

This approach ensures data consistency while preventing conflicts between transactions. It’s one of PostgreSQL’s key advantages for handling concurrent operations.

Therefore, MVCC in PostgreSQL is a powerful mechanism that allows for high concurrency and scalability in a multi-user environment. However, it comes with certain trade-offs. The system needs to store multiple versions of each row, increasing storage requirements and processing overhead. This storage approach affects various database operations, including one you might use frequently – SELECT COUNT(*).

It is clear at this point why SELECT COUNT(*) might not be the best choice in PostgreSQL. Since the database maintains multiple versions of rows, counting them all can be resource-intensive.

If you don’t need an exact count, you can this alternative that’s more efficient:

SELECT reltuples AS estimate FROM pg_class WHERE relname = 'table_name';

You can read more about this with PostgreSQL’s documentation on Slow Counting and Count Estimate.

PostgreSQL Bloat: An Overview of its Types

In PostgreSQL, bloat refers to the storage inefficiencies that can occur as tables and indexes grow over time. Several different types of bloat can occur in PostgreSQL:

1. Dead Rows
As discussed above, when rows are updated or deleted, they are not immediately removed from the table. Instead, they are marked as “dead” and left on the table until the next time that VACUUM is run. This can result in tables with a large number of dead rows, consuming disk space and slowing down queries.

2. Index Bloat
As indexes grow, they can become inefficient and consume more disk space than necessary. This can be caused by fragmentation, where index pages become partially filled, or by the presence of duplicate index entries.

3. Table Bloat
Tables can become bloated when they contain a large number of dead rows or when the statistics used by the query planner become outdated. Table bloat can lead to slower queries and increased disk usage.

4. Transaction ID Wraparound
PostgreSQL uses a 32-bit transaction ID to track changes to the database. The transaction ID counter will eventually reach its maximum value and reset to zero. When this happens, any existing live transactions with IDs higher than the wraparound point can cause problems. This can lead to data corruption and the need for a manual VACUUM or VACUUM FULL operation.

5. Toast Table Bloat
Large values stored in TEXT, VARCHAR, and BYTE columns are stored in a separate table called the “toast” table. If these values are updated frequently or are very large, the toast table can become bloated and slow down queries.

To check if a table is making use of the toast table in PostgreSQL, you can use the below query. The pg_class catalogue table contains information about all tables in the database, including their attributes, indexes, and storage options.

SELECT relname, relkind, reltoastrelid 
FROM pg_class 
WHERE relname = 'tableName';

If the reltoastrelid column is not null. Instead, it means that the table is using the toast table. The relkind column will also show ‘r’ for a regular table or ‘t’ for a toast table. Note that a table may use the toast table for some columns but not others. You can check the attstorage column in the pg_attribute catalogue table to see which columns are stored in the toast table. A value of ‘x’ in this column indicates that the column is stored in the toast table.

But what do we do about this bloat data? The answer is VACUUM.

How VACUUM Works in PostgreSQL

VACUUM is used to optimise and improve PostgreSQL performance. This reclaims storage space and optimises the performance of the database by removing dead rows, updating statistics, and compacting pages. VACUUM is essential for preventing performance degradation and keeping the database running smoothly.

Benefits of VACUUM in PostgreSQL

Reclaims Disk Space
When data is deleted or updated, the database doesn’t immediately free up the space occupied by the old data. VACUUM frees up this space and returns it to the operating system.

Improves Query Performance
VACUUM updates statistics about data distribution in PostgreSQL, which is used by the query optimiser to generate better query plans. This can lead to faster and more efficient queries.

Prevents Transaction ID Wraparound
VACUUM removes dead rows and frees up transaction IDs that prevent transaction ID wraparound.

Compacts the Tables
Tables can become fragmented over time, with data spread across multiple pages. VACUUM compacts the tables by moving data to fewer pages, reducing the number of disk I/O operations required to access the data.

In short, VACUUM takes care of the bloat data in PostgreSQL. Yet, while vacuuming is an important maintenance task in PostgreSQL, there are some potential downsides and trade-offs to consider.

Challenges and Limitations of VACUUM in PostgreSQL

Resource Usage
Vacuuming can be a resource-intensive operation, especially for large tables with many dead rows. The process can consume significant CPU, memory, and I/O resources, which may impact the performance of other processes running on the same system.

Index Maintenance
As we mentioned earlier, vacuuming can have an impact on indexes, as it involves scanning the indexes to find obsolete entries and removing them. This can be an expensive operation, especially for large tables with many indexes, and can increase the time required to complete the vacuum.

Slowdowns During Peak Usage
If you vacuum a large table during peak usage hours, it can impact the performance of the application or service using the table. In this case, it is recommended to schedule vacuuming during off-peak hours to minimise any potential impact on users.

Increased Storage Requirements
When you vacuum a table, PostgreSQL creates a new version of the table file without the dead rows. This new file may be larger than the original file, as it may contain empty space that has not yet been reclaimed. This can result in increased storage requirements for the table, which may be a concern for systems with limited disk space.

Locking and Concurrency
Vacuuming requires exclusive access to the table being processed. PostgreSQL implements this through AccessExclusiveLock, which:

– Prevents other sessions from performing any operations on the table.
– Blocks both read and write operations until completion.
– Acts as a heavyweight lock that supersedes other lock types.

This locking mechanism, while necessary for maintaining data integrity, can cause significant contention in heavily used databases. To minimise impact:

– Schedule VACUUM operations during low-usage periods.
– Partition large tables.
– Archive tables wherever appropriate.

Setting Up Efficient VACUUM Operations
It is important to balance the benefits of vacuuming against the downsides and to carefully plan and schedule efficient vacuuming operations. Usually, companies with operations that involve few updates but millions of daily inserts should consider turning off auto-vacuum. This prevents auto-vacuum from activating during bulk operations and causing unwanted locks. Instead, have a cron job that does VACUUM ANALYZE just like Zerodha, the largest stock broker in India does. For more information, we recommend you to read this article by their tech team.

Conclusion

Understanding PostgreSQL’s VACUUM mechanism is crucial for maintaining optimal database performance. While it comes with certain challenges, proper scheduling and monitoring can help minimise its impact on your operations. For further optimisation, consider implementing automated VACUUM scheduling based on your specific usage patterns and performance requirements.

If you’d like additional assistance, schedule a call with us to discuss the right VACUUM strategy for your specific needs. Our engineering team at CAW can help you achieve optimal database performance — be it managing complex data operations or millions of daily transactions. Book a call with us today to learn how we can support your PostgreSQL optimisation needs.

FAQs

What is VACUUM in PostgreSQL?
VACUUM is a maintenance operation in PostgreSQL that helps reclaim storage space by removing dead rows, updating statistics, and compacting tables. It is essential for maintaining database performance, preventing bloat, and ensuring efficient use of disk space.

Why does PostgreSQL create dead rows during updates?
PostgreSQL uses Multi-Version Concurrency Control (MVCC), which creates new versions of rows when updates or deletes are performed. The old versions, marked as dead, remain in the table until they are cleaned up by VACUUM. This mechanism ensures data consistency, allowing multiple transactions to work concurrently without locking issues.

How does VACUUM improve database performance?
VACUUM reclaims disk space, updates statistics for the query planner, compacts tables to reduce I/O operations, and prevents transaction ID wraparound issues. This helps maintain fast and efficient database queries.

When should you schedule VACUUM operations?
It’s best to schedule VACUUM operations during off-peak hours to reduce the impact on performance. For systems with high insert rates and low updates, disabling auto-vacuum and using scheduled VACUUM operations may be more efficient.

PostgreSQL VACUUM Explained: Managing Dead Rows and Database Bloat

In this blog

Share