Killing postgres the hard way

2015-10-22

So today, I finally decided to do something about a query that we saw had been running for 25 DAYS on our server

Note: If you find this post and you need to follow the hard way, backup your data first if possible.

First I could obviously see the culprit: each postgresql query runs it's own process so I could see in "htop" that there was this process that had been running for 600 hours, or about 25 days

Next, I opened a psql console and ran this query:

SELECT datname,procpid,current_query FROM pg_stat_activity WHERE datname='database_name' ORDER BY procpid ;

This returns which actual queries are being run on the database at any given time. I could easily see the one problematic query being run, which was a badly constructed intermine template query that resulted in a weird "recursion" essentially.

I wanted to try just terminating this query itself, so I ran this

SELECT pg_cancel_backend(29033);

Each time I ran it, it would say it returned one result but it did nothing.

I also read that you can try to nicely "kill" it from the command line (no kill -9) so I ran

> kill 29033

This also did not work!

I thought perhaps all these problems were because tomcat was still active, so we shut down tomcat, and retried killing the specific query, but to no avail

At this point, I just wanted to restart the whole database server. Kind of a risky move... but I am sort of a risky kind of guy...(that is not a good thing with databases). If you are doing this, make backups! I didn't. Luckily I suffered no data loss but what follows is kind of intense.

So first, I try and stop the database service

> service postgresql-9.1 stop

Unfortunately, this [FAILED] ! And of course, even though it failed, the database is now unusable. No logging into it anymore, we have to go with the hard way now...

Looking at /etc/init.d/postgres-9.1 told me that the service stop command was effectively using something like this:

> pg_ctl -D /db/postgres/data -m fast stop

After some reading, I learned that you can try using a slightly different flag to restart it

> pg_ctl -D /db/postgres/data -m immediate  stop

I ran this and to my horror/surprise, it actually worked! At this point I decided to start postgresql back up again!

> service postgresql-9.1 start

The service start quickly returned a SUCCESS, which was great, but then I tried to start a psql console and the console froze on me! I could not even ctrl+c it!

I got really worried at this point and I looked at the process manager, and saw that there was one postmaster process running but it was not clear what it was doing. I actually tried to shutdown the server again in a panic mode but at this point it said

> /usr/pgsql-9.1/bin/pg_ctl stop -D /db/postgres/data/ -m immediate
> waiting for server to shut
> down...............................................................
> failed

It was probably good that it didn't shut down, because I would quickly find out that it was in recovery mode. I looked at the postgresql logs and I saw this, reproduced here for full detail (from before the shutdown to the restart)

> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
> WARNING:  pgstat wait timeout
>
> ERROR:  canceling statement due to user request
> STATEMENT:  CREATE TABLE precomp_90519 AS SELECT DISTINCT a1_.id AS
> a1_id, a2_.id AS a2_id, a3_.id AS a3_id, a4_.id AS a4_id,
> a5_.id AS a5_id, a6_.id AS a6_id, a12_.id AS a12_id, a10_.id AS
> a10_id, a1_.id AS a13_, a1_.primaryIdentifier AS a14_,
> a1_.secondaryIdentifier AS a15_, a2_.type AS a16_, a3_.name AS
> a17_, a4_.primaryIdentifier AS a18_, a5_.primaryIdentifier AS
> a19_, a6_.shortName AS a20_, a12_.identifier AS a21_, a10_.code
> AS a22_ FROM Gene AS a1_, Homologue AS a2_, Organism AS a3_, Gene
> AS a4_, Gene AS a5_, Organism AS a6_, GOAnnotation AS a7_,
> GOEvidence AS a8_, OntologyTerm AS a9_, GOEvidenceCode AS a10_,
> OntologyAnnotation AS a11_, OntologyTerm AS a12_, GeneGoAnnotation
> AS indirect0, EvidenceGOAnnotation AS indirect1 WHERE a1_.id =
> a2_.geneId AND a1_.organismId = a3_.id AND a2_.geneId = a4_.id
> AND a2_.homologueId = a5_.id AND a5_.organismId = a6_.id AND
> a1_.id = indirect0.Gene AND indirect0.GoAnnotation = a7_.id AND
> a7_.id = indirect1.GOAnnotation AND indirect1.Evidence = a8_.id AND
> a7_.ontologyTermId = a9_.id AND a8_.codeId = a10_.id AND a9_.id =
> a11_.ontologyTermId AND a11_.ontologyTermId = a12_.id ORDER BY
> a1_.primaryIdentifier, a1_.secondaryIdentifier, a2_.type,
> a3_.name, a4_.primaryIdentifier, a5_.primaryIdentifier,
> a6_.shortName, a12_.identifier, a10_.code, a1_.id, a2_.id,
> a3_.id, a4_.id, a5_.id, a6_.id, a12_.id, a10_.id
> LOG:  could not send data to client: Broken pipe
> STATEMENT:  CREATE TABLE precomp_90519 AS SELECT DISTINCT a1_.id AS
> a1_id, a2_.id AS a2_id, a3_.id AS a3_id, a4_.id AS a4_id,
> a5_.id AS a5_id, a6_.id AS a6_id, a12_.id AS a12_id, a10_.id AS
> a10_id, a1_.id AS a13_, a1_.primaryIdentifier AS a14_,
> a1_.secondaryIdentifier AS a15_, a2_.type AS a16_, a3_.name AS
> a17_, a4_.primaryIdentifier AS a18_, a5_.primaryIdentifier AS
> a19_, a6_.shortName AS a20_, a12_.identifier AS a21_, a10_.code
> AS a22_ FROM Gene AS a1_, Homologue AS a2_, Organism AS a3_, Gene
> AS a4_, Gene AS a5_, Organism AS a6_, GOAnnotation AS a7_,
> GOEvidence AS a8_, OntologyTerm AS a9_, GOEvidenceCode AS a10_,
> OntologyAnnotation AS a11_, OntologyTerm AS a12_, GeneGoAnnotation
> AS indirect0, EvidenceGOAnnotation AS indirect1 WHERE a1_.id =
> a2_.geneId AND a1_.organismId = a3_.id AND a2_.geneId = a4_.id
> AND a2_.homologueId = a5_.id AND a5_.organismId = a6_.id AND
> a1_.id = indirect0.Gene AND indirect0.GoAnnotation = a7_.id AND
> a7_.id = indirect1.GOAnnotation AND indirect1.Evidence = a8_.id AND
> a7_.ontologyTermId = a9_.id AND a8_.codeId = a10_.id AND a9_.id =
> a11_.ontologyTermId AND a11_.ontologyTermId = a12_.id ORDER BY
> a1_.primaryIdentifier, a1_.secondaryIdentifier, a2_.type,
> a3_.name, a4_.primaryIdentifier, a5_.primaryIdentifier,
> a6_.shortName, a12_.identifier, a10_.code, a1_.id, a2_.id,
> a3_.id, a4_.id, a5_.id, a6_.id, a12_.id, a10_.id
> LOG:  unexpected EOF on client connection
> LOG:  unexpected EOF on client connection
> LOG:  unexpected EOF on client connection
> LOG:  unexpected EOF on client connection
> LOG:  received fast shutdown request
> LOG:  aborting any active transactions
> LOG:  autovacuum launcher shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> FATAL:  the database system is shutting down
> LOG:  received immediate shutdown request
> WARNING:  terminating connection because of crash of another server
> process
> DETAIL:  The postmaster has commanded this server process to roll back
> the current transaction and exit, because another server process
> exited abnormally and possibly corrupted shared memory.
> HINT:  In a moment you should be able to reconnect to the database and
> repeat your command.
> LOG:  received fast shutdown request
> LOG:  database system was interrupted; last known up at 2015-10-22
> 15:47:43 CDT
> LOG:  received immediate shutdown request
> LOG:  database system was interrupted; last known up at 2015-10-22
> 15:47:43 CDT
> LOG:  database system was not properly shut down; automatic recovery
> in progress
> LOG:  record with zero length at BBD/1CC2F0C0
> ...

You can see all the weird activity that was done here

first the attempt to "canceling statement due to user request" did not work
then the database stop using -m fast
then the database stop using -m immediate
the restart (with the HINT, should be ready soon)
the panic mode where i tried to shut it again anyways

During the recovery period, I was still very concerned about the database was doing, so I used "strace" to look at the main postmaster process.

I was pleasantly surprised to see that the postmaster process was just cleaning up files in /db/postgres/data/base/pgsql_tmp/, I could see the file system "unlink" command with successful status codes.

There were about 150 large files in /db/postgres/data/base/pgsql_tmp/, and I waited about an hour for them to be deleted, and after that, the postgresql log file said it was ready, and indeed, it was perfect :)

> LOG:  redo is not required
> LOG:  database system is ready to accept connections
> LOG:  autovacuum launcher started

What a relief!

I hope this might help any wayward stragglers to see how the postgresql restart process works. Sometimes things don't shut down cleanly, but I think it is still good to know some alternative steps to kill -9

#Footnote

This all happened right before the demo of our project to some stakeholders (it was a small academic lab project, but still crucial!) so that made it extra scary