So today, I finally decided to do something about a query that we saw had been
running for 25 DAYS on our server
Note: If you find this post and you need to follow the hard way, backup your
data first if possible.
First I could obviously see the culprit: each postgresql query runs it's own
process so I could see in "htop" that there was this process that had been
running for 600 hours, or about 25 days
Next, I opened a psql console and ran this query:
This returns which actual queries are being run on the database at any given
time. I could easily see the one problematic query being run, which was a badly
constructed intermine template query that resulted in a weird "recursion"
essentially.
I wanted to try just terminating this query itself, so I ran this
SELECT pg_cancel_backend(29033);
Each time I ran it, it would say it returned one result but it did nothing.
I also read that you can try to nicely "kill" it from the command line (no kill
-9) so I ran
This also did not work!
I thought perhaps all these problems were because tomcat was still active, so we
shut down tomcat, and retried killing the specific query, but to no avail
At this point, I just wanted to restart the whole database server. Kind of a
risky move... but I am sort of a risky kind of guy...(that is not a good thing
with databases). If you are doing this, make backups! I didn't. Luckily I
suffered no data loss but what follows is kind of intense.
So first, I try and stop the database service
Unfortunately, this [FAILED] ! And of course, even though it failed, the
database is now unusable. No logging into it anymore, we have to go with the
hard way now...
Looking at /etc/init.d/postgres-9.1 told me that the service stop command was
effectively using something like this:
After some reading, I learned that you can try using a slightly different flag
to restart it
I ran this and to my horror/surprise, it actually worked! At this point I
decided to start postgresql back up again!
The service start quickly returned a SUCCESS, which was great, but then I tried
to start a psql console and the console froze on me! I could not even ctrl+c it!
I got really worried at this point and I looked at the process manager, and saw
that there was one postmaster process running but it was not clear what it was
doing. I actually tried to shutdown the server again in a panic mode but at this
point it said
It was probably good that it didn't shut down, because I would quickly find out
that it was in recovery mode. I looked at the postgresql logs and I saw this,
reproduced here for full detail (from before the shutdown to the restart)
You can see all the weird activity that was done here
first the attempt to "canceling statement due to user request" did not work
then the database stop using -m fast
then the database stop using -m immediate
the restart (with the HINT, should be ready soon)
the panic mode where i tried to shut it again anyways
During the recovery period, I was still very concerned about the database was
doing, so I used "strace" to look at the main postmaster process.
I was pleasantly surprised to see that the postmaster process was just cleaning
up files in /db/postgres/data/base/pgsql_tmp/, I could see the file system
"unlink" command with successful status codes.
There were about 150 large files in /db/postgres/data/base/pgsql_tmp/, and I
waited about an hour for them to be deleted, and after that, the postgresql log
file said it was ready, and indeed, it was perfect :)
What a relief!
I hope this might help any wayward stragglers to see how the postgresql restart
process works. Sometimes things don't shut down cleanly, but I think it is still
good to know some alternative steps to kill -9
This all happened right before the demo of our project to some stakeholders (it
was a small academic lab project, but still crucial!) so that made it extra
scary