Running a buildfarm member requires very little attention once you have it set up. That's as it should be, and was always a design goal. From time to time we get persistent failures, and I will send an email note to the owners asking them to do a little cleanup. Usually this results in very prompt action. However, I've just had to take the very rare action of disabling three members whose owner has apparently ignored my request to fix things. I don't know if he's left the job where these were running, or he's on a very long vacation, or what. But I don't see why we should have endlessly failing machines on the buildfarm dashboard if their owners go silent. If he fixes them we can re-enable them, but it's going to take positive action on his part.
In the aftermath of this huge storm in the USA, a number of web sites I use on a daily basis are either offline or threatening to go offline or operating on reduced functionality. If your business relies on your site and your data being available, you need to make sure that all parts of it can suffer a major event like this and yet continue. That means having failover systems that you can rapidly bring online. In the Postgres world, that means having at least one standby replica that you can activate on very short notice. This replica should be placed either in the cloud or in an alternative data centre, but in either case in some physical location far from where your primary is. For example, if your major data centre is on the east coast, the backup should possibly be on the west coast. There is no excuse these days for not having a good disaster recovery plan. Most large businesses I am familiar with have long learned this lesson, but it's one I have seen ignored many times by smaller and even medium sized businesses. The other thing I have often seen is having a DR plan that's never been tested. That's as bad as having a backup mechanism that's never been tested. Just the other day I saw a business suffer 100% data loss because they had never tested their backups, which turned out to be useless. It's like running with scissors.
If you run on the Amazon cloud, it's tempting to use Amazon's own Linux distro. One might expect it to be built to run well on the platform, and also the updates will be close at hand and not incur any traffic charges. However, they seem to have made a major blunder in packaging PostgreSQL.
I found out about this when a very angry user came on the IRC channel yesterday. He had done a "yum update" and suddenly found that his database would not work. He shared the log with us, and here's a small excerpt from it:
---> Package postgresql9-server.x86_64 0:9.1.5-1.23.amzn1 will be obsoleted
---> Package postgresql9-upgrade.x86_64 0:9.2.1-1.28.amzn1 will be obsoleting
--> Processing Dependency: postgresql9-server(x86-64) = 9.2.1-1.28.amzn1 for package: postgresql9-upgrade-9.2.1-1.28.amzn1.x86_64
Wow. They are upgrading the postgres-server package from 9.1 to 9.2, on a simple "yum update". That is just awful. You don't get anything like that happening on any other RPM-based system I know of, certainly not on any RedHat or SUSE derived system I have ever seen. "yum update" should not be doing this sort of thing.
I suspect that it's happened because the Amazon packagers don't actually understand PostgreSQL version numbering. The have two collections of packages, one with "postgres8" and one with "postgres9" as the prefix. But as most (I hope) PostgreSQL users know, our major version numbers have two parts, not one, as can be seen in the versioning policy Upgrading to a new major version via a simple "yum update" is a totally bad thing to be doing.
Apparently it has caused enough problems for them to have issued an FAQ about it. But really, that's no answer. This is exactly the sort of thing you should NOT have to do after a simple "yum update". pg_upgrade is not guaranteed to work in all circumstances, and if it fails in one of the many ways it can fail, the user will be left scrambling.
What's really annoying about this is that users are just as likely to blame Postgres for this mess as they are to blame Amazon.
I don't know who to contact at Amazon to urge them to fix this mess, so I've written this in the hope it might help someone who encounters the problem. And for now I am going to advise customers and others to avoid using this. There are lots of other AMIs that can be used on Amazon. If you must use Amazon Linux, I suggest using the PostgreSQL community yum repos, although this probably requires a tiny bit of work to start with: see how I did this before we got support for ScientificLinux
This is something that I have had a number of discussions on recently, including with several clients, and also with Simon Riggs of 2ndQuadrant at PostgresOpen. Sometimes, especially with larger development items, it is quite difficult to estimate the amount of effort required to undertake a certain item without actually doing a quite substantial amount of work to start with. I am often very conservative in giving estimates, and won't do so unless I feel reasonably confident that I can hit the target in the specified time. This is to protect myself, PostgreSQL Experts Inc., and the clients. In at least one case recently a client decided not to pursue a certain item because I would not put a firm estimate on it without this initial investigation.
A case in point is the Grouping Sets feature (CUBE and ROLLUP are instances of this feature). It's one I would like to land. Some work has been done on it a few years ago, but clearly a lot needs to be done, and I have no idea how much. Finding out would probably cost me a lot of time and effort, which I can't afford to undertake unfunded.
Some people get this, I'm happy to say. Yesterday we reached agreement in principle with a client on proceeding this way for a feature that I hope will eventually land in Postgres (details will have to wait, I'm afraid). Without this approach we would almost certainly have declined the work, and the feature might never have seen the light of day. I hope other people wanting to sponsor developments can see the point of this.
The buildfarm server was rejecting status updates this afternoon for about 2 hours. My fault - I neglected to add a "return NEW;" to a trigger function. I had been giving it some overdue attention, and improved the speed on a couple of important pages which were running slowish queries. Everything is working again now.
This tripped me up for about 2 hours this morning. If you are creating an aggregate with a dynamically allocated C object as the state type (type "internal" at the SQL level) you need to make sure that it is allocated in a context that has the same lifetime as the aggregate call, so that it will be valid across calls to the transition function. So if your state object is a StringInfo object, as mine was, instead of doing:
state = makeStringInfo();
you need to do something like:
oldcontext = MemoryContextSwitchTo(aggcontext);
state = makeStringInfo();
There is a good complete example of this in the source code in the file src/backend/utils/adt/varlena.c - see the functions string_agg_transfn() and makeStringAggState().
Once I got this fixed my aggregate was much happier