What’s Up Lately With Slony?

What’s up Lately? 2011-04-12 Tue

Git Changeover

In July 2010, we switched over to use Git, which has been working out quite fine so far. The official repository is at git.postgresql.org; note that some developers are publishing their repositories publicly at GitHub:

You can find details at those “private” repositories of branches that the developers have opened to work on various bug fixes and features.

The next big version

We have been working on what seems most likely to be called the “2.1 release.”

  • There are quite a lot of fixes and enhancements already in place. We have been quite faithful about integrating release notes in as changes are made, so Master RELEASE notes should be quite accurate in representing what has changed. Some highlights include:
    • Changes to queries against sl_log_* tables improve performance when undergoing large backlog
    • Slonik now supports commands to bulk-add tables and sequences
    • Integration of clustertest framework that does rather more sophisticated tests, obsolescing previous “ducttape” and shell script tests.
    • Cleanup of a bunch of things
      • Use named parameters in all functions.
      • Dropped SNMP support that doesn’t seem to run anymore, and which was never part of any regression tests.
  • It is unlikely that it will get dubbed “version 3,” as there aren’t the sorts of deep changes that would warrant such.
    • The database schema has not materially changed in any way that would warrant re-initializing clusters, as was the case between version 1.2 and 2.0.
    • The changes generally aren’t really huge, with the exceptions of a couple features that aren’t quite ready yet (which deserves its own separate discussion)

Still Outstanding

There are two features being worked on, which we hoped would be ready around the time of PGCon 2011:

This feature causes most Slonik commands to wait for whatever event responses should be received before they may be considered properly finished. For instance SUBSCRIBE SET would wait until the subscription has been completed before proceeding.
Multinode FAIL OVER
For clusters where there are multiple origins for different sets, this allows reshaping the entire cluster properly, which has historically been rather more troublesome than people usually were able to recognize.

Unfortunately, neither of these are quite ready yet. It is conceivable that the automatic waiting may be mostly ready, but complications and interruptions have gotten in the way of completion of multinode failover.

When will 2.1 be ready?

Three possibilities seem to present themselves:

  1. Release what we’ve got as 2.1, let the outstanding items arrive in a future version.Unfortunately, this would seem to dictate that we support a “version 2.1” for an extended period of time, complete with the trouble and effort of backpatching. It’s not very attractive.
  2. Draw in Implicit WAIT FOR EVENT, which would make for a substantially more featureful 2.1, and let multinode FAIL OVER come along later.We had been hoping that there would be common functionality between these two features, so had imagined it a bad idea to do one without the other. But perhaps that’s wrong, and Implicit WAIT FOR EVENT doesn’t need multinode failover to be meaningful. That does seem like it may be true.

    There is still the same issue as with 1. above, that this would mean having an extra version of Slony to support, which isn’t something anyone is too keen on.

  3. Wait until it’s all ready.This gets rid of the version proliferation problem, but means that it’s going to be a while (several months, perhaps quite a few) before users may benefit from any of these enhancements.

    Development of the failover facility seems like it will be bottlenecked for a while on Jan, so this suggests that it may be timely to solicit features that Steve and I might work on concurrently in the interim.

So, what might still go into 2.1?

  • We periodically get bug reports from people about this and that, and minor things will certainly get drawn in, particularly if they represent incorrect behaviour.
  • ABORT scriptI plan to send a note out soon describing my thoughts thus far.
  • Cluster Analysis ToolingI think it would be pretty neat to connect to a Slony cluster, pull out some data, and generate some web pages and GraphViz diagrams to characterize the status and health of the cluster.
  • There was evidently discussion at PGEast about trying to get the altperl scripts improved/cleaned up.My personal opinion (cbbrowne) is that they’re not quite general enough, and that making them so would be more trouble than it’s worth, so my “vote” would be to deprecate them.

    But that is certainly not the only opinion out there – there are apparently others that regularly use them.

    While I’m not keen on putting effort into them, if there is some consensus on what to do, I’d go along with it. That might include:

    • Adding scripts to address slonik features that have not thus far been included in altperl.
    • Integrating tests into the set of tests run using the clustertest framework, so that we have some verification that this stuff works properly.
  • Insert Your Pet Feature Here?Maybe there’s some low hanging fruit that we’re not aware of that’s worth poking at.

More Slony work

I have been way way too busy to do substantial Slony work in a while. Very very engaged on internal (infernal?) DB apps work.

At long last I reached a certain degree of completion that allowed me a breather, and a little time to look at Slony 2.0 issues.

I have been experimenting with Git lately, in several contexts, so pulled the PostgreSQL Git repo, with a view to using that as my “PostgreSQL HEAD” for testing. While the “official” PG version for our apps is 8.3, I usually do my builds/tests on either 8.4 or CVS HEAD, or, I guess, now, Git “master” ;-).

After checking out Git master, I found problems with both the internal app (minor thing in accessing information_Schema) and, alas, Slony :-(. A function now has 3 arguments (and, in the Klingon tradition, always wins them!), thus needing a bit of autoconf remediation. I hate autoconf… But absent some substantial Tilbrook contributions, that won’t be changing soon! 🙁

I surely hope I can run through a set of regression tests this coming week so as to get 2.0.3 released!

Slony-I DDL problem

Yay, I figured out the nature of the problem with the running of per-node DDL scripts on v2.0.

It’s fairly deep; there is something about the semantics of “local” execution that’s wrong.   Gonna need to pass this upstream to Jan for more input…

Slony-I 2.0 fixes…

Apparently, it’s a good day to CVS COMMIT…

I was able to get fixes in to

  • fix testpartition
    The partitioning functions were still modelled on the “old” way that Slony-I 1.2 and earlier handled triggers on functions,  where they had to get dropped/re-added each time you run DDL.  Need to use New Scheme.
  • fix multipaths
    This code exercised code paths that showed off that there were a number of references left in the code to storenode_int(int,text,boolean), where, in 2.0, we’re dropping the boolean flag that was intended to indicate that the node was to be used as a log shipping source.

There are several tests that are not working, at present, in the 2.0 branch, notably (but possibly not “only”), that will be the targets of further work:

  • testddl
  • testlogship

Slony-I 2.0 Testing

The “CVS HEAD” branch of Slony-I has been open for rather a while, and has rather a lot of changes in it.  ‘Tis time, now, to validate that it is generally working, so that we can, with some confidence, let it out and allow others to start criticizing any remaining issues that would prevent a release.

Today’s extra entertainment: I have added a bit of code to my test script that publishes to Twitter each time I run a test…

My Next Slony-I trick…

I’m working on a “CANCEL SUBSCRIPTION” feature that has been talked about for rather a long time.  (It’s on the bug list:  http://www.slony.info/bugzilla/show_bug.cgi?id=10).

The notion is that if you have a subscription that keeps failing for some reason, we should have a way of dropping the subscription request without being forced to throw away the entire cluster configuration.  The special thing about CANCEL SUBSCRIPTION is that it needs to be able to come in and take action against events EARLIER in the event stream than itself.

In analyzing the environment of data available to CANCEL SUBSCRIPTION, it looks like it’s pretty highly constrained.  There are quite a number of cases where it should fail to do anything.