Wednesday, July 23, 2008

Inside ALUI Grid Search: Redundancy Bug (6.1 on window at least)

With ALUI 6.1, BEA introduced a completely revamped search component for ALUI, allowing for better redundancy and better throughput: Grid Search. The main advantages of that new search component are:

  • Multiple search nodes to provide redundancy for serving search requests.
  • Search index can be split in multiple partitions, each attached to various search nodes., to increase throughput.

Every nodes on the same partition automatically replicate locally their search index to guaranty redundancy and performance.

Although there is capability for multiple nodes that guaranty redundancy, all the nodes need to access a central "cluster" data repository located somewhere on the network (through file share). It is located by default at <ALUI_HOME>/ptsearchserver/6.1/cluster. What is usually done is to share that folder (simple network share if you are on windows) and set up the other nodes to access that share as their cluster repository. This cluster repository holds the cluster information (nodes and partitions info) and the multiple search checkpoints that allow for search index backup.

One main problem that I personally experienced with that design consists in the fact that this cluster repository represent a single point of failure... if the cluster share is suddenly not available (hard disk, server, or network failure), all the nodes are not able to talk to the cluster and there might be problems happening.

And actually, a huge problem occurs in that case: if the cluster share is not available, all the nodes are suddenly experiencing an "Out Of Memory" exception and shutting down abruptly. Thus, although you deployed multiple nodes and partitions, if the cluster share is down, your search architecture is...down.

It is pretty easy to test (at least I successfully reproduced the bug on ALUI 6.1 MP1 Patch 1 on windows server 2003): have your nodes all running, and simply remove the share from your cluster folder...all your nodes will go down (apart from the one that accesses the share locally if the cluster share is installed on the same server as one of the nodes)

2 options from there:

  • make sure the share is never down (windows clustering, redundant NAS cluster, or polyserve technologies)
  • install that critical fix from BEA that fixes this bug

If you don't have an infrastructure that provides the first expensive option, you might want to look seriously into the 2nd one...and contact your sales rep asap. Basically, the critical fix allow for the nodes to continue serving requests even if the cluster share is no longer available. All the nodes switch automatically in read only mode without the "out of memory" exception that was occurring before.

Although it is much better, some problems are still present with that critical fix. When in read-only mode, the nodes are no longer indexing new content...your search index is then blocked at the point in time when the cluster share did actually go down, and any new object or document will not show up in the search as long as the cluster share is not restored. The second problem is that the nodes will NOT automatically roll back to read/write mode whenever the cluster is available again. It will require a manual restart.

But compared to a total shut down of search, these problems seem less important indeed!

I am not 100% sure this fix has been pushed to ALUI 6.5 but I sure hope so. And by fix, I am talking about a total fix including auto rollback to "normal" mode when share is available anew, or even allowing for TOTAL continuity of service when this share goes down...

Please let me know (leave comment) if you have that information on 6.5, or if you reproduce this with other versions of the portal.

No comments:

Post a Comment