Non-multicast Tomcat Session Clustering with RabbitMQ
Session replication and clustering is one of those annoyingly persistent problems when hosting web applications within a platform-agnostic, horizontally-scalable, enterprise cloud environment. Sticky sessions are a valid option to preserve a user’s login information. But that’s far from being the dynamicaly-load-balanced Cloud Nirvana I’m aiming for.
I don’t want to have to use sticky sessions. I tried the built-in multicast-based clustering and found it would break randomly in my virtual machine environment, causing a loss of user sessions. Maybe I wasn’t configuring it right (though I tried going strictly by the documentation, such as there is) or maybe our network isn’t suited to doing that kind of session replication. I finally had to ditch it and go back to regular sticky sessions.
But even that wasn’t enough for moving forward. I also want to have a load-balanced pool of Apache proxy servers on the front end. My goal is a completely dynamic environment where a user can come to any front-end Apache server and have their application service requests forwarded to any active backend server (we use SpringSource’s tcServer 6.0) whether that user has ever hit that server before or not. I also don’t want a full copy of every session in the cloud on every tcServer node. That’s asking a lot, I know. I found out that there isn’t any software out there to do this. Maybe it’s because it’s not a good idea and shouldn’t be done! I hope that’s not the case. But whether it’s a good idea or not, I plunged ahead and created a replacement Manager and Store that manages Tomcat Session objects in a dynamic cloud environment using RabbitMQ as the asynchronous message broker.
Each node in the cloud listens to three different kinds of events:
- A fanout exchange that blasts tiny event messages to every active node. Consumers of this exchange are considered part of a “cluster” or “group”.
- A direct exchange which is for messages intended for specific nodes (i.e. to get a user session). These are considered “source” events (it’s not the most intuitive name, unfortunately…I’m open to suggestions :).
- A topic exchange which is used to keep copies of user sessions somewhere other than in the internal Map of one of the nodes.
When a new session is created and saved to the Store, that session object is added to an internal Map. A “touch” message is sent to the fanout exchange letting everyone in the cloud know that, for session ID “X”, I (the “source”) have the actual session object. Each node that receives this message updates an internal Map with this reference so it knows where to go get the actual session object when it needs it.
When the user makes a subsequent request to a different server, the Session Store knows it doesn’t have that object in its internal Map, so it checks the cloud Map to see if that session exists somewhere in the cloud. If it gets a “source” for that session, it sends the source node (identified by the “storeId” property of the Store) a “load” message and identifies itself as the requestor. The Store gets the Session object from its internal Map, serializes it, then sends the requestor an “update” message with the bytes of the serialized session as the body of the message.
When the requestor receives the “update” message, it deserializes the session, checks its internal Map of SessionLoaders to see if any are waiting on this session and, if so, places the Session object on that loader’s queue so it can, in turn, pass that object back into the Manager and on through the Request lifecycle. The session is then stored locally and another “touch” message is sent to the cloud, notifying the other nodes that it now claims to be the “source” for session ID “X”. When the node that just sent the Session to the requestor sees this new “touch” message and notices it has a session for ID “X” in its internal Map, it checks to see if that Session was put there as the result of a replication event or not. If it’s not a replica, then it assumes it no longer needs to keep a copy of that Session object because another node in the cloud now has it. It deletes that Session from its internal Map.
The idea here is that the CloudSession object bounces from one node to the next in the cloud, going wherever it is needed. Session membership in the cloud is managed by having every node maintain a String -> String Map that tells it what sessions are in the cloud (I’m still working on the valid/invalid functionality) and what node is the “source” (where to go to get the actual Session object).
I’m not even going to pretend that this is just as performant as using sticky sessions. To be honest, I don’t think any robust, cloud-friendly session clustering would be as fast as using the PersistentManager implementation. I also haven’t load-tested this. I don’t know what’s going to break when a bunch of users start hitting it at once. But in my defense, this isn’t release-quality software. This is about as alpha and rough-around-the-edges as it comes. But the basics are all there. The foundation has been laid to build upon, I hope. In my ad-hoc tests, session load runtimes are 10-20 milliseconds (I only use sessions for the purposes of authentication…any persistence our applications do is done through Postgres). I’m only running two updaters at a time, but higher throughput sites might have to run more.
This module, as are all the vCloud Utilities I’m writing, are Apache 2.0 licensed and available on GitHub:
I apologize in advance for not have much in the way of documentation on building, installing, and using this session manager. Once I work the bugs out and it is more stable and I know what I’m working with here, I’ll start adding some of that.