Tuesday, December 14, 2010

Clustered Sessions of Resin 4.0 in a Nutshell

When trying to migrate from Resin 3.1.3 to Resin 4.0.13 we came across a shock .....Resin 4.x does not support JDBC persistance for  sessions.....

This is something to help you get started to deep dive into the world of "Clusters"
  • Resin's concept of "cluster" is a group of servers doing the same thing.
    • For clustering to work all the servers in a cluster should have the exact configuration file.
    • The only difference between servers should be jvm-args (or load balance weights) in case you have a lightly-loaded performance-testing server.
    • In particular, the server order must be the same or Resin will get very confused.
    • If you have different configurations, multiple clusters will be created, each doing a slightly different thing.

  • The first 3 servers mentioned in the config file is know as Resin Triad Servers
    • For sites with a single server, the server is considered a triad server. For sites with two servers, both servers are considered triad servers, and back each other up.
    • Resin's clustering uses the triad as a reliable store for shared data across the Resin system.
    • All session data is replicated for all three triad servers for both reliability and load balancing.
    • A session is owned by a specific triad (one out of 3 ) based on a hash of the session id. Since one third of the sessions are owned by server A, one third by server B, and one third by server C, the load is evenly distributed between the triad. This triad load balancing lets the cluster pod scale up to its full size of 64 servers.

  • The hash of the session is used to determine the owning triad server.
    • The session id hashing is used in both triad and non triad servers.

  • As part of Resin's health check system, the cluster continually checks that all servers are alive and responding properly. Every 60 seconds, each server sends a heartbeat to the triad, and each triad server sends a heartbeat to all the other servers.
    • When a server failure is detected, Resin immediately detects the failure, logging it for an administrator's analysis and internally prepares to failover to backup servers for any messaging for distributed storage like the clustered sessions and the clustered deployment.
    • When the server comes back up, the heartbeat is reestablished and any missing data is recovered.

How it works by an example

Consider we have 6 servers (nodes) in a cluster. Server 1 ,2 and 3 are part of the triad.
  • When a request comes to server 4 how does it retrieve a session?
    • When a request comes to server 4 and server 4 does not have the current session information, it will ask the owning triad for the session data.

  • In the case of a new session how does it get distributed across the triads?
    • A new session is assigned randomly to one of the triad servers, using a hash of the session id. So each triad server gets 1/3 of the session load.

  • Do the nodes that are not part of the triad have any local store of its own?
    • The non-triad nodes have a local cache. Each of the non-triad nodes will have a copy of the session data so it can avoid a network request if the data is in the cache.

  • If sticky sessions are disabled will the local cache of non triad nodes be ever used or does it always get the session information from its triad owner?
    • With non-sticky sessions, the session will be loaded from the triad on each request. (Unless you happen to get lucky and end up on the same server.) It's always better to enable sticky sessions when possible. Even IP-based sticky sessions is better than nothing.

  • Where does the triad store the session data ? Is it configurable?
    • The session data is in resin-data/<server>. The resin-data location can be configured, but not the structure underneath it.

  • What happens when one of the nodes in the triad goes down? How do the other nodes which are not in the triad know about this?
    • The heartbeat service continually checks which servers are up with a dedicated TCP connection. If a triad server goes down, all other servers will see the TCP connection close within a second for normal exits. If the machine is unplugged, the heartbeat 60s later will detect the crashed server.

  • In the case of a triad server going down : Are the sessions (which the dead triad was the owner of ) equally distributed between the two remaining triad servers ? If so how?
    • Since all the servers including the non-triad servers know that one of the triad server is down (since it received a TCP close) the session id mapping will then point to the other two triad servers . All three triad servers get all copies of the sessions. So even two triad servers can go down and the sessions will be backed up.

For a more detailed information you can check out the  Resin 4.0 Technical White Paper at http://www.caucho.com/articles/resin-cloud.pdf

Till next time.....................