TAO/orbsvcs/examples/FaultTolerance/RolyPoly/README

   1
   2
   3 Overview
   4
   5 RolyPoly is a simple example that shows how to increase application
   6 reliability by using replication to tolerate faults. It allows you
   7 to start two replicas of the same object which are logically seen as
   8 one object by a client. Furthermore, you can terminate one of the
   9 replicas without interrupting the service provided by the object.
  10
  11 RolyPoly is using request/reply logging to suppress repeated
  12 requests (thus guaranteeing exactly-once semantic) and state
  13 synchronization (to ensure all replicas are in a consistent
  14 state). Since replicas are generally distributed across multiple
  15 nodes in the network, logging and state synchronizations are
  16 done using multicast group communication protocol.
  17
  18 In order to make it illustrative, each replica can be set to
  19 fail in one of the predefined places called crash points. The
  20 following crash point numbers are defined:
  21
  22 0 - no crash point (default).
  23
  24 1 - fail before reply logging/state synchronization.
  25
  26 2 - fail after reply logging/state synchronization but before
  27     returning reply to the client.
  28
  29 Essential difference between crash point 1 and 2 is that in
  30 the second case there should be reply replay while in the
  31 first case request is simply re-executed (this can be observed
  32 in the trace messages of the replicas).
  33
  34
  35 Execution Scenario
  36
  37 In this example scenario we will start three replicas. For one
  38 of them (let us call it primary) we will specify a crash point
  39 other than 0. Then we will start a client to execute requests
  40 on the resulting object. After a few requests, primary will
  41 fail and we will be able to observe transparent shifting of
  42 client to the other replica. Also we will be able to make sure
  43 that, after this shifting, object is still in expected state
  44 (i.e. the sequence of returned numbers is not interrupted and
  45 that, in case of the crash point 2, request is not re-executed).
  46
  47 Note, due to the underlying group communication architecture,
  48 the group with only one member (replica in our case) can only
  49 exist for a very short period of time. This, in turn, means
  50 that we need to start first two replicas virtually at the same
  51 time. This is also a reason why we need three replicas instead
  52 of two - if one replica is going to fail then the other one
  53 won't live very long alone. For more information on the reasons
  54 why it works this way please see documentation for TMCast
  55 available at $(ACE_ROOT)/ace/TMCast/README.
  56
  57 Suppose we have node0, node1 and node2 on which we are going
  58 to start our replicas (it could be the same node). Then, to
  59 start our replicas we can execute the following commands:
  60
  61 node0$ ./server -o replica-0.ior -c 2
  62 node1$ ./server -o replica-1.ior
  63 node2$ ./server -o replica-2.ior
  64
  65 When all replicas are up we can start the client:
  66
  67 $ ./client -k file://replica-0.ior -k file://replica-1.ior
  68
  69 In this scenario, after executing a few requests, replica-0
  70 will fail in crash point 2. After that, replica-1 will continue
  71 executing client requests. You can see what's going on with
  72 replicas by looking at various trace messages printed during
  73 execution.
  74
  75
  76 Architecture
  77
  78 The biggest part of the replication logic is carried out by
  79 the ReplicaController. In particular it performs the
  80 following tasks:
  81
  82 * management of distributed request/reply log
  83
  84 * state synchronization
  85
  86 * repeated request suppression
  87
  88
  89 Object implementation (interface RolyPoly in our case) can use
  90 two different strategies for delivering state update to the
  91 ReplicaController:
  92
  93 * push model: client calls Checkpointable::associate_state
  94   to associate the state update with current request.
  95
  96 * pull model: ReplicaController will call Checkpointable::get_state
  97   implemented by the servant.
  98
  99 This two model can be used simultaneously. In RolyPoly interface
 100 implementation you can comment out corresponding piece of code to
 101 chose one of the strategies.
 102
 103 --
 104 Boris Kolpackov <boris@dre.vanderbilt.edu>
 105