Cassandra split brain partition -


we running 6-node cassandra cluster across 2 aws availability zones (ap-southeast-1 , ap-southeast-2).

after running happily several months, cluster given rolling restart fix hung repair, , each group thinks other down.

cluster information:     name: megaportglobal     snitch: org.apache.cassandra.locator.dynamicendpointsnitch     partitioner: org.apache.cassandra.dht.murmur3partitioner     schema versions:         220727fa-88d2-366f-9473-777e32744c37: [10.5.13.117, 10.5.12.245, 10.5.13.93]          unreachable: [10.4.0.112, 10.4.0.169, 10.4.2.186]  cluster information:     name: megaportglobal     snitch: org.apache.cassandra.locator.dynamicendpointsnitch     partitioner: org.apache.cassandra.dht.murmur3partitioner     schema versions:         3932d237-b907-3ef8-95bc-4276dc7f32e6: [10.4.0.112, 10.4.0.169, 10.4.2.186]          unreachable: [10.5.13.117, 10.5.12.245, 10.5.13.93] 

from sydney, 'nodetool status' reports singapore nodes down:

datacenter: ap-southeast-2 ========================== status=up/down |/ state=normal/leaving/joining/moving --  address      load       tokens  owns    host id                               rack un  10.4.0.112   9.04 gb    256     ?       b9c19de4-4939-4112-bf07-d136d8a57b57  2a un  10.4.0.169   9.34 gb    256     ?       2d7c3ac4-ae94-43d6-9afe-7d421c06b951  2a un  10.4.2.186   10.72 gb   256     ?       4dc8b155-8f9a-4532-86ec-d958ac207f40  2b datacenter: ap-southeast-1 ========================== status=up/down |/ state=normal/leaving/joining/moving --  address      load       tokens  owns    host id                               rack un  10.5.13.117  9.45 gb    256     ?       324ee189-3e72-465f-987f-cbc9f7bf740b  1a dn  10.5.12.245  10.25 gb   256     ?       bee281c9-715b-4134-a033-00479a390f1e  1b dn  10.5.13.93   12.29 gb   256     ?       a8262244-91bb-458f-9603-f8c8fe455924  1a 

but singapore, sydney nodes reported down:

ap-southeast-2 ========================== status=up/down |/ state=normal/leaving/joining/moving --  address      load       tokens  owns    host id                               rack dn  10.4.0.112   8.91 gb    256     ?       b9c19de4-4939-4112-bf07-d136d8a57b57  2a dn  10.4.0.169   ?          256     ?       2d7c3ac4-ae94-43d6-9afe-7d421c06b951  2a dn  10.4.2.186   ?          256     ?       4dc8b155-8f9a-4532-86ec-d958ac207f40  2b datacenter: ap-southeast-1 ========================== status=up/down |/ state=normal/leaving/joining/moving --  address      load       tokens  owns    host id                               rack un  10.5.13.117  9.45 gb    256     ?       324ee189-3e72-465f-987f-cbc9f7bf740b  1a un  10.5.12.245  10.25 gb   256     ?       bee281c9-715b-4134-a033-00479a390f1e  1b un  10.5.13.93   12.29 gb   256     ?       a8262244-91bb-458f-9603-f8c8fe455924  1a 

even more confusing, 'nodetool gossipinfo' executed in sydney reports fine status - normal:

/10.5.13.117   generation:1440735653   heartbeat:724504   severity:0.0   dc:ap-southeast-1   load:1.0149565738e10   schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3   rack:1a   status:normal,-1059943672916788858   release_version:2.1.6   net_version:8   rpc_address:10.5.13.117   internal_ip:10.5.13.117   host_id:324ee189-3e72-465f-987f-cbc9f7bf740b /10.5.12.245   generation:1440734497   heartbeat:728014   severity:0.0   dc:ap-southeast-1   load:1.100647505e10   schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3   rack:1b   status:normal,-1029869455226513030   release_version:2.1.6   net_version:8   rpc_address:10.5.12.245   internal_ip:10.5.12.245   host_id:bee281c9-715b-4134-a033-00479a390f1e /10.4.0.112   generation:1440973751   heartbeat:4135   severity:0.0   dc:ap-southeast-2   load:9.70297176e9   schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3   rack:2a   release_version:2.1.6   status:normal,-1016623069114845926   net_version:8   rpc_address:10.4.0.112   internal_ip:10.4.0.112   host_id:b9c19de4-4939-4112-bf07-d136d8a57b57 /10.5.13.93   generation:1440734532   heartbeat:727909   severity:0.0   dc:ap-southeast-1   load:1.3197536002e10   schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3   rack:1a   status:normal,-1021689296016263011   release_version:2.1.6   net_version:8   rpc_address:10.5.13.93   internal_ip:10.5.13.93   host_id:a8262244-91bb-458f-9603-f8c8fe455924 /10.4.0.169   generation:1440974511   heartbeat:1832   severity:0.0   dc:ap-southeast-2   load:1.0023502338e10   schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3   rack:2a   release_version:2.1.6   status:normal,-1004223692762353764   net_version:8   rpc_address:10.4.0.169   internal_ip:10.4.0.169   host_id:2d7c3ac4-ae94-43d6-9afe-7d421c06b951 /10.4.2.186   generation:1440734382   heartbeat:730171   severity:0.0   dc:ap-southeast-2   load:1.1507595081e10   schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3   rack:2b   status:normal,-10099894685483463   release_version:2.1.6   net_version:8   rpc_address:10.4.2.186   internal_ip:10.4.2.186   host_id:4dc8b155-8f9a-4532-86ec-d958ac207f40 

the same command executed in singapore not include status nodes in sydney:

/10.5.12.245   generation:1440974710   heartbeat:1372   severity:0.0   load:1.100835806e10   rpc_address:10.5.12.245   net_version:8   schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3   release_version:2.1.6   status:normal,-1029869455226513030   dc:ap-southeast-1   rack:1b   internal_ip:10.5.12.245   host_id:bee281c9-715b-4134-a033-00479a390f1e /10.5.13.117   generation:1440974648   heartbeat:1561   severity:0.0   load:1.0149992022e10   rpc_address:10.5.13.117   net_version:8   schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3   release_version:2.1.6   status:normal,-1059943672916788858   dc:ap-southeast-1   rack:1a   host_id:324ee189-3e72-465f-987f-cbc9f7bf740b   internal_ip:10.5.13.117 /10.4.0.112   generation:1440735420   heartbeat:23   severity:0.0   load:9.570546197e9   rpc_address:10.4.0.112   net_version:8   schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3   release_version:2.1.6   dc:ap-southeast-2   rack:2a   internal_ip:10.4.0.112   host_id:b9c19de4-4939-4112-bf07-d136d8a57b57 /10.5.13.93   generation:1440734532   heartbeat:729862   severity:0.0   load:1.3197536002e10   rpc_address:10.5.13.93   net_version:8   schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3   release_version:2.1.6   status:normal,-1021689296016263011   dc:ap-southeast-1   rack:1a   internal_ip:10.5.13.93   host_id:a8262244-91bb-458f-9603-f8c8fe455924 /10.4.0.169   generation:1440974511   heartbeat:15   severity:0.5076141953468323   rpc_address:10.4.0.169   net_version:8   schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3   release_version:2.1.6   dc:ap-southeast-2   rack:2a   internal_ip:10.4.0.169   host_id:2d7c3ac4-ae94-43d6-9afe-7d421c06b951 /10.4.2.186   generation:1440734382   heartbeat:15   severity:0.0   rpc_address:10.4.2.186   net_version:8   schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3   release_version:2.1.6   dc:ap-southeast-2   rack:2b   internal_ip:10.4.2.186   host_id:4dc8b155-8f9a-4532-86ec-d958ac207f40 

during restart, each node can see remote dc little while:

info  [gossipstage:1] 2015-08-31 10:53:07,638 outboundtcpconnection.java:97 - outboundtcpconnection using coalescing strategy disabled info  [handshake-/10.4.2.186] 2015-08-31 10:53:08,267 outboundtcpconnection.java:485 - handshaking version /10.4.2.186 info  [handshake-/10.4.0.169] 2015-08-31 10:53:08,287 outboundtcpconnection.java:485 - handshaking version /10.4.0.169 info  [handshake-/10.5.12.245] 2015-08-31 10:53:08,391 outboundtcpconnection.java:485 - handshaking version /10.5.12.245 info  [handshake-/10.5.13.93] 2015-08-31 10:53:08,498 outboundtcpconnection.java:485 - handshaking version /10.5.13.93 info  [gossipstage:1] 2015-08-31 10:53:08,537 gossiper.java:987 - node /10.5.12.245 has restarted, info  [handshake-/10.5.13.117] 2015-08-31 10:53:08,537 outboundtcpconnection.java:485 - handshaking version /10.5.13.117 info  [gossipstage:1] 2015-08-31 10:53:08,656 storageservice.java:1642 - node /10.5.12.245 state jump normal info  [gossipstage:1] 2015-08-31 10:53:08,820 gossiper.java:987 - node /10.5.13.117 has restarted, info  [gossipstage:1] 2015-08-31 10:53:08,852 gossiper.java:987 - node /10.5.13.93 has restarted, info  [sharedpool-worker-33] 2015-08-31 10:53:08,907 gossiper.java:954 - inetaddress /10.5.12.245 info  [gossipstage:1] 2015-08-31 10:53:08,947 storageservice.java:1642 - node /10.5.13.93 state jump normal info  [gossipstage:1] 2015-08-31 10:53:09,007 gossiper.java:987 - node /10.4.0.169 has restarted, warn  [gossiptasks:1] 2015-08-31 10:53:09,123 failuredetector.java:251 - not marking nodes down due local pause of 7948322997 > 5000000000 info  [gossipstage:1] 2015-08-31 10:53:09,192 storageservice.java:1642 - node /10.4.0.169 state jump normal info  [handshake-/10.5.12.245] 2015-08-31 10:53:09,199 outboundtcpconnection.java:485 - handshaking version /10.5.12.245 info  [gossipstage:1] 2015-08-31 10:53:09,203 gossiper.java:987 - node /10.4.2.186 has restarted, info  [gossipstage:1] 2015-08-31 10:53:09,206 storageservice.java:1642 - node /10.4.2.186 state jump normal info  [sharedpool-worker-34] 2015-08-31 10:53:09,215 gossiper.java:954 - inetaddress /10.5.13.93 info  [sharedpool-worker-33] 2015-08-31 10:53:09,259 gossiper.java:954 - inetaddress /10.5.13.117 info  [sharedpool-worker-33] 2015-08-31 10:53:09,259 gossiper.java:954 - inetaddress /10.4.0.169 info  [sharedpool-worker-33] 2015-08-31 10:53:09,259 gossiper.java:954 - inetaddress /10.4.2.186 info  [gossipstage:1] 2015-08-31 10:53:09,296 storageservice.java:1642 - node /10.4.0.169 state jump normal info  [gossipstage:1] 2015-08-31 10:53:09,491 storageservice.java:1642 - node /10.5.12.245 state jump normal info  [handshake-/10.5.13.117] 2015-08-31 10:53:09,509 outboundtcpconnection.java:485 - handshaking version /10.5.13.117 info  [gossipstage:1] 2015-08-31 10:53:09,511 storageservice.java:1642 - node /10.5.13.93 state jump normal info  [handshake-/10.5.13.93] 2015-08-31 10:53:09,538 outboundtcpconnection.java:485 - handshaking version /10.5.13.93 

then, without errors, nodes marked down:

info  [gossiptasks:1] 2015-08-31 10:53:34,410 gossiper.java:968 - inetaddress /10.5.13.117 down info  [gossiptasks:1] 2015-08-31 10:53:34,411 gossiper.java:968 - inetaddress /10.5.12.245 down info  [gossiptasks:1] 2015-08-31 10:53:34,411 gossiper.java:968 - inetaddress /10.5.13.93 down 

we have tried multiple restarts, behaviour remains same.

*edit

it looks related gossip protocol... turning on debug shows phi values steadily increasing:

trace [gossiptasks:1] 2015-08-31 16:46:44,706 failuredetector.java:262 - phi /10.4.0.112 : 2.9395029255 trace [gossiptasks:1] 2015-08-31 16:46:45,727 failuredetector.java:262 - phi /10.4.0.112 : 3.449690761 trace [gossiptasks:1] 2015-08-31 16:46:46,728 failuredetector.java:262 - phi /10.4.0.112 : 3.95049114 trace [gossiptasks:1] 2015-08-31 16:46:47,730 failuredetector.java:262 - phi /10.4.0.112 : 4.451317456 trace [gossiptasks:1] 2015-08-31 16:46:48,732 failuredetector.java:262 - phi /10.4.0.112 : 4.952114357 trace [gossiptasks:1] 2015-08-31 16:46:49,733 failuredetector.java:262 - phi /10.4.0.112 : 5.4529339645 trace [gossiptasks:1] 2015-08-31 16:46:50,735 failuredetector.java:262 - phi /10.4.0.112 : 5.953951289 trace [gossiptasks:1] 2015-08-31 16:46:51,737 failuredetector.java:262 - phi /10.4.0.112 : 6.4547808165 trace [gossiptasks:1] 2015-08-31 16:46:52,738 failuredetector.java:262 - phi /10.4.0.112 : 6.955600038 trace [gossiptasks:1] 2015-08-31 16:46:53,740 failuredetector.java:262 - phi /10.4.0.112 : 7.456422601 trace [gossiptasks:1] 2015-08-31 16:46:54,742 failuredetector.java:262 - phi /10.4.0.112 : 7.957303284 trace [gossiptasks:1] 2015-08-31 16:46:55,751 failuredetector.java:262 - phi /10.4.0.112 : 8.461658576 trace [gossiptasks:1] 2015-08-31 16:46:56,755 failuredetector.java:262 - phi /10.4.0.112 : 8.9636610545 trace [gossiptasks:1] 2015-08-31 16:46:57,763 failuredetector.java:262 - phi /10.4.0.112 : 9.4676926445 

the phi values steadily increase after restart, until exceed failure threshold , marked down.

any suggestions on how proceed?

for laggy network, raise phi failure detection threshold 12 or 15. commonly required in aws, cross region.


Comments

Popular posts from this blog

c# - Binding a comma separated list to a List<int> in asp.net web api -

Delphi 7 and decode UTF-8 base64 -

html - Is there any way to exclude a single element from the style? (Bootstrap) -