Cassandra split brain partition -
we running 6-node cassandra cluster across 2 aws availability zones (ap-southeast-1 , ap-southeast-2).
after running happily several months, cluster given rolling restart fix hung repair, , each group thinks other down.
cluster information: name: megaportglobal snitch: org.apache.cassandra.locator.dynamicendpointsnitch partitioner: org.apache.cassandra.dht.murmur3partitioner schema versions: 220727fa-88d2-366f-9473-777e32744c37: [10.5.13.117, 10.5.12.245, 10.5.13.93] unreachable: [10.4.0.112, 10.4.0.169, 10.4.2.186] cluster information: name: megaportglobal snitch: org.apache.cassandra.locator.dynamicendpointsnitch partitioner: org.apache.cassandra.dht.murmur3partitioner schema versions: 3932d237-b907-3ef8-95bc-4276dc7f32e6: [10.4.0.112, 10.4.0.169, 10.4.2.186] unreachable: [10.5.13.117, 10.5.12.245, 10.5.13.93]
from sydney, 'nodetool status' reports singapore nodes down:
datacenter: ap-southeast-2 ========================== status=up/down |/ state=normal/leaving/joining/moving -- address load tokens owns host id rack un 10.4.0.112 9.04 gb 256 ? b9c19de4-4939-4112-bf07-d136d8a57b57 2a un 10.4.0.169 9.34 gb 256 ? 2d7c3ac4-ae94-43d6-9afe-7d421c06b951 2a un 10.4.2.186 10.72 gb 256 ? 4dc8b155-8f9a-4532-86ec-d958ac207f40 2b datacenter: ap-southeast-1 ========================== status=up/down |/ state=normal/leaving/joining/moving -- address load tokens owns host id rack un 10.5.13.117 9.45 gb 256 ? 324ee189-3e72-465f-987f-cbc9f7bf740b 1a dn 10.5.12.245 10.25 gb 256 ? bee281c9-715b-4134-a033-00479a390f1e 1b dn 10.5.13.93 12.29 gb 256 ? a8262244-91bb-458f-9603-f8c8fe455924 1a
but singapore, sydney nodes reported down:
ap-southeast-2 ========================== status=up/down |/ state=normal/leaving/joining/moving -- address load tokens owns host id rack dn 10.4.0.112 8.91 gb 256 ? b9c19de4-4939-4112-bf07-d136d8a57b57 2a dn 10.4.0.169 ? 256 ? 2d7c3ac4-ae94-43d6-9afe-7d421c06b951 2a dn 10.4.2.186 ? 256 ? 4dc8b155-8f9a-4532-86ec-d958ac207f40 2b datacenter: ap-southeast-1 ========================== status=up/down |/ state=normal/leaving/joining/moving -- address load tokens owns host id rack un 10.5.13.117 9.45 gb 256 ? 324ee189-3e72-465f-987f-cbc9f7bf740b 1a un 10.5.12.245 10.25 gb 256 ? bee281c9-715b-4134-a033-00479a390f1e 1b un 10.5.13.93 12.29 gb 256 ? a8262244-91bb-458f-9603-f8c8fe455924 1a
even more confusing, 'nodetool gossipinfo' executed in sydney reports fine status - normal:
/10.5.13.117 generation:1440735653 heartbeat:724504 severity:0.0 dc:ap-southeast-1 load:1.0149565738e10 schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3 rack:1a status:normal,-1059943672916788858 release_version:2.1.6 net_version:8 rpc_address:10.5.13.117 internal_ip:10.5.13.117 host_id:324ee189-3e72-465f-987f-cbc9f7bf740b /10.5.12.245 generation:1440734497 heartbeat:728014 severity:0.0 dc:ap-southeast-1 load:1.100647505e10 schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3 rack:1b status:normal,-1029869455226513030 release_version:2.1.6 net_version:8 rpc_address:10.5.12.245 internal_ip:10.5.12.245 host_id:bee281c9-715b-4134-a033-00479a390f1e /10.4.0.112 generation:1440973751 heartbeat:4135 severity:0.0 dc:ap-southeast-2 load:9.70297176e9 schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3 rack:2a release_version:2.1.6 status:normal,-1016623069114845926 net_version:8 rpc_address:10.4.0.112 internal_ip:10.4.0.112 host_id:b9c19de4-4939-4112-bf07-d136d8a57b57 /10.5.13.93 generation:1440734532 heartbeat:727909 severity:0.0 dc:ap-southeast-1 load:1.3197536002e10 schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3 rack:1a status:normal,-1021689296016263011 release_version:2.1.6 net_version:8 rpc_address:10.5.13.93 internal_ip:10.5.13.93 host_id:a8262244-91bb-458f-9603-f8c8fe455924 /10.4.0.169 generation:1440974511 heartbeat:1832 severity:0.0 dc:ap-southeast-2 load:1.0023502338e10 schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3 rack:2a release_version:2.1.6 status:normal,-1004223692762353764 net_version:8 rpc_address:10.4.0.169 internal_ip:10.4.0.169 host_id:2d7c3ac4-ae94-43d6-9afe-7d421c06b951 /10.4.2.186 generation:1440734382 heartbeat:730171 severity:0.0 dc:ap-southeast-2 load:1.1507595081e10 schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3 rack:2b status:normal,-10099894685483463 release_version:2.1.6 net_version:8 rpc_address:10.4.2.186 internal_ip:10.4.2.186 host_id:4dc8b155-8f9a-4532-86ec-d958ac207f40
the same command executed in singapore not include status nodes in sydney:
/10.5.12.245 generation:1440974710 heartbeat:1372 severity:0.0 load:1.100835806e10 rpc_address:10.5.12.245 net_version:8 schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3 release_version:2.1.6 status:normal,-1029869455226513030 dc:ap-southeast-1 rack:1b internal_ip:10.5.12.245 host_id:bee281c9-715b-4134-a033-00479a390f1e /10.5.13.117 generation:1440974648 heartbeat:1561 severity:0.0 load:1.0149992022e10 rpc_address:10.5.13.117 net_version:8 schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3 release_version:2.1.6 status:normal,-1059943672916788858 dc:ap-southeast-1 rack:1a host_id:324ee189-3e72-465f-987f-cbc9f7bf740b internal_ip:10.5.13.117 /10.4.0.112 generation:1440735420 heartbeat:23 severity:0.0 load:9.570546197e9 rpc_address:10.4.0.112 net_version:8 schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3 release_version:2.1.6 dc:ap-southeast-2 rack:2a internal_ip:10.4.0.112 host_id:b9c19de4-4939-4112-bf07-d136d8a57b57 /10.5.13.93 generation:1440734532 heartbeat:729862 severity:0.0 load:1.3197536002e10 rpc_address:10.5.13.93 net_version:8 schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3 release_version:2.1.6 status:normal,-1021689296016263011 dc:ap-southeast-1 rack:1a internal_ip:10.5.13.93 host_id:a8262244-91bb-458f-9603-f8c8fe455924 /10.4.0.169 generation:1440974511 heartbeat:15 severity:0.5076141953468323 rpc_address:10.4.0.169 net_version:8 schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3 release_version:2.1.6 dc:ap-southeast-2 rack:2a internal_ip:10.4.0.169 host_id:2d7c3ac4-ae94-43d6-9afe-7d421c06b951 /10.4.2.186 generation:1440734382 heartbeat:15 severity:0.0 rpc_address:10.4.2.186 net_version:8 schema:7bf335ee-61ae-36c6-a902-c70d785ec7a3 release_version:2.1.6 dc:ap-southeast-2 rack:2b internal_ip:10.4.2.186 host_id:4dc8b155-8f9a-4532-86ec-d958ac207f40
during restart, each node can see remote dc little while:
info [gossipstage:1] 2015-08-31 10:53:07,638 outboundtcpconnection.java:97 - outboundtcpconnection using coalescing strategy disabled info [handshake-/10.4.2.186] 2015-08-31 10:53:08,267 outboundtcpconnection.java:485 - handshaking version /10.4.2.186 info [handshake-/10.4.0.169] 2015-08-31 10:53:08,287 outboundtcpconnection.java:485 - handshaking version /10.4.0.169 info [handshake-/10.5.12.245] 2015-08-31 10:53:08,391 outboundtcpconnection.java:485 - handshaking version /10.5.12.245 info [handshake-/10.5.13.93] 2015-08-31 10:53:08,498 outboundtcpconnection.java:485 - handshaking version /10.5.13.93 info [gossipstage:1] 2015-08-31 10:53:08,537 gossiper.java:987 - node /10.5.12.245 has restarted, info [handshake-/10.5.13.117] 2015-08-31 10:53:08,537 outboundtcpconnection.java:485 - handshaking version /10.5.13.117 info [gossipstage:1] 2015-08-31 10:53:08,656 storageservice.java:1642 - node /10.5.12.245 state jump normal info [gossipstage:1] 2015-08-31 10:53:08,820 gossiper.java:987 - node /10.5.13.117 has restarted, info [gossipstage:1] 2015-08-31 10:53:08,852 gossiper.java:987 - node /10.5.13.93 has restarted, info [sharedpool-worker-33] 2015-08-31 10:53:08,907 gossiper.java:954 - inetaddress /10.5.12.245 info [gossipstage:1] 2015-08-31 10:53:08,947 storageservice.java:1642 - node /10.5.13.93 state jump normal info [gossipstage:1] 2015-08-31 10:53:09,007 gossiper.java:987 - node /10.4.0.169 has restarted, warn [gossiptasks:1] 2015-08-31 10:53:09,123 failuredetector.java:251 - not marking nodes down due local pause of 7948322997 > 5000000000 info [gossipstage:1] 2015-08-31 10:53:09,192 storageservice.java:1642 - node /10.4.0.169 state jump normal info [handshake-/10.5.12.245] 2015-08-31 10:53:09,199 outboundtcpconnection.java:485 - handshaking version /10.5.12.245 info [gossipstage:1] 2015-08-31 10:53:09,203 gossiper.java:987 - node /10.4.2.186 has restarted, info [gossipstage:1] 2015-08-31 10:53:09,206 storageservice.java:1642 - node /10.4.2.186 state jump normal info [sharedpool-worker-34] 2015-08-31 10:53:09,215 gossiper.java:954 - inetaddress /10.5.13.93 info [sharedpool-worker-33] 2015-08-31 10:53:09,259 gossiper.java:954 - inetaddress /10.5.13.117 info [sharedpool-worker-33] 2015-08-31 10:53:09,259 gossiper.java:954 - inetaddress /10.4.0.169 info [sharedpool-worker-33] 2015-08-31 10:53:09,259 gossiper.java:954 - inetaddress /10.4.2.186 info [gossipstage:1] 2015-08-31 10:53:09,296 storageservice.java:1642 - node /10.4.0.169 state jump normal info [gossipstage:1] 2015-08-31 10:53:09,491 storageservice.java:1642 - node /10.5.12.245 state jump normal info [handshake-/10.5.13.117] 2015-08-31 10:53:09,509 outboundtcpconnection.java:485 - handshaking version /10.5.13.117 info [gossipstage:1] 2015-08-31 10:53:09,511 storageservice.java:1642 - node /10.5.13.93 state jump normal info [handshake-/10.5.13.93] 2015-08-31 10:53:09,538 outboundtcpconnection.java:485 - handshaking version /10.5.13.93
then, without errors, nodes marked down:
info [gossiptasks:1] 2015-08-31 10:53:34,410 gossiper.java:968 - inetaddress /10.5.13.117 down info [gossiptasks:1] 2015-08-31 10:53:34,411 gossiper.java:968 - inetaddress /10.5.12.245 down info [gossiptasks:1] 2015-08-31 10:53:34,411 gossiper.java:968 - inetaddress /10.5.13.93 down
we have tried multiple restarts, behaviour remains same.
*edit
it looks related gossip protocol... turning on debug shows phi values steadily increasing:
trace [gossiptasks:1] 2015-08-31 16:46:44,706 failuredetector.java:262 - phi /10.4.0.112 : 2.9395029255 trace [gossiptasks:1] 2015-08-31 16:46:45,727 failuredetector.java:262 - phi /10.4.0.112 : 3.449690761 trace [gossiptasks:1] 2015-08-31 16:46:46,728 failuredetector.java:262 - phi /10.4.0.112 : 3.95049114 trace [gossiptasks:1] 2015-08-31 16:46:47,730 failuredetector.java:262 - phi /10.4.0.112 : 4.451317456 trace [gossiptasks:1] 2015-08-31 16:46:48,732 failuredetector.java:262 - phi /10.4.0.112 : 4.952114357 trace [gossiptasks:1] 2015-08-31 16:46:49,733 failuredetector.java:262 - phi /10.4.0.112 : 5.4529339645 trace [gossiptasks:1] 2015-08-31 16:46:50,735 failuredetector.java:262 - phi /10.4.0.112 : 5.953951289 trace [gossiptasks:1] 2015-08-31 16:46:51,737 failuredetector.java:262 - phi /10.4.0.112 : 6.4547808165 trace [gossiptasks:1] 2015-08-31 16:46:52,738 failuredetector.java:262 - phi /10.4.0.112 : 6.955600038 trace [gossiptasks:1] 2015-08-31 16:46:53,740 failuredetector.java:262 - phi /10.4.0.112 : 7.456422601 trace [gossiptasks:1] 2015-08-31 16:46:54,742 failuredetector.java:262 - phi /10.4.0.112 : 7.957303284 trace [gossiptasks:1] 2015-08-31 16:46:55,751 failuredetector.java:262 - phi /10.4.0.112 : 8.461658576 trace [gossiptasks:1] 2015-08-31 16:46:56,755 failuredetector.java:262 - phi /10.4.0.112 : 8.9636610545 trace [gossiptasks:1] 2015-08-31 16:46:57,763 failuredetector.java:262 - phi /10.4.0.112 : 9.4676926445
the phi values steadily increase after restart, until exceed failure threshold , marked down.
any suggestions on how proceed?
for laggy network, raise phi failure detection threshold 12 or 15. commonly required in aws, cross region.
Comments
Post a Comment