1
0
0
0
专栏/.../

Data Migration高可用演练

 Haaahei  发表于  2022-03-24

为确保DM可以在线上稳定运行,现计划对其高可用机制进行演练,主要包括如下事项:

事项 验证点 步骤 结论
dm-worker ha

验证dm-worker宕机

  • 同步任务是否会转移
  • 同步任务情况(延迟、状态等)
  • 宕掉的dm-worker启动后,dm-worker是否会自动启动并重新加入集群
如下

如下

dm-master ha

验证dm-master leader宕机

  • leader是否正常选举
  • 选举过程中,同步任务的情况(延迟、状态等)
  • dm-master所在机器启动后,dm-master是否会自动启动并重新加入集群
如下 如下
滚动升级

升级dm到v2.0.6

  • leader是否正常选举
  • 同步任务情况
如下 如下

步骤及结论

dm-worker HA

  1. 模拟dm-worker宕机

    date; kill -9 pid; mv <deploy dir> <deploy dir>-1  # 强制kill dm-worker pid,并将部署目录改名防止自启动
    
  2. 观察任务切换情况

  3. 记录相关数据:切换耗时,任务状态,延时情况

结论:

  • 同步任务是否会转移

    [2021/08/17 13:28:04.712 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8262  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8262: connect: connection refused\". Reconnecting..."] [component="embed etcd"]   ...  [2021/08/17 13:28:51.576 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8262  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8262: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 13:28:54.876 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8262  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8262: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 13:28:57.913 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8262  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8262: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 13:28:58.159 +08:00] [INFO] [keepalive.go:216] ["receive dm-worker keep alive event"] [operation=DELETE] [kv=/dm-worker/a/646d2d3137322e31372e3230312e3131352d38323632] [2021/08/17 13:28:58.163 +08:00] [INFO] [scheduler.go:1506] ["receive worker status change event"] [component=scheduler] [delete=true] [event="{\"worker-name\":\"dm-172.17.201.115-8262\",\"join-time\":\"0001-01-01T00:00:00Z\"}"] [2021/08/17 13:28:58.165 +08:00] [INFO] [scheduler.go:1662] ["unbound the worker for source"] [component=scheduler] [bound="{\"source\":\"ds-mysql_report\",\"worker\":\"dm-172.17.201.115-8262\"}"] [event="{\"worker-name\":\"dm-172.17.201.115-8262\",\"join-time\":\"0001-01-01T00:00:00Z\"}"] [2021/08/17 13:28:58.165 +08:00] [INFO] [scheduler.go:1838] ["found free worker when source bound"] [component=scheduler] [worker=dm-172.18.78.254-8265] [source=ds-mysql_report] [2021/08/17 13:28:58.168 +08:00] [INFO] [scheduler.go:1876] ["bound the source to worker"] [component=scheduler] [bound="{\"source\":\"ds-mysql_report\",\"worker\":\"dm-172.18.78.254-8265\"}"]
    
  • 大约60s左右,新的dm-worker成功接管同步任务,通过query-status查看同步状态正常

  • 同步任务情况

    [2021/08/17 13:28:58.168 +08:00] [INFO] [server.go:581] ["receive source bound"] [bound="{\"source\":\"ds-mysql_report\",\"worker\":\"dm-172.18.78.254-8265\"}"] ["is deleted"=false] [2021/08/17 13:28:58.170 +08:00] [WARN] [task.go:826] ["session variable 'time_zone' is overwritten by default UTC timezone."] [time_zone=+00:00] [2021/08/17 13:28:58.170 +08:00] [INFO] [server.go:836] ["will start a new worker"] [sourceID=ds-mysql_report] [2021/08/17 13:28:58.170 +08:00] [INFO] [worker.go:120] [initialized] [component="worker controller"] [cfg="{\"enable-gtid\":true,\"auto-fix-gtid\":false,\"relay-dir\":\"relay-dir\",\"meta-dir\":\"\",\"flavor\":\"mysql\",\"charset\":\"\",\"enable-relay\":false,\"relay-binlog-name\":\"\",\"relay-binlog-gtid\":\"\",\"source-id\":\"ds-mysql_report\",\"from\":{\"host\":\"172.16.150.53\",\"port\":15381,\"user\":\"dm_sync\",\"max-allowed-packet\":null,\"session\":{\"time_zone\":\"+00:00\"},\"security\":null},\"purge\":{\"interval\":3600,\"expires\":0,\"remain-space\":15},\"checker\":{\"check-enable\":true,\"backoff-rollback\":{\"Duration\":\"5m0s\"},\"backoff-max\":{\"Duration\":\"5m0s\"}},\"server-id\":429548349,\"case-sensitive\":false,\"filters\":null}"] [2021/08/17 13:28:58.170 +08:00] [INFO] [worker.go:135] ["start running"] [component="worker controller"] [2021/08/17 13:28:58.270 +08:00] [INFO] [worker.go:310] ["enter EnableHandleSubtasks"] [component="worker controller"] [2021/08/17 13:28:58.272 +08:00] [WARN] [task.go:826] ["session variable 'time_zone' is overwritten by default UTC timezone."] [time_zone=+00:00] [2021/08/17 13:28:58.272 +08:00] [WARN] [task.go:826] ["session variable 'time_zone' is overwritten by default UTC timezone."] [time_zone=+00:00] [2021/08/17 13:28:58.273 +08:00] [INFO] [worker.go:326] ["starting to handle mysql source"] [component="worker controller"] [sourceCfg="{\"enable-gtid\":true,\"auto-fix-gtid\":false,\"relay-dir\":\"relay-dir\",\"meta-dir\":\"\",\"flavor\":\"mysql\",\"charset\":\"\",\"enable-relay\":false,\"relay-binlog-name\":\"\",\"relay-binlog-gtid\":\"\",\"source-id\":\"ds-mysql_report\",\"from\":{\"host\":\"172.16.150.53\",\"port\":15381,\"user\":\"dm_sync\",\"max-allowed-packet\":null,\"session\":{\"time_zone\":\"+00:00\"},\"security\":null},\"purge\":{\"interval\":3600,\"expires\":0,\"remain-space\":15},\"checker\":{\"check-enable\":true,\"backoff-rollback\":{\"Duration\":\"5m0s\"},\"backoff-max\":{\"Duration\":\"5m0s\"}},\"server-id\":429548349,\"case-sensitive\":false,\"filters\":null}"] [subTasks="{\"dm-mysql_report\":{\"is-sharding\":false,\"shard-mode\":\"\",\"online-ddl-scheme\":\"gh-ost\",\"case-sensitive\":false,\"name\":\"dm-mysql_report\",\"mode\":\"incremental\",\"ignore-checking-items\":[\"dump_privilege\"],\"source-id\":\"ds-mysql_report\",\"server-id\":429548349,\"flavor\":\"mysql\",\"meta-schema\":\"dm_meta\",\"heartbeat-update-interval\":1,\"heartbeat-report-interval\":10,\"enable-heartbeat\":false,\"meta\":{\"BinLogName\":\"\",\"BinLogPos\":0,\"BinLogGTID\":\"34474b1e-2bf3-11e8-8515-00163e1040fb:1-1278385,767d5889-e08e-11ea-bf83-00163e0e3732:1,7fbb40a3-8240-11eb-8cda-00163e17fb0e:1-168290280,803ffea1-7b9d-11e9-87b0-00163e0e3732:1-435424493,a2d27a7f-de3c-11e7-82cd-00163e1040fb:1-304056,a33473fb-de3c-11e7-8140-00163e0e6470:1-68232043,bfbebe4d-1582-11e9-8e63-00163e082a23:1-36011466,e033b7c4-7b9d-11e9-8e45-00163e097eeb:1-608218207\"},\"timezone\":\"\",\"relay-dir\":\"relay-dir\",\"use-relay\":false,\"from\":{\"host\":\"172.16.150.53\",\"port\":15381,\"user\":\"dm_sync\",\"max-allowed-packet\":null,\"session\":{\"time_zone\":\"+00:00\"},\"security\":null},\"to\":{\"host\":\"172.21.35.233\",\"port\":15381,\"user\":\"dm_load\",\"max-allowed-packet\":null,\"session\":{\"tidb_txn_mode\":\"optimistic\",\"time_zone\":\"+00:00\"},\"security\":null},\"route-rules\":[{\"schema-pattern\":\"reverse_flow\",\"table-pattern\":\"\",\"target-schema\":\"reverse_center\",\"target-table\":\"\"}],\"filter-rules\":[],\"mapping-rule\":[],\"black-white-list\":null,\"block-allow-list\":{\"do-tables\":[{\"db-name\":\"reverse_flow\",\"tbl-name\":\"rc_reverse_record_integration\"}],\"do-dbs\":[\"reverse_flow\"],\"ignore-tables\":null,\"ignore-dbs\":null},\"mydumper-path\":\"./bin/mydumper\",\"threads\":1,\"chunk-filesize\":\"64\",\"statement-size\":0,\"rows\":1000,\"where\":\"\",\"skip-tz-utc\":true,\"extra-args\":\"--consistency none\",\"pool-size\":8,\"dir\":\"./dm-mysql_report.dm-mysql_report\",\"meta-file\":\"\",\"worker-count\":128,\"batch\":100,\"queue-size\":1024,\"checkpoint-flush-interval\":30,\"max-retry\":0,\"auto-fix-gtid\":false,\"enable-gtid\":true,\"disable-detect\":false,\"safe-mode\":false,\"enable-ansi-quotes\":false,\"log-level\":\"\",\"log-file\":\"\",\"log-format\":\"\",\"log-rotate\":\"\",\"pprof-addr\":\"\",\"status-addr\":\"\",\"config-file\":\"\",\"clean-dump-file\":false,\"ansi-quotes\":false}}"] [2021/08/17 13:28:58.273 +08:00] [INFO] [worker.go:333] ["start to create subtask"] [component="worker controller"] [sourceID=ds-mysql_report] [task=dm-mysql_report] [2021/08/17 13:28:58.273 +08:00] [INFO] [worker.go:426] ["subtask created"] [component="worker controller"] [config="{\"is-sharding\":false,\"shard-mode\":\"\",\"online-ddl-scheme\":\"gh-ost\",\"case-sensitive\":false,\"name\":\"dm-mysql_report\",\"mode\":\"incremental\",\"ignore-checking-items\":[\"dump_privilege\"],\"source-id\":\"ds-mysql_report\",\"server-id\":429548349,\"flavor\":\"mysql\",\"meta-schema\":\"dm_meta\",\"heartbeat-update-interval\":1,\"heartbeat-report-interval\":10,\"enable-heartbeat\":false,\"meta\":{\"BinLogName\":\"\",\"BinLogPos\":0,\"BinLogGTID\":\"34474b1e-2bf3-11e8-8515-00163e1040fb:1-1278385,767d5889-e08e-11ea-bf83-00163e0e3732:1,7fbb40a3-8240-11eb-8cda-00163e17fb0e:1-168290280,803ffea1-7b9d-11e9-87b0-00163e0e3732:1-435424493,a2d27a7f-de3c-11e7-82cd-00163e1040fb:1-304056,a33473fb-de3c-11e7-8140-00163e0e6470:1-68232043,bfbebe4d-1582-11e9-8e63-00163e082a23:1-36011466,e033b7c4-7b9d-11e9-8e45-00163e097eeb:1-608218207\"},\"timezone\":\"\",\"relay-dir\":\"relay-dir\",\"use-relay\":false,\"from\":{\"host\":\"172.16.150.53\",\"port\":15381,\"user\":\"dm_sync\",\"max-allowed-packet\":null,\"session\":{\"time_zone\":\"+00:00\"},\"security\":null},\"to\":{\"host\":\"172.21.35.233\",\"port\":15381,\"user\":\"dm_load\",\"max-allowed-packet\":null,\"session\":{\"tidb_txn_mode\":\"optimistic\",\"time_zone\":\"+00:00\"},\"security\":null},\"route-rules\":[{\"schema-pattern\":\"reverse_flow\",\"table-pattern\":\"\",\"target-schema\":\"reverse_center\",\"target-table\":\"\"}],\"filter-rules\":[],\"mapping-rule\":[],\"black-white-list\":null,\"block-allow-list\":{\"do-tables\":[{\"db-name\":\"reverse_flow\",\"tbl-name\":\"rc_reverse_record_integration\"}],\"do-dbs\":[\"reverse_flow\"],\"ignore-tables\":null,\"ignore-dbs\":null},\"mydumper-path\":\"./bin/mydumper\",\"threads\":1,\"chunk-filesize\":\"64\",\"statement-size\":0,\"rows\":1000,\"where\":\"\",\"skip-tz-utc\":true,\"extra-args\":\"--consistency none\",\"pool-size\":8,\"dir\":\"./dm-mysql_report.dm-mysql_report\",\"meta-file\":\"\",\"worker-count\":128,\"batch\":100,\"queue-size\":1024,\"checkpoint-flush-interval\":30,\"max-retry\":0,\"auto-fix-gtid\":false,\"enable-gtid\":true,\"disable-detect\":false,\"safe-mode\":false,\"enable-ansi-quotes\":false,\"log-level\":\"\",\"log-file\":\"\",\"log-format\":\"\",\"log-rotate\":\"\",\"pprof-addr\":\"\",\"status-addr\":\"\",\"config-file\":\"\",\"clean-dump-file\":false,\"ansi-quotes\":false}"] [2021/08/17 13:28:58.273 +08:00] [INFO] [syncer.go:3024] ["use timezone"] [task=dm-mysql_report] [unit="binlog replication"] [location=UTC] [2021/08/17 13:28:58.891 +08:00] [INFO] [config.go:599] ["detect server type"] [task=dm-mysql_report] [unit="binlog replication"] [scope=upstream] [type=MySQL] [2021/08/17 13:28:58.891 +08:00] [INFO] [config.go:618] ["detect server version"] [task=dm-mysql_report] [unit="binlog replication"] [scope=upstream] [version=5.7.20-log] [2021/08/17 13:28:58.894 +08:00] [INFO] [config.go:599] ["detect server type"] [task=dm-mysql_report] [unit="binlog replication"] [scope=downstream] [type=TiDB] [2021/08/17 13:28:58.894 +08:00] [INFO] [config.go:618] ["detect server version"] [task=dm-mysql_report] [unit="binlog replication"] [scope=downstream] [version=4.0.13] [2021/08/17 13:28:59.422 +08:00] [INFO] [checkpoint.go:699] ["create checkpoint schema"] [task=dm-mysql_report] [unit="binlog replication"] [component="remote checkpoint"] [statement="CREATE SCHEMA IF NOT EXISTS `dm_meta`"] [2021/08/17 13:28:59.426 +08:00] [INFO] [checkpoint.go:723] ["create checkpoint table"] [task=dm-mysql_report] [unit="binlog replication"] [component="remote checkpoint"] [statements="[\"CREATE TABLE IF NOT EXISTS `dm_meta`.`dm-mysql_report_syncer_checkpoint` (\\n\\t\\t\\tid VARCHAR(32) NOT NULL,\\n\\t\\t\\tcp_schema VARCHAR(128) NOT NULL,\\n\\t\\t\\tcp_table VARCHAR(128) NOT NULL,\\n\\t\\t\\tbinlog_name VARCHAR(128),\\n\\t\\t\\tbinlog_pos INT UNSIGNED,\\n\\t\\t\\tbinlog_gtid TEXT,\\n\\t\\t\\texit_safe_binlog_name VARCHAR(128) DEFAULT '',\\n\\t\\t\\texit_safe_binlog_pos INT UNSIGNED DEFAULT 0,\\n\\t\\t\\texit_safe_binlog_gtid TEXT,\\n\\t\\t\\ttable_info JSON NOT NULL,\\n\\t\\t\\tis_global BOOLEAN,\\n\\t\\t\\tcreate_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,\\n\\t\\t\\tupdate_time timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,\\n\\t\\t\\tUNIQUE KEY uk_id_schema_table (id, cp_schema, cp_table)\\n\\t\\t)\"]"] [2021/08/17 13:28:59.429 +08:00] [INFO] [checkpoint.go:785] ["fetch global checkpoint from DB"] [task=dm-mysql_report] [unit="binlog replication"] [component="remote checkpoint"] ["global checkpoint"="position: (mysql-bin.001906, 820109405), gtid-set: 34474b1e-2bf3-11e8-8515-00163e1040fb:1-1278385,767d5889-e08e-11ea-bf83-00163e0e3732:1,7fbb40a3-8240-11eb-8cda-00163e17fb0e:1-344070205,803ffea1-7b9d-11e9-87b0-00163e0e3732:1-435424493,a2d27a7f-de3c-11e7-82cd-00163e1040fb:1-304056,a33473fb-de3c-11e7-8140-00163e0e6470:1-68232043,bfbebe4d-1582-11e9-8e63-00163e082a23:1-36011466,e033b7c4-7b9d-11e9-8e45-00163e097eeb:1-608218207(flushed position: (mysql-bin.001906, 820109405), gtid-set: 34474b1e-2bf3-11e8-8515-00163e1040fb:1-1278385,767d5889-e08e-11ea-bf83-00163e0e3732:1,7fbb40a3-8240-11eb-8cda-00163e17fb0e:1-344070205,803ffea1-7b9d-11e9-87b0-00163e0e3732:1-435424493,a2d27a7f-de3c-11e7-82cd-00163e1040fb:1-304056,a33473fb-de3c-11e7-8140-00163e0e6470:1-68232043,bfbebe4d-1582-11e9-8e63-00163e082a23:1-36011466,e033b7c4-7b9d-11e9-8e45-00163e097eeb:1-608218207)"] [2021/08/17 13:28:59.431 +08:00] [INFO] [subtask.go:226] ["start to run"] [subtask=dm-mysql_report] [unit=Sync] [2021/08/17 13:28:59.431 +08:00] [INFO] [worker.go:351] ["handling subtask enabled"] [component="worker controller"] [2021/08/17 13:28:59.432 +08:00] [INFO] [syncer.go:1342] ["replicate binlog from checkpoint"] [task=dm-mysql_report] [unit="binlog replication"] [checkpoint="position: (mysql-bin.001906, 820109405), gtid-set: 34474b1e-2bf3-11e8-8515-00163e1040fb:1-1278385,767d5889-e08e-11ea-bf83-00163e0e3732:1,7fbb40a3-8240-11eb-8cda-00163e17fb0e:1-344070205,803ffea1-7b9d-11e9-87b0-00163e0e3732:1-435424493,a2d27a7f-de3c-11e7-82cd-00163e1040fb:1-304056,a33473fb-de3c-11e7-8140-00163e0e6470:1-68232043,bfbebe4d-1582-11e9-8e63-00163e082a23:1-36011466,e033b7c4-7b9d-11e9-8e45-00163e097eeb:1-608218207"] [2021/08/17 13:28:59.440 +08:00] [INFO] [streamer_controller.go:72] ["last slave connection"] [task=dm-mysql_report] [unit="binlog replication"] ["connection ID"=31610609] [2021/08/17 13:28:59.440 +08:00] [INFO] [mode.go:100] ["change count"] [task=dm-mysql_report] [unit="binlog replication"] ["previous count"=0] ["new count"=0] [2021/08/17 13:28:59.440 +08:00] [INFO] [mode.go:100] ["change count"] [task=dm-mysql_report] [unit="binlog replication"] ["previous count"=0] ["new count"=1] [2021/08/17 13:28:59.440 +08:00] [INFO] [mode.go:59] ["enable safe-mode because of task initialization"] [task=dm-mysql_report] [unit="binlog replication"] ["duration in seconds"=60] [2021/08/17 13:29:00.075 +08:00] [INFO] [syncer.go:1690] ["meet heartbeat event and then flush jobs"] [task=dm-mysql_report] [unit="binlog replication"] [2021/08/17 13:29:00.075 +08:00] [INFO] [syncer.go:2746] ["flush all jobs"] [task=dm-mysql_report] [unit="binlog replication"] ["global checkpoint"="position: (mysql-bin.001906, 820109405), gtid-set: 34474b1e-2bf3-11e8-8515-00163e1040fb:1-1278385,767d5889-e08e-11ea-bf83-00163e0e3732:1,7fbb40a3-8240-11eb-8cda-00163e17fb0e:1-344070205,803ffea1-7b9d-11e9-87b0-00163e0e3732:1-435424493,a2d27a7f-de3c-11e7-82cd-00163e1040fb:1-304056,a33473fb-de3c-11e7-8140-00163e0e6470:1-68232043,bfbebe4d-1582-11e9-8e63-00163e082a23:1-36011466,e033b7c4-7b9d-11e9-8e45-00163e097eeb:1-608218207(flushed position: (mysql-bin.001906, 820109405), gtid-set: 34474b1e-2bf3-11e8-8515-00163e1040fb:1-1278385,767d5889-e08e-11ea-bf83-00163e0e3732:1,7fbb40a3-8240-11eb-8cda-00163e17fb0e:1-344070205,803ffea1-7b9d-11e9-87b0-00163e0e3732:1-435424493,a2d27a7f-de3c-11e7-82cd-00163e1040fb:1-304056,a33473fb-de3c-11e7-8140-00163e0e6470:1-68232043,bfbebe4d-1582-11e9-8e63-00163e082a23:1-36011466,e033b7c4-7b9d-11e9-8e45-00163e097eeb:1-608218207)"] [2021/08/17 13:29:00.080 +08:00] [INFO] [syncer.go:1003] ["flushed checkpoint"] [task=dm-mysql_report] [unit="binlog replication"] [checkpoint="position: (mysql-bin.001906, 820109405), gtid-set: 34474b1e-2bf3-11e8-8515-00163e1040fb:1-1278385,767d5889-e08e-11ea-bf83-00163e0e3732:1,7fbb40a3-8240-11eb-8cda-00163e17fb0e:1-344070205,803ffea1-7b9d-11e9-87b0-00163e0e3732:1-435424493,a2d27a7f-de3c-11e7-82cd-00163e1040fb:1-304056,a33473fb-de3c-11e7-8140-00163e0e6470:1-68232043,bfbebe4d-1582-11e9-8e63-00163e082a23:1-36011466,e033b7c4-7b9d-11e9-8e45-00163e097eeb:1-608218207(flushed position: (mysql-bin.001906, 820109405), gtid-set: 34474b1e-2bf3-11e8-8515-00163e1040fb:1-1278385,767d5889-e08e-11ea-bf83-00163e0e3732:1,7fbb40a3-8240-11eb-8cda-00163e17fb0e:1-344070205,803ffea1-7b9d-11e9-87b0-00163e0e3732:1-435424493,a2d27a7f-de3c-11e7-82cd-00163e1040fb:1-304056,a33473fb-de3c-11e7-8140-00163e0e6470:1-68232043,bfbebe4d-1582-11e9-8e63-00163e082a23:1-36011466,e033b7c4-7b9d-11e9-8e45-00163e097eeb:1-608218207)"] [2021/08/17 13:29:13.098 +08:00] [INFO] [server.go:753] [request=QueryStatus] [payload="name:\"dm-mysql_report\" "] [2021/08/17 13:29:13.098 +08:00] [INFO] [worker.go:509] ["will open a connection to get master status"] [component="worker controller"] ["upstream config"="{\"host\":\"172.16.150.53\",\"port\":15381,\"user\":\"dm_sync\",\"max-allowed-packet\":null,\"session\":{\"time_zone\":\"+00:00\"},\"security\":null}"] [2021/08/17 13:29:29.443 +08:00] [INFO] [syncer.go:2627] ["binlog replication progress"] [task=dm-mysql_report] [unit="binlog replication"] ["total binlog size"=12632410] ["last binlog size"=0] ["cost time"=30] [bytes/Second=421080] ["unsynced binlog size"=0] ["estimate time to catch up"=0]
    
  • 在新的dm-worker接管后,同步任务正常运行;由于切换需要60s左右,所以延迟至少在60s

  • 宕掉的dm-worker启动后,dm-worker是否会自动启动并重新加入集群会自动加入集群,dm-master leader会尝试重启宕掉的dm-worker

    [2021/08/17 13:30:28.796 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8262  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8262: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 13:30:31.625 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8262  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8262: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 13:30:35.190 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8262  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8262: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 13:30:37.523 +08:00] [INFO] [server.go:2206] [payload="name:\"dm-172.17.201.115-8262\" address:\"172.17.201.115:8262\" "] [request=RegisterWorker] [2021/08/17 13:30:37.523 +08:00] [WARN] [scheduler.go:836] ["add the same worker again"] [component=scheduler] ["worker info"="{\"name\":\"dm-172.17.201.115-8262\",\"addr\":\"172.17.201.115:8262\"}"] [2021/08/17 13:30:37.523 +08:00] [INFO] [server.go:309] ["register worker successfully"] [name=dm-172.17.201.115-8262] [address=172.17.201.115:8262] [2021/08/17 13:30:37.529 +08:00] [INFO] [keepalive.go:216] ["receive dm-worker keep alive event"] [operation=PUT] [kv=/dm-worker/a/646d2d3137322e31372e3230312e3131352d38323632] [2021/08/17 13:30:37.529 +08:00] [INFO] [scheduler.go:1506] ["receive worker status change event"] [component=scheduler] [delete=false] [event="{\"worker-name\":\"dm-172.17.201.115-8262\",\"join-time\":\"2021-08-17T13:30:37.524837339+08:00\"}"] [2021/08/17 13:30:37.529 +08:00] [INFO] [scheduler.go:1739] ["no unbound sources need to bound"] [component=scheduler] [worker="{\"name\":\"dm-172.17.201.115-8262\",\"addr\":\"172.17.201.115:8262\"}"]
    

dm-master HA

  1. 模拟dm-master宕机

    date; kill -9 pid; mv <deploy dir> <deploy dir>-1  # 强制kill dm-master pid,并将部署目录改名防止自启动
    
  2. 观察leader切换情况

  3. 记录相关数据:leader切换耗时,所有任务状态,延时情况

结论:

  • leader是否正常选举

    [2021/08/17 14:17:36.240 +08:00] [WARN] [stream.go:436] ["lost TCP streaming connection with remote peer"] [component="embed etcd"] [stream-reader-type="stream MsgApp v2"] [local-member-id=db326cb7fba547f5] [remote-peer-id=201495974e8233cd] [error="unexpected EOF"] [2021/08/17 14:17:36.240 +08:00] [WARN] [peer_status.go:68] ["peer became inactive (message send to peer failed)"] [component="embed etcd"] [peer-id=201495974e8233cd] [error="failed to read 201495974e8233cd on stream MsgApp v2 (unexpected EOF)"] [2021/08/17 14:17:36.240 +08:00] [WARN] [stream.go:436] ["lost TCP streaming connection with remote peer"] [component="embed etcd"] [stream-reader-type="stream Message"] [local-member-id=db326cb7fba547f5] [remote-peer-id=201495974e8233cd] [error="unexpected EOF"] [2021/08/17 14:17:36.241 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:17:37.241 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:17:38.097 +08:00] [WARN] [stream.go:193] ["lost TCP streaming connection with remote peer"] [component="embed etcd"] [stream-writer-type="stream Message"] [local-member-id=db326cb7fba547f5] [remote-peer-id=201495974e8233cd] [2021/08/17 14:17:38.855 +08:00] [WARN] [util.go:163] ["apply request took too long"] [component="embed etcd"] [took=2.096232825s] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/dm-master/bound-worker/646d2d3137322e31372e3230312e3131362d38323634\" "] [response=] [error="etcdserver: leader changed"] [2021/08/17 14:17:38.857 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:38.857 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:38.958 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:17:39.763 +08:00] [WARN] [stream.go:193] ["lost TCP streaming connection with remote peer"] [component="embed etcd"] [stream-writer-type="stream MsgApp v2"] [local-member-id=db326cb7fba547f5] [remote-peer-id=201495974e8233cd] [2021/08/17 14:17:41.719 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:17:42.859 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:42.859 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:45.053 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:17:45.855 +08:00] [WARN] [v3_server.go:746] ["timed out waiting for read index response (local node might have slow network)"] [component="embed etcd"] [timeout=7s] [2021/08/17 14:17:45.855 +08:00] [WARN] [util.go:163] ["apply request took too long"] [component="embed etcd"] [took=9.033455931s] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/dm-master/bound-worker/646d2d3137322e31372e3230312e3131352d38323633\" "] [response=] [error="etcdserver: request timed out"] [2021/08/17 14:17:45.855 +08:00] [WARN] [util.go:163] ["apply request took too long"] [component="embed etcd"] [took=9.085679831s] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/dm-master/bound-worker/646d2d3137322e31382e37382e3235342d38323636\" "] [response=] [error="etcdserver: request timed out"] [2021/08/17 14:17:45.855 +08:00] [WARN] [util.go:163] ["apply request took too long"] [component="embed etcd"] [took=9.085819911s] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/dm-master/bound-worker/646d2d3137322e31382e37382e3235342d38323632\" "] [response=] [error="etcdserver: request timed out"] [2021/08/17 14:17:45.856 +08:00] [WARN] [util.go:163] ["apply request took too long"] [component="embed etcd"] [took=6.99962841s] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/dm-master/relay-worker/646d2d3137322e31372e3230312e3131362d38323634\" "] [response="range_response_count:0 size:5"] [] [2021/08/17 14:17:46.860 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:46.860 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:47.820 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:17:49.716 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=201495974e8233cd] [rtt=331.518µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:50.305 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=201495974e8233cd] [rtt=593.425µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:50.862 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:50.862 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:50.909 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:17:53.418 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:17:54.717 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=201495974e8233cd] [rtt=331.518µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:54.863 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:54.863 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:55.305 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=201495974e8233cd] [rtt=593.425µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:56.854 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:17:58.864 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:58.865 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:59.717 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=201495974e8233cd] [rtt=331.518µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:17:59.859 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:18:00.305 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=201495974e8233cd] [rtt=593.425µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:02.469 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:18:02.866 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:02.866 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:04.717 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=201495974e8233cd] [rtt=331.518µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:05.305 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=201495974e8233cd] [rtt=593.425µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:05.898 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:18:06.867 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:06.867 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:08.588 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:18:09.717 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=201495974e8233cd] [rtt=331.518µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:10.306 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=201495974e8233cd] [rtt=593.425µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:10.868 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:10.868 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:11.740 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:18:14.586 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:18:14.717 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=201495974e8233cd] [rtt=331.518µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:14.869 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:14.869 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:15.306 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=201495974e8233cd] [rtt=593.425µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:17.707 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:18:18.870 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:18.870 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:19.718 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=201495974e8233cd] [rtt=331.518µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:19.876 +08:00] [INFO] [server.go:2206] [payload="leader:true master:true names:\"master-3\" "] [request=ListMember] [2021/08/17 14:18:19.876 +08:00] [INFO] [server.go:2221] ["will forward after a short interval"] [from=master-3] [to=master-1] [request=ListMember] [2021/08/17 14:18:20.306 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=201495974e8233cd] [rtt=593.425µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:20.388 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:18:22.871 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:22.871 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:23.242 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:18:24.718 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=201495974e8233cd] [rtt=331.518µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:25.306 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=201495974e8233cd] [rtt=593.425µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:25.676 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:18:26.872 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:26.873 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:28.384 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:18:29.718 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=201495974e8233cd] [rtt=331.518µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:30.306 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=201495974e8233cd] [rtt=593.425µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:30.874 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:30.874 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:31.067 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:18:34.088 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:18:34.718 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=201495974e8233cd] [rtt=331.518µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:34.875 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:34.875 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:35.307 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=201495974e8233cd] [rtt=593.425µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:37.289 +08:00] [WARN] [grpclog.go:60] ["grpc: addrConn.createTransport failed to connect to {172.17.201.115:8261  <nil> 0 <nil>}. Err :connection error: desc = \"transport: Error while dialing dial tcp 172.17.201.115:8261: connect: connection refused\". Reconnecting..."] [component="embed etcd"] [2021/08/17 14:18:38.876 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:38.876 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:39.719 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_SNAPSHOT] [remote-peer-id=201495974e8233cd] [rtt=331.518µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:18:40.003 +08:00] [INFO] [election.go:292] ["get response from election observe"] [component=election] [key=/dm-master/leader/1ba57a3843f19704] [value="{\"id\":\"master-2\",\"addr\":\"172.18.78.254:8261\"}"] [2021/08/17 14:18:40.003 +08:00] [INFO] [election.go:316] ["current member is not the leader"] [component=election] ["current member"="{\"id\":\"master-3\",\"addr\":\"172.17.201.116:8261\"}"] [leader="{\"id\":\"master-2\",\"addr\":\"172.18.78.254:8261\"}"] [2021/08/17 14:18:40.003 +08:00] [INFO] [election.go:97] ["get new leader"] [leader=master-2] ["current member"=master-3]
    
  • 正常选举,耗时约60s

  • 选举过程中,同步任务的情况(延迟、状态等)选举过程中,正在同步的任务不受影响,无延迟,同步日志存在相关ERROR或者WARN信息(可忽略)

    [2021/08/17 14:17:36.269 +08:00] [ERROR] [server.go:596] ["WatchSourceBound received an error"] [error="etcdserver: mvcc: required revision has been compacted"] [2021/08/17 14:17:36.269 +08:00] [ERROR] [server.go:635] ["WatchRelayConfig received an error"] [error="etcdserver: mvcc: required revision has been compacted"] [2021/08/17 14:17:36.269 +08:00] [ERROR] [worker.go:675] ["WatchSubTaskStage received an error"] [error="etcdserver: mvcc: required revision has been compacted"] [2021/08/17 14:17:38.857 +08:00] [INFO] [worker.go:274] ["enter DisableRelay"] [component="worker controller"] [2021/08/17 14:17:38.858 +08:00] [WARN] [worker.go:278] ["already disabled relay"] [component="worker controller"] [2021/08/17 14:17:38.858 +08:00] [INFO] [keepalive.go:149] ["ignore same keepalive TTL change"] [TTL=60] [2021/08/17 14:17:38.858 +08:00] [WARN] [task.go:826] ["session variable 'time_zone' is overwritten by default UTC timezone."] [time_zone=+00:00] [2021/08/17 14:17:38.858 +08:00] [WARN] [task.go:826] ["session variable 'time_zone' is overwritten by default UTC timezone."] [time_zone=+00:00] [2021/08/17 14:17:38.859 +08:00] [INFO] [worker.go:473] ["resume sub task"] [component="worker controller"] [task=dm-rds_master] [2021/08/17 14:17:38.859 +08:00] [ERROR] [worker.go:585] ["fail to operate subtask stage"] [stage="{\"expect\":2,\"source\":\"ds-rds_master\",\"task\":\"dm-rds_master\"}"] [task=dm-rds_master] [error="[code=40051:class=dm-worker:scope=internal:level=high], Message: current stage is Running but not paused, invalid"] [errorVerbose="[code=40051:class=dm-worker:scope=internal:level=high], Message: current stage is Running but not paused, invalid\ngithub.com/pingcap/dm/pkg/terror.(*Error).Generate\n\t/home/jenkins/agent/workspace/build_dm_multi_branch_v2.0.4/go/src/github.com/pingcap/dm/pkg/terror/terror.go:265\ngithub.com/pingcap/dm/dm/worker.(*SubTask).Resume\n\t/home/jenkins/agent/workspace/build_dm_multi_branch_v2.0.4/go/src/github.com/pingcap/dm/dm/worker/subtask.go:486\ngithub.com/pingcap/dm/dm/worker.(*Worker).OperateSubTask\n\t/home/jenkins/agent/workspace/build_dm_multi_branch_v2.0.4/go/src/github.com/pingcap/dm/dm/worker/worker.go:474\ngithub.com/pingcap/dm/dm/worker.(*Worker).operateSubTaskStage\n\t/home/jenkins/agent/workspace/build_dm_multi_branch_v2.0.4/go/src/github.com/pingcap/dm/dm/worker/worker.go:706\ngithub.com/pingcap/dm/dm/worker.(*Worker).resetSubtaskStage\n\t/home/jenkins/agent/workspace/build_dm_multi_branch_v2.0.4/go/src/github.com/pingcap/dm/dm/worker/worker.go:582\ngithub.com/pingcap/dm/dm/worker.(*Worker).observeSubtaskStage\n\t/home/jenkins/agent/workspace/build_dm_multi_branch_v2.0.4/go/src/github.com/pingcap/dm/dm/worker/worker.go:631\ngithub.com/pingcap/dm/dm/worker.(*Worker).EnableHandleSubtasks.func1\n\t/home/jenkins/agent/workspace/build_dm_multi_branch_v2.0.4/go/src/github.com/pingcap/dm/dm/worker/worker.go:347\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357"] [2021/08/17 14:17:45.859 +08:00] [WARN] [task.go:826] ["session variable 'time_zone' is overwritten by default UTC timezone."] [time_zone=+00:00] [2021/08/17 14:17:47.316 +08:00] [INFO] [syncer.go:1003] ["flushed checkpoint"] [task=dm-rds_master] [unit="binlog replication"] [checkpoint="position: (mysql-bin.013427, 319445109), gtid-set: 007c9203-833d-11e7-aff9-6c92bf21bbe1:1-931221440,11efeebc-747e-11e5-8527-d89d672b3674:1-36643616,1d258984-586b-11e8-9e16-7cd30ae3fcb8:1-1251269710,20dc5615-747e-11e5-8528-1051721b3701:1-39135349,22fb37d8-09ec-11e9-a38f-7cd30a5a2712:1-11655067164,4e09d36c-bb22-11e8-a1cb-7cd30ad38860:1-530901004,8940bea7-54b1-11e6-bb21-6c92bf31607b:1-1096743131,89cc53a8-e019-11e7-8d82-7cd30adae7d0:1-213160520,96767ef0-54b1-11e6-bb21-6c92bf31493f:1-325177501,9b5f2a07-a37c-11e8-8798-7cd30adae88c:1-140532061,a3408fb3-7479-11e5-850a-008cfae41260:1-6695,a7b9cbdc-e019-11e7-8d83-7cd30adbc3c8:1-1582021685,cc6f5f0f-014d-11e6-9b5c-6c92bf21d7b1:1-329680794,ce41dd8e-a37c-11e8-8799-7cd30ac427e8:1-985022222,d17b141f-e217-11ea-aa81-b8599f52a93c:1-4745467924,db92643a-014d-11e6-9b5c-6c92bf21bb19:1-4,f650defc-09eb-11e9-a38e-7cd30ae4109e:1-166445680(flushed position: (mysql-bin.013427, 319445109), gtid-set: 007c9203-833d-11e7-aff9-6c92bf21bbe1:1-931221440,11efeebc-747e-11e5-8527-d89d672b3674:1-36643616,1d258984-586b-11e8-9e16-7cd30ae3fcb8:1-1251269710,20dc5615-747e-11e5-8528-1051721b3701:1-39135349,22fb37d8-09ec-11e9-a38f-7cd30a5a2712:1-11655067164,4e09d36c-bb22-11e8-a1cb-7cd30ad38860:1-530901004,8940bea7-54b1-11e6-bb21-6c92bf31607b:1-1096743131,89cc53a8-e019-11e7-8d82-7cd30adae7d0:1-213160520,96767ef0-54b1-11e6-bb21-6c92bf31493f:1-325177501,9b5f2a07-a37c-11e8-8798-7cd30adae88c:1-140532061,a3408fb3-7479-11e5-850a-008cfae41260:1-6695,a7b9cbdc-e019-11e7-8d83-7cd30adbc3c8:1-1582021685,cc6f5f0f-014d-11e6-9b5c-6c92bf21d7b1:1-329680794,ce41dd8e-a37c-11e8-8799-7cd30ac427e8:1-985022222,d17b141f-e217-11ea-aa81-b8599f52a93c:1-4745467924,db92643a-014d-11e6-9b5c-6c92bf21bb19:1-4,f650defc-09eb-11e9-a38e-7cd30ae4109e:1-166445680)"]  [2021/08/17 14:18:01.661 +08:00] [INFO] [syncer.go:2627] ["binlog replication progress"] [task=dm-rds_master] [unit="binlog replication"] ["total binlog size"=461082248942] ["last binlog size"=461058611217] ["cost time"=30] [bytes/Second=787924] ["unsynced binlog size"=0] ["estimate time to catch up"=0] [2021/08/17 14:18:01.661 +08:00] [INFO] [syncer.go:2654] ["binlog replication status"] [task=dm-rds_master] [unit="binlog replication"] [total_events=461442238] [total_tps=562] [tps=162] [master_position="(mysql-bin.013427, 329839142)"] [master_gtid=007c9203-833d-11e7-aff9-6c92bf21bbe1:1-931221440,11efeebc-747e-11e5-8527-d89d672b3674:1-36643616,1d258984-586b-11e8-9e16-7cd30ae3fcb8:1-1251269710,20dc5615-747e-11e5-8528-1051721b3701:1-39135349,22fb37d8-09ec-11e9-a38f-7cd30a5a2712:1-11655067164,4e09d36c-bb22-11e8-a1cb-7cd30ad38860:1-530901004,8940bea7-54b1-11e6-bb21-6c92bf31607b:1-1096743131,89cc53a8-e019-11e7-8d82-7cd30adae7d0:1-213160520,96767ef0-54b1-11e6-bb21-6c92bf31493f:1-325177501,9b5f2a07-a37c-11e8-8798-7cd30adae88c:1-140532061,a3408fb3-7479-11e5-850a-008cfae41260:1-6695,a7b9cbdc-e019-11e7-8d83-7cd30adbc3c8:1-1582021685,cc6f5f0f-014d-11e6-9b5c-6c92bf21d7b1:1-329680794,ce41dd8e-a37c-11e8-8799-7cd30ac427e8:1-985022222,d17b141f-e217-11ea-aa81-b8599f52a93c:1-4745480546,db92643a-014d-11e6-9b5c-6c92bf21bb19:1-4,f650defc-09eb-11e9-a38e-7cd30ae4109e:1-166445680] [checkpoint="position: (mysql-bin.013427, 329839142), gtid-set: 007c9203-833d-11e7-aff9-6c92bf21bbe1:1-931221440,11efeebc-747e-11e5-8527-d89d672b3674:1-36643616,1d258984-586b-11e8-9e16-7cd30ae3fcb8:1-1251269710,20dc5615-747e-11e5-8528-1051721b3701:1-39135349,22fb37d8-09ec-11e9-a38f-7cd30a5a2712:1-11655067164,4e09d36c-bb22-11e8-a1cb-7cd30ad38860:1-530901004,8940bea7-54b1-11e6-bb21-6c92bf31607b:1-1096743131,89cc53a8-e019-11e7-8d82-7cd30adae7d0:1-213160520,96767ef0-54b1-11e6-bb21-6c92bf31493f:1-325177501,9b5f2a07-a37c-11e8-8798-7cd30adae88c:1-140532061,a3408fb3-7479-11e5-850a-008cfae41260:1-6695,a7b9cbdc-e019-11e7-8d83-7cd30adbc3c8:1-1582021685,cc6f5f0f-014d-11e6-9b5c-6c92bf21d7b1:1-329680794,ce41dd8e-a37c-11e8-8799-7cd30ac427e8:1-985022222,d17b141f-e217-11ea-aa81-b8599f52a93c:1-4745480546,db92643a-014d-11e6-9b5c-6c92bf21bb19:1-4,f650defc-09eb-11e9-a38e-7cd30ae4109e:1-166445680(flushed position: (mysql-bin.013427, 319445109), gtid-set: 007c9203-833d-11e7-aff9-6c92bf21bbe1:1-931221440,11efeebc-747e-11e5-8527-d89d672b3674:1-36643616,1d258984-586b-11e8-9e16-7cd30ae3fcb8:1-1251269710,20dc5615-747e-11e5-8528-1051721b3701:1-39135349,22fb37d8-09ec-11e9-a38f-7cd30a5a2712:1-11655067164,4e09d36c-bb22-11e8-a1cb-7cd30ad38860:1-530901004,8940bea7-54b1-11e6-bb21-6c92bf31607b:1-1096743131,89cc53a8-e019-11e7-8d82-7cd30adae7d0:1-213160520,96767ef0-54b1-11e6-bb21-6c92bf31493f:1-325177501,9b5f2a07-a37c-11e8-8798-7cd30adae88c:1-140532061,a3408fb3-7479-11e5-850a-008cfae41260:1-6695,a7b9cbdc-e019-11e7-8d83-7cd30adbc3c8:1-1582021685,cc6f5f0f-014d-11e6-9b5c-6c92bf21d7b1:1-329680794,ce41dd8e-a37c-11e8-8799-7cd30ac427e8:1-985022222,d17b141f-e217-11ea-aa81-b8599f52a93c:1-4745467924,db92643a-014d-11e6-9b5c-6c92bf21bb19:1-4,f650defc-09eb-11e9-a38e-7cd30ae4109e:1-166445680)"] [2021/08/17 14:18:17.523 +08:00] [INFO] [syncer.go:1003] ["flushed checkpoint"] [task=dm-rds_master] [unit="binlog replication"] [checkpoint="position: (mysql-bin.013427, 337456334), gtid-set: 007c9203-833d-11e7-aff9-6c92bf21bbe1:1-931221440,11efeebc-747e-11e5-8527-d89d672b3674:1-36643616,1d258984-586b-11e8-9e16-7cd30ae3fcb8:1-1251269710,20dc5615-747e-11e5-8528-1051721b3701:1-39135349,22fb37d8-09ec-11e9-a38f-7cd30a5a2712:1-11655067164,4e09d36c-bb22-11e8-a1cb-7cd30ad38860:1-530901004,8940bea7-54b1-11e6-bb21-6c92bf31607b:1-1096743131,89cc53a8-e019-11e7-8d82-7cd30adae7d0:1-213160520,96767ef0-54b1-11e6-bb21-6c92bf31493f:1-325177501,9b5f2a07-a37c-11e8-8798-7cd30adae88c:1-140532061,a3408fb3-7479-11e5-850a-008cfae41260:1-6695,a7b9cbdc-e019-11e7-8d83-7cd30adbc3c8:1-1582021685,cc6f5f0f-014d-11e6-9b5c-6c92bf21d7b1:1-329680794,ce41dd8e-a37c-11e8-8799-7cd30ac427e8:1-985022222,d17b141f-e217-11ea-aa81-b8599f52a93c:1-4745489146,db92643a-014d-11e6-9b5c-6c92bf21bb19:1-4,f650defc-09eb-11e9-a38e-7cd30ae4109e:1-166445680(flushed position: (mysql-bin.013427, 337456334), gtid-set: 007c9203-833d-11e7-aff9-6c92bf21bbe1:1-931221440,11efeebc-747e-11e5-8527-d89d672b3674:1-36643616,1d258984-586b-11e8-9e16-7cd30ae3fcb8:1-1251269710,20dc5615-747e-11e5-8528-1051721b3701:1-39135349,22fb37d8-09ec-11e9-a38f-7cd30a5a2712:1-11655067164,4e09d36c-bb22-11e8-a1cb-7cd30ad38860:1-530901004,8940bea7-54b1-11e6-bb21-6c92bf31607b:1-1096743131,89cc53a8-e019-11e7-8d82-7cd30adae7d0:1-213160520,96767ef0-54b1-11e6-bb21-6c92bf31493f:1-325177501,9b5f2a07-a37c-11e8-8798-7cd30adae88c:1-140532061,a3408fb3-7479-11e5-850a-008cfae41260:1-6695,a7b9cbdc-e019-11e7-8d83-7cd30adbc3c8:1-1582021685,cc6f5f0f-014d-11e6-9b5c-6c92bf21d7b1:1-329680794,ce41dd8e-a37c-11e8-8799-7cd30ac427e8:1-985022222,d17b141f-e217-11ea-aa81-b8599f52a93c:1-4745489146,db92643a-014d-11e6-9b5c-6c92bf21bb19:1-4,f650defc-09eb-11e9-a38e-7cd30ae4109e:1-166445680)"] [2021/08/17 14:18:31.661 +08:00] [INFO] [syncer.go:2627] ["binlog replication progress"] [task=dm-rds_master] [unit="binlog replication"] ["total binlog size"=461093711235] ["last binlog size"=461082248942] ["cost time"=30] [bytes/Second=382076] ["unsynced binlog size"=0] ["estimate time to catch up"=0] [2021/08/17 14:18:31.661 +08:00] [INFO] [syncer.go:2654] ["binlog replication status"] [task=dm-rds_master] [unit="binlog replication"] [total_events=461447300] [total_tps=562] [tps=168] [master_position="(mysql-bin.013427, 341302136)"] [master_gtid=007c9203-833d-11e7-aff9-6c92bf21bbe1:1-931221440,11efeebc-747e-11e5-8527-d89d672b3674:1-36643616,1d258984-586b-11e8-9e16-7cd30ae3fcb8:1-1251269710,20dc5615-747e-11e5-8528-1051721b3701:1-39135349,22fb37d8-09ec-11e9-a38f-7cd30a5a2712:1-11655067164,4e09d36c-bb22-11e8-a1cb-7cd30ad38860:1-530901004,8940bea7-54b1-11e6-bb21-6c92bf31607b:1-1096743131,89cc53a8-e019-11e7-8d82-7cd30adae7d0:1-213160520,96767ef0-54b1-11e6-bb21-6c92bf31493f:1-325177501,9b5f2a07-a37c-11e8-8798-7cd30adae88c:1-140532061,a3408fb3-7479-11e5-850a-008cfae41260:1-6695,a7b9cbdc-e019-11e7-8d83-7cd30adbc3c8:1-1582021685,cc6f5f0f-014d-11e6-9b5c-6c92bf21d7b1:1-329680794,ce41dd8e-a37c-11e8-8799-7cd30ac427e8:1-985022222,d17b141f-e217-11ea-aa81-b8599f52a93c:1-4745492663,db92643a-014d-11e6-9b5c-6c92bf21bb19:1-4,f650defc-09eb-11e9-a38e-7cd30ae4109e:1-166445680] [checkpoint="position: (mysql-bin.013427, 341300826), gtid-set: 007c9203-833d-11e7-aff9-6c92bf21bbe1:1-931221440,11efeebc-747e-11e5-8527-d89d672b3674:1-36643616,1d258984-586b-11e8-9e16-7cd30ae3fcb8:1-1251269710,20dc5615-747e-11e5-8528-1051721b3701:1-39135349,22fb37d8-09ec-11e9-a38f-7cd30a5a2712:1-11655067164,4e09d36c-bb22-11e8-a1cb-7cd30ad38860:1-530901004,8940bea7-54b1-11e6-bb21-6c92bf31607b:1-1096743131,89cc53a8-e019-11e7-8d82-7cd30adae7d0:1-213160520,96767ef0-54b1-11e6-bb21-6c92bf31493f:1-325177501,9b5f2a07-a37c-11e8-8798-7cd30adae88c:1-140532061,a3408fb3-7479-11e5-850a-008cfae41260:1-6695,a7b9cbdc-e019-11e7-8d83-7cd30adbc3c8:1-1582021685,cc6f5f0f-014d-11e6-9b5c-6c92bf21d7b1:1-329680794,ce41dd8e-a37c-11e8-8799-7cd30ac427e8:1-985022222,d17b141f-e217-11ea-aa81-b8599f52a93c:1-4745492663,db92643a-014d-11e6-9b5c-6c92bf21bb19:1-4,f650defc-09eb-11e9-a38e-7cd30ae4109e:1-166445680(flushed position: (mysql-bin.013427, 337456334), gtid-set: 007c9203-833d-11e7-aff9-6c92bf21bbe1:1-931221440,11efeebc-747e-11e5-8527-d89d672b3674:1-36643616,1d258984-586b-11e8-9e16-7cd30ae3fcb8:1-1251269710,20dc5615-747e-11e5-8528-1051721b3701:1-39135349,22fb37d8-09ec-11e9-a38f-7cd30a5a2712:1-11655067164,4e09d36c-bb22-11e8-a1cb-7cd30ad38860:1-530901004,8940bea7-54b1-11e6-bb21-6c92bf31607b:1-1096743131,89cc53a8-e019-11e7-8d82-7cd30adae7d0:1-213160520,96767ef0-54b1-11e6-bb21-6c92bf31493f:1-325177501,9b5f2a07-a37c-11e8-8798-7cd30adae88c:1-140532061,a3408fb3-7479-11e5-850a-008cfae41260:1-6695,a7b9cbdc-e019-11e7-8d83-7cd30adbc3c8:1-1582021685,cc6f5f0f-014d-11e6-9b5c-6c92bf21d7b1:1-329680794,ce41dd8e-a37c-11e8-8799-7cd30ac427e8:1-985022222,d17b141f-e217-11ea-aa81-b8599f52a93c:1-4745489146,db92643a-014d-11e6-9b5c-6c92bf21bb19:1-4,f650defc-09eb-11e9-a38e-7cd30ae4109e:1-166445680)"] [2021/08/17 14:18:40.034 +08:00] [INFO] [server.go:581] ["receive source bound"] [bound="{\"source\":\"ds-rds_master\",\"worker\":\"dm-172.17.201.115-8263\"}"] ["is deleted"=false] [2021/08/17 14:18:40.036 +08:00] [WARN] [task.go:826] ["session variable 'time_zone' is overwritten by default UTC timezone."] [time_zone=+00:00] [2021/08/17 14:18:40.036 +08:00] [INFO] [server.go:830] ["mysql source is being handled"] [sourceID=ds-rds_master] [2021/08/17 14:18:40.036 +08:00] [INFO] [worker.go:310] ["enter EnableHandleSubtasks"] [component="worker controller"] [2021/08/17 14:18:40.036 +08:00] [WARN] [worker.go:314] ["already enabled handling subtasks"] [component="worker controller"] [2021/08/17 14:18:47.719 +08:00] [INFO] [syncer.go:1003] ["flushed checkpoint"] [task=dm-rds_master] [unit="binlog replication"] [checkpoint="position: (mysql-bin.013427, 345689091), gtid-set: 007c9203-833d-11e7-aff9-6c92bf21bbe1:1-931221440,11efeebc-747e-11e5-8527-d89d672b3674:1-36643616,1d258984-586b-11e8-9e16-7cd30ae3fcb8:1-1251269710,20dc5615-747e-11e5-8528-1051721b3701:1-39135349,22fb37d8-09ec-11e9-a38f-7cd30a5a2712:1-11655067164,4e09d36c-bb22-11e8-a1cb-7cd30ad38860:1-530901004,8940bea7-54b1-11e6-bb21-6c92bf31607b:1-1096743131,89cc53a8-e019-11e7-8d82-7cd30adae7d0:1-213160520,96767ef0-54b1-11e6-bb21-6c92bf31493f:1-325177501,9b5f2a07-a37c-11e8-8798-7cd30adae88c:1-140532061,a3408fb3-7479-11e5-850a-008cfae41260:1-6695,a7b9cbdc-e019-11e7-8d83-7cd30adbc3c8:1-1582021685,cc6f5f0f-014d-11e6-9b5c-6c92bf21d7b1:1-329680794,ce41dd8e-a37c-11e8-8799-7cd30ac427e8:1-985022222,d17b141f-e217-11ea-aa81-b8599f52a93c:1-4745496851,db92643a-014d-11e6-9b5c-6c92bf21bb19:1-4,f650defc-09eb-11e9-a38e-7cd30ae4109e:1-166445680(flushed position: (mysql-bin.013427, 345689091), gtid-set: 007c9203-833d-11e7-aff9-6c92bf21bbe1:1-931221440,11efeebc-747e-11e5-8527-d89d672b3674:1-36643616,1d258984-586b-11e8-9e16-7cd30ae3fcb8:1-1251269710,20dc5615-747e-11e5-8528-1051721b3701:1-39135349,22fb37d8-09ec-11e9-a38f-7cd30a5a2712:1-11655067164,4e09d36c-bb22-11e8-a1cb-7cd30ad38860:1-530901004,8940bea7-54b1-11e6-bb21-6c92bf31607b:1-1096743131,89cc53a8-e019-11e7-8d82-7cd30adae7d0:1-213160520,96767ef0-54b1-11e6-bb21-6c92bf31493f:1-325177501,9b5f2a07-a37c-11e8-8798-7cd30adae88c:1-140532061,a3408fb3-7479-11e5-850a-008cfae41260:1-6695,a7b9cbdc-e019-11e7-8d83-7cd30adbc3c8:1-1582021685,cc6f5f0f-014d-11e6-9b5c-6c92bf21d7b1:1-329680794,ce41dd8e-a37c-11e8-8799-7cd30ac427e8:1-985022222,d17b141f-e217-11ea-aa81-b8599f52a93c:1-4745496851,db92643a-014d-11e6-9b5c-6c92bf21bb19:1-4,f650defc-09eb-11e9-a38e-7cd30ae4109e:1-166445680)"]
    
  • dm-master所在机器启动后,是否会自动启动并重新加入集群

    [2021/08/17 14:19:34.892 +08:00] [WARN] [cluster_util.go:315] ["failed to reach the peer URL"] [component="embed etcd"] [address=http://172.17.201.115:8291/version] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:19:34.892 +08:00] [WARN] [cluster_util.go:168] ["failed to get version"] [component="embed etcd"] [remote-member-id=201495974e8233cd] [error="Get http://172.17.201.115:8291/version: dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:19:35.309 +08:00] [WARN] [probing_status.go:70] ["prober detected unhealthy status"] [component="embed etcd"] [round-tripper-name=ROUND_TRIPPER_RAFT_MESSAGE] [remote-peer-id=201495974e8233cd] [rtt=593.425µs] [error="dial tcp 172.17.201.115:8291: connect: connection refused"] [2021/08/17 14:19:38.496 +08:00] [WARN] [stream.go:277] ["established TCP streaming connection with remote peer"] [component="embed etcd"] [stream-writer-type="stream MsgApp v2"] [local-member-id=db326cb7fba547f5] [remote-peer-id=201495974e8233cd] [2021/08/17 14:19:38.496 +08:00] [WARN] [stream.go:277] ["established TCP streaming connection with remote peer"] [component="embed etcd"] [stream-writer-type="stream Message"] [local-member-id=db326cb7fba547f5] [remote-peer-id=201495974e8233cd] [2021/08/17 14:19:49.428 +08:00] [INFO] [server.go:2206] [payload="leader:true master:true names:\"master-3\" "] [request=ListMember]
    
  • 会自动加入集群,dm-master leader会尝试重启宕掉的dm-master

dm-master和dm-worker同时宕机

  1. 模拟dm-master/worker同时宕机

    date; kill -9 m_pid; kill -9 w_pid; mv <deploy dir> <deploy dir>-1; mv <deploy dir> <deploy dir>-1  # 强制kill dm-master/worker pid,并将部署目录改名防止自启动
    
  2. 观察leader切换和任务转移情况

  3. 记录相关数据:leader切换耗时,所有任务状态,延时情况

结论:

  1. dm-master成功选举,耗时60s左右

  2. 正在运行的同步任务不受影响

  3. 同时宕机的dm-worker上的任务无法转移到另一个free状态的dm-worker上,直到挂掉的dm-worker重启后任务才会启动(版本2.0.6及以下)

    若dm-worker无法启动,需要进行如下步骤

    # 1. 执行stop-task任务,清理集群中任务信息,dm-meta表的信息不会被清理 tiup dmctl --master-addr=<master-addr> stop-task <task-name> {     "op": "Stop",     "result": true,     "msg": "",     "sources": [         {             "result": false,             "msg": "<数据源ID> relevant worker-client not found",             "source": "<数据源ID>",             "worker": ""         }     ] }   # 2. 执行operate-source stop, 清理数据源版定信息 tiup dmctl --master-addr=<master-addr> operate-source stop <数据源ID> {     "result": true,     "msg": "",     "sources": [         {             "result": true,             "msg": "source is stopped and hasn't bound to worker before being stopped",             "source": "<数据源ID>",             "worker": ""         }     ] }  # 3. 重新绑定数据源 tiup dmctl --master-addr=<master-addr> operate-source create <数据源yaml配置>   # 4. 重新启动任务,位点从dm-meta表中获取,继续同步,延迟取决于操作的时间 tiup dmctl --master-addr=<master-addr> start-task <任务yaml配置>
    

      风险点:当前线上dm-master/dm-worker是混部的,存在风险,dm-master leader节点上不能有dm-worker任务,已提Bug

滚动升级

  1. 使用滚动升级命令,将dm集群升级到v2.0.6

    tiup dm upgrade <dm-cluster> v2.0.6
    

结论:

采用滚动升级方式,顺序:dm-master→ dm-worker→ prometheus→ grafana

  1. dm-master leader会自动切换到其他节点
  2. 滚动升级过程中,正在升级的dm-worker上的同步任务会被转移到其他空闲状态的dm-worker节点继续同步,基本无延迟
  3. 滚动升级过程中,还未升级的dm-worker上的同步任务不受影响

1
0
0
0

版权声明:本文为 TiDB 社区用户原创文章,遵循 CC BY-NC-SA 4.0 版权协议,转载请附上原文出处链接和本声明。

评论
暂无评论