1
2
2
0
专栏/.../

记三次升级 TiDB 集群到 v6.1.5 遇到的案例分析过程&升级收益

 Yifei_Jia  发表于  2023-12-15

团队升级TiDB版本的事情是规划很久了,迟迟没操作还是因为很多预期意外的问题是否有能力覆盖解决。

本文写的时间是8月底,今天刚好总结的时候看到了分享给大家以作为版本升级的参考。

我们的业务集群TiDB数据量本身是很大,单集群数十TB规模,加之业务的重要性,本着非必要不升级的原则。

随着TiDB版本迭代,很多新的功能特性和性能提升以及我们遇到的一些bug修复,不断吸引着我们作出升级的决定。

文中记录的三则为计划升级16套TiDB集群中遇到的问题,将性能压力大且符合升级需求的TiDB集群升级到v6.1.5。

问题一:升级完成后其中一台TiFlash无法正常重启,一直oom

image.png

问题分析过程:

研发侧报错:
1、uncategorized SQLException; SQL state [HY000]; error code [1105]; other error for mpp stream: DB::Exception: Exchange receiver meet error : DB::Exception: Exchange receiver meet error : send async stream request fail; nested exception is java.sql.SQLException: other error for mpp stream: DB::Exception: Exchange receiver meet error : DB::Exception: Exchange receiver meet error : send async stream request fail/span>
tidb节点上错误日志:
2、[ERROR] 2023-08-23 ****** [http-nio-8080-exec-6] [com.*****.man.web.framework.exception.GlobalErrHandler] [7c25fb69df2800c5] - 通用异常信息:[
### Error querying database.  Cause: java.sql.SQLException: TiFlash server timeout
### The error may exist in URL [jar:file:/web/servers/tomcat/webapps/ROOT/WEB-INF/lib/***-man-dao-2.0.0-RELEASE.jar!/mapper/tidb/SheetListTidbMapper.xml]
### The error may involve *****.man.dao.tidb.SheetListTidbMapper.selectTidbPageByQuery-Inline
### The error occurred while setting parameters
### SQL: SELECT  ****** ;
### Cause: java.sql.SQLException: TiFlash server timeout
; uncategorized SQLException; SQL state [HY000]; error code [9012]; TiFlash server timeout; nested exception is java.sql.SQLException: TiFlash server timeout]
org.springframework.jdbc.UncategorizedSQLException: 
### Error querying database.  Cause: java.sql.SQLException: TiFlash server timeout
; uncategorized SQLException; SQL state [HY000]; error code [9012]; TiFlash server timeout; nested exception is java.sql.SQLException: TiFlash server timeout

其查询计划为,没有任何合适的索引能走到tikv查询数据,只有走TiFlash。

image.png

第一个错误是MPP老版本的一个bug错误已经修复了,按理不会报这个错误。

后面确认是TiFlash升级完成后挂掉不能正常启动,确认了虽然有两个TiFlash节点实际上还是单副本运行,所以一直报错。

问题分析及处理建议:

v5.1的TiFlash升级到v6.1.5不能保证升级过程中会不会有其他问题,这台机器升级过程是恢复临时占用内存超过机器内存16G,不能在原地升级完成。

大家以后在执行版本升级时,第一硬件资源要跟上,第二升级的组件要是多副本的。

这样在升级过程中不用等待重新扩缩TiFlash同步数据带来的数据迁移时间。

问题二:使用--offline参数升级集群无法继续进行,cluster is running and cannot be upgraded offline

ll upgrade component           "tiflash" to "v6.1.5",
will upgrade component               "cdc" to "v6.1.5",
will upgrade component                "pd" to "v6.1.5",
will upgrade component              "tikv" to "v6.1.5",
will upgrade component              "tidb" to "v6.1.5",
will upgrade component        "prometheus" to "v6.1.5",
will upgrade component           "grafana" to "v6.1.5",
will upgrade component      "alertmanager" to "",
will upgrade component     "node-exporter" to "",
will upgrade component "blackbox-exporter" to "".
Do you want to continue? [y/N]:(default=N) y
Upgrading cluster...
Error: cluster is running and cannot be upgraded offline

image.png

至于为什么会发生这个问题,没去排查了。最后确认的是:tiup中关闭了TiFlash组件,状态也显示已关闭,但服务器上systemctl status tiflash-9000显示running,实际上组件并未正常停止,这个问题也好理解,毕竟是异步关闭的组件。

遇到这个问题处理办法就是去TiFlash组件服务器上关闭一下。

问题三:升级期间有一个超大DDL在执行,导致升级操作完成后tidb计算节点不能提供服务

image.png

这个服务直接就不可用了,我们来看个组件的日志情况

排查过程,我们一般都是grep err日志,发现错误日志集中在PD上,开始在怀疑PD有问题

PD的相关日志:

[2023/08/24 22:51:27.992 +08:00] [ERROR] [client.go:847] ["[pd] getTS error"] [dc-location=global] [stream-addr=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetTSO]rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]
[2023/08/24 22:51:27.993 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.106:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:27.993 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.106:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:27.994 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:27.994 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:27.995 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:27.995 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:27.995 +08:00] [ERROR] [base_client.go:144] ["[pd] failed updateMember"] [error="[PD:client:ErrClientGetLeader]get leader from [http://10.16.57.106:2379 http://10.16.57.62:2379 http://10.16.57.82:2379] error"]
[2023/08/24 22:51:28.265 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:28.266 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:28.266 +08:00] [ERROR] [base_client.go:144] ["[pd] failed updateMember"] [error="[PD:client:ErrClientGetLeader]get leader from [http://10.16.57.106:2379 http://10.16.57.62:2379 http://10.16.57.82:2379] error"]
[2023/08/24 22:51:28.266 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetLeader]get leader from leader address don't exist error"]
[2023/08/24 22:51:28.476 +08:00] [ERROR] [client.go:847] ["[pd] getTS error"] [dc-location=global] [stream-addr=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetTSO]rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster"]

大致情况就是执行任何操作请求都提示没有pd leader能处理。

TiDB的日志:

[2023/08/24 23:16:29.989 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.106:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.106:2379 status:READY: error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.106:2379 status:READY"]
[2023/08/24 23:16:29.990 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.62:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.62:2379 status:READY: error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.62:2379 status:READY"]
[2023/08/24 23:16:29.990 +08:00] [INFO] [base_client.go:275] ["[pd] cannot update member from this address"] [address=http://10.16.57.82:2379] [error="[PD:client:ErrClientGetMember]error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.82:2379 status:READY: error:rpc error: code = Unknown desc = [PD:server:ErrServerNotStarted]server not started target:10.16.57.82:2379 status:READY"]
[2023/08/25 01:07:49.023 +08:00] [INFO] [printer.go:34] ["Welcome to TiDB."] ["Release Version"=v6.1.5] [Edition=Community] ["Git Commit Hash"=73d82e330b02a39f74073d98daefbadd7deab9b9] ["Git Branch"=heads/refs/tags/v6.1.5] ["UTC Build Time"="2023-02-22 06:21:24"] [GoVersion=go1.19.5] ["Race Enabled"=false] ["Check Table Before Drop"=false] ["TiKV Min Version"=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]
[2023/08/25 01:07:49.024 +08:00] [INFO] [printer.go:48] ["loaded config"] [config="{\"host\":\"0.0.0.0\",\"advertise-address\":\"10.16.57.82\",\"port\":4000,\"cors\":\"\",\"store\":\"tikv\",\"path\":\"10.16.57.82:2379,10.16.57.106:2379,10.16.57.62:2379\",\"socket\":\"/tmp/tidb-4000.sock\",\"lease\":\"45s\",\"run-ddl\":true,\"split-table\":true,\"token-limit\":1000,\"oom-use-tmp-storage\":true,\"tmp-storage-path\":\"/tmp/0_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage\",\"tmp-storage-quota\":-1,\"server-version\":\"\",\"version-comment\":\"\",\"tidb-edition\":\"\",\"tidb-release-version\":\"\",\"log\":{\"level\":\"info\",\"format\":\"text\",\"disable-timestamp\":null,\"enable-timestamp\":null,\"disable-error-stack\":null,\"enable-error-stack\":null,\"file\":{\"filename\":\"/data/tidb-deploy/tidb-4000/log/tidb.log\",\"max-size\":300,\"max-days\":0,\"max-backups\":0},\"enable-slow-log\":true,\"slow-query-file\":\"/data/tidb-deploy/tidb-4000/log/tidb_slow_query.log\",\"slow-threshold\":1000,\"expensive-threshold\":10000,\"record-plan-in-slow-log\":1,\"query-log-max-len\":4096},\"instance\":{\"tidb_general_log\":false,\"tidb_pprof_sql_cpu\":false,\"ddl_slow_threshold\":300,\"tidb_expensive_query_time_threshold\":60,\"tidb_enable_slow_log\":true,\"tidb_slow_log_threshold\":1000,\"tidb_record_plan_in_slow_log\":1,\"tidb_check_mb4_value_in_utf8\":true,\"tidb_force_priority\":\"NO_PRIORITY\",\"tidb_memory_usage_alarm_ratio\":0.8,\"tidb_enable_collect_execution_info\":true,\"plugin_dir\":\"/data/deploy/plugin\",\"plugin_load\":\"\"},\"security\":{\"skip-grant-table\":false,\"ssl-ca\":\"\",\"ssl-cert\":\"\",\"ssl-key\":\"\",\"cluster-ssl-ca\":\"\",\"cluster-ssl-cert\":\"\",\"cluster-ssl-key\":\"\",\"cluster-verify-cn\":null,\"spilled-file-encryption-method\":\"plaintext\",\"enable-sem\":false,\"auto-tls\":false,\"tls-version\":\"\",\"rsa-key-size\":4096,\"secure-bootstrap\":false},\"status\":{\"status-host\":\"0.0.0.0\",\"metrics-addr\":\"\",\"status-port\":10080,\"metrics-interval\":15,\"report-status\":true,\"record-db-qps\":false,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"grpc-concurrent-streams\":1024,\"grpc-initial-window-size\":2097152,\"grpc-max-send-msg-size\":2147483647},\"performance\":{\"max-procs\":0,\"max-memory\":0,\"server-memory-quota\":0,\"memory-usage-alarm-ratio\":0.8,\"stats-lease\":\"3s\",\"stmt-count-limit\":5000,\"feedback-probability\":0,\"query-feedback-limit\":512,\"pseudo-estimate-ratio\":0.8,\"force-priority\":\"NO_PRIORITY\",\"bind-info-lease\":\"3s\",\"txn-entry-size-limit\":6291456,\"txn-total-size-limit\":104857600,\"tcp-keep-alive\":true,\"tcp-no-delay\":true,\"cross-join\":true,\"distinct-agg-push-down\":false,\"projection-push-down\":false,\"max-txn-ttl\":3600000,\"index-usage-sync-lease\":\"0s\",\"plan-replayer-gc-lease\":\"10m\",\"gogc\":100,\"enforce-mpp\":false,\"stats-load-concurrency\":5,\"stats-load-queue-size\":1000,\"enable-stats-cache-mem-quota\":false,\"committer-concurrency\":128,\"run-auto-analyze\":true},\"prepared-plan-cache\":{\"enabled\":true,\"capacity\":100,\"memory-guard-ratio\":0.1},\"opentracing\":{\"enable\":false,\"rpc-metrics\":false,\"sampler\":{\"type\":\"const\",\"param\":1,\"sampling-server-url\":\"\",\"max-operations\":0,\"sampling-refresh-interval\":0},\"reporter\":{\"queue-size\":0,\"buffer-flush-interval\":0,\"log-spans\":false,\"local-agent-host-port\":\"\"}},\"proxy-protocol\":{\"networks\":\"\",\"header-timeout\":5},\"pd-client\":{\"pd-server-timeout\":3},\"tikv-client\":{\"grpc-connection-count\":4,\"grpc-keepalive-time\":10,\"grpc-keepalive-timeout\":3,\"grpc-compression-type\":\"none\",\"commit-timeout\":\"41s\",\"async-commit\":{\"keys-limit\":256,\"total-key-size-limit\":4096,\"safe-window\":2000000000,\"allowed-clock-drift\":500000000},\"max-batch-size\":128,\"overload-threshold\":200,\"max-batch-wait-time\":0,\"batch-wait-size\":8,\"enable-chunk-rpc\":true,\"region-cache-ttl\":600,\"store-limit\":0,\"store-liveness-timeout\":\"1s\",\"copr-cache\":{\"capacity-mb\":1000},\"ttl-refreshed-txn-size\":33554432,\"resolve-lock-lite-threshold\":16},\"binlog\":{\"enable\":false,\"ignore-error\":false,\"write-timeout\":\"15s\",\"binlog-socket\":\"\",\"strategy\":\"range\"},\"compatible-kill-query\":false,\"plugin\":{\"dir\":\"/data/deploy/plugin\",\"load\":\"\"},\"pessimistic-txn\":{\"max-retry-count\":256,\"deadlock-history-capacity\":10,\"deadlock-history-collect-retryable\":false,\"pessimistic-auto-commit\":false},\"check-mb4-value-in-utf8\":true,\"max-index-length\":3072,\"index-limit\":64,\"table-column-count-limit\":1017,\"graceful-wait-before-shutdown\":0,\"alter-primary-key\":false,\"treat-old-version-utf8-as-utf8mb4\":true,\"enable-table-lock\":false,\"delay-clean-table-lock\":0,\"split-region-max-num\":1000,\"top-sql\":{\"receiver-address\":\"\"},\"repair-mode\":false,\"repair-table-list\":[],\"isolation-read\":{\"engines\":[\"tikv\",\"tiflash\",\"tidb\"]},\"max-server-connections\":1000,\"new_collations_enabled_on_first_bootstrap\":true,\"experimental\":{\"allow-expression-index\":false},\"enable-collect-execution-info\":true,\"skip-register-to-dashboard\":false,\"enable-telemetry\":false,\"labels\":{},\"enable-global-index\":false,\"deprecate-integer-display-length\":false,\"enable-enum-length-limit\":true,\"stores-refresh-interval\":60,\"enable-tcp4-only\":false,\"enable-forwarding\":false,\"max-ballast-object-size\":0,\"ballast-object-size\":0,\"enable-global-kill\":true,\"enable-batch-dml\":false,\"mem-quota-query\":2147483648,\"oom-action\":\"cancel\"}"]
[2023/08/25 01:07:49.024 +08:00] [INFO] [main.go:343] ["disable Prometheus push client"]
这个地方的日志也显示TiDB节点的配置正常加载正常启动,但是就是未正常打开4000端口对外提供服务
[2023/08/25 01:07:51.410 +08:00] [INFO] [session.go:3311] ["CRUCIAL OPERATION"] [conn=0] [schemaVersion=3242] [cur_db=] [sql="ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"] [user=]
[2023/08/25 01:07:51.442 +08:00] [INFO] [ddl_worker.go:339] ["[ddl] add DDL jobs"] ["batch count"=1] [jobs="ID:4591, Type:add column, State:queueing, SchemaState:none, SchemaID:3, TableID:21, RowCount:0, ArgLen:3, start time: 2023-08-25 01:07:51.419 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0; "]
[2023/08/25 01:07:51.444 +08:00] [INFO] [ddl.go:632] ["[ddl] start DDL job"] [job="ID:4591, Type:add column, State:queueing, SchemaState:none, SchemaID:3, TableID:21, RowCount:0, ArgLen:3, start time: 2023-08-25 01:07:51.419 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [query="ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"]
怀疑是未正常执行升级

因为是按计划升级,也没有在TiDB上执行DDL,从未怀疑这一点是问题的关键。然而事实的关键往往就在你没发现的可能忽略地方,我们排查了tikv都是正常的,间接也证明了PD应该也是正常的,否则多副本的存储服务肯定也不行。

回过头来排查tidb本身,突然发现一个从MySQL同步到这个TiDB集群的一个物理加索引DDL。

问题逐渐清晰,因为MySQL有分库分表,并且只保存着近半年的热数据执行很快就完成。TiDB里面有着全量的数据聚合到一张表,这个加索引的物理DDL也自动同步到了TiDB执行,并且执行时间很长。

image.png

因为有了整体问题思路,再回头看TiDB日志

[2023/08/24 22:00:27.008 +08:00] [WARN] [session.go:1533] ["run statement failed"] [conn=5897899] [schemaVersion=3240] [error="[ddl:1061]index already exist idx_create; a background job is trying to add the same
 index, please check by `ADMIN SHOW DDL JOBS`"] [session="{\n  \"currDBName\": \"***_voucher\",\n  \"id\": 5897899,\n  \"status\": 2,\n  \"strictMode\": true,\n  \"user\": {\n    \"Username\": \"tidb_cold_
w\",\n    \"Hostname\": \"10.16.56.178\",\n    \"CurrentUser\": false,\n    \"AuthUsername\": \"tidb_cold_w\",\n    \"AuthHostname\": \"10.%.%.%\"\n  }\n}"]
[2023/08/24 22:00:27.008 +08:00] [INFO] [conn.go:877] ["command dispatched failed"] [conn=5897899] [connInfo="id:5897899, addr:10.16.56.178:28898 status:10, collation:utf8_general_ci, user:tidb_cold_w"] [command
=Query] [status="inTxn:0, autocommit:1"] [sql="ALTER TABLE `voucher_detail` ADD INDEX idx_create(created, vender_id)"] [txn_mode=PESSIMISTIC] [err="[ddl:1061]index already exist idx_create; a background job is t
rying to add the same index, please check by `ADMIN SHOW DDL JOBS`\ngithub.com/pingcap/errors.AddStack\n\t/nfs/cache/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/
# cat tidb.log |grep "run reorg job wait timeout"
[2023/08/24 22:09:51.308 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7104213] [doneKey=6774522551193595979] []
[2023/08/24 22:09:56.356 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7183573] [doneKey=6774522551193595979] []
[2023/08/24 22:10:01.391 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7262165] [doneKey=6774522551193595979] []
[2023/08/24 22:10:06.481 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7336405] [doneKey=6774522551193595979] []
[2023/08/24 22:10:11.570 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7419143] [doneKey=6774522551193595979] []
[2023/08/24 22:10:16.640 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7464902] [doneKey=6774522551193595979] []
[2023/08/24 22:10:21.804 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=7544549] [doneKey=6774692334375633788.next] []
[2023/08/24 22:11:43.519 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8656519] [doneKey=6775048577556492528.next] []
[2023/08/24 22:11:48.593 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8737159] [doneKey=6775048577556492528.next] []
[2023/08/24 22:11:53.648 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8792281] [doneKey=6775048577556492528.next] []
[2023/08/24 22:11:58.740 +08:00] [INFO] [reorg.go:284] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=8826499] [doneKey=6775048577556492528.next] []
[2023/08/25 04:37:44.221 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256292839] [doneKey=6858402545246070082.next] []
[2023/08/25 04:37:49.308 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256353399] [doneKey=6858402545246070082.next] []
[2023/08/25 04:37:54.366 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256386509] [doneKey=6858402545246070082.next] []
[2023/08/25 04:37:59.438 +08:00] [INFO] [reorg.go:301] ["[ddl] run reorg job wait timeout"] [waitTime=5s] [elementType=_idx_] [elementID=11] [totalAddedRowCount=256442291] [doneKey=6858463469667393591.next] []

根因分析:

[2023/08/24 22:00:27.008 +08:00] [INFO] [conn.go:877] ["command dispatched failed"] [conn=5897899] [connInfo="id:5897899, addr:10.16.56.178:28898 status:10, collation:utf8_general_ci, user:tidb_cold_w"] [command =Query] [status="inTxn:0, autocommit:1"] [sql="ALTER TABLE `voucher_detail` ADD INDEX idx_create(created, vender_id)"] [txn_mode=PESSIMISTIC] [err="[ddl:1061]index already exist idx_create; a background job is t rying to add the same index, please check by `ADMIN SHOW DDL JOBS`

[2023/08/25 01:07:51.410 +08:00] [INFO] [session.go:3311] ["CRUCIAL OPERATION"] [conn=0] [schemaVersion=3242] [cur_db=] [sql="ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"] [user=]

[2023/08/25 01:07:51.444 +08:00] [INFO] [ddl.go:632] ["[ddl] start DDL job"] [job="ID:4591, Type:add column, State:queueing, SchemaState:none, SchemaID:3, TableID:21, RowCount:0, ArgLen:3, start time: 2023-08-25 01:07:51.419 +0800 CST, Err:<nil>, ErrCount:0, SnapshotVersion:0"] [query="ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"]

因升级导致内部执行的"ALTER TABLE mysql.stats_meta ADD COLUMN snapshot BIGINT(64) UNSIGNED NOT NULL DEFAULT 0"

在DDL执行队列中排在了物理DDL "ALTER TABLE `voucher_detail` ADD INDEX idx_create(created, vender_id)"后面,导致升级并未真正完成,卡住了打开4000端口的流程。

改进:在进行大版本升级时,操作集群不要有大DDL执行,提前检查【ADMIN SHOW DDL JOBS】

最奇怪的事情是:物理DDL执行完成后,tidb计算节点打开服务后,进数据库看那个索引并没有加上。

TiDB升级版本的收益

下面简单列举一些正向收益数据:

升级前是5.1.x版本,升级后6.1.5版本:Duration 99基本降了一半左右

升级前:

升级后:

内存使用降了一半左右

升级前后tso下降明显

总结上述几个问题,升级TiDB数据库版本是一个重型运维操作,多测试多验证。

多参考官方升级文档https://docs.pingcap.com/zh/tidb/stable/smooth-upgrade-tidb

有TiFlash组件的更要多收集相关运维操作文档https://docs.pingcap.com/zh/tidb/stable/tiflash-upgrade-guide

做到升级过程心中有数,稳定稳定稳定!!!

以上,

欢迎技术交流。

1
2
2
0

版权声明:本文为 TiDB 社区用户原创文章,遵循 CC BY-NC-SA 4.0 版权协议,转载请附上原文出处链接和本声明。

评论
暂无评论