bug排查之空想导致的祸之MariaDB Galera Cluster的docker镜像构建中的遇到的坑爹事

  • A+
所属分类:Docker  虚拟化/私有云
摘要

唉,技术不行就不能装逼,不然就我这样,自己把自己搞死了。什么事情都需要用实际验证之后才能下结论,不能用经验主义和理论主义。

之前自己构建了一个Docker的MariaDB Galera Cluster的镜像,测试的时候我只测试了普通功能,集群功能其实自己没测试,只是想当然的用自己之前的经验认为是OK的,谁知道今天测试的时候还真发现问题了,

1、第一次发现我在构建镜像的时候系统上没装which命令,估计你会觉得搞笑,数据库要which命令干什么,我也呵呵。。。看下面截图
bug排查之空想导致的祸之MariaDB Galera Cluster的docker镜像构建中的遇到的坑爹事

既然要用Garela那么wsrep_sst_common肯定要调用的so。。。。you need have which command。

2、第二次发现构建的镜像没有rsync命令,有些人要说了这又是什么鬼。。。看Galera官方的参数说明来说事情。其实这个问题要不是自己为了装逼也不会呈现出来
bug排查之空想导致的祸之MariaDB Galera Cluster的docker镜像构建中的遇到的坑爹事
为了装逼我在构建MariaDB Galera Cluster镜像的时候把method的参数都给做了变量支持了,不在局限于默认的mysqldump模式。。

3、第三次是个严重的问题了,不想详细说了,先看看err日志吧

[root@docker-node01 ~]# cat /data/mariadb/mysql-error.log 
151119 22:19:33 [Note] WSREP: Read nil XID from storage engines, skipping position init
151119 22:19:33 [Note] WSREP: wsrep_load(): loading provider library 'none'
151119 22:19:33 [Note] InnoDB: Using mutexes to ref count buffer pool pages
151119 22:19:33 [Note] InnoDB: The InnoDB memory heap is disabled
151119 22:19:33 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
151119 22:19:33 [Note] InnoDB: Memory barrier is not used
151119 22:19:33 [Note] InnoDB: Compressed tables use zlib 1.2.3
151119 22:19:33 [Note] InnoDB: Using CPU crc32 instructions
151119 22:19:33 [Note] InnoDB: Initializing buffer pool, size = 128.0M
151119 22:19:33 [Note] InnoDB: Completed initialization of buffer pool
151119 22:19:33 [Note] InnoDB: The first specified data file ./ibdata1 did not exist: a new database to be created!
151119 22:19:33 [Note] InnoDB: Setting file ./ibdata1 size to 12 MB
151119 22:19:33 [Note] InnoDB: Database physically writes the file full: wait...
151119 22:19:33 [Note] InnoDB: Setting log file ./ib_logfile101 size to 48 MB
151119 22:19:33 [Note] InnoDB: Setting log file ./ib_logfile1 size to 48 MB
151119 22:19:33 [Note] InnoDB: Setting log file ./ib_logfile2 size to 48 MB
151119 22:19:34 [Note] InnoDB: Renaming log file ./ib_logfile101 to ./ib_logfile0
151119 22:19:34 [Warning] InnoDB: New log files created, LSN=45781
151119 22:19:34 [Note] InnoDB: Doublewrite buffer not found: creating new
151119 22:19:34 [Note] InnoDB: Doublewrite buffer created
151119 22:19:34 [Note] InnoDB: 128 rollback segment(s) are active.
151119 22:19:34 [Warning] InnoDB: Creating foreign key constraint system tables.
151119 22:19:34 [Note] InnoDB: Foreign key constraint system tables created
151119 22:19:34 [Note] InnoDB: Creating tablespace and datafile system tables.
151119 22:19:34 [Note] InnoDB: Tablespace and datafile system tables created.
151119 22:19:34 [Note] InnoDB: Waiting for purge to start
151119 22:19:34 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.25-73.1 started; log sequence number 0
151119 22:19:34 [Warning] Failed to load slave replication state from table mysql.gtid_slave_pos: 1146: Table 'mysql.gtid_slave_pos' doesn't exist
151119 22:19:34 [Note] WSREP: Service disconnected.
151119 22:19:35 [Note] WSREP: Some threads may fail to exit.
151119 22:19:35 [Note] InnoDB: FTS optimize thread exiting.
151119 22:19:35 [Note] InnoDB: Starting shutdown...
151119 22:19:37 [Note] InnoDB: Shutdown completed; log sequence number 1616697
151119 22:19:37 [Note] WSREP: Read nil XID from storage engines, skipping position init
151119 22:19:37 [Note] WSREP: wsrep_load(): loading provider library 'none'
151119 22:19:37 [Note] InnoDB: Using mutexes to ref count buffer pool pages
151119 22:19:37 [Note] InnoDB: The InnoDB memory heap is disabled
151119 22:19:37 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
151119 22:19:37 [Note] InnoDB: Memory barrier is not used
151119 22:19:37 [Note] InnoDB: Compressed tables use zlib 1.2.3
151119 22:19:37 [Note] InnoDB: Using CPU crc32 instructions
151119 22:19:37 [Note] InnoDB: Initializing buffer pool, size = 128.0M
151119 22:19:37 [Note] InnoDB: Completed initialization of buffer pool
151119 22:19:37 [Note] InnoDB: Highest supported file format is Barracuda.
151119 22:19:37 [Note] InnoDB: 128 rollback segment(s) are active.
151119 22:19:37 [Note] InnoDB: Waiting for purge to start
151119 22:19:37 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.25-73.1 started; log sequence number 1616697
151119 22:19:37 [Note] WSREP: Service disconnected.
151119 22:19:38 [Note] WSREP: Some threads may fail to exit.
151119 22:19:38 [Note] InnoDB: FTS optimize thread exiting.
151119 22:19:38 [Note] InnoDB: Starting shutdown...
151119 22:19:40 [Note] InnoDB: Shutdown completed; log sequence number 1616707
151119 22:19:40 [Note] WSREP: Read nil XID from storage engines, skipping position init
151119 22:19:40 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/galera/libgalera_smm.so'
151119 22:19:41 [Note] WSREP: wsrep_load(): Galera 25.3.9(r3385) by Codership Oy <info@codership.com> loaded successfully.
151119 22:19:41 [Note] WSREP: CRC-32C: using hardware acceleration.
151119 22:19:41 [Warning] WSREP: Could not open saved state file for reading: /data/mariadb//grastate.dat
151119 22:19:41 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
151119 22:19:41 [Note] WSREP: Passing config to GCS: base_host = IP隐藏1; base_port = 4567; cert.log_conflicts = no; debug = no; evs.auto_evict = 0; evs.delay_margin = PT1S; evs.delayed_keep_period = PT30S; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 3; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /data/mariadb/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /data/mariadb//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.recover
151119 22:19:41 [Note] WSREP: Service thread queue flushed.
151119 22:19:41 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
151119 22:19:41 [Note] WSREP: wsrep_sst_grab()
151119 22:19:41 [Note] WSREP: Start replication
151119 22:19:41 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
151119 22:19:41 [Note] WSREP: protonet asio version 0
151119 22:19:41 [Note] WSREP: Using CRC-32C for message checksums.
151119 22:19:41 [Note] WSREP: backend: asio
151119 22:19:41 [Warning] WSREP: access file(gvwstate.dat) failed(No such file or directory)
151119 22:19:41 [Note] WSREP: restore pc from disk failed
151119 22:19:41 [Note] WSREP: GMCast version 0
151119 22:19:41 [Note] WSREP: (92b4ea9d, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
151119 22:19:41 [Note] WSREP: (92b4ea9d, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
151119 22:19:41 [Note] WSREP: EVS version 0
151119 22:19:41 [Note] WSREP: gcomm: connecting to group '*********1', peer 'IP隐藏1:,IP隐藏2:,IP隐藏2:'
151119 22:19:41 [Warning] WSREP: (92b4ea9d, 'tcp://0.0.0.0:4567') address 'tcp://IP隐藏:4567' points to own listening address, blacklisting
151119 22:19:41 [Note] WSREP: (92b4ea9d, 'tcp://0.0.0.0:4567') address 'tcp://IP隐藏:4567' pointing to uuid 92b4ea9d is blacklisted, skipping
151119 22:19:41 [Note] WSREP: (92b4ea9d, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://IP隐藏:4567 
151119 22:19:41 [Note] WSREP: (92b4ea9d, 'tcp://0.0.0.0:4567') address 'tcp://IP隐藏:4567' pointing to uuid 92b4ea9d is blacklisted, skipping
151119 22:19:42 [Note] WSREP: declaring 4580556b at tcp://IP隐藏:4567 stable
151119 22:19:42 [Note] WSREP: declaring 64914067 at tcp://IP隐藏:4567 stable
151119 22:19:42 [Note] WSREP: (92b4ea9d, 'tcp://0.0.0.0:4567') address 'tcp://IP隐藏:4567' pointing to uuid 92b4ea9d is blacklisted, skipping
151119 22:19:42 [Note] WSREP: Node 4580556b state prim
151119 22:19:42 [Note] WSREP: (92b4ea9d, 'tcp://0.0.0.0:4567') address 'tcp://IP隐藏:4567' pointing to uuid 92b4ea9d is blacklisted, skipping
151119 22:19:42 [Note] WSREP: (92b4ea9d, 'tcp://0.0.0.0:4567') address 'tcp://IP隐藏:4567' pointing to uuid 92b4ea9d is blacklisted, skipping
151119 22:19:42 [Note] WSREP: view(view_id(PRIM,4580556b,48) memb {
    4580556b,0
    64914067,0
    92b4ea9d,0
} joined {
} left {
} partitioned {
})
151119 22:19:42 [Note] WSREP: save pc into disk
151119 22:19:42 [Note] WSREP: discarding pending addr without UUID: tcp://IP隐藏:4567
151119 22:19:42 [Note] WSREP: discarding pending addr proto entry 0x1d16450
151119 22:19:42 [Note] WSREP: gcomm: connected
151119 22:19:42 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
151119 22:19:42 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
151119 22:19:42 [Note] WSREP: Opened channel 'LegionMariadbGaleraCluster1'
151119 22:19:42 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 2, memb_num = 3
151119 22:19:42 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
151119 22:19:42 [Note] WSREP: Waiting for SST to complete.
151119 22:19:42 [Note] WSREP: STATE EXCHANGE: sent state msg: 936ecf63-8ec8-11e5-bdc6-5f35b34c1bd4
151119 22:19:42 [Note] WSREP: STATE EXCHANGE: got state msg: 936ecf63-8ec8-11e5-bdc6-5f35b34c1bd4 from 0 (node1)
151119 22:19:42 [Note] WSREP: STATE EXCHANGE: got state msg: 936ecf63-8ec8-11e5-bdc6-5f35b34c1bd4 from 1 (node3)
151119 22:19:42 [Note] WSREP: STATE EXCHANGE: got state msg: 936ecf63-8ec8-11e5-bdc6-5f35b34c1bd4 from 2 (node10)
151119 22:19:42 [Note] WSREP: Quorum results:
    version    = 3,
    component  = PRIMARY,
    conf_id    = 43,
    members    = 2/3 (joined/total),
    act_id     = 113579,
    last_appl. = -1,
    protocols  = 0/7/3 (gcs/repl/appl),
    group UUID = 92a4834f-7ee8-11e5-963f-a24c2a696910
151119 22:19:42 [Note] WSREP: Flow-control interval: [28, 28]
151119 22:19:42 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 113579)
151119 22:19:42 [Note] WSREP: State transfer required: 
    Group state: 92a4834f-7ee8-11e5-963f-a24c2a696910:113579
    Local state: 00000000-0000-0000-0000-000000000000:-1
151119 22:19:42 [Note] WSREP: New cluster view: global state: 92a4834f-7ee8-11e5-963f-a24c2a696910:113579, view# 44: Primary, number of nodes: 3, my index: 2, protocol version 3
151119 22:19:42 [Warning] WSREP: Gap in state sequence. Need state transfer.
151119 22:19:42 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address 'IP隐藏' --auth 'cluster:nicaia' --datadir '/data/mariadb/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --parent '1' --binlog 'mysql-bin' '
151119 22:19:43 [Note] WSREP: Prepared SST request: rsync|IP隐藏:4444/rsync_sst
151119 22:19:43 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
151119 22:19:43 [Note] WSREP: REPL Protocols: 7 (3, 2)
151119 22:19:43 [Note] WSREP: Service thread queue flushed.
151119 22:19:43 [Note] WSREP: Assign initial position for certification: 113579, protocol version: 3
151119 22:19:43 [Note] WSREP: Service thread queue flushed.
151119 22:19:43 [Warning] WSREP: Failed to prepare for incremental state transfer: Local state UUID (00000000-0000-0000-0000-000000000000) does not match group state UUID (92a4834f-7ee8-11e5-963f-a24c2a696910): 1 (Operation not permitted)
     at galera/src/replicator_str.cpp:prepare_for_IST():456. IST will be unavailable.
151119 22:19:43 [Note] WSREP: Member 2.0 (node10) requested state transfer from '*any*'. Selected 0.0 (node1)(SYNCED) as donor.
151119 22:19:43 [Note] WSREP: Shifting PRIMARY -> JOINER (TO: 113579)
151119 22:19:43 [Note] WSREP: Requesting state transfer: success, donor: 0
151119 22:19:43 [Warning] WSREP: 0.0 (node1): State transfer to 2.0 (node10) failed: -255 (Unknown error 255)
151119 22:19:43 [ERROR] WSREP: gcs/src/gcs_group.cpp:int gcs_group_handle_join_msg(gcs_group_t*, const gcs_recv_msg_t*)():731: Will never receive state. Need to abort.
151119 22:19:43 [Note] WSREP: gcomm: terminating thread
151119 22:19:43 [Note] WSREP: gcomm: joining thread
151119 22:19:43 [Note] WSREP: gcomm: closing backend
151119 22:19:44 [Note] WSREP: view(view_id(NON_PRIM,4580556b,48) memb {
    92b4ea9d,0
} joined {
} left {
} partitioned {
    4580556b,0
    64914067,0
})
151119 22:19:44 [Note] WSREP: view((empty))
151119 22:19:44 [Note] WSREP: gcomm: closed
151119 22:19:44 [Note] WSREP: mysqld: Terminated.
151119 22:19:44 [ERROR] mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.

Server version: 10.0.21-MariaDB-wsrep-log
key_buffer_size=0
read_buffer_size=524288
max_used_connections=0
max_threads=102
thread_count=2
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 106485 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x0 thread_stack 0x48000
addr2line: 'mysqld': No such file
mysqld(my_print_stacktrace+0x29)[0xae93f9]
mysqld(handle_fatal_signal+0x40a)[0x70005a]
/lib64/libpthread.so.0(+0xf790)[0x7f8a67dcd790]
/lib64/libc.so.6(abort+0xd4)[0x7f8a66415d64]
/usr/lib64/galera/libgalera_smm.so(+0xd0ba7)[0x7f8a62a30ba7]
/usr/lib64/galera/libgalera_smm.so(_Z13gcs_core_recvP8gcs_coreP12gcs_act_rcvdx+0x2e9)[0x7f8a62b29089]
/usr/lib64/galera/libgalera_smm.so(+0x1cf630)[0x7f8a62b2f630]
/lib64/libpthread.so.0(+0x7a51)[0x7f8a67dc5a51]
/lib64/libc.so.6(clone+0x6d)[0x7f8a664ca93d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

这他妈逆天到了要提交bug去官方么?,我还就不信邪,于是慢慢的各种排查。。。。最后定位问题出现在了MySQL初始化sql语句上,如果之前有人看了我构建MariaDB Galera cluster镜像的三个文件,那么会发现MariaDB的初始化是用--init-file=参数带sql脚本来实现的,其中做sql安全配置的第一段就是

-- What's done in this file shouldn't be replicated
--  or products like mysql-fabric won't work
SET @@SESSION.SQL_LOG_BIN=0;
 
DELETE FROM mysql.user;
CREATE USER 'root'@'%' IDENTIFIED BY '${MYSQL_ROOT_PASSWORD}';
GRANT ALL ON *.* TO 'root'@'%' WITH GRANT OPTION;
DROP DATABASE IF EXISTS test;

问题就是出在刚买命令的高亮行上。不多解释了,有兴趣的朋友可以自己试试,为什么会出现问题,最后我把这地调整成为了

-- What's done in this file shouldn't be replicated
--  or products like mysql-fabric won't work
SET @@SESSION.SQL_LOG_BIN=0;
         
DELETE FROM mysql.user;
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '${MYSQL_ROOT_PASSWORD}' WITH GRANT OPTION;
DROP DATABASE IF EXISTS test;

然后重新构建镜像,于是好了

docker run -d --name=mariadb_cluster_node3 \
-p 3306:3306 -p 4444:4444 -p 4567-4568:4567-4568 \
-p 4444:4444/udp -p 4567-4568:4567-4568/udp \
-v /data/mariadb:/data/mariadb \
-e TIMEZONE=Asia/Shanghai \
-e MYSQL_ROOT_PASSWORD=nicaia \
-e REPLICATION_USERNAME=cluster \
-e REPLICATION_PASSWORD=nicaia \
-e MAX_CONNECTIONS=20 \
-e LOG_BIN=mysql-bin \
-e SYNC_BINLOG=1 \
-e GALERA=ON \
-e NODE_NAME=node11 \
-e CLUSTER_METHOD=rsync \
-e CLUSTER_NAME=nicaia \
-e CLUSTER_NODE_ADDRESS=1.2.3.4 \
-e CLUSTER_ADDRESS="gcomm://1.2.3.4,1.2.3.5,1.2.3.6" \
benyoo/mariadb_build:latest

这是我最后成功的启动命令
bug排查之空想导致的祸之MariaDB Galera Cluster的docker镜像构建中的遇到的坑爹事

记录下这次bug排查所发现的几个好的网页:
http://galeracluster.com/documentation-webpages/mysqlwsrepoptions.html
http://www.cnblogs.com/Richardzhu/p/3318595.html
http://dba.stackexchange.com/questions/105459/percona-mysqld-got-signal-11?answertab=votes#tab-top
http://dev.mysql.com/doc/refman/5.6/en/update.html

lookback

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: