2014年10月28日星期二

Innodb engine Waiting for table level lock

show processlist 看到大量的 Waiting for table level lock

 video_file set dmd5='3b6950150e45488268431b612c81eb1e', filesize=2790658, totaltime=261625, v |         0 |             0 |         1 |
| 2090679959 | 58bb520         | 10.11.80.84:16894   | 56_flash_flv | Query       |     1047 | Waiting for table level lock                                          | update video_file set dmd5='8242e23856e6fe9457a41f26076d1cec', filesize=7871452, totaltime=304600, v |         0 |             0 |         1 |
| 2090679973 | 58bb520         | 10.11.80.82:29175   | 56_flash_flv | Query       |     1012 | Waiting for table level lock                                          | update video_file set dmd5='aa859956958e7aa4c86bd3b0b3b3133b', filesize=10095169, totaltime=286000,  |         0 |             0 |         1 |
| 2090679997 | 58bb520         | 10.11.80.117:37906  | 56_flash_flv | Query       |     1023 | Waiting for table level lock                                          | update video_file set dmd5='5152806cba75a0c659ccb684ae88f3d6', filesize=7750179, totaltime=107880, v |         0 |             0 |         1 |
| 2090680656 | c26140          | 10.11.80.117:38628  | 56_flash_flv | Query       |     1098 | Writing to net                                                        | SELECT /*!40001 SQL_NO_CACHE */ * FROM `flash_flv_date`                                              |  13222420 |             0 |  13222420 |
| 2090680664 | 58bb520         | 10.11.80.117:38634  | 56_flash_flv | Query       |     1098 | Waiting for table level lock                                          | insert into `flash_flv_date` set

表是 innodb 的表,这个可以肯定,如果是对表进行修改操作,会出现 metadata 的锁表等待,但是和平常看到的不太一样。

从 processlist 输出注意到  sql 中的注释 /*!40001 SQL_NO_CACHE */  来看,是在导数据,并且它的执行时间最长,导致大量更新表操作失败。

innodb 是支持事物型的,直接锁住整个表的情况不合理,另外即便是锁表的话,最常见的也是这种:

Waiting for table metadata lock


为了验证来源,进行测试,在 mysqldump 的时候增加 --lock-all-tables 发现看到的信息是这样的,和实际的情况有些不一致

 Waiting for global read lock

通过 grep mysqldump 的 help 输出,使用 --add-locks 参数验证,在 state 列的输出重看到 table level lock 的提示:

Waiting for table level lock

2014年10月23日星期四

mysql-mmm-server 启动失败

[root@ZHAOW80-34-DX-APP mysql-mmm]# /etc/init.d/mysql-mmm-vdb start
Daemon bin: '/usr/sbin/mmm_mond @vdb'
Daemon pid: '/var/run/mmm_mond-vdb.pid'
Starting MMM Monitor daemon: Configuration file /etc/mysql-mmm/mmm_common_vdb.conf is world readable! at /usr/share/perl5/vendor_perl/MMM/Common/Config.pm line 132
failed


解决办法:
chmod 640 /etc/mysql-mmm/*


2014/08/07 15:31:02 DEBUG Daemon: Answer = 'ERROR: Invalid hostname in command (album19431)! My name is 'vcomment19431''


重启服务,检查各配置文件是否存在旧的残留

haproxy 代理 http 出现 502 分析

收到一个 502 问题反馈:

# curl "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&vidlist=1&orderby=4&page=2&rows=20&dy=c" -x 10.62.80.61:80

<html><body><h1>502 Bad Gateway</h1>
The server returned an invalid or incomplete response.
</body></html>

但是直接请求 nginx 是没有问题的,增加 -I 参数对比:

# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&vidlist=1&orderby=4&page=2&rows=20&dy=c" -x 10.62.80.61:80

HTTP/1.0 502 Bad Gateway
Cache-Control: no-cache
Connection: close
Content-Type: text/html

# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&vidlist=1&orderby=4&page=2&rows=20&dy=c" -x 10.62.80.67:80

HTTP/1.1 200 OK
Server: ngx_openresty
Date: Sat, 08 Aug 2014 14:57:16 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Cache-Control: no-cache
Expires: Mon, 20 Jul 1999 23:00:00 GMT
lualu: 20140730a
-abm@11583750: 0
-hfopn: 1
-abvids@11583750@0: 0

后端 nginx 的日志,可以证明两次产生的 access.log 都是相同的 HTTP 200 。初步确认问题出在 haproxy 上。
10.62.80.67 info.album.56.com - [08/Aug/2014:15:48:44 +0800] HEAD http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&dy=c HTTP/1.1 "200" 0 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.15.3 zlib/1.2.3 libidn/1.18 libssh2/1.4.2" "-" "0.007"
10.62.80.61 info.album.56.com - [08/Aug/2014:15:48:53 +0800] HEAD http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&dy=c HTTP/1.1 "200" 0 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2" "10.62.80.61" "0.008"


折腾:换了几个不同版本的 haproxy,又升级了 pcre 库,发现问题依旧。

回到原点:通过上面的对比,发现请求通过 haproxy 代理访问和 使用 nginx 直接访问的不同之处是代码中自定义的 header 没有了。
但不能直接就此定位为 header 原因导致,有可能是其它地方中断导致后面的 header 不完整。在未下定论之前,把我们
怀疑的重点逐个排除。

为了确认问题的症状,我逐个去掉 url 中的参数测试,从 dy=c 中就发现不一样:
# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&vidlist=1&orderby=4&page=2&rows=20" -x 10.62.80.61:80
HTTP/1.1 200 OK
Server: ngx_openresty
Date: Sat, 08 Aug 2014 15:55:27 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-Control: no-cache
Expires: Mon, 20 Jul 1999 23:00:00 GMT
lualu: 20140730a

# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&vidlist=1&orderby=4&page=2&rows=20" -x 10.62.80.67:80
HTTP/1.1 200 OK
Server: ngx_openresty
Date: Sat, 08 Aug 2014 15:55:34 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Cache-Control: no-cache
Expires: Mon, 20 Jul 1999 23:00:00 GMT
lualu: 20140730a

经过多次验证确认,不带 dy=c 的情况下,haproxy 是正常的,难道这个 dy=c 触发 haproxy 什么 bug ?
让晓杰帮忙看了一下 haproxy 源码,也没有很好定论,讨论了一份 tcpdump 抓包的结果:

 tcpdump -i eth0 port 80 and host info.album.56.com -s 0 -n

-------------------200------------------------------------
17:59:51.097947 IP 10.62.80.61.62329 > 10.62.80.67.http: Flags [S], seq 3664157090, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 5], length 0
17:59:51.098063 IP 10.62.80.67.http > 10.62.80.61.62329: Flags [S.], seq 3363980584, ack 3664157091, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 5], length 0
17:59:51.098077 IP 10.62.80.61.62329 > 10.62.80.67.http: Flags [.], ack 1, win 457, length 0
17:59:51.098093 IP 10.62.80.61.62329 > 10.62.80.67.http: Flags [P.], seq 1:349, ack 1, win 457, length 348
17:59:51.098248 IP 10.62.80.67.http > 10.62.80.61.62329: Flags [.], ack 349, win 490, length 0
17:59:51.106903 IP 10.62.80.67.http > 10.62.80.61.62329: Flags [P.], seq 1:222, ack 349, win 490, length 221
17:59:51.106914 IP 10.62.80.61.62329 > 10.62.80.67.http: Flags [.], ack 222, win 490, length 0
17:59:51.106919 IP 10.62.80.67.http > 10.62.80.61.62329: Flags [F.], seq 222, ack 349, win 490, length 0
17:59:51.107322 IP 10.62.80.61.62329 > 10.62.80.67.http: Flags [F.], seq 349, ack 223, win 490, length 0
17:59:51.107411 IP 10.62.80.67.http > 10.62.80.61.62329: Flags [.], ack 350, win 490, length 0

-------------------502------------------------------------
18:00:04.856421 IP 10.62.80.61.62502 > 10.62.80.67.http: Flags [S], seq 2761391685, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 5], length 0
18:00:04.856612 IP 10.62.80.67.http > 10.62.80.61.62502: Flags [S.], seq 2370805417, ack 2761391686, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 5], length 0
18:00:04.856624 IP 10.62.80.61.62502 > 10.62.80.67.http: Flags [.], ack 1, win 457, length 0
18:00:04.856633 IP 10.62.80.61.62502 > 10.62.80.67.http: Flags [P.], seq 1:348, ack 1, win 457, length 347
18:00:04.856769 IP 10.62.80.67.http > 10.62.80.61.62502: Flags [.], ack 348, win 490, length 0
18:00:04.936327 IP 10.62.80.67.http > 10.62.80.61.62502: Flags [P.], seq 1:296, ack 348, win 490, length 295
18:00:04.936360 IP 10.62.80.61.62502 > 10.62.80.67.http: Flags [.], ack 296, win 490, length 0
18:00:04.936366 IP 10.62.80.67.http > 10.62.80.61.62502: Flags [F.], seq 296, ack 348, win 490, length 0
18:00:04.936398 IP 10.62.80.61.62502 > 10.62.80.67.http: Flags [R.], seq 348, ack 297, win 490, length 0

从抓包的结果看,tcp 的连接是被 reset,没有正常的 ack,导致 502 的发生。附一段关于 tcp Flag 位 R 的说明:
An RST is a type of TCP segment that is sent by TCP when something is wrong. Three conditions that generate an RST are: when a SYN arrives for a port that has no listening server, when TCP wants to abort an existing connection, and when TCP receives a segment for a connection that does not exist. (TCPv1 [pp. 246–250] contains additional information.)

从响应时间上讲,这是一个非正常的 502 超时或者无响应导致的错误问题,而是被终止的问题。这个问题出现在 haproxy 内部,问题不在网络
传输上,还是只能证明问题出现在 haproxy 上。

想了想,开启 haproxy 的终端模式 debug:

# haproxy -d -f /etc/haproxy/haproxy.cfg

在另外机器上分别测试一条包含 dy=c 和 不包含 dy=c 参数,可以对比发现,输出完全一样:

00000006:http-in.accept(0004)=0006 from [10.62.80.61:34201]
00000006:http-in.clireq[0006:ffff]: GET http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&vidlist=1&orderby=4&page=2&rows=20&dy=c HTTP/1.1
00000006:http-in.clihdr[0006:ffff]: User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
00000006:http-in.clihdr[0006:ffff]: Host: info.album.56.com
00000006:http-in.clihdr[0006:ffff]: Accept: */*
00000006:http-in.clihdr[0006:ffff]: Proxy-Connection: Keep-Alive
00000006:infoalbumpool.srvrep[0006:0007]: HTTP/1.1 200 OK
00000006:infoalbumpool.srvhdr[0006:0007]: Server: ngx_openresty
00000006:infoalbumpool.srvhdr[0006:0007]: Date: Fri, 08 Aug 2014 07:31:38 GMT
00000006:infoalbumpool.srvhdr[0006:0007]: Content-Type: text/html; charset=UTF-8
00000006:infoalbumpool.srvhdr[0006:0007]: Transfer-Encoding: chunked
00000006:infoalbumpool.srvhdr[0006:0007]: Connection: close
00000006:infoalbumpool.srvhdr[0006:0007]: Cache-Control: no-cache
00000006:infoalbumpool.srvhdr[0006:0007]: Expires: Mon, 20 Jul 1999 23:00:00 GMT
00000006:infoalbumpool.srvhdr[0006:0007]: lualu: 20140730a
00000006:infoalbumpool.srvcls[0006:0007]
00000006:infoalbumpool.clicls[0006:0007]
00000006:infoalbumpool.closed[0006:0007]
00000007:http-in.accept(0004)=0006 from [10.62.80.61:34223]
00000007:http-in.clireq[0006:ffff]: GET http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&vidlist=1&orderby=4&page=2&rows=20 HTTP/1.1
00000007:http-in.clihdr[0006:ffff]: User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
00000007:http-in.clihdr[0006:ffff]: Host: info.album.56.com
00000007:http-in.clihdr[0006:ffff]: Accept: */*
00000007:http-in.clihdr[0006:ffff]: Proxy-Connection: Keep-Alive
00000007:infoalbumpool.srvrep[0006:0007]: HTTP/1.1 200 OK
00000007:infoalbumpool.srvhdr[0006:0007]: Server: ngx_openresty
00000007:infoalbumpool.srvhdr[0006:0007]: Date: Fri, 08 Aug 2014 07:31:44 GMT
00000007:infoalbumpool.srvhdr[0006:0007]: Content-Type: text/html; charset=UTF-8
00000007:infoalbumpool.srvhdr[0006:0007]: Transfer-Encoding: chunked
00000007:infoalbumpool.srvhdr[0006:0007]: Connection: close
00000007:infoalbumpool.srvhdr[0006:0007]: Cache-Control: no-cache
00000007:infoalbumpool.srvhdr[0006:0007]: Expires: Mon, 20 Jul 1999 23:00:00 GMT
00000007:infoalbumpool.srvhdr[0006:0007]: lualu: 20140730a
00000007:infoalbumpool.srvcls[0006:0007]
00000007:infoalbumpool.clicls[0006:0007]
00000007:infoalbumpool.closed[0006:0007]

从上面两处返回的最后的 header 都是 lualu:20140730a,回顾前面的测试,看 dy=c 直接用 nginx 返回 和不带 dy=c  nginx 分别返回的
header 是  -abvids@11583750@0: 0   和 lualu: 20140730a 。因此 debug 带 dy=c 的是不完整的,不带 dy=c 的 url 是完整的,所以最终
返回的一个正常,一个不正常。

竟然问题出在 dy=c 上,再看一下程序代码中 关于 dy=c 触发的一些条件:
# grep "dy == 'c'" -r . |grep -v svn
./model/album.lua:            if ngx.var.arg_dy == 'cc' or ngx.var.arg_dy == 'c' then
./model/album.lua:            if ngx.var.arg_dy == 'c' or ngx.var.arg_dy == 'cc' then
./model/album.lua:            if ngx.var.arg_dy == 'c' or ngx.var.arg_dy == 'cc' then
./model/album.lua:            if ngx.var.arg_dy == 'c' or ngx.var.arg_dy == 'cc' then
./model/album.lua:            if ngx.var.arg_dy == 'c' or ngx.var.arg_dy == 'cc' then
./model/album.lua:            if ngx.var.arg_dy == 'c' or ngx.var.arg_dy == 'cc' then
./lib/hosts.lua:                if ngx.var.arg_dy == 'c' then
./ainfo.lua:if ngx.var.arg_dy == 'cc' or ngx.var.arg_dy == 'c' and ngx.var.host == config:get('WEB_HOST') then

看一下 ainfo.lua 和 module/album.lua 文件相关部分,截取其中一段 abm@:

    if aid then
        local rkey = 'abm@' .. aid
        local rds = self:_getRedis()

        --[[ try to fetch from redis ]]
        local tmpdata
        if rds then
            if ngx.var.arg_dy == 'cc' or ngx.var.arg_dy == 'c' then
                rds:del(rkey)
            else
                tmpdata = rds:get(rkey)
                if tmpdata then
                    tmpdata = cjson.decode(tmpdata)
                end
            end
        end

        if tmpdata then
            data = tmpdata
        else
            header['-' .. rkey] = 0

dy=c 触发了 rkey 增加这一条 header:-abm@11583750: 0。 
是否是关键字引起的呢?首先怀疑到的是 - 。将涉及到的几个字段的 - 都去掉,测试发现问题依旧。
再次怀疑另外一个符号 @ ,同样的方法将其也去掉,再次测试,发现正常了:

# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&dy=c" -x 10.62.80.61:80
HTTP/1.1 200 OK
Server: ngx_openresty
Date: Sat, 09 Aug 2014 01:38:47 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-Control: no-cache
Expires: Mon, 20 Jul 1999 23:00:00 GMT
lualu: 20140730a
abm11583750: 0
-hfopn: 1
-abvids115837500: 0

# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&dy=c" -x 10.62.80.67:80
HTTP/1.1 200 OK
Server: ngx_openresty
Date: Sat, 09 Aug 2014 01:38:55 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-Control: no-cache
Expires: Mon, 20 Jul 1999 23:00:00 GMT
lualu: 20140730a
abm11583750: 0
-hfopn: 1
-abvids115837500: 0


所以问题的根本原因是触发了定义好的特殊字符,问题到这里可以确定改改 lua 程序代码就可以解决。

反正失眠,不如进一步看看是哪里的原因导致,根据符号 @ 的线索,
到 haproxy 的源码目录 grep 关键字符 @ ,找到了相关文件 proto_http.c, 定义的关键字符代码如下:
/* It is about twice as fast on recent architectures to lookup a byte in a
 * table than to perform a boolean AND or OR between two tests. Refer to
 * RFC2616 for those chars.
 */
const char http_is_sep[256] = {
        ['('] = 1, [')']  = 1, ['<']  = 1, ['>'] = 1,
        ['@'] = 1, [',']  = 1, [';']  = 1, [':'] = 1,
        ['"'] = 1, ['/']  = 1, ['[']  = 1, [']'] = 1,
        ['{'] = 1, ['}']  = 1, ['?']  = 1, ['='] = 1,
        [' '] = 1, ['\t'] = 1, ['\\'] = 1,
};
 .......................................
const char http_is_token[256] = {
        [' '] = 0, ['!'] = 1, ['"'] = 0, ['#'] = 1,
        ['$'] = 1, ['%'] = 1, ['&'] = 1, ['\''] = 1,
        ['('] = 0, [')'] = 0, ['*'] = 1, ['+'] = 1,
        [','] = 0, ['-'] = 1, ['.'] = 1, ['/'] = 0,
        ['0'] = 1, ['1'] = 1, ['2'] = 1, ['3'] = 1,
        ['4'] = 1, ['5'] = 1, ['6'] = 1, ['7'] = 1,
        ['8'] = 1, ['9'] = 1, [':'] = 0, [';'] = 0,
        ['<'] = 0, ['='] = 0, ['>'] = 0, ['?'] = 0,
        ['@'] = 0, ['A'] = 1, ['B'] = 1, ['C'] = 1,
        ['D'] = 1, ['E'] = 1, ['F'] = 1, ['G'] = 1,
        ['H'] = 1, ['I'] = 1, ['J'] = 1, ['K'] = 1,
        ['L'] = 1, ['M'] = 1, ['N'] = 1, ['O'] = 1,
.......................................
        ['|'] = 1, ['}'] = 0, ['~'] = 1,
};

分隔符和 token 这两部分决定了不能作为普通字符使用,特别是关于遵循 rfc2616 定义的规则,这也就是问题的源头。修改源码,
将这两部分的 @ 定义的值置反,重新编译了一次 haproxy,再次测试:

# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&dy=c" -x 10.62.80.61:80
HTTP/1.1 200 OK
Server: ngx_openresty
Date: Sat, 09 Aug 2014 02:09:36 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-Control: no-cache
Expires: Mon, 20 Jul 1999 23:00:00 GMT
lualu: 20140730a
abm11583750: 0
-hfopn: 1
-abvids@115837500: 0

# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&dy=c" -x 10.62.80.67:80
HTTP/1.1 200 OK
Server: ngx_openresty
Date: Sat, 09 Aug 2014 02:11:44 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Cache-Control: no-cache
Expires: Mon, 20 Jul 1999 23:00:00 GMT
lualu: 20140730a
abm11583750: 0
-hfopn: 1
-abvids@115837500: 0

这样 @ 也可以直接兼容了。
rfc2616 相关文档链接:http://tools.ietf.org/html/rfc2616