收到一个 502 问题反馈:
# curl "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&vidlist=1&orderby=4&page=2&rows=20&dy=c" -x 10.62.80.61:80
<html><body><h1>502 Bad Gateway</h1>
The server returned an invalid or incomplete response.
</body></html>
但是直接请求 nginx 是没有问题的,增加 -I 参数对比:
# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&vidlist=1&orderby=4&page=2&rows=20&dy=c" -x 10.62.80.61:80
HTTP/1.0 502 Bad Gateway
Cache-Control: no-cache
Connection: close
Content-Type: text/html
# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&vidlist=1&orderby=4&page=2&rows=20&dy=c" -x 10.62.80.67:80
HTTP/1.1 200 OK
Server: ngx_openresty
Date: Sat, 08 Aug 2014 14:57:16 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Cache-Control: no-cache
Expires: Mon, 20 Jul 1999 23:00:00 GMT
lualu: 20140730a
-abm@11583750: 0
-hfopn: 1
-abvids@11583750@0: 0
后端 nginx 的日志,可以证明两次产生的 access.log 都是相同的 HTTP 200 。初步确认问题出在 haproxy 上。
10.62.80.67 info.album.56.com - [08/Aug/2014:15:48:44 +0800] HEAD http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&dy=c HTTP/1.1 "200" 0 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.15.3 zlib/1.2.3 libidn/1.18 libssh2/1.4.2" "-" "0.007"
10.62.80.61 info.album.56.com - [08/Aug/2014:15:48:53 +0800] HEAD http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&dy=c HTTP/1.1 "200" 0 "-" "curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2" "10.62.80.61" "0.008"
折腾:换了几个不同版本的 haproxy,又升级了 pcre 库,发现问题依旧。
回到原点:通过上面的对比,发现请求通过 haproxy 代理访问和 使用 nginx 直接访问的不同之处是代码中自定义的 header 没有了。
但不能直接就此定位为 header 原因导致,有可能是其它地方中断导致后面的 header 不完整。在未下定论之前,把我们
怀疑的重点逐个排除。
为了确认问题的症状,我逐个去掉 url 中的参数测试,从 dy=c 中就发现不一样:
# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&vidlist=1&orderby=4&page=2&rows=20" -x 10.62.80.61:80
HTTP/1.1 200 OK
Server: ngx_openresty
Date: Sat, 08 Aug 2014 15:55:27 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-Control: no-cache
Expires: Mon, 20 Jul 1999 23:00:00 GMT
lualu: 20140730a
# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&vidlist=1&orderby=4&page=2&rows=20" -x 10.62.80.67:80
HTTP/1.1 200 OK
Server: ngx_openresty
Date: Sat, 08 Aug 2014 15:55:34 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Cache-Control: no-cache
Expires: Mon, 20 Jul 1999 23:00:00 GMT
lualu: 20140730a
经过多次验证确认,不带 dy=c 的情况下,haproxy 是正常的,难道这个 dy=c 触发 haproxy 什么 bug ?
让晓杰帮忙看了一下 haproxy 源码,也没有很好定论,讨论了一份 tcpdump 抓包的结果:
tcpdump -i eth0 port 80 and host info.album.56.com -s 0 -n
-------------------200------------------------------------
17:59:51.097947 IP 10.62.80.61.62329 > 10.62.80.67.http: Flags [S], seq 3664157090, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 5], length 0
17:59:51.098063 IP 10.62.80.67.http > 10.62.80.61.62329: Flags [S.], seq 3363980584, ack 3664157091, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 5], length 0
17:59:51.098077 IP 10.62.80.61.62329 > 10.62.80.67.http: Flags [.], ack 1, win 457, length 0
17:59:51.098093 IP 10.62.80.61.62329 > 10.62.80.67.http: Flags [P.], seq 1:349, ack 1, win 457, length 348
17:59:51.098248 IP 10.62.80.67.http > 10.62.80.61.62329: Flags [.], ack 349, win 490, length 0
17:59:51.106903 IP 10.62.80.67.http > 10.62.80.61.62329: Flags [P.], seq 1:222, ack 349, win 490, length 221
17:59:51.106914 IP 10.62.80.61.62329 > 10.62.80.67.http: Flags [.], ack 222, win 490, length 0
17:59:51.106919 IP 10.62.80.67.http > 10.62.80.61.62329: Flags [F.], seq 222, ack 349, win 490, length 0
17:59:51.107322 IP 10.62.80.61.62329 > 10.62.80.67.http: Flags [F.], seq 349, ack 223, win 490, length 0
17:59:51.107411 IP 10.62.80.67.http > 10.62.80.61.62329: Flags [.], ack 350, win 490, length 0
-------------------502------------------------------------
18:00:04.856421 IP 10.62.80.61.62502 > 10.62.80.67.http: Flags [S], seq 2761391685, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 5], length 0
18:00:04.856612 IP 10.62.80.67.http > 10.62.80.61.62502: Flags [S.], seq 2370805417, ack 2761391686, win 14600, options [mss 1460,nop,nop,sackOK,nop,wscale 5], length 0
18:00:04.856624 IP 10.62.80.61.62502 > 10.62.80.67.http: Flags [.], ack 1, win 457, length 0
18:00:04.856633 IP 10.62.80.61.62502 > 10.62.80.67.http: Flags [P.], seq 1:348, ack 1, win 457, length 347
18:00:04.856769 IP 10.62.80.67.http > 10.62.80.61.62502: Flags [.], ack 348, win 490, length 0
18:00:04.936327 IP 10.62.80.67.http > 10.62.80.61.62502: Flags [P.], seq 1:296, ack 348, win 490, length 295
18:00:04.936360 IP 10.62.80.61.62502 > 10.62.80.67.http: Flags [.], ack 296, win 490, length 0
18:00:04.936366 IP 10.62.80.67.http > 10.62.80.61.62502: Flags [F.], seq 296, ack 348, win 490, length 0
18:00:04.936398 IP 10.62.80.61.62502 > 10.62.80.67.http: Flags [R.], seq 348, ack 297, win 490, length 0
从抓包的结果看,tcp 的连接是被 reset,没有正常的 ack,导致 502 的发生。附一段关于 tcp Flag 位 R 的说明:
An RST is a type of TCP segment that is sent by TCP when something is wrong. Three conditions that generate an RST are: when a SYN arrives for a port that has no listening server, when TCP wants to abort an existing connection, and when TCP receives a segment for a connection that does not exist. (TCPv1 [pp. 246–250] contains additional information.)
从响应时间上讲,这是一个非正常的 502 超时或者无响应导致的错误问题,而是被终止的问题。这个问题出现在 haproxy 内部,问题不在网络
传输上,还是只能证明问题出现在 haproxy 上。
想了想,开启 haproxy 的终端模式 debug:
# haproxy -d -f /etc/haproxy/haproxy.cfg
在另外机器上分别测试一条包含 dy=c 和 不包含 dy=c 参数,可以对比发现,输出完全一样:
00000006:http-in.accept(0004)=0006 from [10.62.80.61:34201]
00000006:http-in.clireq[0006:ffff]: GET http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&vidlist=1&orderby=4&page=2&rows=20&dy=c HTTP/1.1
00000006:http-in.clihdr[0006:ffff]: User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
00000006:http-in.clihdr[0006:ffff]: Host: info.album.56.com
00000006:http-in.clihdr[0006:ffff]: Accept: */*
00000006:http-in.clihdr[0006:ffff]: Proxy-Connection: Keep-Alive
00000006:infoalbumpool.srvrep[0006:0007]: HTTP/1.1 200 OK
00000006:infoalbumpool.srvhdr[0006:0007]: Server: ngx_openresty
00000006:infoalbumpool.srvhdr[0006:0007]: Date: Fri, 08 Aug 2014 07:31:38 GMT
00000006:infoalbumpool.srvhdr[0006:0007]: Content-Type: text/html; charset=UTF-8
00000006:infoalbumpool.srvhdr[0006:0007]: Transfer-Encoding: chunked
00000006:infoalbumpool.srvhdr[0006:0007]: Connection: close
00000006:infoalbumpool.srvhdr[0006:0007]: Cache-Control: no-cache
00000006:infoalbumpool.srvhdr[0006:0007]: Expires: Mon, 20 Jul 1999 23:00:00 GMT
00000006:infoalbumpool.srvhdr[0006:0007]: lualu: 20140730a
00000006:infoalbumpool.srvcls[0006:0007]
00000006:infoalbumpool.clicls[0006:0007]
00000006:infoalbumpool.closed[0006:0007]
00000007:http-in.accept(0004)=0006 from [10.62.80.61:34223]
00000007:http-in.clireq[0006:ffff]: GET http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&vidlist=1&orderby=4&page=2&rows=20 HTTP/1.1
00000007:http-in.clihdr[0006:ffff]: User-Agent: curl/7.19.7 (x86_64-redhat-linux-gnu) libcurl/7.19.7 NSS/3.13.1.0 zlib/1.2.3 libidn/1.18 libssh2/1.2.2
00000007:http-in.clihdr[0006:ffff]: Host: info.album.56.com
00000007:http-in.clihdr[0006:ffff]: Accept: */*
00000007:http-in.clihdr[0006:ffff]: Proxy-Connection: Keep-Alive
00000007:infoalbumpool.srvrep[0006:0007]: HTTP/1.1 200 OK
00000007:infoalbumpool.srvhdr[0006:0007]: Server: ngx_openresty
00000007:infoalbumpool.srvhdr[0006:0007]: Date: Fri, 08 Aug 2014 07:31:44 GMT
00000007:infoalbumpool.srvhdr[0006:0007]: Content-Type: text/html; charset=UTF-8
00000007:infoalbumpool.srvhdr[0006:0007]: Transfer-Encoding: chunked
00000007:infoalbumpool.srvhdr[0006:0007]: Connection: close
00000007:infoalbumpool.srvhdr[0006:0007]: Cache-Control: no-cache
00000007:infoalbumpool.srvhdr[0006:0007]: Expires: Mon, 20 Jul 1999 23:00:00 GMT
00000007:infoalbumpool.srvhdr[0006:0007]: lualu: 20140730a
00000007:infoalbumpool.srvcls[0006:0007]
00000007:infoalbumpool.clicls[0006:0007]
00000007:infoalbumpool.closed[0006:0007]
从上面两处返回的最后的 header 都是 lualu:20140730a,回顾前面的测试,看 dy=c 直接用 nginx 返回 和不带 dy=c nginx 分别返回的
header 是 -abvids@11583750@0: 0 和 lualu: 20140730a 。因此 debug 带 dy=c 的是不完整的,不带 dy=c 的 url 是完整的,所以最终
返回的一个正常,一个不正常。
竟然问题出在 dy=c 上,再看一下程序代码中 关于 dy=c 触发的一些条件:
# grep "dy == 'c'" -r . |grep -v svn
./model/album.lua: if ngx.var.arg_dy == 'cc' or ngx.var.arg_dy == 'c' then
./model/album.lua: if ngx.var.arg_dy == 'c' or ngx.var.arg_dy == 'cc' then
./model/album.lua: if ngx.var.arg_dy == 'c' or ngx.var.arg_dy == 'cc' then
./model/album.lua: if ngx.var.arg_dy == 'c' or ngx.var.arg_dy == 'cc' then
./model/album.lua: if ngx.var.arg_dy == 'c' or ngx.var.arg_dy == 'cc' then
./model/album.lua: if ngx.var.arg_dy == 'c' or ngx.var.arg_dy == 'cc' then
./lib/hosts.lua: if ngx.var.arg_dy == 'c' then
./ainfo.lua:if ngx.var.arg_dy == 'cc' or ngx.var.arg_dy == 'c' and ngx.var.host == config:get('WEB_HOST') then
看一下 ainfo.lua 和 module/album.lua 文件相关部分,截取其中一段 abm@:
if aid then
local rkey = 'abm@' .. aid
local rds = self:_getRedis()
--[[ try to fetch from redis ]]
local tmpdata
if rds then
if ngx.var.arg_dy == 'cc' or ngx.var.arg_dy == 'c' then
rds:del(rkey)
else
tmpdata = rds:get(rkey)
if tmpdata then
tmpdata = cjson.decode(tmpdata)
end
end
end
if tmpdata then
data = tmpdata
else
header['-' .. rkey] = 0
dy=c 触发了 rkey 增加这一条 header:-abm@11583750: 0。
是否是关键字引起的呢?首先怀疑到的是 - 。将涉及到的几个字段的 - 都去掉,测试发现问题依旧。
再次怀疑另外一个符号 @ ,同样的方法将其也去掉,再次测试,发现正常了:
# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&dy=c" -x 10.62.80.61:80
HTTP/1.1 200 OK
Server: ngx_openresty
Date: Sat, 09 Aug 2014 01:38:47 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-Control: no-cache
Expires: Mon, 20 Jul 1999 23:00:00 GMT
lualu: 20140730a
abm11583750: 0
-hfopn: 1
-abvids115837500: 0
# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&dy=c" -x 10.62.80.67:80
HTTP/1.1 200 OK
Server: ngx_openresty
Date: Sat, 09 Aug 2014 01:38:55 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-Control: no-cache
Expires: Mon, 20 Jul 1999 23:00:00 GMT
lualu: 20140730a
abm11583750: 0
-hfopn: 1
-abvids115837500: 0
所以问题的根本原因是触发了定义好的特殊字符,问题到这里可以确定改改 lua 程序代码就可以解决。
反正失眠,不如进一步看看是哪里的原因导致,根据符号 @ 的线索,
到 haproxy 的源码目录 grep 关键字符 @ ,找到了相关文件 proto_http.c, 定义的关键字符代码如下:
/* It is about twice as fast on recent architectures to lookup a byte in a
* table than to perform a boolean AND or OR between two tests. Refer to
* RFC2616 for those chars.
*/
const char http_is_sep[256] = {
['('] = 1, [')'] = 1, ['<'] = 1, ['>'] = 1,
['@'] = 1, [','] = 1, [';'] = 1, [':'] = 1,
['"'] = 1, ['/'] = 1, ['['] = 1, [']'] = 1,
['{'] = 1, ['}'] = 1, ['?'] = 1, ['='] = 1,
[' '] = 1, ['\t'] = 1, ['\\'] = 1,
};
.......................................
const char http_is_token[256] = {
[' '] = 0, ['!'] = 1, ['"'] = 0, ['#'] = 1,
['$'] = 1, ['%'] = 1, ['&'] = 1, ['\''] = 1,
['('] = 0, [')'] = 0, ['*'] = 1, ['+'] = 1,
[','] = 0, ['-'] = 1, ['.'] = 1, ['/'] = 0,
['0'] = 1, ['1'] = 1, ['2'] = 1, ['3'] = 1,
['4'] = 1, ['5'] = 1, ['6'] = 1, ['7'] = 1,
['8'] = 1, ['9'] = 1, [':'] = 0, [';'] = 0,
['<'] = 0, ['='] = 0, ['>'] = 0, ['?'] = 0,
['@'] = 0, ['A'] = 1, ['B'] = 1, ['C'] = 1,
['D'] = 1, ['E'] = 1, ['F'] = 1, ['G'] = 1,
['H'] = 1, ['I'] = 1, ['J'] = 1, ['K'] = 1,
['L'] = 1, ['M'] = 1, ['N'] = 1, ['O'] = 1,
.......................................
['|'] = 1, ['}'] = 0, ['~'] = 1,
};
分隔符和 token 这两部分决定了不能作为普通字符使用,特别是关于遵循 rfc2616 定义的规则,这也就是问题的源头。修改源码,
将这两部分的 @ 定义的值置反,重新编译了一次 haproxy,再次测试:
# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&dy=c" -x 10.62.80.61:80
HTTP/1.1 200 OK
Server: ngx_openresty
Date: Sat, 09 Aug 2014 02:09:36 GMT
Content-Type: text/html; charset=UTF-8
Connection: close
Cache-Control: no-cache
Expires: Mon, 20 Jul 1999 23:00:00 GMT
lualu: 20140730a
abm11583750: 0
-hfopn: 1
-abvids@115837500: 0
# curl -I "http://info.album.56.com/?a=albuminfo&aid=11583750&vids=&dy=c" -x 10.62.80.67:80
HTTP/1.1 200 OK
Server: ngx_openresty
Date: Sat, 09 Aug 2014 02:11:44 GMT
Content-Type: text/html; charset=UTF-8
Connection: keep-alive
Cache-Control: no-cache
Expires: Mon, 20 Jul 1999 23:00:00 GMT
lualu: 20140730a
abm11583750: 0
-hfopn: 1
-abvids@115837500: 0
这样 @ 也可以直接兼容了。
rfc2616 相关文档链接:http://tools.ietf.org/html/rfc2616