之所以没有昨天发wiki,是原本想再次确认一下和前天一样能够在tcpcopy 压测一段时间后看到效果,但是
昨天的测试持续好几个小时,问题不会重现;同样,我对报错的请求进行循环测试,依然没有问题。也就是
从昨天开始,缓存建立到一定程度后,这个bug 如果没有确认的话,估计就又隐藏起来了。
故障描述
从 10 月底到 11 月 22 号之前,info.v 使用 lua 取代 php 后,从数据监控图上看,性能提高至少一倍以上。
由于11 月 22 号一次清缓存,加节点的调整,导致nginx worker cpu 一段时间 100% 后无响应(即便无访问
情况,cpu 仍然100%),web 请求随之也无响应。
出现问题后的想法:
1. 缓存大的变动,导致缓存读库后响应慢引起;
2. 代码是否有bug,进程死循环了;
初步做法如下:
1. 为了先临时解决问题,通过脚本监控重启nginx,在 worker 都死完之前重启保证服务影响时间尽量短;
2. 和昊哥一再确认服务器配置和代码的修改都没有问题;验证缓存和库的访问是正常。
3. 检查access log 访问量,和前一天相差不大,error log 无变化,但仍在对可疑error 进行逐步修复。
Result: cpu 100% 问题依然复发。
商讨之后第二步做法:
1. 再次检查代码逻辑,是否有逻辑bug 问题,调整相关连接超时;
2. 和php 不同的是,没有相关slow log 可以查看,只能去找有效的工具,查相关的错误/或者瓶颈所在。
Result:代码上线了好几个版本,修复了一些错误,增加退出的逻辑,没有任何效果;
查找相关工具,openresty systemtap tools 能有效的查看信息,但需要内核相关软件支持,
systemtap 2.1 kerneldebuginfo ,由于线上服务器内核版本都是非rpm 包,想使用该版本,需要
集团那边协助才能安装。
23 号晚上,昊继续修复 nginx 中的 error;我在空闲测试机上搭建systemap 环境,使用tcpcopy 独立测试;
使用系统命令工具:
1. lsof -p pid 100% cpu的 nginx worker,和正常的worker 对比,没有想像的存在很多文件未释放情况;
2. strace -p pid 100% cpu 的 nginx worker,系统相关调用无任何输出,是不干活了;
3. pstack 和 gdb pid 100% cpu 的 nginx worker,有了以下发现:
[root@APP ~]# pstack 18523
#0 0x00002ae5d9be1208 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#1 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#2 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#3 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#4 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#5 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#6 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#7 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#8 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#9 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#10 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#11 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#12 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#13 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#14 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#15 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#16 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#17 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#18 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#19 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#20 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#21 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#22 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#23 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#24 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#25 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#26 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#27 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#28 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#29 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#30 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#31 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#32 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#33 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#34 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#35 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#36 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#37 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#38 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#39 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#40 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#41 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#42 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#43 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#44 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#45 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#46 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#47 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#48 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#49 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#50 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#51 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#52 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#53 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#54 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#55 0x00002ae5d9be1392 in match () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#56 0x00002ae5d9be1858 in lj_cf_string_gsub () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#57 0x00002ae5d9b9cd3b in lj_BC_FUNCC () from /home/openresty//luajit/lib/libluajit-5.1.so.2
#58 0x00000000004a0ca2 in ngx_http_lua_run_thread ()
#59 0x00000000004a8634 in ngx_http_lua_socket_tcp_resume ()
#60 0x00000000004a335b in ngx_http_lua_content_wev_handler ()
#61 0x00000000004a7735 in ngx_http_lua_socket_handle_success ()
#62 0x00000000004a9959 in ngx_http_lua_socket_tcp_read ()
#63 0x00000000004aa14e in ngx_http_lua_socket_read_handler ()
#64 0x00000000004a887a in ngx_http_lua_socket_tcp_handler ()
#65 0x000000000042d14a in ngx_event_process_posted ()
#66 0x000000000042d00f in ngx_process_events_and_timers ()
#67 0x0000000000432d28 in ngx_worker_process_cycle ()
#68 0x0000000000431687 in ngx_spawn_process ()
#69 0x000000000043240e in ngx_start_worker_processes ()
#70 0x00000000004332ab in ngx_master_process_cycle ()
#71 0x00000000004181cd in main ()
(gdb) attach 18523
Attaching to process 18523
Reading symbols from /home/openresty/nginx/sbin/nginx...done.
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libcrypt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypt.so.1
Reading symbols from /usr/lib64/libdrizzle.so.0...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libdrizzle.so.0
Reading symbols from /home/openresty//luajit/lib/libluajit-5.1.so.2...done.
Loaded symbols for /home/openresty//luajit/lib/libluajit-5.1.so.2
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libssl.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libssl.so.6
Reading symbols from /lib64/libcrypto.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcrypto.so.6
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libz.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libgcc_s.so.1
Reading symbols from /usr/lib64/libgssapi_krb5.so.2...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libgssapi_krb5.so.2
Reading symbols from /usr/lib64/libkrb5.so.3...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libkrb5.so.3
Reading symbols from /lib64/libcom_err.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libcom_err.so.2
Reading symbols from /usr/lib64/libk5crypto.so.3...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libk5crypto.so.3
Reading symbols from /usr/lib64/libkrb5support.so.0...(no debugging symbols found)...done.
Loaded symbols for /usr/lib64/libkrb5support.so.0
Reading symbols from /lib64/libkeyutils.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libkeyutils.so.1
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libselinux.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libselinux.so.1
Reading symbols from /lib64/libsepol.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libsepol.so.1
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /home/openresty//lualib/cjson.so...done.
Loaded symbols for /home/openresty//lualib/cjson.so
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fffcb582000
0x00002ae5d9be11c0 in match (ms=0x2d,
s=0x4270e44f "\r\n\r\n\343\200\220\345\244\247\345\236\213\346\274\224\345\207\272\347\273\217\345\216\206\343\200\221\357\274\232\r\n2005-12 \350\276\275\345\256\201\347\224\265\350\247\206\345\217\260\346\230\245\350\212\202\346\231\232\344\274\232\350\241\250\346\274\224\345\230\211\345\256\276\r\n2006-11 CCTV4\347\224\265\350\247\206\350\266\205\344\272\272\350\212\202\347\233\256\347\211\271\347\272\246\345\230\211\345\256\276\r\n2006-12 \345\217\202\345\212\240\346\271\226\345\215\227\345\215\253\350\247\206<\345\277\253\344\271\220\345\244\247\346\234\254\350\220\245>\350\212\202\347\233\256\345\275\225\345\210\266,\344\270\223\350\256\277\345\230\211\345\256\276\r\n200"..., p=<value optimized out>) at lib_string.c:451
451 }
(gdb) bt
Cannot access memory at address 0x7fffcb56b388
这个 worker 是陷于 match 函数里了,为进一步验证,我们对另外几个死掉的worker 同样做了查看,情况基本都相同,以此定位到问题出根本原因;
再回过头查看相关的nginx error log 对应 match 相关调用:
/diska/htdocs/openresty/vinfo.lua:217: in function </diska/htdocs/openresty/vinfo.lua:1>, client: 10.11.80.54, server: info.v.56.com, request: "GET /?ids=51594989,52711230,47843719,52546854,29120020,52614818,51764257,51747382,52419156,52078753,52181976,52723946,51480892,52214031,39500819,39500976,46742475,47681770,47831882,47710651,52680032,52666705,51171938,51584136,52420233,52419871,52420185,52419754,52419307,52406085,52405507,32986833,19141870,52092021,47540485,47451641,52400473,52377987,52550667,51659238,51113427,26454976,51711240,14369633&fli=id,totaltime,times,comment,public,chk_yn,public_time,user_id HTTP/1.0", host: "info.v.56.com"
一个正则表达式匹配失败报出来的error;这种由于字符escape 失败导致的一个兼容性错误之前也是有的。
这个故障也要求深入底层的c 看问题了,要不然从表面看到的,只是一个黑盒子。