no live upstreams问题解决实践

1. 背景介绍

生产环境Nginx时不时会出现大量的”no live upstreams while connecting to upstream”错误，这时全部API接口都不可用，影响非常恶劣。到目前为止，每次只要重启rule-engine应用，这些错误就会消失。基本可以断定是rule-engine服务有部分接口卡死，导致Nginx认为上游服务接口都不可用，直接拒绝了所有API请求。先看下目前upstream配置如下。

upstream gateway_api {
    server 127.0.0.1:9099 max_fails=2 fail_timeout=15s;
    server 127.0.0.2:9099 max_fails=2 fail_timeout=15s;
}

2. 配置分析

在分析问题原因之前，我们先来看下关于上面upstream配置一些相关的参数配置说明，参考下面表格

2.1 ngx_http_proxy_module

这里重点看框出来的三个参数：proxy_connect_timeout、proxy_send_timeout、proxy_read_timeout，默认超时时间是60s

官方地址：http://nginx.org/en/docs/http/ngx_http_proxy_module.html#proxy_connect_timeout

2.2 upstream_server

还是重点关注我们目前使用的max_fails和fail_timeout，可以看到我们使用的是15s内失败2次即认为该服务不可用。再结合proxy_connect_timeout的默认60s，以上no live upstreams while connecting to upstream问题就很明朗了。

3. 问题重现

相信从上面章节的分析，你们都知道产生这个问题的原因了。当有部分接口请求超过60s后，Nginx将请求认定为超时并标记为失败，如果在15秒内连续出现超过2次，那么当前server将不可用，会切换到另外一个server继续发起请求。当两个server都出现15秒内超过2次以上失败，那么这个upstream将不可用，也就出现文章开头的no live upstreams while connecting to upstream问题。下面将一步步重现该问题以便更进一步了解。

3.1、准备Nginx，并参照文章开头配置 upstreams

3.2、准备测试接口，一个正常接口，一个超时接口（用sleep让其等待）

3.3、接口压测，此处用Jmeter工具，目的是要让15秒内出现多次超时接口情况

3.4、打印nginx日志，在看到有超时或者`no live upstreams`出现的时候，立刻连续不间断调用另外服务接口，观察请求情况

如上日志可以看到，当超时接口出现错误的时候，服务开始不可用，此时调用其他服务也出现no live 的情况，前端也无法访问。

4. 问题解决

方案一：根据业务合理设置proxy_connect_timeout，调整fail_timeout、max_fails阈值
将nginx中的proxy_connect_timeout默认超时时间设置大于下游业务最大执行时间。Nginx默认：fail_timeout为10s,max_fails为1次。如果调大，Nginx相当于把请求缓冲，如果整体的的后端服务处于可用状态，对于高并发的场景来说，建议适当调大是有效的。
方案二：优化下游超时接口
方案三：取消fail_timeout、max_fails 配置或者加大该配置，增加主动检测机制（Nginx默认被动检测），插件包 nginx_upstream_check_module，下面将详细介绍该方案。

官方介绍：https://github.com/yaoweibin/nginx_upstream_check_module

5. upstream_check_module插件安装

5.1 简单配置案例

Name
    nginx_http_upstream_check_module - support upstream health check with
    Nginx
Synopsis
    http {
        upstream cluster {
            # simple round-robin max_fails=0 表示不统计该值，服务一直可用
            server 192.168.0.1:80 max_fails=0 fail_timeout=15s;
            server 192.168.0.2:80 max_fails=0 fail_timeout=15s;

            check interval=5000 rise=1 fall=3 timeout=4000 type=http;
            #check interval=3000 rise=2 fall=5 timeout=1000 type=ssl_hello;
            #check interval=3000 rise=2 fall=5 timeout=1000 type=http;
            #check_http_send "HEAD / HTTP/1.0\r\n\r\n";
            #check_http_expect_alive http_2xx http_3xx;
        }
        server {
            listen 80;
            location / {
                proxy_pass http://cluster;
            }
            location /status {
                check_status;
                access_log   off;
                allow SOME.IP.ADD.RESS;
                deny all;
           }
        }

    }

5.2 相关指令说明

指令后面的参数意义是：

interval：向后端发送的健康检查包的间隔，单位：milliseconds。
fall(fall_count): 如果连续失败次数达到fall_count，服务器就被认为是down。
rise(rise_count): 如果连续成功次数达到rise_count，服务器就被认为是up。
timeout: 后端健康请求的超时时间，单位：milliseconds。
default_down: 设定初始时服务器的状态，如果是true，就说明默认是down的，如果是false，就是up的。默认值是true，也就是一开始服务器认为是不可用，要等健康检查包达到一定成功次数以后才会被认为是健康的。
type：健康检查包的类型，现在支持以下多种类型
- tcp：简单的tcp连接，如果连接成功，就说明后端正常。
- ssl_hello：发送一个初始的SSL hello包并接受服务器的SSL hello包。
- http：发送HTTP请求，通过后端的回复包的状态来判断后端是否存活。
- mysql: 向mysql服务器连接，通过接收服务器的greeting包来判断后端是否存活。
- ajp：向后端发送AJP协议的Cping包，通过接收Cpong包来判断后端是否存活。

5.3 upstream_check_module 下载

// 在nginx根目录下载并解压
wget https://codeload.github.com/yaoweibin/nginx_upstream_check_module/zip/refs/heads/master
unzip master

5.4 配置参数并编译

// 查看当前nginx版本，用来匹配nginx_upstream_check_module中的补丁版本，例如使用nginx-1.7.2+, 对应插件版本应该是'check_1.7.2+.patch'，具体参考官方说明，另外需要记录下现有参数配置，配置添加模块时候使用，不能单独添加模块，会覆盖原有配置
nginx -V
nginx version: nginx/1.12.2
built by gcc 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) 
built with OpenSSL 1.0.2k-fips  26 Jan 2017
TLS SNI support enabled
configure arguments: --prefix=/usr/local/nginx --with-http_stub_status_module --with-http_ssl_module

// 打补丁，如果nginx目录缺少源码包，可下载同版本解压放到nginx目录即可
cd nginx-1.12.2/
patch -p1 < ../nginx_upstream_check_module-master/check_1.12.1+.patch

// 添加模块配置，记得加上原有的，不然会覆盖。--add-module=../nginx_upstream_check_module-master
./configure --prefix=/usr/local/nginx --with-http_stub_status_module --with-http_ssl_module --add-module=../nginx_upstream_check_module-master

// 编译，编译后会在当前目录下生成一个objs目录，里面有nginx文件，切勿直接用来替换原有的nginx文件
make

5.5 验证安装

// 测试编译生成的nginx是否可用 
./objs/nginx -t
nginx: the configuration file /usr/local/nginx/conf/nginx.conf syntax is ok
nginx: configuration file /usr/local/nginx/conf/nginx.conf test is successful

// 替换原有nginx文件
cp ./objs/nginx /usr/local/nginx/sbin/nginx

// 验证是否安装成功,可以看到upstream_check_module模块已经编译成功
./ngxin -V
nginx version: nginx/1.12.2
built by gcc 4.8.5 20150623 (Red Hat 4.8.5-44) (GCC) 
built with OpenSSL 1.0.2k-fips  26 Jan 2017
TLS SNI support enabled
configure arguments: --prefix=/usr/local/nginx --with-http_stub_status_module --with-http_ssl_module --add-module=/usr/local/nginx/nginx_upstream_check_module-master

// 安装完需要重启才能生效，如果没有加入系统服务，则只能找到进程，通过kill的方式进行
// 停止服务
systemctl stop nginx.service
// 启动服务
systemctl start nginx.service
// 重启服务
systemctl restart nginx.service

如果没有重启，在下面修改配置的时候会出现如下错误

unknown directive "check" in /usr/local/nginx/conf/nginx.conf:61

6. upstream_check_module配置使用

6.1 修改nginx配置文件

http {
    upstream cluster {
        # simple round-robin max_fails=0 表示不统计该值，服务一直可用
        server 192.168.0.1:80 max_fails=0 fail_timeout=15s;
        server 192.168.0.2:80 max_fails=0 fail_timeout=15s;

        # 1、重点配置此项，心跳检测
        check interval=5000 rise=1 fall=3 timeout=5000 default_down=false type=http;
        check_keepalive_requests 100;
        check_http_send "HEAD / HTTP/1.1\r\nConnection: keep-alive\r\n\r\n";
        check_http_expect_alive http_2xx http_3xx http_4xx http_5xx;
        #check interval=3000 rise=2 fall=5 timeout=1000 type=ssl_hello;
        #check interval=3000 rise=2 fall=5 timeout=1000 type=http;
        #check_http_send "HEAD / HTTP/1.0\r\n\r\n"; 默认HTTP1.0
        #check_http_send "HEAD / HTTP/1.1\r\nConnection: keep-alive\r\n\r\n";
        #check_http_expect_alive http_2xx http_3xx http_4xx http_5xx;
        #Default: 如果没有配置参数，默认值是：interval=30000 fall=5 rise=2 timeout=1000 default_down=true type=tcp

    }
    server {
        listen 80;
        location / {
            proxy_pass http://cluster;
        }
        # 2、显示服务器的健康状态页面，可在浏览器访问http://xxxx/check_status
        location /check_status {
            check_status;
            access_log   off;
            allow SOME.IP.ADD.RESS;
            deny all;
        }
    }
}

6.2 配置验证

// 生效模块配置
./nginx -s reload

浏览器输入http://xxxx/check_status查看返回，则证明配置成功

6.3 问题验证

上述配置完成之后，重复 3. 问题重现过程，可以看到linux打印大量超时日志，但是并没有no live 相关日志，其他接口也能正常响应。

作者：山村放牛娃
链接：https://juejin.cn/post/7208839038018142245
来源：稀土掘金
著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

作者：Jeebiz 创建时间：2023-09-14 11:46
最后编辑：Jeebiz 更新时间：2025-06-29 00:34