Nginx: recv() failed (104: connection reset by peer) troubleshooting

Recv () Failed (104: Connection reset by Peer) problem troubleshooting
In the recent project, through the Nginx reverse proxy nodejs (using nestjs framework) service, the probability of 502 Bad Gateway appears in the pressure testing process, with a low probability of around 0.005%. The specific error message in the log is recv() failed (104: Connection reset by peer) while reading the response header from upstream, through searching the data, we learned that the direct reason for the error was:

nodejs服务已经断开了连接,但是未通知到Nginx,Nginx还在该连接上收发数据,最终导致了该报错。

Considering that in Nginx we have configured to open a long connection with nodeJS Server, namely:

proxy_http_version 1.1;
proxy_set_header Connection "";

Let’s start with the Settings related to long connections. Several common parameters that Nginx USES to affect long connections are: keepalive_timeout, keepalive_requests, and keepalive
1) keepalive_timeout: set the client’s long connection timeout. If the client does not make a request beyond this time, Nginx server will actively close the long connection. Nginx defaults to keepalive_timeout 75s; . Some browsers only hold 60 seconds at most, so we usually set it to 60s. If set to 0, close the long connection. 2) keepalive_requests: set the maximum number of requests a long connection can handle with the client. If this value is exceeded, Nginx will actively close the long connection. The default value is 100. Under normal circumstances, keepalive_requests 100 can basically meet the requirements. However, in the case of high QPS, the number of continuous long connection requests reaches the maximum number of requests and is shut down. This means that Nginx needs to keep creating new long connection to handle the requests. Set the maximum number of connection to the upstream server free keepalive, the number of connections when idle keepalive exceeds this value, the least recently used connection will be closed, if this value is set too small, a certain time period the number of requests, and request processing time is not stable, may be kept closed and create a long number of connections. We usually set 1024. In special scenarios, the average response time and QPS of the interface can be estimated. 4) Open a long connection with the upstream server The default nginx access backend is a short (HTTP1.0) connection that is opened each time a request is made, closed after processing, and reopened on the next request. The HTTP protocol has supported long connections since version 1.1. Therefore, we will set the following parameters in location to open a long connection:

  http {
      server {
          location/ {
              proxy_http_version 1.1;
              proxy_set_header Connection "";
          }
      }
  }

1) and 2) are set to be long connections between client and Nginx, and 3) and 4) are set to be long connections between Nginx and server.
After checking the above parameter configuration, we started to look for the configuration related to keepalive in node. By looking up nodejs documents, we found that the default server.keepAliveTimeout is 5000ms. If you continue to send and receive data on the long connection, the above error may occur.
So we add more to the node keepAliveTimeout value (see https://shuheikagawa.com/blog/2019/04/25/keep-alive-timeout/) :


// Set up the app...
const server = app.listen(8080);

server.keepAliveTimeout = 61 * 1000;
server.headersTimeout = 65 * 1000; // Node.js >= 10.15.2 需要设置该值大于keepAliveTimeout

The purpose of lengthening the node keepAliveTimeout value is to prevent Nginx from disconnecting before node service, and the caller needs to timeout longer than the caller.
After the above modification, no error 502 Bad Gateway occurred in Nginx during the pressure test.

Read More: