本文共 12765 字,大约阅读时间需要 42 分钟。
近期一个比较有意思的问题,有用户反馈说他们的服务器一直在连接阿里云的服务,把端口都占满了,导致网络中断,业务连接建立不起来,telnet本地端口也不通。
按照产品设计的角度来说,这种情况一般不可能发生,但是用户的netstat的结果来看,的确显示大部分的连接是跟阿里云的服务器建立的。
由于服务器都是非阿里云实例,无法直接登陆排查,因此,准备了排查方案。
类似的从本地telnet本地端口不通的问题一般是动态端口用光。默认Windows Server 2008 之后的系统使用49152 - 65536范围共计 16384 个端口。一般情况下是用不完的,如果有用光的情况常常伴随Port Leak(端口泄露)问题。
netsh int ipv4 show dyn tcp
Protocol tcp Dynamic Port Range
Start Port : 49152
Number of Ports : 16384
验证方案是使用命令 netsh int ipv4 set dynamicport tcp start=10000 num=50000 来开放更多的动态端口使用。
由于 netsh 直接通过 NSI 接口调用操作 Windows Kernel 的 TCPIP 方法来修改系统配置,该命令立时生效。
不过,用户很快反馈是没有任何三方软件。
考虑到系统中多数动态端口在 TIME_WAIT 状态,可以尝试通过 TcpTimedWaitDelay 注册表来缩短 TIME_WAIT 的释放时间。
Key: HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServicesTcpipParameters
Value: TcpTimedWaitDelayData Type: REG_DWORDRange: 30-300 (decimal)Default value: 0x78 (120 decimal)Recommended value: 30注册表的配置需要重启才会生效。
用户一直对他们服务器访问的阿里云服务器 IP 存有疑虑,不过大多数是 443 端口的访问,抓包很难证明确切是哪个应用产生的连接。因此,我们需要 Windows 提供的 ETL Trace 方法来获取应用程序对 TCPIP 等 Network API 的调用信息。
netsh trace start provider={2F07E2EE-15DB-40F1-90EF-9D7BA282188A} keywords=0xffffffffffffffff level=0xff provider={E53C6823-7BB8-44BB-90DC-3F86090D48A6} keywords=0xffffffffffffffff level=0xff provider={7D44233D-3055-4B9C-BA64-0D47CA40A232} keywords=0xffffffffffffffff level=0xff provider={50B3E73C-9370-461D-BB9F-26F32D68887D} keywords=0xffffffffffffffff level=0xff provider={43D1A55C-76D6-4F7E-995C-64C711E5CAFE} keywords=0xffffffffffffffff level=0xff maxSize=500MB fileMode=circular persistent=no overwrite=yes report=yes correlation=yes traceFile=c:NetworkTrace.etl capture=yes packettruncatebytes=128
从收集的日志看,我们发现请求对应的Process ID = 0x0600 = 1536,
16767 [3]0600.061C::?2018?-?12?-?24 20:15:22.978 [Microsoft-Windows-TCPIP]TCP: endpoint 0xFFFFFA80C4CF0CF0 (Family=IPV6 , PID=1536) created with status = STATUS_SUCCESS. 16768 [3]0600.061C::?2018?-?12?-?24 20:15:22.978 [Microsoft-Windows-Winsock-AFD]socket: 1: Process 0xFFFFFA80C3FB0860 (0x600), Endpoint 0xFFFFFA80C259FE50, Family 0, Type 0, Protocol 0, Seq 1013, Status STATUS_SUCCESS 16769 [3]0600.061C::?2018?-?12?-?24 20:15:22.978 [Microsoft-Windows-Winsock-AFD]Socket option: 4: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Option FIONBIO , Value 0x1, Seq 11002, Status STATUS_SUCCESS 16770 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]bind: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address ::, Seq 7010, Status STATUS_SUCCESS 16771 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C4CF0CF0 acquired port number 10010. 16772 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: endpoint (sockaddr=[::]:10010) bound. 16773 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]bind: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address [::]:10010, Seq 7022, Status STATUS_SUCCESS 16774 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-Winsock-AFD]connect: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Address [::ffff:120.55.35.39]:443, Seq 5023, Status STATUS_SUCCESS 16775 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from ClosedState to SynSentState , SndNxt = 0. 16776 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 attempted to acquire weak reference on port number 10010 inherited from endpoint 0xFFFFFA80C4CF0CF0. Successful = TRUE . 16777 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: Tcb 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) requested to connect. 16779 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: Tcb 0xFFFFFA80C3956B20 is going to output SYN with ISN = 2438595996, RcvWnd = 8192, RcvWndScale = 8. 16780 [3]0600.061C::?2018?-?12?-?24 20:15:22.979 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) connect proceeding. 16789 [7]0000.0000::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from SynSentState to EstablishedState , SndNxt = 2438595997. 16790 [7]0000.0000::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1. ThSeq = 2246241665. 16791 [7]0000.0000::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) connect completed. PID = 1536. 16793 [3]0600.061C::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-Winsock-AFD]connect: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 5024, Status STATUS_SUCCESS 16794 [3]0600.061C::?2018?-?12?-?24 20:15:23.008 [Microsoft-Windows-Winsock-AFD]Socket option: 4: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Option FIONBIO , Value 0x0, Seq 11002, Status STATUS_SUCCESS 16795 [3]0600.061C::?2018?-?12?-?24 20:15:23.009 [Microsoft-Windows-TCPIP]TCP: Option TCP_OFFLOAD_NOT_PREFERRED is going to be set for connection 0xFFFFFA80C3956B20. 16796 [3]0600.061C::?2018?-?12?-?24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3047, Status STATUS_SUCCESS 16797 [3]0600.061C::?2018?-?12?-?24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3056, Status STATUS_SUCCESS 16798 [3]0600.061C::?2018?-?12?-?24 20:15:23.019 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 send posted posted 279 bytes at 2438595997. 16807 [3]0600.061C::?2018?-?12?-?24 20:15:23.019 [Microsoft-Windows-Winsock-AFD]send: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3051, Status STATUS_SUCCESS 16815 [6]0000.0000::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]send: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0xFFFFFA80C56959B8, Length 279, Seq 3024, Status STATUS_SUCCESS 16818 [7]3ECC.2FC8::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]Data indication: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer 0xFFFFFA80C2C56830, Length 1452, Seq 9000 16819 [7]3ECC.2FC8::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 indicated 0x5AC bytes accepted 0x5AC bytes, status = STATUS_SUCCESS. RcvNxt = 2246241665. 16820 [7]3ECC.2FC8::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1452. ThSeq = 2246241665. 16823 [3]0600.061C::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]recv: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0x0, Length 5, Seq 4115, Status STATUS_SUCCESS 16824 [6]0000.0000::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]Data indication: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer 0xFFFFFA80C2D36F50, Length 1452, Seq 9000 16825 [3]0600.061C::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-Winsock-AFD]recv: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Buffer Count 1, Buffer 0x0, Length 5, Seq 4116, Status STATUS_SUCCESS 16826 [6]0000.0000::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 indicated 0x5AC bytes accepted 0x5AC bytes, status = STATUS_SUCCESS. RcvNxt = 2246243117. 16827 [6]0000.0000::?2018?-?12?-?24 20:15:23.049 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20: Received data with number of bytes = 1452. ThSeq = 2246243117. ... 17644 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from EstablishedState to FinWait1State , SndNxt = 2438604647. 17650 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]socket cleanup: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2002, Status STATUS_SUCCESS 17651 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: endpoint (sockaddr=[::]:10010) closed. 17652 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C4CF0CF0 released port number 10010. WeakReference = FALSE . 17653 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]socket cleanup: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2003, Status STATUS_SUCCESS 17654 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]closesocket: 0: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2000, Status STATUS_SUCCESS 17655 [4]0600.A53C::?2018?-?12?-?24 20:15:28.520 [Microsoft-Windows-Winsock-AFD]closesocket: 1: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 2001, Status STATUS_SUCCESS 17664 [7]0000.0000::?2018?-?12?-?24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from FinWait1State to ClosingState , SndNxt = 2438604648. 17665 [7]0000.0000::?2018?-?12?-?24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 delivery 0xFFFFFA80C3956C80 delivering FIN. RcvNxt = 2246247426. 17666 [7]0000.0000::?2018?-?12?-?24 20:15:28.549 [Microsoft-Windows-Winsock-AFD]disconnect indicated: 3: Process 0xFFFFFA80C3FB0860, Endpoint 0xFFFFFA80C259FE50, Seq 12001 17671 [7]0000.0000::?2018?-?12?-?24 20:15:28.549 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from ClosingState to TimeWaitState , SndNxt = 2438604648. 17673 [7]0000.0000::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) disconnect completed. 17675 [7]0000.0000::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) close issued. 17676 [7]0000.0000::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 (local=[::ffff:----------]:10010 remote=[::ffff:120.55.35.39]:443) shutdown initiated (The transport connection was aborted by the local system.). PID = 1536. 17677 [7]0000.0000::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: connection 0xFFFFFA80C3956B20 transition from TimeWaitState to ClosedState , SndNxt = 2438604648. 17678 [2]0004.0058::?2018?-?12?-?24 20:15:28.550 [Microsoft-Windows-TCPIP]TCP: endpoint/connection 0xFFFFFA80C3956B20 released port number 10010. WeakReference = TRUE .
PID = 1536 通过 Process Explorer 我们能找到对应的Java.exe,
由于通过第四步我们找到去往阿里云服务器 120.55.35.x 的连接请求是 CloudMonitor 触发的。
考虑到没有配置 TcpTimedWaitDelay 的操作系统默认使用 2MSL (120s) TIME_WAIT timeout 时间,如果一个系统近 15000 的端口都处于 TIME_WAIT 状态,理论上来说 Cloud Monitor 至少需要每秒产生 125 条以上的连接。而无论从 Cloud Monitor 的日志还是之前的 ETL Trace ,我们都没有发现此类行为,Cloud Monitor 程序比较规律的每几秒钟建立连接用于提交 "指标"。
41931919752 INFO 2018-12-24 17:57:17.748 [ricGatherServiceHttp] 提交指标完成,耗时:78ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]
41931935181 INFO 2018-12-24 17:57:33.177 [ricGatherServiceHttp] 提交指标完成,耗时:47ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]
41931950625 INFO 2018-12-24 17:57:48.621 [ricGatherServiceHttp] 提交指标完成,耗时:47ms. SystemInfo [serialNumber=------------------------------------, hostname=----------, localIPs=[----------], name=Windows Server -------, version=---, arch=amd64]
"这台服务器运行了很长时间,一直都没有出问题"
具体有多长?我们通过事件日志 Eventlog 6013 确认已经运行了 43225197s 近 500 多天。这引起我们的注意,
All the TCP/IP ports that are in a TIME_WAIT status are not closed after 497 days from system startup in Windows Vista, in Windows 7, in Windows Server 2008 and in Windows Server 2008 R2
至此,整个问题得到了完整的分析和解决。
简单说一下这个 bug 的产生的原因,在 Windows 系统中,使用当前系统运行时间加上 TIMEOUT 时间来决定什么时候释放 TIME_WAIT 状态的 TCP Endpoint。在没有安装补丁的系统中,32位寄存器所能存放的最大值 0xFFFFFFFF,寄存器中存放的数值为 "时间 * 0n100",也就是说在系统运行超过 "0XFFFFFFFF/0n100" 秒之后,就可能发生问题。
0xFFFFFFFF/(0n86400 * 0n100)= 497.1026961805...
转载地址:http://gcgum.baihongyu.com/