frustrating log forwarding issue, resolved it today!

I ran into a technical logging obstacle and I finally passed it today.

Setup:

  • A large number of linux servers deployed over disparate locations and connected via consumer cable and xDSL lines
  • All of them are configured to forward logs to a remote central log server via syslog-ng and UDP (client prefers UDP over TCP)

Problem:

  • We were noticing a growing percentage would stop forwarding logs to the central log server

Troubleshooting notes:

  • Timing seemed random; many clients forwarded logs successfully for weeks or months, some clients would stop forwarding after a few weeks, a few would stop in a couple of days
  • All other connectivity appeared fine when troubleshooting (e.g. ssh worked, http worked, icmp, etc.)
  • On each server, syslog-ng was running, logging locally fine, and showed no error conditions (but we observed no outbound traffic when executing tcpdump port 514)
  • We did notice if we restart syslog-ng, forwarding would work again and we would immediately observe new logs appearing on the central log server

Quick thoughts and things we tried:

  • Okay, configs are likely good because all works when restarting
  • Network issues? But then why does restarting syslog-ng have an immediate and somewhat lasting effect?
  • Alright, let’s upgrade to the lastest sylog-ng 2.x suite (libol, eventlog, syslog-ng) because maybe we’re running into a weird bug

Finali:

  • After upgrading a dozen or so and waiting a week, we noticed two or three stopped forwarding logs again
  • We executed tcpdump port 514 and again observed no outbound traffic
  • We restarted syslog-ng and as before, observed new logs appearing on the central log server. This time it caught our attention that our tcpdump window still showed no outbound traffic even after the syslog-ng restart. Ah, now we’ve identified the information we were neglecting to consider: multiple interfaces.

[tt@00-e0-4d-8d-ce-3b ~]$ /sbin/ifconfig
eth0 Link encap:Ethernet HWaddr
[]
tun0 Link encap:UNSPEC HWaddr
[]

We executed tcpdump -i tun0 port 514 on a server for which we had not observed log messages on the central log server in some time. By watching tun0, we now observed outbound UDP packets, but the central log server was still not receiving them. When we restarted syslog-ng on this server, we now noticed a significant change in the tcpdump output. The first three lines below were observed when the central syslog server was not receiving messages and the second three lines are after we restarted syslog-ng.

15:57:22.009114 IP 01-01-01-01.att-inc.com.40655 > 555.55.253.145.syslog: SYSLOG authpriv.notice, length: 132
15:57:22.187959 IP 01-01-01-01.att-inc.com.40655 > 555.55.253.145.syslog: SYSLOG syslog.info, length: 104
15:57:22.188023 IP 01-01-01-01.att-inc.com.40655 > 555.55.253.145.syslog: SYSLOG syslog.notice, length: 97
15:57:22.699615 IP 777-777-229-216.static.data393.net.60408 > 555.55.253.145.syslog: SYSLOG kernel.notice, length: 249
15:57:22.700051 IP 777-777-229-216.static.data393.net.60408 > 555.55.253.145.syslog: SYSLOG kernel.notice, length: 254
15:57:22.700108 IP 777-777-229-216.static.data393.net.60408 > 555.55.253.145.syslog: SYSLOG kernel.notice, length: 246

Problem discovered (some of the IP information above was changed for obvious reasons).

On reboot or startup, tun0 is brought up and syslog-ng forwarded logs with the correct source IP for this environment (the one assigned to tun0). The catch is when an openvpn connection would bounce on a live system (i.e. tun0 would go down), syslog-ng would fall back to using the source IP assigned to eth0 and continued to use that IP even after tun0 was brought back up, which broke the forwarding of logs in this environment. Syslog-ng required a restart after tun0 would go down in order to use the proper IP (as observed in the tcpdump output for tun0 above). The flapping openvpn connection was the culprit.

Posted by tate Mon, 20 Apr 2009 22:04:00 GMT