New Post has been published on Event Enrichment.Org - http://www.eventenrichment.org/event-enrichment-unix-rsyslogd-imuxsock-message-drops/
New Post has been published on http://www.eventenrichment.org/event-enrichment-unix-rsyslogd-imuxsock-message-drops/
Event Enrichment : Unix : Rsyslogd : IMUXSOCK MESSAGE DROPS
This is the first in our sample Event Enrichment series!
While enjoying your shift in the quiet solitude of the NOC ;), you suddenly receive an alert from PagerDuty or your NMS. Depending on your level of expertise, you would typically need to open a runbook or Ops Wiki to determine how to handle the event.
Instead, let’s explore a different method, Event Enrichment, using the following syslog entry as our reference.
It all starts with an alert…
Jan 28 12:01:20 zenoss-mon rsyslogd-2177: imuxsock lost 5 messages from pid 1169 due to rate-limiting
This event has some useful information (we are losing messages which could be important, or event critical), but requires user intervention in order to investigate the problem. Now, assume that this same event arrives at the NOC already enriched with the steps required to handle the event. Mean Time to Repair (MTTR) would decrease given that the information required to properly triage the problem is already included in the initial alert.
Event Enrichments are composed of two components: remediation and escalation. Remediation consists of the steps necessary to rectify the problem, beginning with troubleshooting. The escalation includes the information to pass along as well as the intended recipient of said information (team or individual engineer)
The first step in investigating this alert is to log into the device / server generating the error.
ops@noc-jump:~$ ssh ops@zenoss-mon
Last login: Tue Jan 28 17:38:54 2014 from 172.25.230.5
[ops@zenoss-mon ~]#
The next step is to determine the process associated with the PID referenced in the alert.
[ops@zenoss-mon ~]# ps aux | grep 1169 root 1169 0.2 0.1 203312 10368 ? S Jan23 22:48 /usr/sbin/snmpd -LS0-6d -Lf /dev/null -p /var/run/snmpd.pid
From the result of this command we can conclude that snmpd is generating more events than the configured rate-limiting threshold for rsyslogd (default is 200 events in a 5 second interval). On-call systems engineering will need to investigate the cause of this message suppression.
Escalate to the on-call SysEng team using the PagerDuty SysEng Service (or other alerting mechanism) and include the following information:
Original Event Summary: [Jan 28 12:01:20 zenoss-mon rsyslogd-2177: imuxsock lost 5 messages from pid 1169 due to rate-limiting]
This error is being generated due to an issue with /usr/sbin/snmpd.
[ops@zenoss-mon ~]# ps aux | grep 1169
root 1169 0.2 0.1 203312 10368 ? S Jan23 22:48 /usr/sbin/snmpd -LS0-6d -Lf /dev/null -p /var/run/snmpd.pid
Adopting the Event Enrichment methodology enhances the standardization and scalability of your NOC and on-call processes.
Now check out the Beginner’s Guide to Event Enrichment to deepen your understanding of the methodology.