Monday, June 1, 2015

Broadband Connectivity Monitor


In this article I'll be demonstrating a way for keeping tabs on your Internet connectivity.  Anyone that  has had challenges with their ISP and uptime knows what I'm talking about.
I looked at several external services (Pingdom, UptimeRobot, etc) and other folks' code, but didn't see anything that I particularly liked.  They wanted money, monitoring intervals were too long, etc.  Instead, I just wrote a fairly simple Linux shell script myself to do the job.

Design goals:
  • Simple to deploy/use
  • One-minute monitoring granularity
  • Logging with sufficient detail that I can go back to my ISP and get refunds for service disruptions
  • Email alerts


The two main requirements for this script are the Bash shell and a Mail Transfer Agent (MTA).

The MTA requirement is to support transmission of email alerts.  I used the Heirloom Mailx agent in testing on both Debian (Ubuntu) and Red Hat (Centos) environments.  My ISP blocks direct SMTP traffic (spam prevention no doubt), so I needed a MTA that would support use of external SMTP services (i.e., relay) for outbound emails.  Mailx provides that.  

I decided to use Google's email service (GMail) for the relay.  Below is the configuration I have working for Ubuntu (nail.rc file):
set smtp-use-starttls
set smtp=smtp://
set ssl-verify=ignore
set smtp-auth=login
set smtp-auth-user=""
set smtp-auth-password="yourPassword"
set from=""

For Centos, I had to add the following line in addition the ones above (mail.rc file in this case):
 set nss-config-dir="/etc/pki/nssdb"


As mentioned previously, I wrote this monitor completely in Linux shell script.  The overall program logic is as follows (loops forever at a user-configurable time interval):
  • Send an ICMP echo request (Ping) to a user-configurable target.
  • If the target replies, do nothing.  
  • If the target does not reply, I've experienced a broadband/Internet connectivity outage.  Log the details locally.
  • If we've recorded an outage and now have a successful ping, calculate the service disruption time, log it, and send an email alert that the outage occurred.  Since I'm doing the monitoring locally, there's no need to attempt an email alert till connectivity is restored, for obvious reasons.

Main body of the shell script below:
while :
 results=`ping -qc $COUNT $TARGET`
 case "$?" in
  0) if [ "$failedTime" -ne 0 ]
    restoredTime=`date +%s`
    duration=$(( $restoredTime - $failedTime ))
    s=$(( duration%60 ))
    h=$(( duration/3600 ))
    (( duration/=60 ))
    m=$(( duration%60 ))
    logRec="Service Restored, Approx Outage Duration:"
    logRec+=`printf "%02d %s %02d %s %02d %s" "$h" "hrs" "$m" "min" "$s" "sec"`
    logger -t $(basename $0) "$logRec"
    t1=`date -d @$failedTime -I'seconds'`
    t2=`date -d @$restoredTime -I'seconds'`
    printf "%s %s\n%s %s" "$t1" "$msg" "$t2" "$logRec" | mail -s "Service Outage Occurred" $EMAIL
  1) if [ "$failedTime" -eq 0 ]
    failedTime=`date +%s`
    logRec=`echo "Service Outage:" "$results" | tr '\n' ' '`
    logger -t $(basename $0) "$logRec"
   if [ "$internalError" -eq 0 ]
    logger -t $(basename $0) "Internal Error"
    (( internalError+=1 ))
 sleep $INTERVAL

Line 1:  Loop, like forever.
Line 3:  Executes the ping command with a user-configurable ping count and target.  Those settings are stored in an external config file.
Line 4:  Set up a switch on the return value of the ping command.  Per the man page, ping will return 0 if it gets a reply, 1 if it gets no reply at all, and 2 on any other sort of error.
Line 5:  This would be the case that ping received a reply.  I only need to take action if there has been an outage recorded earlier.  That outage flag is the time of occurrence, stored in the failedTime variable.
Line 7: An outage and resulting service restoration is in progress.  Store the time of the restoration (in seconds since 1970).
Lines 8-12:  Calculate the total duration of the outage using difference between the start and stop times.  Take that duration time that is in seconds and do some arithmetic to convert it to hours, minutes, and seconds.
Lines 14-15:  Do some prettifying of a log message of the service restoration notice.
Line 16:  Send the notice to the syslog process on the local server.
Lines 17-18:  Put some timestamps on the message that will be sent as an email alert (syslog does this automatically, so I didn't need to timestamp the log messages).
Line 19:  Send alert message out via email.
Lines 20-21:  Reset some variable flags.
Line 24:  The case here is ping has returned a "1", meaning it did not receive a reply.  If the failedTime flag is not set, this indicates a fabulous new outage event.
Line 26:  Save the current time (in seconds since 1970) in the failedTime variable.
Line 27:  Concatenate a string with that contains the output of the original ping command.  Remove all newlines in that string (syslog logging is 1 line at a time).
Line 28:  Save that outage string in a variable for use later for the email alert when connectivity has been restored.
Line 29:  Send the log message to syslog.
Line 34:  This covers the degenerate case (ping return code of "2").  A scenario where this may happen would be the local server interface went down.
Line 36:  Simply log a message of the issue, but only do it one time.
Line 42:  Pause till the next ping for a user-configurable amount of time.

This script can be fired off and run forever simply like this:
nohup ./ > /dev/null 2>&1 &
For those more motivated, you can set this up as a regular Linux daemon in init.d.


Sample syslog output below:
Jun  2 04:58:31 intel3770k Service Outage: PING ( 56(84) bytes of data.  --- ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2014ms 
Jun  2 04:59:33 intel3770k Service Restored, Approx Outage Duration:00 hrs 01 min 02 sec

Email alert text from the sample above:
2015-06-02T04:58:31-0600 Service Outage: PING ( 56(84) bytes of data.  --- ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2014ms 
2015-06-02T04:59:33-0600 Service Restored, Approx Outage Duration:00 hrs 01 min 02 sec
Full source here.

No comments:

Post a Comment