Monitoring batch workers using CloudWatch Metrics

Posted at 2016-12-20

CloudWatch Metrics

In our project, we use the standard CloudWatch Metrics to monitor performance and issues in production. For example, if CPU utilization on a server exceeds 80% for more than 5 minutes, an alarm triggers and sends us an email.

We've also built a few simple CloudWatch Dashboards for monitoring different parts of our system. It's a great way to get a quick overview.

Missing metrics: memory, swap, disk utilization

In addition to the standard metrics, we've added system level metrics like memory, swap and disk utilization. AWS provides a Perl script for that which can be run as a cron job:
http://docs.aws.amazon.com/ja_jp/AWSEC2/latest/UserGuide/mon-scripts.html

Custom metrics

We are using batch workers (ActiveJob via delayed_job) to handle long running requests, such as importing bulk data from files, or rendering large PDF files. These functions are used all the time by customers, so we need a health check for our worker processes.

To do this, we implemented our own simple custom AWS metric: "delayed-job::WorkerProcesses".

# This works on OSX and AWS Linux servers
pids = `pgrep -f delayed_job`.split
Rails.logger.info("Worker health check. #{pids.size} worker processes found (#{pids.join(',')})")

log_cloudwatch_metrics(
  namespace: 'delayed-job',
  aws_metrics: [AWSMetrics::Metric.new(name: 'WorkerProcesses', value: pids.size)]
)

Now we can see the metric in Cloudwatch like any other metric, and add alarms etc.

The check is run from a periodic batch job using delayed_job_recurring. That way, we know that the worker processes are running, and jobs are being processed successfully.

CloudWatch Metrics wrapper

Sending AWS Metrics is quite simple using the AWS SDK for Ruby, but the request format is a bit complex, with many fields to fill out. To make it simpler, we implemented a small wrapper with sensible defaults:

module AWSMetrics

  def log_cloudwatch_metrics(namespace:, aws_metrics:)
    cloudwatch_client.put_metric_data(
      namespace: "ProjectX/#{namespace}",
      metric_data: aws_metrics.map(&:to_h)
    )
  end

  def cloudwatch_client
    Aws::CloudWatch::Client.new(region: ENV['AWS_REGION'].presence || 'ap-northeast-1')
  end

  class Metric
    VALID_UNITS = [
      'Seconds', 'Microseconds', 'Milliseconds',
      'Bytes', 'Kilobytes', 'Megabytes', 'Gigabytes', 'Terabytes',
      'Bits', 'Kilobits', 'Megabits', 'Gigabits', 'Terabits',
      'Percent', 'Count',
      'Bytes/Second', 'Kilobytes/Second', 'Megabytes/Second', 'Gigabytes/Second', 'Terabytes/Second',
      'Bits/Second', 'Kilobits/Second', 'Megabits/Second', 'Gigabits/Second', 'Terabits/Second',
      'Count/Second',
      'None'
    ]

    attr_reader :name, :value, :aws_unit_string, :hostname, :timestamp

    def initialize(name:, value:, aws_unit_string: 'Count', hostname: Socket.gethostname, timestamp: Time.now)
      fail 'Invalid Unit' unless VALID_UNITS.include?(aws_unit_string)
      @name = name
      @value = value
      @aws_unit_string = aws_unit_string
      @hostname = hostname
      @timestamp = timestamp
    end

    def to_h
      {
        metric_name: @name,
        dimensions: [
          {
            name: 'hostname',
            value: @hostname
          }
        ],
        timestamp: @timestamp,
        value: @value,
        unit: @aws_unit_string
      }
    end
  end

end

Thanks for reading!

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up