— devops, tech, cloudops — 3 min read
In the opening part of this series, I briefly introduced about the challenges of Cloud Operation: Automation. And these following parts will help to step by step resolve thoses challenges.
The first ever challenge that I usually faced during my years of working with cloud infrastructure: "My Product Manager call me in a very beautiful weekend evening to tell me that he felt our service is having slow response time and nothing loaded, interrupting me enjoying my well-deserved beer after a week of productivity. Slowly, I move to my desk, turn on my laptop, open my favourite browser, open my phone and look for my OTP, then login to cloud provider console and open the monitoring service only to findout that our genious PM (when I said genious, I did not mean it) is using his smartphone in a rural area where he was enjoying his vacation, to check on our app functionality, hence the slowness (pls upgrade your data plan with your PM's overpaid salary). Too many steps just to investigate a false alarm from a untrustworthy source. "There must be something we can do about it" - said my PM in the next meeting. "Oh I am so sorry for interrupting your weekends, I must buy you a drink next time" - said not my PM.
The story may not sound convincing for everyone, but we might have been experienced the similar one, either from PM, Customer Serivce or Customers themself aka non-tech people.
Being summoned during vacation or weekend should always be avoided at all cost unless it is a 100% true, confirmed, real incident from a trustworthy source. And by trustworthy source, I mean Wikipedia. And by Wikipedia, I mean the page I just edited. Joke aside, we cannot anticipate the real incident to happen and in some scenarios, we cannot afford it. And monitoring is how we prepare and prevent such incidents to occur or at least occur with our present. A simple notification system based on app and infra metrics can be counted as a trustworthy source. There are uncountable number of tutorials showing how to build a reliable monitoring system with metrics and dashboards, and how to combine metrics to have more confident results which help engineers to troubleshoot their system, hence I will not spend my time thinking of similar posts regarding how to build metrics and dashboards. My focus will be how to utilize such monitoring system to deliver the result to engineer's hands. By that way, we can always have a reliable source of truth, telling us what is happening to our infrastructure base on well-built monitoring system.
An event based application is born for such use case with countless benefits of development cost and time as well as cost of infrastructure. In AWS perspective, EventBridge + SNS + Lambda Function + Any messaging platform/app (Slack, Discord, Telegram, MS Teams) = Automated Notification System
The solution design: simple as that because AWS is so good at making things easier to build
Here we can use AWS Cloudwatch to let AWS do the collecting of metrics and logs, then we can use log filtering and composite alarms to deliver the result. Other than that, AWS EventBridge can also be useful in cases that our target of monitoring is not just applications and servers but rather on AWS's sensitive api calls, who login/assumes to admin role on a weekend, exeeding budget, or simple as S3 objects creation in a specific path of a specific . An SNS topic will be created and allow subscriptions, and act as a funnel to collect all of those alarms and events mentioned above and deliver it to lambda handler. The last bit that doing all of the heavy duty is the Lambda Function, which can be written in language of your choice. It will receive message from SNS topic which in the form of JSON format. The message attribute of such JSON can be varied depend on the source of the event and can be easily found on AWS document.
The script:
I am not a good programmer myself, and I have no defence for that. However, I can still make my script work and that show how simple it is. In the end, this is a short demonstrate of use cases, not a competitive programing contest. All my script does is receive the event delivered by SNS, find out if it is from EventBridge or Cloudwatch Alarm, parse it accordingly with predefined fields, decorate it with some proper formatting for the sake of readability, then send it to the desired URL/Webhook/API/...
The IaC:
All of the good work we have done previously, anyone can do it, but no one can do it repeatedly 10/100/1000 without boredom and prone for mistakes. And here come the automation part for deploying the automation solution: terraform module. Since IaC means Infrastructure as Code, I will let the code speak for itself with a example of usage with module:
1# variables2variable "app_name" { default = "my-ground-breakking-app" }3variable "env" { default = "dev" }4variable "webhook" { default = "https://webhook.com/api/v1/example" }5
6
7# Get aws identify8data "aws_caller_identity" "current" {}9
10locals {11 common_tags = {12 "Manage_by" = "Terraform"13 "Environment" = "dev"14 }15}16
17# SNS Topic for notification18resource "aws_sns_topic" "this" {19 name = "${var.app_name}-monitoring-noti-topic"20}21
22# SNS Topic policy23resource "aws_sns_topic_policy" "this" {24 arn = aws_sns_topic.this.arn25 policy = data.aws_iam_policy_document.this.json26}27# Data policy28data "aws_iam_policy_document" "this" {29 policy_id = "__default_policy_ID"30
31 statement {32 sid = "__default_statement_ID"33 actions = [34 "SNS:Subscribe",35 "SNS:SetTopicAttributes",36 "SNS:RemovePermission",37 "SNS:Receive",38 "SNS:Publish",39 "SNS:ListSubscriptionsByTopic",40 "SNS:GetTopicAttributes",41 "SNS:DeleteTopic",42 "SNS:AddPermission",43 ]44 condition {45 test = "StringEquals"46 variable = "AWS:SourceOwner"47 values = [data.aws_caller_identity.current.account_id]48
49 }50 effect = "Allow"51 principals {52 type = "AWS"53 identifiers = ["*"]54 }55 resources = [aws_sns_topic.this.arn]56 }57 statement {58 sid = "AWSBudgets"59 actions = [60 "SNS:Publish"61 ]62 effect = "Allow"63 principals {64 type = "Service"65 identifiers = ["budgets.amazonaws.com"]66 }67 resources = [aws_sns_topic.this.arn]68 }69 statement {70 sid = "CodeStartNotification"71 actions = ["sns:Publish"]72 principals {73 type = "Service"74 identifiers = ["codestar-notifications.amazonaws.com"]75 }76 resources = [aws_sns_topic.this.arn]77 }78 statement {79 sid = "AWSEvents"80 actions = ["sns:Publish"]81 principals {82 type = "Service"83 identifiers = ["events.amazonaws.com"]84 }85 resources = [aws_sns_topic.this.arn]86 }87}88
89module "sns2webhook_noti_forwarder" {90 source = "../path/to/your/module/aws-noti-forwarder"91 name_prefix = var.app_name92 common_tags = local.common_tags93
94 forwarder_type = "sns2webhook"95 python_runtime = "python3.9"96 lambda_environments = {97 "ENV" = var.env98 "APP" = var.app_name99 "WEBHOOK_URL" = var.webhook100 }101}102
103# SNS topic subscription104resource "aws_lambda_permission" "sns2webhook_noti_forwarder" {105 statement_id = "AllowExecutionFromSNS"106 action = "lambda:InvokeFunction"107 function_name = module.sns2webhook_noti_forwarder.lambda_function_name108 principal = "sns.amazonaws.com"109 source_arn = aws_sns_topic.this.arn110}111
112resource "aws_sns_topic_subscription" "sns2webhook_noti_forwarder" {113 topic_arn = aws_sns_topic.this.arn114 protocol = "lambda"115 endpoint = module.sns2webhook_noti_forwarder.lambda_function_arn116}
Remember your devbox we created in the previous post? It is time to make use of it, try to run this stack from your devbox.
I know this is simple stuff and might not need such long lines of text to make it sound complicated. But what I am trying to share here is the approach and mindset to solve the day to day operational work by automate it and make an engineer life simple.
Hopefully you find something interesting from this and stay tune for more...
Title | URL |
---|---|
Part 1: CloudOps with a touch of Automation | Here |
Part 2: Jump Your First Box | Here |
Part 3: SNS Integration Forwarder Function - SNIFF | Here |
Part 4: Automation Start Stop Enhanced System - ASSES | Here |