AWS Cost Optimization Playbook

What you'll achieve: Identify hidden cost drivers, implement automated cost controls, optimize instance types and storage, set up effective monitoring and alerts.

The AWS Cost Optimization Framework

The 4-Phase Approach

  1. Discovery (Days 1-7): Audit current usage and identify waste
  2. Quick Wins (Days 8-14): Implement immediate cost reductions
  3. Strategic Optimization (Weeks 3-8): Long-term architectural improvements
  4. Continuous Optimization (Ongoing): Automated monitoring and governance

Phase 1: Cost Discovery and Audit

Set Up Cost Monitoring Foundation

Step 1: Enable Cost and Usage Reports

# AWS CLI command to create detailed billing report
aws cur put-report-definition \
  --report-definition ReportName=detailed-usage-report,TimeUnit=DAILY,Format=ParquetColumnar

Step 2: Implement Cost Allocation Tags

{
  "TagSpecifications": [
    {
      "ResourceType": "instance",
      "Tags": [
        {"Key": "Environment", "Value": "production"},
        {"Key": "Team", "Value": "backend"},
        {"Key": "Project", "Value": "api-service"}
      ]
    }
  ]
}

Identify Top Cost Drivers

import boto3
from datetime import datetime, timedelta

def get_top_cost_services():
    ce = boto3.client('ce')
    
    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': (datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d'),
            'End': datetime.now().strftime('%Y-%m-%d')
        },
        Granularity='MONTHLY',
        Metrics=['BlendedCost'],
        GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
    )
    
    costs = []
    for group in response['ResultsByTime'][0]['Groups']:
        service = group['Keys'][0]
        cost = float(group['Metrics']['BlendedCost']['Amount'])
        costs.append((service, cost))
    
    return sorted(costs, key=lambda x: x[1], reverse=True)[:10]

Phase 2: Quick Wins (30-50% cost reduction)

Eliminate Idle Resources

Identify Idle EC2 Instances:

def find_idle_instances():
    ec2 = boto3.client('ec2')
    cloudwatch = boto3.client('cloudwatch')
    
    instances = ec2.describe_instances(
        Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
    )
    
    idle_instances = []
    
    for reservation in instances['Reservations']:
        for instance in reservation['Instances']:
            instance_id = instance['InstanceId']
            
            # Check CPU utilization for last 7 days
            cpu_metrics = cloudwatch.get_metric_statistics(
                Namespace='AWS/EC2',
                MetricName='CPUUtilization',
                Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
                StartTime=datetime.utcnow() - timedelta(days=7),
                EndTime=datetime.utcnow(),
                Period=3600,
                Statistics=['Average']
            )
            
            # Consider idle if average CPU < 5%
            if cpu_metrics['Datapoints']:
                avg_cpu = sum(dp['Average'] for dp in cpu_metrics['Datapoints']) / len(cpu_metrics['Datapoints'])
                if avg_cpu < 5.0:
                    idle_instances.append({
                        'InstanceId': instance_id,
                        'InstanceType': instance.get('InstanceType'),
                        'AvgCPU': avg_cpu
                    })
    
    return idle_instances

Right-Size Over-Provisioned Resources

EC2 Right-Sizing Analysis:

def analyze_instance_utilization():
    ce = boto3.client('ce')
    
    response = ce.get_rightsizing_recommendation(
        Service='AmazonEC2',
        Configuration={
            'BenefitsConsidered': True,
            'RecommendationTarget': 'SAME_INSTANCE_FAMILY'
        }
    )
    
    recommendations = []
    for recommendation in response['RightsizingRecommendations']:
        current_instance = recommendation['CurrentInstance']
        recommended_instance = recommendation['ModifyRecommendationDetail']
        
        if recommended_instance:
            savings = float(recommended_instance['EstimatedMonthlySavings'])
            if savings > 10:
                recommendations.append({
                    'instance_id': current_instance['ResourceId'],
                    'current_type': current_instance['InstanceType'],
                    'recommended_type': recommended_instance['TargetInstances'][0]['InstanceType'],
                    'monthly_savings': savings
                })
    
    return sorted(recommendations, key=lambda x: x['monthly_savings'], reverse=True)

Phase 3: Strategic Optimization

Implement Reserved Instances Strategy

def get_ri_recommendations():
    ce = boto3.client('ce')
    
    response = ce.get_reservation_purchase_recommendation(
        Service='AmazonEC2',
        PaymentOption='PARTIAL_UPFRONT',
        TermInYears='ONE_YEAR'
    )
    
    recommendations = []
    for recommendation in response['Recommendations']:
        rec_details = recommendation['RecommendationDetails']
        
        recommendations.append({
            'instance_type': rec_details['InstanceDetails']['EC2InstanceDetails']['InstanceType'],
            'recommended_quantity': rec_details['RecommendedNumberOfInstancesToPurchase'],
            'estimated_monthly_savings': rec_details['EstimatedMonthlySavingsAmount'],
            'upfront_cost': rec_details['UpfrontCost']
        })
    
    return recommendations

Leverage Spot Instances

# Mixed instance types for resilience
spot_fleet_config = {
    "IamFleetRole": "arn:aws:iam::123456789012:role/fleet-role",
    "AllocationStrategy": "diversified",
    "TargetCapacity": 10,
    "SpotPrice": "0.50",
    "LaunchSpecifications": [
        {
            "ImageId": "ami-12345678",
            "InstanceType": "m5.large",
            "KeyName": "my-key",
            "SecurityGroups": [{"GroupId": "sg-12345678"}],
            "SubnetId": "subnet-12345678"
        }
    ]
}

Phase 4: Continuous Optimization

Automated Cost Monitoring

def setup_cost_alarms():
    cloudwatch = boto3.client('cloudwatch')
    
    cloudwatch.put_metric_alarm(
        AlarmName='MonthlyBudgetExceeded',
        ComparisonOperator='GreaterThanThreshold',
        EvaluationPeriods=1,
        MetricName='EstimatedCharges',
        Namespace='AWS/Billing',
        Period=86400,
        Statistic='Maximum',
        Threshold=5000.0,
        ActionsEnabled=True,
        AlarmActions=['arn:aws:sns:us-east-1:123456789012:cost-alerts'],
        AlarmDescription='Monthly AWS charges exceeded $5000'
    )

Real-World Case Studies

Case Study 1: E-commerce Startup ($5k/month → $1.8k/month)

Optimizations Applied:

optimization_results = {
    'instance_scheduling': {
        'action': 'Schedule dev/staging instances 8AM-6PM weekdays',
        'savings': '$1200/month'
    },
    'reserved_instances': {
        'action': 'Purchase 6 m5.large RIs for production',
        'savings': '$800/month'
    },
    's3_lifecycle': {
        'action': 'Move logs/backups to Glacier after 30 days',
        'savings': '$320/month'
    },
    'rightsizing': {
        'action': 'Downsize 4 instances from m5.large to m5.medium',
        'savings': '$480/month'
    }
}

# Total: 64% reduction

Case Study 2: SaaS Company ($15k/month → $7k/month)

Major Optimizations:

  • Migrated 60% of workloads to Spot Instances
  • Implemented auto-scaling based on CloudWatch metrics
  • Used S3 Intelligent Tiering for data lake
  • Purchased Compute Savings Plans instead of RIs

Cost Optimization Tools

Automated Resource Cleanup

#!/bin/bash
# AWS Resource Cleanup Script
echo "Starting AWS resource cleanup..."

# Stop idle instances in development environment
aws ec2 describe-instances \
  --filters "Name=tag:Environment,Values=development" "Name=instance-state-name,Values=running" \
  --query 'Reservations[*].Instances[?LaunchTime<=`2024-01-01T00:00:00.000Z`].[InstanceId]' \
  --output text | \
while read instance; do
  if [[ -n "$instance" ]]; then
    echo "Stopping idle development instance: $instance"
    aws ec2 stop-instances --instance-ids $instance
  fi
done

echo "Resource cleanup completed."

Best Practices and Governance

Implement Comprehensive Tagging

{
  "required_tags": {
    "Environment": ["production", "staging", "development"],
    "Team": ["frontend", "backend", "devops"],
    "Project": ["api", "web-app", "mobile-app"],
    "Owner": "email@company.com",
    "CostCenter": "department-code"
  }
}

Multi-level Budget Structure

def create_comprehensive_budgets():
    budgets_client = boto3.client('budgets')
    
    # Overall company budget
    budgets_client.create_budget(
        AccountId='123456789012',
        Budget={
            'BudgetName': 'CompanyWideAWSBudget',
            'BudgetLimit': {'Amount': '10000', 'Unit': 'USD'},
            'TimeUnit': 'MONTHLY',
            'BudgetType': 'COST'
        },
        NotificationsWithSubscribers=[
            {
                'Notification': {
                    'NotificationType': 'ACTUAL',
                    'ComparisonOperator': 'GREATER_THAN',
                    'Threshold': 80
                },
                'Subscribers': [
                    {'Address': 'finance@company.com', 'SubscriptionType': 'EMAIL'}
                ]
            }
        ]
    )

Implementation Roadmap

Week 1-2: Foundation Setup

  • Enable Cost and Usage Reports
  • Implement comprehensive tagging strategy
  • Set up Cost Explorer and basic monitoring
  • Create initial cost baseline

Week 3-4: Quick Wins Implementation

  • Identify and shut down idle resources
  • Implement instance scheduling for dev/staging
  • Set up S3 lifecycle policies
  • Purchase first batch of Reserved Instances

Week 5-8: Strategic Optimization

  • Implement auto-scaling groups
  • Deploy spot instance workloads
  • Purchase Savings Plans or additional RIs
  • Set up advanced monitoring and alerting

Expected Results Timeline

Month 1: 15-25% cost reduction through quick wins Month 2: 25-35% cost reduction with strategic optimizations Month 3: 30-50% cost reduction with full implementation

Cost Optimization Checklist

Discovery Phase

  • Cost and Usage Reports enabled
  • Comprehensive tagging implemented
  • Cost baseline established
  • Top cost drivers identified

Quick Wins Phase

  • Idle resources identified and stopped
  • Development/staging scheduling implemented
  • S3 lifecycle policies applied
  • Unattached resources cleaned up

Strategic Phase

  • Reserved Instances purchased
  • Spot Instances implemented
  • Auto-scaling configured
  • Savings Plans evaluated

Automation Phase

  • Cost monitoring alerts set up
  • Automated cleanup scripts deployed
  • Anomaly detection configured
  • Regular reporting established

Remember: Cost optimization is not a one-time activity but an ongoing process. The key to long-term success is building a culture of cost awareness and implementing automated monitoring and optimization processes.


This guide is based on real experience helping companies reduce AWS costs by millions of dollars. For personalized cost optimization consultation, contact me directly.