AWS Cost Optimization Playbook
What you'll achieve: Identify hidden cost drivers, implement automated cost controls, optimize instance types and storage, set up effective monitoring and alerts.
The AWS Cost Optimization Framework
The 4-Phase Approach
- Discovery (Days 1-7): Audit current usage and identify waste
- Quick Wins (Days 8-14): Implement immediate cost reductions
- Strategic Optimization (Weeks 3-8): Long-term architectural improvements
- Continuous Optimization (Ongoing): Automated monitoring and governance
Phase 1: Cost Discovery and Audit
Set Up Cost Monitoring Foundation
Step 1: Enable Cost and Usage Reports
# AWS CLI command to create detailed billing report
aws cur put-report-definition \
--report-definition ReportName=detailed-usage-report,TimeUnit=DAILY,Format=ParquetColumnar
Step 2: Implement Cost Allocation Tags
{
"TagSpecifications": [
{
"ResourceType": "instance",
"Tags": [
{"Key": "Environment", "Value": "production"},
{"Key": "Team", "Value": "backend"},
{"Key": "Project", "Value": "api-service"}
]
}
]
}
Identify Top Cost Drivers
import boto3
from datetime import datetime, timedelta
def get_top_cost_services():
ce = boto3.client('ce')
response = ce.get_cost_and_usage(
TimePeriod={
'Start': (datetime.now() - timedelta(days=30)).strftime('%Y-%m-%d'),
'End': datetime.now().strftime('%Y-%m-%d')
},
Granularity='MONTHLY',
Metrics=['BlendedCost'],
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
)
costs = []
for group in response['ResultsByTime'][0]['Groups']:
service = group['Keys'][0]
cost = float(group['Metrics']['BlendedCost']['Amount'])
costs.append((service, cost))
return sorted(costs, key=lambda x: x[1], reverse=True)[:10]
Phase 2: Quick Wins (30-50% cost reduction)
Eliminate Idle Resources
Identify Idle EC2 Instances:
def find_idle_instances():
ec2 = boto3.client('ec2')
cloudwatch = boto3.client('cloudwatch')
instances = ec2.describe_instances(
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
)
idle_instances = []
for reservation in instances['Reservations']:
for instance in reservation['Instances']:
instance_id = instance['InstanceId']
# Check CPU utilization for last 7 days
cpu_metrics = cloudwatch.get_metric_statistics(
Namespace='AWS/EC2',
MetricName='CPUUtilization',
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
StartTime=datetime.utcnow() - timedelta(days=7),
EndTime=datetime.utcnow(),
Period=3600,
Statistics=['Average']
)
# Consider idle if average CPU < 5%
if cpu_metrics['Datapoints']:
avg_cpu = sum(dp['Average'] for dp in cpu_metrics['Datapoints']) / len(cpu_metrics['Datapoints'])
if avg_cpu < 5.0:
idle_instances.append({
'InstanceId': instance_id,
'InstanceType': instance.get('InstanceType'),
'AvgCPU': avg_cpu
})
return idle_instances
Right-Size Over-Provisioned Resources
EC2 Right-Sizing Analysis:
def analyze_instance_utilization():
ce = boto3.client('ce')
response = ce.get_rightsizing_recommendation(
Service='AmazonEC2',
Configuration={
'BenefitsConsidered': True,
'RecommendationTarget': 'SAME_INSTANCE_FAMILY'
}
)
recommendations = []
for recommendation in response['RightsizingRecommendations']:
current_instance = recommendation['CurrentInstance']
recommended_instance = recommendation['ModifyRecommendationDetail']
if recommended_instance:
savings = float(recommended_instance['EstimatedMonthlySavings'])
if savings > 10:
recommendations.append({
'instance_id': current_instance['ResourceId'],
'current_type': current_instance['InstanceType'],
'recommended_type': recommended_instance['TargetInstances'][0]['InstanceType'],
'monthly_savings': savings
})
return sorted(recommendations, key=lambda x: x['monthly_savings'], reverse=True)
Phase 3: Strategic Optimization
Implement Reserved Instances Strategy
def get_ri_recommendations():
ce = boto3.client('ce')
response = ce.get_reservation_purchase_recommendation(
Service='AmazonEC2',
PaymentOption='PARTIAL_UPFRONT',
TermInYears='ONE_YEAR'
)
recommendations = []
for recommendation in response['Recommendations']:
rec_details = recommendation['RecommendationDetails']
recommendations.append({
'instance_type': rec_details['InstanceDetails']['EC2InstanceDetails']['InstanceType'],
'recommended_quantity': rec_details['RecommendedNumberOfInstancesToPurchase'],
'estimated_monthly_savings': rec_details['EstimatedMonthlySavingsAmount'],
'upfront_cost': rec_details['UpfrontCost']
})
return recommendations
Leverage Spot Instances
# Mixed instance types for resilience
spot_fleet_config = {
"IamFleetRole": "arn:aws:iam::123456789012:role/fleet-role",
"AllocationStrategy": "diversified",
"TargetCapacity": 10,
"SpotPrice": "0.50",
"LaunchSpecifications": [
{
"ImageId": "ami-12345678",
"InstanceType": "m5.large",
"KeyName": "my-key",
"SecurityGroups": [{"GroupId": "sg-12345678"}],
"SubnetId": "subnet-12345678"
}
]
}
Phase 4: Continuous Optimization
Automated Cost Monitoring
def setup_cost_alarms():
cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_alarm(
AlarmName='MonthlyBudgetExceeded',
ComparisonOperator='GreaterThanThreshold',
EvaluationPeriods=1,
MetricName='EstimatedCharges',
Namespace='AWS/Billing',
Period=86400,
Statistic='Maximum',
Threshold=5000.0,
ActionsEnabled=True,
AlarmActions=['arn:aws:sns:us-east-1:123456789012:cost-alerts'],
AlarmDescription='Monthly AWS charges exceeded $5000'
)
Real-World Case Studies
Case Study 1: E-commerce Startup ($5k/month → $1.8k/month)
Optimizations Applied:
optimization_results = {
'instance_scheduling': {
'action': 'Schedule dev/staging instances 8AM-6PM weekdays',
'savings': '$1200/month'
},
'reserved_instances': {
'action': 'Purchase 6 m5.large RIs for production',
'savings': '$800/month'
},
's3_lifecycle': {
'action': 'Move logs/backups to Glacier after 30 days',
'savings': '$320/month'
},
'rightsizing': {
'action': 'Downsize 4 instances from m5.large to m5.medium',
'savings': '$480/month'
}
}
# Total: 64% reduction
Case Study 2: SaaS Company ($15k/month → $7k/month)
Major Optimizations:
- Migrated 60% of workloads to Spot Instances
- Implemented auto-scaling based on CloudWatch metrics
- Used S3 Intelligent Tiering for data lake
- Purchased Compute Savings Plans instead of RIs
Cost Optimization Tools
Automated Resource Cleanup
#!/bin/bash
# AWS Resource Cleanup Script
echo "Starting AWS resource cleanup..."
# Stop idle instances in development environment
aws ec2 describe-instances \
--filters "Name=tag:Environment,Values=development" "Name=instance-state-name,Values=running" \
--query 'Reservations[*].Instances[?LaunchTime<=`2024-01-01T00:00:00.000Z`].[InstanceId]' \
--output text | \
while read instance; do
if [[ -n "$instance" ]]; then
echo "Stopping idle development instance: $instance"
aws ec2 stop-instances --instance-ids $instance
fi
done
echo "Resource cleanup completed."
Best Practices and Governance
Implement Comprehensive Tagging
{
"required_tags": {
"Environment": ["production", "staging", "development"],
"Team": ["frontend", "backend", "devops"],
"Project": ["api", "web-app", "mobile-app"],
"Owner": "email@company.com",
"CostCenter": "department-code"
}
}
Multi-level Budget Structure
def create_comprehensive_budgets():
budgets_client = boto3.client('budgets')
# Overall company budget
budgets_client.create_budget(
AccountId='123456789012',
Budget={
'BudgetName': 'CompanyWideAWSBudget',
'BudgetLimit': {'Amount': '10000', 'Unit': 'USD'},
'TimeUnit': 'MONTHLY',
'BudgetType': 'COST'
},
NotificationsWithSubscribers=[
{
'Notification': {
'NotificationType': 'ACTUAL',
'ComparisonOperator': 'GREATER_THAN',
'Threshold': 80
},
'Subscribers': [
{'Address': 'finance@company.com', 'SubscriptionType': 'EMAIL'}
]
}
]
)
Implementation Roadmap
Week 1-2: Foundation Setup
- Enable Cost and Usage Reports
- Implement comprehensive tagging strategy
- Set up Cost Explorer and basic monitoring
- Create initial cost baseline
Week 3-4: Quick Wins Implementation
- Identify and shut down idle resources
- Implement instance scheduling for dev/staging
- Set up S3 lifecycle policies
- Purchase first batch of Reserved Instances
Week 5-8: Strategic Optimization
- Implement auto-scaling groups
- Deploy spot instance workloads
- Purchase Savings Plans or additional RIs
- Set up advanced monitoring and alerting
Expected Results Timeline
Month 1: 15-25% cost reduction through quick wins Month 2: 25-35% cost reduction with strategic optimizations Month 3: 30-50% cost reduction with full implementation
Cost Optimization Checklist
Discovery Phase
- Cost and Usage Reports enabled
- Comprehensive tagging implemented
- Cost baseline established
- Top cost drivers identified
Quick Wins Phase
- Idle resources identified and stopped
- Development/staging scheduling implemented
- S3 lifecycle policies applied
- Unattached resources cleaned up
Strategic Phase
- Reserved Instances purchased
- Spot Instances implemented
- Auto-scaling configured
- Savings Plans evaluated
Automation Phase
- Cost monitoring alerts set up
- Automated cleanup scripts deployed
- Anomaly detection configured
- Regular reporting established
Remember: Cost optimization is not a one-time activity but an ongoing process. The key to long-term success is building a culture of cost awareness and implementing automated monitoring and optimization processes.
This guide is based on real experience helping companies reduce AWS costs by millions of dollars. For personalized cost optimization consultation, contact me directly.