New 2

You might also like

Download as txt, pdf, or txt
Download as txt, pdf, or txt
You are on page 1of 6

Proposed Plan to Mitigate SNAT Port Exhaustion in AKS

Problem Identification and Root Cause Analysis

From the provided SNAT connection count chart, the backend IP addresses and
frontend IP address experiencing SNAT port exhaustion are:

Frontend IP Address: 20.227.30.35

Backend IP Addresses:
10.224.1.166
10.224.1.34
10.224.0.69
10.224.0.199

Possible Root Causes:

High number of concurrent connections from the nodes.


Long-lived idle connections not being closed timely.
Specific services or applications generating excessive outbound traffic.

Immediate Mitigation:

1. Adjusting Pre-allocated Ports per Node:

Increase the number of pre-allocated ports per node from the default 1,024 to
3,000.
This can be done without needing additional public IPs, given that the default
configuration provides 64,000 available ports.
Reduce TCP Idle Timeout:

Set the TCP idle timeout to 4 minutes to release idle connections faster.
This adjustment helps to free up SNAT ports more quickly and reduces the chance of
port exhaustion.

2. Identify the Root Cause:

Monitor and Analyze Metrics:

Use the metrics provided in the current setup to identify patterns in SNAT port
exhaustion.
Focus on the frontend IP address and backend IP address connections, along with the
destination IP addresses and ports, to pinpoint the specific services causing the
issue.
Pay attention to spikes in the metrics around specific times or nodes.
Examine Service Usage:

Investigate the usage patterns of frontend applications, API services, SQL


databases, service bus services, and node pool services.
Identify if any particular service is making excessive outbound connections leading
to port exhaustion.

Further Investigation Required:

Analyze logs to identify the specific services or applications causing high SNAT
usage.
Monitor the outbound connection patterns of API services and SQL node pool
services.
Immediate Mitigation
Option 1: Increase Pre-Allocated Ports per Node

Current Configuration:

One outbound public IP with 64,000 available ports.


Default port allocation per node: 1,024 ports.

Proposed Change:

Increase port allocation per node to 3,000 ports.


Terraform Configuration:

resource "azurerm_kubernetes_cluster" "aks_cluster" {


name = "myAKSCluster"
location = "East US"
resource_group_name = "myResourceGroup"
dns_prefix = "myaks"

default_node_pool {
name = "default"
node_count = 3
vm_size = "Standard_DS2_v2"
}

network_profile {
load_balancer_profile {
managed_outbound_ip_count = 1
outbound_ip_address_ids = []
outbound_ip_prefix_ids = []

outbound_ports_allocated_per_node {
min_count = 3000
max_count = 3000
}
}

network_plugin = "azure"
network_policy = "calico"
load_balancer_sku = "standard"
}

identity {
type = "SystemAssigned"
}
}
***********************************************
Mid-term Mitigation
Option 2: Set TCP Idle Timeout to 4 Minutes

Current Configuration:

Idle TCP connections are released after 30 minutes.

Proposed Change:
Set the TCP idle timeout to 4 minutes to release idle connections faster.
Terraform Configuration:

resource "azurerm_kubernetes_cluster" "aks_cluster" {


name = "myAKSCluster"
location = "East US"
resource_group_name = "myResourceGroup"
dns_prefix = "myaks"

default_node_pool {
name = "default"
node_count = 3
vm_size = "Standard_DS2_v2"
}

network_profile {
load_balancer_profile {
managed_outbound_ip_count = 1
outbound_ip_address_ids = []
outbound_ip_prefix_ids = []

outbound_ports_allocated_per_node {
min_count = 3000
max_count = 3000
}

idle_timeout_in_minutes = 4
}

network_plugin = "azure"
network_policy = "calico"
load_balancer_sku = "standard"
}

identity {
type = "SystemAssigned"
}
}
*************************************************
Long-term Strategy
Option 3: Add Additional Outbound Public IPs

Current Configuration:

One outbound public IP with 64,000 available ports.


Proposed Change:

Add multiple outbound public IPs to increase the total number of available SNAT
ports.
Terraform Configuration:

resource "azurerm_public_ip" "lb_public_ip" {


count = 2
name = "myPublicIP${count.index}"
location = "East US"
resource_group_name = "myResourceGroup"
allocation_method = "Static"
sku = "Standard"
}

resource "azurerm_kubernetes_cluster" "aks_cluster" {


name = "myAKSCluster"
location = "East US"
resource_group_name = "myResourceGroup"
dns_prefix = "myaks"

default_node_pool {
name = "default"
node_count = 3
vm_size = "Standard_DS2_v2"
}

network_profile {
load_balancer_profile {
outbound_ip_address_ids = [
azurerm_public_ip.lb_public_ip[0].id,
azurerm_public_ip.lb_public_ip[1].id
]

outbound_ports_allocated_per_node {
min_count = 3000
max_count = 3000
}

idle_timeout_in_minutes = 4
}

network_plugin = "azure"
network_policy = "calico"
load_balancer_sku = "standard"
}

identity {
type = "SystemAssigned"
}
}
Monitoring and Diagnostics
Azure Monitor and Log Analytics:

Use Azure Monitor and Log Analytics to collect and analyze logs.
Network Watcher:

Enable Network Watcher to monitor and diagnose network issues.


Diagnostics Settings:

Configure diagnostics settings to collect metrics and logs.


Terraform Configuration for Enabling Diagnostics:

resource "azurerm_monitor_diagnostic_setting" "aks_diagnostic" {


name = "aksDiagnostics"
target_resource_id = azurerm_kubernetes_cluster.aks_cluster.id
log_analytics_workspace_id =
azurerm_log_analytics_workspace.log_analytics_workspace.id

log {
category = "kube-apiserver"
enabled = true

retention_policy {
enabled = true
days = 30
}
}

metric {
category = "AllMetrics"
enabled = true

retention_policy {
enabled = true
days = 30
}
}
}

resource "azurerm_log_analytics_workspace" "log_analytics_workspace" {


name = "myLogAnalyticsWorkspace"
location = "East US"
resource_group_name = "myResourceGroup"
sku = "PerGB2018"
retention_in_days = 30
}

Cost Implementation
The cost implications for implementing these changes include:

Additional Public IPs:

Adding more public IPs will incur additional costs. Each public IP address is
billed separately.

Log Analytics Workspace:

Enabling diagnostic settings and sending logs to a Log Analytics workspace will
incur data ingestion and retention costs.
VM Costs:

There might be increased costs if the cluster scales up due to autoscaling and
increased node count.

Note: To get detailed cost estimates, use the Azure Pricing Calculator to input the
specific services and configurations you plan to use.

Cost Considerations:

Adjusting the idle timeout and pre-allocated ports per node can be done with
minimal cost impact.
Adding more public IPs will incur additional costs. Each Standard Public IP address
has an associated cost as per Azure pricing.
Monitor and optimize usage to balance between cost and performance.

Recommendations:

1.Regular Monitoring and Alerts:


Set up alerts based on SNAT port usage metrics to proactively identify and address
issues before they lead to port exhaustion.

2. Review and Optimize Code:


Conduct a thorough review of the application code to ensure efficient use of
connections.
Implement best practices for connection management in your applications.

3. Scaling Considerations:
Regularly review and adjust the load balancer and node configurations based on the
cluster's scaling requirements.
Use the cluster autoscaler to automatically manage the number of nodes in response
to load changes.

Conclusion
By increasing the number of pre-allocated ports per node, reducing the TCP idle
timeout, and adding additional outbound public IPs,
you can effectively mitigate SNAT port exhaustion in your AKS cluster. Implementing
monitoring and diagnostic tools will help you
continuously analyze and optimize your setup.

You might also like