New 2

Proposed Plan to Mitigate SNAT Port Exhaustion in AKS
Problem Identification and Root Cause Analysis
From the provided SNAT connection count chart, the backend IP addresses and
frontend IP address experiencing SNAT port exhaustion are:
Frontend IP Address: 20.227.30.35
Backend IP Addresses:
10.224.1.166
10.224.1.34
10.224.0.69
10.224.0.199
Possible Root Causes:
High number of concurrent connections from the nodes.

Long-lived idle connections not being closed timely.
Specific services or applications generating excessive outbound traffic.
Immediate Mitigation:
1. Adjusting Pre-allocated Ports per Node:
Increase the number of pre-allocated ports per node from the default 1,024 to
3,000.
This can be done without needing additional public IPs, given that the default
configuration provides 64,000 available ports.
Reduce TCP Idle Timeout:
Set the TCP idle timeout to 4 minutes to release idle connections faster.
This adjustment helps to free up SNAT ports more quickly and reduces the chance of
port exhaustion.
2. Identify the Root Cause:
Monitor and Analyze Metrics:
Use the metrics provided in the current setup to identify patterns in SNAT port
exhaustion.
Focus on the frontend IP address and backend IP address connections, along with the
destination IP addresses and ports, to pinpoint the specific services causing the
issue.
Pay attention to spikes in the metrics around specific times or nodes.
Examine Service Usage:
Investigate the usage patterns of frontend applications, API services, SQL

databases, service bus services, and node pool services.
Identify if any particular service is making excessive outbound connections leading
to port exhaustion.
Further Investigation Required:
Analyze logs to identify the specific services or applications causing high SNAT
usage.
Monitor the outbound connection patterns of API services and SQL node pool
services.
Immediate Mitigation
Option 1: Increase Pre-Allocated Ports per Node
Current Configuration:
One outbound public IP with 64,000 available ports.

Default port allocation per node: 1,024 ports.
Proposed Change:
Increase port allocation per node to 3,000 ports.

Terraform Configuration:
resource "azurerm_kubernetes_cluster" "aks_cluster" {

name = "myAKSCluster"
location = "East US"
resource_group_name = "myResourceGroup"
dns_prefix = "myaks"
default_node_pool {
name = "default"
node_count = 3
vm_size = "Standard_DS2_v2"
}
network_profile {
load_balancer_profile {
managed_outbound_ip_count = 1
outbound_ip_address_ids = []
outbound_ip_prefix_ids = []
outbound_ports_allocated_per_node {
min_count = 3000
max_count = 3000
}
}
network_plugin = "azure"
network_policy = "calico"
load_balancer_sku = "standard"
}
identity {
type = "SystemAssigned"
}
}
***********************************************
Mid-term Mitigation
Option 2: Set TCP Idle Timeout to 4 Minutes
Idle TCP connections are released after 30 minutes.
Proposed Change:
Set the TCP idle timeout to 4 minutes to release idle connections faster.

default_node_pool {
name = "default"
node_count = 3
}
network_profile {
managed_outbound_ip_count = 1
outbound_ip_address_ids = []
outbound_ip_prefix_ids = []
min_count = 3000
max_count = 3000
}
idle_timeout_in_minutes = 4
}
}
identity {
}
}
*************************************************
Long-term Strategy
Option 3: Add Additional Outbound Public IPs
One outbound public IP with 64,000 available ports.

Proposed Change:
Add multiple outbound public IPs to increase the total number of available SNAT
ports.
resource "azurerm_public_ip" "lb_public_ip" {

count = 2
name = "myPublicIP${count.index}"
allocation_method = "Static"
sku = "Standard"
}

default_node_pool {
name = "default"
node_count = 3
}
network_profile {
outbound_ip_address_ids = [
azurerm_public_ip.lb_public_ip[0].id,
azurerm_public_ip.lb_public_ip[1].id
]
min_count = 3000
max_count = 3000
}
idle_timeout_in_minutes = 4
}
}
identity {
}
}
Monitoring and Diagnostics
Azure Monitor and Log Analytics:
Use Azure Monitor and Log Analytics to collect and analyze logs.
Network Watcher:
Enable Network Watcher to monitor and diagnose network issues.

Diagnostics Settings:
Configure diagnostics settings to collect metrics and logs.

Terraform Configuration for Enabling Diagnostics:
resource "azurerm_monitor_diagnostic_setting" "aks_diagnostic" {

name = "aksDiagnostics"
target_resource_id = azurerm_kubernetes_cluster.aks_cluster.id
log_analytics_workspace_id =
azurerm_log_analytics_workspace.log_analytics_workspace.id
log {
category = "kube-apiserver"
enabled = true
retention_policy {
enabled = true
days = 30
}
}
metric {
category = "AllMetrics"
enabled = true
retention_policy {
enabled = true
days = 30
}
}
}
resource "azurerm_log_analytics_workspace" "log_analytics_workspace" {

name = "myLogAnalyticsWorkspace"
sku = "PerGB2018"
retention_in_days = 30
}
Cost Implementation
The cost implications for implementing these changes include:
Additional Public IPs:
Adding more public IPs will incur additional costs. Each public IP address is
billed separately.
Log Analytics Workspace:
Enabling diagnostic settings and sending logs to a Log Analytics workspace will
incur data ingestion and retention costs.
VM Costs:
There might be increased costs if the cluster scales up due to autoscaling and
increased node count.
Note: To get detailed cost estimates, use the Azure Pricing Calculator to input the
specific services and configurations you plan to use.
Cost Considerations:
Adjusting the idle timeout and pre-allocated ports per node can be done with
minimal cost impact.
Adding more public IPs will incur additional costs. Each Standard Public IP address
has an associated cost as per Azure pricing.
Monitor and optimize usage to balance between cost and performance.
Recommendations:
1.Regular Monitoring and Alerts:

Set up alerts based on SNAT port usage metrics to proactively identify and address
issues before they lead to port exhaustion.
2. Review and Optimize Code:

Conduct a thorough review of the application code to ensure efficient use of
connections.
Implement best practices for connection management in your applications.
3. Scaling Considerations:
Regularly review and adjust the load balancer and node configurations based on the
cluster's scaling requirements.
Use the cluster autoscaler to automatically manage the number of nodes in response
to load changes.
Conclusion
By increasing the number of pre-allocated ports per node, reducing the TCP idle
timeout, and adding additional outbound public IPs,
you can effectively mitigate SNAT port exhaustion in your AKS cluster. Implementing
monitoring and diagnostic tools will help you
continuously analyze and optimize your setup.

New 2

Uploaded by

Copyright:

Available Formats

You might also like

New 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

New 2

Uploaded by

Copyright:

Available Formats

Proposed Plan to Mitigate SNAT Port Exhaustion in AKS

Problem Identification and Root Cause Analysis

Frontend IP Address: 20.227.30.35

Possible Root Causes:

High number of concurrent connections from the nodes.

1. Adjusting Pre-allocated Ports per Node:

2. Identify the Root Cause:

Monitor and Analyze Metrics:

Investigate the usage patterns of frontend applications, API services, SQL

Further Investigation Required:

One outbound public IP with 64,000 available ports.

Increase port allocation per node to 3,000 ports.

resource "azurerm_kubernetes_cluster" "aks_cluster" {

Idle TCP connections are released after 30 minutes.

resource "azurerm_kubernetes_cluster" "aks_cluster" {

One outbound public IP with 64,000 available ports.

resource "azurerm_public_ip" "lb_public_ip" {

resource "azurerm_kubernetes_cluster" "aks_cluster" {

Enable Network Watcher to monitor and diagnose network issues.

Configure diagnostics settings to collect metrics and logs.

resource "azurerm_monitor_diagnostic_setting" "aks_diagnostic" {

resource "azurerm_log_analytics_workspace" "log_analytics_workspace" {

Additional Public IPs:

Log Analytics Workspace:

1.Regular Monitoring and Alerts:

2. Review and Optimize Code:

You might also like