“Performance issue” is every
developer’s nightmare and this is very true when all your components are in cloud.
I recently had to go through this tedious journey of figuring out what happened
to our existing Web API connected to Azure Table storage that it started taking
20-30 secs to return a response instead of a second.
It is obvious that when an
existing Web API starts acting weird and slows down, the analysis to be done is
in different direction. It includes but not limited to (sudden increase in) load,
memory leaks, CPU utilization, etc. My journey starts from here and goes through
different Support teams of Azure, learning things in-between. I have put down
these learnings and hopefully someone will benefit from this –
a. Increase
in load/hits
Let’s check if
there is significant increase in the load/hits to your API/App compared to the
days when it was working as expected. If so, start thinking about increasing
the number of instances or implementing the Traffic Manager to distribute the
load, etc.
b. Memory
leaks & CPU utilization checks
This should be
the next area to check, Azure portal provides these details and are self-explanatory
with detailed graphical notations. These are available under “Diagnose and
solve problems” section. Sample of one such graph below –
Fig: Memory Usage
Fig: CPU Utilization
If these graphs indicate a spike, start looking at API/App
code and fix the faulty code.
c.
NAT limitation check
What is NAT?
Most often APIs/Apps do have to
make outbound calls to other endpoints, these include calls to the Azure SQL DB
and/or Azure Storages along with the scenarios of calling other applications
over http/https, for example, calling search API or calling one of your other API
which implements core logic for your application. In all these cases, the
calling API/App is implicitly opening a network socket and making outbound
calls. All such calls made from an API/App on Azure App Service to a remote
endpoint rely on Azure Networking to set up and manage a table of Network Address
Translation (aka NAT) mappings.
Creating &
removing the entries in this NAT mapping takes time and there is limit on the
total number of NAT mappings that can be established for a single Azure App
Service and thus, App Service limits the number of outbound connections that
can be outstanding at any given point in time. The maximum connections limits
depend on the App Service pricing tier. Details about the pricing tier and its
allowed maximum connections can be found below (please do check the latest
number on Azure’s website)
-
1,920 connections per B1/S1/P1 instance
-
3,968 connections per B2/S2/P2 instance
-
8,064 connections per B3/S3/P3 instance
-
64K max upper limit per App Service Environment
Based on these
details check what is your API/App’s maximum outbound connections at any time
and compare it with what your App Service limit. This could be the reason for performance
issue as the additional connection requests are put in the queue and processed
in the order or are failed intermittently.
The most important
point to consider and look for in the code is to check for any “leaky”
connections that invariably run into these connection limits. Remember, it is always
a good practice to close the connection(s) explicitly after receiving the
response.
Other best
practices to avoid this situation –
- Use database connection pooling
- For making outbound HTTP/HTTPS calls, pool and
reuse instances of System.Net.Http.HttpClient or use Keep-alive connections
with System.Net.HttpWebRequest.
Note: Remember to
increase the System.Net.ServicePointManager.DefaultConnectionLimit because
you’ll otherwise be limited to two concurrent outbound connections to the same
endpoint.
Check
if your application is under the limit.
d. Enable
Table storage tracing & analysis
-
Enable table storage tracing, the requests made
to table service
- Then download logs from $Logs blob container
with Storage Explorer www.storageexplorer.com.
The logs will show every request made to storage, including client and server
latency and content size, etc.
Note: This
cannot be viewed/downloaded from Visual Studio’s Cloud Explorer.
e.
Throttling
a.
Table storage APIs
This is a long
shot but very much possible. Just like our custom APIs/App services, table
storage APIs scale up and down based on the traffic. There is a possibility
that table storage APIs were getting throttle at that moment and the APIs were
busy in scaling up.
You will be surprised
to hear this but, in my case, this was the reason for slow performance where
the scaling up of Table Storage APIs had problem and took between 1-2 hours to
get up and running. Till then, it was throwing 503 ServerBusy
ServerPartitionRequestThrottlingError. I had to raise ticket with Azure Storage
team to get this confirmation.
b.
Your API/App
Check if your
API/App is getting throttled to cause this performance degradation. If so, this
is time to start thinking about implementing throttling logic.
I hope this helped you in finding
the reason for your APIs/App’s performance degradation. If you think there are
things to be considered, please put them in the comments section and I will
include them. Thank you.
References