Tuesday, December 18, 2018

Performance degradation – Debugging existing Azure API/App


“Performance issue” is every developer’s nightmare and this is very true when all your components are in cloud. I recently had to go through this tedious journey of figuring out what happened to our existing Web API connected to Azure Table storage that it started taking 20-30 secs to return a response instead of a second.

It is obvious that when an existing Web API starts acting weird and slows down, the analysis to be done is in different direction. It includes but not limited to (sudden increase in) load, memory leaks, CPU utilization, etc. My journey starts from here and goes through different Support teams of Azure, learning things in-between. I have put down these learnings and hopefully someone will benefit from this –

a.       Increase in load/hits

Let’s check if there is significant increase in the load/hits to your API/App compared to the days when it was working as expected. If so, start thinking about increasing the number of instances or implementing the Traffic Manager to distribute the load, etc.

b.       Memory leaks & CPU utilization checks

This should be the next area to check, Azure portal provides these details and are self-explanatory with detailed graphical notations. These are available under “Diagnose and solve problems” section. Sample of one such graph below –


Fig: Memory Usage



Fig: CPU Utilization

If these graphs indicate a spike, start looking at API/App code and fix the faulty code.

c.       NAT limitation check

What is NAT?
                Most often APIs/Apps do have to make outbound calls to other endpoints, these include calls to the Azure SQL DB and/or Azure Storages along with the scenarios of calling other applications over http/https, for example, calling search API or calling one of your other API which implements core logic for your application. In all these cases, the calling API/App is implicitly opening a network socket and making outbound calls. All such calls made from an API/App on Azure App Service to a remote endpoint rely on Azure Networking to set up and manage a table of Network Address Translation (aka NAT) mappings.

Creating & removing the entries in this NAT mapping takes time and there is limit on the total number of NAT mappings that can be established for a single Azure App Service and thus, App Service limits the number of outbound connections that can be outstanding at any given point in time. The maximum connections limits depend on the App Service pricing tier. Details about the pricing tier and its allowed maximum connections can be found below (please do check the latest number on Azure’s website)

-          1,920 connections per B1/S1/P1 instance
-          3,968 connections per B2/S2/P2 instance
-          8,064 connections per B3/S3/P3 instance
-          64K max upper limit per App Service Environment

Based on these details check what is your API/App’s maximum outbound connections at any time and compare it with what your App Service limit. This could be the reason for performance issue as the additional connection requests are put in the queue and processed in the order or are failed intermittently.

The most important point to consider and look for in the code is to check for any “leaky” connections that invariably run into these connection limits. Remember, it is always a good practice to close the connection(s) explicitly after receiving the response.

Other best practices to avoid this situation –
-         Use database connection pooling
-  For making outbound HTTP/HTTPS calls, pool and reuse instances of System.Net.Http.HttpClient or use Keep-alive connections with System.Net.HttpWebRequest.
Note: Remember to increase the System.Net.ServicePointManager.DefaultConnectionLimit because you’ll otherwise be limited to two concurrent outbound connections to the same endpoint.

                Check if your application is under the limit.

d.       Enable Table storage tracing & analysis

-          Enable table storage tracing, the requests made to table service



-  Then download logs from $Logs blob container with Storage Explorer www.storageexplorer.com. The logs will show every request made to storage, including client and server latency and content size, etc.

Note: This cannot be viewed/downloaded from Visual Studio’s Cloud Explorer.

-          Details about the format and how to read these logs can be found here.



e.       Throttling
a.       Table storage APIs
This is a long shot but very much possible. Just like our custom APIs/App services, table storage APIs scale up and down based on the traffic. There is a possibility that table storage APIs were getting throttle at that moment and the APIs were busy in scaling up.
You will be surprised to hear this but, in my case, this was the reason for slow performance where the scaling up of Table Storage APIs had problem and took between 1-2 hours to get up and running. Till then, it was throwing 503 ServerBusy ServerPartitionRequestThrottlingError. I had to raise ticket with Azure Storage team to get this confirmation.
b.       Your API/App
Check if your API/App is getting throttled to cause this performance degradation. If so, this is time to start thinking about implementing throttling logic.

I hope this helped you in finding the reason for your APIs/App’s performance degradation. If you think there are things to be considered, please put them in the comments section and I will include them. Thank you.

References

No comments:

Post a Comment