Troubleshooting Application and Citrix Performance During a Data Center Migration
Have you ever moved and forgotten to change your address with your local post office? Or changed banks and forgotten to update your account information for automatic billing with one of your providers?
I’m sure we've all had something similar happen to us at some point. Oftentimes, it’s overlooking the little details that can cause a much bigger problem.
Such was the case with one Riverbed customer during a data center migration project, where a minor configuration for one of the migrated applications was overlooked and not updated. This caused performance for this particular application to degrade considerably. And the users certainly noticed because there were many complaints.
As the customer diligently worked to identify the root cause, they requested Riverbed Professional Services to provide some assistance. Here's how we analyzed the performance issues and diagnosed the cause, so that the customer could resolve the challenge and enjoy the benefits of their data center migration.
Analyzing and diganosing application performance problems
We were already working with the customer to predict performance for many other applications that were being migrated to another region. In fact, we had previously performed some predictive analysis on this particular application to forecast its performance before it was deployed in the new region.
We used SteelCentral™ Transaction Analyzer to do this predictive analysis. But as you’ll see, the product is extremely useful for application troubleshooting as well.
The problematic app was a J2EE application with a global user base. Our analysis showed that many of the application’s transactions have a high number of application turns. So, it’s pretty chatty. And when you have a chatty application, as the distance between the users and the servers increases, transaction performance suffers. While Riverbed does have products and solutions that mitigate application chattiness, they were not available within the customer’s environment.
The screenshot above shows us the summary of all the delays for one key transaction the application owners identified. This is how that one transaction performed in the local environment.
525 milliseconds? That’s pretty fast! OK, we’re done here, right?
Unfortunately, no—we weren’t done.
We commonly come across transactions that perform well locally, but terribly over a WAN. So what we did next is look at the Diagnosis tab within SteelCentral Transaction Analyzer to get more detail about this transaction’s performance.
Right off the bat, we knew this transaction was chatty. And when we ran it through the QuickPredict simulation with the parameters of the new data center, we saw that the 525-millisecond response time is expected to increase to about 35 seconds.
Wow! An almost 70x increase in response time! User productivity would slow to a crawl at this rate.
We see in the screenshot above that latency was the largest cause of performance delay. And that’s due to the chattiness bottleneck that SteelCentral Transaction Analyzer warned us about.
As a result of the predictive analysis, our first recommendation was to modify the application to generate fewer application turns. If that were not possible, our second recommendation was to deploy this application using thin client technology.
Partly due to migration timelines and other planned projects, the customer chose our second recommendation and elected to deploy this application in an existing Citrix environment. However, the customer still experienced performance issues when users began using the application as the servers were being migrated.
We reviewed our original analysis. We verified the accuracy of the customer’s network parameters. We looked at network performance statistics. All seemed well.
Since the customer was using Citrix, we knew something else was up. We just had to help analyze, diagnose, and resolve the problem through an application troubleshooting engagement.
Application performance troubleshooting in action
As we began to ask more questions about this issue, we discovered that users would complain only on certain days, but always around the same two-hour timeframe.
Of course, one of the first things we had to do was discuss the usual suspect. Yep, you guessed it – the network. But none of the network performance data we looked at specifically pointed to the network.
Since this issue only happens intermittently on some days, we decided to install SteelCentral Transaction Analyzer agents, which allow us to continuously capture data and remotely start and stop any captures when required. A continuous capture allows us to capture data up to a specified buffer size that wraps once the buffer is full. As long as we have enough hard drive space, we can have as high a buffer as needed. In this case, we had a 1GB buffer.
We started a continuous capture on both tiers and waited until we received notification that users were experiencing degraded application performance.
We didn’t have to wait long. Users complained that the application was very slow in the morning around 4:00 a.m.
Before downloading the data that was captured on the Citrix server, we previewed the network throughput and downloaded data for the slice of time we wanted. Here’s what we saw:
Looks like the Citrix server had a burst of traffic around the time the users were complaining. From the SteelCentral Transaction Analyzer agent trace, we determined that the burst was an SSH data transfer initiated by the Citrix server to another local server. This did not generate a lot of data, so the problem wasn’t increased utilization on the LAN.
Using our continuous capture, we were able to go back a few days during times when there were and weren’t any user complaints. We noticed this same SSH data transfer happening at the same time, every day, including weekends. During complaint days, 5x more data was transferred.
We provided our analysis to the customer. Armed with this information, our customer found a system security scan was scheduled for the exact time the bursts occur.
You read that right—a scan.
This scan helps protect global servers from intrusion. It can be quite CPU intensive, which explains the intermittent nature of this issue. Because of this, it is scheduled to run after local business hours, which is early morning in this case.
“So what’s the problem with that,” you ask? Local time! That’s the problem!
When the application was moved from the other region, the scan schedule was not updated to coincide with the after-hours local time of the new region’s time zone. The “local time” was now during business hours of the previous region. Therefore, users in the previous region were using the application while the scan was running, causing slow performance during business hours for any users accessing the application via Citrix.
Riverbed Professional Services: the problem solver and money saver
Using SteelCentral Transaction Analyzer, we were able to identify the specific time of increased server use, which helped the customer’s server team identify what was running on the server during that time.
Needless to say, the customer was happy. Our discoveries helped them move forward with the server migration for this particular application. As a result, they were able to realize the expected cost savings of the data center migration for this application.
Data center migrations are a large endeavor. Incomplete configuration changes such as this can, and do, happen. Keep in mind that it can be the little details that cause big performance problems. You need the appropriate people, processes, and tools to quickly troubleshoot issues when they arise. That’s why Riverbed Professional Services offers a Data Center Migration Assessment that can, among other things, help document dependencies such as this. We’re prepared to help!
Have you forgotten a key piece during your company’s data center migration project? How did it impact application performance? Let us know how you solved it.