SteelCentral AppInternals 10.4—Enhanced Cloud Support, Bigger Data, and Business-relevant Analysis
Cloud. Elastic computing. Containers. Microservices. Shared Libraries. These technologies are being adopted at almost exponential rates since they greatly simplify application design, development, deployment and manageability.
Unfortunately, that surface-level simplicity comes with the price of dramatically more complex under-the-covers behaviors and relationships. Application topology is often transient, nebulous, and unpredictably interrelated with countless other applications and services. This has posed a serious challenge for performance management, since many tools and techniques are still rooted to static physical concepts like tiers, servers, and JVMs.
There’s also been an expansion of the consumption of APM data, as LOB management and application owners look for more KPIs to help them improve the bottom line of the business. APM isn’t just a tool for developers and troubleshooters anymore.
SteelCentral AppInternals 10.4 introduces some revolutionary new functionality to not only overcome these challenges, but extract business-relevant information buried deep in these mind-boggling complex architectures.
Dynamic “Scaling Up” and “Scaling Out” of application instances presents numerous challenges for monitoring. Since the physical identities and quantity of instances can change from moment-to-moment, a logical approach must be employed to collate and analyze APM data.
The following example shows an HR application in an elastic environment. If you were to monitor these instances based on their physical identity, you’d have a non-contiguous and overlapping dataset making it very difficult to get an overall understanding of the application’s behavior:
SteelCentral AppInternals 10.4 allows you to define arbitrary tags (role, environment, region etc.) so these transient instances can be combined in logical ways:
Even if your application has hundreds of instances spinning up and down across the globe every day, AppInternals will organize that data as simply as if you only had a single, static instance.
Tags also ensure that every brand-new instance is automatically provisioned with a specific AppInternals configuration (which JVMs/CLRs to monitor, what code detail to collect etc.) so that even short-lived instances are fully configured upon initialization without requiring any manual intervention or restarts.
Microsoft Azure and Amazon Web Services
AppInternals works with any cloud computing service, but 10.4 has additional support for Azure and Amazon Web Services (AWS).
The AppInternals Analysis Server can be easily spun up from the Azure Marketplace or AWS Marketplace in a manner of minutes.
For applications running on Azure Cloud Services (PaaS) or Azure Virtual Machines (IaaS), the AppInternals Agent can be bundled and deployed with the application code via a NuGet package from Visual Studio etc.
AppInternals is all about Big data—capturing all transactions, all the time, without sampling, in immense detail. Even if you have billions of transactions per month, every single one of those is captured with call stacks, method parameters, sql statements, web service calls and multi-tier transaction maps. All of those details are persisted for months and not watered down by roll-ups nor aggregation. Most importantly, that data can be searched, filtered and analyzed in true Big Data fashion, based on any combination of 100+ different criteria.
Version 10.4 introduces a number of exciting features which leverage that rich dataset even further to more rapidly reveal root causes and expose unexpected patterns of behavior and relationships between shared components. The remainder of this article will focus on some of those enhancements.
Rich transaction details
The low-level details for delay such as methods and sql are now displayed in realtime without having to drill down into individual transactions or perform any additional aggregate analysis. As you select the applications, transactions and periods of interest, these charts update dynamically to immediately reveal the root causes of slowness.
In this example we can immediately see that the method getAnalystRecommendation is the top contributor to overall delay, but the getstockquote SQL call is the root cause for the spike in response time at 10am. In one screen you have all the information you need to start solving the problem. One and done.
If there’s one thing all APM tools have in common, it’s time series charts for response time. Although they may seem the same in concept, they can vary greatly in accuracy depending on how the transactional data is captured and stored. AppInternals captures all transactions without sampling, and persists those transactions in full detail with millisecond accuracy, so its response time charts are as accurate as possible:
Unfortunately, time series charts by their nature are very limited. They can’t effectively relate the subtleties of rich datasets and can inadvertently hide behavioral patterns which are crucial to determining true root cause. In this regard, all APM tools are fundamentally flawed.
To combat this, version 10.4 introduces TruePlot, which leverages AppInternals’ Big data to display hundreds of thousands of data points at once, colored by various facets (transaction type, server, username etc.) to clearly reveal patterns masked by simple line charts. Here we see the previous example rendered as a TruePlot:
This tells a completely different story. It tells the true story of what’s really happening. The horizontal orange line at 7 seconds shows that there’s some artificial behavior occurring for the Securities transaction which needs to be optimized. The vertical multi-colored spikes show that response time is increasing for multiple transactions at once, which is a clear indicator of what I call a “Performance Phantom”—an issue which initially appears to be related to slow code, but is in fact due to an issue in some shared dependency like hypervisor over-commitment, and not a code issue at all:
TruePlot helps you differentiate symptoms from root causes so you don’t waste time futilely trying to fix the wrong thing.
For more examples of why TruePlot is so valuable, and the case studies which inspired it, watch my talk from Collision Conference.
As application architectures continue to increase in complexity, and are further complicated by the elasticity and transience of cloud/containers, relationship mapping based on physical concepts like servers and JVMs/CLRs becomes less practical.
Here’s an example of a physical map for a single transaction type from a large customer in production:
This type of complexity is increasingly common, and all APM vendors struggle with relating practical value for these massive physically-organized maps.
AppInternals 10.4 introduces Performance Graph, which leverages the power of mathematical graph theory to model the logical relationships of application transactions and their sub-components (methods, SQL, etc.) to reveal the top sources of delay for individual transaction types or entire applications simultaneously.
Here we see the Performance Graph built from all 11 million production transactions in a 24-hour period for one of our large customers:
It clearly shows which transaction types are consuming the most processing time, as well as root causes for those delays, down to specific methods, SQL, etc.
For comparison, here is the traditional, physical map for this same dataset:
It has limited practical value due to its immense complexity, whereas Performance Graph doesn’t suffer from the issues that arise from “Web Scale.”
AppInternals 10.4 provides different Performance Graph views organized by combinations of Transaction Type, Category, Package and Method/SQL:
The Show Financial Impact option for Performance Graph is truly revolutionary. It enables you to perform business relevant analysis of your APM data, based on dollars (or whatever your local currency is) rather than simply seconds.
You may now specify the financial impact per minute of response time for different transaction types in your application. For example, certain mission-critical end-user transactions may have higher downtime or slow time cost than internal administrative or reporting transactions. Determining the top candidates to optimize based on business impact may yield very different results than those based purely on response time or total processing seconds.
In this example, when determining what to optimize based on total processing seconds, database transactions have a negligible impact to overall delay and would not be considered for improvement:
But when the very same data is rendered based on the relative financial impact of different types of transactions, then database transactions become very significant, with just one related to over $170,000 per day. That clearly should be a candidate to improve its performance and reliability.
I’ve been using AppInternals for almost 13 years, and I truly believe that we’ve created something really special with version 10.4—Performance Graph, TruePlot and our one-and-done method-level analysis are going to change the way APM is done.
See the power of SteelCentral AppInternals for yourself. Try it for free today!