What to do in software performance testing?
By DeviQA on Wed Nov 30 2016 00:00:00 GMT+0000 (Coordinated Universal Time)
At a time when the growth of computing power is going down (vertical scalability), and the volume of tasks is growing at the same speed, performance problem is becoming more up to date. But before you rush to test performance, it is necessary to obtain a quantitative characteristic. Our task is to carry out a sufficient number of tests to predict the system's behavior in a given situation.
How to do performance testing
Let’s consider the system as a "black box." Since the system can operate under a virtual machine using the database and without it, it makes no sense to dismember the system and to take measurements somewhere in the middle (usually leads to unrecorded important points). But if you have the opportunity to test the system in a different configuration, please make sure to do this.For testing, even for the simplest test, it is better to have an independent system like JMeter. It can calculate the results and average them. Of course, setting up a system takes time, but want to get the result much faster, and we write something like this:
Recording the results
After the first run, we get the X number. Run again to get 1.5*X, 2*X, 0.7*X. In fact, you can and should stop here, if we make a brief test and we do not need to insert the number into the report. If we want to share this number with someone, it is important that it was similar to other checkups and do not arouse suspicion.
The first "logical" step seems to be to take the X numbers and average them. However, averaging will mean nothing more than an increase in the count for one test. Count increase can lead to even more unstable results, especially if you run the test and doing something else at the same time.
The minimum time for best performance statisticsThe problem is that the average is not a stable characteristic. One test that performed 10 times longer will be enough to let your results not coincide with the other. What is performance testing in software testing? For the simple performance tests, it is desirable to take the minimum time of execution. Naturally, operation() should be measured in this context, typically 100 ms - 500 ms is more than enough for the system. Of course, the minimum time would not match 1:1 with the observed or real effect, but this number is compared with other dimensions.
Quintiles10%, 50% (median), 90% quintiles are more stable than the average, and they are more informative. We can find out how likely the request will be executed while 0.5*X and 0.7*X. There is another problem with them, calculating quantile is much harder than to take min or max. JMeter provides a measurement of the median and 90% of Aggregate Report then you should use.
A parameterized test
After the checkups we get a number (median, minimum, or other), but what we can say about our system (function) with only one number? Imagine, you automate the process of obtaining this number and every day you'll check it when you need to sound the alarm and point fingers? For example, here is a typical plot of the same test every day.
If we want to record a performance report, it will look empty with only one number. So we’ll certainly test for different parameters. Tests for various parameters will give us a nice chart and an idea of the scalability of the system.
What is Performance Test with a single parameter?Let’s consider a system with a single numeric parameter. First of all, you should select the parameter values. It makes no sense to choose the number of different ranges such as [1, 100, 10000], you simply spend three entirely different tests. It won’t be possible to find the dependence on such numbers. Imagine you want to build a graph, which numbers would you choose? Something similar to [X*1, X*2, X*3, X*4, X*5].
Let’s choose 5-10 (7 is the best) control points, hold 3-7 measurements for each point, take the minimum number for each position and build a graph.
The complexity of the algorithm
As you can see, the points are arranged in an imaginary straight line. Now we come to the most interesting part of the measurements. We can determine the complexity of the algorithm based on the measurements.
Determination of the compound nature of the algorithm by the tests is much easier for complex programs than the original text analysis programs. With the help of the test it is possible to find a particular point when complexity varies, for example, the launch of Full GC.
Determining program complexity according to the test results is the direct object of the regression analysis. It is evident that there is no general method for finding the function only on the point. So art first we make an assumption of a particular dependency, and then we check it. In our case, and in most others, we assume that the function is a linear relationship.
Least Squares Method (linear model)
To find a linear dependence of the coefficients, there is a straightforward and optimal method called OLS. The advantage of this approach is that the formula can be programmed in 10 lines and they are entirely clear.
Take a look at our calculations in the table.
In the highlighted line we see a linear coefficient of our model. It is equal to 95.54, free factor 2688. Now we can make a simple, but not an obvious trick. We can add meaning to these numbers. 95.54 is measured in milliseconds (as well as our measurements). 95.54 is the time we spend on each component, and the 2688 ms time that we devote to the system itself does not depend on the number of components.This method lets us identify the time of the external system. In this case, it’s the database, although it is dozens of times higher than the single component. If we used Time_with_N_component / N formula, we would have to measure for N> 1000 that the error to be less than 10%.
The linear model is the most important factor of the number of our measurement, and strangely enough, it is the most stable among our measurements. The same linear coefficient can adequately measure and interpret the operations that occur for a nanosecond and separating the overhead of the system.
The graph of the independent test that we run at different times, which confirms the greater stability of the linear coefficient than the individual tests.
Estimation of linear models and Pearson coefficient
Using the chart we can clearly see that our measurements do satisfy the linear model, but this method is not accurate, and we can not automate it. In this case, the Pearson coefficient can help us. This method really shows the deviation from a straight line, but it is not enough for 5-10 measurements. Here is an example of not a linear dependence, but the Pearson coefficient is quite high.
Going back to our table, knowing the free rate we can calculate the linear coefficient for each dimension, as it’s done in the table. As we see the numbers (100, 101, 93, 96, 102, 81, 94, 100, 93, 98) are distributed quite randomly around 95, that gives us a good reason to believe that the dependence is linear. In fact, deviations from the average value should satisfy the standard distribution.
We should admit that not all dependencies are linear. The first thing that comes to mind is a quadratic dependence. In fact, the quadratic dependence is very dangerous, it kills performance slowly at first and then rapidly.
Even if you have 1000 items that are performed in a fraction of a second, for the 10,000 it already will be tens of seconds (multiplied by 100). Another example is sorting, which can not be solved in linear time. Let’s calculate how we apply the linear method of analysis for algorithms with complexity O(n*log n).
Let’s take a look at the example of a nonlinear dependence.
The first thing to note is the negative overhead. For obvious reasons, the system can not have a negative time spent on preparation. The second point to note is looking at the column of coefficients. We can see a pattern, a linear coefficient decreases and then increases. This looks like a parabola.
Multiplayer tests and tests of multi-core systems
«Throughput per core» myth
One of the favorite things to do with performance tests is an extrapolation. Especially, people like to be extrapolated to the area for which they can not get values. For example, in a system having 2 cores or 1 core, I want to extrapolate how the system would behave with 10 cores. And of course, the first false assumption is that the relationship is linear. For the common definition of a linear dependence, there must be 5 points, that indeed can not be obtained on the 2 cores.
One of the closest approximations is Amdahl's law. It is based on calculating the percentage of the code alpha (outside the block synchronization) and the synchronized code (1 - α). If a process takes time T to one core. While in multiple running tasks, N times is T '= T*α + (1-α)*N*T. On average, of course, it’s N/2, but Amdahl allows N. S = N*T /T'=N/(α+(1-α)\*N)=1/(α/N+(-α)). Of course, the above graph is not as dramatic in reality (there is, in fact, a logarithmic scale on the X). However, the asymptote is a significant disadvantage of sync blocks. Relatively speaking, it is impossible to overcome the increasing power of the acceleration limit at Kim S = 1/(1-α). And this limit is quite hard, that is for 10% of synchronized code, the limit it 10, for 5% (excellent) the limit is 20. The function has a limit, but it is growing. Therefore we have the task: to evaluate a hardware optimized for our algorithm. In reality, the percentage of parallelized items increase can be difficult. Let us return to the formula T'=T*α+(1-α)*N*T. It’s the best regarding efficiency: if the core is idle, it will work for the same time as well. That is, T*α = (1-α)*N*T, we obtain N = α/(1-α). For 10% - 9 cores, for 5% - 19 cores.
The number of users and the number of cores. The ideal graph
Acceleration computing model is theoretically possible, but not always realistic. Let’s consider a client-server situation where N customers are always launching some tasks on the server one by one, and we do not spend any expenses on the synchronization results, as customers are not entirely dependant!Maintenance of statistics introduces a timing element. With M-core machine, we expect that the average query time is T is equal where N
What is bottleneck in performance testing? Experimental measurementsNaturally, perfect graphics are achievable if we have 0% synchronized blocks (critical sections). Here are the actual measurements of the same algorithm with different parameter values.
We can calculate the linear coefficient and construct a graph. Just having a machine with 8 cores, it is possible to conduct a series of experiments for 2, 4, 8 cores. From the test results, it can be seen that a system with 4 cores and 4 users behaves the same way as a system with 8 cores and 4 users. Of course, it was expected, and it gives us the opportunity to carry out only one series of tests for a machine with a maximum number of cores.
Experimental measurements using linear coefficientAnalysis graphs are close to the law of Amdahl but are still significantly different. With measurements for 1, 2, 4, 8 cores, one can calculate the amount of non-parallelized code with the formula: Estimated = (1/SU - 1)/(1/NP - 1). Where NP is the number of cores, SU = Throughput NP core / Throughput 1 core. According to Amdahl's law, this number should be constant, but in all dimensions, the number decreases, but not significantly (91% - 85%).
A throughput per core graph is not always a continuous line. For example, when memory is low or when there’s a GC, variations in throughput can be very significant.