One of our clients, which is a betting company, had strong needs to perform web application performance testing using 5000 “real users.”
That means that each virtual user must do his activity in a real browser. Unfortunately, in this case, common performance tools like Jmeter, NeoLoader, etc. won’t do the job.
Our main goal was to load the most critical actions that real users will heavily use. Users should do their activities stepwise, for example:
All users should log in during 5 minutes and wait. Once all users are successfully logged in, they should do bulk actions in the same period.
During testing, we faced a lot of issues related to the realtime update for bets. That stopped us from running performance tests, and we had to add complicated logic with different conditions to make our tests successful.
As we performed our test runs on production (this was the main requirement from our client), we had to add additional timeouts to avoid DDOS attack protection.
Another problem was to run all the tests from one machine. As a result, all requests were in the queue, and that was not what we need to do the performance testing.
Our clients were not able to stop deployment of new features, and as a result, we got updated for UI and backend two times a week that required to update our automation tests from time to time. Such activities slow down our performance.
We decide to use Selenide automation framework for developing user actions and Zalenium to run all these huge numbers of tests. Zalenium is primely selenium grid but in Kubernetes.
Kubernetes was chosen to get the ability to orchestrate selenium nodes and scale them up quick. To generate such amount of users, we required a huge number of resources and lucky us AWS was able to provide it to us.
Our Kubernetes cluster contained 200 c5d.18xlarge nodes, each of them had 72 cores and 144 GB of RAM.
We have developed a test suite that was able to generate test data based on the number of browsers in Kubernetes cluster.
Test farm was developed to execute our tests in parallel from 200 machines.
Major bottlenecks were found on the frontend, backend, and GraphQl server. The client expected to support load for 5000+ users, but after the first test run the maximum number of users without downtime was only 717.
Our engineers developed more than 1000 test cases in addition to the 300 test cases from the client’s in-house testing team.
3 Full-time automation QA Engineers were involved to the project.
Our QA Lead was involved in the process of formation test cases for each sprint.