
Written by: Chief Operating Officer
Anastasiia SokolinskaPosted: 02.06.2026
14 min read
You inherit a 500-test Playwright suite. The README says it follows best practices. Page Object Model — check. getByRole locators — check. Headless CI — check. Then you look at the numbers: 11% flaky rate, 55-minute CI runs, auth setup copy-pasted into 60 spec files. The suite technically works. It just doesn't scale.
This is the gap nobody writes about. Playwright best practices at the tutorial level are table stakes. What actually separates production-grade automation from code that slowly collapses under its own weight is a layer of architectural decisions — the kind that don't appear in documentation but become obvious the moment you're maintaining 500 tests written by someone who's no longer on the team.
Here's what those decisions look like, and why they matter.
Why most Playwright suites look fine until they don't
Playwright now pulls around 33 million weekly npm downloads. It's not a differentiator anymore — it's the default. Which means the competitive advantage isn't in the tool. It's in how the team uses it.
The surface-level adoption pattern is predictable: the team reads the official docs, sets up POM, uses getByRole, runs tests headless in GitHub Actions, and ships. That's not wrong. But it's also not enough. Tutorials optimize for comprehension, not for the reality of 500 tests across 20 feature areas with 5 user roles and a CI pipeline that needs to give feedback in under 10 minutes.
The failure modes that follow are specific. Auth setup that made sense for the first 5 tests becomes 200 lines of duplication by test 80. Fixtures get treated as syntax — a place to put beforeEach logic — rather than as the primary architectural layer. The CI pipeline runs everything on one runner, takes 55 minutes, and nobody really knows which tests flake and how often.
None of this is visible from a code review. It accumulates.
Locator strategy is a hierarchy, not a preference
Every tutorial says "use getByRole." The part that gets skipped: why, and what to do when you can't.
Playwright's locator priority stack isn't arbitrary. It mirrors how assistive technologies interact with your application, which means it's also the layer most resistant to implementation churn. The full hierarchy:
- ARIA roles (
getByRole('button', { name: 'Submit' })) - Accessible names and labels (
getByLabel,getByPlaceholder) - Text content (
getByText) - Test IDs (
getByTestId) - CSS selectors — last resort only
The key distinction that most articles miss: data-testid attributes are not a fallback. They're a contract. When you reach for getByTestId, you're making a team-level commitment: QA and frontend agree that this element needs a stable, explicitly named hook in the DOM. That decision requires a naming convention (data-testid="checkout-submit-btn", not data-testid="btn1"), a review process, and ideally a lint rule that prevents inconsistencies.
Here's what the wrong call looks like in practice:
// Fragile — breaks the moment the class changes or the grid reorders
const submitButton = page.locator('.MuiGrid-root > div:nth-child(3) > button');
// Correct — role + accessible name; survives styling and layout changes
const submitButton = page.getByRole('button', { name: 'Complete order' });
The CSS locator isn't just slower. It couples your test to the implementation. Every design system update becomes a potential test failure, regardless of whether the actual behavior changed.
Fixtures are your architecture — not just a setup helper
This is where the gap between intermediate and senior Playwright usage is sharpest. Intermediate engineers use fixtures to avoid repeating beforeEach. Senior engineers use fixtures to define the architecture of the entire suite.
Test-scope vs. worker-scope: the decision that matters
Playwright gives you two fixture scopes. Test-scope fixtures re-run for every test — isolated by default, predictable, but expensive if they involve auth or database setup. Worker-scope fixtures run once per worker process and are shared across all tests on that worker — faster, but they introduce shared state risk if you're not deliberate about it.
The concrete rule: use worker-scope for fixtures that are genuinely stateless from the test's perspective (a seeded read-only user, a shared API client pointing to a stable seed dataset) and test-scope for anything that writes, modifies, or depends on a clean slate.
Using worker-scope incorrectly is how you get "tests that pass in isolation but fail in CI" — a flakiness pattern that's extremely hard to diagnose without understanding fixture lifecycle.
Composing fixtures to eliminate boilerplate
The pattern that makes the biggest dent in test file length: extending the base test object to inject page objects, authenticated API clients, and auth state simultaneously. Done right, your spec files become 10-line behavioral descriptions instead of 60-line setup marathons.
// fixtures/index.ts
import { test as baseTest, expect } from '@playwright/test';
import { DashboardPage } from '../pages/DashboardPage';
import { ApiClient } from '../support/ApiClient';
type Fixtures = {
dashboardPage: DashboardPage;
apiClient: ApiClient;
};
export const test = baseTest.extend<Fixtures>({
// Test-scoped: fresh page object per test
dashboardPage: async ({ page }, use) => {
await use(new DashboardPage(page));
},
// Worker-scoped: one API client per worker, shared across tests
apiClient: [
async ({}, use) => {
const client = new ApiClient(process.env.API_BASE_URL!);
await client.authenticate();
await use(client);
},
{ scope: 'worker' },
],
});
export { expect };
// tests/dashboard.spec.ts — what the test file actually looks like
import { test, expect } from '../fixtures';
test('admin sees pending approvals count', async ({ dashboardPage, apiClient }) => {
const pendingCount = await apiClient.getPendingApprovals();
await dashboardPage.goto();
await expect(dashboardPage.pendingApprovalsLabel).toHaveText(`${pendingCount} pending`);
});
No auth setup in the test file. No page instantiation. The test describes behavior; the fixtures handle everything else.
Authentication at scale means more than storageState
The standard advice — save a session with storageState, load it in your tests — works for a single user. It breaks the moment your B2B SaaS application has multiple roles, which is always.
A real product has an admin, a billing manager, a read-only viewer, an API consumer, and maybe a support agent with elevated access. Each role sees a different UI, triggers different API responses, and needs to be tested against different scenarios. Handling all of this with a single storageState file either means serialized tests (slow) or cross-role contamination (unreliable).
The right approach: generate separate auth state files per role in a dedicated auth.setup.ts, reference them via named projects in playwright.config.ts, and inject them via worker-scoped fixtures.
// auth.setup.ts
import { test as setup } from '@playwright/test';
import path from 'path';
const AUTH_DIR = path.join(__dirname, '.auth');
const roles = [
{ name: 'admin', email: 'admin@example.com', password: process.env.ADMIN_PASS! },
{ name: 'billing', email: 'billing@example.com', password: process.env.BILLING_PASS! },
{ name: 'viewer', email: 'viewer@example.com', password: process.env.VIEWER_PASS! },
];
for (const role of roles) {
setup(`authenticate as ${role.name}`, async ({ page }) => {
await page.goto('/login');
await page.getByLabel('Email').fill(role.email);
await page.getByLabel('Password').fill(role.password);
await page.getByRole('button', { name: 'Sign in' }).click();
await page.waitForURL('/dashboard');
await page.context().storageState({ path: `${AUTH_DIR}/${role.name}.json` });
});
}
// playwright.config.ts (relevant section)
projects: [
{ name: 'setup', testMatch: /auth\.setup\.ts/ },
{
name: 'admin-tests',
use: { storageState: '.auth/admin.json' },
dependencies: ['setup'],
},
{
name: 'billing-tests',
use: { storageState: '.auth/billing.json' },
dependencies: ['setup'],
},
{
name: 'viewer-tests',
use: { storageState: '.auth/viewer.json' },
dependencies: ['setup'],
},
],
One silent failure that almost nobody writes about: forgetting to add .auth/*.json to .gitignore. Those files contain live session tokens. If they land in version control — even briefly — you've shipped valid credentials into your repo history.
Add this before anything else:
# .gitignore
.auth/
Flaky tests have root causes — retries just hide them
Every Playwright article says "avoid waitForTimeout." That's correct but incomplete. Flakiness has distinct root causes, and each one requires a different fix. Treating them all with retries is like treating every headache with painkillers — it masks the symptom while the underlying problem grows.
The root-cause taxonomy that actually matters:
Async timing gaps — the test proceeds before the UI has caught up. Fix: use Playwright's built-in auto-waiting and assertion-level waits (
await expect(locator).toBeVisible()) rather than arbitrary delays.Resource-affected failures (RAFTs) — shared database state, port collisions, or test data written by one test and read by another. Research cited by QA tooling practitioners puts resource-affected failures at roughly 46.5% of all flaky test root causes. Fix: isolate test data per test run using
@faker-js/fakerfor generated data and API-level teardown.Environment drift — this one is subtle and commonly missed. Docker's default
/dev/shmallocation is 64MB. Chromium uses shared memory for rendering and can silently crash when that limit is hit under parallel load — without any obvious error in your logs. Fix: pass--shm-size=2gto your Docker run command, or mount/dev/shmexplicitly.Selector fragility — a locator that works today breaks when a developer renames a CSS class or restructures a component. Fix: go back up the locator hierarchy. If you're using a CSS selector, find the ARIA role equivalent.
For prevention at the project level, eslint-plugin-playwright catches the most common anti-patterns before they reach the suite:
// eslint.config.mjs
import playwright from 'eslint-plugin-playwright';
export default [
{
...playwright.configs['flat/recommended'],
files: ['tests/**/*.ts'],
rules: {
...playwright.configs['flat/recommended'].rules,
'playwright/no-wait-for-timeout': 'error',
'playwright/no-force-option': 'warn',
'playwright/prefer-web-first-assertions': 'error',
},
},
];
Here's the metric that frames this correctly: if a test retried more than 2% of its runs over a rolling 30-day window, it has a bug. Retries are a circuit breaker for transient infrastructure noise, not a long-term stability strategy. If you're seeing retry rates above 5% on any individual test, you're paying a "flaky tax" — CI time burned on reruns, developer trust eroded, signal-to-noise ratio dropping with every deployment.
If your flaky rate is above 5%, that's not a Playwright problem — it's a system design problem.
Talk to a senior SDET at DeviQA
CI pipeline design is a test architecture decision
Sharding syntax is well-documented. How to design the pipeline around it is not.
The right CI configuration depends on suite size. Here are three distinct models:
Under ~200 tests: A single runner with fullyParallel: true in playwright.config.ts is sufficient. Adding sharding here introduces coordination overhead that costs more than it saves.
200–1,000 tests: Matrix sharding across multiple runners is where you start to see meaningful time savings. The pattern:
# .github/workflows/playwright.yml
name: Playwright Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
shard: [1, 2, 3, 4]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- name: Install dependencies
run: npm ci
- name: Install Playwright browsers
run: npx playwright install --with-deps chromium
- name: Run tests (shard ${{ matrix.shard }}/4)
run: npx playwright test --shard=${{ matrix.shard }}/4
- name: Upload shard report
uses: actions/upload-artifact@v4
if: always()
with:
name: playwright-report-${{ matrix.shard }}
path: playwright-report/
retention-days: 7
merge-reports:
needs: test
runs-on: ubuntu-latest
if: always()
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci
- name: Download all shard reports
uses: actions/download-artifact@v4
with:
path: all-reports/
pattern: playwright-report-*
- name: Merge reports
run: npx playwright merge-reports --reporter html ./all-reports/*/
- name: Upload merged report
uses: actions/upload-artifact@v4
with:
name: playwright-merged-report
path: playwright-report/
Above 1,000 tests: Tag-based tiering. @smoke runs on every PR (fast feedback, catches regressions in critical paths). Full @regression runs nightly. @cross-browser gates releases only. This isn't about speed alone — it's about matching test granularity to the cost of being wrong at each gate.
One setting that separates senior CI configuration from junior: forbidOnly: !!process.env.CI in playwright.config.ts. This fails the build immediately if any test.only reaches CI, rather than silently running a single test while everyone thinks the full suite passed. It's one line. It prevents an entire category of release incidents.
// playwright.config.ts
export default defineConfig({
forbidOnly: !!process.env.CI,
retries: process.env.CI ? 1 : 0,
workers: process.env.CI ? 4 : undefined,
// ...
});
The test suite health metrics actually worth tracking
Most teams measure coverage. Coverage is not a health metric. It tells you what's touched; it doesn't tell you whether your suite is degrading.
Four metrics that do:
Flaky rate per test over 30 days. Playwright's built-in reporter gives you per-test retry data. Aggregate it over a rolling window and surface the worst offenders. Any test with a >2% retry rate gets triaged before new tests get written.
Mean time to green. Measured from commit to passing CI status. If this is creeping past 10 minutes for typical PRs, you have a pipeline architecture problem, not a test count problem. Sharding should bring this under 8 minutes for most suites up to 1,000 tests.
Test-to-feature coverage ratio. Are you adding tests faster than the application is gaining features? Slower? If the ratio is drifting downward, you're accumulating untested surface area. This shows up as late-discovered regressions.
Auth setup time as a fixture efficiency proxy. If your auth setup runs on every test instead of being worker-scoped, you're paying the cost of a full login flow N times per worker. Track it in your CI timing output. If auth accounts for more than 15% of total test time, your fixture architecture needs attention.
These aren't complex to track. Playwright's JSON reporter gives you the raw data. The point is to look at them consistently rather than reacting when CI starts to hurt.
What not to test with Playwright
Senior engineers spend as much energy deciding what to remove from E2E coverage as what to add. The suite that includes everything is not thorough — it's expensive, slow, and hard to maintain.
The framework for the decision:
If unit tests cover it fully, E2E adds no value. A form validation rule that's tested at the component level doesn't need a full browser test. The E2E version tests infrastructure (browser rendering, network calls), not logic. If the logic is already covered, you're paying the cost of a browser test to verify something that fails for a different reason than the thing you care about.
If it hits a real payment gateway or external service, it's a liability. Tests that require live third-party credentials are environment-dependent, slow to debug, and prone to flaking from the external service rather than your code. Fake the integration at the API boundary; test the business logic separately.
Visual regression at the component level belongs in Storybook, not Playwright. Running full-page screenshot comparisons in Playwright for component-level visual changes is the wrong layer. Storybook with Chromatic or Percy handles component-level visual regression faster, with better baseline management, and without the setup cost of a full browser session per story.
The strongest signal that a Playwright test should be replaced: if the test can be fully rewritten using Playwright's request fixture against your API, with no page interaction, replace it. You'll get the same coverage in a fraction of the time.
// If your E2E test only checks API-level behavior — this is faster and more reliable
test('order creation returns 201 with correct payload', async ({ request }) => {
const response = await request.post('/api/orders', {
data: { productId: 'prod_123', quantity: 2 },
headers: { Authorization: `Bearer ${process.env.API_TOKEN}` },
});
expect(response.status()).toBe(201);
const body = await response.json();
expect(body).toMatchObject({ status: 'pending', productId: 'prod_123' });
});
Keep E2E for flows that cross subsystem boundaries and span multiple user sessions — the paths where the integration of your frontend, API, auth layer, and external services either works end to end or doesn't. That's where Playwright earns its cost.
What happens when you fix the architecture, not just the tests
Go back to the inherited suite. 500 tests, 11% flaky rate, 55-minute CI, auth duplicated across 60 spec files.
The senior engineer's first diagnosis: fixture architecture. Before any individual test gets touched, the auth setup moves to auth.setup.ts with named role projects. That alone eliminates the copy-paste problem and cuts auth-related flakiness by removing the duplication vectors.
The second fix: CI pipeline. fullyParallel: true is already set, but the suite runs on a single runner. Moving to 4-runner sharding with merge-reports brings the runtime from 55 minutes to under 14. No tests are rewritten. The pipeline just processes them in parallel.
The third fix: eslint-plugin-playwright with no-wait-for-timeout: error. The static analysis run surfaces 23 hard-coded wait calls across the suite. Those are the primary source of the 11% flaky rate. Each gets replaced with assertion-level waits or API synchronization. Flakiness drops to under 2%.
None of this requires rebuilding the suite. It requires the architectural decisions that should have been made at the start.
Conclusion
Playwright doesn't make test suites stable — architecture does. The tool is excellent; the defaults are sensible. But a poorly structured suite in Playwright degrades in exactly the same ways as one in Selenium. The flakiness accumulates, the CI grows, and eventually the suite becomes an obstacle rather than a safety net.
The decisions that matter most aren't in the test files. They're in how you scope fixtures, how you structure auth, how you design the CI pipeline, and how you define what the suite should and shouldn't cover.
If your current suite is showing the early signs — flaky tests nobody owns, CI times that make PRs a waiting game, auth logic that lives in every spec file — that's not a test-writing problem. It's an architecture problem with a known set of solutions.
We audit Playwright suites and fix the architecture, not just the tests.
Book a strategic QA consultation

About the author
Chief Operating Officer
Anastasiia Sokolinska is the Chief Operating Officer at DeviQA, responsible for operational strategy, delivery performance, and scaling QA services for complex software products.