A/B Test Sample Size Calculator

Most stores end A/B tests too early. Find out the real sample size you need for statistically significant results.

Test Parameters

%
%

How big an improvement do you want to detect?

Total Visitors Needed

128,430

across all 2 variations

Per Variation

64,215

visitors each

Days to Run

257

calendar days

Test End Date

Nov 27, 2026

if started today

Test Timeline

TodayWeek 1Week 2Week 3Week 4Day 257

Week 1

3,500 visitors

3% complete

Week 2

7,000 visitors

5% complete

Week 3

10,500 visitors

8% complete

Week 4

14,000 visitors

11% complete

Long test duration

This test would take over 37 weeks. Consider testing larger changes (higher MDE) to reduce test duration. A test running longer than 4 weeks is exposed to seasonal shifts, cookie expiration, and other external factors that can pollute your results.

Test Configuration Summary

Baseline conversion rate2.5%
Expected conversion rate (with improvement)2.75%
Absolute difference to detect0.25pp
Confidence level95%
Statistical power80%
Variations2

Why Sample Size Matters in A/B Testing

The number one mistake in A/B testing is ending the test too early. You see a 15% lift after two days, get excited, and call it. Then you implement the change and the lift disappears. That is not bad luck — it is statistics punishing you for insufficient sample size.

Small samples produce noisy data. A coin flipped 10 times might show 70% heads. Flip it 10,000 times and you will land very close to 50%. A/B tests work the same way. You need enough visitors in each variation to separate real differences from random noise.

This calculator uses standard power analysis to tell you exactly how many visitors you need — and how long your test must run — before the results mean anything. Enter your numbers above and stop guessing.

How the Sample Size Formula Works

The calculation depends on four factors: your baseline conversion rate, the minimum detectable effect (how small an improvement you want to catch), your desired statistical significance (confidence level), and statistical power (the probability of detecting a real effect).

Higher confidence and power require more visitors. Smaller effects require more visitors. Lower baseline conversion rates require more visitors. This is why a store converting at 1% needs a much larger sample than one converting at 10% — the signal is weaker and harder to detect.

The standard approach uses Z-scores for your chosen significance and power levels, combined with the pooled variance of your baseline and expected conversion rates. The result tells you the minimum visitors per variation needed for a valid conclusion.

What to Do When You Need Too Many Visitors

If the calculator tells you the test would take three months, you have options. First, test bigger changes. A 20% MDE needs roughly a quarter of the sample a 10% MDE needs. Stop testing button color changes and test fundamentally different page layouts, value propositions, or pricing structures.

Second, test higher in the funnel. Landing pages get more traffic than checkout pages. Test where the volume is, then validate downstream impact.

Third, if your traffic is genuinely too low for quantitative A/B testing, switch to qualitative methods. User session recordings, heatmaps, customer interviews, and usability testing can surface conversion issues without needing 50,000 visitors. Use our CRO Diagnostic to find your biggest conversion leaks — no traffic threshold required.

The worst option is running an underpowered test and acting on the results. That is worse than not testing at all, because it gives you false confidence in a decision that may actually be hurting your revenue growth.

Frequently Asked Questions

Why can't I just stop a test when I see a clear winner?

Early results are dominated by random noise. A test that shows a 20% lift after 200 visitors could easily show 0% lift — or a negative result — after 2,000 visitors. Statistical significance means your result is unlikely to be caused by chance. Without sufficient sample size, you are making decisions based on randomness, not real user behavior.

What does Minimum Detectable Effect (MDE) mean?

MDE is the smallest improvement you want your test to reliably detect. A 10% MDE on a 2% conversion rate means you want to catch a lift to 2.2%. Smaller MDEs require more visitors because the signal is weaker. If you only have moderate traffic, increase your MDE by testing bolder changes — a headline rewrite instead of a button color tweak.

What is the difference between statistical significance and statistical power?

Statistical significance (confidence level) is the probability of avoiding a false positive — declaring a winner when there is no real difference. Statistical power is the probability of detecting a real difference when one exists. The standard is 95% significance and 80% power. Increasing either requires more visitors.

My traffic is too low for A/B testing. What should I do?

If you need more than 60 days to reach the required sample size, formal A/B testing is impractical. Use qualitative methods instead: session recordings (Hotjar, FullStory), heatmaps, customer surveys, and usability testing. These reveal conversion problems without requiring statistical sample sizes. Fix the obvious issues first, then A/B test when your traffic supports it.

Should I use 90%, 95%, or 99% confidence?

For most ecommerce tests, 95% is the standard. Use 90% when the cost of a wrong decision is low (e.g., minor copy changes). Use 99% when the stakes are high (e.g., a complete checkout redesign or pricing change). Higher confidence requires more visitors, so match the rigor to the risk.

More Free Tools

Find out exactly where your store is leaking revenue.

Answer a quick set of multiple-choice questions and we'll pinpoint your biggest revenue leaks — and whether we can help plug them.

Find Your Revenue Leaks

Free · No obligation · 2 minutes