Most stores end A/B tests too early. Find out the real sample size you need for statistically significant results.
How big an improvement do you want to detect?
Total Visitors Needed
128,430
across all 2 variations
Per Variation
64,215
visitors each
Days to Run
257
calendar days
Test End Date
Nov 27, 2026
if started today
Week 1
3,500 visitors
3% complete
Week 2
7,000 visitors
5% complete
Week 3
10,500 visitors
8% complete
Week 4
14,000 visitors
11% complete
This test would take over 37 weeks. Consider testing larger changes (higher MDE) to reduce test duration. A test running longer than 4 weeks is exposed to seasonal shifts, cookie expiration, and other external factors that can pollute your results.
The number one mistake in A/B testing is ending the test too early. You see a 15% lift after two days, get excited, and call it. Then you implement the change and the lift disappears. That is not bad luck — it is statistics punishing you for insufficient sample size.
Small samples produce noisy data. A coin flipped 10 times might show 70% heads. Flip it 10,000 times and you will land very close to 50%. A/B tests work the same way. You need enough visitors in each variation to separate real differences from random noise.
This calculator uses standard power analysis to tell you exactly how many visitors you need — and how long your test must run — before the results mean anything. Enter your numbers above and stop guessing.
The calculation depends on four factors: your baseline conversion rate, the minimum detectable effect (how small an improvement you want to catch), your desired statistical significance (confidence level), and statistical power (the probability of detecting a real effect).
Higher confidence and power require more visitors. Smaller effects require more visitors. Lower baseline conversion rates require more visitors. This is why a store converting at 1% needs a much larger sample than one converting at 10% — the signal is weaker and harder to detect.
The standard approach uses Z-scores for your chosen significance and power levels, combined with the pooled variance of your baseline and expected conversion rates. The result tells you the minimum visitors per variation needed for a valid conclusion.
If the calculator tells you the test would take three months, you have options. First, test bigger changes. A 20% MDE needs roughly a quarter of the sample a 10% MDE needs. Stop testing button color changes and test fundamentally different page layouts, value propositions, or pricing structures.
Second, test higher in the funnel. Landing pages get more traffic than checkout pages. Test where the volume is, then validate downstream impact.
Third, if your traffic is genuinely too low for quantitative A/B testing, switch to qualitative methods. User session recordings, heatmaps, customer interviews, and usability testing can surface conversion issues without needing 50,000 visitors. Use our CRO Diagnostic to find your biggest conversion leaks — no traffic threshold required.
The worst option is running an underpowered test and acting on the results. That is worse than not testing at all, because it gives you false confidence in a decision that may actually be hurting your revenue growth.
Early results are dominated by random noise. A test that shows a 20% lift after 200 visitors could easily show 0% lift — or a negative result — after 2,000 visitors. Statistical significance means your result is unlikely to be caused by chance. Without sufficient sample size, you are making decisions based on randomness, not real user behavior.
MDE is the smallest improvement you want your test to reliably detect. A 10% MDE on a 2% conversion rate means you want to catch a lift to 2.2%. Smaller MDEs require more visitors because the signal is weaker. If you only have moderate traffic, increase your MDE by testing bolder changes — a headline rewrite instead of a button color tweak.
Statistical significance (confidence level) is the probability of avoiding a false positive — declaring a winner when there is no real difference. Statistical power is the probability of detecting a real difference when one exists. The standard is 95% significance and 80% power. Increasing either requires more visitors.
If you need more than 60 days to reach the required sample size, formal A/B testing is impractical. Use qualitative methods instead: session recordings (Hotjar, FullStory), heatmaps, customer surveys, and usability testing. These reveal conversion problems without requiring statistical sample sizes. Fix the obvious issues first, then A/B test when your traffic supports it.
For most ecommerce tests, 95% is the standard. Use 90% when the cost of a wrong decision is low (e.g., minor copy changes). Use 99% when the stakes are high (e.g., a complete checkout redesign or pricing change). Higher confidence requires more visitors, so match the rigor to the risk.
Answer a quick set of multiple-choice questions and we'll pinpoint your biggest revenue leaks — and whether we can help plug them.
Find Your Revenue LeaksFree · No obligation · 2 minutes