When to use a statistical significance calculator
Any time you compare two percentages and want to act on the difference, you're making a bet: is the gap real, or did it just happen by chance? This calculator settles that bet for the most common cases marketers and researchers run into:
- Two segments in a survey. 31% of your enterprise customers picked "pricing" as their top frustration, but only 22% of small businesses did. Is that a genuine difference between segments, or sampling luck?
- Two variants of a form or question. You ran two versions of a signup form or reworded a question, and version B converted better. Before you declare a winner, check the math — our guide to form conversion optimization covers what to test in the first place.
- Before and after a change. Satisfaction was 68% last quarter and 74% this quarter. Significant improvement, or a wobble you'll be embarrassed to have celebrated?
In every case the inputs are the same: how many people were in each group, and how many of them gave the answer (or took the action) you're measuring.
How to read the result
The calculator runs a two-proportion z-test and returns a p-value. In plain language, the p-value answers one question: if there were truly no difference between the two groups, how often would random chance alone produce a gap at least this big?
A p-value of 0.20 means a gap like yours would show up about 20% of the time by pure luck — far too often to trust. A p-value of 0.01 means it would appear only 1% of the time, so the difference is almost certainly real.
The conventional cutoff is 0.05: below it, the result is called "statistically significant," meaning there's less than a 5% chance the gap is a fluke. That threshold is a convention, not a law of nature — a p-value of 0.06 isn't meaningless and 0.04 isn't gospel. But 0.05 is the standard your stakeholders will recognize, and it's what this calculator uses for its verdict.
One thing "significant" does not mean: important. It only means the difference is unlikely to be random. Whether a real 2-point lift matters to your business is a judgment call the math can't make for you.
A worked example
Say you tested two versions of a lead form. Group A saw the original: 200 visitors, 48 completed it — a 24% conversion rate. Group B saw the redesign: 210 visitors, 69 completed it — about 33%.
The pooled rate across both groups is 117 out of 410, or 28.5%. The z-test compares the 9-point gap against the variation you'd expect from samples this size and returns a z-score of roughly 1.99, which works out to a p-value of about 0.047. That's under 0.05, so the redesign's win is statistically significant — barely. A difference this large would appear by chance only about 5 times in 100, so you can roll out version B with reasonable confidence. If the same rates had come from 50 visitors per group, the p-value would be close to 0.4 and the honest answer would be "we can't tell yet."
Why sample size matters more than the gap
That last point trips people up constantly: the same percentage gap can be rock solid or meaningless depending on how many people are behind it. A 10-point difference between two groups of 1,000 is overwhelming evidence. The same 10-point difference between two groups of 30 is a coin flip — small samples bounce around so much that big gaps appear and vanish on their own.
This is the same phenomenon as the margin of error on a poll: fewer respondents means each percentage comes with a wider band of uncertainty, and two wide bands overlap easily. You can see exactly how wide your bands are with our margin of error calculator. And if you're planning a test rather than analyzing one, work backwards: decide the smallest difference you'd care about, then use our sample size calculator to figure out how many respondents you need before you start. Collecting data first and hoping for significance later is how tests end in frustration.
Common pitfalls
- Peeking early and stopping when you hit significance. If you check the p-value every day and stop the moment it dips under 0.05, you'll "win" far more often than you should — random wobbles cross the line all the time on their way to nowhere. Decide your sample size up front and judge the result once, when you reach it.
- Testing many segments and reporting the one that won. Slice your survey by age, plan, industry, and region, and one slice will look significant by luck alone — run twenty comparisons at the 0.05 level and you should expect one false positive. Treat surprise wins in sub-segments as hypotheses to re-test, not findings to announce.
- Confusing statistical significance with practical importance. With a huge sample, a 0.4-point difference can be highly significant and still not worth a single meeting. Always ask two questions: is the difference real, and is it big enough to act on?
- Comparing groups that differ in more than one way. If version B ran a week later, to a different audience, during a promotion — the test can't tell you which change caused the gap. Keep everything but the thing you're testing constant.
The honest fine print
This tool runs a two-proportion z-test with pooled variance — the standard method for comparing two percentages. It's an approximation, and it's least reliable when samples are small (roughly under 30 respondents per group) or when rates sit very close to 0% or 100%. In those edge cases treat the verdict as a rough guide, and when the decision is high-stakes, collect more responses rather than squinting at a borderline p-value.
Significance testing also can't rescue a biased sample: if only your happiest customers answered, no p-value will fix that. Getting trustworthy inputs — enough responses, from the right people — is the real work, and our form analytics guide covers how to measure and improve it. When you're ready to run the test itself, build both variants as free Fomr surveys with unlimited responses, split your audience, and bring the counts back here.