Understanding statistical significance is critical for making sound decisions from your A/B tests. Acting on unreliable data can harm your app's performance, while being overly cautious delays valuable improvements. This guide explains statistical significance in practical terms and provides clear frameworks for decision-making.
Statistical significance answers one critical question: "Is the difference I'm seeing between variants real, or could it just be random chance?"
When you run an A/B test, you observe different performance between variants. But these differences might be:
Real differences: One variant genuinely performs better
Random variation: Normal fluctuations that would disappear with more data
Statistical significance quantifies the probability that observed differences are real, not random.
PressPlay reports confidence levels for your experiments. Here's what they mean:
Meaning: 90% probability the observed difference is real
Risk: 10% chance (1 in 10) you're wrong
When to use: Minimum threshold for most decisions
Meaning: 95% probability the observed difference is real
Risk: 5% chance (1 in 20) you're wrong
When to use: If you’re risk averse but still want to drive impact, choose this.
Meaning: 99% probability the observed difference is real
Risk: 1% chance (1 in 100) you're wrong
When to use: High-stakes decisions or major changes (ie. sometimes in the case of an icon change that has other implications across your org as well
Statistical significance depends heavily on sample size:
Imagine flipping a coin to determine if it's fair:
10 flips: 7 heads, 3 tails—is the coin biased? Hard to say (small sample)
1,000 flips: 700 heads, 300 tails—coin is definitely biased (large sample)
The same principle applies to A/B testing. Small differences require large samples to confirm they're real.
For reliable results, aim for these minimums per variant:
Absolute minimum: 100 impressions per variant
Recommended minimum: 1,000 impressions per variant
Ideal: 5,000+ impressions per variant
Additionally, you need sufficient conversions:
Minimum conversions: 20-30 installs per variant to detect meaningful differences
Recommended: 100+ installs per variant for reliable results
Statistical significance tells you if a difference is real, but effect size tells you if it matters:
Large effect (6% improvement): Substantial, meaningful impact—definitely implement
Medium effect (3-5% improvement): Worthwhile improvement—implement
Small effect (1-2% improvement): Modest but valuable over time—implement if confident
You can have:
Statistically significant but small: Real difference, but may not matter much practically
Large but not significant: Promising difference, but not enough data to be certain
Large and significant: Clear winner—implement immediately
Always consider both statistical significance and effect size together.
Use this framework to decide whether to implement test results:
Conditions: 95%+ confidence, 10%+ improvement
Decision: Implement immediately
Action: Roll out winning variant, document learning
Conditions: 95%+ confidence, 5-10% improvement
Decision: Implement
Action: Roll out winning variant, monitor performance
Conditions: 95%+ confidence, 2-5% improvement
Decision: Usually implement, but consider effort vs. impact
Action: If easy to implement, do it; if complex, weigh opportunity cost
Conditions: 90-95% confidence, 10%+ improvement
Decision: Consider running test longer or implement with monitoring
Action: Extend test if possible, or implement with plan to revert if performance drops
Conditions: Below 90% confidence or
Decision: Don't implement based on this test
Action: Continue test, start new test, or accept no significant difference
Scenario 6: Negative Results (High Confidence)
Conditions: 95%+ confidence that new variant performs worse
Decision: Don't implement; keep current version
Action: Document what didn't work, extract insights for future tests
Reality: Early leads often disappear with more data. Wait for significance and minimum duration.
Reality: 95% confidence still means 5% chance you're wrong. It's very likely, not guaranteed.
Reality: Absence of significance isn't proof of no difference—you may just need more data.
Reality: 99% confidence requires much larger samples. 95% is appropriate for most decisions.
Reality: If variants truly perform the same, you'll never reach significance. Set a maximum duration.
When testing multiple variants simultaneously, significance thresholds should be adjusted:
2 variants (A/B): 5% false positive rate (as expected)
3 variants (A/B/C): Higher false positive rate due to multiple comparisons
4+ variants: Even higher false positive rate
2 variants: Use standard 95% confidence threshold
3 variants: Look for 96-97% confidence or larger effect sizes
4+ variants: Require 98%+ confidence or very large effects
Or stick to A/B tests (2 variants only) to avoid this complexity.
Checking results mid-test is natural, but be careful:
Checking results daily and stopping when you see significance inflates false positive rates. You might catch a temporary random fluctuation and mistake it for a real effect.
Check weekly: Review progress once per week, not daily
Note trends: Observe which direction results are trending, but don't decide yet
Wait for milestones: Make decisions only at predetermined checkpoints (7 days, 14 days, etc.)
Avoid confirmation bias: Don't stop just because results match your expectations
If results surprise you:
Verify the test setup: Confirm variants were implemented correctly
Check for external factors: Were there holidays, platform changes, or unusual events?
Review sample size: Is it large enough for reliable results?
Consider segment effects: Did different user segments respond differently?
Accept the data: If everything checks out, trust the results even if unexpected
Surprising results are often the most valuable—they reveal insights you didn't anticipate.
You might want to analyze results by user segment (geography, device type, etc.). Be cautious:
Reduces sample size: Each segment has fewer data points
Multiple comparisons: Testing significance across many segments inflates false positives
Exploratory only: Use segment analysis to generate hypotheses, not make primary decisions
If you notice segment differences, design a follow-up test specifically for that segment.
When sharing test results with your team or executives:
Confidence level: "95% confident this is a real difference"
Effect size: "12% improvement in install rate"
Sample size: "Based on 5,000 impressions per variant"
Recommendation: "Implement variant B immediately"
"We tested a benefit-focused hero screenshot against our current feature-focused design. After 14 days and 4,500 impressions per variant, the benefit-focused variant showed a 15% improvement in install rate with 99% confidence. I recommend implementing this variant immediately."
Before acting on test results, verify:
✓ Confidence level is 90% or higher
✓ Minimum 7 days elapsed
✓ At least 1,000 installs per variant
✓ At least 20-30 conversions per variant
✓ Effect size is meaningful (ideally 5%+)
✓ No major external factors distorting results
✓ Test setup verified as correct
✓ Results align with or provide good explanation for hypothesis
Most A/B tests don't require statistical expertise beyond what PressPlay provides. However, consider expert consultation for:
Very low traffic situations: Need advanced methods to handle small samples
Complex multi-variant tests: Testing 5+ variants simultaneously
High-stakes decisions: Results will drive major business investments
Unusual patterns: Results don't make intuitive sense
Segment-specific optimization: Need to properly handle multiple comparisons
Use 90% confidence as your decision threshold for implementing results
Ensure minimum 7 days and 1,000 impressions per variant before deciding
Consider both significance and effect size—need both to make good decisions
Don't stop tests early just because one variant is leading
Trust the data even when results surprise you (after verifying test setup)
Remember: 95% confidence means you'll be wrong about 1 in 20 times—that's okay
Understanding statistical significance transforms A/B testing from guesswork into science, giving you confidence to make data-driven decisions that consistently improve your app store performance.