SEO A/B Test Analyzer

Analyze and validate your SEO changes through SEO A/B testing with the Causal Impact approach.

  1. Upload data
  2. Run analysis
  3. Review results

Input data

Upload a CSV with columns: date, variant, control.

Q&A

What does this tool do?

It estimates the causal impact of an SEO change by comparing observed variant performance to a modelled counterfactual built from historical variant and control data.

What input file do I need?

Use either CSV upload or Google Search Console mode. Both produce daily time-series data in the same structure: date, variant, control.

How does Google Search Console mode work?

Connect GSC, choose a property and metric (clicks or impressions), set the fetch date range, and paste variant/control URL lists. The tool fetches and aggregates daily totals automatically.

Can I run the analyzer with CSV only?

Yes. Upload a CSV with columns date, variant, and control. Values should be daily totals for clicks, sessions, or impressions.

Can I use impressions or sessions instead of clicks?

Yes. Keep one metric type consistent across variant and control for all dates within the same run.

How much data is recommended?

At least 100 pre-intervention days is recommended. The tool can run with less, but confidence and robustness are usually weaker.

How should I set the start date?

Choose the first day after the SEO change was launched on the variant group. Everything before is pre-period; everything after is post-period.

What is Impact on Clicks?

The average relative effect during the post-period. Positive values indicate uplift versus the modelled baseline; negative values indicate decline.

What is Confidence Level?

Confidence Level is calculated as 1 - p_value. Higher confidence means the observed impact is less likely to be random variation.

What is P value?

P value is the posterior tail-area probability from the model. Lower values are stronger evidence against a no-effect outcome.

What is Daily Effect?

The average absolute day-level difference between observed variant performance and the modelled counterfactual.

What is Cumulative Effect?

The total absolute impact over the post-period, built by summing the estimated day-level effects.

How do I read chart 1 (actual vs predicted)?

If actual and predicted lines separate after launch and remain separated, that suggests sustained impact. The confidence band shows forecast uncertainty.

How do I read chart 2 (cumulative effect)?

An upward cumulative curve indicates accumulating positive impact; a downward curve indicates accumulating loss over time.

What does Analysis quality summarize?

It combines effect size, confidence, interval behavior, fit quality, and robustness checks into one decision-oriented quality view.

How is Reliability Score calculated?

It is a composite score from three components: Statistical evidence, Precision and stability, and Robustness.

What is Statistical evidence?

This reflects the strength of the causal signal using confidence, whether the effect interval clearly excludes zero, and interval tightness.

What is Precision and stability?

This measures how tight and consistent estimates are, using pre-period fit quality and post-period uncertainty behavior.

What is Robustness?

This checks whether conclusions remain stable under placebo-date runs and sensitivity runs with different model assumptions.

What is 95% interval width?

This is the width of the relative-effect uncertainty interval. Narrower intervals generally indicate more precise estimates.

What is Pre-period coverage?

Pre-period coverage shows how often actual pre-period values fall inside the model's uncertainty bounds. Values around ~95% are typically healthy.

What is in Advanced diagnostics?

Advanced diagnostics is the technical validation layer behind Reliability Score. It includes: Residual autocorr (lag1) (leftover error pattern), Placebo false-positive rate (how often fake launch dates still look significant), Sensitivity spread (effect variation across model settings), and Sign consistency (whether uplift/decline direction stays stable). Use this section to confirm the result is not fragile.

What do GO / HOLD / NO-GO mean?

GO: strong positive and reliable signal. HOLD: mixed or fragile evidence. NO-GO: reliably negative impact.

Common reasons for unreliable output?

Short pre-period, unstable control series, major concurrent external events, missing dates, or inconsistent aggregation logic between variant and control.

Legacy version

Prefer the original Streamlit experience? Use the previous analyzer app.

Open legacy SEO A/B Test Analyzer