After you run an analysis, each method shows a point estimate — one number that is the best single guess for the average causal effect on your outcome, in the outcome's units. It is not a correlation coefficient (those stay between −1 and +1).

Novice readers often focus on which method has the largest number. That can mislead you. Statistical strength comes from uncertainty: the 95% confidence interval (CI) and the p-value.

Point estimate vs confidence interval

The point estimate answers: if we had to pick one number, what would it be? The confidence interval answers: given this sample, what range of true effects is still plausible?

If the 95% CI is wide and includes zero, the data are compatible with no effect — even when the point estimate is +600. If the CI is narrow and excludes zero, a smaller estimate like +200 can be more convincing.

CI includes zero → a null effect (no change) remains plausible.
CI excludes zero → the sample distinguishes the estimate from zero (assumptions still matter).
p-value above 0.05 → same practical message: we cannot rule out zero.

When methods disagree

Different estimators use different formulas (matching, weighting, machine learning). Headline numbers often differ in size but should usually agree on direction.

Example: with only a handful of treated and control units, propensity score matching might show +637 with a wide CI [−497, 1772] and p = 0.27 — compatible with no effect. A causal forest might show +209 with a CI that looks like [209, 209]. That narrow interval does not automatically mean stronger proof; with tiny samples the model may fail to estimate uncertainty properly.

Compare CIs, p-values, balance diagnostics, and sample size — not just the biggest point estimate.

Sample size matters

Small studies produce wide confidence intervals. Power analysis (minimum detectable effect) tells you whether your sample could reliably detect an effect of the size you care about.

A statistically non-significant result does not always mean there is no effect — you may lack power. A significant result on a tiny sample still needs careful scrutiny of assumptions.