Revisiting Filtered ANN Benchmarks: A Hardness-Controlled Benchmark Generator for Realistic Evaluation

VLDB, 2026

1Seoul National University   *Corresponding author
Overview of HCBGen, the hardness-controlled benchmark generator.

HCBGen — a hardness-controlled benchmark generator. It supports Load and Generate modes for base data, and controls query-level hardness using either a target hardness profile or coarse bias modes (High, Low, Random).

Abstract

Filtered approximate nearest neighbor (FANN) search must satisfy both vector similarity and structured predicates, yet evaluations remain brittle because real hybrid workloads are rarely shareable and existing benchmarks rely on ad-hoc constructions. We propose α-Hardness, an execution-driven query-level hardness metric that models the conditional execution chain via the over-fetch factor and extends to strategy-conditioned settings. It aligns monotonically with empirical performance, unlike proxies such as selectivity or correlation.

We further introduce HCBGen, a hardness-controlled benchmark generator that uses α-Hardness as a control signal to synthesize workloads under coarse bias modes or to match a target hardness profile. Our experiments show that widely used benchmarks occupy a narrow, easy portion of the hardness spectrum, and that matching hardness distributions yields privacy-preserving proxy workloads that closely reproduce performance trends.

Contributions

  • α-Hardness — an execution-driven metric capturing query-level FANN difficulty via the over-fetch factor, closely aligned with empirical performance.
  • Revisiting benchmarks — we show widely used evaluations are biased toward easy queries, masking tail behavior.
  • HCBGen — a hardness-controlled generator for controlled workload construction, stress testing, and privacy-preserving approximation of real workloads.

Method

α-Hardness Metric

We model FANN execution as a post-filtering conditional chain: fetch top-ranked vector candidates, then scan until $K$ filter-satisfying results are found. The over-fetch factor $\alpha(q;K)$ is the rank of the $K$-th valid candidate, and hardness is the multiplicative cost of the two stages:

(1) \[ H_{\alpha}(q \mid K) \;\triangleq\; H_{\text{fetch}}\big(q_v \mid \alpha(q;K)\big) \,\times\, H_{\text{scan}}\big(q_f \mid q_v, \alpha(q;K), K\big) \]

With $H_{\text{scan}} \propto C_{\text{scan}} \cdot m$, substituting $m = \alpha(q;K)$ gives:

(2) \[ H_{\alpha}(q \mid K) \;\propto\; H_{\text{fetch}}\big(q_v \mid \alpha(q;K)\big) \cdot \alpha(q;K) \]

Since $\alpha$ is unknown a priori, we estimate it index-free from global selectivity $s$ and a local-density correction $\rho(q)$:

(3) \[ \hat{\alpha}(q;K) \;\triangleq\; \frac{1}{s} \cdot \frac{1}{\rho(q)} \cdot K \;=\; \frac{|V_0|}{|V_f|} \cdot \frac{d\big(q_v,\, K\text{-NN} \mid V_0\big)}{d\big(q_v,\, K\text{-NN} \mid V_f\big)} \cdot K \]

so a globally selective (small $|V_f|$) or locally sparse filter both raise $\hat{\alpha}$. Finally, the score is strategy-conditioned by a monotone inversion — vector-centric strategies get harder with $\alpha$, while filter-centric ones get easier:

(4) \[ H(q \mid \text{strategy}) \;=\; \begin{cases} H_{\alpha}(q), & \text{if vector-centric}\\[4pt] 1 / H_{\alpha}(q), & \text{if filter-centric} \end{cases} \]
HCBGen Generator

HCBGen separates a Label Generator (proposes candidate queries; supports Load and Generate base-data modes) from a Hardness Estimator (classifies the strategy's pruning family, scores candidates, and enforces acceptance through a regeneration loop). Two control families share this loop:

  Coarse bias modes

High / Low / Random skew the workload toward hard or easy queries via an adaptive cutoff.

  Profile matching

Match-PDF reproduces a target hardness distribution by binning difficulty and filling target counts.

Results

1 Hardness Alignment Figure 3
Spearman rank correlation between hardness estimates and search performance across 25 datasets and six strategies.

α-Hardness aligns monotonically with empirical performance (ρ closer to −1) across 25 datasets and six strategies, while selectivity and correlation are unstable or strategy-inconsistent.

Finding α-Hardness reaches ρ ≤ −0.7 with no alignment failures in the [−0.3, 0.3] region, whereas selectivity flips sign between vector- and filter-centric strategies and correlation is informative only when strong data–label correlation is explicitly enforced.
2 Robustness Across Workloads Figure 6
Recall-QPS trade-off curves for five strategies across synthetic, semi-real, and hardness-controlled workloads.

Strategy rankings and trade-off shapes vary substantially across workloads. Results reported in original papers (red) align with easier regions, and High workloads expose sharp degradation.

Finding Under High workloads recall drops below 0.2 for most indices — consistent with the sub-0.1% selectivity regime — revealing robustness gaps that the easy slices used in prior benchmarks hide.
3 Hardness as a Bridge Figure 8
Recall-QPS trade-offs on four semi-real datasets comparing original workloads with Match-PDF proxy workloads.

Match-PDF proxy workloads reproduce the performance trends of the original workloads — even without access to the base data — enabling privacy-preserving benchmarking.

Finding Proxy workloads match the target hardness distribution (low KS and 1-Wasserstein distances) and track the original Recall–QPS curves closely on all four semi-real datasets, so a shareable hardness profile alone suffices to emulate an unseen workload.

For more detailed experiments and findings, please refer to the full paper.

BibTeX

@article{lim2026fann,
  author  = {Lim, Mintaek and Kim, Dogeun and Kim, Minwoo and Do, Jaeyoung},
  title   = {Revisiting Filtered ANN Benchmarks: A Hardness-Controlled Benchmark
             Generator for Realistic Evaluation},
  journal = {Proceedings of the VLDB Endowment (PVLDB)},
  volume  = {14},
  number  = {1},
  year    = {2026},
}