第3回思考院セミナー
- 【日時】
- 2025年9月12日(金)13:00〜
登録不要・参加無料
- 【場所】
- 統計数理研究所 セミナー室 D313・D314
- 【講演者】
- Yuan-chin Ivan Chang (Academia Sinica)
- 【演題】
- Preserving Data Structure in Large-Scale Subsampling by PCA-Guided Quantile Sampling Method
- 【概要】
- In this talk, we introduce Principal Component Analysis-guided Quantile Sampling (PCA-QS), a novel sampling framework designed to preserve both the statistical and geometric structure of large-scale datasets. Unlike conventional PCA, which reduces dimensionality at the cost of interpretability, PCA-QS retains the original feature space while using leading principal components solely to guide a quantile-based stratification scheme. This principled design ensures that sampling remains representative without distorting the underlying data semantics. We establish rigorous theoretical guarantees, deriving convergence rates for empirical quantiles, Kullback–Leibler divergence, and Wasserstein distance, thus quantifying the distributional fidelity of PCA-QS samples. Practical guidelines for selecting the number of principal components, quantile bins, and sampling rates are provided based on these results. Extensive empirical studies on both synthetic and real-world datasets demonstrate that PCA-QS consistently outperforms not only simple random sampling (SRS) but also recent state-of-the-art methods such as coreset and leverage score sampling, yielding better structure preservation and improved downstream model performance. Together, these contributions position PCA-QS as a scalable, interpretable, and theoretically grounded solution for efficient data summarization in modern machine learning workflows.