Cookies?
Library Header Image
LSE Research Online LSE Library Services

Using hierarchical information-theoretic criteria to optimize subsampling of extensive datasets

Duarte, Belmiro P.M., Atkinson, Anthony C. and Oliveira, Nuno M.C. (2024) Using hierarchical information-theoretic criteria to optimize subsampling of extensive datasets. Chemometrics and Intelligent Laboratory Systems, 245. ISSN 0169-7439

[img] Text (Using hierarchical information-theoretic criteria to optimize subsampling of extensive datasets) - Published Version
Available under License Creative Commons Attribution Non-commercial.

Download (810kB)

Identification Number: 10.1016/j.chemolab.2024.105067

Abstract

This paper addresses the challenge of subsampling large datasets, aiming to generate a smaller dataset that retains a significant portion of the original information. To achieve this objective, we present a subsampling algorithm that integrates hierarchical data partitioning with a specialized tool tailored to identify the most informative observations within a dataset for a specified underlying linear model, not necessarily first-order, relating responses and inputs. The hierarchical data partitioning procedure systematically and incrementally aggregates information from smaller-sized samples into new samples. Simultaneously, our selection tool employs Semidefinite Programming for numerical optimization to maximize the information content of the chosen observations. We validate the effectiveness of our algorithm through extensive testing, using both benchmark and real-world datasets. The real-world dataset is related to the physicochemical characterization of white variants of Portuguese Vinho Verde. Our results are highly promising, demonstrating the algorithm's capability to efficiently identify and select the most informative observations while keeping computational requirements at a manageable level.

Item Type: Article
Additional Information: © 2024 The Author(s)
Divisions: Statistics
LSE
Subjects: H Social Sciences > HA Statistics
Q Science > QA Mathematics > QA76 Computer software
Date Deposited: 02 Feb 2024 12:24
Last Modified: 12 Dec 2024 04:02
URI: http://eprints.lse.ac.uk/id/eprint/121641

Actions (login required)

View Item View Item

Downloads

Downloads per month over past year

View more statistics