Applications
- [1] arXiv:2405.20352 [pdf, ps, html, other]
-
Title: Adapting Quantile Mapping to Bias Correct Solar Radiation DataComments: 28 pages, 15 figuresSubjects: Applications (stat.AP)
Bias correction is a common pre-processing step applied to climate model data before it is used for further analysis. This article introduces an efficient adaptation of a well-established bias-correction method - quantile mapping - for global horizontal irradiance (GHI) that ensures corrected data is physically plausible through incorporating measurements of clearsky GHI. The proposed quantile mapping method is fit on reanalysis data to first bias correct for regional climate models (RCMs) and is tested on RCMs forced by general circulation models (GCMs) to understand existing biases directly from GCMs. Additionally, we adapt a functional analysis of variance methodology that analyzes sources of remaining biases after implementing the proposed quantile mapping method and considered biases by climate region. This analysis is applied to four sets of climate model output from NA-CORDEX and compared against data from the National Solar Radiation Database produced by the National Renewable Energy Lab.
- [2] arXiv:2405.20418 [pdf, ps, html, other]
-
Title: A Bayesian joint model of multiple nonlinear longitudinal and competing risks outcomes for dynamic prediction in multiple myeloma: joint estimation and corrected two-stage approachesDanilo Alvares, Jessica K. Barrett, François Mercier, Spyros Roumpanis, Sean Yiu, Felipe Castro, Jochen Schulze, Yajing ZhuComments: 38 pages, 13 figuresSubjects: Applications (stat.AP); Methodology (stat.ME)
Predicting cancer-associated clinical events is challenging in oncology. In Multiple Myeloma (MM), a cancer of plasma cells, disease progression is determined by changes in biomarkers, such as serum concentration of the paraprotein secreted by plasma cells (M-protein). Therefore, the time-dependent behaviour of M-protein and the transition across lines of therapy (LoT) that may be a consequence of disease progression should be accounted for in statistical models to predict relevant clinical outcomes. Furthermore, it is important to understand the contribution of the patterns of longitudinal biomarkers, upon each LoT initiation, to time-to-death or time-to-next-LoT. Motivated by these challenges, we propose a Bayesian joint model for trajectories of multiple longitudinal biomarkers, such as M-protein, and the competing risks of death and transition to next LoT. Additionally, we explore two estimation approaches for our joint model: simultaneous estimation of all parameters (joint estimation) and sequential estimation of parameters using a corrected two-stage strategy aiming to reduce computational time. Our proposed model and estimation methods are applied to a retrospective cohort study from a real-world database of patients diagnosed with MM in the US from January 2015 to February 2022. We split the data into training and test sets in order to validate the joint model using both estimation approaches and make dynamic predictions of times until clinical events of interest, informed by longitudinally measured biomarkers and baseline variables available up to the time of prediction.
- [3] arXiv:2405.20992 [pdf, ps, html, other]
-
Title: A Novel Two-stage Deming Regression Framework with Applications to Association Analysis between Clinical RisksSubjects: Applications (stat.AP)
In healthcare, clinical risks are crucial for treatment decisions, yet the analysis of their associations is often overlooked. This gap is particularly significant when balancing risks that are weighed against each other, as in the case of atrial fibrillation (AF) patients facing stroke and bleeding risks with anticoagulant medication. While traditional regression models are ill-suited for this task due to standard errors in risk estimation, a novel two-stage Deming regression framework is proposed to address this issue, offering a more accurate tool for analyzing associations between variables observed with errors of known or estimated variances. The first stage is to obtain the variable values with variances of errors either by estimation or observation, followed by the second stage that fits a Deming regression model potentially subject to a transformation. The second stage accounts for the uncertainties associated with both independent and response variables, including known or estimated variances and additional unknown variances from the model. The complexity arising from different scenarios of uncertainty is handled by existing and advanced variations of Deming regression models. An important practical application is to support personalized treatment recommendations based on clinical risk associations that were identified by the proposed framework. The model's effectiveness is demonstrated by applying it to a real-world dataset of AF-diagnosed patients to explore the relationship between stroke and bleeding risks, providing crucial guidance for making informed decisions regarding anticoagulant medication. Furthermore, the model's versatility in addressing data containing multiple sources of uncertainty such as privacy-protected data suggests promising avenues for future research in regression analysis.
- [4] arXiv:2405.21037 [pdf, ps, html, other]
-
Title: Introducing sgboost: A Practical Guide and Implementation of sparse-group boosting in RSubjects: Applications (stat.AP); Computation (stat.CO); Machine Learning (stat.ML)
This paper introduces the sgboost package in R, which implements sparse-group boosting for modeling high-dimensional data with natural groupings in covariates. Sparse-group boosting offers a flexible approach for both group and individual variable selection, reducing overfitting and enhancing model interpretability. The package uses regularization techniques based on the degrees of freedom of individual and group base-learners, and is designed to be used in conjunction with the mboost package. Through comparisons with existing methods and demonstration of its unique functionalities, this paper provides a practical guide on utilizing sparse-group boosting in R, accompanied by code examples to facilitate its application in various research domains. Overall, this paper serves as a valuable resource for researchers and practitioners seeking to use sparse-group boosting for efficient and interpretable high-dimensional data analysis.
New submissions for Monday, 3 June 2024 (showing 4 of 4 entries )
- [5] arXiv:2405.20415 (cross-list from stat.ME) [pdf, ps, html, other]
-
Title: Differentially Private BoxplotsSubjects: Methodology (stat.ME); Applications (stat.AP); Other Statistics (stat.OT)
Despite the potential of differentially private data visualization to harmonize data analysis and privacy, research in this area remains relatively underdeveloped. Boxplots are a widely popular visualization used for summarizing a dataset and for comparison of multiple datasets. Consequentially, we introduce a differentially private boxplot. We evaluate its effectiveness for displaying location, scale, skewness and tails of a given empirical distribution. In our theoretical exposition, we show that the location and scale of the boxplot are estimated with optimal sample complexity, and the skewness and tails are estimated consistently. In simulations, we show that this boxplot performs similarly to a non-private boxplot, and it outperforms a boxplot naively constructed from existing differentially private quantile algorithms. Additionally, we conduct a real data analysis of Airbnb listings, which shows that comparable analysis can be achieved through differentially private boxplot visualization.
- [6] arXiv:2405.20957 (cross-list from stat.ME) [pdf, ps, html, other]
-
Title: Data Fusion for Heterogeneous Treatment Effect Estimation with Multi-Task Gaussian ProcessesSubjects: Methodology (stat.ME); Applications (stat.AP)
Bridging the gap between internal and external validity is crucial for heterogeneous treatment effect estimation. Randomised controlled trials (RCTs), favoured for their internal validity due to randomisation, often encounter challenges in generalising findings due to strict eligibility criteria. Observational studies on the other hand, provide external validity advantages through larger and more representative samples but suffer from compromised internal validity due to unmeasured confounding. Motivated by these complementary characteristics, we propose a novel Bayesian nonparametric approach leveraging multi-task Gaussian processes to integrate data from both RCTs and observational studies. In particular, we introduce a parameter which controls the degree of borrowing between the datasets and prevents the observational dataset from dominating the estimation. The value of the parameter can be either user-set or chosen through a data-adaptive procedure. Our approach outperforms other methods in point predictions across the covariate support of the observational study, and furthermore provides a calibrated measure of uncertainty for the estimated treatment effects, which is crucial when extrapolating. We demonstrate the robust performance of our approach in diverse scenarios through multiple simulation studies and a real-world education randomised trial.
Cross submissions for Monday, 3 June 2024 (showing 2 of 2 entries )
- [7] arXiv:2104.11702 (replaced) [pdf, ps, html, other]
-
Title: Correlated Dynamics in Marketing SensitivitiesSubjects: Applications (stat.AP); Econometrics (econ.EM); Machine Learning (stat.ML)
Understanding individual customers' sensitivities to prices, promotions, brands, and other marketing mix elements is fundamental to a wide swath of marketing problems. An important but understudied aspect of this problem is the dynamic nature of these sensitivities, which change over time and vary across individuals. Prior work has developed methods for capturing such dynamic heterogeneity within product categories, but neglected the possibility of correlated dynamics across categories. In this work, we introduce a framework to capture such correlated dynamics using a hierarchical dynamic factor model, where individual preference parameters are influenced by common cross-category dynamic latent factors, estimated through Bayesian nonparametric Gaussian processes. We apply our model to grocery purchase data, and find that a surprising degree of dynamic heterogeneity can be accounted for by only a few global trends. We also characterize the patterns in how consumers' sensitivities evolve across categories. Managerially, the proposed framework not only enhances predictive accuracy by leveraging cross-category data, but enables more precise estimation of quantities of interest, like price elasticity.
- [8] arXiv:2404.04455 (replaced) [pdf, ps, html, other]
-
Title: Tomographic reconstruction of a disease transmission landscape via GPS recorded random pathsJairo Diaz-Rodriguez, Juan Pablo Gomez, Jeremy P. Orange, Nathan D. Burkett-Cadena, Samantha M. Wisely, Jason K. Blackburn, Sylvain SardySubjects: Applications (stat.AP)
Identifying areas in a landscape where individuals have higher probability of becoming infected with a pathogen is a crucial step towards disease management. We perform a novel epidemiological tomography for the estimation of landscape propensity to disease infection, using GPS animal tracks in a manner analogous to tomographic techniques in Positron Emission Tomography. Our study data consists of individual tracks of white-tailed deer (Odocoileus virginianus) and three exotic Cervidae species moving freely in a high-fenced game preserve over given time periods. A serological test was performed on each individual to measure antibody concentration of epizootic hemorrhagic disease viruses (EHDV) at the beginning and at the end of each tracking period. EHDV is a vector-borne viral disease indirectly transmitted between ruminant hosts by biting midges. We model the data as a binomial linear inverse problem, where spatial coherence is enforced with a total variation regularization. The smoothness of the reconstructed propensity map is selected by the quantile universal threshold, which can also test the null hypothesis that the propensity map is spatially constant. We apply our method to simulated and real data, showing good statistical properties during simulations and consistent results and interpretations compared to intensive field estimations.
- [9] arXiv:2303.01887 (replaced) [pdf, ps, html, other]
-
Title: Fast Forecasting of Unstable Data Streams for On-Demand Service PlatformsSubjects: Econometrics (econ.EM); Applications (stat.AP)
On-demand service platforms face a challenging problem of forecasting a large collection of high-frequency regional demand data streams that exhibit instabilities. This paper develops a novel forecast framework that is fast and scalable, and automatically assesses changing environments without human intervention. We empirically test our framework on a large-scale demand data set from a leading on-demand delivery platform in Europe, and find strong performance gains from using our framework against several industry benchmarks, across all geographical regions, loss functions, and both pre- and post-Covid periods. We translate forecast gains to economic impacts for this on-demand service platform by computing financial gains and reductions in computing costs.
- [10] arXiv:2308.14143 (replaced) [pdf, ps, html, other]
-
Title: Ensemble-localized Kernel Density Estimation with Applications to the Ensemble Gaussian Mixture FilterSubjects: Optimization and Control (math.OC); Numerical Analysis (math.NA); Applications (stat.AP)
The ensemble Gaussian mixture filter (EnGMF) is a non-linear filter suited to data assimilation of highly non-Gaussian and non-linear models that has practical utility in the case of a small number of samples, and theoretical convergence to full Bayesian inference in the ensemble limit. We aim to increase the utility of the EnGMF by introducing an ensemble-local notion of covariance into the kernel density estimation (KDE) step for the prior distribution. We prove that in the Gaussian case, our new ensemble-localized KDE technique is exactly the same as more traditional KDE techniques. We also show an example of a non-Gaussian distribution that can fail to be approximated by canonical KDE methods, but can be approximated well by our new KDE technique. We showcase our new KDE technique on a simple bivariate problem, showing that it has nice qualitative and quantitative properties, and significantly improves the estimate of the prior and posterior distributions for all ensemble sizes tested. We additionally show the utility of the proposed methodology for sequential filtering for the Lorenz '63 equations, achieving a significant reduction in error, and less conservative behavior in the uncertainty estimate with respect to traditional techniques.
- [11] arXiv:2402.09033 (replaced) [pdf, ps, html, other]
-
Title: Cross-Temporal Forecast Reconciliation at Digital Platforms with Machine LearningSubjects: Econometrics (econ.EM); Applications (stat.AP); Methodology (stat.ME); Machine Learning (stat.ML)
Platform businesses operate on a digital core and their decision making requires high-dimensional accurate forecast streams at different levels of cross-sectional (e.g., geographical regions) and temporal aggregation (e.g., minutes to days). It also necessitates coherent forecasts across all levels of the hierarchy to ensure aligned decision making across different planning units such as pricing, product, controlling and strategy. Given that platform data streams feature complex characteristics and interdependencies, we introduce a non-linear hierarchical forecast reconciliation method that produces cross-temporal reconciled forecasts in a direct and automated way through the use of popular machine learning methods. The method is sufficiently fast to allow forecast-based high-frequency decision making that platforms require. We empirically test our framework on unique, large-scale streaming datasets from a leading on-demand delivery platform in Europe and a bicycle sharing system in New York City.