Statistics
- [1] arXiv:2405.10329 [pdf, ps, other]
-
Title: Causal inference approach to appraise long-term effects of maintenance policy on functional performance of asphalt pavementsSubjects: Applications (stat.AP); Artificial Intelligence (cs.AI)
Asphalt pavements as the most prevalent transportation infrastructure, are prone to serious traffic safety problems due to functional or structural damage caused by stresses or strains imposed through repeated traffic loads and continuous climatic cycles. The good quality or high serviceability of infrastructure networks is vital to the urbanization and industrial development of nations. In order to maintain good functional pavement performance and extend the service life of asphalt pavements, the long-term performance of pavements under maintenance policies needs to be evaluated and favorable options selected based on the condition of the pavement. A major challenge in evaluating maintenance policies is to produce valid treatments for the outcome assessment under the control of uncertainty of vehicle loads and the disturbance of freeze-thaw cycles in the climatic environment. In this study, a novel causal inference approach combining a classical causal structural model and a potential outcome model framework is proposed to appraise the long-term effects of four preventive maintenance treatments for longitudinal cracking over a 5-year period of upkeep. Three fundamental issues were brought to our attention: 1) detection of causal relationships prior to variables under environmental loading (identification of causal structure); 2) obtaining direct causal effects of treatment on outcomes excluding covariates (identification of causal effects); and 3) sensitivity analysis of causal relationships. The results show that the method can accurately evaluate the effect of preventive maintenance treatments and assess the maintenance time to cater well for the functional performance of different preventive maintenance approaches. This framework could help policymakers to develop appropriate maintenance strategies for pavements.
- [2] arXiv:2405.10371 [pdf, ps, html, other]
-
Title: Causal Discovery in Multivariate Extremes with a Hydrological Analysis of Swiss River DischargesSubjects: Methodology (stat.ME); Applications (stat.AP)
Causal asymmetry is based on the principle that an event is a cause only if its absence would not have been a cause. From there, uncovering causal effects becomes a matter of comparing a well-defined score in both directions. Motivated by studying causal effects at extreme levels of a multivariate random vector, we propose to construct a model-agnostic causal score relying solely on the assumption of the existence of a max-domain of attraction. Based on a representation of a Generalized Pareto random vector, we construct the causal score as the Wasserstein distance between the margins and a well-specified random variable. The proposed methodology is illustrated on a hydrologically simulated dataset of different characteristics of catchments in Switzerland: discharge, precipitation, and snowmelt.
- [3] arXiv:2405.10399 [pdf, ps, html, other]
-
Title: A note on continuous-time online learningSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Numerical Analysis (math.NA); Optimization and Control (math.OC)
In online learning, the data is provided in a sequential order, and the goal of the learner is to make online decisions to minimize overall regrets. This note is concerned with continuous-time models and algorithms for several online learning problems: online linear optimization, adversarial bandit, and adversarial linear bandit. For each problem, we extend the discrete-time algorithm to the continuous-time setting and provide a concise proof of the optimal regret bound.
- [4] arXiv:2405.10412 [pdf, ps, html, other]
-
Title: Property testing in graphical models: testing small separation numbersSubjects: Statistics Theory (math.ST)
In many statistical applications, the dimension is too large to handle for standard high-dimensional machine learning procedures. This is particularly true for graphical models, where the interpretation of a large graph is difficult and learning its structure is often computationally impossible either because the underlying graph is not sufficiently sparse or the number of vertices is too large. To address this issue, we develop a procedure to test a property of a graph underlying a graphical model that requires only a subquadratic number of correlation queries (i.e., we require that the algorithm only can access a tiny fraction of the covariance matrix). This provides a conceptually simple test to determine whether the underlying graph is a tree or, more generally, if it has a small separation number, a quantity closely related to the treewidth of the graph. The proposed method is a divide-and-conquer algorithm that can be applied to quite general graphical models.
- [5] arXiv:2405.10453 [pdf, ps, html, other]
-
Title: Expected Points Above Average: A Novel NBA Player Metric Based on Bayesian Hierarchical ModelingSubjects: Other Statistics (stat.OT)
Team and player evaluation in professional sport is extremely important given the financial implications of success/failure. It is especially critical to identify and retain elite shooters in the National Basketball Association (NBA), one of the premier basketball leagues worldwide because the ultimate goal of the game is to score more points than one's opponent. To this end we propose two novel basketball metrics: "expected points" for team-based comparisons and "expected points above average (EPAA)" as a player-evaluation tool. Both metrics leverage posterior samples from Bayesian hierarchical modeling framework to cluster teams and players based on their shooting propensities and abilities. We illustrate the concepts for the top 100 shot takers over the last decade and offer our metric as an additional metric for evaluating players.
- [6] arXiv:2405.10458 [pdf, ps, html, other]
-
Title: Decision theory via model-free generalized fiducial inferenceSubjects: Statistics Theory (math.ST)
Building on the recent development of the model-free generalized fiducial (MFGF) paradigm (Williams, 2023) for predictive inference with finite-sample frequentist validity guarantees, in this paper, we develop an MFGF-based approach to decision theory. Beyond the utility of the new tools we contribute to the field of decision theory, our work establishes a formal connection between decision theories from the perspectives of fiducial inference, conformal prediction, and imprecise probability theory. In our paper, we establish pointwise and uniform consistency of an {\em MFGF upper risk function} as an approximation to the true risk function via the derivation of nonasymptotic concentration bounds, and our work serves as the foundation for future investigations of the properties of the MFGF upper risk from the perspective of new decision-theoretic, finite-sample validity criterion, as in Martin (2021).
- [7] arXiv:2405.10461 [pdf, ps, html, other]
-
Title: Prediction in Measurement Error ModelsSubjects: Methodology (stat.ME)
We study the well known difficult problem of prediction in measurement error models. By targeting directly at the prediction interval instead of the point prediction, we construct a prediction interval by providing estimators of both the center and the length of the interval which achieves a pre-determined prediction level. The constructing procedure requires a working model for the distribution of the variable prone to error. If the working model is correct, the prediction interval estimator obtains the smallest variability in terms of assessing the true center and length. If the working model is incorrect, the prediction interval estimation is still consistent. We further study how the length of the prediction interval depends on the choice of the true prediction interval center and provide guidance on obtaining minimal prediction interval length. Numerical experiments are conducted to illustrate the performance and we apply our method to predict concentration of Abeta1-12 in cerebrospinal fluid in an Alzheimer's disease data.
- [8] arXiv:2405.10490 [pdf, ps, other]
-
Title: Neural Optimization with Adaptive Heuristics for Intelligent Marketing SystemChangshuai Wei, Benjamin Zelditch, Joyce Chen, Andre Assuncao Silva T Ribeiro, Jingyi Kenneth Tay, Borja Ocejo Elizondo, Keerthi Selvaraj, Aman Gupta, Licurgo Benemann De AlmeidaComments: KDD 2024Subjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR); Machine Learning (cs.LG); Optimization and Control (math.OC)
Computational marketing has become increasingly important in today's digital world, facing challenges such as massive heterogeneous data, multi-channel customer journeys, and limited marketing budgets. In this paper, we propose a general framework for marketing AI systems, the Neural Optimization with Adaptive Heuristics (NOAH) framework. NOAH is the first general framework for marketing optimization that considers both to-business (2B) and to-consumer (2C) products, as well as both owned and paid channels. We describe key modules of the NOAH framework, including prediction, optimization, and adaptive heuristics, providing examples for bidding and content optimization. We then detail the successful application of NOAH to LinkedIn's email marketing system, showcasing significant wins over the legacy ranking system. Additionally, we share details and insights that are broadly useful, particularly on: (i) addressing delayed feedback with lifetime value, (ii) performing large-scale linear programming with randomization, (iii) improving retrieval with audience expansion, (iv) reducing signal dilution in targeting tests, and (v) handling zero-inflated heavy-tail metrics in statistical testing.
- [9] arXiv:2405.10527 [pdf, ps, html, other]
-
Title: Hawkes Models And Their ApplicationsSubjects: Methodology (stat.ME); Probability (math.PR); Applications (stat.AP)
The Hawkes process is a model for counting the number of arrivals to a system which exhibits the self-exciting property - that one arrival creates a heightened chance of further arrivals in the near future. The model, and its generalizations, have been applied in a plethora of disparate domains, though two particularly developed applications are in seismology and in finance. As the original model is elegantly simple, generalizations have been proposed which: track marks for each arrival, are multivariate, have a spatial component, are driven by renewal processes, treat time as discrete, and so on. This paper creates a cohesive review of the traditional Hawkes model and the modern generalizations, providing details on their construction, simulation algorithms, and giving key references to the appropriate literature for a detailed treatment.
- [10] arXiv:2405.10552 [pdf, ps, html, other]
-
Title: Data Science Principles for Interpretable and Explainable AISubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Society's capacity for algorithmic problem-solving has never been greater. Artificial Intelligence is now applied across more domains than ever, a consequence of powerful abstractions, abundant data, and accessible software. As capabilities have expanded, so have risks, with models often deployed without fully understanding their potential impacts. Interpretable and interactive machine learning aims to make complex models more transparent and controllable, enhancing user agency. This review synthesizes key principles from the growing literature in this field.
We first introduce precise vocabulary for discussing interpretability, like the distinction between glass box and explainable algorithms. We then explore connections to classical statistical and design principles, like parsimony and the gulfs of interaction. Basic explainability techniques -- including learned embeddings, integrated gradients, and concept bottlenecks -- are illustrated with a simple case study. We also review criteria for objectively evaluating interpretability approaches. Throughout, we underscore the importance of considering audience goals when designing interactive algorithmic systems. Finally, we outline open challenges and discuss the potential role of data science in addressing them. Code to reproduce all examples can be found at this https URL. - [11] arXiv:2405.10582 [pdf, ps, other]
-
Title: General oracle inequalities for a penalized log-likelihood criterion based on non-stationary dataJulien Aubert (UniCA, LJAD, CNRS), Luc Lehéricy (LJAD, UniCA, CNRS), Patricia Reynaud-Bouret (LJAD, UniCA, CNRS)Subjects: Statistics Theory (math.ST)
We prove oracle inequalities for a penalized log-likelihood criterion that hold even if the data are not independent and not stationary, based on a martingale approach. The assumptions are checked for various contexts: density estimation with independent and identically distributed (i.i.d) data, hidden Markov models, spiking neural networks, adversarial bandits. In each case, we compare our results to the literature, showing that, although we lose some logarithmic factors in the most classical case (i.i.d.), these results are comparable or more general than the existing results in the more dependent cases.
- [12] arXiv:2405.10588 [pdf, ps, other]
-
Title: Decompounding with unknown noise through several independents channelsGuillaume Garnier (LJLL, MERGE)Subjects: Statistics Theory (math.ST)
In this article, we consider two different statistical models. First, we focus on the estimation of the jump intensity of a compound Poisson process in the presence of unknown noise. This problem combines both the deconvolution problem and the decompounding problem. More specifically, we observe several independent compound Poisson processes but we assume that all these observations are noisy due to measurement noise. We construct an Fourier estimator of the jump density and we study its mean integrated squared error. Then, we propose an adaptive method to correctly select the cutoff of the estimator and we illustrate the efficiency of the method with numerical results. Secondly, we introduce in this paper the multiplicative decompounding problem. We study this problem with Mellin density estimators. We develop an adaptive procedure to select the optimal cutoff parameter.
- [13] arXiv:2405.10712 [pdf, ps, html, other]
-
Title: Comparative evaluation of earthquake forecasting models: An application to ItalySubjects: Applications (stat.AP)
Testing earthquake forecasts is essential to obtain scientific information on forecasting models and sufficient credibility for societal usage. We aim at enhancing the testing phase proposed by the Collaboratory for the Study of Earthquake Predictability (CSEP, Schorlemmer et al., 2018) with new statistical methods supported by mathematical theory. To demonstrate their applicability, we evaluate three short-term forecasting models that were submitted to the CSEP Italy experiment, and two ensemble models thereof. The models produce weekly overlapping forecasts for the expected number of M4+ earthquakes in a collection of grid cells. We compare the models' forecasts using consistent scoring functions for means or expectations, which are widely used and theoretically principled tools for forecast evaluation. We further discuss and demonstrate their connection to CSEP-style earthquake likelihood model testing. Then, using tools from isotonic regression, we investigate forecast reliability and apply score decompositions in terms of calibration and discrimination. Our results show where and how models outperform their competitors and reveal a substantial lack of calibration for various models. The proposed methods also apply to full-distribution (e.g., catalog-based) forecasts, without requiring Poisson distributions or making any other type of parametric assumption.
- [14] arXiv:2405.10719 [pdf, ps, html, other]
-
Title: $\ell_1$-Regularized Generalized Least SquaresComments: 13 pages, 6 figuresSubjects: Methodology (stat.ME); Statistics Theory (math.ST); Machine Learning (stat.ML)
In this paper we propose an $\ell_1$-regularized GLS estimator for high-dimensional regressions with potentially autocorrelated errors. We establish non-asymptotic oracle inequalities for estimation accuracy in a framework that allows for highly persistent autoregressive errors. In practice, the Whitening matrix required to implement the GLS is unkown, we present a feasible estimator for this matrix, derive consistency results and ultimately show how our proposed feasible GLS can recover closely the optimal performance (as if the errors were a white noise) of the LASSO. A simulation study verifies the performance of the proposed method, demonstrating that the penalized (feasible) GLS-LASSO estimator performs on par with the LASSO in the case of white noise errors, whilst outperforming it in terms of sign-recovery and estimation error when the errors exhibit significant correlation.
- [15] arXiv:2405.10742 [pdf, ps, html, other]
-
Title: Efficient Sampling in Disease Surveillance through Subpopulations: Sampling Canaries in the Coal MineComments: 15 pages, 1 figureSubjects: Methodology (stat.ME); Applications (stat.AP)
We consider disease outbreak detection settings where the population under study consists of various subpopulations available for stratified surveillance. These subpopulations can for example be based on age cohorts, but may also correspond to other subgroups of the population under study such as international travellers. Rather than sampling uniformly over the entire population, one may elevate the effectiveness of the detection methodology by optimally choosing a subpopulation for sampling. We show (under some assumptions) the relative sampling efficiency between two subpopulations is inversely proportional to the ratio of their respective baseline disease risks. This leads to a considerable potential increase in sampling efficiency when sampling from the subpopulation with higher baseline disease risk, if the two subpopulation baseline risks differ strongly. Our mathematical results require a careful treatment of the power curves of exact binomial tests as a function of their sample size, which are erratic and non-monotonic due to the discreteness of the underlying distribution. Subpopulations with comparatively high baseline disease risk are typically in greater contact with health professionals, and thus when sampled for surveillance purposes this is typically motivated merely through a convenience argument. With this study, we aim to elevate the status of such "convenience surveillance" to optimal subpopulation surveillance.
- [16] arXiv:2405.10769 [pdf, ps, other]
-
Title: Efficient estimation of target population treatment effect from multiple source trials under effect-measure transportabilitySubjects: Methodology (stat.ME)
When the marginal causal effect comparing the same treatment pair is available from multiple trials, we wish to transport all results to make inference on the target population effect. To account for the differences between populations, statistical analysis is often performed controlling for relevant variables. However, when transportability assumptions are placed on conditional causal effects, rather than the distribution of potential outcomes, we need to carefully choose these effect measures. In particular, we present identifiability results in two cases: target population average treatment effect for a continuous outcome and causal mean ratio for a positive outcome. We characterize the semiparametric efficiency bounds of the causal effects under the respective transportability assumptions and propose estimators that are doubly robust against model misspecifications. We highlight an important discussion on the tension between the non-collapsibility of conditional effects and the variational independence induced by transportability in the case of multiple source trials.
- [17] arXiv:2405.10773 [pdf, ps, other]
-
Title: Proximal indirect comparisonSubjects: Methodology (stat.ME)
We consider the problem of indirect comparison, where a treatment arm of interest is absent by design in the target randomized control trial (RCT) but available in a source RCT. The identifiability of the target population average treatment effect often relies on conditional transportability assumptions. However, it is a common concern whether all relevant effect modifiers are measured and controlled for. We highlight a new proximal identification result in the presence of shifted, unobserved effect modifiers based on proxies: an adjustment proxy in both RCTs and an additional reweighting proxy in the source RCT. We propose an estimator which is doubly-robust against misspecifications of the so-called bridge functions and asymptotically normal under mild consistency of the nuisance models. An alternative estimator is presented to accommodate missing outcomes in the source RCT, which we then apply to conduct a proximal indirect comparison analysis using two weight management trials.
- [18] arXiv:2405.10795 [pdf, ps, html, other]
-
Title: Non trivial optimal sampling rate for estimating a Lipschitz-continuous function in presence of mean-reverting Ornstein-Uhlenbeck noiseComments: 14 pages, 5 figuresSubjects: Statistics Theory (math.ST); Probability (math.PR); Methodology (stat.ME)
We examine a mean-reverting Ornstein-Uhlenbeck process that perturbs an unknown Lipschitz-continuous drift and aim to estimate the drift's value at a predetermined time horizon by sampling the path of the process. Due to the time varying nature of the drift we propose an estimation procedure that involves an online, time-varying optimization scheme implemented using a stochastic gradient ascent algorithm to maximize the log-likelihood of our observations. The objective of the paper is to investigate the optimal sample size/rate for achieving the minimum mean square distance between our estimator and the true value of the drift. In this setting we uncover a trade-off between the correlation of the observations, which increases with the sample size, and the dynamic nature of the unknown drift, which is weakened by increasing the frequency of observation. The mean square error is shown to be non monotonic in the sample size, attaining a global minimum whose precise description depends on the parameters that govern the model. In the static case, i.e. when the unknown drift is constant, our method outperforms the arithmetic mean of the observations in highly correlated regimes, despite the latter being a natural candidate estimator. We then compare our online estimator with the global maximum likelihood estimator.
- [19] arXiv:2405.10817 [pdf, ps, html, other]
-
Title: Restless Linear BanditsSubjects: Machine Learning (stat.ML); Information Theory (cs.IT); Machine Learning (cs.LG)
A more general formulation of the linear bandit problem is considered to allow for dependencies over time. Specifically, it is assumed that there exists an unknown $\mathbb{R}^d$-valued stationary $\varphi$-mixing sequence of parameters $(\theta_t,~t \in \mathbb{N})$ which gives rise to pay-offs. This instance of the problem can be viewed as a generalization of both the classical linear bandits with iid noise, and the finite-armed restless bandits. In light of the well-known computational hardness of optimal policies for restless bandits, an approximation is proposed whose error is shown to be controlled by the $\varphi$-dependence between consecutive $\theta_t$. An optimistic algorithm, called LinMix-UCB, is proposed for the case where $\theta_t$ has an exponential mixing rate. The proposed algorithm is shown to incur a sub-linear regret of $\mathcal{O}\left(\sqrt{d n\mathrm{polylog}(n) }\right)$ with respect to an oracle that always plays a multiple of $\mathbb{E}\theta_t$. The main challenge in this setting is to ensure that the exploration-exploitation strategy is robust against long-range dependencies. The proposed method relies on Berbee's coupling lemma to carefully select near-independent samples and construct confidence ellipsoids around empirical estimates of $\mathbb{E}\theta_t$.
- [20] arXiv:2405.10925 [pdf, ps, other]
-
Title: High-dimensional multiple imputation (HDMI) for partially observed confounders including natural language processing-derived auxiliary covariatesJanick Weberpals, Pamela A. Shaw, Kueiyu Joshua Lin, Richard Wyss, Joseph M Plasek, Li Zhou, Kerry Ngan, Thomas DeRamus, Sudha R. Raman, Bradley G. Hammill, Hana Lee, Sengwee Toh, John G. Connolly, Kimberly J. Dandreo, Fang Tian, Wei Liu, Jie Li, José J. Hernández-Muñoz, Sebastian Schneeweiss, Rishi J. DesaiSubjects: Methodology (stat.ME); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Multiple imputation (MI) models can be improved by including auxiliary covariates (AC), but their performance in high-dimensional data is not well understood. We aimed to develop and compare high-dimensional MI (HDMI) approaches using structured and natural language processing (NLP)-derived AC in studies with partially observed confounders. We conducted a plasmode simulation study using data from opioid vs. non-steroidal anti-inflammatory drug (NSAID) initiators (X) with observed serum creatinine labs (Z2) and time-to-acute kidney injury as outcome. We simulated 100 cohorts with a null treatment effect, including X, Z2, atrial fibrillation (U), and 13 other investigator-derived confounders (Z1) in the outcome generation. We then imposed missingness (MZ2) on 50% of Z2 measurements as a function of Z2 and U and created different HDMI candidate AC using structured and NLP-derived features. We mimicked scenarios where U was unobserved by omitting it from all AC candidate sets. Using LASSO, we data-adaptively selected HDMI covariates associated with Z2 and MZ2 for MI, and with U to include in propensity score models. The treatment effect was estimated following propensity score matching in MI datasets and we benchmarked HDMI approaches against a baseline imputation and complete case analysis with Z1 only. HDMI using claims data showed the lowest bias (0.072). Combining claims and sentence embeddings led to an improvement in the efficiency displaying the lowest root-mean-squared-error (0.173) and coverage (94%). NLP-derived AC alone did not perform better than baseline MI. HDMI approaches may decrease bias in studies with partially observed confounders where missingness depends on unobserved factors.
- [21] arXiv:2405.10930 [pdf, ps, other]
-
Title: Submodular Information Selection for Hypothesis Testing with Misclassification PenaltiesComments: 23 pages, 4 figuresSubjects: Machine Learning (stat.ML); Computational Complexity (cs.CC); Information Theory (cs.IT); Machine Learning (cs.LG); Optimization and Control (math.OC)
We consider the problem of selecting an optimal subset of information sources for a hypothesis testing/classification task where the goal is to identify the true state of the world from a finite set of hypotheses, based on finite observation samples from the sources. In order to characterize the learning performance, we propose a misclassification penalty framework, which enables non-uniform treatment of different misclassification errors. In a centralized Bayesian learning setting, we study two variants of the subset selection problem: (i) selecting a minimum cost information set to ensure that the maximum penalty of misclassifying the true hypothesis remains bounded and (ii) selecting an optimal information set under a limited budget to minimize the maximum penalty of misclassifying the true hypothesis. Under mild assumptions, we prove that the objective (or constraints) of these combinatorial optimization problems are weak (or approximate) submodular, and establish high-probability performance guarantees for greedy algorithms. Further, we propose an alternate metric for information set selection which is based on the total penalty of misclassification. We prove that this metric is submodular and establish near-optimal guarantees for the greedy algorithms for both the information set selection problems. Finally, we present numerical simulations to validate our theoretical results over several randomly generated instances.
New submissions for Monday, 20 May 2024 (showing 21 of 21 entries )
- [22] arXiv:2405.09843 (cross-list from econ.TH) [pdf, ps, other]
-
Title: Organizational Selection of InnovationComments: 40 pages, 13 figures, 2 tablesSubjects: Theoretical Economics (econ.TH); Multiagent Systems (cs.MA); Physics and Society (physics.soc-ph); Applications (stat.AP)
Budgetary constraints force organizations to pursue only a subset of possible innovation projects. Identifying which subset is most promising is an error-prone exercise, and involving multiple decision makers may be prudent. This raises the question of how to most effectively aggregate their collective nous. Our model of organizational portfolio selection provides some first answers. We show that portfolio performance can vary widely. Delegating evaluation makes sense when organizations employ the relevant experts and can assign projects to them. In most other settings, aggregating the impressions of multiple agents leads to better performance than delegation. In particular, letting agents rank projects often outperforms alternative aggregation rules -- including averaging agents' project scores as well as counting their approval votes -- especially when organizations have tight budgets and can select only a few project alternatives out of many.
- [23] arXiv:2405.10410 (cross-list from math.NA) [pdf, ps, html, other]
-
Title: The fast committor machine: Interpretable prediction with kernelsComments: 10 pages, 7 figuresSubjects: Numerical Analysis (math.NA); Machine Learning (stat.ML)
In the study of stochastic dynamics, the committor function describes the probability that a process starting from an initial configuration $x$ will reach set $A$ before set $B$. This paper introduces a fast and interpretable method for approximating the committor, called the "fast committor machine" (FCM). The FCM is based on simulated trajectory data, and it uses this data to train a kernel model. The FCM identifies low-dimensional subspaces that optimally describe the $A$ to $B$ transitions, and the subspaces are emphasized in the kernel model. The FCM uses randomized numerical linear algebra to train the model with runtime that scales linearly in the number of data points. This paper applies the FCM to example systems including the alanine dipeptide miniprotein: in these experiments, the FCM is generally more accurate and trains more quickly than a neural network with a similar number of parameters.
- [24] arXiv:2405.10462 (cross-list from astro-ph.IM) [pdf, ps, html, other]
-
Title: Rotation of the Globular Cluster Population of the Dark Matter Deficient Galaxy NGC 1052-DF4: Implication for the Total MassComments: 9 pages 6 figures. Accepted for publication in PASASubjects: Instrumentation and Methods for Astrophysics (astro-ph.IM); Astrophysics of Galaxies (astro-ph.GA); Applications (stat.AP)
We explore the globular cluster population of NGC 1052-DF4, a dark matter deficient galaxy, using Bayesian inference to search for the presence of rotation. The existence of such a rotating component is relevant to the estimation of the mass of the galaxy, and therefore the question of whether NGC 1052-DF4 is truly deficient of dark matter, similar to NGC 1052-DF2 another galaxy in the same group. The rotational characteristics of seven globular clusters in NGC 1052-DF4 were investigated, finding that a non-rotating kinematic model has a higher Bayesian evidence than a rotating model, by a factor of approximately 2.5. In addition, we find that under the assumption of rotation, its amplitude must be small. This distinct lack of rotation strengthens the case that, based on its intrinsic velocity dispersion, NGC 1052-DF4 is a truly dark matter deficient galaxy.
- [25] arXiv:2405.10469 (cross-list from cs.AI) [pdf, ps, html, other]
-
Title: Simulation-Based Benchmarking of Reinforcement Learning Agents for Personalized Retail PromotionsSubjects: Artificial Intelligence (cs.AI); Machine Learning (cs.LG); Econometrics (econ.EM); Machine Learning (stat.ML)
The development of open benchmarking platforms could greatly accelerate the adoption of AI agents in retail. This paper presents comprehensive simulations of customer shopping behaviors for the purpose of benchmarking reinforcement learning (RL) agents that optimize coupon targeting. The difficulty of this learning problem is largely driven by the sparsity of customer purchase events. We trained agents using offline batch data comprising summarized customer purchase histories to help mitigate this effect. Our experiments revealed that contextual bandit and deep RL methods that are less prone to over-fitting the sparse reward distributions significantly outperform static policies. This study offers a practical framework for simulating AI agents that optimize the entire retail customer journey. It aims to inspire the further development of simulation tools for retail AI systems.
- [26] arXiv:2405.10618 (cross-list from cs.LG) [pdf, ps, other]
-
Title: Distributed Event-Based Learning via ADMMComments: 29 pages, 12 figuresSubjects: Machine Learning (cs.LG); Optimization and Control (math.OC); Machine Learning (stat.ML)
We consider a distributed learning problem, where agents minimize a global objective function by exchanging information over a network. Our approach has two distinct features: (i) It substantially reduces communication by triggering communication only when necessary, and (ii) it is agnostic to the data-distribution among the different agents. We can therefore guarantee convergence even if the local data-distributions of the agents are arbitrarily distinct. We analyze the convergence rate of the algorithm and derive accelerated convergence rates in a convex setting. We also characterize the effect of communication drops and demonstrate that our algorithm is robust to communication failures. The article concludes by presenting numerical results from a distributed LASSO problem, and distributed learning tasks on MNIST and CIFAR-10 datasets. The experiments underline communication savings of 50% or more due to the event-based communication strategy, show resilience towards heterogeneous data-distributions, and highlight that our approach outperforms common baselines such as FedAvg, FedProx, and FedADMM.
- [27] arXiv:2405.10763 (cross-list from cond-mat.dis-nn) [pdf, ps, html, other]
-
Title: Integer Traffic Assignment Problem: Algorithms and Insights on Random GraphsComments: 37 pages, 15 figuresSubjects: Disordered Systems and Neural Networks (cond-mat.dis-nn); Discrete Mathematics (cs.DM); Optimization and Control (math.OC); Computation (stat.CO)
Path optimization is a fundamental concern across various real-world scenarios, ranging from traffic congestion issues to efficient data routing over the internet. The Traffic Assignment Problem (TAP) is a classic continuous optimization problem in this field. This study considers the Integer Traffic Assignment Problem (ITAP), a discrete variant of TAP. ITAP involves determining optimal routes for commuters in a city represented by a graph, aiming to minimize congestion while adhering to integer flow constraints on paths. This restriction makes ITAP an NP-hard problem. While conventional TAP prioritizes repulsive interactions to minimize congestion, this work also explores the case of attractive interactions, related to minimizing the number of occupied edges. We present and evaluate multiple algorithms to address ITAP, including a message passing algorithm, a greedy approach, simulated annealing, and relaxation of ITAP to TAP. Inspired by studies of random ensembles in the large-size limit in statistical physics, comparisons between these algorithms are conducted on large sparse random regular graphs with a random set of origin-destination pairs. Our results indicate that while the simplest greedy algorithm performs competitively in the repulsive scenario, in the attractive case the message-passing-based algorithm and simulated annealing demonstrate superiority. We then investigate the relationship between TAP and ITAP in the repulsive case. We find that, as the number of paths increases, the solution of TAP converges toward that of ITAP, and we investigate the speed of this convergence. Depending on the number of paths, our analysis leads us to identify two scaling regimes: in one the average flow per edge is of order one, and in another the number of paths scales quadratically with the size of the graph, in which case the continuous relaxation solves the integer problem closely.
- [28] arXiv:2405.10815 (cross-list from math.OC) [pdf, ps, html, other]
-
Title: A Functional Model Method for Nonconvex Nonsmooth Conditional Stochastic OptimizationSubjects: Optimization and Control (math.OC); Machine Learning (cs.LG); Machine Learning (stat.ML)
We consider stochastic optimization problems involving an expected value of a nonlinear function of a base random vector and a conditional expectation of another function depending on the base random vector, a dependent random vector, and the decision variables. We call such problems conditional stochastic optimization problems. They arise in many applications, such as uplift modeling, reinforcement learning, and contextual optimization. We propose a specialized single time-scale stochastic method for nonconvex constrained conditional stochastic optimization problems with a Lipschitz smooth outer function and a generalized differentiable inner function. In the method, we approximate the inner conditional expectation with a rich parametric model whose mean squared error satisfies a stochastic version of a Łojasiewicz condition. The model is used by an inner learning algorithm. The main feature of our approach is that unbiased stochastic estimates of the directions used by the method can be generated with one observation from the joint distribution per iteration, which makes it applicable to real-time learning. The directions, however, are not gradients or subgradients of any overall objective function. We prove the convergence of the method with probability one, using the method of differential inclusions and a specially designed Lyapunov function, involving a stochastic generalization of the Bregman distance. Finally, a numerical illustration demonstrates the viability of our approach.
- [29] arXiv:2405.10839 (cross-list from nucl-th) [pdf, ps, html, other]
-
Title: Model orthogonalization and Bayesian forecast mixing via Principal Component AnalysisComments: 12 pages, 4 figuresSubjects: Nuclear Theory (nucl-th); Data Analysis, Statistics and Probability (physics.data-an); Machine Learning (stat.ML)
One can improve predictability in the unknown domain by combining forecasts of imperfect complex computational models using a Bayesian statistical machine learning framework. In many cases, however, the models used in the mixing process are similar. In addition to contaminating the model space, the existence of such similar, or even redundant, models during the multimodeling process can result in misinterpretation of results and deterioration of predictive performance. In this work we describe a method based on the Principal Component Analysis that eliminates model redundancy. We show that by adding model orthogonalization to the proposed Bayesian Model Combination framework, one can arrive at better prediction accuracy and reach excellent uncertainty quantification performance.
- [30] arXiv:2405.10875 (cross-list from eess.SY) [pdf, ps, html, other]
-
Title: Recursively Feasible Shrinking-Horizon MPC in Dynamic Environments with Conformal Prediction GuaranteesSubjects: Systems and Control (eess.SY); Machine Learning (stat.ML)
In this paper, we focus on the problem of shrinking-horizon Model Predictive Control (MPC) in uncertain dynamic environments. We consider controlling a deterministic autonomous system that interacts with uncontrollable stochastic agents during its mission. Employing tools from conformal prediction, existing works derive high-confidence prediction regions for the unknown agent trajectories, and integrate these regions in the design of suitable safety constraints for MPC. Despite guaranteeing probabilistic safety of the closed-loop trajectories, these constraints do not ensure feasibility of the respective MPC schemes for the entire duration of the mission. We propose a shrinking-horizon MPC that guarantees recursive feasibility via a gradual relaxation of the safety constraints as new prediction regions become available online. This relaxation enforces the safety constraints to hold over the least restrictive prediction region from the set of all available prediction regions. In a comparative case study with the state of the art, we empirically show that our approach results in tighter prediction regions and verify recursive feasibility of our MPC scheme.
- [31] arXiv:2405.10938 (cross-list from cs.LG) [pdf, ps, html, other]
-
Title: Observational Scaling Laws and the Predictability of Language Model PerformanceSubjects: Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (stat.ML)
Understanding how language model performance varies with scale is critical to benchmark and algorithm development. Scaling laws are one approach to building this understanding, but the requirement of training models across many different scales has limited their use. We propose an alternative, observational approach that bypasses model training and instead builds scaling laws from ~80 publically available models. Building a single scaling law from multiple model families is challenging due to large variations in their training compute efficiencies and capabilities. However, we show that these variations are consistent with a simple, generalized scaling law where language model performance is a function of a low-dimensional capability space, and model families only vary in their efficiency in converting training compute to capabilities. Using this approach, we show the surprising predictability of complex scaling phenomena: we show that several emergent phenomena follow a smooth, sigmoidal behavior and are predictable from small models; we show that the agent performance of models such as GPT-4 can be precisely predicted from simpler non-agentic benchmarks; and we show how to predict the impact of post-training interventions like Chain-of-Thought and Self-Consistency as language model capabilities continue to improve.
Cross submissions for Monday, 20 May 2024 (showing 10 of 10 entries )
- [32] arXiv:2003.05492 (replaced) [pdf, ps, other]
-
Title: An asymptotic Peskun ordering and its application to lifted samplersJournal-ref: Bernoulli 30(3), 2301-2325, (August 2024)Subjects: Computation (stat.CO); Methodology (stat.ME)
A Peskun ordering between two samplers, implying a dominance of one over the other, is known among the Markov chain Monte Carlo community for being a remarkably strong result. It is however also known for being a result that is notably difficult to establish. Indeed, one has to prove that the probability to reach a state $\mathbf{y}$ from a state $\mathbf{x}$, using a sampler, is greater than or equal to the probability using the other sampler, and this must hold for all pairs $(\mathbf{x}, \mathbf{y})$ such that $\mathbf{x} \neq \mathbf{y}$. We provide in this paper a weaker version that does not require an inequality between the probabilities for all these states: essentially, the dominance holds asymptotically, as a varying parameter grows without bound, as long as the states for which the probabilities are greater than or equal to belong to a mass-concentrating set. The weak ordering turns out to be useful to compare lifted samplers for partially-ordered discrete state-spaces with their Metropolis--Hastings counterparts. An analysis in great generality yields a qualitative conclusion: they asymptotically perform better in certain situations (and we are able to identify them), but not necessarily in others (and the reasons why are made clear). A quantitative study in a specific context of graphical-model simulation is also conducted.
- [33] arXiv:2110.07051 (replaced) [pdf, ps, html, other]
-
Title: Fast and Scalable Inference for Spatial Extreme Value ModelsSubjects: Methodology (stat.ME); Computation (stat.CO)
The generalized extreme value (GEV) distribution is a popular model for analyzing and forecasting extreme weather data. To increase prediction accuracy, spatial information is often pooled via a latent Gaussian process (GP) on the GEV parameters. Inference for GEV-GP models is typically carried out using Markov chain Monte Carlo (MCMC) methods, or using approximate inference methods such as the integrated nested Laplace approximation (INLA). However, MCMC becomes prohibitively slow as the number of spatial locations increases, whereas INLA is only applicable in practice to a limited subset of GEV-GP models. In this paper, we revisit the original Laplace approximation for fitting spatial GEV models. In combination with a popular sparsity-inducing spatial covariance approximation technique, we show through simulations that our approach accurately estimates the Bayesian predictive distribution of extreme weather events, is scalable to several thousand spatial locations, and is several orders of magnitude faster than MCMC. A case study in forecasting extreme snowfall across Canada is presented.
- [34] arXiv:2210.05983 (replaced) [pdf, ps, html, other]
-
Title: Model-based clustering in simple hypergraphs through a stochastic blockmodelLuca Brusa (UNIMIB), Catherine Matias (LPSM (UMR\_8001))Subjects: Methodology (stat.ME)
We propose a model to address the overlooked problem of node clustering in simple hypergraphs. Simple hypergraphs are suitable when a node may not appear multiple times in the same hyperedge, such as in co-authorship datasets. Our model generalizes the stochastic blockmodel for graphs and assumes the existence of latent node groups and hyperedges are conditionally independent given these groups. We first establish the generic identifiability of the model parameters. We then develop a variational approximation Expectation-Maximization algorithm for parameter inference and node clustering, and derive a statistical criterion for model selection.
To illustrate the performance of our R package HyperSBM, we compare it with other node clustering methods using synthetic data generated from the model, as well as from a line clustering experiment and a co-authorship dataset. - [35] arXiv:2210.09560 (replaced) [pdf, ps, html, other]
-
Title: A Bayesian Convolutional Neural Network-based Generalized Linear ModelComments: 25 pages, 7 figuresSubjects: Methodology (stat.ME)
Convolutional neural networks (CNNs) provide flexible function approximations for a wide variety of applications when the input variables are in the form of images or spatial data. Although CNNs often outperform traditional statistical models in prediction accuracy, statistical inference, such as estimating the effects of covariates and quantifying the prediction uncertainty, is not trivial due to the highly complicated model structure and overparameterization. To address this challenge, we propose a new Bayesian approach by embedding CNNs within the generalized linear models (GLMs) framework. We use extracted nodes from the last hidden layer of CNN with Monte Carlo (MC) dropout as informative covariates in GLM. This improves accuracy in prediction and regression coefficient inference, allowing for the interpretation of coefficients and uncertainty quantification. By fitting ensemble GLMs across multiple realizations from MC dropout, we can account for uncertainties in extracting the features. We apply our methods to biological and epidemiological problems, which have both high-dimensional correlated inputs and vector covariates. Specifically, we consider malaria incidence data, brain tumor image data, and fMRI data. By extracting information from correlated inputs, the proposed method can provide an interpretable Bayesian analysis. The algorithm can be broadly applicable to image regressions or correlated data analysis by enabling accurate Bayesian inference quickly.
- [36] arXiv:2305.12100 (replaced) [pdf, ps, html, other]
-
Title: How Spurious Features Are Memorized: Precise Analysis for Random and NTK FeaturesComments: Revision after ICML2024 acceptance. Motivation of the paper changed from Privacy to Spurious Features. arXiv admin note: text overlap with arXiv:2302.01629Subjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
Deep learning models are known to overfit and memorize spurious features in the training dataset. While numerous empirical studies have aimed at understanding this phenomenon, a rigorous theoretical framework to quantify it is still missing. In this paper, we consider spurious features that are uncorrelated with the learning task, and we provide a precise characterization of how they are memorized via two separate terms: (i) the stability of the model with respect to individual training samples, and (ii) the feature alignment between the spurious feature and the full sample. While the first term is well established in learning theory and it is connected to the generalization error in classical work, the second one is, to the best of our knowledge, novel. Our key technical result gives a precise characterization of the feature alignment for the two prototypical settings of random features (RF) and neural tangent kernel (NTK) regression. We prove that the memorization of spurious features weakens as the generalization capability increases and, through the analysis of the feature alignment, we unveil the role of the model and of its activation function. Numerical experiments show the predictive power of our theory on standard datasets (MNIST, CIFAR-10).
- [37] arXiv:2305.15671 (replaced) [pdf, ps, html, other]
-
Title: Matrix Autoregressive Model with Vector Time Series Covariates for Spatio-Temporal DataSubjects: Methodology (stat.ME)
We develop a new methodology for forecasting matrix-valued time series with historical matrix data and auxiliary vector time series data. We focus on a time series of matrices defined on a static 2-D spatial grid and an auxiliary time series of non-spatial vectors. The proposed model, Matrix AutoRegression with Auxiliary Covariates (MARAC), contains an autoregressive component for the historical matrix predictors and an additive component that maps the auxiliary vector predictors to a matrix response via tensor-vector product. The autoregressive component adopts a bi-linear transformation framework following Chen et al. (2021), significantly reducing the number of parameters. The auxiliary component posits that the tensor coefficient, which maps non-spatial predictors to a spatial response, contains slices of spatially smooth matrix coefficients that are discrete evaluations of smooth functions from a Reproducible Kernel Hilbert Space (RKHS). We propose to estimate the model parameters under a penalized maximum likelihood estimation framework coupled with an alternating minimization algorithm. We establish the joint asymptotics of the autoregressive and tensor parameters under fixed and high-dimensional regimes. Extensive simulations and a geophysical application for forecasting the global Total Electron Content (TEC) are conducted to validate the performance of MARAC.
- [38] arXiv:2306.08485 (replaced) [pdf, ps, html, other]
-
Title: Graph-Aligned Random Partition Model (GARP)Comments: Journal of the American Statistical Association 2024Subjects: Methodology (stat.ME)
Bayesian nonparametric mixtures and random partition models are powerful tools for probabilistic clustering. However, standard independent mixture models can be restrictive in some applications such as inference on cell lineage due to the biological relations of the clusters. The increasing availability of large genomic data requires new statistical tools to perform model-based clustering and infer the relationship between homogeneous subgroups of units. Motivated by single-cell RNA applications we develop a novel dependent mixture model to jointly perform cluster analysis and align the clusters on a graph. Our flexible graph-aligned random partition model (GARP) exploits Gibbs-type priors as building blocks, allowing us to derive analytical results on the graph-aligned random partition's probability mass function (pmf). We derive a generalization of the Chinese restaurant process from the pmf and a related efficient and neat MCMC algorithm to perform Bayesian inference. We perform posterior inference on real single-cell RNA data from mice stem cells. We further investigate the performance of our model in capturing the underlying clustering structure as well as the underlying graph by means of simulation studies.
- [39] arXiv:2306.09555 (replaced) [pdf, ps, html, other]
-
Title: Geometric-Based Pruning Rules For Change Point Detection in Multiple Independent Time SeriesComments: 34 pages, 11 figures, 1 tableSubjects: Methodology (stat.ME); Computation (stat.CO); Machine Learning (stat.ML)
We consider the problem of detecting multiple changes in multiple independent time series. The search for the best segmentation can be expressed as a minimization problem over a given cost function. We focus on dynamic programming algorithms that solve this problem exactly. When the number of changes is proportional to data length, an inequality-based pruning rule encoded in the PELT algorithm leads to a linear time complexity. Another type of pruning, called functional pruning, gives a close-to-linear time complexity whatever the number of changes, but only for the analysis of univariate time series.
We propose a few extensions of functional pruning for multiple independent time series based on the use of simple geometric shapes (balls and hyperrectangles). We focus on the Gaussian case, but some of our rules can be easily extended to the exponential family. In a simulation study we compare the computational efficiency of different geometric-based pruning rules. We show that for small dimensions (2, 3, 4) some of them ran significantly faster than inequality-based approaches in particular when the underlying number of changes is small compared to the data length. - [40] arXiv:2306.15075 (replaced) [pdf, ps, html, other]
-
Title: Differences in academic preparedness do not fully explain Black-White enrollment disparities in advanced high school courseworkSubjects: Methodology (stat.ME); Applications (stat.AP)
Whether racial disparities in enrollment in advanced high school coursework can be attributed to differences in prior academic preparation is a central question in sociological research and education policy. However, previous investigations face methodological limitations, for they compare race-specific enrollment rates of students after adjusting for characteristics only partially related to their academic preparedness for advanced coursework. Informed by a recently-developed statistical technique, we propose and estimate a novel measure of students' academic preparedness and use administrative data from the New York City Department of Education to measure differences in AP mathematics enrollment rates among similarly prepared students of different races. We find that preexisting differences in academic preparation do not fully explain the under-representation of Black students relative to White students in AP mathematics. Our results imply that achieving equal opportunities for AP enrollment not only requires equalizing earlier academic experiences, but also addressing inequities that emerge from coursework placement processes.
- [41] arXiv:2310.12788 (replaced) [pdf, ps, html, other]
-
Title: Continuous Time Locally Stationary Wavelet ProcessesComments: 38 pages, 12 figuresSubjects: Statistics Theory (math.ST)
This article introduces the class of continuous time locally stationary wavelet processes. Continuous time models enable us to properly provide scale-based time series models for irregularly-spaced observations for the first time, while also permitting a spectral representation of the process over a continuous range of scales. We derive results for both the theoretical setting, where we assume access to the entire process sample path, and a more practical one, which develops methods for estimating the quantities of interest from sampled time series. The latter estimates are accurately computable in reasonable time by solving the relevant linear integral equation using the iterative thresholding method due to Daubechies, Defrise and De Mol. Appropriate smoothing techniques are also developed and applied in this new setting. We exemplify our new methods by computing spectral and autocovariance estimates on irregularly-spaced heart rate data obtained from a recent sleep-state study.
- [42] arXiv:2310.14691 (replaced) [pdf, ps, html, other]
-
Title: Identifiability of total effects from abstractions of time series causal graphsComments: Accepted to the 40th Conference on Uncertainty in Artificial Intelligence (UAI) 2024, Barcelona, SpainSubjects: Statistics Theory (math.ST); Artificial Intelligence (cs.AI)
We study the problem of identifiability of the total effect of an intervention from observational time series in the situation, common in practice, where one only has access to abstractions of the true causal graph. We consider here two abstractions: the extended summary causal graph, which conflates all lagged causal relations but distinguishes between lagged and instantaneous relations, and the summary causal graph which does not give any indication about the lag between causal relations. We show that the total effect is always identifiable in extended summary causal graphs and provide sufficient conditions for identifiability in summary causal graphs. We furthermore provide adjustment sets allowing to estimate the total effect whenever it is identifiable.
- [43] arXiv:2310.17582 (replaced) [pdf, ps, html, other]
-
Title: Convergence of flow-based generative models via proximal gradient descent in Wasserstein spaceSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC); Statistics Theory (math.ST)
Flow-based generative models enjoy certain advantages in computing the data generation and the likelihood, and have recently shown competitive empirical performance. Compared to the accumulating theoretical studies on related score-based diffusion models, analysis of flow-based models, which are deterministic in both forward (data-to-noise) and reverse (noise-to-data) directions, remain sparse. In this paper, we provide a theoretical guarantee of generating data distribution by a progressive flow model, the so-called JKO flow model, which implements the Jordan-Kinderleherer-Otto (JKO) scheme in a normalizing flow network. Leveraging the exponential convergence of the proximal gradient descent (GD) in Wasserstein space, we prove the Kullback-Leibler (KL) guarantee of data generation by a JKO flow model to be $O(\varepsilon^2)$ when using $N \lesssim \log (1/\varepsilon)$ many JKO steps ($N$ Residual Blocks in the flow) where $\varepsilon $ is the error in the per-step first-order condition. The assumption on data density is merely a finite second moment, and the theory extends to data distributions without density and when there are inversion errors in the reverse process where we obtain KL-$W_2$ mixed error guarantees. The non-asymptotic convergence rate of the JKO-type $W_2$-proximal GD is proved for a general class of convex objective functionals that includes the KL divergence as a special case, which can be of independent interest. The analysis framework can extend to other first-order Wasserstein optimization schemes applied to flow-based generative models.
- [44] arXiv:2311.00905 (replaced) [pdf, ps, html, other]
-
Title: Data-driven fixed-point tuning for truncated realized variationsSubjects: Statistics Theory (math.ST); Econometrics (econ.EM)
Many methods for estimating integrated volatility and related functionals of semimartingales in the presence of jumps require specification of tuning parameters for their use in practice. In much of the available theory, tuning parameters are assumed to be deterministic and their values are specified only up to asymptotic constraints. However, in empirical work and in simulation studies, they are typically chosen to be random and data-dependent, with explicit choices often relying entirely on heuristics. In this paper, we consider novel data-driven tuning procedures for the truncated realized variations of a semimartingale with jumps based on a type of random fixed-point iteration. Being effectively automated, our approach alleviates the need for delicate decision-making regarding tuning parameters in practice and can be implemented using information regarding sampling frequency alone. We show our methods can lead to asymptotically efficient estimation of integrated volatility and exhibit superior finite-sample performance compared to popular alternatives in the literature.
- [45] arXiv:2311.03644 (replaced) [pdf, ps, html, other]
-
Title: BOB: Bayesian Optimized Bootstrap for Uncertainty Quantification in Gaussian Mixture ModelsComments: 35 pages, 8 figuresSubjects: Methodology (stat.ME); Computation (stat.CO)
A natural way to quantify uncertainties in Gaussian mixture models (GMMs) is through Bayesian methods. That said, sampling from the joint posterior distribution of GMMs via standard Markov chain Monte Carlo (MCMC) imposes several computational challenges, which have prevented a broader full Bayesian implementation of these models. A growing body of literature has introduced the Weighted Likelihood Bootstrap and the Weighted Bayesian Bootstrap as alternatives to MCMC sampling. The core idea of these methods is to repeatedly compute maximum a posteriori (MAP) estimates on many randomly weighted posterior densities. These MAP estimates then can be treated as approximate posterior draws. Nonetheless, a central question remains unanswered: How to select the random weights under arbitrary sample sizes. We, therefore, introduce the Bayesian Optimized Bootstrap (BOB), a computational method to automatically select these random weights by minimizing, through Bayesian Optimization, a black-box and noisy version of the reverse Kullback-Leibler (KL) divergence between the Bayesian posterior and an approximate posterior obtained via random weighting. Our proposed method outperforms competing approaches in recovering the Bayesian posterior, it provides a better uncertainty quantification, and it retains key asymptotic properties from existing methods. BOB's performance is demonstrated through extensive simulations, along with real-world data analyses.
- [46] arXiv:2311.05794 (replaced) [pdf, ps, html, other]
-
Title: An Experimental Design for Anytime-Valid Causal Inference on Multi-Armed BanditsSubjects: Methodology (stat.ME); Machine Learning (cs.LG)
In multi-armed bandit (MAB) experiments, it is often advantageous to continuously produce inference on the average treatment effect (ATE) between arms as new data arrive and determine a data-driven stopping time for the experiment. We develop the Mixture Adaptive Design (MAD), a new experimental design for multi-armed bandit experiments that produces powerful and anytime-valid inference on the ATE for \emph{any} bandit algorithm of the experimenter's choice, even those without probabilistic treatment assignment. Intuitively, the MAD "mixes" any bandit algorithm of the experimenter's choice with a Bernoulli design through a tuning parameter $\delta_t$, where $\delta_t$ is a deterministic sequence that decreases the priority placed on the Bernoulli design as the sample size grows. We prove that for $\delta_t = \omega\left(t^{-1/4}\right)$, the MAD generates anytime-valid asymptotic confidence sequences that are guaranteed to shrink around the true ATE. Hence, the experimenter is guaranteed to detect a true non-zero treatment effect in finite time. Additionally, we prove that the regret of the MAD approaches that of its underlying bandit algorithm over time, and hence, incurs a relatively small loss in regret in return for powerful inferential guarantees. Finally, we conduct an extensive simulation study exhibiting that the MAD achieves finite-sample anytime validity and high power without significant losses in finite-sample reward.
- [47] arXiv:2311.17778 (replaced) [pdf, ps, html, other]
-
Title: Unified Binary and Multiclass Margin-Based ClassificationComments: Accepted for publication in Journal of Machine Learning ResearchSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG)
The notion of margin loss has been central to the development and analysis of algorithms for binary classification. To date, however, there remains no consensus as to the analogue of the margin loss for multiclass classification. In this work, we show that a broad range of multiclass loss functions, including many popular ones, can be expressed in the relative margin form, a generalization of the margin form of binary losses. The relative margin form is broadly useful for understanding and analyzing multiclass losses as shown by our prior work (Wang and Scott, 2020, 2021). To further demonstrate the utility of this way of expressing multiclass losses, we use it to extend the seminal result of Bartlett et al. (2006) on classification-calibration of binary margin losses to multiclass. We then analyze the class of Fenchel-Young losses, and expand the set of these losses that are known to be classification-calibrated.
- [48] arXiv:2312.07792 (replaced) [pdf, ps, html, other]
-
Title: Differentially private projection-depth-based mediansComments: 44 pages, 1 figureSubjects: Statistics Theory (math.ST); Cryptography and Security (cs.CR); Machine Learning (cs.LG); Methodology (stat.ME)
We develop $(\epsilon,\delta)$-differentially private projection-depth-based medians using the propose-test-release (PTR) and exponential mechanisms. Under general conditions on the input parameters and the population measure, (e.g. we do not assume any moment bounds), we quantify the probability the test in PTR fails, as well as the cost of privacy via finite sample deviation bounds. We then present a new definition of the finite sample breakdown point which applies to a mechanism, and present a lower bound on the finite sample breakdown point of the projection-depth-based median. We demonstrate our main results on the canonical projection-depth-based median, as well as on projection-depth-based medians derived from trimmed estimators. In the Gaussian setting, we show that the resulting deviation bound matches the known lower bound for private Gaussian mean estimation. In the Cauchy setting, we show that the "outlier error amplification" effect resulting from the heavy tails outweighs the cost of privacy. This result is then verified via numerical simulations. Additionally, we present results on general PTR mechanisms and a uniform concentration result on the projected spacings of order statistics, which may be of general interest.
- [49] arXiv:2312.10563 (replaced) [pdf, ps, html, other]
-
Title: Mediation Analysis with Mendelian Randomization and Efficient Multiple GWAS IntegrationSubjects: Methodology (stat.ME); Statistics Theory (math.ST)
Mediation analysis is a powerful tool for studying causal pathways between exposure, mediator, and outcome variables of interest. While classical mediation analysis using observational data often requires strong and sometimes unrealistic assumptions, such as unconfoundedness, Mendelian Randomization (MR) avoids unmeasured confounding bias by employing genetic variations as instrumental variables. We develop a novel MR framework for mediation analysis with genome-wide associate study (GWAS) summary data, and provide solid statistical guarantees. Our framework employs carefully crafted estimating equations, allowing for different sets of genetic variations to instrument the exposure and the mediator, to efficiently integrate information stored in three independent GWAS. As part of this endeavor, we demonstrate that in mediation analysis, the challenge raised by instrument selection goes beyond the well-known winner's curse issue, and therefore, addressing it requires special treatment. We then develop bias correction techniques to address the instrument selection issue and commonly encountered measurement error bias issue. Collectively, through our theoretical investigations, we show that our framework provides valid statistical inference for both direct and mediation effects with enhanced statistical efficiency compared to existing methods. We further illustrate the finite-sample performance of our approach through simulation experiments and a case study.
- [50] arXiv:2402.02969 (replaced) [pdf, ps, html, other]
-
Title: Towards Understanding the Word Sensitivity of Attention Layers: A Study via Random FeaturesComments: Revision after ICML2024 reviewsSubjects: Machine Learning (stat.ML); Computation and Language (cs.CL); Machine Learning (cs.LG)
Understanding the reasons behind the exceptional success of transformers requires a better analysis of why attention layers are suitable for NLP tasks. In particular, such tasks require predictive models to capture contextual meaning which often depends on one or few words, even if the sentence is long. Our work studies this key property, dubbed word sensitivity (WS), in the prototypical setting of random features. We show that attention layers enjoy high WS, namely, there exists a vector in the space of embeddings that largely perturbs the random attention features map. The argument critically exploits the role of the softmax in the attention layer, highlighting its benefit compared to other activations (e.g., ReLU). In contrast, the WS of standard random features is of order $1/\sqrt{n}$, $n$ being the number of words in the textual sample, and thus it decays with the length of the context. We then translate these results on the word sensitivity into generalization bounds: due to their low WS, random features provably cannot learn to distinguish between two sentences that differ only in a single word; in contrast, due to their high WS, random attention features have higher generalization capabilities. We validate our theoretical results with experimental evidence over the BERT-Base word embeddings of the imdb review dataset.
- [51] arXiv:2402.14775 (replaced) [pdf, ps, html, other]
-
Title: Localised Natural Causal Learning Algorithms for Weak Consistency ConditionsComments: UAI2024Subjects: Methodology (stat.ME)
By relaxing conditions for natural structure learning algorithms, a family of constraint-based algorithms containing all exact structure learning algorithms under the faithfulness assumption, we define localised natural structure learning algorithms (LoNS). We also provide a set of necessary and sufficient assumptions for consistency of LoNS, which can be thought of as a strict relaxation of the restricted faithfulness assumption. We provide a practical LoNS algorithm that runs in exponential time, which is then compared with related existing structure learning algorithms, namely PC/SGS and the relatively recent Sparsest Permutation algorithm. Simulation studies are also provided.
- [52] arXiv:2403.02058 (replaced) [pdf, ps, html, other]
-
Title: Utility-based optimization of Fujikawa's basket trial design -- Pre-specified protocol of a comparison studyComments: 26 pages, 1 figure; updated content in reaction to anonymous review: new section "Methodology of utility functions in basket trial designs", discussion of literature and four new scenario sets in section "Outcome scenarios", two new algorithms and detailed explanation in section "Optimization algorithms", new section "Discussion", further minor changesSubjects: Methodology (stat.ME); Applications (stat.AP)
Basket trial designs are a type of master protocol in which the same therapy is tested in several strata of the patient cohort. Many basket trial designs implement borrowing mechanisms. These allow sharing information between similar strata with the goal of increasing power in responsive strata while at the same time constraining type-I error inflation to a bearable threshold. These borrowing mechanisms can be tuned using numerical tuning parameters. The optimal choice of these tuning parameters is subject to research. In a comparison study using simulations and numerical calculations, we are planning to investigate the use of utility functions for quantifying the compromise between power and type-I error inflation and the use of numerical optimization algorithms for optimizing these functions. The present document is the protocol of this comparison study, defining each step of the study in accordance with the ADEMP scheme for pre-specification of simulation studies.
- [53] arXiv:2405.08253 (replaced) [pdf, ps, html, other]
-
Title: Thompson Sampling for Infinite-Horizon Discounted Decision ProcessesSubjects: Machine Learning (stat.ML); Machine Learning (cs.LG); Optimization and Control (math.OC)
We model a Markov decision process, parametrized by an unknown parameter, and study the asymptotic behavior of a sampling-based algorithm, called Thompson sampling. The standard definition of regret is not always suitable to evaluate a policy, especially when the underlying chain structure is general. We show that the standard (expected) regret can grow (super-)linearly and fails to capture the notion of learning in realistic settings with non-trivial state evolution. By decomposing the standard (expected) regret, we develop a new metric, called the expected residual regret, which forgets the immutable consequences of past actions. Instead, it measures regret against the optimal reward moving forward from the current period. We show that the expected residual regret of the Thompson sampling algorithm is upper bounded by a term which converges exponentially fast to 0. We present conditions under which the posterior sampling error of Thompson sampling converges to 0 almost surely. We then introduce the probabilistic version of the expected residual regret and present conditions under which it converges to 0 almost surely. Thus, we provide a viable concept of learning for sampling algorithms which will serve useful in broader settings than had been considered previously.
- [54] arXiv:2405.10067 (replaced) [pdf, ps, html, other]
-
Title: Sparse and Orthogonal Low-rank Collective Matrix Factorization (solrCMF): Efficient data integration in flexible layoutsSubjects: Methodology (stat.ME)
Interest in unsupervised methods for joint analysis of heterogeneous data sources has risen in recent years. Low-rank latent factor models have proven to be an effective tool for data integration and have been extended to a large number of data source layouts. Of particular interest is the separation of variation present in data sources into shared and individual subspaces. In addition, interpretability of estimated latent factors is crucial to further understanding.
We present sparse and orthogonal low-rank Collective Matrix Factorization (solrCMF) to estimate low-rank latent factor models for flexible data layouts. These encompass traditional multi-view (one group, multiple data types) and multi-grid (multiple groups, multiple data types) layouts, as well as augmented layouts, which allow the inclusion of side information between data types or groups. In addition, solrCMF allows tensor-like layouts (repeated layers), estimates interpretable factors, and determines variation structure among factors and data sources.
Using a penalized optimization approach, we automatically separate variability into the globally and partially shared as well as individual components and estimate sparse representations of factors. To further increase interpretability of factors, we enforce orthogonality between them. Estimation is performed efficiently in a recent multi-block ADMM framework which we adapted to support embedded manifold constraints.
The performance of solrCMF is demonstrated in simulation studies and compares favorably to existing methods. - [55] arXiv:1810.02071 (replaced) [pdf, ps, html, other]
-
Title: Leave-one-out least squares Monte Carlo algorithm for pricing Bermudan optionsJournal-ref: Journal of Futures Markets (2024)Subjects: Computational Finance (q-fin.CP); Mathematical Finance (q-fin.MF); Machine Learning (stat.ML)
The least squares Monte Carlo (LSM) algorithm proposed by Longstaff and Schwartz (2001) is widely used for pricing Bermudan options. The LSM estimator contains undesirable look-ahead bias, and the conventional technique of avoiding it requires additional simulation paths. We present the leave-one-out LSM (LOOLSM) algorithm to eliminate look-ahead bias without doubling simulations. We also show that look-ahead bias is asymptotically proportional to the regressors-to-paths ratio. Our findings are demonstrated with several option examples in which the LSM algorithm overvalues the options. The LOOLSM method can be extended to other regression-based algorithms that improve the LSM method.
- [56] arXiv:2303.07158 (replaced) [pdf, ps, html, other]
-
Title: Uniform Pessimistic Risk and its Optimal PortfolioSubjects: Portfolio Management (q-fin.PM); Machine Learning (cs.LG); Computation (stat.CO); Machine Learning (stat.ML)
The optimal allocation of assets has been widely discussed with the theoretical analysis of risk measures, and pessimism is one of the most attractive approaches beyond the conventional optimal portfolio model. The $\alpha$-risk plays a crucial role in deriving a broad class of pessimistic optimal portfolios. However, estimating an optimal portfolio assessed by a pessimistic risk is still challenging due to the absence of a computationally tractable model. In this study, we propose an integral of $\alpha$-risk called the \textit{uniform pessimistic risk} and the computational algorithm to obtain an optimal portfolio based on the risk. Further, we investigate the theoretical properties of the proposed risk in view of three different approaches: multiple quantile regression, the proper scoring rule, and distributionally robust optimization. Real data analysis of three stock datasets (S\&P500, CSI500, KOSPI200) demonstrates the usefulness of the proposed risk and portfolio model.
- [57] arXiv:2307.09864 (replaced) [pdf, ps, other]
-
Title: Asymptotic equivalence of Principal Components and Quasi Maximum Likelihood estimators in Large Approximate Factor ModelsComments: arXiv admin note: text overlap with arXiv:2211.01921 which is written by the same author. The two papers do not overlap as they contain different results although they have the same assumptionsSubjects: Econometrics (econ.EM); Methodology (stat.ME)
We provide an alternative derivation of the asymptotic results for the Principal Components estimator of a large approximate factor model. Results are derived under a minimal set of assumptions and, in particular, we require only the existence of 4th order moments. A special focus is given to the time series setting, a case considered in almost all recent econometric applications of factor models. Hence, estimation is based on the classical $n\times n$ sample covariance matrix and not on a $T\times T$ covariance matrix often considered in the literature. Indeed, despite the two approaches being asymptotically equivalent, the former is more coherent with a time series setting and it immediately allows us to write more intuitive asymptotic expansions for the Principal Component estimators showing that they are equivalent to OLS as long as $\sqrt n/T\to 0$ and $\sqrt T/n\to 0$, that is the loadings are estimated in a time series regression as if the factors were known, while the factors are estimated in a cross-sectional regression as if the loadings were known. Finally, we give some alternative sets of primitive sufficient conditions for mean-squared consistency of the sample covariance matrix of the factors, of the idiosyncratic components, and of the observed time series, which is the starting point for Principal Component Analysis.
- [58] arXiv:2310.09319 (replaced) [pdf, ps, html, other]
-
Title: Topological Data Analysis in smart manufacturingComments: Preprint still under reviewSubjects: Machine Learning (cs.LG); Algebraic Topology (math.AT); Applications (stat.AP)
Topological Data Analysis (TDA) is a discipline that applies algebraic topology techniques to analyze complex, multi-dimensional data. Although it is a relatively new field, TDA has been widely and successfully applied across various domains, such as medicine, materials science, and biology. This survey provides an overview of the state of the art of TDA within a dynamic and promising application area: industrial manufacturing and production, particularly within the Industry 4.0 context. We have conducted a rigorous and reproducible literature search focusing on TDA applications in industrial production and manufacturing settings. The identified works are categorized based on their application areas within the manufacturing process and the types of input data. We highlight the principal advantages of TDA tools in this context, address the challenges encountered and the future potential of the field. Furthermore, we identify TDA methods that are currently underexploited in specific industrial areas and discuss how their application could be beneficial, with the aim of stimulating further research in this field. This work seeks to bridge the theoretical advancements in TDA with the practical needs of industrial production. Our goal is to serve as a guide for practitioners and researchers applying TDA in industrial production and manufacturing systems. We advocate for the untapped potential of TDA in this domain and encourage continued exploration and research.
- [59] arXiv:2404.05484 (replaced) [pdf, ps, html, other]
-
Title: On Computational Modeling of Sleep-Wake CycleSubjects: Machine Learning (cs.LG); Machine Learning (stat.ML)
Why do mammals need to sleep? Neuroscience treats sleep and wake as default and perturbation modes of the brain. It is hypothesized that the brain self-organizes neural activities without environmental inputs. This paper presents a new computational model of the sleep-wake cycle (SWC) for learning and memory. During the sleep mode, the memory consolidation by the thalamocortical system is abstracted by a disentangling operator that maps context-dependent representations (CDR) to context-independent representations (CIR) for generalization. Such a disentangling operator can be mathematically formalized by an integral transform that integrates the context variable from CDR. During the wake mode, the memory formation by the hippocampal-neocortical system is abstracted by an entangling operator from CIR to CDR where the context is introduced by physical motion. When designed as inductive bias, entangled CDR linearizes the problem of unsupervised learning for sensory memory by direct-fit. The concatenation of disentangling and entangling operators forms a disentangling-entangling cycle (DEC) as the building block for sensorimotor learning. We also discuss the relationship of DEC and SWC to the perception-action cycle (PAC) for internal model learning and perceptual control theory for the ecological origin of natural languages.
- [60] arXiv:2405.07836 (replaced) [pdf, ps, other]
-
Title: Forecasting with Hyper-TreesComments: Forecasting, Gradient Boosting, Hyper-Networks, LightGBM, Parameter Non-Stationarity, Time Series, XGBoostSubjects: Machine Learning (cs.LG); Methodology (stat.ME)
This paper introduces the concept of Hyper-Trees and offers a new direction in applying tree-based models to time series data. Unlike conventional applications of decision trees that forecast time series directly, Hyper-Trees are designed to learn the parameters of a target time series model. Our framework leverages the gradient-based nature of boosted trees, which allows us to extend the concept of Hyper-Networks to Hyper-Trees and to induce a time-series inductive bias to tree models. By relating the parameters of a target time series model to features, Hyper-Trees address the issue of parameter non-stationarity and enable tree-based forecasts to extend beyond their training range. With our research, we aim to explore the effectiveness of Hyper-Trees across various forecasting scenarios and to extend the application of gradient boosted decision trees outside their conventional use in time series modeling.