Simple Exponential Family Principal Component Analysis

Many practical data lives in non-Euclidean spaces, e.g. binary bits for documents of word-presence/absence, or for neuron activation status, and integer scales for customers’ ranking of products. How do we construct representations that are both informative and reliable from usually noisy and volatile observations? Without a nature distance metric, how can we tell the trend from the noise, and how can we determine the number of features so that the new representations capture the trend and discard the noise? Principal component analysis (PCA) is one widely applied automatic technique of learning representations from data. For determining the appropriate number of principal components, automatic relevance determination (ARD) is a weight decaying technique, which has been proven effective for trimming neural networks from data. For non-Euclidean data, however, there are few techniques that are directly applicable to choose the number of features for reliable data representation.

“Simple exponential family PCA” is a piece of work addressing the model selection problem for generalised principal component analysis (PCA). In particular, our work is the first to introduce the empirical Bayesian approach ARD towards choosing the optimal number of principal components to represent data living in non-Euclidean spaces. For non-Euclidean data, exponential family distributions provide the necessary link between the variable of the new representation and the data. Therefore, we have studied a probabilistic reformulation of generalised PCA of non-Euclidean data, which is based on exponential family distributions, and developed corresponding inference and learning algorithms. Our analysis shows that iteratively learning the probabilistic model and tuning the hyper-parameters leads to a useful technique of determining the effective model complexity. ARD works by pruning variables that are surplus for representing the data, and the contribution of a variable to the target representation is assessed during the learning process. The exponential family distributions serve as a link connecting real-valued principal components and general types of data and enabling ARD to be applied on the model. It has been shown the proposed technique helps choose model families with appropriate complexity for various analysis tasks.

Besides the methodological contribution to the model selection, a theoretical contribution of the work is that we have provided an in-depth analysis of the proposed ARD-based type-II EM learning process. The conditions on which a latent variable or a candidate principal component is trimmed by ARD have been carefully examined, and the implications have been comprehensively discussed. Exploiting the exponential family-based stochastic models, we have systematically studied the contribution of a variable to data representation. A connection has been established between the classic ARD and realising the basic principle of dimension reduction: the additional complexity caused by using an extra variable in the model can only be justified when the variable contributes to explaining sufficient variance in the data. Based on the analysis, the proposed technique can be understood as a way of expressing one’s belief on the signal-to-noise ratio of the data and make use of the knowledge to choose an appropriate data model. Different from prior believes that must be obtained by domain knowledge or by arbitrary guess, the prior required by our method can be obtained from data, and our experiments have shown that the assessment of the model on training data varies consistently with the model performance on test data.

A demo Matlab implementation reproducing the toy example in our paper can be downloaded [here]. Necessary instructions can be found in the beginning of the program file “toy_demo.m”. If you find our tool useful, interesting, or related to your work, please kindly cite the following article (bibtex provided for your convenience).

[Li & Tao, 2013] Jun Li and Dacheng Tao, Simple Exponential Family PCA, IEEE Trans. Neural Networks and Learning Systems 24(3): 485–497, 2013 [pdf]

author = {Jun Li and Dacheng Tao,
title = {{Simple Exponential Family PCA}},
journal = {IEEE Transactions on Neural Networks and Learning Systems},
volume = {24},
number = {3},
pages = {485-497},
year = {2013},
month = {3},
ISSN = {2162-237X},
doi = {10.1109/TNNLS.2012.2234134}