Note:write about ICML2016 for me.
http://icml.cc/2016/?page_id=1649
Deep Reinforcement Learning David Silver (Google DeepMind)
A major goal of artificial intelligence is to create general-purpose agents that can perform effectively in a wide range of challenging tasks. To achieve this goal, it is necessary to combine reinforcement learning (RL) agents with powerful and flexible representations. The key idea of deep RL is to use neural networks to provide this representational power. In this tutorial we will present a family of algorithms in which deep neural networks are used for value functions, policies, or environment models. State-of-the-art results will be presented in a variety of domains, including Atari games, 3D navigation tasks, continuous control domains and the game of Go.
Graph Sketching, Streaming, and Space-Efficient Optimization Sudipto Guha (University of Pennsylvania) and Andrew McGregor (University of Massachusetts Amherst)
Graphs are one of the most commonly used data representation tools but existing algorithmicapproaches are typically not appropriate when the graphs of interest are dynamic, stochastic, ordo not fit into the memory of a single machine. Such graphs are often encountered as machinelearning techniques are increasingly deployed to manage graph data and large-scale graph opti-mization problems. Graph sketching is a form of dimensionality reduction for graph data that isbased on using random linear projections and exploiting connections between linear algebra andcombinatorial structure. The technique has been studied extensively over the last five years andcan be applied in many computational settings. It enables small-space online and data streamcomputation where we are permitted only a few passes (ideally only one) over an input sequence ofupdates to a large underlying graph. The technique parallelizes easily and can naturally be appliedin various distributed settings. It can also be used in the context of convex programming to enablemore efficient algorithms for combinatorial optimization problems such as correlation clustering.One of the main goals of the research on graph sketching is understanding and characterizing thetypes of graph structure and features that can be inferred from compressed representations of therelevant graphs.
Abstraction in Reinforcement Learning Daniel Mankowitz, Shie Mannor (Technion Israel Institute of Technology), Timothy Mann (Google Deepmind)
Many real-world domains can be modeled using some form of abstraction. An abstraction is an important tool that enables an agent to focus less on the lower level details of a task and more on solving the task at hand. Temporal abstraction (i.e., options or skills) as well as spatial abstraction (i.e., state space representation) are two important examples. The goal of this workshop is to provide a forum to discuss the current challenges in designing as well as learning abstractions in real-world Reinforcement Learning (RL). http://rlabstraction2016.wix.com/icml
Neural Networks Back To The Future
Léon Bottou (Facebook), David Grangier (Facebook), Tomas Mikolov (Facebook), John Platt (Google)
As research in deep learning is extremely active today, we could take a step back and examine its foundations. We propose to have a critical look at previous work on neural networks, and try to have a better understanding of the differences with today’s work. Previous work can point at promising directions to follow, pitfalls to avoid, ideas and assumptions to revisit. Similarly, today’s progress can allow a critical examination of what should still be investigated, what has been answered… https://sites.google.com/site/nnb2tf
Graph embeddings, Clustering周りで面白そうなの集めた
Revisiting Semi-Supervised Learning with Graph Embeddings ,Zhilin Yang Carnegie Mellon University, Ruslan Salakhudinov U. of Toronto, William Cohen CMU
We present a semi-supervised learning framework based on graph embeddings. Given a graph between instances, we train an embedding for each instance to jointly predict the class label and the neighborhood context in the graph. We develop both transductive and inductive variants of our method. In the transductive variant of our method, the class labels are determined by both the learned embeddings and input feature vectors, while in the inductive variant, the embeddings are defined as a parametric function of the feature vectors, so predictions can be made on instances not seen during training. On a large and diverse set of benchmark tasks, including text classification, distantly supervised entity extraction, and entity classification, we show improved performance over many of the existing models.
Compressive Spectral Clustering ,Nicolas TREMBLAY INRIA Rennes, Gilles Puy INRIA, Remi Gribonval INRIA, Pierre Vandergheynst EPFL
Spectral clustering has become a popular technique due to its high performance in many contexts. It comprises three main steps: create a similarity graph between N objects to cluster, compute the first k eigenvectors of its Laplacian matrix to define a feature vector for each object, and run k-means on these features to separate objects into k classes. Each of these three steps becomes computationally intensive for large N and/or k. We propose to speed up the last two steps based on recent results in the emerging field of graph signal processing: graph filtering of random signals, and random sampling of bandlimited graph signals. We prove that our method, with a gain in computation time that can reach several orders of magnitude, is in fact an approximation of spectral clustering, for which we are able to control the error. We test the performance of our method on artificial and real-world network data.
k-variates++: more pluses in the k-means++
kk-means++ seeding has become a de facto standard for hard clustering algorithms. In this paper, our first contribution is a two-way generalisation of this seeding, kk-variates++, that includes the sampling of general densities rather than just a discrete set of Dirac densities anchored at the point locations, \textit{and} a generalisation of the well known Arthur-Vassilvitskii (AV) approximation guarantee, in the form of a \textit{bias+variance} approximation bound of the \textit{global} optimum. This approximation exhibits a reduced dependency on the “noise” component with respect to the optimal potential — actually approaching the statistical lower bound. We show that kk-variates++ \textit{reduces} to efficient (biased seeding) clustering algorithms tailored to specific frameworks; these include distributed, streaming and on-line clustering, with \textit{direct} approximation results for these algorithms. Finally, we present a novel application of kk-variates++ to differential privacy. For either the specific frameworks considered here, or for the differential privacy setting, there is little to no prior results on the direct application of kk-means++ and its approximation bounds — state of the art contenders appear to be significantly more complex and / or display less favorable (approximation) properties. We stress that our algorithms can still be run in cases where there is \textit{no} closed form solution for the population minimizer. We demonstrate the applicability of our analysis via experimental evaluation on several domains and settings, displaying competitive performances vs state of the art.
CryptoNets: Applying Neural Networks to Encrypted Data with High Throughput and Accuracy ,Nathan Dowlin Princeton, Ran , Kim Laine Microsoft Research, Kristin Lauter Microsoft Research, Michael Naehrig Microsoft Research, John Wernsing Microsoft Research
Applying machine learning to a problem which involves medical, financial, or other types of sensitive data, not only requires accurate predictions but also careful attention to maintaining data privacy and security. Legal and ethical requirements may prevent the use of cloud-based machine learning solutions for such tasks. In this work, we will present a method to convert learned neural networks to CryptoNets, neural networks that can be applied to encrypted data. This allows a data owner to send their data in an encrypted form to a cloud service that hosts the network. The encryption ensures that the data remains confidential since the cloud does not have access to the keys needed to decrypt it. Nevertheless, we will show that the cloud service is capable of applying the neural network to the encrypted data to make encrypted predictions, and also return them in encrypted form. These encrypted predictions can be sent back to the owner of the secret key who can decrypt them. Therefore, the cloud service does not gain any information about the raw data nor about the prediction it made. We demonstrate CryptoNets on the MNIST optical character recognition tasks. CryptoNets achieve 99% accuracy and can make more than 51000 predictions per hour on a single PC. Therefore, they allow high throughput, accurate, and private predictions.
The Variational Nystrom method for large-scale spectral problems ,Max Vladymyrov Yahoo Labs, Miguel Carreira-Perpinan UC Merced
Spectral methods for dimensionality reduction and clustering require solving an eigenproblem defined by a sparse affinity matrix. When this matrix is large, one seeks an approximate solution. The standard way to do this is the Nystrom method, which first solves a small eigenproblem considering only a subset of landmark points, and then applies an out-of-sample formula to extrapolate the solution to the entire dataset. We show that by constraining the original problem to satisfy the Nystrom formula, we obtain an approximation that is computationally simple and efficient, but achieves a lower approximation error using fewer landmarks and less runtime. We also study the role of normalization in the computational cost and quality of the resulting solution.
Fast Stochastic Algorithms for SVD and PCA: Convergence Properties and Convexity ,Ohad Shamir Weizmann Institute of Science
We study the convergence properties of the VR-PCA algorithm introduced by (Shamir, 2015) for fast computation of leading singular vectors. We prove several new results, including a formal analysis of a block version of the algorithm, and convergence from random initialization. We also make a few observations of independent interest, such as how pre-initializing with just a single exact power iteration can significantly improve the runtime of stochastic methods, and what are the convexity and non-convexity properties of the underlying optimization problem.
Convergence of Stochastic Gradient Descent for PCA ,Ohad Shamir Weizmann Institute of Science
We consider the problem of principal component analysis (PCA) in a streaming stochastic setting, where our goal is to find a direction of approximate maximal variance, based on a stream of i.i.d. data points in R^d. A simple and computationally cheap algorithm for this is stochastic gradient descent (SGD), which incrementally updates its estimate based on each new data point. However, due to the non-convex nature of the problem, analyzing its performance has been a challenge. In particular, existing guarantees rely on a non-trivial eigengap assumption on the covariance matrix, which is intuitively unnecessary. In this paper, we provide (to the best of our knowledge) the first eigengap-free convergence guarantees for SGD in the context of PCA. This also partially resolves an open problem posed in \cite{hardt2014noisy}. Moreover, under an eigengap assumption, we show that the same techniques lead to new SGD convergence guarantees with better dependence on the eigengap.
Low-Rank Matrix Approximation with Stability ,Dongsheng Li IBM Research – China, Chao Chen , Qin Lv , Junchi Yan , Li Shang , Stephen Chu
Low-rank matrix approximation has been widely adopted in machine learning applications with sparse data, such as recommender systems. However, the sparsity of the data, incomplete and noisy, introduces challenges to the algorithm stability — small changes in the training data may significantly change the models. As a result, existing low-rank matrix approximation solutions yield low generalization performance, exhibiting high error variance on the training dataset, and minimizing the training error may not guarantee error reduction on the testing dataset. In this paper, we investigate the algorithm stability problem of low-rank matrix approximations. We present a new algorithm design framework, which (1) introduces new optimization objectives to guide stable matrix approximation algorithm design, and (2) solves the optimization problem to obtain stable low-rank approximation solutions with good generalization performance. Experimental results on real-world datasets demonstrate that the proposed work can achieve better prediction accuracy compared with both state-of-the-art low-rank matrix approximation methods and ensemble methods in recommendation task.
Fast Methods for Estimating the Numerical Rank of Large Matrices ,Shashanka Ubaru University of Minnesota, Yousef Saad University of Minnesota
We present two computationally inexpensive techniques for estimating the numerical rank of a matrix, combining powerful tools from computational linear algebra. These techniques exploit three key ingredients. The first is to approximate the projector on the non-null invariant subspace of the matrix by using a polynomial filter. Two types of filters are discussed, one based on Hermite interpolation and the other based on Chebyshev expansions. The second ingredient employs stochastic trace estimators to compute the rank of this wanted eigen-projector, which yields the desired rank of the matrix. In order to obtain a good filter, it is necessary to detect a gap between the eigenvalues that correspond to noise and the relevant eigenvalues that correspond to the non-null invariant subspace. The third ingredient of the proposed approaches exploits the idea of spectral density, popular in physics, and the Lanczos spectroscopic method to locate this gap.
Fast k-Nearest Neighbour Search via Dynamic Continuous Indexing ,Ke Li UC Berkeley, Jitendra Malik
Existing methods for retrieving k-nearest neighbours suffer from the curse of dimensionality. We argue this is caused in part by inherent deficiencies of space partitioning, which is the underlying strategy used by almost all existing methods. We devise a new strategy that avoids partitioning the vector space and present a novel randomized algorithm that runs in time linear in dimensionality and sub-linear in the size of the dataset and takes space constant in dimensionality and linear in the size of the dataset. The proposed algorithm allows fine-grained control over accuracy and speed on a per-query basis, automatically adapts to variations in dataset density, supports dynamic updates to the dataset and is easy-to-implement. We show appealing theoretical properties and demonstrate empirically that the proposed algorithm outperforms locality-sensitivity hashing (LSH) in terms of approximation quality, speed and space efficiency.
Community Recovery in Graphs with Locality ,Yuxin Chen Stanford University, Govinda Kamath Stanford University, Changho Suh KAIST, David Tse Stanford University
Motivated by applications in domains such as social networks and computational biology, we study the problem of community recovery in graphs with locality. In this problem, pairwise noisy measurements of whether two nodes are in the same community or different communities come mainly or exclusively from nearby nodes rather than uniformly sampled between all nodes pairs, as in most existing models. We present an algorithm that runs linearly in the number of measurements and which achieves the information theoretic limit for exact recovery.
Fast k-means with accurate bounds ,James Newling Idiap Research Institute, Francois Fleuret Idiap research institute
Our first contribution is an accelerated exact k-means algorithm, which performs better than the current state-of-the-art low-dimensional algorithm in 18 of 22 experiments, running up to 3x faster. Our second contribution is a general improvement of existing state-of-the-art accelerated exact k-means algorithms, which relies on better estimates of distance bounds used to reduce the required number of distance calculations, and leads to a speedup in 36 of 44 experiments, running up to 1.8x faster. We have conducted experiments with our own implementations of existing state-of-the-art methods to ensure homogeneous evaluation of performance, and we show that these new implementations perform as well or better than existing available implementations. We also propose simplified variants of standard approaches and show that they are faster than their fully-fledged counterparts in 59 of 62 experiments.
KK-Means Clustering with Distributed Dimensions ,Hu Ding State University of New York at Buffalo, Lingxiao Huang , Yu Liu Tsinghua University, Jian Li
Distributed clustering has attracted significant attention in recent years. In this paper, we study the kk-means problem in the distributed dimension setting, where the dimensions of the data are partitioned across multiple machines. We provide new approximation algorithms, which incur low communication costs and achieve constant approximation ratios. The communication complexity of our algorithms significantly improve on existing algorithms. We also provide the first communication lower bound, which nearly matches our upper bound in a certain range of parameter setting. Our experimental results show that our algorithms outperform existing algorithms on real data-sets in the distributed dimension setting.
Non-negative Matrix Factorization under Heavy Noise ,Jagdeep Pani Indian Institute of Science, Ravindran Kannan Microsoft Reseach India, Chiranjib Bhattacharya , Navin Goyal Microsoft Research India
The Noisy Non-negative Matrix factorization (NMF) is: given a data matrix A (d x n), find non-negative matrices B;C (d x k, k x n respy.) so that A = BC +N, where N is a noise matrix. Existing polynomial time algorithms with proven error guarantees require EACH column N_.j to have l1 norm much smaller than ||(BC).j ||_1, which could be very restrictive. In important applications of NMF such as Topic Modeling as well as theoretical noise models (e.g. Gaussian with high sigma), almost EVERY column of N.j violates this condition. We introduce the heavy noise model which only requires the average noise over large subsets of columns to be small. We initiate a study of Noisy NMF under the heavy noise model. We show that our noise model subsumes noise models of theoretical and practical interest (for e.g. Gaussian noise of maximum possible sigma). We then devise an algorithm TSVDNMF which under certain assumptions on B,C, solves the problem under heavy noise. Our error guarantees match those of previous algorithms. Our running time of O(k.(d+n)^2) is substantially better than the O(d.n^3) for the previous best. Our assumption on B is weaker than the “Separability” assumption made by all previous results. We provide empirical justification for our assumptions on C. We also provide the first proof of identifiability (uniqueness of B) for noisy NMF which is not based on separability and does not use hard to check geometric conditions. Our algorithm outperforms earlier polynomial time algorithms both in time and error, particularly in the presence of high noise.
Autoencoding beyond pixels using a learned similarity metric ,Anders B. L. Larsen Technical University of Denmar, Søren Kaae Sønderby University of Copenagen, Hugo Larochelle Twitter, Ole Winther Technical University of Denmark
We present an autoencoder that leverages learned representations to better measure similarities in data space. By combining a variational autoencoder with a generative adversarial network we can use learned feature representations in the GAN discriminator as basis for the VAE reconstruction objective. Thereby, we replace element-wise errors with feature-wise errors to better capture the data distribution while offering invariance towards e.g. translation. We apply our method to images of faces and show that it outperforms VAEs with element-wise similarity measures in terms of visual fidelity. Moreover, we show that the method learns an embedding in which high-level abstract visual features (e.g. wearing glasses) can be modified using simple arithmetic.
Matrix Eigendecomposition via Doubly Stochastic Riemannian Optimization ,Zhiqiang Xu Institute of Infocomm Research, Peilin Zhao I2R
Matrix eigendecomposition plays a significant role in machine learning, which is a non-convex quadratic optimization problem. In this paper, we propose a doubly stochastic Riemannian gradient DSRG method to address this problem. Specifically, the double stochasticity comes from the Riemannian counterparts of stochastic Euclidean gradient ascent and stochastic Euclidean coordinate ascent. With the double stochasticity, DSRG can greatly reduce the complexity, and completely avoid the matrix inversion, so that it is very suitable to large matrices eigendecomposition. We theoretically analyze its convergence properties and empirically validate it on real-world datasets. Encouraging experimental results demonstrate its advantages over the non-stochastic counterpart.
Persistence weighted Gaussian kernel for topological data analysis ,Genki Kusano Tohoku University, Yasuaki Hiraoka , Kenji
Topological data analysis (TDA) is an emerging mathematical concept for characterizing shapes in complex data. In TDA, persistence diagrams are widely recognized as a useful descriptor of data, and can distinguish robust and noisy topological properties. This paper proposes a kernel method on persistence diagrams to develop a statistical framework in TDA. The proposed kernel satisfies the stability property and provides explicit control on the effect of persistence. Furthermore, the method allows a fast approximation technique. The method is applied into practical data on proteins and oxide glasses, and the results show the advantage of our method compared to other relevant methods on persistence diagrams.
Clustering is a powerful tool in data analysis, but it is often difficult to find a grouping that aligns with a user’s needs. To address this, several methods incorporate constraints obtained from users into clustering algorithms, but unfortunately do not apply to hierarchical clustering. We design an interactive Bayesian algorithm that incorporates user interaction into hierarchical clustering while still utilizing the geometry of the data by sampling a constrained posterior distribution over hierarchies. We also suggest several ways to intelligently query a user. The algorithm, along with the querying schemes, shows promising results on real data.
Interactive Bayesian Hierarchical Clustering ,Sharad Vikram UCSD, Sanjoy Dasgupta UCSD
Clustering is a powerful tool in data analysis, but it is often difficult to find a grouping that aligns with a user’s needs. To address this, several methods incorporate constraints obtained from users into clustering algorithms, but unfortunately do not apply to hierarchical clustering. We design an interactive Bayesian algorithm that incorporates user interaction into hierarchical clustering while still utilizing the geometry of the data by sampling a constrained posterior distribution over hierarchies. We also suggest several ways to intelligently query a user. The algorithm, along with the querying schemes, shows promising results on real data.
Cross-graph Learning of Multi-relational Associations ,Hanxiao Liu , Yiming Yan
Cross-graph Relational Learning (CGRL) refers to the problem of predicting the strengths or labels of multi-relational tuples of heterogeneous object types, through the joint inference over multiple graphs which specify the internal connections among each type of objects. CGRL is an open challenge in machine learning due to the daunting number of all possible tuples to deal with when the numbers of nodes in multiple graphs are large, and because the labeled training instances are extremely sparse as typical. Existing methods such as tensor factorization or tensor-kernel machines do not work well because of the lack of convex formulation for the optimization of CGRL models, the poor scalability of the algorithms in handling combinatorial numbers of tuples, and/or the non-transductive nature of the learning methods which limits their ability to leverage unlabeled data in training. This paper proposes a novel framework which formulates CGRL as a convex optimization problem, enables transductive learning using both labeled and unlabeled tuples, and offers a scalable algorithm that guarantees the optimal solution and enjoys a constant time complexity with respect to the sizes of input graphs. In our experiments with a subset of DBLP publication records and an Enzyme multi-source dataset, the proposed method successfully scaled to the large cross-graph inference problem, and outperformed other representative approaches significantly.
Speeding up k-means by approximating Euclidean distances via block vectors ,Thomas Bottesch Ulm University, Thomas Bühler Avira, Markus Kächele Ulm University
This paper introduces a new method to approximate Euclidean distances between points using block vectors in combination with the Holder inequality. By defining lower bounds based on the proposed approximation, cluster algorithms can be considerably accelerated without loss of quality. In extensive experiments, we show a considerable reduction in terms of computational time in comparison to standard methods and the recently proposed Yinyang k-means. Additionally we show that the memory consumption of the presented clustering algorithm does not depend on the number of clusters, which makes the approach suitable for large scale problems.
Faster Eigenvector Computation via Shift-and-Invert Preconditioning ,Dan Garber TTI Chicago, Elad Hazan Princeton University, Chi Jin UC Berkeley, Sham , Cameron Musco Massachusetts Institute of Technology, Praneeth Netrapalli Microsoft Research, Aaron Sidford Microsoft Research
We give faster algorithms and improved sample complexities for the fundamental problem of estimating the top eigenvector of a given matrix. In particular, given an explicit matrix A∈\Rn×dA∈\Rn×d such that Σ=A⊤AΣ=A⊤A, we show how to compute an ϵϵ approximate top eigenvector in time \otilde([\nnz(A)+d\sr(A)\gap2]⋅log1/ϵ)\otilde([\nnz(A)+d\sr(A)\gap2]⋅log1/ϵ). Here \nnz(A)\nnz(A) is the number of nonzeros in AA, \sr(A)\sr(A) is the stable rank, and \gap\gap is the relative eigengap. We also consider an online setting in which, given a stream of i.i.d. samples from a distribution DD with covariance matrix ΣΣ and a vector x0x0 which is an O(\gap)O(\gap) approximate top eigenvector for ΣΣ, we show how to refine x0x0 to an ϵϵ approximation using O(\var(D)\gap⋅ϵ)O(\var(D)\gap⋅ϵ) samples from DD. Here \var(D)\var(D) is a natural notion of variance. Combining our algorithm with previous work to initialize x0x0, we obtain improved sample complexity and runtime results under a variety of assumptions on \D\D. Notably, we show that, for general distributions, we achieve asymptotically optimal accuracy as a function of sample size as the number of samples grows large. We achieve our results by combining a new robust analysis of the classic method of \emph{shift-and-invert} preconditioning, which reduces eigenvector computations to \emph{approximately} solving a sequence of linear systems, with fast stochastic gradient based system solvers to achieve our claims.
Efficient Algorithms for Large-scale Generalized Eigenvector Computation and CCA ,Rong Ge , Chi Jin UC Berkeley, Sham , Praneeth Netrapalli Microsoft Research, Aaron Sidford Microsoft Research
This paper considers the problem of canonical-correlation analysis (CCA) and, more broadly, the generalized eigenvector problem for a pair of symmetric matrices. These are two fundamental problems in data analysis and scientific computing with numerous applications in machine learning and statistics. We provide simple iterative algorithms, with improved runtimes, for solving these problems that are globally linearly convergent with moderate dependencies on the condition numbers and eigenvalue gaps of the matrices involved. We obtain our results by reducing CCA to the top-kk generalized eigenvector problem. We solve this problem through a general framework that simply requires black box access to an approximate linear system solver. Instantiating this framework with accelerated gradient descent we obtain a running time of \orderzkκ√ρlog(1/ϵ)log(kκ/ρ)\orderzkκρlog(1/ϵ)log(kκ/ρ) where zz is the total number of nonzero entries, κκ is the condition number and ρρ is the relative eigenvalue gap of the appropriate matrices. Our algorithm is linear in the input size and the number of components kk. This is essential for handling large-scale matrices that appear in practice. To the best of our knowledge this is the first such algorithm with global linear convergence. We hope that our results prompt further research and ultimately improve the practical running time for performing these important data analysis procedures on large data sets.