Swarm particle optimization guides swarm particles in multidimensional search space, their movement speed is updated iteratively for each dimension with a pseudo-random number generator (because a particle does not know where to go, chooses an arbitrary path). In many cases works good with precision-recall, also for transient events, when the matrix is relatively sparse.

history,social -> cf,mf, graph mining, nn, (these models fit the specific of social media) demographic -> online stats, association rule, tag -> content based model, dist predict model

shallow model (logistic regression), deep model (decision tree, deep neural network), deep model can automatically find relationships, in the case of shallow model we need much engineering to learn about feature combinations

LR (learn to rate): training linear, prediction efficient, feature engineering (BAD), decision tree (mostly random forest): training log-linear (BAD), predicting efficient (depth of the tree), feature engineering cost is low, deep neural: training linear, predicting costly (BAD), feature engineering cost is low

rmse a very limited metric taken from the Euclidean space norm calculation

cluster sparse data, make it dense, re-define the problem, if current model ineffective

greedy approaches, e.g. kullback-leibler divergence criterion, or GLIE Boltzmann exploration technique (by decreasing the temp we explore the search space), Kullback-Leibler hard k-means avoids exhaustive search, with rating depending on the KL-divergence

a family of k-order statistic losses,

or k-os 5 mean-max get better, but p@,r@ worse, the contrary for k-os 1, so mean-max vs p/r trade-off in the context of k-os, increases diversity over the highly ranked items somehow mf optimizing a smoothed min max rank

stochastic methods may be better due to randomness, ccd+ (deterministic ex.) may avoid tuning of learning rates

Partial Random Method: random update (faster and stable, but disc), sequential (continuous, but slow), so we apply sequentiality for an arbitrary block and random for selecting a block

asymmetry exploitation also involved the asymmetric conditional probability measure, the conditional similarity probability defined as an asymmetric product of conditionals

matrix factorization and lda (latent dirichlet alloc) project users and items into low-dim spaces, so we need to identify a transform these to combine, for that we are just passing through the exponential func and then normalizing it, minimizing the MSE using gradient descent (using L-BFGS), sampling entities via Gibbs sampling

RF (random forest) – ensemble of multiple decision trees, used used for document/image analysis relevance metric is done with NDCG, and diversity with subtopic-recall, alfa-NDCG(penalizes redundancy), ERR per topic (expected reciprocal rank)

Banach Fixed Point theorem to prove convergence for a dynamical system, orthogonal nmtf model to preserve user-item similarity and bi-clustering, orthogonal nonnegative matrix factorization and further encourage sparsity on association, multiplicative updating rule, mixed cost flow design, accuracy based on MRR: mean reciprocal rank, similarity, diversity and lont-tail, matrix factorization did not perform when compared to the simplex based with encouraged, sparsity and simplex with mixed cost flow design

truncated svd, close to the original one (min arg), representing latent vectors in l-dim space, removing user vectors from the system (netflix prize), asymmetric model: only use the item vectors (less params), normalization in hypergraph learning is the key to diversity, graph into hypergraph -> every item is a matrix, every user is a matrix, first l-eigenvectors of the NHC transformation Zhou 2007, then rank-l truncated SVD, hypergraph learning is equivalent to asymmetric svd, clusters of items remain clustered in the latent space by asymmetric svd, normalization improves the diversity of recommendations, with the asymmetric svd with normalization we have more diversity, therefore less popularity bias

random walks: reset and teleporting allowed in rw, calc the dist of events, personalized pagerank (ppr), calc ppr vector for each source slow, or with an approximate, graphChi, turbograph, x-stream, vertex-centric model, store the walk in memory, many walks at a time, if a walk is within a specific bucket, then stored in the bucket, it is assumed that 2000 x 5 hip hop walks equals 10k walks, which is just the approximation

about learning, from “Bootstrap Learning via Modular Concept Discovery “, Dechter et al., for

non-smooth solution space and uninformative reward signal, with no knowledge,

facing a cs problem, explores a subset of domain, guided by a stochastic grammar,

tested with symbolic regression and boolean circuit learning- learner fast becomes

able to solve more complex problems rel. fast, but due to the fact that it seeks

answers more in places that previously contained answers, may stop in a local minimum,

see: http://dash.harvard.edu/bitstream/handle/1/11223358/dechter-bootstrap-ijcai-2013.pdf?sequence=1