By theknight - 3 days ago
Showing first level comment(s)
sadikkapadia1 - 3 days ago
Content Based: If you can represent your products as a vector, you can have a distance between each product, then you have a item-item recommendation. You can use all kinds of embedding to achieve this results, some techniques that we tried are word2vec embedding of user navigation, auto encoding of features using neural networks, dimensionality reduction with PCA, ALS, etc. There are lots of libs for solving these problems as is a very studied field, usually numpy and for finding the neighbors we use ann from scikitlearn, because if you have millions of items, you cant just find the distance between all the pairs.
Collaborative - Filtering, here you use the pairs of behavior of the users, <user, item, ranking>. There is a surprise lib in python that works well, you have the MlLib from Spark too, this techniques are called matrix factorization techniques, and also gives you a embedding of the item or the user, and you can apply the techniques of content based to find user-user and item-item recommendations along the user-item recommendations
Hybrid Models: These are the models that use behavior and features of the user an items, LightFM is a good lib that works well, but you can model it with other tools like neural networks ( https://ai.google/research/pubs/pub45530 ).
The challenges are depending on the company, its not the same to recommended small amount of items to large number of users than large number of items to small number users.
There is a whole specialization in coursera that is really good https://www.coursera.org/specializations/recommender-systems
chudi - 3 days ago
Technical aspects in how you train your models and such are fun, but way, way down the list of things that are likely to matter in the short to medium term. Like, data scientists are nice to have, but you're not really going to be able to fully utilize them until you have the capability to build, deploy, and test a model at scale. If going third party helps you do this, you probably should.
splonk - 3 days ago
At Theneeds we were recommending news = fresh content based on user's interest and other features. Because the content is fresh, you can't easily have enough data for a proper collaborative system.
Our algo was essentially the Reddit algo, where a piece of content gets a rank based on time and log of score. Score in Reddit is the upvotes - downvotes. At Theneeds we had a more complex score including social signals (likes on fb / RT on tw) so we could compute a meaningful score also without a big community of users. The other difference wrt Reddit was having different scores and different paces (multipliers) based on categories of content, so for example news in tech and politics from newspapers were updating faster than news on travel from magazines. And by normalizing the ranks, you can merge multiple categories in one -- a feature that I think Reddit also added.
As for the code/stack, custom written in python. We were using Redis to cache user timelines using sorted sets (including the guest users, i.e. the default top news for each category). In Redis, you can merge sorted sets, and we used it as an efficient way to create the new timeline when a new user was signing up.
[1] https://medium.com/@Pinterest_Engineering/introducing-pixie-...
Edit: added more details about tech.
ecesena - 3 days ago
prades - 3 days ago
I can train on multi-GB datasets w/ only lightFM and multiple CPUs.
Another interesting package is called Implicit. This package, although not as complete as LightFM when it comes to algorithms or APIs, really shines when it comes down to optimizations. Including native Cuda kernels for BPR and ALS it also has an important speedup called the Conjugate Gradient Method which makes it faster than spark in some benchmarks.
But usually, now-a-days my work requires more customized hybrid models of which I usually start w/ a base BPR implementation I have in Keras.
eggie5 - 21 hours ago
seektable - 2 days ago
rwieruch - 2 days ago
I grow my own analysis code but use search APIs for storage and access (Lucene or Algolia)
itronitron - 2 days ago
Topgamer7 - 3 days ago