47. Feature stores in an embedding world
We all know features stores are important in a MLOps workflow. But how does that change when storing embeddings?
Introduction
By this point, we are all familiar with features stores. They are very helpful to:
Democratize Usage: Features can be accessed by different teams.
Multi modality support: Batch, realtime and RPC features with online and offline data parity.
Feature transformers: Setup chains of transformations at training/serving time.
Historical and near real time data available
I have talked about them extensively in previous articles:
#17 Uber's Offline Platform For Optimal Feature Discovery.
Introduction Optimal feature discovery through a centralised Feature Store Results Introduction Today's topic is about feature selection. As a Machine Learning engineer, you are tempted to ingest as many features as possible from different teams in hope to improve your model performance.
#20 Machine Learning Features store challenges from Constructor.io
Table of contents Introduction. What is a Feature Store anyway? Offline Batch Layer Online Serving Layer Closing thoughts Introduction In today's article I will discuss engineering challenges faced by Constructor.io [1] when implementing a feature store.
However, things are changing. It’s becoming more common to “abstract” away the feature engineering step and just create embeddings. What has been observed is that:
Downstream systems require less supervised data and provide a quality lift compared to hand-tuned features.
But that adds a different type of overhead. We could now describe the offline and online system as follows:
Let’s look at the challenges.
Challenges
1. Tail scalability challenge
The long tail of entities is a major issue. Majority of entities are rare!
This means that a large number of patterns needed to resolve the tail, making it difficult to scale a system that can learn the patterns.
2. Memory usage
Embeddings linearly grow per number of entities. More computations affect latency.
It’s also harder to fit on a device!
A solution is to only keep the “top k entity” embeddings to drop memory consumption with acceptable accuracy still.
3. Embedding in non-english languages
There is a lack of equally abundant resources in languages other than English.
Memory usage increases the size of embeddings by the number of languages
4. Embedding stability
Updating an embedding model is hard business.
As it is one component in a system of downstream tasks that depend on the embedding quality you need to make sure also those are retrained.
Otherwise, a previous correct downstream prediction might become incorrect!
5. Embedding evaluation
There are many questions you can ask yourself with trained embeddings:
Are they biased to popular entities in a given geolocation?
Vulnerable to adversarial attacks?
Are the downstream applications affected by the updated embeddings?
How to enable safe and regular model updates?
Conclusions
Using embeddings instead of features can massively speed up the development process, however there are some additional tradeoffs that one needs to take into account to have a smooth flow.
The challenges I have described above are usually business-dependent and you are the most equipped person to answer if a given solution is needed: e.g. you might not need to care at all about retraining models if entities are very stable.
Let me know what challenges you faced in your day to day job and how you managed to solve them.
Ludo