Matrix Factorization

Matrix factorization is a mathematical technique widely used in recommender systems, especially collaborative filtering, to predict users' preferences based on their interactions with items. This approach is crucial for streaming platforms, e-commerce and social networks, as it allows for personalized recommendations.
The main idea is to "split" an interaction matrix \(R\), with dimensions \([m,n]\), into two smaller matrices that represent latent characteristics of users and items:
$$\huge R≈U*V$$
- \(m\) is the number of users.
- \(n\) is the number of items.
- The matrix \(U (of \ dimensions \ [m,k])\) represents users in a latent feature space.
- The matrix \(V (of \ dimensions \ [k,n])\) represents the items in a latent feature space.
The value \(k\) defines the number of latent factors, which capture implicit patterns in interactions between users and items.
Example
Imagine a futuristic scenario, at a party illuminated by soft neon lights and nice ambient sound in the background. Robbie, a curious and music-loving robot, finds himself at the center of a rather peculiar chatting circle. Around him, four iconic figures discuss their favorite music styles: Ripley, Darth Vader, Spock and Hermione. Robbie adjusts his hearing sensors and leans forward slightly, eager to learn about the characters' musical like.
Robbie creates an interaction matrix with dimensions \([m, n]\) with the rating values given by the characters.
Let's load sample data of user interactions with songs (from now on we'll call the characters users).
note: All code used in this article is available on GitHub .
import pandas as pd
import numpy as np
import examples as ex
matrix_interactions = ex.get_matrix_interaction()
matrix_interactions
Result of matrix_interactions
User | Another One Bites the Dust | Back in Black | Bohemian Rhapsody | Brandenburg Concerto No. 3 | Clair de Lune | Comfortably Numb | Don't Stop Believin' | Firework | Hedwig’s Theme | O Fortuna | Survivor | The Imperial March |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Darth Vader | 4 | 4 | 2 | 0 | 1 | 2 | 2 | 0 | 0 | 4 | 0 | 5 |
Hermione | 1 | 0 | 5 | 0 | 5 | 2 | 2 | 5 | 5 | 2 | 2 | 1 |
Ripley | 5 | 5 | 2 | 1 | 2 | 3 | 4 | 4 | 0 | 5 | 5 | 4 |
Spock | 2 | 0 | 5 | 5 | 5 | 4 | 3 | 2 | 3 | 4 | 2 | 0 |
Each cell contains a rating from 0 to 5, where 0 indicates that the user has never heard the music. Robbie's goal is to predict which musics would be recommended to each user, based on how similar their like are. To predict new preferences, he decides to use Matrix Factorization.
Matrix Factorization in Practice
Matrix factorization is the process of "split" an original matrix \(R\) into two smaller matrices, \(U\) and \(V\), both of which have \(k\) latent space. This "split" transforms the matrix of interactions between users and items (musics in our context) into a space of latent variables, allowing us to capture hidden patterns in the data and predict missing interactions efficiently.
To start the implementation, we need to initialize the matrices \(U\) and \(V\) with random values following a normal distribution. To prevent these initial values from being too large, we set the scale
parameter to \(1.0/k\). This ensures that the values have a standard deviation inversely proportional to the number of latent dimensions, stabilizing the optimization process and improving the quality of the model's predictions.
ratings = matrix_interactions.values
k = 5 # numer of latent dimensions
num_users, num_items = ratings.shape
# Initialize user and item latent feature matrices
user_matrix = np.random.normal(scale=1. / k, size=(num_users, k)) # U matrix
item_matrix = np.random.normal(scale=1. / k, size=(num_items, k)) # V matrix
Error Maximization and Optimization Methods
With the matrices \(U\) and \(𝑉\) initialized, the next step is to optimize the latent factors by adjusting the predictions for the actual interactions. The goal is to minimize the error between the known scores and the model predictions in the gaps of the original matrix.
For this process, two algorithms are widely used in matrix factorization:
- Stochastic Gradient Descent (SGD) – an iterative method that adjusts latent factors by gradually reducing the error through gradient descent. This will be the focus of this article.
- Alternating Least Squares (ALS) – an approach that solves the matrices \(U\) and \(V\) alternately using least squares. This method will be explored in another article.
In the next steps, we will see how to implement SGD to optimize latent factors and improve forecast accuracy.
Implementation of the SGD algorithm
The SGD performs three main steps:
- Choose a pair \((i,j)\) where \(R_{ij}\) is not zero.
- Calculate the prediction error for this interaction:
- \(e_{ij} = R_{ij}−\hat R_{ij} = R_{ij}−U_iV_j^T\)
- Update the values of the matrices \(U\) and \(V\) using the update rule of gradient descent:
- \(U_i \leftarrow U_i + \alpha * (2 * e_{ij} * V_j - \lambda U_i)\)
- \(V_j \leftarrow V_j + \alpha * (2 * e_{ij} * U_i - \lambda V_j) \)
- \(\alpha\) is the learning rate, which controls the size of the adjustment in each iteration.
- \(\lambda\) is the regularization term, which prevents overfitting by penalizing very high values in the latent factors
This process is repeated for several interactions throughout the training epochs, progressively reducing the error.
alpha = 0.01 # learning rate
lambda_ = 0.02 # regularization term
"""
- Returns the predicted rating of user i for item j.
- It does this through the dot product between the user feature vector (users[i, :]) and the item feature vector (items[j, :]).
- The resulting vector indicates the affinity between the user and the item.
"""
def predict(users, items, i, j):
return np.dot(users[i, :], items[j, :].T)
"""
- This function performs training using Stochastic Gradient Descent (SGD).
- ratings.nonzero() returns the indices (i, j) of the non-zero positions in the ratings array, i.e. where there is a known rating.
- ratings[i, j]: actual value of the rating given by user i to item j.
- predict(users, items, i, j): rating prediction calculated by the model.
- eij: diferença entre o valor real e o previsto.
- The term 2 * eij * items[j, :] corrects the direction of the user vector based on the prediction error.
- The term - lambda_ * users[i, :] is the regularization to avoid overfitting.
- After the updates, the function returns the adjusted users and items vectors, which contain the final user and item embeddings.
"""
def sgd(ratings, users, items):
for i, j in zip(*ratings.nonzero()):
eij = ratings[i,j] - predict(users, items, i, j)
users[i, :] += alpha * (2 * eij * items[j, :] - lambda_ * users[i, :])
items[j, :] += alpha * (2 * eij * users[i, :] - lambda_ * items[j, :])
return users, items
Now that we have our optimization function ready. Let's implement the error maximization, which is done by summing the mean squared error (MSE).
$$\large \min_{U,V} \sum_{(i,j) \ \in \ \Omega } (R_{ij} - (U * V)_{ij})^2 + \lambda ( \| U \|^2 + \| V \|^2) $$
MSE (Mean Squared Error)
Let's start our implementation of the optimization function by calculating the MSE.
$$\large \min_{U,V} \sum_{(i,j) \ \in \ \Omega } (R_{ij} - (U * V)_{ij})^2 = MSE = \dfrac{1}{n} \sum_{i=0}^{n} (y_{i} - \hat y_{i})²$$
- \( (i,j) \in \Omega\) is the set of indices where \(R_{ij}\) is known and belong to the set \(\Omega = \{\ (i,j) \ | \ R_{ij} \ is \ known \ \}\). It is the total number of known interactions.
- \((U * V)_{ij}\) are the estimated values of \(R_{ij}\) by the model.
- \(R_{ij}\) original matrix of ratings.
The MSE is crucial in SGD for several reasons:
- Provides a clear measure of error
- It quantifies the mean squared difference between actual values and estimated values, allowing the quality of the model's predictions to be assessed.
- It is smooth and differentiable
- MSE has continuous derivatives, which facilitates its optimization by gradient-based methods such as SGD.
- Penalizes major errors
- Because the error is squared, larger errors have a disproportionately larger impact on the final MSE value. This encourages the model to reduce large discrepancies, improving prediction accuracy.
- Facilitates interpretation of model performance
- High MSE values indicate that the model still needs to be optimized, while low values suggest that it is fitting the data well.
"""
- This variable will accumulate the squared error of each pair (i, j) in the ratings matrix.
- ratings.nonzero() returns the indices (i, j) where the ratings are known (nonzero).
- ratings[i, j]: actual rating given by user i to item j.
- (ratings[i, j] - y_pred) ** 2: squared error between actual and predicted value.
"""
def compute_loss(ratings, users, items):
error = 0
for i, j in zip(*ratings.nonzero()):
y_pred = predict(users, items, i, j)
error += (ratings[i, j] - y_pred) ** 2
mse = np.mean(error)
return mse
Let's now implement the second part of the optimization equation.
$$\huge \lambda ( \| U \|^2 + \| V \|^2) $$
- \(\lambda\) is the Regularization rate.
- \(( \| U \|^2 + \| V \|^2)\) are used as regularization to avoid overfitting by limiting the size of the parameters and ensuring that they do not grow excessively.
"""
- This variable will accumulate the squared error of each pair (i, j) in the ratings matrix.
- ratings.nonzero() returns the indices (i, j) where the ratings are known (nonzero).
- ratings[i, j]: actual rating given by user i to item j.
- (ratings[i, j] - y_pred) ** 2: squared error between actual and predicted value.
"""
def compute_loss(ratings, users, items):
error = 0
for i, j in zip(*ratings.nonzero()):
y_pred = predict(users, items, i, j)
error += (ratings[i, j] - y_pred) ** 2
mse = np.mean(error)
# Regularization term to avoid overfitting
# Penalizes large magnitudes in user and item vectors
regularization = lambda_ * (np.linalg.norm(users)**2 + np.linalg.norm(items)**2)
return mse + regularization
Why use regularization?
- Avoid overfitting
- High values in the
user_matrix
anditem_matrix
matrices can lead to a model that fits the training data perfectly but performs poorly on unseen data. - Regularization keeps learned features compact and avoids excessive parameter magnitudes.
- High values in the
- Improves numerical stability
- Without regularization, factor matrices can grow too large, causing instability and errors in the system.
- Adding the penalty term controls the scaling of
user_matrix
anditem_matrix
, stabilizing the training.
- Encourages generalization
- It ensures that latent factors capture meaningful patterns rather than noise.
Done. Now that we have the optimization function for matrix factorization and error maximization, we can start training.
epochs = 1000
for epoch in range(epochs):
user_matrix, item_matrix = sgd(ratings, user_matrix, item_matrix)
if epoch % 100 == 0:
error = compute_loss(ratings, user_matrix, item_matrix)
print(f'Epoch {epoch + 1}/{epochs} - Error: {error:.4f}')
"""
Train output:
Epoch 1/1000 - Error: 596.5902
Epoch 101/1000 - Error: 1.6179
Epoch 201/1000 - Error: 1.6440
Epoch 301/1000 - Error: 1.6707
Epoch 401/1000 - Error: 1.6952
Epoch 501/1000 - Error: 1.7164
Epoch 601/1000 - Error: 1.7343
Epoch 701/1000 - Error: 1.7491
Epoch 801/1000 - Error: 1.7612
Epoch 901/1000 - Error: 1.7710
"""
At the beginning of training, we usually start the matrices randomly. So, the model is practically “guessing” values in the predictions. That’s why the error starts high up (in the example, around 596).
As the SGD algorithm adjusts its parameters, it gradually “discovers” the best way to align these user and item vectors to approximate real interactions. This explains why, after a few iterations, the error drops to around 1.6. The model quickly picks up on the most “obvious” patterns in the data.
After this initial phase of great decline, the error tends to stabilize and even fluctuate a little. It's as if the model has learned most of the users' general preferences, but still needs to refine finer details (or deal with data that doesn't quite follow the general trend). As the number of epochs increases (200, 300, 400...), the model makes small fine adjustments, which means that the error does not fall so drastically, but remains around 1.6 to 1.7.
Why doesn’t the error go to zero? In real applications, there is always some level of noise: users may give inconsistent ratings, certain items may have few ratings, and so on. In addition, we apply “regularization” to prevent the model from simply memorizing every single point in the matrix (so-called overfitting). This regularization “safeguards” the fit a bit and prevents the model from overfitting to noise or incomplete data.
Recommending new music
Now that we have defined all the parts of the process, let's put the code together.
import numpy as np
class MatrixFactorization:
def __init__(self, n_factors, learning_rate, regularization):
"""
Matrix Factorization using Stochastic Gradient Descent (SGD) to factorize a matrix R into two matrices U and V.
:param n_factors: Number of latent factors
:param learning_rate: Learning rate
:param regularization: Regularization parameter
"""
self._k = n_factors
self._alpha = learning_rate
self._lambda = regularization
self._original_matrix = None
self._user_matrix = None
self._item_matrix = None
self._num_users = None
self._num_items = None
def fit(self, matrix, epochs=100) -> np.ndarray:
"""
Fit the matrix factorization model to the data
:param matrix: the matrix to factorize
:param epochs: the number of epochs to train the model
:return: the predicted ratings
"""
self._initialize(matrix)
for i in range(epochs):
self._sgd()
loss = self._compute_loss()
print(f'Epoch {i + 1}/{epochs} - Training Loss: {loss:.4f}')
return self.predict()
def predict(self) -> np.ndarray:
"""
Predict the ratings for every user and item.
U matrix * V matrix^T
:return: the predicted ratings
"""
return self._user_matrix.dot(self._item_matrix.T)
def _predict(self, i, j) -> np.ndarray:
"""
Predict the rating of user i for item j
U[i, :] * V[j, :].T
:param i: Index of the user
:param j: Index of the item
:return: the predicted rating
"""
return self._user_matrix[i, :].dot(self._item_matrix[j, :].T)
def _sgd(self):
"""
Perform Stochastic Gradient Descent to learn the latent factors U and V by minimizing the loss function using the original matrix R.
:return: None
"""
for i, j in zip(*self._original_matrix.nonzero()):
eij = self._original_matrix[i, j] - self._predict(i, j)
self._user_matrix[i] += self._alpha * (2 * eij * self._item_matrix[j] - self._lambda * self._user_matrix[i])
self._item_matrix[j] += self._alpha * (2 * eij * self._user_matrix[i] - self._lambda * self._item_matrix[j])
def _compute_loss(self) -> float:
"""
Compute the loss (MSE) of the prediction
:return: the loss of the prediction
"""
error = 0
for i, j in zip(*self._original_matrix.nonzero()):
error = self._original_matrix[i, j] - self._predict(i, j)
error += error ** 2
mse = np.mean(error)
regularization = self._lambda * (np.linalg.norm(self._user_matrix) + np.linalg.norm(self._item_matrix))
return mse + regularization
def _initialize(self, matrix):
self._original_matrix = matrix.copy()
self._num_users, self._num_items = self._original_matrix.shape
self._user_matrix = np.random.normal(scale=1. / self._k, size=(self._num_users, self._k)) # U matrix
self._item_matrix = np.random.normal(scale=1. / self._k, size=(self._num_items, self._k)) # V matrix
Training
Now that we have everything together, we can start training the model. But first, it's important to normalize the ratings
values to maintain the range \([0, 5]\). This minimizes the impact of outliers and allows the model to learn patterns and relations more efficiently and consistently.
from sklearn.preprocessing import MinMaxScaler
ratings = matrix_interactions.values
scaler = MinMaxScaler(feature_range=(0, 5))
normalized_ratings = scaler.fit_transform(ratings)
model = MatrixFactorization(n_factors=75, learning_rate=0.001, regularization=0.02)
matrix_factorization_result = model.fit(ratings, epochs=600)
Result of matrix_factorization_result
User | Another One Bites the Dust | Back in Black | Bohemian Rhapsody | Brandenburg Concerto No. 3 | Clair de Lune | Comfortably Numb | Don't Stop Believin' | Firework | Hedwig’s Theme | O Fortuna | Survivor | The Imperial March |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Darth Vader | 4.169970 | 4.057246 | 3.667471 | 0.710625 | 1.712582 | 2.772770 | 3.595744 | 3.018872 | 2.391332 | 4.190256 | 4.012947 | 4.421050 |
Hermione | 2.686220 | 3.310009 | 5.080972 | 4.519992 | 5.051042 | 3.952128 | 3.291865 | 4.718939 | 4.862734 | 4.351025 | 2.163611 | 0.979503 |
Ripley | 4.838367 | 4.918483 | 4.106493 | 1.043555 | 2.023779 | 3.012154 | 3.992257 | 4.059402 | 3.200697 | 4.823499 | 4.927110 | 4.451684 |
Spock | 1.994154 | 2.523687 | 4.894493 | 4.974091 | 4.904404 | 3.989944 | 2.998294 | 2.253896 | 3.132301 | 3.994360 | 1.850755 | 0.795050 |
Comparison of actual grade vs model prediction:
User | Music | Real rating | Prediction | Review |
---|---|---|---|---|
Darth Vader | Back in Black | 4 | 4.057246 | The model predicted a value very close to the actual grade. |
Hermione | Bohemian Rhapsody | 5 | 5.080972 | The model predicted a value very close to the current grade. |
Ripley | Don’t Stop Believin’ | 4 | 3.992257 | The model correctly predicted a strong interaction. |
Spock | Bohemian Rhapsody | 5 | 4.894493 | The model predicted a value very close to the actual interaction. |
Darth Vade | The Imperial March | 5 | 4.421050 | The model almost got the score right, but it was a bit conservative. |
Gap predictions
While training the model with Matrix Factorization, some cells in the interactions matrix had values equal to 0. This means that these users did not rate these musics or have any recorded interactions with them. So how did the model manage to predict these missing ratings?
The model fills in these gaps by using the similarity between the latent factors of the users and the musics. For example, even though Darth Vader has never listened to "Brandenburg Concerto No. 3" (a score of 0 in the interaction matrix), the model was able to predict a score of 0.71. This was possible because the factored matrix learned that Darth Vader tends to prefer epic and science fiction music, but has little affinity for classical music.
Let's look at other examples of predictions made by the model:
- Darth Vader and "Brandenburg Concerto No. 3": Prediction: 0.71 This indicates that the model understood that Darth Vader has little interest in this music. The score was low because Darth Vader's latent factors do not align well with the latent factors of classical music.
- Hermione and "The Imperial March": Prediction: 0.97 Hermione didn't show much interest in this music (original score 1), and the model predicted a value close to it. This is because Hermione's latent factors indicate a preference for emotional and classical music, but "The Imperial March" is not among her favorites.
- Ripley and "Clair de Lune": Prediction: 2.02 Even without a prior assessment, the model predicted a moderate score. This is because Ripley has shown some interest in soft, emotional music, but "Clair de Lune" is not entirely in line with her core preferences.
- Spock and "Firework": Prediction: 2.25 Although Spock never interacted with this music, the model predicted limited interest. This is consistent with Spock's latent factors indicating a preference for classical music, while "Firework" is a pop music.
In general, the model makes these predictions by multiplying the latent factors of users and items, adjusted by Stochastic Gradient Descent (SGD). The multiplication between the matrices \(U\) and \(V\) captures the relation between the general preferences of each user and the characteristics of the musics.
As SGD minimizes the error in known predictions, it adjusts the latent factors to correctly reflect individual tastes. This allows the model to generalize and predict gaps in the original matrix, providing personalized music suggestions for each character.
This allows Robbie to suggest new music to each user, personalizing his recommendations.
Similarity between users
from sklearn.metrics.pairwise import cosine_similarity
user_similarity = cosine_similarity(matrix_factorization_result)
df_user_similarity = pd.DataFrame(user_similarity, index=ex.users, columns=ex.users)
df_user_similarity
Ripley | Darth Vader | Spock | Hermione | |
---|---|---|---|---|
Ripley | 1.0000 | 0.836699 | 0.996773 | 0.795352 |
Darth Vader | 0.836699 | 1.0000 | 0.856497 | 0.976032 |
Spock | 0.996773 | 0.856497 | 1.0000 | 0.805946 |
Hermione | 0.795352 | 0.976032 | 0.805946 | 1.0000 |
- Ripley and Spock (0.996773):
- High similarity, as both have strong preferences for musics like “Bohemian Rhapsody” and “Back in Black”.
- Darth Vader and Hermione (0.976032):
- High similarity, which makes sense, since they both like epic and classical music, like “The Imperial March” and “Hedwig’s Theme”.
- Darth Vader and Spock (0.856497):
- Moderate similarity, because although they have some preferences in common, Spock has a greater focus on classical music.
Recommend new musics to Hermione
After training the model, Robbie is ready to make recommendations! He’s looked at the songs Hermione hasn’t listened to yet and analyzed other characters who have similar likes to her.
def recommend_by_user(user, df_predicted, df_interactions, top_k=3):
# Calculate user similarity matrix
user_similarity = cosine_similarity(df_predicted.values)
user_similarity_df = pd.DataFrame(user_similarity, index=ex.users, columns=ex.users)
# Get the top similar users
similarities = user_similarity_df.loc[user].sort_values(ascending=False)[1:top_k+1]
print('Sser similarity score:')
print(similarities)
# Initialize recommendations
recommendations = pd.Series(0, index=df_predicted.columns, dtype=np.float64)
# Aggregate recommendations based on similar users
for user_similar, similarity in similarities.items():
recommendations = recommendations.add(
df_predicted.loc[user_similar].reindex(recommendations.index, fill_value=0) * similarity,
fill_value=0
)
# Filter out already known items
already_known = df_interactions.loc[user] > 0
recommendations = recommendations[~already_known.reindex(recommendations.index, fill_value=False)]
return recommendations.sort_values(ascending=False)
recommend_by_user('Hermione', df_matrix_interactions, matrix_interactions)
"""
Output:
Sser similarity score:
Darth Vader 0.976032
Spock 0.805946
Ripley 0.795352
"""
Recommendations:
Music | Score |
---|---|
Back in Black | 9.905884 |
Brandenburg Concerto No. 3 | 5.532434 |
Robbie realized that the characters most similar to Hermione, based on musical likes, are:
- Darth Vader (high similarity: 0.97)
- Spock (moderate similarity: 0.81)
- Ripley (moderate similarity: 0.79)
Darth Vader has the most similar likes to Hermione, so his preferences carried more weight in the recommendations.
Why was "Black in Black" recommended?
- Hermione has never heard "Black in Black" before (rating 0 in the interaction matrix).
- Robbie saw that Darth Vader and Ripley really liked this music:
- Since Darth Vader has a very high similarity to Hermione, his preferences greatly influenced the recommendation.
- Robbie concluded that Hermione would probably also love this music because of the strong influence of Darth Vader and Ripley.
Why was "Brandenburg Concerto No. 3" recommended?
- Hermione hadn't heard that music either (rating 0 in the interactions matrix).
- Robbie realized that Spock liked this song a lot.
- Additionally, Hermione has shown an interest in classical music (such as "Clair de Lune").
- With Spock's influence and his past preferences, Robbie concluded that this classical music would be a good recommendation.
in summary
The model works like a friend (Robbie) who knows your likes well and compares them with those of other people. He realized that Hermione:
- She shares musical likes with Darth Vader, which led to the recommendation of a rock song like "Black in Black".
- Her has an affinity for classical music, such as Spock, which resulted in the recommendation of "Brandenburg Concerto No. 3.".
Robbie now knows exactly how to surprise Hermione with new musics! 🎶
Conclusion
Matrix Factorization stands out as an essential tool for creating personalized recommendations at scale, helping platforms such as streaming services and e-commerce to better understand their users. By decomposing the matrix of interactions, we can identify hidden patterns of preferences, even when there is missing data.
With Stochastic Gradient Descent (SGD), we iteratively adjusted the matrices, reducing the error between predictions and real data. Regularization ensured a balanced model, without overfitting, while data normalization brought more stability to the learning.
In the end, we saw how the model goes beyond direct interactions, predicting preferences even when data is limited. The story of Robbie and his iconic friends showed us that these techniques can be applied to a variety of scenarios, providing unique and personalized recommendations. 🎶
References
- Koren, Yehuda; Bell, Robert; Volinsky, Chris. "Matrix Factorization Techniques for Recommender Systems". IEEE Computer Society, 2009. Link to article
- Ricci, Francesco; Rokach, Lior; Shapira, Bracha. "Recommender Systems Handbook". Springer, 2022. Link to the book
- Hug, Nicolas. "Surprise: A Python Scikit for Building and Analyzing Recommender Systems". Official documentation
- Said, Alan; Bellogín, Alejandro. "Comparative Recommender System Evaluation". Springer, 2014. Link to article