Matrix Factorization

January 16, 2025
In machine-learning

Compartilhar:

Matrix factorization is a mathematical technique widely used in recommender systems, especially collaborative filtering, to predict users' preferences based on their interactions with items. This approach is crucial for streaming platforms, e-commerce and social networks, as it allows for personalized recommendations.

The main idea is to "split" an interaction matrix $R$, with dimensions $[m,n]$, into two smaller matrices that represent latent characteristics of users and items:

$$\huge R≈U*V$$

$m$ is the number of users.
$n$ is the number of items.
The matrix $U (of \ dimensions \ [m,k])$ represents users in a latent feature space.
The matrix $V (of \ dimensions \ [k,n])$ represents the items in a latent feature space.

The value $k$ defines the number of latent factors, which capture implicit patterns in interactions between users and items.

Example

Imagine a futuristic scenario, at a party illuminated by soft neon lights and nice ambient sound in the background. Robbie, a curious and music-loving robot, finds himself at the center of a rather peculiar chatting circle. Around him, four iconic figures discuss their favorite music styles: Ripley, Darth Vader, Spock and Hermione. Robbie adjusts his hearing sensors and leans forward slightly, eager to learn about the characters' musical like.

Robbie creates an interaction matrix with dimensions $[m, n]$ with the rating values given by the characters.

Let's load sample data of user interactions with songs (from now on we'll call the characters users).

note: All code used in this article is available on GitHub .


    import pandas as pd
    import numpy as np
    import examples as ex

    matrix_interactions = ex.get_matrix_interaction()
    matrix_interactions

Result of matrix_interactions

User	Another One Bites the Dust	Back in Black	Bohemian Rhapsody	Brandenburg Concerto No. 3	Clair de Lune	Comfortably Numb	Don't Stop Believin'	Firework	Hedwig’s Theme	O Fortuna	Survivor	The Imperial March
Darth Vader	4	4	2	0	1	2	2	0	0	4	0	5
Hermione	1	0	5	0	5	2	2	5	5	2	2	1
Ripley	5	5	2	1	2	3	4	4	0	5	5	4
Spock	2	0	5	5	5	4	3	2	3	4	2	0

Each cell contains a rating from 0 to 5, where 0 indicates that the user has never heard the music. Robbie's goal is to predict which musics would be recommended to each user, based on how similar their like are. To predict new preferences, he decides to use Matrix Factorization.

Matrix Factorization in Practice

Matrix factorization is the process of "split" an original matrix $R$ into two smaller matrices, $U$ and $V$, both of which have $k$ latent space. This "split" transforms the matrix of interactions between users and items (musics in our context) into a space of latent variables, allowing us to capture hidden patterns in the data and predict missing interactions efficiently.

To start the implementation, we need to initialize the matrices $U$ and $V$ with random values following a normal distribution. To prevent these initial values from being too large, we set the scale parameter to $1.0/k$. This ensures that the values have a standard deviation inversely proportional to the number of latent dimensions, stabilizing the optimization process and improving the quality of the model's predictions.


    ratings = matrix_interactions.values
    k = 5 # numer of latent dimensions

    num_users, num_items = ratings.shape

    # Initialize user and item latent feature matrices
    user_matrix = np.random.normal(scale=1. / k, size=(num_users, k))  # U matrix
    item_matrix = np.random.normal(scale=1. / k, size=(num_items, k))  # V matrix

Error Maximization and Optimization Methods

With the matrices $U$ and $𝑉$ initialized, the next step is to optimize the latent factors by adjusting the predictions for the actual interactions. The goal is to minimize the error between the known scores and the model predictions in the gaps of the original matrix.

For this process, two algorithms are widely used in matrix factorization:

Stochastic Gradient Descent (SGD) – an iterative method that adjusts latent factors by gradually reducing the error through gradient descent. This will be the focus of this article.
Alternating Least Squares (ALS) – an approach that solves the matrices $U$ and $V$ alternately using least squares. This method will be explored in another article.

In the next steps, we will see how to implement SGD to optimize latent factors and improve forecast accuracy.

Implementation of the SGD algorithm

The SGD performs three main steps:

Choose a pair $(i,j)$ where $R_{ij}$ is not zero.
Calculate the prediction error for this interaction:
- $e_{ij} = R_{ij}−\hat R_{ij} = R_{ij}−U_iV_j^T$
Update the values of the matrices $U$ and $V$ using the update rule of gradient descent:
- $U_i \leftarrow U_i + \alpha * (2 * e_{ij} * V_j - \lambda U_i)$
- $V_j \leftarrow V_j + \alpha * (2 * e_{ij} * U_i - \lambda V_j) $

$\alpha$ is the learning rate, which controls the size of the adjustment in each iteration.
$\lambda$ is the regularization term, which prevents overfitting by penalizing very high values in the latent factors

This process is repeated for several interactions throughout the training epochs, progressively reducing the error.


    alpha = 0.01 # learning rate
    lambda_ = 0.02 # regularization term

    """
    - Returns the predicted rating of user i for item j.
    - It does this through the dot product between the user feature vector (users[i, :]) and the item feature vector (items[j, :]).
    - The resulting vector indicates the affinity between the user and the item.
    """
    def predict(users, items, i, j):
        return np.dot(users[i, :], items[j, :].T)

    """
    - This function performs training using Stochastic Gradient Descent (SGD).
    - ratings.nonzero() returns the indices (i, j) of the non-zero positions in the ratings array, i.e. where there is a known rating.
    - ratings[i, j]: actual value of the rating given by user i to item j.
    - predict(users, items, i, j): rating prediction calculated by the model.
    - eij: diferença entre o valor real e o previsto.
    - The term 2 * eij * items[j, :] corrects the direction of the user vector based on the prediction error.
    - The term - lambda_ * users[i, :] is the regularization to avoid overfitting.
    - After the updates, the function returns the adjusted users and items vectors, which contain the final user and item embeddings.
    """
    def sgd(ratings, users, items):
        for i, j in zip(*ratings.nonzero()):
            eij = ratings[i,j] - predict(users, items, i, j)
            users[i, :] += alpha * (2 * eij * items[j, :] - lambda_ * users[i, :])
            items[j, :] += alpha * (2 * eij * users[i, :] - lambda_ * items[j, :])

        return users, items

Now that we have our optimization function ready. Let's implement the error maximization, which is done by summing the mean squared error (MSE).

$$\large \min_{U,V} \sum_{(i,j) \ \in \ \Omega } (R_{ij} - (U * V)_{ij})^2 + \lambda ( \| U \|^2 + \| V \|^2) $$

MSE (Mean Squared Error)

Let's start our implementation of the optimization function by calculating the MSE.

$$\large \min_{U,V} \sum_{(i,j) \ \in \ \Omega } (R_{ij} - (U * V)_{ij})^2 = MSE = \dfrac{1}{n} \sum_{i=0}^{n} (y_{i} - \hat y_{i})²$$

$ (i,j) \in \Omega$ is the set of indices where $R_{ij}$ is known and belong to the set $\Omega = \{\ (i,j) \ | \ R_{ij} \ is \ known \ \}$. It is the total number of known interactions.
$(U * V)_{ij}$ are the estimated values of $R_{ij}$ by the model.
$R_{ij}$ original matrix of ratings.

The MSE is crucial in SGD for several reasons:

Provides a clear measure of error
- It quantifies the mean squared difference between actual values and estimated values, allowing the quality of the model's predictions to be assessed.
It is smooth and differentiable
- MSE has continuous derivatives, which facilitates its optimization by gradient-based methods such as SGD.
Penalizes major errors
- Because the error is squared, larger errors have a disproportionately larger impact on the final MSE value. This encourages the model to reduce large discrepancies, improving prediction accuracy.
Facilitates interpretation of model performance
- High MSE values indicate that the model still needs to be optimized, while low values suggest that it is fitting the data well.


    """
    - This variable will accumulate the squared error of each pair (i, j) in the ratings matrix.
    - ratings.nonzero() returns the indices (i, j) where the ratings are known (nonzero).
    - ratings[i, j]: actual rating given by user i to item j.
    - (ratings[i, j] - y_pred) ** 2: squared error between actual and predicted value.
    """
    def compute_loss(ratings, users, items):
        error = 0
        for i, j in zip(*ratings.nonzero()):
            y_pred = predict(users, items, i, j)
            error += (ratings[i, j] - y_pred) ** 2

        mse = np.mean(error)

        return mse

Let's now implement the second part of the optimization equation.

$$\huge \lambda ( \| U \|^2 + \| V \|^2) $$

$\lambda$ is the Regularization rate.
$( \| U \|^2 + \| V \|^2)$ are used as regularization to avoid overfitting by limiting the size of the parameters and ensuring that they do not grow excessively.


    """
    - This variable will accumulate the squared error of each pair (i, j) in the ratings matrix.
    - ratings.nonzero() returns the indices (i, j) where the ratings are known (nonzero).
    - ratings[i, j]: actual rating given by user i to item j.
    - (ratings[i, j] - y_pred) ** 2: squared error between actual and predicted value.
    """
    def compute_loss(ratings, users, items):
        error = 0
        for i, j in zip(*ratings.nonzero()):
            y_pred = predict(users, items, i, j)
            error += (ratings[i, j] - y_pred) ** 2

        mse = np.mean(error)

        # Regularization term to avoid overfitting
        # Penalizes large magnitudes in user and item vectors
        regularization = lambda_ * (np.linalg.norm(users)**2 + np.linalg.norm(items)**2)

        return mse + regularization

Why use regularization?

Avoid overfitting
- High values in the user_matrix and item_matrix matrices can lead to a model that fits the training data perfectly but performs poorly on unseen data.
- Regularization keeps learned features compact and avoids excessive parameter magnitudes.
Improves numerical stability
- Without regularization, factor matrices can grow too large, causing instability and errors in the system.
- Adding the penalty term controls the scaling of user_matrix and item_matrix, stabilizing the training.
Encourages generalization
- It ensures that latent factors capture meaningful patterns rather than noise.

Done. Now that we have the optimization function for matrix factorization and error maximization, we can start training.


    epochs = 1000
    for epoch in range(epochs):
        user_matrix, item_matrix = sgd(ratings, user_matrix, item_matrix)

        if epoch % 100 == 0:
            error = compute_loss(ratings, user_matrix, item_matrix)
            print(f'Epoch {epoch + 1}/{epochs} - Error: {error:.4f}')

    """
    Train output:
    Epoch 1/1000 - Error: 596.5902
    Epoch 101/1000 - Error: 1.6179
    Epoch 201/1000 - Error: 1.6440
    Epoch 301/1000 - Error: 1.6707
    Epoch 401/1000 - Error: 1.6952
    Epoch 501/1000 - Error: 1.7164
    Epoch 601/1000 - Error: 1.7343
    Epoch 701/1000 - Error: 1.7491
    Epoch 801/1000 - Error: 1.7612
    Epoch 901/1000 - Error: 1.7710
    """

At the beginning of training, we usually start the matrices randomly. So, the model is practically “guessing” values in the predictions. That’s why the error starts high up (in the example, around 596).

As the SGD algorithm adjusts its parameters, it gradually “discovers” the best way to align these user and item vectors to approximate real interactions. This explains why, after a few iterations, the error drops to around 1.6. The model quickly picks up on the most “obvious” patterns in the data.

After this initial phase of great decline, the error tends to stabilize and even fluctuate a little. It's as if the model has learned most of the users' general preferences, but still needs to refine finer details (or deal with data that doesn't quite follow the general trend). As the number of epochs increases (200, 300, 400...), the model makes small fine adjustments, which means that the error does not fall so drastically, but remains around 1.6 to 1.7.

Why doesn’t the error go to zero? In real applications, there is always some level of noise: users may give inconsistent ratings, certain items may have few ratings, and so on. In addition, we apply “regularization” to prevent the model from simply memorizing every single point in the matrix (so-called overfitting). This regularization “safeguards” the fit a bit and prevents the model from overfitting to noise or incomplete data.

Recommending new music

Now that we have defined all the parts of the process, let's put the code together.


    import numpy as np

    class MatrixFactorization:
        def __init__(self, n_factors, learning_rate, regularization):
            """
            Matrix Factorization using Stochastic Gradient Descent (SGD) to factorize a matrix R into two matrices U and V.
            :param n_factors: Number of latent factors
            :param learning_rate: Learning rate
            :param regularization: Regularization parameter
            """
            self._k = n_factors
            self._alpha = learning_rate
            self._lambda = regularization

            self._original_matrix = None

            self._user_matrix = None
            self._item_matrix = None

            self._num_users = None
            self._num_items = None


        def fit(self, matrix, epochs=100) -> np.ndarray:
            """
            Fit the matrix factorization model to the data
            :param matrix: the matrix to factorize
            :param epochs: the number of epochs to train the model
            :return: the predicted ratings
            """
            self._initialize(matrix)

            for i in range(epochs):
                self._sgd()
                loss = self._compute_loss()
                print(f'Epoch {i + 1}/{epochs} - Training Loss: {loss:.4f}')

            return self.predict()

        def predict(self) -> np.ndarray:
            """
            Predict the ratings for every user and item.
            U matrix * V matrix^T
            :return: the predicted ratings
            """
            return self._user_matrix.dot(self._item_matrix.T)

        def _predict(self, i, j) -> np.ndarray:
            """
            Predict the rating of user i for item j
            U[i, :] * V[j, :].T
            :param i: Index of the user
            :param j: Index of the item
            :return: the predicted rating
            """
            return self._user_matrix[i, :].dot(self._item_matrix[j, :].T)

        def _sgd(self):
            """
            Perform Stochastic Gradient Descent to learn the latent factors U and V by minimizing the loss function using the original matrix R.
            :return: None
            """
            for i, j in zip(*self._original_matrix.nonzero()):
                eij = self._original_matrix[i, j] - self._predict(i, j)

                self._user_matrix[i] += self._alpha * (2 * eij * self._item_matrix[j] - self._lambda * self._user_matrix[i])
                self._item_matrix[j] += self._alpha * (2 * eij * self._user_matrix[i] - self._lambda * self._item_matrix[j])

        def _compute_loss(self) -> float:
            """
            Compute the loss (MSE) of the prediction
            :return: the loss of the prediction
            """
            error = 0
            for i, j in zip(*self._original_matrix.nonzero()):
                error = self._original_matrix[i, j] - self._predict(i, j)
                error += error ** 2

            mse = np.mean(error)
            regularization = self._lambda * (np.linalg.norm(self._user_matrix) + np.linalg.norm(self._item_matrix))
            return mse + regularization


        def _initialize(self, matrix):
            self._original_matrix = matrix.copy()
            self._num_users, self._num_items = self._original_matrix.shape

            self._user_matrix = np.random.normal(scale=1. / self._k, size=(self._num_users, self._k))  # U matrix
            self._item_matrix = np.random.normal(scale=1. / self._k, size=(self._num_items, self._k))  # V matrix

Training

Now that we have everything together, we can start training the model. But first, it's important to normalize the ratings values to maintain the range $[0, 5]$. This minimizes the impact of outliers and allows the model to learn patterns and relations more efficiently and consistently.


    from sklearn.preprocessing import MinMaxScaler

    ratings = matrix_interactions.values

    scaler = MinMaxScaler(feature_range=(0, 5))
    normalized_ratings = scaler.fit_transform(ratings)

    model = MatrixFactorization(n_factors=75, learning_rate=0.001, regularization=0.02)
    matrix_factorization_result = model.fit(ratings, epochs=600)

Result of matrix_factorization_result

User	Another One Bites the Dust	Back in Black	Bohemian Rhapsody	Brandenburg Concerto No. 3	Clair de Lune	Comfortably Numb	Don't Stop Believin'	Firework	Hedwig’s Theme	O Fortuna	Survivor	The Imperial March
Darth Vader	4.169970	4.057246	3.667471	0.710625	1.712582	2.772770	3.595744	3.018872	2.391332	4.190256	4.012947	4.421050
Hermione	2.686220	3.310009	5.080972	4.519992	5.051042	3.952128	3.291865	4.718939	4.862734	4.351025	2.163611	0.979503
Ripley	4.838367	4.918483	4.106493	1.043555	2.023779	3.012154	3.992257	4.059402	3.200697	4.823499	4.927110	4.451684
Spock	1.994154	2.523687	4.894493	4.974091	4.904404	3.989944	2.998294	2.253896	3.132301	3.994360	1.850755	0.795050

Comparison of actual grade vs model prediction:

User	Music	Real rating	Prediction	Review
Darth Vader	Back in Black	4	4.057246	The model predicted a value very close to the actual grade.
Hermione	Bohemian Rhapsody	5	5.080972	The model predicted a value very close to the current grade.
Ripley	Don’t Stop Believin’	4	3.992257	The model correctly predicted a strong interaction.
Spock	Bohemian Rhapsody	5	4.894493	The model predicted a value very close to the actual interaction.
Darth Vade	The Imperial March	5	4.421050	The model almost got the score right, but it was a bit conservative.

Gap predictions

While training the model with Matrix Factorization, some cells in the interactions matrix had values equal to 0. This means that these users did not rate these musics or have any recorded interactions with them. So how did the model manage to predict these missing ratings?

The model fills in these gaps by using the similarity between the latent factors of the users and the musics. For example, even though Darth Vader has never listened to "Brandenburg Concerto No. 3" (a score of 0 in the interaction matrix), the model was able to predict a score of 0.71. This was possible because the factored matrix learned that Darth Vader tends to prefer epic and science fiction music, but has little affinity for classical music.

Let's look at other examples of predictions made by the model:

Darth Vader and "Brandenburg Concerto No. 3": Prediction: 0.71 This indicates that the model understood that Darth Vader has little interest in this music. The score was low because Darth Vader's latent factors do not align well with the latent factors of classical music.
Hermione and "The Imperial March": Prediction: 0.97 Hermione didn't show much interest in this music (original score 1), and the model predicted a value close to it. This is because Hermione's latent factors indicate a preference for emotional and classical music, but "The Imperial March" is not among her favorites.
Ripley and "Clair de Lune": Prediction: 2.02 Even without a prior assessment, the model predicted a moderate score. This is because Ripley has shown some interest in soft, emotional music, but "Clair de Lune" is not entirely in line with her core preferences.
Spock and "Firework": Prediction: 2.25 Although Spock never interacted with this music, the model predicted limited interest. This is consistent with Spock's latent factors indicating a preference for classical music, while "Firework" is a pop music.

In general, the model makes these predictions by multiplying the latent factors of users and items, adjusted by Stochastic Gradient Descent (SGD). The multiplication between the matrices $U$ and $V$ captures the relation between the general preferences of each user and the characteristics of the musics.

As SGD minimizes the error in known predictions, it adjusts the latent factors to correctly reflect individual tastes. This allows the model to generalize and predict gaps in the original matrix, providing personalized music suggestions for each character.

This allows Robbie to suggest new music to each user, personalizing his recommendations.

Similarity between users


    from sklearn.metrics.pairwise import cosine_similarity
    user_similarity = cosine_similarity(matrix_factorization_result)
    df_user_similarity = pd.DataFrame(user_similarity, index=ex.users, columns=ex.users)
    df_user_similarity

	Ripley	Darth Vader	Spock	Hermione
Ripley	1.0000	0.836699	0.996773	0.795352
Darth Vader	0.836699	1.0000	0.856497	0.976032
Spock	0.996773	0.856497	1.0000	0.805946
Hermione	0.795352	0.976032	0.805946	1.0000

Ripley and Spock (0.996773):
- High similarity, as both have strong preferences for musics like “Bohemian Rhapsody” and “Back in Black”.
Darth Vader and Hermione (0.976032):
- High similarity, which makes sense, since they both like epic and classical music, like “The Imperial March” and “Hedwig’s Theme”.
Darth Vader and Spock (0.856497):
- Moderate similarity, because although they have some preferences in common, Spock has a greater focus on classical music.

Recommend new musics to Hermione

After training the model, Robbie is ready to make recommendations! He’s looked at the songs Hermione hasn’t listened to yet and analyzed other characters who have similar likes to her.


    def recommend_by_user(user, df_predicted, df_interactions, top_k=3):
        # Calculate user similarity matrix
        user_similarity = cosine_similarity(df_predicted.values)
        user_similarity_df = pd.DataFrame(user_similarity, index=ex.users, columns=ex.users)

        # Get the top similar users
        similarities = user_similarity_df.loc[user].sort_values(ascending=False)[1:top_k+1]
        print('Sser similarity score:')
        print(similarities)

        # Initialize recommendations
        recommendations = pd.Series(0, index=df_predicted.columns, dtype=np.float64)

        # Aggregate recommendations based on similar users
        for user_similar, similarity in similarities.items():
            recommendations = recommendations.add(
                df_predicted.loc[user_similar].reindex(recommendations.index, fill_value=0) * similarity,
                fill_value=0
            )

        # Filter out already known items
        already_known = df_interactions.loc[user] > 0
        recommendations = recommendations[~already_known.reindex(recommendations.index, fill_value=False)]

        return recommendations.sort_values(ascending=False)


    recommend_by_user('Hermione', df_matrix_interactions, matrix_interactions)

    """
    Output:
    Sser similarity score:
    Darth Vader    0.976032
    Spock          0.805946
    Ripley         0.795352
    """

Recommendations:

Music	Score
Back in Black	9.905884
Brandenburg Concerto No. 3	5.532434

Robbie realized that the characters most similar to Hermione, based on musical likes, are:

Darth Vader (high similarity: 0.97)
Spock (moderate similarity: 0.81)
Ripley (moderate similarity: 0.79)

Darth Vader has the most similar likes to Hermione, so his preferences carried more weight in the recommendations.

Why was "Black in Black" recommended?

Hermione has never heard "Black in Black" before (rating 0 in the interaction matrix).
Robbie saw that Darth Vader and Ripley really liked this music:
- Since Darth Vader has a very high similarity to Hermione, his preferences greatly influenced the recommendation.
Robbie concluded that Hermione would probably also love this music because of the strong influence of Darth Vader and Ripley.

Why was "Brandenburg Concerto No. 3" recommended?

Hermione hadn't heard that music either (rating 0 in the interactions matrix).
Robbie realized that Spock liked this song a lot.
Additionally, Hermione has shown an interest in classical music (such as "Clair de Lune").
With Spock's influence and his past preferences, Robbie concluded that this classical music would be a good recommendation.

in summary

The model works like a friend (Robbie) who knows your likes well and compares them with those of other people. He realized that Hermione:

She shares musical likes with Darth Vader, which led to the recommendation of a rock song like "Black in Black".
Her has an affinity for classical music, such as Spock, which resulted in the recommendation of "Brandenburg Concerto No. 3.".

Robbie now knows exactly how to surprise Hermione with new musics! 🎶

Conclusion

Matrix Factorization stands out as an essential tool for creating personalized recommendations at scale, helping platforms such as streaming services and e-commerce to better understand their users. By decomposing the matrix of interactions, we can identify hidden patterns of preferences, even when there is missing data.

With Stochastic Gradient Descent (SGD), we iteratively adjusted the matrices, reducing the error between predictions and real data. Regularization ensured a balanced model, without overfitting, while data normalization brought more stability to the learning.

In the end, we saw how the model goes beyond direct interactions, predicting preferences even when data is limited. The story of Robbie and his iconic friends showed us that these techniques can be applied to a variety of scenarios, providing unique and personalized recommendations. 🎶

References

Koren, Yehuda; Bell, Robert; Volinsky, Chris. "Matrix Factorization Techniques for Recommender Systems". IEEE Computer Society, 2009. Link to article
Ricci, Francesco; Rokach, Lior; Shapira, Bracha. "Recommender Systems Handbook". Springer, 2022. Link to the book
Hug, Nicolas. "Surprise: A Python Scikit for Building and Analyzing Recommender Systems". Official documentation
Said, Alan; Bellogín, Alejandro. "Comparative Recommender System Evaluation". Springer, 2014. Link to article

Compartilhar:

Post Tags machine-learning