Shoppe — Price Match Guarantee

13 min readMay 31, 2021

Shoppe is Singapore based retail company which is dominant in southeast Asian retail market and has its footprint allover the world.

The topic of this blog is about their Kaggle competition which just finished few days ago and was called Shoppe price match guarantee, the main objective of the competition was to build models which helps the retailer to group different products sold by various merchants present in the website into groups with common product type and price tag.

The task of grouping product is quiet important since it enhances user experience of the consumers and also helps merchants selling on the platform to compete more effectively.

Business Problem

The main problem to solve in this case is to build a model which does the product matching automatically as soon as the merchant uploads the data to the website.

The model has some limitation in terms of latency and interpretability as the model will be used internally by the organization and limitation on the latency will be based on the fact that the model need to take max few seconds to group the products.

Machine Learning Problem

The machine learning problem here can be interpreted and solved in various ways first we can solve the problem like a normal classification problem using Siamese network, another option is to approach the problem is similarity learning instead of classification we train the model to get embeddings of data such that the embeddings of data points of similar class is closer that the distance between points of the dissimilar classes

I am for most part going to use similarity learning methods for the problem since my attempts with Siamese network troubled with issue of lack of the resources hence i will discuss it only briefly, after getting embeddings by similarity learning we will use cutoff threshold distance to get the predictions.

For this purpose we can try to use KNN, using metrics like cosine distance and Euclidean distance to get distances between embedding values and use validation data selected to verify the f1score for threshold.

Another thing we need to note is that we have multiple source of data which can act as good predictor for the output for example we have both text data and image data each can be used for prediction by employing specific models.

Now coming to the metric we are going to use for the calculating the performance of the model will be the Row wise F1-score.

Exploratory Data Analysis

In this phase we look at the data we have i.e csv files, images etc, we find that the csv file available to us has 5 columns

posting id :- This columns information about the unique id for each poduct item uploaded to the shoppe website

image :- This column gives information about image file corresponding to the item posted

image_phash : phashcode for image uploaded for the particular item.

Title: The Title of the product uploaded by the merchant.

label group : group to which each product belongs

The csv file has 34K data points or 34k products uploaded to the website, along with the csv file we also 34k images corresponding to data points

Next we look into the label group column and how the labels are distributed.

The above plot shows the distribution of various labels groups most of then are repeated only few times maximum times a label is repeated is somewhere around 50

We try find duplicates of various types like image only ,title only and both image and title , we find that the there duplicate's in both image and title but no duplicate's with both image and title columns

Next we look at various number of times each label is repeated we find that it ranges from 2 times to 51 times

But number times we have labels repeating 51 times is way less than labels repeating only twice.

So if you consider this as a classification problem number sample for each class is only few. Hence better approach would be generating embedding using thresholding to predict output.

Lastly we look into images we find that unsurprisingly images with similar labels have similar objects in it.

Base Models Used

The problem we are solving has Multiple data source's like text(Title column) and images the types of models used are different for each data source.

Generating TEXT Embedding

TFIDF Vectorizer

This is basic form of vectorizing the text data, were first we compute IDF value for all the words present in the text corpus and create a vector for each document whose dimensions is number of unique words in corpus and for each word present in the multiply its IDF value with Term frequency i.e. ratio of number of times word is present in the document to number words in document.

The embedding generated is very sparse and not very handy when you are trying to feed it neural network for future training and improvement.

The final step will be feeding the TFIDF to KNN model

Word2Vec

Unlike the TFIDF, W2V embedding is generated for each word, embedding generated by training a model by using context of a word to predict the word, the major advantage is that it retain sematic meaning of the word in terms of the distance between words.

We can create embedding for entire document by using TFIDF W2V technique or by simple averaging all vectors from the document.

The major advantage of this technique is that vectors are of smaller size compared to TFIDF. Which finally can be fed to a nearest neighbor algorithm

Multi Layer Perceptron

We can feed the the above discussed embeddings into MLP network to training the model to predict the classes, after the training we need remove the top layer of the model this enables us generate embeddings which is much better for predicting the output by using technique's like Nearest neighbor's.

BERT

BERT is a transformer based models which have become the state of the art in the field of Natural Language processing.

The attention mechanism acts as the fundamental principle on which the BERT model is based on. And the model is Bi-directional i.e. the output of the model depends on words at all positions in a document.

We can use the models which are pretrained using MLM and fine tune the model weights by feeding the output of the BERT to a simple neural network classifier which classifies the output into various classes similar to previous MLP model.

Generating Image Embeddings

To generate image embeddings i have mostly used two types of convolution neural network

Efficient Net

Efficient Net is convolution neural network, whose major innovation comes in form of how the scaling of the model is done from models with lower number of weights .

In convolution neural network scaling can be done by three ways,

Increasing width of the model
Increasing depth of the model
Increasing resolution of the model

But we cannot arbitrarily increase any one of the parameters without causing problems in terms of performance.

In order to over come this problem a parameter called compound scaling is used which gives rough estimate on how much depth ,height and resolution should be change in order to achieve maximum performance.

NFNet

NFnet is the state of the art CNN image classifier network, whose performance is much better than the nearest competitor efficient net.

And also the training of these network is much faster efficientnet, The NFnet achieves all these by eliminating Batch Normalization used in the previous neural network because it was very computationally expensive.

The task carried out by Batch normalization is done by three new techniques

Modified residual branches and convolutions with Scaled Weight Standardization
Adaptive Gradient Clipping
Architecture optimization for improved accuracy and training speed

All these CNN’s are trained as softmax classifier on label group after the training is complete the softmax layer is removed.

Model Architectures

Siamese Network

The Siamese neural network contains of two identical base models which generates embeddings which is fed to a module which computes distance (like cosine or Euclidean) then is fed to sigmoid layer.

The loss used for the training the model is called contrastive loss.

Contrastive Loss: is a popular loss function used highly nowadays, It is a distance-based loss as opposed to more conventional error-prediction losses. This loss is used to learn embeddings in which two similar points have a low Euclidean distance and two dissimilar points have a large Euclidean distance.

The major of disadvantage these kind is that creation of examples that is which examples should you consider for the dissimilar types since large number are combination are available, selecting randomly is not efficient since too dissimilar embedding is not too useful for model preferably we need to feed images of which are similar but belong to different classes creating such training example is non trivial.

The prediction time diagram of model:

Basemodel and Thresholding

In this technique we train a Basemodel with Arc Face loss (will discuss next section) with all the classes available in the data, After the training of the model top most layer is removed from the model so it could generate embeddings

These similarity values for all the embeddings are calculated the the points whose similarity value is more than a certain threshold value, these points are predicted as output.

Arc Face

The most widely used loss for classification i.e softmax is as follows :

where x denotes the feature vector of the ith sample, W and b is the weight and bias respectively. Softmax loss does not explicitly optimize the feature embedding to enforce higher similarity for intra- class samples and diversity for inter-class samples, which results in a performance gap for deep face recognition under large intra-class appearance variations (e.g. pose variations, age gaps etc).

In the softmax loss above, we fixed the bias to 0 for simplicity and then we transform the logit as:

where θ is the angle between the weight W and feature x. The weight is normalized to 1 using L2 norm. Feature is also L2 normalized and re-scaled to s. The normalization steps helps making the predictions only depend on the angle θ between the feature and the weight. The learned embedding is distributed on hypersphere with radius s as :

An additive angular margin penalty m is added between weight and feature to enhance the intra-class compactness and inter-class discrepancy. Since the proposed additive angular margin penalty is equal to the geodesic distance margin penalty in the normalised hypersphere, it is names as ArcFace. The final loss function becomes:

Modeling Process

I first began by building models on text data available to me i.e. the “Title” column available in the csv file available to me, I began the process by building models using the basic text embeddings technique like TFIDF vectorizer, along with it in order to compute distances between all the points K-Nearest neighbors model was used, the threshold was selected by validating F1-score for the various values of threshold. After selecting the threshold the predictions were submitted to get the score of 0.616

Next I tried using W2V embeddings, the process of building model was similar to the previously discussed one with both TFIDF-W2V and average W2V, but the score was not able to beat the previous score. The main reasons for failure was that the text data contained both English and Indonesian words.

Next step in the process was trying to build the MLP which trained using TFIDF W2V and KNN for computing similarity , the model completely to generate any useful prediction, the scores were as low as 0.3.

Next I tried building Siamese network, but was unable to do it due to my resource constrain since Siamese network has large set of data and takes long time for even one epoch.

Finally I tried building BERT model on the text data, but was not able improve the results, then i used ARC margin layer in discussion section of the competition and also about pretrained BERT models like Multilang-BERT and Indonesian-BERT.

class ArcMarginProduct(nn.Module):
    def __init__(self, in_features, out_features, scale=30.0, margin=0.50, easy_margin=False, ls_eps=0.0):
        super(ArcMarginProduct, self).__init__()
        self.in_features = in_features
        self.out_features = out_features
        self.scale = scale
        self.margin = margin
        self.ls_eps = ls_eps  # label smoothing
        self.weight = nn.Parameter(torch.FloatTensor(out_features, in_features))
        nn.init.xavier_uniform_(self.weight)

        self.easy_margin = easy_margin
        self.cos_m = math.cos(margin)
        self.sin_m = math.sin(margin)
        self.th = math.cos(math.pi - margin)
        self.mm = math.sin(math.pi - margin) * margin
        
        self.criterion = nn.CrossEntropyLoss()
        
    def forward(self, input, label):
        # --------------------------- cos(theta) & phi(theta) ---------------------------
        cosine = F.linear(F.normalize(input), F.normalize(self.weight))
        sine = torch.sqrt(1.0 - torch.pow(cosine, 2))
        phi = cosine * self.cos_m - sine * self.sin_m
        if self.easy_margin:
            phi = torch.where(cosine > 0, phi, cosine)
        else:
            phi = torch.where(cosine > self.th, phi, cosine - self.mm)
        # --------------------------- convert label to one-hot ---------------------------
        one_hot = torch.zeros(cosine.size(), device=CFG.device)
        one_hot.scatter_(1, label.view(-1, 1).long(), 1)
        if self.ls_eps > 0:
            one_hot = (1 - self.ls_eps) * one_hot + self.ls_eps / self.out_features

        output = (one_hot * phi) + ((1.0 - one_hot) * cosine)
        output *= self.scale
        return output, self.criterion(output,label)

The above code shows the ARC margin layer which i used for training the model. The score improved from the 0.616 to 0.661.

After this I started building the models on Image Data available to me first model I built was using efficientnet architecture and arcface for training the models then as usual i used knn for computing similarity, In my first attempt I had frozen the effnet model and added couple layers after it did notyield very good result. So next I finetuned the complete model with the data. This yielded better result than pervious attempt the scored improved to 0.709.

Next i came across NFnet which was faster and more accurate than training to the similar to the effnet. But score did not change much from the previous score. Then I found in discussions that computing similarity from MatMUL yields much better results than just using KNN for computing similarity

cnn_cts = torch.matmul(cnn_embeddings_mean_half, cnn_embeddings_mean_half[a:b].T).T
        bert_cts = torch.matmul(bert_embeddings_half, bert_embeddings_half[a:b].T).T
        
        for k in range(b-a):
            sim = (cnn_cts[k,] / cnn_threshold) ** 6 + (bert_cts[k,] / bert_threshold) ** 6
            sim_desc = torch.sort(sim, descending=True)
            
            IDX = sim_desc[1][sim_desc[0] > 1][:max_preds].cpu().detach().numpy()
            o = df.iloc[IDX].posting_id.values
            
            if (len(IDX) == 1) and nearlest_one:
                IDX = sim_desc[1][:2].cpu().detach().numpy()
                o = df.iloc[IDX].posting_id.values

I tried to used both NFnet and Effnet, by blending the embeddings generated by it but did not produce noticeable changes in the output, but blending(adding) the embeddings generated by the various argumentation of an image did improve the result.

Blending the Indonesian and Multi Lang Bert embeddings has shown to improve the results.

The final architecture

The final score i achieved was 0.764 which is in top 5% of the total participants

Results

The scores i received for various models i worked on

Only TFIDF +KNN— 0.616

Only W2V+KNN — 0.4

W2V + MLP +KNN— 0.3

Indonesian BERT + KNN-0.61

Multilang BERT +KNN -0.61

Multilang BERT + custom sim — 0.65

Effcientnet +KNN — 0.68

Effcientnet+Multi BERT+TFIDF+custom sim- 0.71

NFNET+Multi BERT + TFIDF +custom sim— 0.73

NFNET + MultiBERT + Indonesian BERT + custom sim — 0.764

The video demo of deployed model is given below

Conclusion and Future Work

The model i have deployed is quiet decent still lot more can be done to improve it by using techniques like iterative neighborhood Blending, I haven’t spend much time in tuning threshold which is also an are to impove.

The git hub link for the files code related to project is given here

https://github.com/kaushaltalapady/Shoppee_price_garentee/tree/main and also check my linkedin account at:

https://www.linkedin.com/in/kaushal-t-10102311b/

References

Shopee PyTorch eca-nfnet-l0 Image [Training]

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

Applied Course

We know how challenging changing careers can be. Our Applied AI/Machine Learning Courses are designed as whole learning…

www.appliedaicourse.com

The Why and the How of Deep Metric Learning.

Diving deep into metric based deep learning.

towardsdatascience.com

One Shot Learning with Siamese Networks in PyTorch

Understanding and Implementing Siamese Networks for One Shot Classification

hackernoon.com

Similarity learning with Siamese Networks | What is Siamese Networks

For now, you must have heard of Classification or Regression problems but there exists a third type of problems called…

www.mygreatlearning.com