Catalogue deduplication
Application Track:
Proposed by:
Entity Logo:

Summary of the entity:
Sonae MC, the market leader in food retail in Portugal, is owned by , SONAE a large conglomerate with significant operations in multiple sectors: Retail (food and non-food), Financial Services, Shopping Centers, Telecom, Retail Properties, among others., is a 100% Portuguese digital marketplace, also part of SONAE group.
Summary of the challenge:
The objective is to significantly improve customer experience, as website navigation will be more consistent and better deals will be found, while working under real world constraints.
Description of the global challenge:
COVID-19 has changed shopping behaviour, with an increase demand for the use e-commerce and digital solutions. is the largest marketplace in Portugal, comprising a quite diversified range of products (food, non-food, fashion, etc.), and following the trend of eCommerce demand growth, it is constantly expanding and further developing its product line. In little over a year, the product range has grown from 0 to >2M.
A big challenge with this growth comes from the diversity of our supplier’s input which poses an interesting challenge with product aggregation and deduplication. As such, there’s room to improve the catalogue import process to ensure that a new product is in fact ‘new’ to the catalogue, or if it should be grouped into an already existing product.
This challenge will provide you with the opportunity to significantly improve customer experience, as website navigation will be more consistent and better deals will be found, while working under real world constraints.
Sub-challenges composing this experiment:
This challenge is composed of 2 sub-challenges:
- Offline Product Labelling (REACH-2020-READYMADE-SONAE_2.1)
- Online Product Labelling (REACH-2020-READYMADE-SONAE_2.2)
Expected global results:
At the end of the project, SONAE expects the following (AI and Data Science) deliverables and results:
- Increase coverage rate (product that have siblings) – by 20 p.p. subject to an internal jury approval.
- Performance: any solution to the challenge must be able to label in real time at a rate no less than 1000 products per minute.
The following business goals should be measured:
- Average ‘purchase time’ and ‘search time’ should be reduced by, at least 10%.
- Increase conversion rate by 5%.
Summary of the sub-challenge:
The objective is to find existing duplicates in the database and propose adequate product aggregations, that will be judged by a jury, to increase by 20 p.p. the number of products that have a sibling.
Description of the challenge:
The trend of eCommerce demand growth delivered a handful of challenges at Dott, who have been expanding and further developing its product line. In little over a year, the product range has grown from 0 to >2M, from around 2 thousand different sellers.
Many suppliers import products that others have already imported and are available on Dott’s catalogue. For that reason, it should not be created new products, but rather, such products should be aggregated into one. This issue is particularly challenging when one product has variants (e.g. one shirt is white and the other blue). Being the same product with different colours, they should be aggregated under the same product.
The goal is to find existing duplicates (duplicate products or the same product with variants) in the database and propose adequate product aggregations.
Data to be used:
NOTE: The applicant could also use external catalogues such as Icecat free tier.
Expected outcomes:
- Increase coverage rate (product that have siblings) – by 20 p.p.
- Average ‘purchase time’ and ‘search time’ should be reduced by, at least 10%
- Increase conversion rate by 5%
Summary of the sub-challenge:
The goal is to develop an architecture that handles a throughput of 1000 products per min to be labelled.
Description of the challenge:
Dott, as any marketplace, is a fixed cost type of business, meaning that it benefits from increased volume and scalability. The more it can do with the existing structure the better, especially if it is to grow at 3 digits every year. If Dott continues to grow and expand the number of sellers, imported catalogues and products as expected, it will need a proactive, online, automatic process to label products and avoid the deduplication of products.
The goal is to develop an architecture that allows products to be automatically labelled.
Data to be used:
NOTE: The applicant could also use external catalogues such as Icecat free tier.
Expected outcomes:
- To get a solution to label an incoming catalogue in a reasonable amount of time.
- To handle a throughput of 1000 products / min to be labelled.
How do we apply?
Read the Guidelines for Applicants
Doubts or questions? Read more about REACH on the About Us page,
have a look at our FAQ section or drop us an email at