[Contribution] How BlueBoard Identifies Clients’ Product Pages
Geoffroy is in charge of BlueBoard’s operations. In this article, he explains what steps BlueBoard is taking to ensure the efficiency and reliability of our product matching process.
BlueBoard is an E-Retail Intelligence solution that enables brands and manufacturers to manage the distribution of their products through an online SaaS interface.
If a brand wants to know who sells its products and what their prices are, or check customer reviews and inventories, we can find out by reading the information displayed on the product pages of their various online dealers.
But before we can do that, we must first find the addresses for these pages. As such, we need to find all of our customers’ product pages and be able to say which product is sold on each of them. If a product page is missing, or we’ve linked a different product than the one sold on the page, then we haven’t done our job right.
Match products with suggested product pages
Universal product identifiers
After researching various sites, we collect “proposals,” (pages that could potentially be the one we are looking for). For each, we must determine if it corresponds to the product we’re looking for. The first data that will let us assess the value of a product page are universal identifiers, such as GTIN (EAN, UPC) or SKU. These values are unique and let us confidently link a product and product page. However, universal product identifiers are only present in 70% of the product pages we analyze.
Contextual product identifiers
For the remaining 30%, we have to go off of contextual identifiers like product names, price, image, and attributes (size, color, etc). The more product information available, the easier for us to make an identification.
- While identification based on universal identifiers works in the majority of cases, issues can still arise: if a page presents several product references, we cannot decide between them; the e-tailers themselves sometimes make mistakes: we have seen many cases where the SKUs and GTINs indicated by the reseller do not correspond to the same product. In this case, we give preference to GTIN, which is the more controlled and discerning product reference.
- The product page lacks information. On reading it, it appears to correspond to the product being sought. But it could just as easily match other products we have in the database. It is therefore just as impossible for us to correctly identify its correspondence, as to exclude it completely. In this case, we prefer to accept this proposal in case it turns out to be correct.
- A large difference in a product’s price or region (a European SKU detected in Japan, for example) is generally a negative sign. However, this is not enough information to completely ignore the proposal: similar products can be sold at significantly different prices and in unusual areas, like on the gray market.
Our approach uses product information – going from the most reliable (universal identifiers) to the least reliable (contextual identifiers) – to define the correspondence between a proposal and a product.
- Product reference similarity: we quantify the degree of similarity between the references in the proposal, and the product information.
- Title similarity: we look for similarities between the title of the proposal, and those of the pages we already validated. We also rely on product information that the brand gave us.
- Image similarity: we compare the proposal’s product image with the images of the pages we already validated.
On the other hand, outliers can lead us to refuse a proposal with a good degree of confidence: GTIN, country, price, product name, brand name, etc.
Product reference similarity
BlueBoard algorithms analyze all visible and hidden GTINs present in these proposals. For example, the presence of a GTIN that’s visible to the user on the page greatly increases the probability that we’re on the page we’re looking for. Conversely, a large number of GTINs – or hidden GTINs – do not allow us to decide.
We’re able to assess the presence of the product’s SKU on the page in the same manner.
Along with 3 product pages already matched in the Netherlands:
JBL Flip 5 Blauw on Artencraft.nl
JBL portable speaker FLIP 5 (Blauw) on Bcc.nl
JBL Flip 5 – Draadloze Bluetooth Speaker – Blauw on Bol.com
The SKU of the product we’re looking for is: JBLFLIP5BLU. It’s a universal ID for our product and its presence in the title of the product page for the first proposal lets us give it a very high score.
For the second proposal, which is Dutch, the title will score high on the comparison with the product name because we find all its tokens there [JBL, Flip, 5]. In addition, its score will further increase thanks to the complementary similarities with the product pages already accepted in the same country. Here, the token [Blauw] that’s present in all the titles on the Dutch product pages already matched.
Since the images used on the sites may differ, we’ve chosen an indirect method of assessing image similarity. For each product, we base ourselves on the images of the product pages that have already been matched, to compare them with the images from the proposals. The more similar a proposal is to one of the images on the product pages, the higher the score.
Identification of negative indicators
For this final step, the primary objective is to reject proposals. If our job is to accept correct proposals as fast as possible, we must just as quickly reject outliers to focus our efforts on ambiguous proposals.
The negative indicators that can lead to proposal rejection are the following:
- Other GTINs found (especially if they are visible on the page and not hidden in the code).
- Other SKUs of the same brand found in the product page title.
- A deviation in price vis-à-vis the prices charged on other sites for the same product.
- The proposal was found in a geographical area outside the scope of the product (in the case of regional products).
The last two indicators are strong, but cannot be considered decisive. They will weigh more than other discriminating indicators.
The results are presented in percentage, scaling from Good to Bad. From here, we decide between correspondence, incompatibility or manual validation. Therefore:
- If a match has more than 90% compatibility, it will be accepted. False positives can be identified later by our quality assurance system.
- If a match is more than 90% incompatible, it will be rejected. The true negatives can be reassessed with each change in comparative factors.
- The rest are accepted or refused under human judgment.
Of every 100 proposals today, approximately 10 are accepted automatically, 30 are processed manually and 60 are rejected automatically.
Thanks to this system, we have significantly increased both the quality of data and its processing speed.