Website matching is a pivotal part of our data and product. It can cause cascading levels of data innacuracy.
Our process, the short version:
We pool a list of potential websites from various sources which then get scraped for up to 75 pages of website text per domain. We then use a logical scoring process to match a company to a website. The score must exceed the minimum threshold.
We estimate our website matching to be 95% correct.
The long version:
For each company in Companies House, we pool together a list of potential websites from a range of multiple sources. We scrape the potential websites and then an algorithmic matching process that combines all our sources of data into a best guess website for every company. The best guess website is decided through a logical scoring process which looks for company information within the text. The score must exceed a specific threshold to be matched and the best guess website may be no match. Our custom built web-scraper handles redirects elegantly and has the ability to crawl a single website in 0.2s.
Back in 2018, our scraper only crawled up to a maximum of 7 pages per domain. This not only limited our ability to match companies accurately but also to classify companies correctly. Today, our scraper crawls up to 75 pages per domain. In 2019, we improved the information we collect on companies to support our matching and therefore increase our confidence in the match. Today, we have over 13 reasons as to why each company may be matched to a website each with their own weighted score depending on the field.
Throughout the years, we’ve been:
- Improving the quality of the text scraped
- Adding input sources for potential matches
- Improving the methodology
- Increasing the speed for matching
- Increasing the speed for scraping
- Harnessing more information on incorrect matches and/or missing matches
- Adding websites to a blacklist
We constantly check websites in the product, report fixes, and make improvements. We have performed formal QA at various scales.
In 2018, we assessed false positive rate for a large amount of companies during list creation for a specific project. This was using a pre-release (v. < 1) version of The Data City platform.
In 2020, we performed a rigorous analysis of 99 companies and their website matching quality for a gold-standard AI sector list. In v1.7 we achieved a 85% true positive rate. This increased to 91% in v2.0.
These two types of previous evaluations are importantly different.
- Evaluation 1 assessed the quality of website matching in a list produced by The Data City.
- Evaluation 2 assessed the accuracy and coverage of The Data City platform to report websites for known companies.
They were also importantly performed on versions of The Data City database with very different numbers of companies matched to websites.
In the latest version of the product, we performed our largest ever QA process on website matching quality performed on a list produced by The Data City platform. We are investigated roughly 7000 companies in 24 industry verticals of the FinTech sector. We chose this sector as it is particularly volatile.
For ~5,000 companies (72%), we are >99% confident they are correctly matched.
For ~1,100 companies (16%), we are >95% confident they are correctly matched.
For ~900 companies (12%), we are > 90% confident they are correctly matched.
We estimate our website matching to be 95% correct.
Aspects to further explore
Some of the elements of the algorithm could do with a little further consideration.
- The scores are somewhat arbitrarily assigned to the reasons (“factors”) for matching. Could/should the scores be adjusted to better represent the factors which are more likely to drive a good match?
- Is the threshold appropriate, or is there a more appropriate or optimum threshold?
- How many companies have an overridden score? Is this the right thing to do?
This could be achieved by exploring whether the data generated as part of the above processes could be employed as training sets to:
- Identify which feature points are the most useful in determining a match (or alternatively in determining a false match), to support a weighting system.
- Explore whether updated matches appeared further down in the list of candidate website matches and whether any improved weighting system might actually have selected them as the top match
Finally, there is also interest in dropping some of the matching sources, if possible. The scope for this will also be explored as part of this research, specifically:
- Quantifying the value of each website match sources
Estimating coverage
There are some key statistics the UK government releases which help us to estimate how many businesses are actually possible to website match.
We've written a blog about our coverage here.