How do we build RTICs?

RTICs are a pivotal part of our platform. Find out more about the build process carried out by our in-house team of experts

Table of contents

  1. The short version
  2. The long version
    1. Taxonomies
    2. Production
    3. Quality Assurance
    4. Caveats

The Short Version

Our Real-Time Industrial Classifications (RTICs) are built by our in-house data experts, often in collaboration with industry experts. 

This build process follows three key stages which include taxonomies, production and quality assurance. 

Taxonomies

RTICs are based on taxonomies, which define industrial verticals and include relevant keywords.

They are typically developed in collaboration with industry experts.

For example, the 'Space' RTICs were created with input from the Satellite Applications Catapult

Here is an example taxonomy from our Artificial Intelligence RTIC:

rtic-taxonomy-1

Keyword structure: The keywords generally follow Boolean logic. When developing keywords users can incorporate "" , AND , OR operators for optimal results. 

Production

We then take this taxonomy and build lists for each of the verticals. This is an iterative process that involves continually adding to the training set.

During this iterative process, the lists are reviewed by experts throughout until we reach a point where we're both happy with their accuracy.

Training set selection: Caution must be taken when selecting large companies with diverse activities in the training set to avoid bias in the algorithm from irrelevant website data.

Quality Assurance

Lastly, we have rigorous quality assurance. We not only check the accuracy of the URLs, but also the machine learning. We do this at both the RTIC and vertical level.

RTICs then undergo an annual review (often biannual) in which we review the training set and cleanse the lists.

Reporting errors / feedback: Users also have the option to provide their feedback or flag companies in an RTIC using the callout available in the company information next to the RTIC details.

The Long Version

Real-Time Industrial Classifications, or RTICs, are datasets that represent the emergent economy.

They are built with TDC’s machine learning and website scraping technology in the Data Explorer platform. In short, RTICs group companies that describe their activity similarly in their website text, allowing the user to make custom sectoral lists.  

Taxonomies

The first thing needed to build an RTIC is a sector taxonomy. This is a key resource that informs the machine learning classification of companies. In this context, a taxonomy is a framework that identifies industry verticals within a sector. The aggregated view of all companies captured in each industry vertical builds the resulting RTIC.  

Industry verticals are defined by keywords and key phrases that represent their activity, specialised technologies or role in a supply chain. The resulting framework and language guide the selection of company websites used to train the machine learning algorithm and subsequent iterations of the exercise.  

There are four steps in the process of taxonomy creation. 

  1. Desk research: existing literature regularly gives a good insight into current developments and major verticals of activity in industrial sectors. It also contributes to the identification of key technical language. 
  2. Website discourse analysis: diving into a few company websites is important. Usually, academic literature or expert reports describe processes in a different way than companies when addressing a wide audience. Understanding how companies use sector-specific language is crucial to identify the most representative keywords for an economic sector.  
  3. Choosing an approach: When it comes to categorizing sectors, there are various methods to choose from. Here are four main approaches we have identified.
    1. Technology Approach : the verticals are defined according to the enabling technologies companies use. 
    2. Applications Approach : Classification of verticals according to the market applications of products, technology, and/or services.
    3. Functional Approach: Define verticals based on a company's activity within a sector's supply chain.
    4. Services Approach: the verticals are defined according to the services companies provide to the sector.

  4. Selecting company websites for the training sets :this stage involves the selection of the first 10/20 company websites to train the algorithm. These can be provided by experts, existing databases and/or found through a combination of keyword searches. 

Production 

We create one machine learning list for each vertical identified in the taxonomy. This is an iterative process that requires: 

  • Updating the training data: the algorithm needs different training data at different stages. Mostly, this refers to the addition of company websites that do not represent the targeted industry vertical of activity to remove false positives. 
  • Expert review: the different iterations’ outputs are reviewed by an industry expert to confirm the data is representative. The expert review can lead to the application of changes to the training data. 
  • Publication of the data: once the machine learning lists have been produced and signed off, they are published on the platform as an RTIC. The raw data is downloadable in .xlsx, JSON and .csv formats. In addition, the platform also provides summary results and processed insights of the raw data, also downloadable in .csv format. 

CICs: The lists can also be published as a Custom Industrial Classification (CIC).

Quality Assurance

The Data City’s quality assurance processes are applied to ensure data integrity and robustness. TDC implements a manual quality assurance process before the publication of RTICs.

The process consists in manually checking 30 random companies from each subcategory identified in the taxonomy. The manual check includes a review of: 

  • Accuracy of the URL-company match: The Data City manually checks whether the URL assigned to a particular company is correct. This is done by comparing the company name, company postcodes and director details.  
  • Accuracy of the machine learning classification method at the RTIC level: The Data City manually checks whether the company that has been classified in an RTIC provides products and/or services for the sector. In short, it implies checking whether the captured companies work in the sector. This is corroborated by the companies’ website information.  
  • Accuracy of the machine learning classification method at the vertical level: manual check of the activity of the companies captured in each vertical. 

Through this process, The Data City can estimate the accuracy of its datasets. The Data City’s published data has a minimum confidence level of 90%. Updates to RTICs are done annually, with certain key ones refreshed bi-annually. 

RTIC updates: Learn more about the RTIC update process here.

Caveats of RTIC methodology

RTICs are a novel data product enabled by supervised machine learning technology. Several aspects of the RTIC methodology set limitations as to which companies are captured and how they are classified. The main elements to consider regarding RTIC caveats are:

  • RTICs classify only companies with a URL match. The algorithm needs website text data to classify companies because language is the main criterion for classification. Therefore, companies with no URL match cannot be captured in an RTIC. However, it is important to note that RTICs tend to focus on new sectors where websites are a market need for companies.
  • There may be up to 10% of false positives in the data. Supervised machine learning technology creates sophisticated classifications that consider large numbers of keywords and key phrases. It enables finding companies in the same field even if they do not use the exact same language, overcoming the limitations of simple keyword filtering. This sometimes leads to the inclusion of edge cases in the final list, which could be referred to as false positives. The QA process limits the inclusion of this type of company in the list and user feedback helps to improve the quality of the lists published on the platform over time.
  • Companies may change their website text. The Data City re-scrapes companies' websites quarterly. The classification of companies in one RTIC could potentially be affected by changing website text.