The Data City's classification technique relies on the website text of a training set of companies. Companies like those we want to identify more of are selected, added to the training set, and labelled as includes. Companies unlike those we want to identify more of are selected, added to the training set, and labelled as excludes. The classification process takes all words and word pairs (tokens) from all the websites of the companies in this training set and then represents each company's website as a normalized vector (length 1) of the frequency of these tokens. The vectors for each company are multiplied by a vector (the classifier vector) made up of all the tokens, with each token having a variable score. These word scores are shown in the product UI.
The token weights in the classifier vector are varied until the two sets (includes and excludes) of companies in the training set are most highly separated, with as many of the includes as possible having scores above 0 and as many of the excludes as possible having scores below 0. This classifier vector is then used to score all website matched companies for which we have website text. The optimisation step described as "the token weights in the classifier vector are varied until the two sets (includes and excludes) of companies in the training set are most highly separated" can never be performed exactly due to the huge search space. But it can be performed efficiently and with reproducible results in almost all cases using a number of well-developed and tested algorithms.
In order to achieve the speed and explainability of our AI system, which is essential to allow sector experts to iteratively improve lists, we have developed our own custom algorithms for this purpose. Our on-going R&D work focuses on creating: more detailed automated quality assessments of our lists, broadening the results so we miss fewer companies in every sector, and reporting a confidence score for every company's inclusion in a given sector.