Using our data safely

There are many things that you should bear in mind when using our data, including some limitations.

Table of Contents

Summary

Some of the issues here relate to incorrect or limited primary data. For example, employee counts may be reported incorrectly from source, or not provided. These issues and errors are compounded when this data is used to generate secondary metrics. For instance, the range of issues associated with employee data will cause subsequent issues with estimated turnover data and GVA data.

Some of the issues relate to other core data points, such as URL matches, or company locations. These issues and errors are compounded when these data points are used to filter, assign, or generate secondary data points. For example, mismatched URLs can lead to attribute data being incorrectly linked to the wrong company, including location information and webtext. This can subsequently lead to the company being incorrectly selected in location filters and/or the company being incorrectly classified in RTICs and/or allocated an incorrect Innovation Score.

Some of the issues are linked to our inability to attribute data across locations and/or sector activity in a way which reflects reality. This includes our inability to differentiate between UK and Non-UK employees – this is linked to the general issues with employee counts. This issue can either lead to potentially incorrect or misleading representations of the data, both when splits have been applied to the data (usually just an even split, e.g. where employee counts/turnover is evenly attributed to each of a company’s locations), or when no split is applied, e.g. when all employees are assigned to every RTIC that the company is associated with. The latter can also lead to data duplication, depending on how the data is used, especially in generating comparisons between RTICs etc.

We also inherit some other uncertainties and errors from our data providers, e.g. location information provided by CreditSafe is of unknown accuracy and Dealroom’s ability to identify investments have improved over time, introducing uncertainty when taking a timeseries view and Lightcast data might contain duplicates. Matching third-party data to our own data based on our core attribute of Companies House ID can also introduce some uncertainty or data duplication, e.g. matching Lightcast IDs to our Companies House ID.

Some issues are generated in the way that we aggregate/summarise/or simply display our data in the platform. These could lead to potential misunderstanding or misuse of our data. Some examples:

  1. Not all data is treated the same when filtering by location, some is split evenly across locations, for others the total is taken if a company has at least one location in the geographic filter selected/the registered address is in the geographic filter selected.
  2. In the RTICs and Sectors section of ANALYSE, total company turnover is allocated to every Vertical that the company is associated with, this leads to duplication.
    1. Within an RTIC total and/or.
    2. Across RTIC totals where a company is classified in multiple Verticals/RTICs.

Creditsafe Data

Employee data can be incorrect

Some employee data is entered incorrectly at source. For example, GILLARDS FARMS LIMITED (00981261) reports 1,299,450 employees in its 2021 accounts filed with Companies House. This is incorrect and is the same value as it submits as its ‘net worth’.

Other data is misprocessed by CreditSafe. For example, CHOUDHURYS VENTURES LTD (08238569). In 2019, their financial report to Companies House reports 2 employees but CreditSafe have reported 61k.

Guidance: We recommend manually checking for outliers like this and removing them prior to analysis.

A great way to do this is to view the list in EXPLORE and sort by turnover or employees highest to lowest. Errors in larger companies will have a larger effect on analysis.

Employee data is sometimes global

International businesses often report their global headcount as their employee count to Companies House. Whilst the number may be correct, it is important to understand that many, perhaps most of these employees, may be based outside of the UK.

This can be seen by looking at the employee count for companies within the SIC code: Activities of head offices (70100).

We are not able to identify where company reported employees are based.

Guidance: We recommend manually checking for outliers like this and removing them prior to analysis.

A great way to do this is to view the list in explore and sort by turnover or employees highest to lowest. Larger companies are more likely to have global operations and errors in larger companies will have a larger effect on analysis.

You may also want to be precise in your wording of what is being reported. i.e. global employees of UK companies, not UK employees. This will require investigation.

ANALYSE reports employees which exceeds the working population

When analysing all UK companies, our product reports more employees than the UK working age population. This is mostly because some companies report global employees. It may additionally be because, some companies within a group structure may double count a single employee, reporting of full-time equivalent and part-time workers is not consistent, etc.

Guidance: Manually checking the largest companies will reduce distortion.

Some company operating addresses are incorrect or missing

A UK company is only required to register one address with Companies House and this address does not necessarily have to be an operating address but we assume it to be. We estimate additional operating addresses for companies from their website text and other sources e.g. CreditSafe.

We can never be certain we have identified all operating addresses. We can never be certain the identified operating addresses are correct.

Guidance: Be aware of potentially missing/incorrect addresses when analyse the data. Include a caveat in your report to state the accuracy of our addresses. If reported values seem unusual, consider this to be a possible cause.


Lightcast Data

Multiple companies, one Lightcast ID

One Lightcast ID can match to multiple companies of ours. Multiple formal entities in our company database could be assigned to the same informal company recorded from Lightcast. Crucially, this would lead to duplicate counting of the same job posting.

You can read more about the implementation of Lightcast data here.

At this stage, we do not split the job postings across all companies which are allocated to the same Lightcast ID.

Guidance: Caveat the data appropriately to take this into account.

These are job postings, not positions filled

This is the advertisement of jobs. These jobs are not necessarily filled.

Guidance: Be clear in your analysis that these are job postings. You may want to add the caveat that they are not necessarily filled.

Job postings are not assigned to specific locations

For any location filter applied on the platform, the job postings returned will be all job postings assigned to the companies with at least one address (either operating or registered) in the filtered geography.

Job postings are not split across locations. This leads to duplicates.

Guidance: If comparing geographies, be conscious that job postings will be duplicated (assigned to both geographies) where a company has a location in both geographies.

The Data City does not have job postings by region at company level on the platform.


Dealroom Data

Timeseries limitations

Dealroom is the source of our investment data. They were founded in 2013. Although they continually improve their historic data, their ability to track deals has improved over time. 

Guidance: Use caution when doing timeseries analysis of Dealroom data before 2015. Dealroom has made efforts to backdate their tracking, however there will be increases in coverage as Dealroom has scaled.

Consider outliers carefully. The total investment figure presented includes everything from seed to IPOs. ANALYSE does not provide a breakdown.

Outliers

Dealroom data is susceptible to outliers. Often an RTIC’s total investment can be dominated by one or two large companies. 

Guidance: Errors in larger companies will have a larger effect on analysis. We recommend manually checking for outliers and removing them prior to analysis if required. A great way to do this is to download the data and sort by Total Dealroom Funding highest to lowest.

Venture capital funding may be incorrectly assigned to a company

Dealroom match some of their companies to a Companies House company. The Data City also match some of Dealroom's companies to a Companies House company. The overall match is highly accurate (>98%) however sometimes the match is incorrect. Sometimes a match from Dealroom to a Companies House company does not meet the threshold for an accurate match.

An incorrect and/or a missing match may misrepesent the total invesment into a given sector/location etc.

Guidance: Caveat the data appropriately to take this into account.


Proprietary Metrics

Company GVA estimate will be wrong if employee numbers are wrong

GVA for a company is estimated using its estimated number of employees, registered SIC codes, and published national statistics on average GVA per employee for those SIC codes. The known issues with employee data such as companies reporting foreign employees will cause GVA to be overestimated.

Guidance: Problems with GVA over-estimates can be reduced by manually reducing GVA estimates for companies known to have reported a large number of foreign employees in their UK accounts. Since large companies contribute the most to GVA it is best to start checking large companies first by using the Sort by Employees: High to Low function in Explore. Companies that do not have significant operations in the UK should be removed from analysis.

Company GVA estimates do not estimate split of function within a company

Where a company's estimated employee count is correct we are able to make a good estimate of the company's GVA based on the average GVA per employee in its sectors of operation. Users should be careful however that a company's GVA cannot be fully assigned to any of its RTICs. For example, many companies use artificial intelligence as part of their work but it would probably be wrong to assign a company's full GVA as being produced in the AI sector.

Guidance: When writing up GVA analysis phrasing such as "companies using AI contribute £Xbn to the UK economy" would be more correct than saying "AI contributed £Xbn to the UK economy".

Incorrect data points based on mismatched URLs

There will be instances of mismatched URLs in our data. We have recently improved our match rate to 95%, but mismatches remain.

If a URL mismatch has occurred, several datapoints which are either scraped from the URL directly, or derived from the webtext, including (but not limited to) operating locations, company description, RTICs, and Innovation Score, will be incorrectly assigned to the company with the mismatched URL.

Guidance: Caveat the data appropriately. Report URLs that are incorrect/missing. You may also want to exclude/include companies from your analysis which you know are mismatched/missing.

Turnover and Employee counts are estimates, and there are known limitations of these estimates

Not all companies report turnover/employee count and/or there is a lag in report date. The Data City attempt to fill turnover/employee data where missing.

Guidance: For reports you will want to do a QA of the list. Errors in larger companies will have a larger effect on analysis. View the list and sort by turnover and employees high to low to identify outliers and remove where appropriate.

There might be no turnover or employee data for a company

The Data City attempt to fill turnover/employee data where missing. We cannot provide an estimate without appropriate data to do so. This could lead to underestimating the size of the sector/economy in a location. This is usually only affects small companies so the overall impact is likely to be somewhat negligible.

Guidance: In your analysis make sure to refer to these datapoints as ‘estimated turnover’ or ‘estimated employment’. For clarity, you may want to report the number of companies which we do provide estimates for. The largest companies have the highest influence on summary statistics and these less likely to be impacted by missing data.

Estimated turnover and employees are evenly split by location

Companies are not required to declare their financial accounts for each operating address that they have. Therefore we do not know how many employees work in each operating address. This is the same for turnover.

The Data City provide an estimate of the number of employees at each operating address. This estimate equally splits the number of employees or amount of turnover across the number of operating addresses.

Guidance: Caveat the data appropriately. If reported values seem unusual, consider this to be a possible cause. 

Assumed split of employees across company locations can lead to potentially incorrect local growth.

Employees incorrectly linked to either an address, or a specific activity in a region, will be included in the calculation for the growth of that sector in the region. The Data City works hard to source location data for each company and has to make subjective decisions as to whether an address should be included as a company location.

Example 1: Should all of BP's petrol stations be included in the location data or just the office locations? We know revenue is generated at each petrol station but we may choose to attribute this revenue to only the offices.

This varies across companies and this may have an impact on summary statistics. 

Example 2: If you were choosing to analyse AdTech, you would likely include Amazon in your list. Amazon has warehouse locations across the UK. We track these warehouse locations but we do not know which are offices and which are warehouses without further research. We would therefore assign employees working in AdTech to warehouses falsely.

Guidance: Be mindful of this issue and explore the location data of the largest companies in your list. Make a sensible decision on how you might overcome assigning employees to the correct locations.

Location Quotients could be distorted

We generate location quotients based on all addresses associated with companies (both registered and operating addresses).

This could introduce known location issues, such as including locations of properties used as default registered addresses, e.g. accountants, where activity relating to the company might not actually take place.

Location quotients are also subject to the registered office effect. London is the registered office location for many multinational companies (and perhaps head office). This can create distortions. For example, when using location quotients, some of the top specialisations for London are tobacco or coal mining. This is due to the registered office of tobacco and mining firms in London.

Therefore there are several reasons why the resulting location quotient value may misrepresent the location.

Guidance: Take care when applying location filters and interpreting the location quotients.  Be aware of these caveats.

Company size definitions should be well understood

The company size definitions are based on the Companies Act 2006 definition. Other definitions are available. You can read about the implementation of this definition here

A misunderstanding of these definitions could lead to incorrect conclusions made from the data.

Guidance: Clearly state the definitions used in your analysis.

Alternatively, you can generate custom definitions yourself using the filter options. 

A company might be assigned an incorrect company size

This would arise due to incorrect turnover or employment data, which could occur as a result of known issues with employee and turnover data.

This could be a declaration error, e.g. an error in a company’s financial accounts. Alternatively, it could be an error in estimated values for a company’s turnover or employment.

Guidance: Company sizes are based on The Data City’s estimated values of turnover and employment. The Data City have robustly estimated turnover and employment, but a company can be miscategorised if errors do occur. Users should be aware of this.

RTIC time series analysis

The RTIC analysis within ANALYSE provides insight into the historical financial performance of the companies which are currently within the RTIC.

It is important to note that this is not a time-series analysis of the industry as a whole, but specifically of these companies. Currently, our approach does not include historical classifications or track the inception and closure of companies (births and deaths).

Guidance: Caveat the data appropriately to take this into account. Be mindful that this is the case when interpreting the results.

The innovation indicator should be properly understood and used appropriately

Our innovation indicator uses a proprietary machine learning model to estimate how innovative every company in our database is based on their website text, where available. R&D intensity is used as a proxy for innovation.

You can read more about our innovation indicator here.

Misunderstanding or misinterpreting our innovation indicator can lead to misleading results.

Guidance: We urge users to exercise caution whilst using our innovation indicator, viewing it as a valuable tool when analysing lists, rather than a definitive measure.

It should also not be used to build company lists, instead, it is more appropriate to use it as a filter in analysis.

We are not able to identify founders for all companies

We are able to identify founders for approximately 80% of companies.

It is worth noting that our ability to identify founders is very dependent on the age of a company.

The Data City tracks the active companies on Companies House and these tend to be younger companies.

Misrepresenting founder information can lead to error in your analysis.

Guidance: Be careful when comparing companies with founder information against the baseline.

For example, comparing women founded businesses to the UK business base will be tricky.

Since we are not able to identify founders for older companies, women founded businesses will be younger than the business base, so a national comparison might not be suitable.

Instead, we advise comparing women founded businesses to non-women founded businesses.