How we built Categories

What if we could instantly map the accounting information of any small business in the entire world to a standard chart of accounts?

Summary 👍

  • Codat has launched Categories – a feature that maps accounting information to a standard chart of accounts. 
  • In the absence of one standard chart of accounts, features are blocked for a wide range of products.
  • Likewise, any business providing financial services to SMBs is forced to do a lot of time-consuming and error-prone manual data processing, such as financial statement spreading.
  • Categories uses machine learning techniques to solve this genuinely complicated problem at scale and massively expands the scope of what the financial technology world can build.

Context 👀

Codat standardises business data APIs. Some of the most complex are accounting APIs. This is unsurprising. Accounting is complicated. Four-year degrees in Accounting don’t exist for no reason. Something as seemingly innocuous as a gift card can create lengthy scholarly debate.

As Codat maps accounting APIs to a single data model, our clients don’t need to worry about a litany of little differences like Xero’s API representing Bills as a type of Invoice while QuickBooks Online’s API treats them as separate entities. This drastically reduces the complexity of building and maintaining accounting integrations.

While this standardisation across accounting platforms is extremely powerful and saves a lot of developers and finance professionals a lot of time, there is a deeper level to accounting data that poses a different challenge.

The problem ❌

There are hundreds of millions of SMBs in the world. They all do their accounts in their own bespoke way. Not only will a high-growth SaaS business have a very different set of financial statements to a century-old vineyard, even similar businesses may report certain accounts in subtly different ways.

This situation is fine for small businesses themselves. Where it creates problems is for those who provide SMBs with software and financial services. For example, here are excerpts from the financials of three fictional businesses. You need to know how much they spend on advertising. 

Company A
Expenses
Sep 2021Aug 2021Jul 2021
Advertising & Marketing2,083.336,628.130.00
Light, Power, and Heating103.42103.42129.38
Motor Vehicle Expense342.79123.75123.75
Postage, Freight, and Courier94.190.000.00
Printing & Stationery65.580.000.00
Total Expenses2,689.316,855.30253.13
Company B
Expenses
Sep 2021Aug 2021Jul 2021
Commissions & fees907.12812.40875.80
Facebook1200.96975.13123.45
Disposal fees0.000.00400.00
Dues and subscriptions12.1812.1812.18
Equipment rental44.4082.2099.45
Total Expenses2164.661881.911510.88
Company C
Expenses
Sep 2021Aug 2021Jul 2021
Amortisation and depreciation33.300.000.00
Rent or lease payments1250.000.000.00
Shipping and delivery199.7766.590.00
Insurance – general193.990.000.00
Total Expenses1677.0666.590.00

It is pretty straightforward to spot the relevant accounts. Company A reports “Advertising & Marketing” expenses. Company B appears to buy ads only on Facebook, so they report “Facebook” expenses. Company C does not seem to spend any money on advertising whatsoever. Maybe they just have great word-of-mouth.

While you or I can do that, a computer programme has a very hard time. Accounts are reported by human beings in terms that make sense to them and, usually, other human beings. A person can see that someone has entered “Facebook” under Expenses and understand that this almost definitely means advertising spend. A machine does not know that. But what if there was some way for the machine to learn? 

A job for machine learning 🤖

We needed to define a single, standard chart of accounts that could represent the accounts of every small business globally regardless of their accounting method, size, or industry. This chart of accounts also had to be detailed enough to satisfy the requirements of any and every financial service or software provider that may interact with a small business throughout its lifetime.

Once we had solved this little problem, we would need a model which could take any business’s bespoke chart of accounts and accurately map every single account to our single standard, with a 100% success rate.

We knew this would require a few things:

Lots of data

We’ve got a model to train. We estimated we would need at least 10,000 sets of SMB accounting data from different companies in different sectors using different accounting platforms to have anything like a useful sample. Codat, by nature of our business, is uniquely placed to process such a sample.

Deep understanding of accounting data

To help our model out, we needed to give it the best possible starting point. This meant identifying as many as possible of the subtle differences in accounting data that are vitally important but easy to miss.

Deep understanding of possible use cases of Categories

Over-optimising for Categories’ immediate utility would hamper us down the road. We consulted with lots of different friendly businesses who were interested in doing lots of different things with Categories, from alternative finance providers to forecasting and planning SaaS.

Deep understanding of data science and relevant machine learning techniques

We knew a truly universal Categories model would not be an easy thing to build. We put together a great team and hired some new developers with specific knowledge of certain parts of the problem.

Defining a single chart of accounts 🧾

To start, we looked at the default categories offered by two leading cloud accounting platforms – Quickbooks Online and Xero. 

  • Quickbooks Online’s default chart of accounts contains 280 possible categories. Users must enter accounts with detailed categorisation, e.g. if entering an Expense, they must declare what type of Expense the account is, such as “Equipment Rental” or “Automobile – Fuel.”
  • Xero’s default chart of accounts has far fewer categories at only 22. In Xero unlike in QuickBooks Online, accounts can be left uncategorised beyond the most general level of “Expenses,” “Income,” “Assets,” or similar.

While the default chart of accounts are different enough already, they are configurable. This is great for QuickBooks and Xero users because they can create, delete, edit, and merge categories to suit their particular needs. For our purposes, we needed a rough idea of how often they do this. Just how different is each user’s chart of accounts?

As it happens, very. We analysed the chart of accounts from over 10,000 businesses across QuickBooks Online and Xero, spanning a wide range of sectors and geographies. Out of 300,000 accounts in Xero we found that only 4% of accounts used the default categories. For 96% of accounts in Xero, users had changed the default name or created a completely new, bespoke account name.

For 96% of accounts in Xero, users change the default name or create a completely new, bespoke account name.

After analysing a range of businesses across sectors and diving deep into different accounting methods (such as UK and US GAAP) we built an an MVP with 68 categories. Eventually after testing in beta, we settled on a single, standard chart of accounts that includes 162 account categories. These 162 strike the right balance between detail and ease of use across a wide range of possible applications. 

Now we just needed to find a way to map instantly and accurately any accounting to our standard model. Easy right?

Designing and building the model 🔨

The current build of Categories has three elements: mapping defaults; natural language processing; and user control. 

Mapping defaults

This is a nice, straightforward first step. Although accounts don’t usually conform to default categorisation, in the 4% of cases (in Xero at least) where they do, we can simply define how they should map to our single chart of accounts. Now for the other 96%.

Natural language processing

Where we need to categorise an account with a name we have not already mapped to our model, we use natural language processing. This is crucial because we could never pre-empt every single name that someone might give an account.

First, Categories removes stopwords (“the,” “is,” “at” and so on) and non-alphanumeric characters. Then it lemmatizes account name and description. This means it groups sets of words that are closely-related as forms of each other or synonyms. In a simple example, this might mean Categories could take two different Cost of Sales Accounts, such as “freighting costs-of-sale” and “freight cost” and understand them as the same thing.

Categories then evaluates this output using text distance metrics. If the difference between the output and our chart of accounts is below a certain threshold, Categories can confidently suggest a mapping for the account.

User control

Ultimately, Codat’s user stays in control of how accounts are categorised. Everywhere our model can suggest a category it does. Where it is not sufficiently confident, it makes no suggestion. 

Everything can be recategorised by a user after the model has run. For our users, this provides the flexibility and control they need. For us, it helps the model learn faster, as if it were crowdsourcing the training process. 

For similar reasons, Categories has a PATCH endpoint in our API (it partially updates a record), not a PUT (edits the whole thing). Categories isn’t arrogant about how well it understands data. It never overwrites anything. It just adds useful new information. Our users never lose access to information that could supply necessary context. 

To illustrate, here is the format the data takes when Categories returns results. 

{
  "results": [
    {
      "accountRef": {
        "id": "string",
        "name": "string"
      },
      "suggested": {
        "type": "string",
        "subtype": "string",
        "detailType": "string"
      },
      "confirmed": {
        "type": "string",
        "subtype": "string",
        "detailType": "string"
      }
    }
  ]
}

There are three main objects:

  • accountRef: this is the account as it appears in the SMB’s accounting platform. This information is never lost.
  • suggested: This is Categories’ suggestion, made by mapping the account to our model.
  • confirmed: This is filled by Codat’s user, either when the user accepts Categories’ suggestion or when Codat’s user recategorises the account themselves. Every time someone does the latter, Categories learns. It won’t make the same mistake again.

Users can also go through this process without ever seeing a line of code. A “Categorise Accounts” button is available for every linked company visible in Codat. Users can accept or modify Categories’ suggestions.

For example, here is what it looks like when a user modifies the categorisation of different Equity accounts.

Conclusion ✨

Categories uses machine learning techniques to solve a genuinely complicated problem at scale. With a new level of normalisation in the accounting information accessible via Codat, financial services and software providers are building great things.

Digital lending

SMB lending is undergoing serious transformation, from global banks to challengers like ClearCo. Categories is helping to reduce drastically the time that lenders have to spend manually rekeying data when spreading financial statements, a core part of the decisioning process.

Faster processes mean SMBs access cash sooner. Time-to-cash consistently comes up as the thing customers care about most. Lenders using tools like Codat’s Categories feature are therefore winning market share with a superior experience.

Alternative finance

Many businesses do not know they are eligible for R&D tax credits. Businesses like Mainstreet help them out by identifying opportunities to claim credit and, in many cases, paying out the cash immediately while they wait for the wheels of bureaucracy to turn. 

Custom-built integrations or OCR solutions cannot access the level of detail required for these use cases, such as directly accessing “R&D Expenses” rather than simply “Expenses” in general. With Codat Categories “R&D Expenses” simply shows up in the same way every time for every business, regardless of the accounting platform they use.

Business dashboards

SMBs are receiving better and better products to help them run their business, from full-featured business financial management dashboards to cash flow forecasting tools like Fluidly. A single chart of accounts is a prerequisite for providing some of the most valuable features of a business dashboard. Without one, product teams are forced to compromise between improving the product itself or expanding its addressable market by adding more integrations.

With Categories, teams can build high-value features for the whole market, without worrying about the subtle differences among their customers’ accounts.


Eimear Donnelly, Product Manager – Data


Building accounting integrations? Download Codat’s 2021 Accounting Software Market Report.