Using machine learning to predicting CPV codes from tender texts

When publishing a new tender, institutions within the European Union need to assign it with a CPV code. To ease the daunting task of choosing the right code from thousands of options, a machine learning model can provide the top five most likely options based on previously published tenders.

Get in touchGet in touch

The challenge

Finding the most appropriate CPV code from thousands of options

Within the European Union it is mandatory to assign Common Procurement Vocabulary (CPV) codes to new tenders before making them public. But that’s not always a simple task.

There are thousands of CPV codes to choose from – each providing a different description of the type of supplies, works, or services that make up a contract. Ideally, you want to choose the code that most accurately represents your contract.

So, how do you do this without going through every code definition?

The solution

Use machine learning to narrow your options

We wanted to see if machine learning could provide a solution by reducing the number of CPV codes to a selection of the most appropriate for a user to pick from.

We created a proof of concept, taking advantage of tens of thousands of published European tenders to train a supervised machine learning model to classify which CPV code belongs to a given tender text.

To achieve this, we trained separate models for each level of the CPV -code hierarchy. We noticed that some of the categories have considerable overlap, for example, both “IT services” or “Software” might be a reasonable description in some cases, so we made sure to take these ambiguities into account.

At the highest level of grouping, 44 out of 45 CPV code categories had at least 50 observations and the model’s classification accuracy was around 76%. This meant that although the algorithm couldn’t always provide a single most appropriate code, that code was often present in the top few options

To give users the most helpful information, we always provided the top five CPV code predictions, including their respective likelihood. That way, instead of choosing from thousands of codes, the user can choose from just five.

The result

A strong first step towards a beneficial tool

Although not quite good enough for practical use just yet, our proof-of-concept CPV code recommendation algorithm was generally well received by those who tested it. To improve the algorithm, we need to increase its performance for the more detailed categories.

We’re exploring two promising options to achieve this. The first is to collect more training data from all European countries. While this would slightly improve performance, it would take a considerable amount of work – particularly considering the multiple languages of European countries.

The second option is to take advantage of the impressive abilities of large language models (LLMs). By leveraging LLMs, we could create a virtual assistant that considers the definitions of the top 20 or so predicted CPV codes, and then assigns the most likely code for the given tender text.