Extraction Engines in Rossum

Prev Next

Feature available for early adopters only

The feature described below is currently available exclusively to early adopters. We will provide detailed information to all clients ahead of the global release.

What are extraction engines

With Aurora implementation we are introducing a new AI engine management system available directly in the Rossum application.

How does it work?
When you create a new queue, you can also create an extraction engine. This AI engine will handle predictions for the data in that queue. Depending on your needs you can use:

  • One engine per queue: if you extract a different set of information in each queue (different fields are necessary in each queue)

  • Shared engine across a few queues: if multiple queues require the same information (same fields are necessary in each queue)

Understanding queue schema and engine schema

You are already familiar with the queue schema (called extraction schema as well), which is a list of fields that specifies what data you want to extract in each queue. This is located in the “Fields” tab in queue settings.

Each extraction engine also has its own separate schema, called the engine schema. This includes all the fields the AI should identify from the documents. Engine schema is assigned to each engine not to a queue and should not be mistaken with queue schema.

Here’s how queue schema and engine schema work together:

  • If you want the AI to predict a field’s value, it must be in both the queue schema and the engine schema

    • Queue schema: makes the field available for annotations in each queue

    • Engine schema: informs the AI to predict this field’s value

  • The connection between these two schemas is made through the field ID, so it’s crucial to keep this consistent

  • Fields where values are entered manually, or added through extensions or integrations, should only be included in the queue schema

Managing extraction engines

Create a new queue with a new extraction engine

New extraction engines can be created when creating a new queue.

Follow the usual steps for queue creation - Manage queues in Rossum, select “New extraction engine” in the third step.

create_extraction_engine

Brand new queues running on an extraction engine come with a list of standard fields to capture, which you can adjust freely. These fields can be used to capture data from documents and are by default added to queue schema and engine schema of a new engine (when you create queue both schemas are aligned and contain the same set of fields).

Additional custom fields can be added to the queue extraction schema, too. Follow the tutorial here to learn how: Extraction Schema Editor in Rossum. Please keep in mind that if you want to enable the engine to learn from the data extracted from your documents, every field captured by AI should have its counterpart in the engine schema. Jump to Add new engine fields to an extraction engine.

Create a new extraction engine from the AI engines overview

From the "Automation" tab, select "AI engines". You will see an overview of all available AI engines on your account. Click the “Add engine” button to add a new extraction engine.

add_extraction_engine

Fill in the name of your new engine. We recommend using a name corresponding to the document type it will extract.

Engine name

The new extraction engine you have just created is not yet connected to any queues. Jump to Assign an extraction engine to an existing queue.

Once connected to a queue, if you want the engine to learn from the data extracted from the documents you process, every captured field should be related to an engine field. Jump to Add new engine fields to an extractor engine.

Managing engine fields

Add new engine fields to an extraction engine

Once created, extraction engines run with the help of engine fields. Engine fields bridge the AI and the captured fields in your queues, enabling Aurora to learn from user annotations.

Click the “Add field” in the "Engine fields" section button to add a new engine field.

Add engine field

On the next screen, you can define the engine field. By defining an engine field, you instruct the AI on what it should learn.

Engine field

Name the field and assign it an ID unique to this extraction engine. We recommend giving the engine field a name and an ID that describe well the kind of values you wish to extract and have predicted from documents. For instance, naming a field discount_amount is preferable to naming it field_a.

You may choose from more than 50 pre-trained fields (Rossum API Reference), including barcodes. Alternatively, you may create a custom engine field that Aurora will learn to recognise as soon as you confirm your first documents.

Then, assign a data type (Extraction Schema Editor in Rossum):

  • string

  • number

  • date

  • enum

In addition to the regular data types, engine fields can be assigned more granular types that further constrain the data extracted from the document.

Subtype

Label

Description

string

none

String

Plain text without any constraints.

alphanumeric

Alphanumeric string

A string containing only characters a-z, A-Z, and 0-9. White space is stripped.

numeric

Numeric string

A string containing only numbers. White space is stripped. It is useful when leading zeros are important or long numbers are expected.

country_code

Country code

Capitalised two or three-letter country code as specified in ISO 3166.

currency_code

Currency code

Capitalised three-letter currency as specified in ISO 4217.

iban

IBAN

International Bank Account Number consisting of up to 34 alphanumeric characters.

vat_number

VAT number

VAT identification number. It starts with a two-letter country code and has 2-13, usually numeric characters.

number

none

Number

Number

integer

Integer

Whole number

rate

Rate

Typically in the range of 0 - 100%.

amount

Amount

A number that respects common financial notation. E.g. parenthesis denote negative values.

date

none

Date

Date

period_begin

Period begin date

It represents the beginning of a date period and falls back to the first day of the month if not specified.

period_end

Period end date

It represents the end of a date period and falls back to the last day of the month if not specified.

Finally, you can select whether the engine should look for values inside or outside a line items table.

Some engine fields may come with a warning that the engine is not connected to any captured field. This indicates no captured field from a queue has been related to an engine field.

Schema field not connected

Relating captured fields to engine fields

Add a new captured field to an existing queue

Navigate to the queue where you want to add a new captured field. From the queue settings, go to “Fields”. Find the section where you want to add a field and click “Add field.” Create a captured field in the queue schema with the correct data type.

If an extraction engine is linked to this queue and contains a field with the same ID in its schema, the new schema field will automatically map to the corresponding engine field.

Schema field - engine field

If the ID of a field does not match the ID of an existing engine field, you have the option to create a new engine field.

Create engine field

When you click the "Create field" button, a dialog window will appear, allowing you to create a new engine field.

New engine field

Turn a non-captured field into a captured field on a queue

A field can exist in the queue schema as non-captured, but you may sometimes need to turn such a field into a captured field.

To do so, open the field editor and change the value source of the field to "Captured".

Change value source

Once you do, you will see the “Create field” button, allowing you to create a new engine field. To save the changes, you must create a new engine field to which the captured field will connect.

Assigning an extraction engine to a queue

Assign an extraction engine to an existing queue

You can assign an existing extraction engine to any non-legacy queue. Doing so lets you utilise the knowledge the engine has accumulated from previous annotations from queues connected to that engine and use that knowledge to get predictions on brand-new queues you have just created.

Extraction engines can be assigned to an existing queue from the queue settings. Under the “Basic settings” section, find the AI Engine at the bottom of the screen.

From the dropdown menu, select the extraction engine you would like to assign to that queue.

Next, choose whether you would like that queue to contribute to engine training or not. When queues do not serve as a source of knowledge for an engine, they can still benefit from engine predictions without impacting the quality of predictions.

If the quality of annotations varies across queues, it may be useful to have only queues with high-quality annotations contribute to engine training. Otherwise, the quality of predictions may be lower.

Similarly, if a queue is only intended for testing, you may also want to skip training from it.

Contribute to engine training

Queues connected to legacy engines require changes in the queue schemas to allow you to migrate to an extraction engine. Currently, only Rossum team can migrate legacy queues to an extraction engine.

Assign an extraction engine to multiple queues

Extraction engines can be shared across multiple queues. This allows you to expand the sources of knowledge for the engine, making learning possible from not one but several of your queues.

The more data an engine has to learn from, the more accurate its predictions can become. Check out this article, which will teach you how to annotate documents to contribute to AI learning effectively.

Knowledge-sharing across multiple queues relies on having good and consistent queue schemas (queues linked to the same extraction engine must share an identical schema). As captured fields are the only types of fields the AI learns from, it is crucial to ensure that they are configured correctly across all queues shared by an extraction engine.

Compatibility issues may arise when connecting an extractor engine to queues with incompatible schemas, causing the engine to predict data poorly. To avoid this, pay attention to the configuration of all captured fields that an extraction engine will use. Make sure that all fields capturing the same kind of value from different queues are configured using the same logic:

  • Equivalent fields must have the same schema id across all queues an engine shares. For example, the field document_id is a document identifier, which may be used when extracting data labelled as “Document ID”, “Invoice Number”, “Číslo faktury”, “Rechnungsnummer”, etc. across different queues. In all queues, the field should have the schema id document_id.

  • Equivalent fields must use the same data type across all queues an engine shares. For example, a custom field like discount_amount should not be configured as numeric in one queue but as string in another.

  • Equivalent fields must have the same field type across all queues an engine shares. A field can either be set as single-value or multi-value (Multivalue fields in Rossum). Avoid configuring a field as single-value in one queue, but as multi-value in another.

  • All equivalent fields must be connected to a single engine field.

  • If a field is configured with the value source “Captured” in one queue, but its equivalent in another queue has a different value source in another queue, the latter queue cannot be used for learning and must be set to not contribute to engine training.

  • When switching a queue between extractor engines, all engine fields from one extractor engine must be present in the other extractor engine.