Introduction

The following methodology guide sections are presented:

Text classification
Entity detection

First of all, define the type of project:

Text classification (what is text classification?)
Entity detection (what is entity detection?)
Both text classification and entity detection

In the latter case, create two separate projects and combine the outcomes in a pipeline: one project can use a model from another project.

Although it is possible to create multilingual projects, it is recommended to create one project for each language to obtain the best possible results. In this case, the language of the documents needs to be defined.

Text classification

Step 1: Initiate the project

Create the project (type=Text classification)
Upload documents
Go to the Documents view
Inspect documents to see what they look like, and explore differences

Step 2: Define labels

Go to the Labels view
Create labels
- One label = one category
- Think whether a document could belong to one or several categories
- It is possible to create a label to obtain better results even if the label is not used to create a classification model
Write annotation guidelines for each label (recommended)

Step 3: Pre-annotate documents (optional)

Why using automatic pre annotation?
- Pre-annotation with an off-the-shelf model or NLP pipeline saves time because a first version of the training dataset is created automatically
Pre-annotate documents
- At least some of the labels (categories) of the existing model/NLP pipeline should perfectly match the labels you want to create
- Pre-annotate a small number of documents (say 50) to start with, because you all annotations need to be reviewed individually in order to create a high quality dataset
Select “Labelled” in the filter “Status“ to access the dataset
Please note:
- The dataset consists of all labelled documents
- Useless labels (categories) can be deleted together with their annotations in the Labels view
- When creating new labels, review all the annotated documents to complete any missing annotations (only for multi-category projects)

Step 4: Annotate documents

Annotate at the document level
- Single or multiple categories
- At least 10 to 15 annotations per label (category), following the annotation guidelines
Continue even after the first appearance of the blue “pop up“ announcing that suggestions are available

Step 5: Use the suggestion engine

Why using the suggestion engine?
- To speed up the dataset creation
- To quickly assess the machine’s ability to learn
Go to the Suggestions view
- Accept/correct the suggested categories then validate the document. It will be added to the dataset with its categorie(s).
Manage suggestions
- Sort them according to their confidence level score
- Filter the list on the label (category) you want to work on
Please note:
- The suggestion engine is updated after a few validated suggestions
- The suggestion engine is based on a machine learning algorithm with a fast training time (but which will not necessarily provide the best results)

Step 6: Review the dataset

Why reviewing the dataset?
- Dataset quality is essential to create the best possible model
Go to the Labels view
- Make sure the annotations are evenly distributed over the labels … as much as possible

Go to the Documents view
Select “Labelled” in the “Status“ filter to access the dataset
The dataset must be as accurate as possible: without false or missing categories and no inconsistencies between categories

Step 7: Split the dataset

Why spliting dataset?
- To make sure we will use the same training and test sets to compare different model experiments
Go to the Model experiments view
Split the dataset by generating train/test metadata on the dataset
Note:
- If new annotations are added to the dataset, the split will be automatically updated when launching a new experiment

Step 8: Train models

Go to the Model Experiments view
Edit each predefined experiment, check the training options to use train & test metatada on “train_on” and “test_on” parameters

Launch the predefined experiments
Check the quality (F-measure) of each experiment and identify the best model
Note:
- If the F-measure is below 60% quality, enrich and improve the dataset by iteration (see next steps below)
- Do not create new experiments to test different algorithms if the F-measure is below 60%, it is useless at this stage

Step 9: Iterate steps 4-5-6 above to achieve 60% accuracy

In the Model experiments view
- Identify labels with low quality by ticking the quality box
Enrich the dataset on these labels either:
- with new manually annotated segments (see above: 4 – Annotate text)
- or by using Suggestions (see above: 5 – Use the suggestion engine)
In the Model experiments view
- Run the experiment again and see if the accuracy of the model has improved for each label
Iterate… until achieving at least a 60% accuracy for each label

Step 10: Annotate the dataset automatically

Why annotating the dataset?
- To test the dataset and model quality
- To detect possible discrepancies
- It is only useful if model accuracy is above 60%
Go to the Documents view
Run an automatic annotation of the dataset with the model

Step 11: Identify discrepancies

Go to the Documents view
Open the filter “Agreement: automatic-other”
Select “Disagreement“

Check the origin of the annotation with the letter or the tooltip on the chips
If the model is right after all, correct the dataset manually.
When corrections are made, remove the automatic annotations from the model.

Re-train the model. You will improve the model’s precision.

Step 12: Train the final model

Why selecting a final model?
- To compare different algorithms to judge their accuracy
- Probably neither the suggestion model nor the pre-packaged experiments result in the best model. In this case, it is necessary to experiment to select the final model .
Go to the Model experiments view
Create additional model experiments, launch models and compare the quality (F-measure)
Note:
- The goal is to achieve an accuracy between 80% and 95% (F-measure)
- Don’t expect to achieve a 100% accuracy… but you might reach this in some simple cases
- Performance might be as important as accuracy, in which case it is not necessarily the highest quality model that is selected as the final model

Entity detection