Transformers get named entity prediction for words instead of tokens

There are two questions here.

Annotating Token Classification

A common sequential tagging, especially in Named Entity Recognition, follows the scheme that a sequence to tokens with tag X at the beginning gets B-X and on reset of the labels it gets I-X. The problem is that most annotated datasets are tokenized with space! For example:

[CSL]  O
Damien  B-ARTIST
canvas  I-MEDIUM
[SEP]  O

where O indicates that it is not a named-entity, B-ARTIST is the beginning of the sequence of tokens labelled as ARTIST and I-ARTIST is inside the sequence – similar pattern for MEDIUM.

At the moment I posted this answer, there is an example of NER in huggingface documentation here:

The example doesn’t exactly answer the question here, but it can add some clarification. The similar style of named entity labels in that example could be as follows:

label_list = [
    "O", # not a named entity
    "B-ARTIST", # beginning of an artist name
    "I-ARTIST", # an artist name
    "B-MEDIUM", # beginning of a medium name
    "I-MEDIUM", # a medium name

Adapt Tokenizations

With all that said about annotation schema, BERT and several other models have different tokenization model. So, we have to adapt these two tokenizations. In this case with bert-base-uncased, the expected outcome is like this:

damien  B-ARTIST
##rst  I-ARTIST
canvas  I-MEDIUM

In order to get this done, you can go through each token in original annotation, then tokenize it and add its label again:

tokens_old = ['Damien', 'Hirst', 'oil', 'in', 'canvas']
labels_old = ["B-ARTIST", "I-ARTIST", "B-MEDIUM", "I-MEDIUM", "I-MEDIUM"]
label2id = {label: idx for idx, label in enumerate(label_list)}

tokens, labels = zip(*[
   (token, label)
   for token_old, label in zip(tokens_old, labels_old)
   for token in tokenizer.tokenize(token_old)

When you add [CLS] and [SEP] in the tokens, their labels "O" must be added to labels.

With the code above, it is possible to get into a situation that a beginning tag like B-ARTIST get repeated when the beginning word splits into pieces. According to the description in huggingface documentation, you can encode these labels with -100 to be ignored:

Something like this should work:

tokens, labels = zip(*[
   (token, label2id[label] if (label[:2] != "B-" or i == 0) else -100)
   for token_old, label in zip(tokens_old, labels_old)
   for i, token in enumerate(tokenizer.tokenize(token_old))

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top