Regex match only when capture group occurs last in string

You can use

df = df[df['Text'].str.contains(r'\|g__[^|]*$')]

The \|g__[^|]*$ regex matches |g__ and then zero or more chars other than | till the end of the string.

See the regex demo.

Pandas test:

import pandas as pd
df = pd.DataFrame({'Text':['k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium', 
'k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_pseudolongum',
'k__Bacteria|p__Actinobacteria|c__Actinobacteria|o__Bifidobacteriales|f__Bifidobacteriaceae|g__Bifidobacterium|s__Bifidobacterium_pseudolongum|t__GCF_000421365']})
df = df[df['Text'].str.contains(r'\|g__[^|]*$')]

CLICK HERE to find out more related problems solutions.

Leave a Comment

Your email address will not be published.

Scroll to Top