From this line:
txt.replace(rf"\b({'|'.join(words)})\b", '', regex=True)
This is the signature for pd.Series.replace
so your function takes a series as input. On the other hand,
df['old_text'].apply(removeWords)
applies the function to each cell of df['old_text']
. That means, txt
would be just a string, and the signature for str.replace
does not have keyword arguments (regex=True
) in this case.
TLDR, you want to do:
df['new_text'] = removeWords(df['old_text'])
Output:
id old_text new_text
0 0 my favorite color is blue favorte color s blue
1 1 you have a dog have a dog
2 2 we built the house ourselves bult the house selves
3 3 i will visit you wll vst
But as you can see, your function replaces the i
within the words. You may want to modify the pattern so as it only replaces the whole words with the boundary indicator \b
:
def removeWords(txt):
words = ['i', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', 'your', 'yours', 'yourself']
# note the `\b` here
return txt.replace(rf"\b({'|'.join(words)})\b", '', regex=True)
Output:
id old_text new_text
0 0 my favorite color is blue favorite color is blue
1 1 you have a dog have a dog
2 2 we built the house ourselves built the house
3 3 i will visit you will visit
CLICK HERE to find out more related problems solutions.