You can make use of spacy similarity
method, that will calculate cosine similarity between tokens for you. In order to use vectors, load a model with vectors:
import spacy
nlp = spacy.load("en_core_web_md")
text = "I have a text file that contains the content of a web page that I have extracted using BeautifulSoup. I need to find N similar words from the text file based on a given word. The process is as follows"
doc = nlp(text)
words = ['goal', 'soccer']
# compute similarity
similarities = {}
for word in words:
tok = nlp(word)
similarities[tok.text] ={}
for tok_ in doc:
similarities[tok.text].update({tok_.text:tok.similarity(tok_)})
# sort
top10 = lambda x: {k: v for k, v in sorted(similarities[x].items(), key=lambda item: item[1], reverse=True)[:10]}
# desired output
top10("goal")
{'need': 0.41729581641359625,
'that': 0.4156277030017712,
'to': 0.40102258054859163,
'is': 0.3742535591719576,
'the': 0.3735002888862756,
'The': 0.3735002888862756,
'given': 0.3595024941701789,
'process': 0.35218102758578645,
'have': 0.34597281472837316,
'as': 0.34433650293640194}
Note, (1) if you’re comfortable with gensim
, and/or (2) have a word2vec
model trained on your text, you can do directly:
word2Vec.most_similar(positive=['goal'], topn=10)
CLICK HERE to find out more related problems solutions.