Python |使用Word2Vec嵌入单词

单词嵌入 是一种语言建模技术,用于将单词映射到实数向量。它表示向量空间中的多个维度的单词或短语。可以使用各种方法生成单词嵌入,如神经网络、共生矩阵、概率模型等。

null

Word2Vec 由生成单词嵌入的模型组成。这些模型是浅层两层神经网络,具有一个输入层、一个隐藏层和一个输出层。Word2Vec采用两种架构:

  1. CBOW(连续文字袋): CBOW模型预测特定窗口内给定上下文单词的当前单词。输入层包含上下文单词,输出层包含当前单词。隐藏层包含我们想要表示输出层中当前单词的维度数。 图片[1]-Python |使用Word2Vec嵌入单词-yiteyi-C++库

  2. 跳过克: Skip gram在给定当前单词的特定窗口内预测周围的上下文单词。输入层包含当前单词,输出层包含上下文单词。隐藏层包含我们想要表示输入层中当前单词的维度数。 图片[2]-Python |使用Word2Vec嵌入单词-yiteyi-C++库

单词嵌入的基本思想是,出现在相似上下文中的单词在向量空间中往往更接近彼此。为了在Python中生成单词向量,需要模块 nltk gensim .

在终端中运行这些命令以安装 nltk gensim :

pip install nltk
pip install gensim

下载用于从中生成单词向量的文本文件 在这里 .

以下是实施情况:

# Python program to generate word vectors using Word2Vec
# importing all necessary modules
from nltk.tokenize import sent_tokenize, word_tokenize
import warnings
warnings.filterwarnings(action = 'ignore' )
import gensim
from gensim.models import Word2Vec
#  Reads ‘alice.txt’ file
sample = open ( "C:\Users\Admin\Desktop\alice.txt" , "r" )
s = sample.read()
# Replaces escape character with space
f = s.replace( "" , " " )
data = []
# iterate through each sentence in the file
for i in sent_tokenize(f):
temp = []
# tokenize the sentence into words
for j in word_tokenize(i):
temp.append(j.lower())
data.append(temp)
# Create CBOW model
model1 = gensim.models.Word2Vec(data, min_count = 1 ,
size = 100 , window = 5 )
# Print results
print ( "Cosine similarity between 'alice' " +
"and 'wonderland' - CBOW : " ,
model1.similarity( 'alice' , 'wonderland' ))
print ( "Cosine similarity between 'alice' " +
"and 'machines' - CBOW : " ,
model1.similarity( 'alice' , 'machines' ))
# Create Skip Gram model
model2 = gensim.models.Word2Vec(data, min_count = 1 , size = 100 ,
window = 5 , sg = 1 )
# Print results
print ( "Cosine similarity between 'alice' " +
"and 'wonderland' - Skip Gram : " ,
model2.similarity( 'alice' , 'wonderland' ))
print ( "Cosine similarity between 'alice' " +
"and 'machines' - Skip Gram : " ,
model2.similarity( 'alice' , 'machines' ))


输出:

Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.999249298413
Cosine similarity between 'alice' and 'machines' - CBOW :  0.974911910445
Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  0.885471373104
Cosine similarity between 'alice' and 'machines' - Skip Gram :  0.856892599521

输出表示不同型号的单词向量“alice”、“wonderland”和“machines”之间的余弦相似性。一项有趣的任务可能是更改“大小”和“窗口”的参数值,以观察余弦相似性的变化。

Applications of Word Embedding :

>> Sentiment Analysis
>> Speech Recognition
>> Information Retrieval
>> Question Answering

参考资料:

© 版权声明
THE END
喜欢就支持一下吧
点赞13 分享