单词嵌入 是一种语言建模技术,用于将单词映射到实数向量。它表示向量空间中的多个维度的单词或短语。可以使用各种方法生成单词嵌入,如神经网络、共生矩阵、概率模型等。
null
Word2Vec 由生成单词嵌入的模型组成。这些模型是浅层两层神经网络,具有一个输入层、一个隐藏层和一个输出层。Word2Vec采用两种架构:
- CBOW(连续文字袋): CBOW模型预测特定窗口内给定上下文单词的当前单词。输入层包含上下文单词,输出层包含当前单词。隐藏层包含我们想要表示输出层中当前单词的维度数。
- 跳过克: Skip gram在给定当前单词的特定窗口内预测周围的上下文单词。输入层包含当前单词,输出层包含上下文单词。隐藏层包含我们想要表示输入层中当前单词的维度数。
单词嵌入的基本思想是,出现在相似上下文中的单词在向量空间中往往更接近彼此。为了在Python中生成单词向量,需要模块 nltk
和 gensim
.
在终端中运行这些命令以安装 nltk
和 gensim
:
pip install nltk pip install gensim
下载用于从中生成单词向量的文本文件 在这里 .
以下是实施情况:
# Python program to generate word vectors using Word2Vec # importing all necessary modules from nltk.tokenize import sent_tokenize, word_tokenize import warnings warnings.filterwarnings(action = 'ignore' ) import gensim from gensim.models import Word2Vec # Reads ‘alice.txt’ file sample = open ( "C:\Users\Admin\Desktop\alice.txt" , "r" ) s = sample.read() # Replaces escape character with space f = s.replace( "" , " " ) data = [] # iterate through each sentence in the file for i in sent_tokenize(f): temp = [] # tokenize the sentence into words for j in word_tokenize(i): temp.append(j.lower()) data.append(temp) # Create CBOW model model1 = gensim.models.Word2Vec(data, min_count = 1 , size = 100 , window = 5 ) # Print results print ( "Cosine similarity between 'alice' " + "and 'wonderland' - CBOW : " , model1.similarity( 'alice' , 'wonderland' )) print ( "Cosine similarity between 'alice' " + "and 'machines' - CBOW : " , model1.similarity( 'alice' , 'machines' )) # Create Skip Gram model model2 = gensim.models.Word2Vec(data, min_count = 1 , size = 100 , window = 5 , sg = 1 ) # Print results print ( "Cosine similarity between 'alice' " + "and 'wonderland' - Skip Gram : " , model2.similarity( 'alice' , 'wonderland' )) print ( "Cosine similarity between 'alice' " + "and 'machines' - Skip Gram : " , model2.similarity( 'alice' , 'machines' )) |
输出:
Cosine similarity between 'alice' and 'wonderland' - CBOW : 0.999249298413 Cosine similarity between 'alice' and 'machines' - CBOW : 0.974911910445 Cosine similarity between 'alice' and 'wonderland' - Skip Gram : 0.885471373104 Cosine similarity between 'alice' and 'machines' - Skip Gram : 0.856892599521
输出表示不同型号的单词向量“alice”、“wonderland”和“machines”之间的余弦相似性。一项有趣的任务可能是更改“大小”和“窗口”的参数值,以观察余弦相似性的变化。
Applications of Word Embedding : >> Sentiment Analysis >> Speech Recognition >> Information Retrieval >> Question Answering
参考资料:
© 版权声明
文章版权归作者所有,未经允许请勿转载。
THE END