Python |使用Word2Vec嵌入单词-yiteyi-C++库

单词嵌入 是一种语言建模技术，用于将单词映射到实数向量。它表示向量空间中的多个维度的单词或短语。可以使用各种方法生成单词嵌入，如神经网络、共生矩阵、概率模型等。

null

Word2Vec 由生成单词嵌入的模型组成。这些模型是浅层两层神经网络，具有一个输入层、一个隐藏层和一个输出层。Word2Vec采用两种架构：

CBOW（连续文字袋）： CBOW模型预测特定窗口内给定上下文单词的当前单词。输入层包含上下文单词，输出层包含当前单词。隐藏层包含我们想要表示输出层中当前单词的维度数。
跳过克： Skip gram在给定当前单词的特定窗口内预测周围的上下文单词。输入层包含当前单词，输出层包含上下文单词。隐藏层包含我们想要表示输入层中当前单词的维度数。

单词嵌入的基本思想是，出现在相似上下文中的单词在向量空间中往往更接近彼此。为了在Python中生成单词向量，需要模块 nltk 和 gensim .

在终端中运行这些命令以安装 nltk 和 gensim :

pip install nltk
pip install gensim

下载用于从中生成单词向量的文本文件在这里 .

以下是实施情况：

                     # Python program to generate word vectors using Word2Vec                   
                             
                     # importing all necessary modules                   
                     from                               nltk.tokenize                               import                               sent_tokenize, word_tokenize                   
                     import                               warnings                   
                             
                     warnings.filterwarnings(action                               =                               'ignore'                               )                   
                             
                     import                               gensim                   
                     from                               gensim.models                               import                               Word2Vec                   
                             
                     #  Reads ‘alice.txt’ file                   
                     sample                               =                               open                               (                               "C:\Users\Admin\Desktop\alice.txt"                               ,                               "r"                               )                   
                     s                               =                               sample.read()                   
                             
                     # Replaces escape character with space                   
                     f                               =                               s.replace(                               ""                               ,                               " "                               )                   
                             
                     data                               =                               []                   
                             
                     # iterate through each sentence in the file                   
                     for                               i                               in                               sent_tokenize(f):                   
                                         temp                               =                               []                   
                             
                                         # tokenize the sentence into words                   
                                         for                               j                               in                               word_tokenize(i):                   
                                         temp.append(j.lower())                   
                             
                                         data.append(temp)                   
                             
                     # Create CBOW model                   
                     model1                               =                               gensim.models.Word2Vec(data, min_count                               =                               1                               ,                   
                                         size                               =                               100                               , window                               =                               5                               )                   
                             
                     # Print results                   
                     print                               (                               "Cosine similarity between 'alice' "                               +                   
                                         "and 'wonderland' - CBOW : "                               ,                   
                                         model1.similarity(                               'alice'                               ,                               'wonderland'                               ))                   
                             
                     print                               (                               "Cosine similarity between 'alice' "                               +                   
                                         "and 'machines' - CBOW : "                               ,                   
                                         model1.similarity(                               'alice'                               ,                               'machines'                               ))                   
                             
                     # Create Skip Gram model                   
                     model2                               =                               gensim.models.Word2Vec(data, min_count                               =                               1                               , size                               =                               100                               ,                   
                                         window                               =                               5                               , sg                               =                               1                               )                   
                             
                     # Print results                   
                     print                               (                               "Cosine similarity between 'alice' "                               +                   
                                         "and 'wonderland' - Skip Gram : "                               ,                   
                                         model2.similarity(                               'alice'                               ,                               'wonderland'                               ))                   
                             
                     print                               (                               "Cosine similarity between 'alice' "                               +                   
                                         "and 'machines' - Skip Gram : "                               ,                   
                                         model2.similarity(                               'alice'                               ,                               'machines'                               ))                   

输出：

Cosine similarity between 'alice' and 'wonderland' - CBOW :  0.999249298413
Cosine similarity between 'alice' and 'machines' - CBOW :  0.974911910445
Cosine similarity between 'alice' and 'wonderland' - Skip Gram :  0.885471373104
Cosine similarity between 'alice' and 'machines' - Skip Gram :  0.856892599521

输出表示不同型号的单词向量“alice”、“wonderland”和“machines”之间的余弦相似性。一项有趣的任务可能是更改“大小”和“窗口”的参数值，以观察余弦相似性的变化。

Applications of Word Embedding :

>> Sentiment Analysis
>> Speech Recognition
>> Information Retrieval
>> Question Answering

参考资料：

文章版权归作者所有，未经允许请勿转载。

THE END

Python