tf-idf页面排名模型-yiteyi-C++库

tf idf 代表 术语频率逆文档频率 tf idf权重是信息检索和文本挖掘中常用的权重。搜索引擎经常使用tf idf加权方案的变体对给定查询的文档相关性进行评分和排序。这个权重是一个统计指标，用于评估一个词对集合或语料库中的文档的重要性。重要性随着一个词在文档中出现的次数成比例增加，但被该词在语料库（数据集）中的频率抵消。

null

如何计算： tf idf是一种加权方案，它根据文档中的术语频率（tf）和逆文档频率（idf）为每个术语分配权重。权重分数较高的术语被认为更重要。

通常，tf idf权重由两项组成-

标准化术语频率（tf）
反向文档频率（idf）

让我们用3个文档来展示它是如何工作的。

文件1： 本在计算机实验室学习计算机。 文件2： 史蒂夫在布朗大学任教。 文件3： 数据科学家研究大型数据集。

假设我们正在使用以下查询对这些文档进行搜索： 数据科学家

该查询是自由文本查询。它意味着一个查询，其中查询的术语以自由形式输入到搜索界面中，没有任何连接的搜索运算符。

步骤1：计算术语频率（tf）

频率表示特定术语出现的次数 T 文件中 D 因此

tf(t, d) = N(t, d), wherein tf(t, d) = term frequency for a term t in document d.

N(t, d)  = number of times a term t occurs in document d

我们可以看到，随着一个术语在文档中出现得越多，它就变得越重要，这是合乎逻辑的。我们可以使用向量来表示单词袋模型中的文档，因为术语的顺序并不重要。文档中每个唯一术语都有一个条目，其值为术语频率。

以下是每个文件中的术语及其频率。[N（t，d）]

tf for document 1:

文件1	本	学习	计算机	实验室
tf	1.	1.	2.	1.

文档1的向量空间表示： [1, 1, 2, 1]

tf for document 2:

文件2	史蒂夫	教导	棕色的	大学
tf	1.	1.	1.	1.

文档2的向量空间表示： [1, 1, 1, 1]

tf for document 3:

文件3	数据	科学家	工作	大的	数据集
tf	1.	1.	1.	1.	1.

文档3的向量空间表示： [1, 1, 1, 1, 1]

因此，在公共向量空间中将文档表示为向量称为 向量空间模型 这是信息检索的基础。

由于我们处理的术语频率取决于发生次数，因此，越长的文档越受青睐。要避免这种情况，请将 术语频率

tf(t, d) = N(t, d) / ||D||
wherein, ||D|| = Total number of term in the document

||D|| for each document:

文件	\|\|D\|\|
1.	7.
2.	5.
3.	6.

以下是所有文件的标准术语频率，即。 [N（t，d）/|d |]

Normalized TF for Document 1:

Doc1	本	学习	计算机	实验室
标准化Tf	0.143	0.143	0.286	0.143

文档1的向量空间表示： [0.143, 0.143, 0.286, 0.143]

Normalized tf for document 2:

文件2	史蒂夫	教导	棕色的	大学
标准化	0.2	0.2	0.2	0.2

文档2的向量空间表示： [0.2, 0.2, 0.2, 0.2]

Normalized tf for document 3:

文件3	数据	科学家	工作	大的	数据集
标准化	0.167	0.167	0.167	0.167	0.167

文档3的向量空间表示： [0.167, 0.167, 0.167, 0.167, 0.167]

Python中的以下函数将执行标准化TF计算：

                       def                                  termFrequency(term, doc):                     
                                
                                             """                     
                                             Input: term: Term in the Document, doc: Document                     
                                             Return: Normalized tf: Number of times term occurs                     
                                             in document/Total number of terms in the document                     
                                             """                     
                                             # Splitting the document into individual terms                     
                                             normalizeTermFreq                                  =                                  doc.lower().split()                     
                                
                                             # Number of times the term occurs in the document                     
                                             term_in_document                                  =                                  normalizeTermFreq.count(term.lower())                     
                                
                                             # Total number of terms in the document                     
                                             len_of_document                                  =                                  float                                  (                                  len                                  (normalizeTermFreq ))                     
                                
                                             # Normalized Term Frequency                     
                                             normalized_tf                                  =                                  term_in_document                                  /                                  len_of_document                     
                                
                                             return                                  normalized_tf                     

第2步：计算逆文档频率–idf

它通常衡量一个术语的重要性。搜索的主要目的是找到与查询匹配的相关文档。自从 tf 认为所有术语都同等重要，因此，我们不能仅使用术语频率来计算文档中术语的权重。然而，众所周知，某些术语，如“is”、“of”和“that”，可能会出现很多次，但没有什么重要性。因此，我们需要在增加稀有项的同时，对频繁项进行权衡。对数帮助我们解决这个问题。

首先，通过计算包含该术语的文档数，找出该术语的文档频率：

df(t) = N(t)

where-
df(t) = Document frequency of a term t
N(t) = Number of documents containing the term t

术语频率是一个术语仅在一个特定文档中出现的次数；文档频率是该术语出现在不同文档中的数量，因此它取决于整个语料库。现在让我们看一下反向文档频率的定义。术语的idf是语料库中的文档数除以术语的文档频率。

idf(t) = N/ df(t) = N/N(t)

通常认为频率较高的项不太重要，但因子（很可能是整数）似乎太苛刻了。因此，我们取逆文档频率的对数（以2为底）。因此，术语t的idf变成：

idf(t) = log(N/ df(t))

这样更好，而且日志是一个单调递增的函数，我们可以安全地使用它。让我们计算计算机这个术语的IDF：

idf(computer) = log(Total Number Of Documents / Number Of Documents with term Computer in it)

共有3个文档=文档1、文档2、文档3

The term Computer appears in Document1

idf(computer) = log(3 / 1)
          = 1.5849

下面给出的是 以色列国防军 所有文件中出现的术语-

鉴于	出现术语的文件数量（N） _T )	idf=对数（N/N） _T )
本	1.	对数（3/1）=1.5849
学习	1.	对数（3/1）=1.5849
计算机	1.	对数（3/1）=1.5849
实验室	1.	对数（3/1）=1.5849
史蒂夫	1.	对数（3/1）=1.5849
教导	1.	对数（3/1）=1.5849
棕色的	1.	对数（3/1）=1.5849
大学	1.	对数（3/1）=1.5849
数据	1.	对数（3/1）=1.5849
科学家	1.	对数（3/1）=1.5849
工作	1.	对数（3/1）=1.5849
大的	1.	对数（3/1）=1.5849
数据集	1.	对数（3/1）=1.5849

以下是python中计算idf的函数：

                       def                                  inverseDocumentFrequency(term, allDocs):                     
                                             num_docs_with_given_term                                  =                                  0                     
                                
                                             """                     
                                             Input: term: Term in the Document,                     
                                             allDocs: List of all documents                     
                                             Return: Inverse Document Frequency (idf) for term                     
                                             = Logarithm ((Total Number of Documents) /                     
                                             (Number of documents containing the term))                     
                                             """                     
                                             # Iterate through all the documents                     
                                             for                                  doc                                  in                                  allDocs:                     
                                
                                             """                     
                                             Putting a check if a term appears in a document.                     
                                             If term is present in the document, then                     
                                             increment "num_docs_with_given_term" variable                     
                                             """                     
                                             if                                  term.lower()                                  in                                  allDocs[doc].lower().split():                     
                                             num_docs_with_given_term                                  +                                  =                                  1                     
                                
                                             if                                  num_docs_with_given_term >                                  0                                  :                     
                                             # Total number of documents                     
                                             total_num_docs                                  =                                  len                                  (allDocs)                     
                                
                                             # Calculating the IDF                     
                                             idf_val                                  =                                  log(                                  float                                  (total_num_docs)                                  /                                  num_docs_with_given_term)                     
                                             return                                  idf_val                     
                                             else                                  :                     
                                             return                                  0                     

第3步：tf idf评分

现在我们已经定义了tf和idf，现在我们可以将它们结合起来，得出文件d中术语t的最终分数。因此，

tf-idf(t, d) = tf(t, d)* idf(t, d)

对于查询中的每个术语，将其标准化术语频率与每个文档上的IDF相乘。在文献3中，术语数据的标准化术语频率为0.167，其IDF为1.5849。将它们相乘得到0.2646。以下是所有文件中数据和科学家的TF*IDF计算。

	文件1	文件2	文件3
数据	0	0	0.2646
科学家	0	0	0.2646

我们将使用任何相似性度量（例如，余弦相似性方法）来查找查询和每个文档之间的相似性。例如，如果我们使用 相似度 方法找到相似度，然后角度越小，相似度越大。

使用下面给出的公式，我们可以找出任意两个文档之间的相似性，比如d1，d2。

Cosine Similarity (d1, d2) =  Dot product(d1, d2) / ||d1|| * ||d2||

Dot product (d1, d2) = d1[0] * d2[0] + d1[1] * d2[1] * … * d1[n] * d2[n]
||d1|| = square root(d1[0]^2 + d1[1]^2 + ... + d1[n]^2)
||d2|| = square root(d2[0]^2 + d2[1]^2 + ... + d2[n]^2)

参考资料：

文章版权归作者所有，未经允许请勿转载。

THE END

技术文章