从文件中找出k个最常用的单词

给我一本文字书。假设你有足够的内存来容纳所有单词。设计一个数据结构来查找前K个最大出现的单词。数据结构应该是动态的,以便可以添加新词。 一个简单的解决办法是 使用散列 .在哈希表中逐个哈希所有单词。如果一个单词已经存在,则增加其计数。最后,遍历哈希表并返回计数最大的k个单词。 我们可以 使用Trie和Min-Heap 以有效地获取最频繁的单词。其想法是使用Trie搜索现有单词,并高效地添加新词。Trie还存储单词出现的次数。大小为k的Min Heap用于在任何时间点跟踪k个最频繁的单词(Min Heap的用法与我们用于在中查找k个最大元素的用法相同) 帖子)。 Trie和Min-Heap通过在Trie’indexMinHeap’中存储一个附加字段和在Min-Heap中存储一个指针’trNode’来相互链接。对于当前不在Min Heap中(或当前不在前k个频繁字中)的字,“indexMinHeap”的值保持为-1。对于Min Heap中存在的单词,“indexMinHeap”包含Min Heap中单词的索引。Min Heap中的指针“trNode”指向与Trie中的单词对应的叶节点。 以下是从文件中打印k个最常用单词的完整过程。 逐字朗读。对于每个单词,将其插入Trie。增加单词的计数器(如果已经存在)。现在,我们也需要在min heap中插入这个词。对于最小堆中的插入,会出现3种情况: 1. 这个词已经出现了。我们只需在min heap中增加相应的频率值,并为Trie中的“indexMinHeap”字段获得的索引调用minHeapify()。当交换最小堆节点时,我们在Trie中更改相应的minHeapIndex。记住,min堆的每个节点都有指向Trie叶节点的指针。 2. 肉馅堆还没满。我们将把新词插入min heap并更新min heap节点中的根节点,以及Trie leaf节点中的min heap索引。现在,调用buildMinHeap()。 3. 最小堆已满。出现了两个子案例。 …. 3.1 插入新词的频率小于存储在min heap头部的词的频率。什么都不做。 …. 3.2 插入新词的频率大于存储在min heap头部的词的频率。替换并更新字段。确保将Trie中“要替换的单词”对应的最小堆索引更新为-1,因为该单词不再在最小堆中。 4. 最后,Min Heap将拥有给定文件中所有单词中最常见的k个单词。所以我们只需要打印Min Heap中的所有单词。

null

CPP

// A program to find k most frequent words in a file
#include <stdio.h>
#include <string.h>
#include <ctype.h>
# define MAX_CHARS 26
# define MAX_WORD_SIZE 30
// A Trie node
struct TrieNode
{
bool isEnd; // indicates end of word
unsigned frequency; // the number of occurrences of a word
int indexMinHeap; // the index of the word in minHeap
TrieNode* child[MAX_CHARS]; // represents 26 slots each for 'a' to 'z'.
};
// A Min Heap node
struct MinHeapNode
{
TrieNode* root; // indicates the leaf node of TRIE
unsigned frequency; //  number of occurrences
char * word; // the actual word stored
};
// A Min Heap
struct MinHeap
{
unsigned capacity; // the total size a min heap
int count; // indicates the number of slots filled.
MinHeapNode* array; //  represents the collection of minHeapNodes
};
// A utility function to create a new Trie node
TrieNode* newTrieNode()
{
// Allocate memory for Trie Node
TrieNode* trieNode = new TrieNode;
// Initialize values for new node
trieNode->isEnd = 0;
trieNode->frequency = 0;
trieNode->indexMinHeap = -1;
for ( int i = 0; i < MAX_CHARS; ++i )
trieNode->child[i] = NULL;
return trieNode;
}
// A utility function to create a Min Heap of given capacity
MinHeap* createMinHeap( int capacity )
{
MinHeap* minHeap = new MinHeap;
minHeap->capacity = capacity;
minHeap->count  = 0;
// Allocate memory for array of min heap nodes
minHeap->array = new MinHeapNode [ minHeap->capacity ];
return minHeap;
}
// A utility function to swap two min heap nodes. This function
// is needed in minHeapify
void swapMinHeapNodes ( MinHeapNode* a, MinHeapNode* b )
{
MinHeapNode temp = *a;
*a = *b;
*b = temp;
}
// This is the standard minHeapify function. It does one thing extra.
// It updates the minHapIndex in Trie when two nodes are swapped in
// in min heap
void minHeapify( MinHeap* minHeap, int idx )
{
int left, right, smallest;
left = 2 * idx + 1;
right = 2 * idx + 2;
smallest = idx;
if ( left < minHeap->count &&
minHeap->array[ left ]. frequency <
minHeap->array[ smallest ]. frequency
)
smallest = left;
if ( right < minHeap->count &&
minHeap->array[ right ]. frequency <
minHeap->array[ smallest ]. frequency
)
smallest = right;
if ( smallest != idx )
{
// Update the corresponding index in Trie node.
minHeap->array[ smallest ]. root->indexMinHeap = idx;
minHeap->array[ idx ]. root->indexMinHeap = smallest;
// Swap nodes in min heap
swapMinHeapNodes (&minHeap->array[ smallest ], &minHeap->array[ idx ]);
minHeapify( minHeap, smallest );
}
}
// A standard function to build a heap
void buildMinHeap( MinHeap* minHeap )
{
int n, i;
n = minHeap->count - 1;
for ( i = ( n - 1 ) / 2; i >= 0; --i )
minHeapify( minHeap, i );
}
// Inserts a word to heap, the function handles the 3 cases explained above
void insertInMinHeap( MinHeap* minHeap, TrieNode** root, const char * word )
{
// Case 1: the word is already present in minHeap
if ( (*root)->indexMinHeap != -1 )
{
++( minHeap->array[ (*root)->indexMinHeap ]. frequency );
// percolate down
minHeapify( minHeap, (*root)->indexMinHeap );
}
// Case 2: Word is not present and heap is not full
else if ( minHeap->count < minHeap->capacity )
{
int count = minHeap->count;
minHeap->array[ count ]. frequency = (*root)->frequency;
minHeap->array[ count ]. word = new char [ strlen ( word ) + 1];
strcpy ( minHeap->array[ count ]. word, word );
minHeap->array[ count ]. root = *root;
(*root)->indexMinHeap = minHeap->count;
++( minHeap->count );
buildMinHeap( minHeap );
}
// Case 3: Word is not present and heap is full. And frequency of word
// is more than root. The root is the least frequent word in heap,
// replace root with new word
else if ( (*root)->frequency > minHeap->array[0]. frequency )
{
minHeap->array[ 0 ]. root->indexMinHeap = -1;
minHeap->array[ 0 ]. root = *root;
minHeap->array[ 0 ]. root->indexMinHeap = 0;
minHeap->array[ 0 ]. frequency = (*root)->frequency;
// delete previously allocated memory and
delete [] minHeap->array[ 0 ]. word;
minHeap->array[ 0 ]. word = new char [ strlen ( word ) + 1];
strcpy ( minHeap->array[ 0 ]. word, word );
minHeapify ( minHeap, 0 );
}
}
// Inserts a new word to both Trie and Heap
void insertUtil ( TrieNode** root, MinHeap* minHeap,
const char * word, const char * dupWord )
{
// Base Case
if ( *root == NULL )
*root = newTrieNode();
//  There are still more characters in word
if ( *word != ' ' )
insertUtil ( &((*root)->child[ tolower ( *word ) - 97 ]),
minHeap, word + 1, dupWord );
else // The complete word is processed
{
// word is already present, increase the frequency
if ( (*root)->isEnd )
++( (*root)->frequency );
else
{
(*root)->isEnd = 1;
(*root)->frequency = 1;
}
// Insert in min heap also
insertInMinHeap( minHeap, root, dupWord );
}
}
// add a word to Trie & min heap.  A wrapper over the insertUtil
void insertTrieAndHeap( const char *word, TrieNode** root, MinHeap* minHeap)
{
insertUtil( root, minHeap, word, word );
}
// A utility function to show results, The min heap
// contains k most frequent words so far, at any time
void displayMinHeap( MinHeap* minHeap )
{
int i;
// print top K word with frequency
for ( i = 0; i < minHeap->count; ++i )
{
printf ( "%s : %d", minHeap->array[i].word,
minHeap->array[i].frequency );
}
}
// The main function that takes a file as input, add words to heap
// and Trie, finally shows result from heap
void printKMostFreq( FILE * fp, int k )
{
// Create a Min Heap of Size k
MinHeap* minHeap = createMinHeap( k );
// Create an empty Trie
TrieNode* root = NULL;
// A buffer to store one word at a time
char buffer[MAX_WORD_SIZE];
// Read words one by one from file.  Insert the word in Trie and Min Heap
while ( fscanf ( fp, "%s", buffer ) != EOF )
insertTrieAndHeap(buffer, &root, minHeap);
// The Min Heap will have the k most frequent words, so print Min Heap nodes
displayMinHeap( minHeap );
}
// Driver program to test above functions
int main()
{
int k = 5;
FILE *fp = fopen ("file.txt", "r");
if (fp == NULL)
printf ("File doesn't exist ");
else
printKMostFreq (fp, k);
return 0;
}


输出:

your : 3well : 3and : 4to : 4Geeks : 6

以上输出适用于包含以下内容的文件。

Welcome to the world of Geeks This portal has been created to provide well written well thought and well explained solutions for selected questions If you like Geeks for Geeks and would like to contribute here is your chance You can write article and mail your article to contribute at geeksforgeeks org See your article appearing on the Geeks for Geeks main page and help thousands of other Geeks

如果您发现任何不正确的地方,或者您想分享有关上述主题的更多信息,请写下评论。

© 版权声明
THE END
喜欢就支持一下吧
点赞14 分享