Python中的可读性索引（NLP）-yiteyi-C++库

可读性是读者理解书面文本的容易程度。在自然语言中，文本的可读性取决于其内容（词汇和语法的复杂性）。它关注于我们选择的单词，以及我们如何将它们放入句子和段落中，以便读者理解。

null

我们写作的主要目的是传递作者和读者都认为有价值的信息。如果我们不能传达这些信息，我们的努力就会白费。为了吸引读者，重要的是向他们展示信息，让他们乐于继续阅读并能够清楚地理解。因此，要求内容足够容易阅读，并且尽可能易读。有各种可用的难度量表和它们自己的难度确定公式。

本文阐述了可用于可读性评分评估的各种传统可读性公式。在自然语言处理中，有时需要分析单词和句子来确定文本的难度。易读性分数通常是特定等级的分数，它根据特定文本的难度对文本进行评分。它帮助作者改进文本，让更多的读者能够理解，从而使内容更具吸引力。

各种可用的可读性评分确定方法/公式：

戴尔-查尔公式
喷射雾公式
Fry可读性图
麦克劳克林的烟雾公式
预测公式
可读性与报纸读者群
弗莱希分数

阅读更多有关可读性公式的信息在这里 .

可读性公式的实现如下所示。

戴尔·查尔公式：

要应用公式：在整个文本中选择几个100字的样本。计算单词的平均句子长度（单词数除以句子数）。计算不在Dale–Chall单词列表中的3000个简单单词的百分比。计算这个等式

 Raw score = 0.1579*(PDW) + 0.0496*(ASL) + 3.6365Here,PDW = Percentage of difficult words not on the Dale–Chall word list.ASL = Average sentence length

喷射雾公式

Grade level= 0.4 * ( (average sentence length) + (percentage of Hard Words) )Here, Hard Words = words with more than two syllables.

烟雾公式

SMOG grading = 3 + √(polysyllable count).Here, polysyllable count = number of words of more than two syllables in a sample of 30 sentences.

弗莱希公式

Reading Ease score = 206.835 - (1.015 × ASL) - (84.6 × ASW)Here,ASL = average sentence length (number of words divided by number of sentences)ASW = average word length in syllables (number of syllables divided by number of words)

优势可读性公式的定义： 1.可读性公式衡量读者阅读给定文本必须达到的年级水平。因此，为文本作者提供了到达目标受众所需的信息。 2.事先知道目标受众是否能理解你的内容。 3.易于使用。 4.可读文本吸引更多观众。

缺点可读性公式的定义： 1.由于许多可读性公式，同一文本的结果出现较大差异的可能性越来越大。 2.将数学应用于文学，这并不总是一个好主意。 3.无法衡量单词或短语的复杂程度，以确定需要纠正的地方。

python

                           import                                        spacy                         
                           from                                        textstat.textstat                                        import                                        textstatistics,legacy_round                         
            
                           # Splits the text into sentences, using                         
                           # Spacy's sentence segmentation which can                         
                           # be found at                             https://spacy.io/usage/spacy-101                                       
                           def                                        break_sentences(text):                         
                                                     nlp                                        =                                        spacy.load(                                        'en_core_web_sm'                                        )                         
                                                     doc                                        =                                        nlp(text)                         
                                                     return                                        list                                        (doc.sents)                         
            
                           # Returns Number of Words in the text                         
                           def                                        word_count(text):                         
                                                     sentences                                        =                                        break_sentences(text)                         
                                                     words                                        =                                        0                         
                                                     for                                        sentence                                        in                                        sentences:                         
                                                     words                                        +                                        =                                        len                                        ([token                                        for                                        token                                        in                                        sentence])                         
                                                     return                                        words                         
            
                           # Returns the number of sentences in the text                         
                           def                                        sentence_count(text):                         
                                                     sentences                                        =                                        break_sentences(text)                         
                                                     return                                        len                                        (sentences)                         
            
                           # Returns average sentence length                         
                           def                                        avg_sentence_length(text):                         
                                                     words                                        =                                        word_count(text)                         
                                                     sentences                                        =                                        sentence_count(text)                         
                                                     average_sentence_length                                        =                                        float                                        (words                                        /                                        sentences)                         
                                                     return                                        average_sentence_length                         
            
                           # Textstat is a python package, to calculate statistics from                         
                           # text to determine readability,                         
                           # complexity and grade level of a particular corpus.                         
                           # Package can be found at                             https://pypi.python.org/pypi/textstat                                       
                           def                                        syllables_count(word):                         
                                                     return                                        textstatistics().syllable_count(word)                         
            
                           # Returns the average number of syllables per                         
                           # word in the text                         
                           def                                        avg_syllables_per_word(text):                         
                                                     syllable                                        =                                        syllables_count(text)                         
                                                     words                                        =                                        word_count(text)                         
                                                     ASPW                                        =                                        float                                        (syllable)                                        /                                        float                                        (words)                         
                                                     return                                        legacy_round(ASPW,                                        1                                        )                         
            
                           # Return total Difficult Words in a text                         
                           def                                        difficult_words(text):                         
                                      
                                                     nlp                                        =                                        spacy.load(                                        'en_core_web_sm'                                        )                         
                                                     doc                                        =                                        nlp(text)                         
                                                     # Find all words in the text                         
                                                     words                                        =                                        []                         
                                                     sentences                                        =                                        break_sentences(text)                         
                                                     for                                        sentence                                        in                                        sentences:                         
                                                     words                                        +                                        =                                        [                                        str                                        (token)                                        for                                        token                                        in                                        sentence]                         
            
                                                     # difficult words are those with syllables >= 2                         
                                                     # easy_word_set is provide by Textstat as                         
                                                     # a list of common words                         
                                                     diff_words_set                                        =                                        set                                        ()                         
                                      
                                                     for                                        word                                        in                                        words:                         
                                                     syllable_count                                        =                                        syllables_count(word)                         
                                                     if                                        word                                        not                                        in                                        nlp.Defaults.stop_words                                        and                                        syllable_count >                                        =                                        2                                        :                         
                                                     diff_words_set.add(word)                         
            
                                                     return                                        len                                        (diff_words_set)                         
            
                           # A word is polysyllablic if it has more than 3 syllables                         
                           # this functions returns the number of all such words                         
                           # present in the text                         
                           def                                        poly_syllable_count(text):                         
                                                     count                                        =                                        0                         
                                                     words                                        =                                        []                         
                                                     sentences                                        =                                        break_sentences(text)                         
                                                     for                                        sentence                                        in                                        sentences:                         
                                                     words                                        +                                        =                                        [token                                        for                                        token                                        in                                        sentence]                         
                                      
            
                                                     for                                        word                                        in                                        words:                         
                                                     syllable_count                                        =                                        syllables_count(word)                         
                                                     if                                        syllable_count >                                        =                                        3                                        :                         
                                                     count                                        +                                        =                                        1                         
                                                     return                                        count                         
            
            
                           def                                        flesch_reading_ease(text):                         
                                                     """                         
                                                     Implements Flesch Formula:                         
                                                     Reading Ease score = 206.835 - (1.015 × ASL) - (84.6 × ASW)                         
                                                     Here,                         
                                                     ASL = average sentence length (number of words                         
                                                     divided by number of sentences)                         
                                                     ASW = average word length in syllables (number of syllables                         
                                                     divided by number of words)                         
                                                     """                         
                                                     FRE                                        =                                        206.835                                        -                                        float                                        (                                        1.015                                        *                                        avg_sentence_length(text))                                        -                                                   
                                                     float                                        (                                        84.6                                        *                                        avg_syllables_per_word(text))                         
                                                     return                                        legacy_round(FRE,                                        2                                        )                         
            
            
                           def                                        gunning_fog(text):                         
                                                     per_diff_words                                        =                                        (difficult_words(text)                                        /                                        word_count(text)                                        *                                        100                                        )                                        +                                        5                         
                                                     grade                                        =                                        0.4                                        *                                        (avg_sentence_length(text)                                        +                                        per_diff_words)                         
                                                     return                                        grade                         
            
            
                           def                                        smog_index(text):                         
                                                     """                         
                                                     Implements SMOG Formula / Grading                         
                                                     SMOG grading = 3 + ?polysyllable count.                         
                                                     Here,                         
                                                     polysyllable count = number of words of more                         
                                                     than two syllables in a sample of 30 sentences.                         
                                                     """                         
            
                                                     if                                        sentence_count(text) >                                        =                                        3                                        :                         
                                                     poly_syllab                                        =                                        poly_syllable_count(text)                         
                                                     SMOG                                        =                                        (                                        1.043                                        *                                        (                                        30                                        *                                        (poly_syllab                                        /                                        sentence_count(text)))                                        *                                        *                                        0.5                                        )                         
                                                     +                                        3.1291                         
                                                     return                                        legacy_round(SMOG,                                        1                                        )                         
                                                     else                                        :                         
                                                     return                                        0                         
            
            
                           def                                        dale_chall_readability_score(text):                         
                                                     """                         
                                                     Implements Dale Challe Formula:                         
                                                     Raw score = 0.1579*(PDW) + 0.0496*(ASL) + 3.6365                         
                                                     Here,                         
                                                     PDW = Percentage of difficult words.                         
                                                     ASL = Average sentence length                         
                                                     """                         
                                                     words                                        =                                        word_count(text)                         
                                                     # Number of words not termed as difficult words                         
                                                     count                                        =                                        word_count                                        -                                        difficult_words(text)                         
                                                     if                                        words >                                        0                                        :                         
            
                                                     # Percentage of words not on difficult word list                         
            
                                                     per                                        =                                        float                                        (count)                                        /                                        float                                        (words)                                        *                                        100                         
                                      
                                                     # diff_words stores percentage of difficult words                         
                                                     diff_words                                        =                                        100                                        -                                        per                         
            
                                                     raw_score                                        =                                        (                                        0.1579                                        *                                        diff_words)                                        +                                                   
                                                     (                                        0.0496                                        *                                        avg_sentence_length(text))                         
                                      
                                                     # If Percentage of Difficult Words is greater than 5 %, then;                         
                                                     # Adjusted Score = Raw Score + 3.6365,                         
                                                     # otherwise Adjusted Score = Raw Score                         
            
                                                     if                                        diff_words >                                        5                                        :                         
            
                                                     raw_score                                        +                                        =                                        3.6365                         
                                      
                                                     return                                        legacy_round(score,                                        2                                        )                         

资料来源： https://en.wikipedia.org/wiki/Readability

文章版权归作者所有，未经允许请勿转载。

THE END

Python