可读性是读者理解书面文本的容易程度。在自然语言中,文本的可读性取决于其内容(词汇和语法的复杂性)。它关注于我们选择的单词,以及我们如何将它们放入句子和段落中,以便读者理解。
我们写作的主要目的是传递作者和读者都认为有价值的信息。如果我们不能传达这些信息,我们的努力就会白费。为了吸引读者,重要的是向他们展示信息,让他们乐于继续阅读并能够清楚地理解。因此,要求内容足够容易阅读,并且尽可能易读。有各种可用的难度量表和它们自己的难度确定公式。
本文阐述了可用于可读性评分评估的各种传统可读性公式。在自然语言处理中,有时需要分析单词和句子来确定文本的难度。易读性分数通常是特定等级的分数,它根据特定文本的难度对文本进行评分。它帮助作者改进文本,让更多的读者能够理解,从而使内容更具吸引力。
各种可用的可读性评分确定方法/公式:
- 戴尔-查尔公式
- 喷射雾公式
- Fry可读性图
- 麦克劳克林的烟雾公式
- 预测公式
- 可读性与报纸读者群
- 弗莱希分数
阅读更多有关可读性公式的信息 在这里 .
可读性公式的实现如下所示。
戴尔·查尔公式:
要应用公式: 在整个文本中选择几个100字的样本。 计算单词的平均句子长度(单词数除以句子数)。 计算不在Dale–Chall单词列表中的3000个简单单词的百分比。 计算这个等式
Raw score = 0.1579*(PDW) + 0.0496*(ASL) + 3.6365Here,PDW = Percentage of difficult words not on the Dale–Chall word list.ASL = Average sentence length
喷射雾公式
Grade level= 0.4 * ( (average sentence length) + (percentage of Hard Words) )Here, Hard Words = words with more than two syllables.
烟雾公式
SMOG grading = 3 + √(polysyllable count).Here, polysyllable count = number of words of more than two syllables in a sample of 30 sentences.
弗莱希公式
Reading Ease score = 206.835 - (1.015 × ASL) - (84.6 × ASW)Here,ASL = average sentence length (number of words divided by number of sentences)ASW = average word length in syllables (number of syllables divided by number of words)
优势 可读性公式的定义: 1.可读性公式衡量读者阅读给定文本必须达到的年级水平。因此,为文本作者提供了到达目标受众所需的信息。 2.事先知道目标受众是否能理解你的内容。 3.易于使用。 4.可读文本吸引更多观众。
缺点 可读性公式的定义: 1.由于许多可读性公式,同一文本的结果出现较大差异的可能性越来越大。 2.将数学应用于文学,这并不总是一个好主意。 3.无法衡量单词或短语的复杂程度,以确定需要纠正的地方。
python
import spacy from textstat.textstat import textstatistics,legacy_round # Splits the text into sentences, using # Spacy's sentence segmentation which can # be found at https://spacy.io/usage/spacy-101 def break_sentences(text): nlp = spacy.load( 'en_core_web_sm' ) doc = nlp(text) return list (doc.sents) # Returns Number of Words in the text def word_count(text): sentences = break_sentences(text) words = 0 for sentence in sentences: words + = len ([token for token in sentence]) return words # Returns the number of sentences in the text def sentence_count(text): sentences = break_sentences(text) return len (sentences) # Returns average sentence length def avg_sentence_length(text): words = word_count(text) sentences = sentence_count(text) average_sentence_length = float (words / sentences) return average_sentence_length # Textstat is a python package, to calculate statistics from # text to determine readability, # complexity and grade level of a particular corpus. # Package can be found at https://pypi.python.org/pypi/textstat def syllables_count(word): return textstatistics().syllable_count(word) # Returns the average number of syllables per # word in the text def avg_syllables_per_word(text): syllable = syllables_count(text) words = word_count(text) ASPW = float (syllable) / float (words) return legacy_round(ASPW, 1 ) # Return total Difficult Words in a text def difficult_words(text): nlp = spacy.load( 'en_core_web_sm' ) doc = nlp(text) # Find all words in the text words = [] sentences = break_sentences(text) for sentence in sentences: words + = [ str (token) for token in sentence] # difficult words are those with syllables >= 2 # easy_word_set is provide by Textstat as # a list of common words diff_words_set = set () for word in words: syllable_count = syllables_count(word) if word not in nlp.Defaults.stop_words and syllable_count > = 2 : diff_words_set.add(word) return len (diff_words_set) # A word is polysyllablic if it has more than 3 syllables # this functions returns the number of all such words # present in the text def poly_syllable_count(text): count = 0 words = [] sentences = break_sentences(text) for sentence in sentences: words + = [token for token in sentence] for word in words: syllable_count = syllables_count(word) if syllable_count > = 3 : count + = 1 return count def flesch_reading_ease(text): """ Implements Flesch Formula: Reading Ease score = 206.835 - (1.015 × ASL) - (84.6 × ASW) Here, ASL = average sentence length (number of words divided by number of sentences) ASW = average word length in syllables (number of syllables divided by number of words) """ FRE = 206.835 - float ( 1.015 * avg_sentence_length(text)) - float ( 84.6 * avg_syllables_per_word(text)) return legacy_round(FRE, 2 ) def gunning_fog(text): per_diff_words = (difficult_words(text) / word_count(text) * 100 ) + 5 grade = 0.4 * (avg_sentence_length(text) + per_diff_words) return grade def smog_index(text): """ Implements SMOG Formula / Grading SMOG grading = 3 + ?polysyllable count. Here, polysyllable count = number of words of more than two syllables in a sample of 30 sentences. """ if sentence_count(text) > = 3 : poly_syllab = poly_syllable_count(text) SMOG = ( 1.043 * ( 30 * (poly_syllab / sentence_count(text))) * * 0.5 ) + 3.1291 return legacy_round(SMOG, 1 ) else : return 0 def dale_chall_readability_score(text): """ Implements Dale Challe Formula: Raw score = 0.1579*(PDW) + 0.0496*(ASL) + 3.6365 Here, PDW = Percentage of difficult words. ASL = Average sentence length """ words = word_count(text) # Number of words not termed as difficult words count = word_count - difficult_words(text) if words > 0 : # Percentage of words not on difficult word list per = float (count) / float (words) * 100 # diff_words stores percentage of difficult words diff_words = 100 - per raw_score = ( 0.1579 * diff_words) + ( 0.0496 * avg_sentence_length(text)) # If Percentage of Difficult Words is greater than 5 %, then; # Adjusted Score = Raw Score + 3.6365, # otherwise Adjusted Score = Raw Score if diff_words > 5 : raw_score + = 3.6365 return legacy_round(score, 2 ) |
资料来源: https://en.wikipedia.org/wiki/Readability