Python中的可读性索引(NLP)

可读性是读者理解书面文本的容易程度。在自然语言中,文本的可读性取决于其内容(词汇和语法的复杂性)。它关注于我们选择的单词,以及我们如何将它们放入句子和段落中,以便读者理解。

null

我们写作的主要目的是传递作者和读者都认为有价值的信息。如果我们不能传达这些信息,我们的努力就会白费。为了吸引读者,重要的是向他们展示信息,让他们乐于继续阅读并能够清楚地理解。因此,要求内容足够容易阅读,并且尽可能易读。有各种可用的难度量表和它们自己的难度确定公式。

本文阐述了可用于可读性评分评估的各种传统可读性公式。在自然语言处理中,有时需要分析单词和句子来确定文本的难度。易读性分数通常是特定等级的分数,它根据特定文本的难度对文本进行评分。它帮助作者改进文本,让更多的读者能够理解,从而使内容更具吸引力。

各种可用的可读性评分确定方法/公式:

  1. 戴尔-查尔公式
  2. 喷射雾公式
  3. Fry可读性图
  4. 麦克劳克林的烟雾公式
  5. 预测公式
  6. 可读性与报纸读者群
  7. 弗莱希分数

阅读更多有关可读性公式的信息 在这里 .

可读性公式的实现如下所示。

戴尔·查尔公式:

要应用公式: 在整个文本中选择几个100字的样本。 计算单词的平均句子长度(单词数除以句子数)。 计算不在Dale–Chall单词列表中的3000个简单单词的百分比。 计算这个等式

 Raw score = 0.1579*(PDW) + 0.0496*(ASL) + 3.6365Here,PDW = Percentage of difficult words not on the Dale–Chall word list.ASL = Average sentence length

喷射雾公式

Grade level= 0.4 * ( (average sentence length) + (percentage of Hard Words) )Here, Hard Words = words with more than two syllables.

烟雾公式

SMOG grading = 3 + √(polysyllable count).Here, polysyllable count = number of words of more than two syllables in a sample of 30 sentences.

弗莱希公式

Reading Ease score = 206.835 - (1.015 × ASL) - (84.6 × ASW)Here,ASL = average sentence length (number of words divided by number of sentences)ASW = average word length in syllables (number of syllables divided by number of words)

优势 可读性公式的定义: 1.可读性公式衡量读者阅读给定文本必须达到的年级水平。因此,为文本作者提供了到达目标受众所需的信息。 2.事先知道目标受众是否能理解你的内容。 3.易于使用。 4.可读文本吸引更多观众。

缺点 可读性公式的定义: 1.由于许多可读性公式,同一文本的结果出现较大差异的可能性越来越大。 2.将数学应用于文学,这并不总是一个好主意。 3.无法衡量单词或短语的复杂程度,以确定需要纠正的地方。

python

import spacy
from textstat.textstat import textstatistics,legacy_round
# Splits the text into sentences, using
# Spacy's sentence segmentation which can
def break_sentences(text):
nlp = spacy.load( 'en_core_web_sm' )
doc = nlp(text)
return list (doc.sents)
# Returns Number of Words in the text
def word_count(text):
sentences = break_sentences(text)
words = 0
for sentence in sentences:
words + = len ([token for token in sentence])
return words
# Returns the number of sentences in the text
def sentence_count(text):
sentences = break_sentences(text)
return len (sentences)
# Returns average sentence length
def avg_sentence_length(text):
words = word_count(text)
sentences = sentence_count(text)
average_sentence_length = float (words / sentences)
return average_sentence_length
# Textstat is a python package, to calculate statistics from
# text to determine readability,
# complexity and grade level of a particular corpus.
# Package can be found at https://pypi.python.org/pypi/textstat
def syllables_count(word):
return textstatistics().syllable_count(word)
# Returns the average number of syllables per
# word in the text
def avg_syllables_per_word(text):
syllable = syllables_count(text)
words = word_count(text)
ASPW = float (syllable) / float (words)
return legacy_round(ASPW, 1 )
# Return total Difficult Words in a text
def difficult_words(text):
nlp = spacy.load( 'en_core_web_sm' )
doc = nlp(text)
# Find all words in the text
words = []
sentences = break_sentences(text)
for sentence in sentences:
words + = [ str (token) for token in sentence]
# difficult words are those with syllables >= 2
# easy_word_set is provide by Textstat as
# a list of common words
diff_words_set = set ()
for word in words:
syllable_count = syllables_count(word)
if word not in nlp.Defaults.stop_words and syllable_count > = 2 :
diff_words_set.add(word)
return len (diff_words_set)
# A word is polysyllablic if it has more than 3 syllables
# this functions returns the number of all such words
# present in the text
def poly_syllable_count(text):
count = 0
words = []
sentences = break_sentences(text)
for sentence in sentences:
words + = [token for token in sentence]
for word in words:
syllable_count = syllables_count(word)
if syllable_count > = 3 :
count + = 1
return count
def flesch_reading_ease(text):
"""
Implements Flesch Formula:
Reading Ease score = 206.835 - (1.015 × ASL) - (84.6 × ASW)
Here,
ASL = average sentence length (number of words
divided by number of sentences)
ASW = average word length in syllables (number of syllables
divided by number of words)
"""
FRE = 206.835 - float ( 1.015 * avg_sentence_length(text)) -
float ( 84.6 * avg_syllables_per_word(text))
return legacy_round(FRE, 2 )
def gunning_fog(text):
per_diff_words = (difficult_words(text) / word_count(text) * 100 ) + 5
grade = 0.4 * (avg_sentence_length(text) + per_diff_words)
return grade
def smog_index(text):
"""
Implements SMOG Formula / Grading
SMOG grading = 3 + ?polysyllable count.
Here,
polysyllable count = number of words of more
than two syllables in a sample of 30 sentences.
"""
if sentence_count(text) > = 3 :
poly_syllab = poly_syllable_count(text)
SMOG = ( 1.043 * ( 30 * (poly_syllab / sentence_count(text))) * * 0.5 )
+ 3.1291
return legacy_round(SMOG, 1 )
else :
return 0
def dale_chall_readability_score(text):
"""
Implements Dale Challe Formula:
Raw score = 0.1579*(PDW) + 0.0496*(ASL) + 3.6365
Here,
PDW = Percentage of difficult words.
ASL = Average sentence length
"""
words = word_count(text)
# Number of words not termed as difficult words
count = word_count - difficult_words(text)
if words > 0 :
# Percentage of words not on difficult word list
per = float (count) / float (words) * 100
# diff_words stores percentage of difficult words
diff_words = 100 - per
raw_score = ( 0.1579 * diff_words) +
( 0.0496 * avg_sentence_length(text))
# If Percentage of Difficult Words is greater than 5 %, then;
# Adjusted Score = Raw Score + 3.6365,
# otherwise Adjusted Score = Raw Score
if diff_words > 5 :
raw_score + = 3.6365
return legacy_round(score, 2 )


资料来源: https://en.wikipedia.org/wiki/Readability

© 版权声明
THE END
喜欢就支持一下吧
点赞15 分享