用 Python 实现英文单词纠错功能

picture.image

单词纠错

  在我们平时使用Word或者其他文字编辑软件的时候,常常会遇到单词纠错的功能。比如在Word中:

picture.image 单词拼写错误

单词“Chinab”有红色的下划线,说明该单词拼写有误,在“拼写检查”中,给出了几个可能的单词来帮助用户纠正拼写。那么,我们是否能够自己来实现这个功能呢?
Why not?
关于单词纠错的思路,可以参考Peter Norvig的鼎鼎大名的网站:http://norvig.com/spell-correct.html 。主要涉及到的相关概念为字符串的编辑距离,读者可以参考文章:用动态规划算法计算字符串的编辑距离

单词纠错算法

  首先,我们需要一个语料库,基本上所有的NLP任务都会有语料库。单词纠错的语料库为bit.txt,里面包含的内容如下:

  • Gutenberg语料库数据;
  • 维基词典;
  • 英国国家语料库中的最常用单词列表。

下载的网址为:https://github.com/percent4/-word-
接着,我们取出里面的所有英语单词,并统计其出现次数。对于一个给定的英语单词(不管其是否拼写有误),依次找到和它编辑距离为0,1,2的单词,这些单词的优先顺序为编辑距离为0的单词(即该单词本身) > 编辑距离为1的单词 > 编辑距离为2的单词。最后按照这些单词是否在语料库中出现及单词的优先顺序及在语料库中的出现次数考虑,考虑的顺序为:是否在语料库中出现,单词的优先顺序,在语料库中的出现次数,最后选取在预料库中出现,优先顺序最高,在语料库中出现次数最多的单词作为该单词的纠正结果。当然,也有可能是它本身,即单词正确。

Python实现

  实现单词纠错的完整Python代码(spelling_correcter.py)如下:


        
# -*- coding: utf-8 -*-  
import re, collections  
  
def tokens(text):  
    """  
    Get all words from the corpus  
    """  
    return re.findall('[a-z]+', text.lower())  
  
with open('E://big.txt', 'r') as f:  
    WORDS = tokens(f.read())  
WORD_COUNTS = collections.Counter(WORDS)  
  
def known(words):  
    """  
    Return the subset of words that are actually  
    in our WORD\_COUNTS dictionary.  
    """  
    return {w for w in words if w in WORD_COUNTS}  
  
  
def edits0(word):  
    """  
    Return all strings that are zero edits away  
    from the input word (i.e., the word itself).  
    """  
    return {word}  
  
  
def edits1(word):  
    """  
    Return all strings that are one edit away  
    from the input word.  
    """  
    alphabet = 'abcdefghijklmnopqrstuvwxyz'  
  
    def splits(word):  
        """  
        Return a list of all possible (first, rest) pairs  
        that the input word is made of.  
        """  
        return [(word[:i], word[i:]) for i in range(len(word) + 1)]  
  
    pairs = splits(word)  
    deletes = [a + b[1:] for (a, b) in pairs if b]  
    transposes = [a + b[1] + b[0] + b[2:] for (a, b) in pairs if len(b) > 1]  
    replaces = [a + c + b[1:] for (a, b) in pairs for c in alphabet if b]  
    inserts = [a + c + b for (a, b) in pairs for c in alphabet]  
    return set(deletes + transposes + replaces + inserts)  
  
  
def edits2(word):  
    """  
    Return all strings that are two edits away  
    from the input word.  
    """  
    return {e2 for e1 in edits1(word) for e2 in edits1(e1)}  
  
  
def correct(word):  
    """  
    Get the best correct spelling for the input word  
    """  
    # Priority is for edit distance 0, then 1, then 2  
    # else defaults to the input word itself.  
    candidates = (known(edits0(word)) or  
                  known(edits1(word)) or  
                  known(edits2(word)) or  
                  [word])  
    return max(candidates, key=WORD_COUNTS.get)  
  
  
def correct\_match(match):  
    """  
    Spell-correct word in match,  
    and preserve proper upper/lower/title case.  
    """  
  
    word = match.group()  
  
    def case\_of(text):  
        """  
        Return the case-function appropriate  
        for text: upper, lower, title, or just str.:  
        """  
        return (str.upper if text.isupper() else  
                str.lower if text.islower() else  
                str.title if text.istitle() else  
                str)  
  
    return case_of(word)(correct(word.lower()))  
  
  
def correct\_text\_generic(text):  
    """  
    Correct all the words within a text,  
    returning the corrected text.  
    """  
    return re.sub('[a-zA-Z]+', correct_match, text)  

    

测试

  有了上述的单词纠错程序,接下来我们对一些单词或句子做测试。如下:


        
original_word_list = ['fianlly', 'castel', 'case', 'monutaiyn', 'foresta', \                        'helloa', 'forteen', 'persreve', 'kisss', 'forteen helloa', \                        'phons forteen Doora. This is from Chinab.']  
  
for original_word in original_word_list:  
    correct_word = correct_text_generic(original_word)  
    print('Orginial word: %s\nCorrect word: %s'%(original_word, correct_word))  

    

输出结果如下:

Orginial word: fianlly
Correct word: finally
Orginial word: castel
Correct word: castle
Orginial word: case
Correct word: case
Orginial word: monutaiyn
Correct word: mountain
Orginial word: foresta
Correct word: forest
Orginial word: helloa
Correct word: hello
Orginial word: forteen
Correct word: fourteen
Orginial word: persreve
Correct word: preserve
Orginial word: kisss
Correct word: kiss
Orginial word: forteen helloa
Correct word: fourteen hello
Orginial word: phons forteen Doora. This is from Chinab.
Correct word: peons fourteen Door. This is from China.

  接着,我们对如下的Word文档(Spelling Error.docx)进行测试(下载地址为:https://github.com/percent4/-word-),

picture.image 有单词错误的Word文档

对该文档进行单词纠错的Python代码如下:


        
from docx import Document  
from nltk import sent_tokenize, word_tokenize  
from spelling_correcter import correct_text_generic  
from docx.shared import RGBColor  
  
# 文档中修改的单词个数  
COUNT_CORRECT = 0  
#获取文档对象  
file = Document("E://Spelling Error.docx")  
  
#print("段落数:"+str(len(file.paragraphs)))  
  
punkt_list = r",.?\"'!()/\\-<>:@#$%^&*~"  
  
document = Document()   # word文档句柄  
  
def write\_correct\_paragraph(i):  
    global COUNT_CORRECT  
  
    # 每一段的内容  
    paragraph = file.paragraphs[i].text.strip()  
    # 进行句子划分  
    sentences = sent_tokenize(text=paragraph)  
    # 词语划分  
    words_list = [word_tokenize(sentence) for sentence in sentences]  
  
    p = document.add_paragraph(' '*7)  # 段落句柄  
  
    for word_list in words_list:  
        for word in word_list:  
            # 每一句话第一个单词的第一个字母大写,并空两格  
            if word_list.index(word) == 0 and words_list.index(word_list) == 0:  
                if word not in punkt_list:  
                    p.add_run(' ')  
                    # 修改单词,如果单词正确,则返回原单词  
                    correct_word = correct_text_generic(word)  
                    # 如果该单词有修改,则颜色为红色  
                    if correct_word != word:  
                        colored_word = p.add_run(correct_word[0].upper()+correct_word[1:])  
                        font = colored_word.font  
                        font.color.rgb = RGBColor(0x00, 0x00, 0xFF)  
                        COUNT_CORRECT += 1  
                    else:  
                        p.add_run(correct_word[0].upper() + correct_word[1:])  
                else:  
                    p.add_run(word)  
            else:  
                p.add_run(' ')  
                # 修改单词,如果单词正确,则返回原单词  
                correct_word = correct_text_generic(word)  
                if word not in punkt_list:  
                    # 如果该单词有修改,则颜色为红色  
                    if correct_word != word:  
                        colored_word = p.add_run(correct_word)  
                        font = colored_word.font  
                        font.color.rgb = RGBColor(0xFF, 0x00, 0x00)  
                        COUNT_CORRECT += 1  
                    else:  
                        p.add_run(correct_word)  
                else:  
                    p.add_run(word)  
  
for i in range(len(file.paragraphs)):  
    write_correct_paragraph(i)  
  
document.save('E://correct\_document.docx')  
  
print('修改并保存文件完毕!')  
print('一共修改了%d处。'%COUNT_CORRECT)  

    

输出的结果如下:

修改并保存文件完毕!
一共修改了19处。

修改后的Word文档如下:

picture.image 单词纠错后的Word文档

其中的红色字体部分为原先的单词有拼写错误,进行拼写纠错后的单词,一共修改了19处。

总结

  单词纠错实现起来并没有想象中的那么难,但也不是那么容易~
本文的代码及文档都已上传Github,地址为:https://github.com/percent4/-word-

picture.image

阿里云双十一活动来袭

长按扫描下方二维码

即享云服务器 新用户1折起购

最低86元/年,一起拼团更优惠!

↓ ↓ 长按扫码了解更多 ↓ ↓

picture.image

【Python中文社区专属拼团码】

活动时间: 2019年10月24日至2019年11月11日

活动对象: 阿里云新用户,同一用户限购1单。

▼ 点击阅读原文 ,即享阿里云产品新用户 1折优惠

0
0
0
0
评论
未登录
暂无评论