2008/06/04: O. Yuanying

Pythonでスパムフィルタを書く/トークナイザの実装

実装

トークン構成要素の正規表現の生成

Unicode.orgの Blocks.txtと LineBreak.txt を利用して同一文字種を抜き出す正規表現を書くのだが、手作業でやるのもめんどくさいし、スクリプト書いて自動処理してもいいのだけど今回のテーマから逸脱する気もするので RAA - wakeru で生成された正規表現をそのまんまパクってくる。

ライセンスも一応 Ruby's と書いてあるので問題ないだろう！

ってことで pgspam.blocks モジュールはこんな感じになる。

これはwakeruの make_block_src.rb を python用の正規表現モジュールを生成するように修正したものを利用して生成した。

pgspam.blocks のテストはこんな感じ。


#!/usr/bin/env python
# encoding: utf-8
"""
blocksTest.py

Created by Yuanying on 2008-06-04.
Copyright (c) 2008 fraction.jp. All rights reserved.
"""
import sys, os
script_dir = os.path.dirname(os.path.abspath(__file__))
base_dir = script_dir + os.sep + '..'
if not base_dir in sys.path:
    sys.path.insert(0, base_dir)

import unittest
import pgspam.blocks

class BlocksTest(unittest.TestCase):
    def setUp(self):
        pass
    
    def test_alphabet_search(self):
        """test_alphabet"""
        m = pgspam.blocks.re_block.search(u' aaa wewe 21')
        assert m.group() == u'aaa'
    
    def test_alphabet_search2(self):
        """docstring for test_alphabet_match2"""
        m = pgspam.blocks.re_block.search(u'*&[aaa]')
        assert m.group() == u'aaa'

    def test_hiragana_search2(self):
        """docstring for test_hiragana_search2"""
        m = pgspam.blocks.re_block.search(u'*&[あああ]')
        assert m.group() == u"あああ"

    
if __name__ == '__main__':
    unittest.main()

実行してみる。

$ python pgspam-project/test/blocksTest.py 
...
----------------------------------------------------------------------
Ran 3 tests in 0.000s

OK

ちゃんと日本語も抜き出してるね！すばらしい。

トークナイザの実装

生成した正規表現を使って連続する同一文字種をトークンとみなして、文字列をトークナイズする関数を実装してみる。

Pythonの正規表現オブジェクトにfindallってメソッドがあるので、そのまんまこれを使えば良いのかな？


import pgspam.blocks

def execute(words):
    """docstring for execute"""
    return pgspam.blocks.re_block.findall(words)

テストを実行してみる。

[yuanying@Magnus] ~/Projects/python
$ python pgspam-project/test/tokenizerTest.py
E.F..
======================================================================
ERROR: docstring for test_append_prefix_to_token
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pgspam-project/test/tokenizerTest.py", line 36, in test_append_prefix_to_token
    rtn = pgspam.tokenizer.execute(u'/', u'url*')
TypeError: execute() takes exactly 1 argument (2 given)

======================================================================
FAIL: docstring for test_ignore_only_one_character
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pgspam-project/test/tokenizerTest.py", line 32, in test_ignore_only_one_character
    assert [u'word', u'word4'] == rtn
AssertionError

----------------------------------------------------------------------
Ran 5 tests in 0.001s

FAILED (failures=1, errors=1)

ああ、、一文字は無視するっていう仕様とprefixをつけるって仕様を忘れてた。

正規表現にマッチしたトークンのチェックと修正を行うので、 findallではなくイテレータを使うように修正。

はじめてのイテレータだワクワク。


import pgspam.blocks

def execute(words, prefix=''):
    """docstring for execute"""
    rtn = []
    for m in pgspam.blocks.re_block.finditer(words):
        if (m.end() - m.start()) > 1:
            rtn.append(prefix + m.group(0))
    return rtn

こんな感じに修正。

リストの追加に append とか書くのがめんどくさい件。

[yuanying@Magnus] ~/Projects/python
$ python pgspam-project/test/tokenizerTest.py
.....
----------------------------------------------------------------------
Ran 5 tests in 0.001s

OK

テストを華麗にパス！(テストにもバグが混入してたことは気にしない。)

ここまでのプロジェクトのソースコード。