본문 바로가기
정규표현식

[ python ] 정규표현식[5]

by fiasco 2022. 10. 31.
Mastering Python Regular Expressions

http://www.packtpub.com/


Authors : Félix López 

제5장  Performance of Regular Expressions

 

Benchmarking regular expressions with Python 

import re
from time import time as now
import cProfile                         # 함수 처리 과정 및 속도 측정

# 실행 속도 촉정 (ms)
def test(f, *args, **kargs):
    start = now()
    f(*args, **kargs)
    print("The function %s lasted: %.12fms" % (f.__name__, (now() - start)*10**3))

# alternation함수
def alternation(text):
    pat = re.compile('spa(in|niard)')
    pat.search(text)

if __name__ == "__main__":
    test(alternation, "spain")  # The function alternation lasted: 0.000000000000 ms
    cProfile.run("alternation('spaniard')")

The RegexBuddy tool 

RegexBuddy Home : https://www.regexbuddy.com/buynow.html

Regular-Expressions.info : https://www.regular-expressions.info/

무료 다운로드 : https://download.cnet.com/RegexBuddy/3000-2352_4-10309844.html

 

Understanding the Python regex engine

Backtracking  구현 가능

※ Backtracking : 필요시 뒤로 돌아가기 위해 마지막 성공 위치를 기억하는 기능

                              ( alternation, quantifiers 적용 가능 )

 

Catastrophic backtracking 문제 : 

경로, 대상string이 증가함에 따라 backtracking은 stack overflow 문제가 발생한다

특히, greedy character로 구성되어 마지막 character matching에 실패하는 regex

import re
from time import time as now
import cProfile                         # 함수 처리 과정 및 속도 측정

# 실행 속도 촉정 (ms)
def test(f, *args, **kargs):
    start = now()
    f(*args, **kargs)
    print("The function %s lasted: %.12fms" % (f.__name__, (now() - start)*10**3))

def catastrophic(n):                           # catastrophic backtracking 발생 - text에 c 부재
    print ("Testing with %d characters" %n)
    pat = re.compile('(a+)+c')
    text = "%s" %('a' * n)
    pat.search(text)

def catastrophic1(n):                         
    print ("Testing with %d characters" %n)
    pat = re.compile('(a+)+(b+)+c')
    text = 'a'*n
    text += 'b'*n
    text += 'c'
    pat.search(text)

if __name__ == "__main__":
    for n in range(20, 30):
        test(catastrophic, n)           # 46sec
        test(catastrophic1, n)          # 1e-12msec 이하

 

Optimization recommendations

1) Reuse compiled patterns 

import re
from time import time as now
import cProfile                         # 함수 처리 과정 및 속도 측정

# 실행 속도 촉정 (ms)
def test(f, *args, **kargs):
    start = now()
    f(*args, **kargs)
    print("The function %s lasted: %.12fms" % (f.__name__, (now() - start)*10**3))

def dontreuse():
    pattern = re.compile(r'\bfoo\b')
    pattern.match("foo bar")

def callonethousandtimes():
    for _ in range(1000):
        dontreuse()

if __name__ == "__main__":
    test(callonethousandtimes)           # 0.998735427856ms

 

import re
from time import time as now
import cProfile                         # 함수 처리 과정 및 속도 측정

# 실행 속도 촉정 (ms)
def test(f, *args, **kargs):
    start = now()
    f(*args, **kargs)
    print("The function %s lasted: %.12fms" % (f.__name__, (now() - start)*10**3))

pattern = re.compile(r'\bfoo\b')
def reuse():
    pattern.match("foo bar")

def callonethousandtimes():
    for _ in range(1000):
        reuse()

if __name__ == "__main__":
    test(callonethousandtimes)           # e-12ms 이하

2) Extract common parts in alternation

import re
from time import time as now
import cProfile                         # 함수 처리 과정 및 속도 측정

# 실행 속도 촉정 (ms)
def test(f, *args, **kargs):
    start = now()
    f(*args, **kargs)
    print("The function %s lasted: %.12fms" % (f.__name__, (now() - start)*10**3))

pattern = re.compile(r'/(Hello\sWorld|Hello\sContinent|Hello\sCountry)')
def nonoptimized():
    pattern.match("Hello\sCountry")

def callonethousandtimes():
    for _ in range(1000):
        nonoptimized()

if __name__ == "__main__":
    test(callonethousandtimes)     

 

import re
from time import time as now
import cProfile                         # 함수 처리 과정 및 속도 측정

# 실행 속도 촉정 (ms)
def test(f, *args, **kargs):
    start = now()
    f(*args, **kargs)
    print("The function %s lasted: %.12fms" % (f.__name__, (now() - start)*10**3))

pattern = re.compile(r'/Hello\s(World|Continent|Country)')
def optimized():
    pattern.match("Hello\sCountry")

def callonethousandtimes():
    for _ in range(1000):
        optimized()

if __name__ == "__main__":
    test(callonethousandtimes)

3) Shortcut to alternation              

앞쪽에 가능성이 더 높은 옵션을 배치하며, 나머지 요소에 대해 나타날 확률이 같으면 길이가 짧은 요소를 긴 요소보다 먼저 배치한다.

import re

from time import time as now

def test(f, *args, **kargs):
    start = now()
    f(*args, **kargs)
    print ("The function %s lasted: %f" %(f.__name__, now() - start))

pattern = re.compile(r'(white|black|red|blue|green)')
def optimized():
    pattern.match("white")

def callonethousandtimes():
    for _ in range(1000):
        optimized()

test(callonethousandtimes)

4) Use non-capturing groups when appropriate   

5) Be specific    /\w{15}/

6) Don't be greedy  

 

 

'정규표현식' 카테고리의 다른 글

[ Python ] 정규표현식 Table 및 우선순위  (0) 2022.11.01
[ python ] 정규표현식[4]  (0) 2022.10.30
[ Python ] 정규표현식[3]  (0) 2022.10.27
[ python ] 정규표현식[2]  (0) 2022.10.27
[ python ] 정규표현식[1]  (0) 2022.10.27