在 Python 中使用 re 模組以正規式解析字串

這邊先推薦大家一個很棒的網頁：RegExr: Learn, Build, & Test RegEx。除了可以快速驗證規則之外，底下附有詳細的解說，滑鼠移到表示式上還會用顏色做區分，另外左邊的側邊攔還有規則總整理，無論是學習、查詢規則、還是測試都相當方便。Python 的 re 比較接近 PCRE 標準。

許多地方都可以看見正規表示式的身影，驗證資料、字串取數值等。這篇不講正規表示式的規則以及寫法，只講述正規式在 Python 中的使用方式。

Import Module

import re

尋找規則

Match 全部

用 re.match 方法來找到符合正規式表達的整個字串。

基本用法

text = 'Hello world. This is an apple.'
m = re.match(r'.+', text)
print(m.group(0))
# group 0 always be the whole string that match regex

如果沒找到的話， m 會是 None。

為 Group 命名

如果在正規式表達中有使用小括號做 Group，第 N 個命中的 Group 可以用 group 方法來取得該字串，例如:

text = 'Hello world. This is an apple.'
m = re.match(r'.+(world).+', text)
print(m.group(1))
# 'world'

但每次都要算第幾個很麻煩，然而其實可以為這些 Group 命名，方法就是使用 (?P<NAME> ... ) Group。

text = 'Hello world. This is an apple.'
m = re.match(r'.+(?P<TXT1>world).+', text)
print(m.group('TXT1'))
# 'world'

Search 中間

用 re.search 方法來搜尋符合正規式表達的第一個部分字串。

text = 'Hello world. This is an apple.'
m = re.search(r'\w+', text)
print(m.group(0)) # 'world'

Find 所有符合

用 re.findall 方法來搜尋符合正規式表達的每一個部分字串。

text = 'Hello world. This is an apple.'
substrs = re.findall(r'\w+', text)
print(substrs)
# ['Hello', 'world', 'This', 'is', 'an', 'apple']

或者使用 re.finditer 拿到每一個 Match Object。

text = 'Hello world. This is an apple.'
for m in re.finditer(r'\w+', text):
    print(m.group(0))
    # print 'Hello', 'world', 'This', 'is', 'an', 'apple' in order

Substitute 取代

用 re.sub 方法將所有符合的字串取代成新字串。

text = 'Hello world. This is an apple.'
new_text = re.sub(r'\w+', ':)', text)
print(new_text)
# ':) :). :) :) :) :).'

如果只需要取代前面幾個可以用 re.subn：

text = 'Hello world. This is an apple.'
new_text, sub_count = re.subn(r'\w+', ':)', text, 20)
print(new_text, sub_count)
# ':) :). :) :) :) :).', 6

將正規式寫的漂亮

一個完整的正規式很容易越寫越長，即使能爲 Group 命名還是會變得難以閲讀，例如以下的例子：

ptn = r'#define\s+\b(?P<NAME>[A-Z_][A-Z0-9_\[\]]+)\b(?P<HAS_PAREN>\((?P<PARAMS>[\w, ]*)\))*\s*(?P<TOKEN>[\.\"\'\w\d_, +*!=<>&|\?\:\/\-\(\)\[\]]+)*'

可以找到 C 語法的 Define 行例如：

#define NVME_CMD_SIZE (64)

但正規式的 Pattern 太長難以閲讀，我們就可以利用以下兩種方法來改善。

Verbose Mode 忽略正規式中的換行

ptn = r'''
    \#define\s+\b                                        # #define
    (?P<NAME>[A-Z_][A-Z0-9_\[\]]+)\b                     # MACRO_NAME
    (?P<HAS_PAREN>\((?P<PARAMS>[\w, ]*)\))*              # (a, b)
    \s*
    (?P<TOKEN>[\.\"\'\w\d_, +*!=<>&|\?\:\/\-\(\)\[\]]+)* # MACRO_BODY
'''

text = '#define NVME_CMD_SIZE (64)'
m = re.match(ptn, text, flags=re.VERBOSE)
# or
define_ptn = re.compile(ptn, flags=re.VERBOSE)
m = define_ptn.match(text)

print(m.group('TOKEN')) # '(64)'

前面介紹的每個 re 方法（match、search、finall、finditer）都可以加上 flags 參數 re.VERBOSE 以忽略規則中的換行，需要注意的是，是方法也會同時忽略正規式中的空白符。

Verbose模式中 # 會變成特殊符號做爲行注解用，所以需要尋找 # 子元的話要使用反斜線 \ 跳脫。

String literal concatenation

或者，將長字串拆成若干短字串用括號包在一起，字串自動會黏在一起（參考 Lexical analysis – Python document）。

ptn = (
    r"\#define\s+\b" # #define
    r"(?P[A-Z_][A-Z0-9_\[\]]+)\b" # MACRO_NAME
    r"(?P\((?P[\w, ]*)\))*" # (a, b)
    r"\s*"
    r"(?P[\.\"\'\w\d_, +*!=<>&|\?\:\/\-\(\)\[\]]+)*" # MACRO_BODY
)

References

Post Views: 2,468