Regular Expression

Posted on 2022-09-10 Edited on 2026-03-15 In Computer Science Views:

Outline:

Intro

每个正则表达式$r$可以描述一个语言$L(r)$, 也即其定义的正则集合( Regular Set)
- 例如, C语言标识符的语言, 可以用如下正则表达式来表示: \[ \mathrm{letter} \_ (\mathrm{letter}|\mathrm{digit})* \]
正则表达式不仅是数学工具, 也被各种编程语言所支持. 绝大部分语言的正则语法都差不多

给定字母表 $\sum$, $\sum$ 上的正则表达式由且仅由以下规则定义:

运算的优先级: $*$ > 连接符 > $|$

C语言的标识符集合:

Pascal无符号数集合, 例如:1946, 11.28, 63.6E8, 1.99E−6

$\mathrm{digit}$: $0|1|\dots|9$
$\mathrm{digits}$: $\mathrm{digit} \ \mathrm{digit}^*$
$\mathrm{optional \_ fraction}$: $. \mathrm{digits} | \epsilon$
$\mathrm{optional \_ exponent}$: $(\mathrm{E} ( + | − | \epsilon ) \ \mathrm{digits} ) \ | \ \epsilon$
$\mathrm{num}$: $\mathrm{optional \_ fraction}\ \mathrm{optional \_ exponent}$

为了方便, 可以用现有的正则来匹配一些常见的语言:

一个或多个: $r^+$ , 等价于$rr^*$
零个或一个: $r?$ 等价于$\epsilon | r$
字符类:
- 字符c的字面值: \c
  - 只写c会被认为是一个正则
- $[abc]$等价于$a|b|c$ , 即字符串$abc$中的任意一个字符
- $[a - z]$等价于$a|b|\dots|z$
  - [0-9a-zA-Z\_]: 匹配一个数字, 字母或者下划线
  - [0-9a-zA-Z\_]+: 匹配至少由一个数字, 字母或者下划线组成的字符串，比如'a100', '0_Z', 'Py3000'等等
  - [a-zA-Z\_][0-9a-zA-Z\_]*: 匹配由字母或下划线开头. 后接任意个由一个数字、字母或者下划线组成的字符串，也就是Python合法的变量
- ^s: 不在串$s$中的任意一个字符
$r\{n\}$: n个$r$
- \d{3}表示匹配3个数字, 例如'010'
$r\{m,n\}$: 最少m个, 最多n个$r$的连接
- \d{3,8}: 匹配3-8个数字
^: 行的开头
- ^\d表示必须以数字开头.
$表示行的结束
- \d$表示必须以数字结束.
- 你可能注意到了, py也可以匹配'python', 但是加上^py$就变成了整行匹配, 就只能匹配'py'了.

前面的例子的简化表示:

$\mathrm{letter}$: $[- -] $
$\mathrm{digit}$: $[0-9]$
$\mathrm{id}$: $\mathrm{letter} \_ (\mathrm{letter} \_ |\mathrm{digit})^*$
$\mathrm{digit}$: $[0-9]$
$\mathrm{digits}$: $\mathrm{digit}?$
$\mathrm{num}$: $\mathrm{digits} \ (. \mathrm{digits})? \ (\mathrm{E}[+-]? \ \mathrm{digits})?$