CXXII. Perl 兼容正则表达式函数

简介

本类函数中所使用的模式极其类似 Perl。表达式应被包含在定界符中,如斜线(/)。任何不是字母、数字或反斜线(\)的字符都可以作为定界符。如果作为定界符的字符必须被用在表达式本身中,则需要用反斜线转义。自 PHP 4.0.4 起,也可以使用 Perl 风格的 (),{},[] 和 <> 匹配定界符。详细解释见模式语法

结束定界符的后面可以跟上不同的修正符以影响匹配方式。见模式修正符

PHP 也支持 POSIX 扩展语法的正则表达式,见 POSIX 扩展正则表达式函数

警告

要留意到 PCRE 的一些局限。更多信息见 http://www.pcre.org/pcre.txt

需求

正则表达式的支持是由 PCRE(Perl Compatible Regular Expression)库提供的,这是个开放源代码的软件,作者为 Philip Hazel,版权属于英国剑桥大学。可于以下地址获得:ftp://ftp.csx.cam.ac.uk/pub/software/programming/pcre/

安装

自 PHP 4.2.0 起这些函数默认被激活。可以通过 --without-pcre-regex 禁用 PCRE 函数。如果不使用绑定的库的话,用 --with-pcre-regex=DIR 来指定 PCRE 库文件和头文件的路径。对早期版本必须在编译时用 --with-pcre-regex[=DIR] 才能使用这些函数。

PHP 的 Windows 版本已经内置该扩展模块的支持。无需加载任何附加扩展库即可使用这些函数。

运行时配置

本扩展模块在 php.ini 中未定义任何配置选项。

资源类型

本扩展模块未定义任何资源类型。

预定义常量

以下常量由本扩展模块定义,因此只有在本扩展模块被编译到 PHP 中,或者在运行时被动态加载后才有效。

表格 1. PREG 常量

常量说明
PREG_PATTERN_ORDER 对结果排序使得 $matches[0] 为整个模式的匹配结果的数组,$matches[1] 为第一个括号内的子模式所匹配的字符串的数组,等等。本标记仅用于 preg_match_all()
PREG_SET_ORDER 对结果排序使得 $matches[0] 为第一组匹配结果的数组,$matches[1] 为第二组匹配结果的数组,等等。本标记仅用于 preg_match_all()
PREG_OFFSET_CAPTUREPREG_SPLIT_OFFSET_CAPTURE 的说明。本标记自 PHP 4.3.0 起可用。
PREG_SPLIT_NO_EMPTY 本标记使 preg_split() 仅返回非空的结果。
PREG_SPLIT_DELIM_CAPTURE 本标记使 preg_split() 也捕获定界符模式中的括号表达。本标记自 PHP 4.0.5 起可用。
PREG_SPLIT_OFFSET_CAPTURE 如果设定本标记,对每个出现的匹配结果也同时返回其附属的字符串偏移量。注意这改变了返回的数组的值,使其中的每个单元也是一个数组,其中第一项为匹配字符串,第二项为其偏移量。本标记自 PHP 4.3.0 起可用且仅用于 preg_split()

范例

例子 1. 合法的模式举例

  • /<\/\w+>/

  • |(\d{3})-\d+|Sm

  • /^(?i)php[34]/

  • {^\s+(\s+)?$}

例子 2. 非法的模式举例

  • /href='(.*)' - 缺少结束定界符

  • /\w+\s*\w+/J - 未知的修正符 'J'

  • 1-\d3-\d3-\d4| - 缺少起始定界符

目录
模式修正符 -- 解说正则表达式模式中使用的修正符
模式语法 -- 解说 Perl 兼容正则表达式的语法
preg_grep --  返回与模式匹配的数组单元
preg_match_all -- 进行全局正则表达式匹配
preg_match -- 进行正则表达式匹配
preg_quote -- 转义正则表达式字符
preg_replace_callback -- 用回调函数执行正则表达式的搜索和替换
preg_replace -- 执行正则表达式的搜索和替换
preg_split -- 用正则表达式分割字符串

add a note add a note User Contributed Notes
nickspring at mail dot ru
14-Oct-2006 10:47
Regular Expressions Tutorial on russian language is accessible on http://www.pcre.ru
toomuchphp-phpman at yahoo dot com
03-Aug-2006 06:33
RE: filippo dot toso at humanprofile dot biz

You should use preg_quote() to escape PCRE special characters.
filippo dot toso at humanprofile dot biz
03-Aug-2006 06:11
Here it is a simple PHP function that escapes a string to safely use it in a Regular Expression, very useful to include unknown strings in PCRE patterns:

<?php
function safe_preg($content) {
  
$search  = array('\\\\', '^', '$', '.', '[', ']', '|', '(', ')', '?', '*', '+', '{', '}');
  
$replace = array('\\\\\\\\', '\\^', '\\$', '\\.', '\\[', '\\]', '\\|', '\\(', '\\)', '\\?', '\\*', '\\+', '\\{', '\\}');
   return
str_replace($search, $replace, $content);
}
?>
lgandras at hotmail dot com
20-Feb-2006 06:19
I read this part, but i couldn't undertand a single word beacause before i must know Basic regular expression. Somebody put a link for PERL that is almost like PHP but here is one totally dedicated to PHP:

http://weblogtoolscollection.com/regex/regex.php
Gokul
06-Feb-2006 04:59
I came accross this nice tutorial for regural expression in perl
http://perldoc.perl.org/perlretut.html
alexbodn at 012 dot n@t dot il
09-Jan-2006 10:45
here is an annotation to my note from 28-Apr-2005 03:52, due to the welcome contribution of dipesh khakhkhar:

Here is a small function to determine whether a string is a preg expression.
Please note, that a dot '.' character in a regexp may match any character, including a dot, thus a string containing a dot may well be interpreted as an ordinary string, or a regexp.

function preg_ispreg($str)
{
   $prefix = "";
   $sufix = "";
   if ($str[0] != '^')
       $prefix = '^';
   if ($str[strlen($str) - 1] != '$')
       $sufix = '$';
   $estr = preg_replace("'^/'", "\\/", preg_replace("'([^/])/'", "\\1\\/", $str));
   if (@preg_match("/".$prefix.$estr.$sufix."/", $str, $matches))
       return strcmp($str, $matches[0]) != 0;
   return true;
}
richardh at phpguru dot org
23-Sep-2005 02:50
There's a printable PDF PCRE cheat sheet available here:

http://www.phpguru.org/article.php?ne_id=67

Has the common metacharacters, quantifiers, pattern modifiers, character classes and assertions with short explanations.
hfuecks at nospam dot org
04-Jul-2005 05:21
Good PCRE tutorial at http://www.tote-taste.de/X-Project/regex/ - well explained but still in depth
alexbodn at 012 dot n@t dot il
28-Apr-2005 09:52
Here is a small function to determine whether a string is a [valid] preg expression.

function preg_ispreg($str)
{
   $prefix = "";
   $sufix = "";
   if ($str[0] != '^')
       $prefix = '^';
   if ($str[strlen($str) - 1] != '$')
       $sufix = '$';
   $estr = preg_replace("'^/'", "\\/", preg_replace("'([^/])/'", "\\1\\/", $str));
   if (@preg_match("/".$prefix.$estr.$sufix."/", $str, $matches))
       return strcmp($str, $matches[0]) != 0;
   return false;
}
Ned Baldessin
24-Oct-2004 09:08
If you want to perform regular expressions on Unicode strings, the PCRE functions will NOT be of any help. You need to use the Multibyte extension : mb_ereg(), mb_eregi(), pb_ereg_replace() and so on. When doing so, be carefull to set the default text encoding to the same encoding used by the text you are searching and replacing in. You can do that with the mb_regex_encoding() function. You will probably also want to set the default encoding for the other mb_* string functions with mb_internal_encoding().

So when dealing with, say, french text, I start with these :
<?php
mb_internal_encoding
('UTF-8');
mb_regex_encoding('UTF-8');
setlocale(LC_ALL, 'fr-fr');
?>
steve at stevedix dot de
20-Jul-2004 08:17
Something to bear in mind is that regex is actually a declarative programming language like prolog : your regex is a set of rules which the regex interpreter tries to match against a string.  During this matching, the interpreter will assume certain things, and continue assuming them until it comes up against a failure to match, which then causes it to backtrack.  Regex assumes "greedy matching" unless explicitly told not to, which can cause a lot of backtracking.  A general rule of thumb is that the more backtracking, the slower the matching process.

It is therefore vital, if you are trying to optimise your program to run quickly (and if you can't do without regex), to optimise your regexes to match quickly.

I recommend the use of a tool such as "The Regex Coach" to debug your regex strings.

http://weitz.de/files/regex-coach.exe (Windows installer) http://weitz.de/files/regex-coach.tgz (Linux tar archive)
hrz at geodata dot soton dot ac dot uk
07-Mar-2002 03:33
If you're venturing into new regular expression territory with a lack of useful examples then it would pay to get familiar with this page:

http://www.pcre.org/man.txt