preg_match_all

(PHP 3 >= 3.0.9, PHP 4, PHP 5)

preg_match_all -- 进行全局正则表达式匹配

说明

int preg_match_all ( string pattern, string subject, array matches [, int flags] )

subject 中搜索所有与 pattern 给出的正则表达式匹配的内容并将结果以 flags 指定的顺序放到 matches 中。

搜索到第一个匹配项之后,接下来的搜索从上一个匹配项末尾开始。

flags 可以是下列标记的组合(注意把 PREG_PATTERN_ORDERPREG_SET_ORDER 合起来用没有意义):

PREG_PATTERN_ORDER

对结果排序使 $matches[0] 为全部模式匹配的数组,$matches[1] 为第一个括号中的子模式所匹配的字符串组成的数组,以此类推。

<?php
preg_match_all
("|<[^>]+>(.*)</[^>]+>|U",
    
"<b>example: </b><div align=left>this is a test</div>",
    
$out, PREG_PATTERN_ORDER);
print
$out[0][0].", ".$out[0][1]."\n";
print
$out[1][0].", ".$out[1][1]."\n";
?>

本例将输出:

<b>example: </b>, <div align=left>this is a test</div>
example: , this is a test

因此,$out[0] 包含匹配整个模式的字符串,$out[1] 包含一对 HTML 标记之间的字符串。

PREG_SET_ORDER

对结果排序使 $matches[0] 为第一组匹配项的数组,$matches[1] 为第二组匹配项的数组,以此类推。

<?php
preg_match_all
("|<[^>]+>(.*)</[^>]+>|U",
    
"<b>example: </b><div align=left>this is a test</div>",
    
$out, PREG_SET_ORDER);
print
$out[0][0].", ".$out[0][1]."\n";
print
$out[1][0].", ".$out[1][1]."\n";
?>

本例将输出:

<b>example: </b>, example:
<div align=left>this is a test</div>, this is a test

本例中,$matches[0] 是第一组匹配结果,$matches[0][0] 包含匹配整个模式的文本,$matches[0][1] 包含匹配第一个子模式的文本,以此类推。同样,$matches[1] 是第二组匹配结果,等等。

PREG_OFFSET_CAPTURE

如果设定本标记,对每个出现的匹配结果也同时返回其附属的字符串偏移量。注意这改变了返回的数组的值,使其中的每个单元也是一个数组,其中第一项为匹配字符串,第二项为其在 subject 中的偏移量。本标记自 PHP 4.3.0 起可用。

如果没有给出标记,则假定为 PREG_PATTERN_ORDER

返回整个模式匹配的次数(可能为零),如果出错返回 FALSE

例子 1. 从某文本中取得所有的电话号码

<?php
preg_match_all
("/\(?  (\d{3})?  \)?  (?(1)  [\-\s] ) \d{3}-\d{4}/x",
                
"Call 555-1212 or 1-800-555-1212", $phones);
?>

例子 2. 搜索匹配的 HTML 标记(greedy)

<?php
// \\2 是一个逆向引用的例子,其在 PCRE 中的含义是
// 必须匹配正则表达式本身中第二组括号内的内容,本例中
// 就是 ([\w]+)。因为字符串在双引号中,所以需要
// 多加一个反斜线。
$html = "<b>bold text</b><a href=howdy.html>click me</a>";

preg_match_all ("/(<([\w]+)[^>]*>)(.*)(<\/\\2>)/", $html, $matches);

for (
$i=0; $i< count($matches[0]); $i++) {
  echo
"matched: ".$matches[0][$i]."\n";
  echo
"part 1: ".$matches[1][$i]."\n";
  echo
"part 2: ".$matches[3][$i]."\n";
  echo
"part 3: ".$matches[4][$i]."\n\n";
}
?>

本例将输出:

matched: <b>bold text</b>
part 1: <b>
part 2: bold text
part 3: </b>

matched: <a href=howdy.html>click me</a>
part 1: <a href=howdy.html>
part 2: click me
part 3: </a>

参见 preg_match()preg_replace()preg_split()


add a note add a note User Contributed Notes
sam at NOSPAM dot aigc dot net
04-Nov-2006 05:10
Here's something I made awhile ago to colorize long regular expressions. I can't guarantee it'll work for everything/everyone, but it helps me a lot and might help someone else.

Usage:
<?php echo highlight_regexp("/^[0-9]{2}:[0-9]{2}[apAP]$/"); ?>

<?php
function highlight_regexp($pattern) {
  
$colors = array(
      
"/" => "red",
      
"(" => "green",
      
")" => "green",
      
"[" => "blue",
      
"]" => "blue",
      
"{" => "orange",
      
"}" => "orange"
  
);
  
$specialchars = array("?","+","*",".","|");
  
$space = "&nbsp; &nbsp; ";
   for (
$i = 0; $i < strlen($pattern); $i++) {
       unset(
$spacing);
       if (
$skip) {
          
$show = 1;
          
$skip = 0;
       } else
           switch (
$pattern{$i}) {
               case
"/":
               case
"(":
               case
"[":
               case
"{":
                   if (
$skip) {
                      
$show = 1;
                      
$skip = 0;
                   } else {
                      
$tier++;
                       if (
$pattern{$i} == "/")
                          
$tier = 0;
                       for (
$j = 0; $j < $tier; $j++)
                          
$spacing .= $space;
                      
$pattern{$i} == "{" or $return .= "<br>$spacing";
                      
$return .= "<font color=".$colors[$pattern{$i}]."><b>".$pattern{$i}."</b></font>";
                       if (
$pattern{$i} == "(")
                          
$spaceover = "<br>$spacing$space";
                       else {
                           if (
$pattern{$i} == "[")
                              
$inbrackets = 1;
                           unset(
$spaceover);
                       }
                   }
                  
$show = 0;
                   break;
               case
")":
               case
"]":
               case
"}":
                   if (
$skip) {
                      
$show = 1;
                      
$skip = 0;
                   } else {
                       for (
$j = 0; $j < $tier; $j++)
                          
$spacing .= $space;
                       if (
$pattern{$i} == ")")
                          
$return .= "<br>$spacing";
                       elseif (
$pattern{$i} == "]")
                          
$inbrackets = 0;
                      
$return .= "<font color=".$colors[$pattern{$i}]."><b>".$pattern{$i}."</b></font>\n";
                      
$spaceover = "<br>$spacing";
                      
$tier--;
                   }
                  
$show = 0;
                   break;
               default:
                  
$show = 1;
                   break;
           }
           if (
$show) {
               if (!
$inbrackets && in_array($pattern{$i},$specialchars)) {
                  
$skipspaceover = 1 ;
                  
$preextra = "<font style='font-weight:bold;color:red'>";
                  
$postextra = "</font>";
                  
$replace = "";
               } elseif (
$pattern{$i} == " ") {
                  
$preextra = "<i style='font-size:10px'>";
                  
$replace = "(space)";
                  
$postextra = "</i>";
               } else
                  
$preextra = $postextra = $replace = $skipspaceover = "";
               if (
$spaceover && !$skipspaceover) {
                  
$return .= $spaceover;
                   unset(
$spaceover);
               }
              
$return .= $preextra.($replace ? $replace : $pattern{$i}).$postextra;
           }
   }
   return
$return;
}
?>
18-Oct-2006 08:09
While reading these notes I noticed many IP-matching patterns that seemed to be missing a detail. Most use this pattern...

25[1-5] | 2[1-4]\d | [01]\d{2} | \\d{1,2}

... for matching each digit group to make sure not to match anything above 255. If we have something looking like an IP starting with 260 it can however match the 60.

So, in the beginning, if matching less than 3 digits, make sure there is no digit before: (?<=[^\d]|^)
And at the end, make sure there is no digit following: (?=[^\d]|$)

<?php
$ip_pattern
= "/((25[1-5]|2[1-4]\\d|[01]\\d{2}|(?<=[^\\d]|^)\\d{1,2})\\.){3}".
  
"(25[1-5]|2[1-4]\\d|[01]\\d{2}|\\d{1,2})(?=[^\\d]|$)/";
?>

But generally it takes less time (coding and executing) to just capture anything looking like an IP and weed out the invalid ones with some simple function using array_walk(). A lot more flexible as well.
cp at ltur dot de
14-Aug-2006 08:56
To Stabby at somewhere dot invalid,

i guess you're not very familar with regexps. Here is the working version:

<?php
$pattern
='/fred(.+)bloggs/isU';
$data="fred hello\nthere bloggs fred goodbye bloggs";
preg_match_all($pattern,$data,$repTxt,PREG_PATTERN_ORDER);

print_r($repTxt);
?>

1. You have to set the 's' modifier, to make the '.' match newlines (see documentation)

2. if you write the capuring phrase like you did (.)+ it will capture a lot of single characters. If you change it to (.+) it will capture all allowed charakters.

HTH
Claus
Stabby at somewhere dot invalid
03-Aug-2006 09:30
Please note that multi-line matches do not work, regardless of the modifiers in php 4.3.x. example:

<?php
$pattern
='/fred(.)+bloggs/iU';
$data="fred hello\nthere bloggs fred goodbye bloggs";
preg_match_all($pattern,$data,$repTxt,PREG_PATTERN_ORDER);

print_r($repTxt);
?>

Will print out:

Array ( [0] => Array ( [0] => fred goodbye bloggs ) [1] => Array ( [0] => ) )

(the \n newline breaks the first match). I just use a loop with strpos to get around this.
krzysztof at uno dot pl
03-Aug-2006 06:24
<?PHP
// GET all links from URL

function remove_html(&$item, $key)
{
  
$item=trim(strip_tags($item));
}

function
get_links($url) {
$preg =
"/a[\s]+[^>]*?href[\s]?=[\s\"\']+(.*?)[\"\']+.*?>"
."([^<]+|.*?)?<\/a>/";
  
preg_match_all(trim($preg),
          
file_get_contents($url), $out, PREG_PATTERN_ORDER);
  
$keys = $out[1];
  
$values = $out[2];
  
array_walk($values, 'remove_html');
   return (
array_combine($keys, $values));
}

print_r(get_links("http://www.uno.pl"));

/*
Result:

array
(
   [/] =>
   [/downloads.php] => PHP 5.1.2
   [http://www.php.net/docs.php] => manual
   [http://www.uno.pl] => My link 1
   ...
)

*/

?>
mail at SPAMBUSTER at milianw dot de
17-Jul-2006 10:11
I refurnished connum at DONOTSPAMME dot googlemail dot com autoCloseTags function:
<?php
/**
 * close all open xhtml tags at the end of the string
 *
 * @author Milian Wolff <http://milianw.de>
 * @param string $html
 * @return string
 */
function closetags($html){
 
#put all opened tags into an array
 
preg_match_all("#<([a-z]+)( .*)?(?!/)>#iU",$html,$result);
 
$openedtags=$result[1];

 
#put all closed tags into an array
 
preg_match_all("#</([a-z]+)>#iU",$html,$result);
 
$closedtags=$result[1];
 
$len_opened = count($openedtags);
 
# all tags are closed
 
if(count($closedtags) == $len_opened){
   return
$html;
  }
 
$openedtags = array_reverse($openedtags);
 
# close tags
 
for($i=0;$i<$len_opened;$i++) {
   if (!
in_array($openedtags[$i],$closedtags)){
    
$html .= '</'.$openedtags[$i].'>';
   } else {
     unset(
$closedtags[array_search($openedtags[$i],$closedtags)]);
   }
  }
  return
$html;
}
?>
volkank at developera dot com
07-Jul-2006 10:04
I will add some note about my last post.

Leading zeros in IP addresses can cause problems on both Windows and Linux, because one can be confused if it is decimal or octal (if octal not written properly)

"66.163.161.117" is in a decimal syntax but in "066.163.161.117" the first octet 066 is in octal syntax.
So "066.163.161.117" is recognized as  decimal "54.163.161.117" by the operating system.
BTW octal is alittle rare syntax so you may not want or need to match it.

***
Unless you specially want to match IP addresses including both decimal and octal syntax; you can use Chortos-2's pattern which is suitable for most conditions.

<?php
//DECIMAL syntax IP match

//$num="(\\d|[1-9]\\d|1\\d\\d|2[0-4]\\d|25[0-5])";
$num='(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])';

if (!
preg_match("/^$num\\.$num\\.$num\\.$num$/", $ip_addr,$match)) //validate IP
...

preg_match_all("/$num\\.$num\\.$num\\.$num/",$test,$match); //collect IP addresses from a text(notice that ^$ not present in pattern)
...

?>

***
Also my previous pattern still have bug and needs some changes to correctly match both decimal and octal syntax.
connum at DONOTSPAMME dot googlemail dot com
04-Jun-2006 05:41
<?
function autoCloseTags($string) {
// automatically close HTML-Tags
// (usefull e.g. if you want to extract part of a blog entry or news as preview/teaser)
// coded by Constantin Gross <connum at googlemail dot com> / 3rd of June, 2006
// feel free to leave comments or to improve this function!

$donotclose=array('br','img','input'); //Tags that are not to be closed

//prepare vars and arrays
$tagstoclose='';
$tags=array();

//put all opened tags into an array
preg_match_all("/<(([A-Z]|[a-z]).*)(( )|(>))/isU",$string,$result);
$openedtags=$result[1];
$openedtags=array_reverse($openedtags); //this is just done so that the order of the closed tags in the end will be better

//put all closed tags into an array
preg_match_all("/<\/(([A-Z]|[a-z]).*)(( )|(>))/isU",$string,$result2);
$closedtags=$result2[1];

//look up which tags still have to be closed and put them in an array
for ($i=0;$i<count($openedtags);$i++) {
   if (
in_array($openedtags[$i],$closedtags)) { unset($closedtags[array_search($openedtags[$i],$closedtags)]); }
       else
array_push($tags, $openedtags[$i]);


$tags=array_reverse($tags); //now this reversion is done again for a better order of close-tags

//prepare the close-tags for output
for($x=0;$x<count($tags);$x++) {
$add=strtolower(trim($tags[$x]));
if(!
in_array($add,$donotclose)) $tagstoclose.='</'.$add.'>';
}

//and finally
return $tagstoclose;
}
?>
slavomir dot hustaty at gmail dot com
28-Mar-2006 11:10
//<h1>some text</h1><b>bold</b><h1>some further text</h1>
//if needed what's between tags :-)

class find_regex
{
  
   var $search_tag;
   var $result;
   //preg_match_all("/(<h1[^>]*>)([^<]*)(<\/h1>)/", $html, $matches);
  
   function find_regex($tag = "h1")
   {
       $this->search_tag = $tag;
   }
  
   function parse($text_to_parse = "")
   {
  
       $regex = "/(<" . $this->search_tag . "[^>]*>)([^<]*)(<\/" . $this->search_tag . ">)/";
  
       preg_match_all( $regex , $row->buffer_sk , $matches );
      
       $this->result = $matches;
      
       return $matches[2];
      
   }
  
}
dave at mixd dot net
23-Mar-2006 12:18
Use this to capture all JavaScript code that is between <script> tags.

Takes into account javascript that generates HTML. This one took a while, so I thought I'd share it.

$delimeter =
'/<script[^>]*>((?:[^<>"\']+(?:"[^"]*"|\'[^\']*\')*)+)<\/script>/i';

Note: For some reason php.net is filtering out my escape characters... If it doesn't work make sure you escape all single quotes and the forward slash.
phpnet at sinful-music dot com
20-Feb-2006 04:53
Here's some fleecy code to 1. validate RCF2822 conformity of address lists and 2. to extract the address specification (the part commonly known as 'email'). I wouldn't suggest using it for input form email checking, but it might be just what you want for other email applications. I know it can be optimized further, but that part I'll leave up to you nutcrackers. The total length of the resulting Regex is about 30000 bytes. That because it accepts comments. You can remove that by setting $cfws to $fws and it shrinks to about 6000 bytes. Conformity checking is absolutely and strictly referring to RFC2822. Have fun and email me if you have any enhancements!

<?php
function mime_extract_rfc2822_address($string)
{
      
//rfc2822 token setup
      
$crlf          = "(?:\r\n)";
      
$wsp            = "[\t ]";
      
$text          = "[\\x01-\\x09\\x0B\\x0C\\x0E-\\x7F]";
      
$quoted_pair    = "(?:\\\\$text)";
      
$fws            = "(?:(?:$wsp*$crlf)?$wsp+)";
      
$ctext          = "[\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F" .
                        
"!-'*-[\\]-\\x7F]";
      
$comment        = "(\\((?:$fws?(?:$ctext|$quoted_pair|(?1)))*" .
                        
"$fws?\\))";
      
$cfws          = "(?:(?:$fws?$comment)*(?:(?:$fws?$comment)|$fws))";
      
//$cfws          = $fws; //an alternative to comments
      
$atext          = "[!#-'*+\\-\\/0-9=?A-Z\\^-~]";
      
$atom          = "(?:$cfws?$atext+$cfws?)";
      
$dot_atom_text  = "(?:$atext+(?:\\.$atext+)*)";
      
$dot_atom      = "(?:$cfws?$dot_atom_text$cfws?)";
      
$qtext          = "[\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F!#-[\\]-\\x7F]";
      
$qcontent      = "(?:$qtext|$quoted_pair)";
      
$quoted_string  = "(?:$cfws?\"(?:$fws?$qcontent)*$fws?\"$cfws?)";
      
$dtext          = "[\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F!-Z\\^-\\x7F]";
      
$dcontent      = "(?:$dtext|$quoted_pair)";
      
$domain_literal = "(?:$cfws?\\[(?:$fws?$dcontent)*$fws?]$cfws?)";
      
$domain        = "(?:$dot_atom|$domain_literal)";
      
$local_part    = "(?:$dot_atom|$quoted_string)";
      
$addr_spec      = "($local_part@$domain)";
      
$display_name  = "(?:(?:$atom|$quoted_string)+)";
      
$angle_addr    = "(?:$cfws?<$addr_spec>$cfws?)";
      
$name_addr      = "(?:$display_name?$angle_addr)";
      
$mailbox        = "(?:$name_addr|$addr_spec)";
      
$mailbox_list  = "(?:(?:(?:(?<=:)|,)$mailbox)+)";
      
$group          = "(?:$display_name:(?:$mailbox_list|$cfws)?;$cfws?)";
      
$address        = "(?:$mailbox|$group)";
      
$address_list  = "(?:(?:^|,)$address)+";

      
//output length of string (just so you see how f**king long it is)
      
echo(strlen($address_list) . " ");

      
//apply expression
      
preg_match_all("/^$address_list$/", $string, $array, PREG_SET_ORDER);

       return
$array;
};
?>
volkank at developera dot com
17-Feb-2006 03:23
Correct IP matching Pattern:

This is my new IP octet pattern seems to be correct:
$num="(25[0-5]|2[0-4]\d|[01]?\d\d|\d)";

/*
25[0-5]    => 250-255
2[0-4]\d  => 200-249
[01]?\d\d  => 00-99,000-199
\d        => 0-9
*/

GRABBING multiple Valid IP addresses from string

<?
   $num
="(25[0-5]|2[0-4]\d|[01]?\d\d|\d)";
  
$test="127.0.0.112 10.0.0.2";
  
preg_match_all("/$num\\.$num\\.$num\\.$num/",$test,$match);
  
print_r($match);
    
?>

Single IP validation
<?
$num
="(25[0-5]|2[0-4]\d|[01]?\d\d|\d)";
$ip_addr='009.111.111.100';
if (!
preg_match("/^$num\\.$num\\.$num\\.$num$/", $ip_addr,$match)) echo "Wrong IP Address\\n";
echo
$match[0];

?>
bgamrat at wirehopper dot com
13-Feb-2006 03:19
The double slashes in the following post should be replaced by single slashes.
bgamrat at wirehopper dot com
07-Feb-2006 11:54
I used these regular expressions to get the references from a page.  The function run_preg lists the references found.

$url = "http://test.com";
$text=@file_get_contents($url);
if ($text)
{
  $src_href_url=run_preg($text,
   "/(?:(?:src|href|url)\\s*[=\\(]\\s*[\\"'`])".
   "([\\+\\w:?=@&\\/#._;-]+)(?:[\\s\\"'`])/i");
  $windows=run_preg($text,
   "/(?:window.open\\s*\\(\\s*[\\w-]*\\s*[,]\\s*[\\"`'])".
   "([\\+\\w:?=@&\\/#._;-]*)(?:[\\"'`]\\s*)/i");
}

function run_preg($text,$pattern) {

   preg_match_all ($pattern, $text, $matches);

   if (count($matches)>0)
       if (count($matches[1])>0)
               foreach ($matches[1] as $k => $v)
                       echo "$k: $v\\n";

   return (is_array($matches)) ? $matches[1]:FALSE;
}

Thanks to http://us2.php.net/manual/en/function.preg-match.php#58505
for giving me a good starting point.

Hope others find this useful.  :)
mnc at u dot nu
03-Feb-2006 02:05
PREG_OFFSET_CAPTURE always seems to provide byte offsets, rather than character position offsets, even when you are using the unicode /u modifier.
egingell at sisna dot com
01-Feb-2006 11:31
Try this for preg_match_all that takes an array of reg expers.

<?
// Emulates preg_match_all() but takes an array instead of a string.
// Returns an array containing all of the matches.
// The return array is an array containing the arrays normally returned by
//    preg_match_all() with the optional third parameter supplied.
function preg_search($ary, $subj) {
  
$matched = array();
   if (
is_array($ary)) {
       foreach (
$ary as $v) {
          
preg_match_all($v, $subj, $matched[]);
       }
   } else {
      
preg_match_all($ary, $subj, $matched[]);
   }
   return
$matched;
}
?>
18-Dec-2005 10:16
Two match all occurrences between and including any two HTML tags, here <tr> and </tr>

preg_match_all("/(\<[ \\n\\r\\t]{0,}tr[^>]*\>|\<[^>]*[\\n\\r\\t]{1,}tr[^>]*\>){1}
([^<]*<([^(\/>)]*(\/[^(t>)]){0,1}(\/t[^(r>)]){0,1})*>)*
(\<[ \\n\\r\\t]{0,}\/tr[^>]*\>|\<[^>]*[\\n\\r\\t]{1,}\/tr[^>]*\>){1}/i", $string, $Matches);
php at projectjj dot com
09-Dec-2005 04:43
Re: webmaster at swirldrop dot com

If you want to get a string with all the 'normal' characters, this may be better:

$clean = preg_replace('/\W+/', '', $dirty);

\W is the opposite of \w and will match any character that is not a letter or digit or the underscore character, plus it respects the current locale. Use [^0-9a-zA-Z]+ instead of \W if you need ASCII-only.
htp
08-Dec-2005 05:29
Just a quick note regarding the post by webmaster at swirldrop dot com.  The regex doesn't match alpha-numerics, as it doesn't actually match numerics, just alphas.  Might want to a add a 0-9 if that was the intend.
pablo dot seb at gmail dot com
16-Jun-2005 09:48
By assigning a name to a capturing group, you can easily reference it by name. (?P<name>group) captures the match of group into the backreference "name". You can reference the contents of the group with the numbered backreference or the named backreference

<?php

preg_match_all
('|(a)(?P<x>b)(?P<y>c)(d)|','abcdefgabcdefg',$sub);

echo
$sub[2][0]; //b

echo '<br />';

echo
$sub['y'][0]; //c

?>

Pablo from Salto, Uruguay
webmaster at m-bread dot com
07-Jun-2005 09:45
Looking at the function from rickyale at ig dot com dot br below getting URLs from an html file, I think this is slightly better:

function get_urls($string, $strict=true) {

   $types = array("href", "src", "url");
   while(list(,$type) = each($types)) {
       $innerT = $strict?'[a-z0-9:?=&@/._-]+?':'.+?';
       preg_match_all ("|$type\=([\"'`])(".$innerT.")\\1|i", $string, &$matches);
       $ret[$type] = $matches[2];
   }

return $ret;
};

This only gets urls in quotes "...", `...` or '...', but not mixed quotes like `..." (thanks to w w w's note on the 'pattern syntax' page). If you set the second parameter to false, then the function will give you any contents of attribute (so the function can be used to get other attributes, such as alt). To make it more strict, the '[a-z0-9:?=&@/._-]+?' can be replaced with a regular expression for a url.
webmaster at swirldrop dot com
07-Jun-2005 08:40
If you want to get al the text characters from a string, possibly entered by a user, and filter out all the non alpha-numeric characters (perhaps to make an ID to enter user-submitted details into a database record), then you can use the function below. It returns a string of only the alpha-numeric characters from the input string (all in lower case), with all other chracters removed:

<?php
function getText($string){
preg_match_all('/(?:([a-z]+)|.)/i', $string, $matches);
return
strtolower(implode('', $matches[1]));
};
//EoFn getText
?>

It took me quite a while tocome up with this regular expression. I hope it saves someone else that time.
20-Apr-2005 11:35
A little correction to my function below:

<?php
function urlhighlight($str) {
  
preg_match_all("/http:\/\/?[^ ][^<]+/i",$str,$lnk);
  
$size = sizeof($lnk[0]);
  
$i = 0;
   while (
$i < $size) {
      
$len = strlen($lnk[0][$i]);
       if(
$len > 30) {
          
$lnk_txt = substr($lnk[0][$i],0,30)."...";
       } else {
          
$lnk_txt = $lnk[0][$i];   
       }
      
$ahref = $lnk[0][$i];
      
$str = str_replace($ahref,"<a href='$ahref' target='_blank'>$lnk_txt</a>",$str);
      
$i++;
   }
   return
$str;
}
?>

The error is in the preg_match_all("/http:\/\/?[^ ][^<]+/i",$str,$lnk); the [^<] was missing.
Dan Madsen
20-Apr-2005 09:25
I wrote a function, which takes urls from a string, or database output, highlights them, and shortens the links name if its above 30 characters.

Note: You'll have to use nl2br() function on the string before using it, because I didn't know how to check for LineFeed or CarrigeReturn in preg-style.

<?php
function urlhighlight($str) {
  
preg_match_all("/http:\/\/?[^ ]+/i",$str,$lnk);
  
$size = sizeof($lnk[0]);
  
$i = 0;
   while (
$i < $size) {
      
$len = strlen($lnk[0][$i]);
       if(
$len > 30) {
          
$lnk_txt = substr($lnk[0][$i],0,30)."...";
       } else {
          
$lnk_txt = $lnk[0][$i];   
       }
      
$ahref = $lnk[0][$i];
      
$str = str_replace($ahref,"<a href='$ahref'>$lnk_txt</a>",$str);
      
$i++;
   }
   return
$str;
}
?>
Ex:
<?php
$str
= "a lot of text with urls in it and alot of linebreaks";
$str = urlhighlight(nl2br($str));
?>
b2sing4u at naver dot com
09-Apr-2005 06:42
This function converts all HTML style decimal character code to hexadecimal code.
ex) Hi &#959; &#9674; Dec  ->  Hi &#x03BF; &#x25CA; Dec

function d2h($word) {
  $n = preg_match_all("/&#(\d+?);/", $word, $match, PREG_PATTERN_ORDER);
  for ($j = 0; $j < $n; $j++) {
   $word = str_replace($match[0][$j], sprintf("&#x%04X;", $match[1][$j]), $word);
  }
  return($word);
}

& This function converts all HTML style hexadecimal character code to decimal code.
ex) Hello &#x03BF; &#x25CA; Hex  ->  Hello &#959; &#9674; Hex

function h2d($word) {
  $n = preg_match_all("/&#x([0-9a-fA-F]+?);/", $word, $match, PREG_PATTERN_ORDER);
  for ($j = 0; $j < $n; $j++) {
   $word = str_replace($match[0][$j], sprintf("&#%u;", hexdec($match[1][$j])), $word);
  }
  return($word);
}
b2sing4u
07-Apr-2005 05:24
Character Code Conversion Example.

You can use following example to convert character code in HTML file.

First example converts Hexadecimal code to Decimal code.
  ex) Hello &#xFF; Hex -> Hello &#255; Hex

Second example converts Decimal code to Hexadecimal code.
  ex) Hi &#16; Dec -> Hi &#x0010; Dec

<?php

$h2d_get
= fopen("h2d_get.htm", 'r');
$h2d_out = fopen("h2d_out.htm", 'w');

for (
$i = 1; $i <= 1000; $i++)
{
  if (
feof($h2d_get)) { break; }

 
$line = fgets($h2d_get, 409600);
 
$line = trim($line);
  if (
$line == "99999999") { break; }

 
$n = preg_match_all("/&#x([0-9a-fA-F]+?);/", $line, $match, PREG_PATTERN_ORDER);

  for (
$j = 0; $j < $n; $j++)
  {
  
$find = $match[0][$j];
  
$code = hexdec($match[1][$j]);
  
$push = sprintf("&#%u;", $code);
  
$line = eregi_replace($find, $push, $line);
  }

 
fwrite($h2d_out, $line);
 
fwrite($h2d_out, "\r\n");
}

fclose($h2d_get);
fclose($h2d_out);

?>

<?php

$d2h_get
= fopen("d2h_get.htm", 'r');
$d2h_out = fopen("d2h_out.htm", 'w');

for (
$i = 1; $i <= 1000; $i++)
{
  if (
feof($d2h_get)) { break; }

 
$line = fgets($d2h_get, 409600);
 
$line = trim($line);
  if (
$line == "99999999") { break; }

 
$n = preg_match_all("/&#(\d+?);/", $line, $match, PREG_PATTERN_ORDER);

  for (
$j = 0; $j < $n; $j++)
  {
  
$find = $match[0][$j];
  
$code = $match[1][$j];
  
$push = sprintf("&#x%04X;", $code);
  
$line = eregi_replace($find, $push, $line);
  }

 
fwrite($d2h_out, $line);
 
fwrite($d2h_out, "\r\n");
}

fclose($d2h_get);
fclose($d2h_out);

?>
arias at elleondeoro dot com
15-Feb-2005 08:27
If you want to find all positions and his length, you can use the next function:

<?php
function preg_match_all_positions($pattern, $subject, &$count=null, $flags=0, $offset=0) {
  for (
$count=0; preg_match($pattern, $subject, $match, $flags, $offset); $count++) {
  
$positions[0][] = $pos = strpos($subject, $match[0], $offset);
  
$positions[1][] = $len = strlen($match[0]);
  
$offset = $pos+$len;
  }
  return
$positions;
}
?>
mpbweb at mbourque dot com
03-Feb-2005 02:41
Here is a handy function I wrote that will check for broken links on the supplied url.

function dead_links($url) {

// mixed link_checker( $url )
// Returns:
//    FALSE if no broken links are found.
//    ARRAY containing broken links if any are found.

   ob_start();
     if( !readfile($url) ) return FALSE;
     $body = ob_get_contents();
   ob_end_clean();

   $pathparts = pathinfo($url);

   $urlpattern = "/<a[^>]+href=\"([^\"]+)/i";
   preg_match_all($urlpattern,$body,$matches);

   foreach( $matches[1] as $link) {

     if( strpos($link,"http://") === FALSE ) { // Deal with relative paths
         $link = $pathparts['dirname'] . "/" . $link;
     }

     $fp = @fopen("$link", "r");
     fclose($fp);
     if (!$fp) {
         $linkArray[] = $link;
     }

   }

   return (is_array($linkArray) ) ? $linkArray : FALSE;
}

Regards,

Michael Bourque
MCLD
20-Jan-2005 06:35
Here's a nice easy use for preg_match_all. I have data files in comma-separated-values format, with all the data enclosed in quote marks. To convert one line of such a data file into an array:

function quotedCsvLineToArray($l)
{
  preg_match_all('/(?<=,|\A)("(.*?)")?(?=,|\Z)/',$l, $matches, PREG_PATTERN_ORDER);
  return $matches[2];
}

hope it helps
dan
hex6ng at yahoo dot com
03-Jul-2004 06:04
This is a much more efficient version of the same function posted in ereg_replace() discussion by hdn, who is the same person as hex6ng.  I didn't include activating urls without http:// protocol identifier because there are many xxx.xxx patterns that are not urls.

function html_activate_urls($str)
{
   // lift all links, images and image maps
   $url_tags = array (
                     "'<a[^>]*>.*?</a>'si",
                     "'<map[^>]*>.*?</map>'si",
                     "'<script[^>]*>.*?</script>'si",
                     "'<style[^>]*>.*?</style>'si",
                     "'<[^>]+>'si"
                     );
   foreach($url_tags as $url_tag)
   {
       preg_match_all($url_tag, $str, $matches, PREG_SET_ORDER);
       foreach($matches as $match)
       {
           $key = "<" . md5($match[0]) . ">";
           $search[] = $key;
           $replace[] = $match[0];
       }
   }

   $str = str_replace($replace, $search, $str);

   // indicate where urls end if they have these trailing special chars
   $sentinals = array("/&(quot|#34);/i",        // Replace html entities
                       "/&(lt|#60);/i",
                       "/&(gt|#62);/i",
                       "/&(nbsp|#160);/i",
                       "/&(iexcl|#161);/i",
                       "/&(cent|#162);/i",
                       "/&(pound|#163);/i",
                       "/&(copy|#169);/i");

   $str = preg_replace($sentinals, "<marker>\\0", $str);

   // URL into links
   $str =
preg_replace( "|\w{3,10}://[\w\.\-_]+(:\d+)?[^\s\"\'<>\(\)\{\}]*|", 
                   "<a href=\"\\0\">\\0</a>", $str );

   $str = str_replace("<marker>", '', $str);
   return str_replace($search, $replace, $str);
}

-hdn
vb_user at yahoo dot com
22-Apr-2004 12:00
If you want to extract the list of php functions in one of your library (ie, includes) for documentation or any purpose use the below:

$filename = 'library.php';
$fp = fopen($filename,'r');
if ($fp !== false) {
   $str = fread($fp, filesize ($filename));
   $count = preg_match_all ("|function[ ]+(.*)[\(](.*)[\)]|U", $str, $out, PREG_PATTERN_ORDER);

   for ($i=0; $i<$count; $i++) {
       if (!eregi('array',$out[1][$i])) {
           echo '#T='.$out[1][$i]."\n";
           echo $out[1][$i].'('.$out[2][$i].')'."\n\n";
       }
   }
}
fabriceb at gmx dot net
05-Mar-2004 10:55
If you just want to find out how many times a string contains another simple string, don't use preg_match_all like I did before I fould the substr_count function.

Use
<?php
$nrMatches
= substr_count ('foobarbar', 'bar');
?>
instead. Hope this helps some other people like me who like to think too complicated :-)