get_meta_tags

(PHP 3 >= 3.0.4, PHP 4, PHP 5)

get_meta_tags --  从一个文件中提取所有的 meta 标签 content 属性,返回一个数组

描述

array get_meta_tags ( string filename [, int use_include_path] )

打开 filename 逐行解析文件中的 <meta> 标签。此参数可以是本地文件也可以是一个 URL。解析工作将在 </head> 处停止。

use_include_path 设置为 1 将促使 PHP 尝试按照 include_path 标准包含路径中的每个指向去打开文件。这只用于本地文件,不适用于 URL。

例子 1. get_meta_tags() 解析些什么

<meta name="author" content="name">
<meta name="keywords" content="php documentation">
<meta name="DESCRIPTION" content="a php manual">
<meta name="geo.position" content="49.33;-86.59">
</head> <!-- 解析工作在此处停止 -->
(注意回车换行 - PHP 使用一个本地函数来解析输入,所以 Mac 上的文件将不能在 Unix 上正常工作)。

返回的关联数组以属性 name 的值作为键,属性 content 的值作为值,所以你可以很容易地使用标准数组函数遍历此关联数组或访问某个值。属性 name 中的特殊字符将使用‘_’替换,而其它字符则转换成小写。如果有两个 meta 标签拥有相同的 name,则只返回最后出现的那一个。

例子 2. get_meta_tags() 的返回值

<?php
// 假设上边的标签是在 www.example.com 中
$tags = get_meta_tags('http://www.example.com/');

// 注意所有的键(key)均为小写,而键中的‘.’则转换成了‘_’。
print $tags['author'];       // name
print $tags['keywords'];     // php documentation
print $tags['description'];  // a php manual
print $tags['geo_position']; // 49.33;-86.59
?>

注: 从 PHP 4.0.5 开始,get_meta_tags() 支持没有使用引号括起来的 HTML 属性。

参见 htmlentities()urlencode()


add a note add a note User Contributed Notes
roganty at gmail dot com
19-Aug-2006 06:33
This is a slight amendment to jimmyxx at gmail dot com function

I tried using the regex displayed in his code, and php threw up a couple of errors

Below is the correct regular expression that works
(Please note that I had to split the regex into strings because php.net was complaining about the line being to long)
<?php
preg_match_all
(
  
"|<meta[^>]+name=\"([^\"]*)\"[^>]" . "+content=\"([^\"]*)\"[^>]+>|i",
  
$html, $out,PREG_PATTERN_ORDER);
?>

The problem was due to the quotes being incorrectly escaped.
I hope this helps anyone who has been having problems with his code
jimmyxx at gmail dot com
21-Sep-2005 03:37
I used this as part of my mini php search based search engine - it really slowed the whole thing down. I wrote this function to read HTML (just fetch the file or use something like snoopy) and extract the meta data via a simple regex, works a treat and made my crawler much faster:

<?php

function get_meta_data($html) {

  
preg_match_all(
  
"|<meta[^>]+name=\\"([^"]*)\\"[^>]+content="([^\\"]*)"[^>]+>|i"$html, $out,PREG_PATTERN_ORDER);

   for (
$i=0;$i < count($out[1]);$i++) {
      
// loop through the meta data - add your own tags here if you need
      
if (strtolower($out[1][$i]) == "keywords") $meta['keywords'] = $out[2][$i];
       if (
strtolower($out[1][$i]) == "description") $meta['description'] = $out[2][$i];
   }

return
$meta;   
}

?>
mariano at cricava dot com
13-Sep-2005 08:18
Based on Michael Knapp's code, and adding some regex, here's a function that will get all meta tags and the title based on a URL. If there's an error, it will return false. Using the function getUrlContents(), also included, it takes care of META REFRESH re-directions, following up to the specified number of redirections. Please note that the regular expressions included were split into strings because php.net was complaining about the line being to long ;)

<?php
function getUrlData($url)
{
  
$result = false;
  
  
$contents = getUrlContents($url);

   if (isset(
$contents) && is_string($contents))
   {
      
$title = null;
      
$metaTags = null;
      
      
preg_match('/<title>([^>]*)<\/title>/si', $contents, $match );

       if (isset(
$match) && is_array($match) && count($match) > 0)
       {
          
$title = strip_tags($match[1]);
       }
      
      
preg_match_all('/<[\s]*meta[\s]*name="?' . '([^>"]*)"?[\s]*' . 'content="?([^>"]*)"?[\s]*[\/]?[\s]*>/si', $contents, $match);
      
       if (isset(
$match) && is_array($match) && count($match) == 3)
       {
          
$originals = $match[0];
          
$names = $match[1];
          
$values = $match[2];
          
           if (
count($originals) == count($names) && count($names) == count($values))
           {
              
$metaTags = array();
              
               for (
$i=0, $limiti=count($names); $i < $limiti; $i++)
               {
                  
$metaTags[$names[$i]] = array (
                      
'html' => htmlentities($originals[$i]),
                      
'value' => $values[$i]
                   );
               }
           }
       }
      
      
$result = array (
          
'title' => $title,
          
'metaTags' => $metaTags
      
);
   }
  
   return
$result;
}

function
getUrlContents($url, $maximumRedirections = null, $currentRedirection = 0)
{
  
$result = false;
  
  
$contents = @file_get_contents($url);
  
  
// Check if we need to go somewhere else
  
  
if (isset($contents) && is_string($contents))
   {
      
preg_match_all('/<[\s]*meta[\s]*http-equiv="?REFRESH"?' . '[\s]*content="?[0-9]*;[\s]*URL[\s]*=[\s]*([^>"]*)"?' . '[\s]*[\/]?[\s]*>/si', $contents, $match);
      
       if (isset(
$match) && is_array($match) && count($match) == 2 && count($match[1]) == 1)
       {
           if (!isset(
$maximumRedirections) || $currentRedirection < $maximumRedirections)
           {
               return
getUrlContents($match[1][0], $maximumRedirections, ++$currentRedirection);
           }
          
          
$result = false;
       }
       else
       {
          
$result = $contents;
       }
   }
  
   return
$contents;
}
?>

Here's an example of its usage. Check that the included URL has a META REFRESH redirection:

<?php
$result
= getUrlData('http://www.marianoiglesias.com.ar/');

echo
'<pre>'; print_r($result); echo '</pre>';

?>

For the above code the output would be:

<?php
Array
(
   [
title] => Mariano Iglesias: El Eternauta   
  
[metaTags] => Array
       (
           [
description] => Array
               (
                   [
html] => <meta name="description" content="Java, PHP, and some other technological mumble jumble. Also, some real-life stuff as well." />
                   [
value] => Java, PHP, and some other technological mumble jumble. Also, some real-life stuff as well.
               )

           [
DC.title] => Array
               (
                   [
html] => <meta name="DC.title" content="Mariano Iglesias - Weblog" />
                   [
value] => Mariano Iglesias - Weblog
              
)

           [
ICBM] => Array
               (
                   [
html] => <meta name="ICBM" content="-34.6017, -58.3956" />
                   [
value] => -34.6017, -58.3956
              
)

           [
geo.position] => Array
               (
                   [
html] => <meta name="geo.position" content="-34.6017;-58.3956" />
                   [
value] => -34.6017;-58.3956
              
)

           [
geo.region] => Array
               (
                   [
html] => <meta name="geo.region" content="AR-BA">
                   [
value] => AR-BA
              
)

           [
geo.placename] => Array
               (
                   [
html] => <meta name="geo.placename" content="Buenos Aires">
                   [
value] => Buenos Aires
              
)

       )

)
?>
Michael Knapp
12-Mar-2005 04:47
Tim's code is good (thanks Tim), except it won't work very well if the tag is part of a long non-breaking string.

E.g. try getting the title from Google Maps (http://www.google.com/maps).

A better solution is:

<?php
$title
= "";
  
if (
$fp = @fopen( $_POST['url'], 'r' )) {

  
$cont = "";
  
  
// read the contents
  
while( !feof( $fp ) ) {
      
$buf = trim(fgets( $fp, 4096 )) ;
      
$cont .= $buf;
   }

  
// get tag contents
  
@preg_match( "/<title>([a-z 0-9]*)<\/title>/si", $cont, $match );
  
  
// tag contents
  
$title = strip_tags(@$match[ 1 ]);
}

?>

Note the strip_tags. Another thing to be careful of is to check for ", <, and >. You will need to strip those out if you are posting the output to a form.

Also, it is probably best to use the /i modifier, because some people might code <TITLE> etc...
rehfeld
06-Feb-2005 10:45
in response to
jp at webgraphe dot com

this function grabs meta tags, not http headers

if you need the headers

<?php

$fp
= fopen('http://example.org/somepage.html', 'r');

// the variable $http_response_header magically appears
print_r($http_response_header);

// or
$meta_data = stream_get_meta_data($fp);
print_r($meta_data);

?>
tim dot bennett at haveaniceplay dot com
01-Feb-2005 11:43
If you want to get the contents of tags other than meta you can use:

<?php

$page
= "http://www.mysite.com/apage.php";

  
// tags
  
$start = '<atag>';
  
$end = '<\/atag>';

  
// open the file
  
$fp = fopen( $page, 'r' );

  
$cont = "";

  
// read the contents
  
while( !feof( $fp ) ) {
      
$buf = trim( fgets( $fp, 4096 ) );
      
$cont .= $buf;
   }
  
  
// get tag contents
  
preg_match( "/$start(.*)$end/s", $cont, $match );

  
// tag contents
  
$contents = $match[ 1 ];

?>
jp at webgraphe dot com
13-Dec-2003 09:37
If the URL is doing a redirection using the headers (like you would do with PHP function header("Location: URL");), the page has no content (in general). It appears get_meta_tags() doesn't catch that kind of redirection (like cURL would do) and it lead me to a timeout of my script.

I experienced this in a spider I wrote in order to feed my database of all available pages on my site and one link was linking to a page that simply has the following code:

<?php
  header
("Location: sections.php?section=home");
  exit();
?>

That made my script hang for a moment and apparently, get_meta_tags() wasn't even able to return me an error.

JP.
bill.neumann at hatworld.com
13-Mar-2003 11:50
The get_meta_tags function does not seem to be able to grab values if there are spaces between the attribute, the equal sign, and the opening quote marks.
21-Dec-2001 03:01
Tested PHP 4.0.6<br>
get_meta_tags() seems to look only in the beginning of a file, meaning that e.g. if there is a lot of PHP code before the HTML header it will return nothing ...<br>
Tested using get_meta_tags() on local files with about 9000 characters of PHP code before HTML HEADER.<br>
Workaround: if possible move code after header or if not: include a file.
Ben dot Davis at furman dot edu
31-Jul-2001 06:29
I have found that for large searches, get_meta_tags is very slow.  I created a large search engine for a website that couldnt use a database and I first tried pulling out the meta tags. 
I have found that it is actually much faster to use eregi to pull out the meta tags.  This code below pulls out the description:

if (eregi ("<meta name=\"description\" content=[^>]*", $contents, $descresult))
                               {
                                   $description = explode("<meta name=\"description\" content=", $descresult[0]);
                                   echo "<font face=\"Arial\" size=2>$description[1]</font>";
                                  
                               }
richard at pifmagazine dot com
22-Apr-2000 08:17
Something that is not mentioned above and should be : When using get_meta_tags on a remote PHP page the page will be parsed before the meta tags are returned - so you can capture meta tags generated dynamically (by PHP??) on the remote end.
<p>
This DOES NOT work the same way when getting meta tags on local file systems.  Local files are not parsed through the web server before returning to get_meta_tags().  If the META tag is hard-coded into the page, you'll be fine - but if it dynamically generated you will not be able to capture it unless you use the full URL when calling your local files.
richard at pifmagazine dot com
22-Apr-2000 08:00
An Important Note about META tags and this function :  if your META tag contains newline "\n"  characters, get_meta_tags() will return a NULL value for that name property.  Removing the newlines from the source META tag corrects the problem.