CLVI. XML 语法解析函数

简介

XML(eXtensible Markup Language,可扩展标记语言)是一种在 web 上进行文档交换的数据格式。该语言是由 W3C(World Wide Web Concortium,世界万维网组织)定义的一种标准。可以访问 http://www.w3.org/XML/ 以获取关于 XML 及其相关技术的更多信息。

本扩展模块可为 James Clark 的 expat 提供支持。该工具包帮助解析 XML 文档(而非 XML 文档的有效化)。它支持三种源代码的编码方式,这三种编码方式也被 PHP 本身所支持,它们分别是:US-ASCIIISO-8859-1UTF-8。本系统尚不支持 UTF-16

本扩展模块使用户能够建立 XML 语法解析器,并对不同的 XML 事件定义对应的处理器。每个 XML 语法解析器都有若干个可根据需要调整的参数

需求

本扩展模块默认使用 expat compat layer。它也可以使用 expat,可以访问 http://www.jclark.com/xml/expat.html 来获取。expat 自带的 Makefile 文件不会生成默认的扩展库,可以使用以下的生成规则来实现:
libexpat.a: $(OBJS)
    ar -rc $@ $(OBJS)
    ranlib $@

请访问 http://sourceforge.net/projects/expat/ 以获取 expat 源文件的 RPM 包。

安装

这些函数默认为有效的,它们使用了捆绑的 expat 库。您可以通过参数 --disable-xml 来屏蔽 XML 的支持。如果您将 PHP 编译为 Apache 1.3.9 或更高版本的一个模块, PHP 将自动使用 Apache 捆绑的 expat 库。如果您不希望使用该捆绑的 expat 库,请在运行 PHP 的 configure 配置脚本时使用参数 --with-expat-dir=DIR,其中 DIR 应该指向 expat 安装的根目录。

PHP 的 Windows 版本已经内置该扩展模块的支持。无需加载任何附加扩展库即可使用这些函数。

运行时配置

本扩展模块在 php.ini 中未定义任何配置选项。

资源类型

xml

xml_parser_create()xml_parser_create_ns() 返回的 xml 资源引用了一个 XML 解析器实例,将被用在本扩展库提供的函数中。

预定义常量

以下常量由本扩展模块定义,因此只有在本扩展模块被编译到 PHP 中,或者在运行时被动态加载后才有效。

XML_ERROR_NONE (integer)

XML_ERROR_NO_MEMORY (integer)

XML_ERROR_SYNTAX (integer)

XML_ERROR_NO_ELEMENTS (integer)

XML_ERROR_INVALID_TOKEN (integer)

XML_ERROR_UNCLOSED_TOKEN (integer)

XML_ERROR_PARTIAL_CHAR (integer)

XML_ERROR_TAG_MISMATCH (integer)

XML_ERROR_DUPLICATE_ATTRIBUTE (integer)

XML_ERROR_JUNK_AFTER_DOC_ELEMENT (integer)

XML_ERROR_PARAM_ENTITY_REF (integer)

XML_ERROR_UNDEFINED_ENTITY (integer)

XML_ERROR_RECURSIVE_ENTITY_REF (integer)

XML_ERROR_ASYNC_ENTITY (integer)

XML_ERROR_BAD_CHAR_REF (integer)

XML_ERROR_BINARY_ENTITY_REF (integer)

XML_ERROR_ATTRIBUTE_EXTERNAL_ENTITY_REF (integer)

XML_ERROR_MISPLACED_XML_PI (integer)

XML_ERROR_UNKNOWN_ENCODING (integer)

XML_ERROR_INCORRECT_ENCODING (integer)

XML_ERROR_UNCLOSED_CDATA_SECTION (integer)

XML_ERROR_EXTERNAL_ENTITY_HANDLING (integer)

XML_OPTION_CASE_FOLDING (integer)

XML_OPTION_TARGET_ENCODING (integer)

XML_OPTION_SKIP_TAGSTART (integer)

XML_OPTION_SKIP_WHITE (integer)

事件处理器

XML 事件处理器定义如下:

表格 1. 已支持的 XML 事件处理器

用来设置处理器的 PHP 函数事件描述
xml_set_element_handler() 元素事件(Element events)将在 XML 解析器遇到标记符的起始符或者终止符时发生。另外,对于起始符和终止符也有独立的处理器。
xml_set_character_data_handler() 粗略的说,字符数据(Character data)是指 XML 文档中所有标记符以外的内容,包括标记符之间的空格。需要注意的是 XML 语法解析器不会加上或者去掉任何空格。空格的取舍将由应用程序(也就是你自己)来决定。
xml_set_processing_instruction_handler() PHP 程序员对“处理指令”(Processing Instructions,PI)应该已经很熟悉了。<?php ?> 就是一个处理指令,其中 php 被称为“PI target”。除了以“XML”开头的 PI target 已被保留以外,对这些 PI 的处理将由应用程序来完成。
xml_set_default_handler() 所有无法被其它处理器处理的事件将由默认处理器来处理。这些事件包括诸如 XML 和文档类型声明等内容。
xml_set_unparsed_entity_decl_handler() 该处理器将在遇到无法解析的实体名称(NDATA)声明时被调用。
xml_set_notation_decl_handler() 该处理器将在声明一个注释时被调用。
xml_set_external_entity_ref_handler() 当 XML 解析器遇到指向外部解析的一般实体名时,该处理器将被调用。该指向的目标可以是一个文件,也可以是 URL。请参阅“外部实体名范例”。

大小写折叠(Case Folding)

元素处理函数可能会导致元素名称“大小写折叠”(case-folded)。“大小写折叠”被 XML 标准定义为“一个应用于一系列字符的过程,在该过程中,这些字符中的所有的非大写字符将被替换成它们对应大写等价字符”。换句话说,对于 XML,“大小写折叠”就是指将字符串转换成大写字符。

所有被传递给处理器函数的元素名称将默认的发生“大小写折叠”。该过程可以分别被 xml_parser_get_option()xml_parser_set_option() 函数查询和控制。

错误代码

以下常量被定义为 XML 的错误代码,将由 xml_parse() 返回:

XML_ERROR_NONE
XML_ERROR_NO_MEMORY
XML_ERROR_SYNTAX
XML_ERROR_NO_ELEMENTS
XML_ERROR_INVALID_TOKEN
XML_ERROR_UNCLOSED_TOKEN
XML_ERROR_PARTIAL_CHAR
XML_ERROR_TAG_MISMATCH
XML_ERROR_DUPLICATE_ATTRIBUTE
XML_ERROR_JUNK_AFTER_DOC_ELEMENT
XML_ERROR_PARAM_ENTITY_REF
XML_ERROR_UNDEFINED_ENTITY
XML_ERROR_RECURSIVE_ENTITY_REF
XML_ERROR_ASYNC_ENTITY
XML_ERROR_BAD_CHAR_REF
XML_ERROR_BINARY_ENTITY_REF
XML_ERROR_ATTRIBUTE_EXTERNAL_ENTITY_REF
XML_ERROR_MISPLACED_XML_PI
XML_ERROR_UNKNOWN_ENCODING
XML_ERROR_INCORRECT_ENCODING
XML_ERROR_UNCLOSED_CDATA_SECTION
XML_ERROR_EXTERNAL_ENTITY_HANDLING

字符编码

PHP 的 XML 扩展库支持不同字符编码(character encoding)的 Unicode 字符集。字符编码有两种形式,它们分别是“源编码”(source encoding)和“目标编码”(target encoding)。PHP 对文档内部表示的编码方式是 UTF-8

源编码将在 XML 文档被解析后完成。源编码可在建立一个 XML 解析器时指明(该编码方式在 XML 解析器的生命周期中不能被再次改变)。支持的编码方式包括 ISO-8859-1US-ASCIIUTF-8。前两种为单字节编码,即每个字符被一个单一的字节表示。UTF-8 支持 1 至 4 个字节的多 bit(最多 12)字符编码。PHP 默认使用 ISO-8859-1 作为源编码方式。

目标编码将在 PHP 向 XML 处理器函数传送数据时被完成。当 XML 解析器被建立后,目标编码将被设置成与源编码相同的编码方式,但该方式可在任何时候被更改。目标编码将影响字符数据、标记符名称以及处理指令目标(PI target)。

如果 XML 解析器遇到其源编码方式表示能力之外的字符,它将返回一个错误。

当 PHP 在被解析的 XML 文档中遇到当前目标编码无法表示的字符时,这些字符将被“降级”。简单的说,这些字符将被问号替换。

范例

以下是 PHP 脚本解析 XML 文档的一些范例。

XML 元素结构范例

第一个范例用缩进格式显示一个文档中起始元素的结构。

例子 1. 显示 XML 元素结构

<?php
$file
= "data.xml";
$depth = array();

function
startElement($parser, $name, $attrs)
{
    global
$depth;
    for (
$i = 0; $i < $depth[$parser]; $i++) {
        echo
"  ";
    }
    echo
"$name\n";
    
$depth[$parser]++;
}

function
endElement($parser, $name)
{
    global
$depth;
    
$depth[$parser]--;
}

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
if (!(
$fp = fopen($file, "r"))) {
    die(
"could not open XML input");
}

while (
$data = fread($fp, 4096)) {
    if (!
xml_parse($xml_parser, $data, feof($fp))) {
        die(
sprintf("XML error: %s at line %d",
                    
xml_error_string(xml_get_error_code($xml_parser)),
                    
xml_get_current_line_number($xml_parser)));
    }
}
xml_parser_free($xml_parser);
?>

XML 标记符映射范例

例子 2. 将 XML 映射为 HTML

以下范例将 XML 文档中的标记符直接映射成 HTML 标记符。在“映射数组”中不存在的元素将被忽略。当然,该范例将只对一个特定的 XML 文档有效。

<?php
$file
= "data.xml";
$map_array = array(
    
"BOLD"     => "B",
    
"EMPHASIS" => "I",
    
"LITERAL"  => "TT"
);

function
startElement($parser, $name, $attrs)
{
    global
$map_array;
    if (isset(
$map_array[$name])) {
        echo
"<$map_array[$name]>";
    }
}

function
endElement($parser, $name)
{
    global
$map_array;
    if (isset(
$map_array[$name])) {
        echo
"</$map_array[$name]>";
    }
}

function
characterData($parser, $data)
{
    echo
$data;
}

$xml_parser = xml_parser_create();
// 使用大小写折叠来保证我们能在元素数组中找到这些元素名称
xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, true);
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
if (!(
$fp = fopen($file, "r"))) {
    die(
"could not open XML input");
}

while (
$data = fread($fp, 4096)) {
    if (!
xml_parse($xml_parser, $data, feof($fp))) {
        die(
sprintf("XML error: %s at line %d",
                    
xml_error_string(xml_get_error_code($xml_parser)),
                    
xml_get_current_line_number($xml_parser)));
    }
}
xml_parser_free($xml_parser);
?>

XML 外部实体范例

该范例能够高亮显示 XML 源代码。它将说明如何外部实体指向处理器来包含和解析其它文档,如何处理 PIs,以及一种确定包含有 PIs 的代码的可信度。

能被该范例使用的的 XML 文档(xmltest.xmlxmltest2.xml)被列在该范例之后。

例子 3. 外部实体范例

<?php
$file
= "xmltest.xml";

function
trustedFile($file)
{
    
// only trust local files owned by ourselves
    
if (!eregi("^([a-z]+)://", $file)
        &&
fileowner($file) == getmyuid()) {
            return
true;
    }
    return
false;
}

function
startElement($parser, $name, $attribs)
{
    echo
"&lt;<font color=\"#0000cc\">$name</font>";
            if (
count($attribs)) {
                foreach (
$attribs as $k => $v) {
            echo
" <font color=\"#009900\">$k</font>=\"<font
                   color=
\"#990000\">$v</font>\"";
        }
    }
    echo
"&gt;";
}

function
endElement($parser, $name)
{
    echo
"&lt;/<font color=\"#0000cc\">$name</font>&gt;";
}

function
characterData($parser, $data)
{
    echo
"<b>$data</b>";
}

function
PIHandler($parser, $target, $data)
{
    switch (
strtolower($target)) {
        case
"php":
            global
$parser_file;
            
// If the parsed document is "trusted", we say it is safe
            // to execute PHP code inside it.  If not, display the code
            // instead.
            
if (trustedFile($parser_file[$parser])) {
                eval(
$data);
            } else {
                
printf("Untrusted PHP code: <i>%s</i>",
                        
htmlspecialchars($data));
            }
            break;
    }
}

function
defaultHandler($parser, $data)
{
    if (
substr($data, 0, 1) == "&" && substr($data, -1, 1) == ";") {
        
printf('<font color="#aa00aa">%s</font>',
                
htmlspecialchars($data));
    } else {
        
printf('<font size="-1">%s</font>',
                
htmlspecialchars($data));
    }
}

function
externalEntityRefHandler($parser, $openEntityNames, $base, $systemId,
                                  
$publicId) {
    if (
$systemId) {
        if (!list(
$parser, $fp) = new_xml_parser($systemId)) {
            
printf("Could not open entity %s at %s\n", $openEntityNames,
                   
$systemId);
            return
false;
        }
        while (
$data = fread($fp, 4096)) {
            if (!
xml_parse($parser, $data, feof($fp))) {
                
printf("XML error: %s at line %d while parsing entity %s\n",
                       
xml_error_string(xml_get_error_code($parser)),
                       
xml_get_current_line_number($parser), $openEntityNames);
                
xml_parser_free($parser);
                return
false;
            }
        }
        
xml_parser_free($parser);
        return
true;
    }
    return
false;
}

function
new_xml_parser($file)
{
    global
$parser_file;

    
$xml_parser = xml_parser_create();
    
xml_parser_set_option($xml_parser, XML_OPTION_CASE_FOLDING, 1);
    
xml_set_element_handler($xml_parser, "startElement", "endElement");
    
xml_set_character_data_handler($xml_parser, "characterData");
    
xml_set_processing_instruction_handler($xml_parser, "PIHandler");
    
xml_set_default_handler($xml_parser, "defaultHandler");
    
xml_set_external_entity_ref_handler($xml_parser, "externalEntityRefHandler");

    if (!(
$fp = @fopen($file, "r"))) {
        return
false;
    }
    if (!
is_array($parser_file)) {
        
settype($parser_file, "array");
    }
    
$parser_file[$xml_parser] = $file;
    return array(
$xml_parser, $fp);
}

if (!(list(
$xml_parser, $fp) = new_xml_parser($file))) {
    die(
"could not open XML input");
}

echo
"<pre>";
while (
$data = fread($fp, 4096)) {
    if (!
xml_parse($xml_parser, $data, feof($fp))) {
        die(
sprintf("XML error: %s at line %d\n",
                    
xml_error_string(xml_get_error_code($xml_parser)),
                    
xml_get_current_line_number($xml_parser)));
    }
}
echo
"</pre>";
echo
"parse complete\n";
xml_parser_free($xml_parser);

?>

例子 4. xmltest.xml

<?xml version='1.0'?>
<!DOCTYPE chapter SYSTEM "/just/a/test.dtd" [
<!ENTITY plainEntity "FOO entity">
<!ENTITY systemEntity SYSTEM "xmltest2.xml">
]>
<chapter>
 <TITLE>Title &plainEntity;</TITLE>
 <para>
  <informaltable>
   <tgroup cols="3">
    <tbody>
     <row><entry>a1</entry><entry morerows="1">b1</entry><entry>c1</entry></row>
     <row><entry>a2</entry><entry>c2</entry></row>
     <row><entry>a3</entry><entry>b3</entry><entry>c3</entry></row>
    </tbody>
   </tgroup>
  </informaltable>
 </para>
 &systemEntity;
 <section id="about">
  <title>About this Document</title>
  <para>
   <!-- this is a comment -->
   <?php echo 'Hi!  This is PHP version ' . phpversion(); ?>
  </para>
 </section>
</chapter>

以下文档将被 xmltest.xml 文件调用:

例子 5. xmltest2.xml

<?xml version="1.0"?>
<!DOCTYPE foo [
<!ENTITY testEnt "test entity">
]>
<foo>
   <element attrib="value"/>
   &testEnt;
   <?php echo "This is some more PHP code being executed."; ?>
</foo>

目录
utf8_decode --  将用 UTF-8 方式编码的 ISO-8859-1 字符串转换成单字节的 ISO-8859-1 字符串。
utf8_encode -- 将 ISO-8859-1 编码的字符串转换为 UTF-8 编码
xml_error_string -- 获取 XML 解析器的错误字符串
xml_get_current_byte_index -- 获取 XML 解析器的当前字节索引
xml_get_current_column_number --  获取 XML 解析器的当前列号
xml_get_current_line_number -- 获取 XML 解析器的当前行号
xml_get_error_code -- 获取 XML 解析器错误代码
xml_parse_into_struct -- 将 XML 数据解析到数组中
xml_parse -- 开始解析一个 XML 文档
xml_parser_create_ns --  生成一个支持命名空间的 XML 解析器
xml_parser_create -- 建立一个 XML 解析器
xml_parser_free -- 释放指定的 XML 解析器
xml_parser_get_option -- 从 XML 解析器获取选项设置信息
xml_parser_set_option -- 为指定 XML 解析进行选项设置
xml_set_character_data_handler -- 建立字符数据处理器
xml_set_default_handler -- 建立默认处理器
xml_set_element_handler -- 建立起始和终止元素处理器
xml_set_end_namespace_decl_handler --  建立终止命名空间声明处理器
xml_set_external_entity_ref_handler -- 建立外部实体指向处理器
xml_set_notation_decl_handler -- 建立注释声明处理器
xml_set_object -- 在对象中使用 XML 解析器
xml_set_processing_instruction_handler --  建立处理指令(PI)处理器
xml_set_start_namespace_decl_handler --  建立起始命名空间声明处理器
xml_set_unparsed_entity_decl_handler --  建立未解析实体定义声明处理器

add a note add a note User Contributed Notes
hutch at midwales dot com
02-Oct-2006 12:26
First off, I'd like thank all and sundry for providing this excellent resource, it has been very helpful in getting my head around xml parsing.

I was recently handed the task of collecting a variety of xml streams, from many different sources and of widely varying quality.

If have found that the following function helped parsing the input by cleaning it up. It removes all leading and trailing whitespace and removes carriage returns and linefeeds.

Using this function before using xml_parser_create() has helped reduce a number of otherwise unexplainable anomalies, such as arbitrary cutoff of data or the data being divided into two, requiring concatenation. Data longer than 1024 characters still has to be concatenated, but I can live with that.

<?php
// remove whitespace and linefeeds and returns the name of a temporary file
// takes the name of an existing file as a parameter
function cleanxmlfile($file, $tmpdir="/tmp", $prefix="xxx_") {
  
$tmp = file_get_contents ($file);
  
$tmp = preg_replace("/^\s+/m","",$tmp);
  
$tmp = preg_replace("/\s+$/m","",$tmp);
  
$tmp = preg_replace("/\r/","",$tmp);
  
$tmp = preg_replace("/\n/","",$tmp);
  
$tmpfname = tempnam($tmpdir, $prefix);
  
$handle = fopen($tmpfname, "w");
  
fwrite($handle, "$tmp");
  
fclose($handle);
   return(
$tmpfname);
}
?>

HTH
forquan
29-Jan-2006 07:45
Here's code that will create an associative array from an xml file.  Keys are the tag data and subarrays are formed from attributes and child tags

<?php
$p
=& new xmlParser();
$p->parse('/*xml file*/');
print_r($p->output);
?>

<?php
class xmlParser{
   var
$xml_obj = null;
   var
$output = array();
   var
$attrs;

   function
xmlParser(){
      
$this->xml_obj = xml_parser_create();
      
xml_set_object($this->xml_obj,$this);
      
xml_set_character_data_handler($this->xml_obj, 'dataHandler');
      
xml_set_element_handler($this->xml_obj, "startHandler", "endHandler");
   }

   function
parse($path){
       if (!(
$fp = fopen($path, "r"))) {
           die(
"Cannot open XML data file: $path");
           return
false;
       }

       while (
$data = fread($fp, 4096)) {
           if (!
xml_parse($this->xml_obj, $data, feof($fp))) {
               die(
sprintf("XML error: %s at line %d",
              
xml_error_string(xml_get_error_code($this->xml_obj)),
              
xml_get_current_line_number($this->xml_obj)));
              
xml_parser_free($this->xml_obj);
           }
       }

       return
true;
   }

   function
startHandler($parser, $name, $attribs){
      
$_content = array();
       if(!empty(
$attribs))
        
$_content['attrs'] = $attribs;
      
array_push($this->output, $_content);
   }

   function
dataHandler($parser, $data){
       if(!empty(
$data) && $data!="\n") {
          
$_output_idx = count($this->output) - 1;
          
$this->output[$_output_idx]['content'] .= $data;
       }
   }

   function
endHandler($parser, $name){
       if(
count($this->output) > 1) {
          
$_data = array_pop($this->output);
          
$_output_idx = count($this->output) - 1;
          
$add = array();
           if (
$_data['attrs'])
              
$add['attrs'] = $_data['attrs'];
           if (
$_data['child'])
              
$add['child'] = $_data['child'];
          
$this->output[$_output_idx]['child'][$_data['content']] = $add;
       }   
   }
}
?>
Greg S
18-Nov-2005 12:56
If you need utf8_encode support and configure PHP with --disable-all you will have some trouble. Unfortunately the configure options aren't completely documented. If you need utf8 functions and have everything disabled just recompile PHP with --enable-xml and you should be good to go.
simonguada at yahoo dot fr
06-Apr-2005 05:31
to import xml into mysql

$file = "article_2_3032005467.xml";
$feed = array();
$key = "";
$info = "";

function startElement($xml_parser,  $attrs ) {
  global $feed;
   }

function endElement($xml_parser, $name) {
  global $feed,  $info;
   $key = $name;
  $feed[$key] = $info;
  $info = ""; }

function charData($xml_parser, $data ) {
  global $info;
  $info .= $data; }

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "charData" );
$fp = fopen($file, "r");
while ($data = fread($fp, 8192))
!xml_parse($xml_parser, $data, feof($fp));
xml_parser_free($xml_parser);

$sql= "INSERT INTO `article` ( `";
$j=0;
$i=count($feed);
foreach( $feed as $assoc_index => $value )
  {
  $j++;
  $sql.= strtolower($assoc_index);
  if($i>$j) $sql.= "` , `";
  if($i<=$j) {$sql.= "` ) VALUES ('";}
  }
 $h=0;
foreach( $feed as $assoc_index => $value )
  {
  $h++;
  $sql.= utf8_decode(trim(addslashes($value)));
  if($i-1>$h) $sql.= "', '";
  if($i<=$h) $sql.= "','')";
  }
  $sql=trim($sql);
  echo $sql;
kerxen at caramail dot com
28-Mar-2005 10:04
To use XML with objetcs, we can use xml_set_object () :
Pour utiliser XML avec des objets, on utilise xml_set_object () :

class xml  {
   var $parser;

   function xml()  // constructor
   {
       $this->parser = xml_parser_create();

       xml_set_object($this->parser, $this);
       xml_set_element_handler($this->parser, "tag_open", "tag_close");
       xml_set_character_data_handler($this->parser, "cdata");
   }

   function tag_open($parser, $tag, $attributes)
   {
       var_dump($parser, $tag, $attributes);
   }

   function tag_close($parser, $tag)
   {
       var_dump($parser, $tag);
   }

   function cdata($parser, $cdata)
   {
       var_dump($parser, $cdata);
   }

   function parse($data)
   {
       xml_parse($this->parser, $data);
   }

} // end of class xml

$xml_parser = new xml();  // creation of the objet
$xml_parser->parse("<a id='hello World'>PHP</a>"); 

Have a nice use of this piece of code.

This can be used to share XML files with others sites.

Eddy
http://www.djfrance.net
compu_global_hyper_mega_net_2 at yahoo dot com
20-Sep-2004 04:35
The documentation regarding white space was never complete I think.

The XML_OPTION_SKIP_WHITE doesn't appear to do anything.  I want to preserve the newlines in a cdata section.  Setting XML_OPTION_SKIP_WHITE to 0 or false doesn't appear to help.  My character_data_handler is getting called once for each line.  This obviously should be reflected in the documentation as well.  When/how often does the handler get called exactly?  Having to build separate test cases is very time consuming.

Inserting newlines myself in my cdata handler is no good either.  For non actual CDATA sections that cause my handler to get called, long lines are split up in multiple calls.  My handler would not be able to tell the difference whether or not the subsequent calls would be due to the fact that the data is coming from the next line or the fact that some internal buffer is long enough for it to 'flush' out and call the handler.
This behaviour also needs to be properly documented.
talraith at withouthonor dot com
29-Jun-2004 09:11
If you are looking for some heavy duty code to parse or create XML documents, then may I suggest taking a look at a class module I am working on.  The module is complete except for support of namespaces and XPath.

The class takes a string of XML code and creates a TRUE object tree.  Likewise, you can create a tree in your code and generate an XML document.  There are no eval() statements used at all unlike some of the other examples shown here.

I posted this a while ago, but it has since been buried by a number of posts and I believe it to be beneficial to anyone looking to use XML / PHP to see this information.

http://www.withouthonor.com/obj_xml.phps for the source code.  Sample usage can be found in my post below.
odders
19-Mar-2004 02:36
I wrote a simple xml parser mainly to deal with rss version 2. I found lots of examples on the net, but they were all masive and bloated and hard to manipulate.

Output is sent to an array, which holds arrays containg data for each item.

Obviously, you will have to make modifications to the code to suit your needs, but there isnt a lot of code there, so that shouldnt be a problem.

<?php

   $currentElements
= array();
  
$newsArray = array();

  
readXml("./news.xml");

   echo(
"<pre>");
  
print_r($newsArray);
   echo(
"</pre>");

  
// Reads XML file into formatted html
  
function readXML($xmlFile)
   {

    
$xmlParser = xml_parser_create();

    
xml_parser_set_option($xmlParser, XML_OPTION_CASE_FOLDING, false);
    
xml_set_element_handler($xmlParser, startElement, endElement);
    
xml_set_character_data_handler($xmlParser, characterData);

    
$fp = fopen($xmlFile, "r");

     while(
$data = fread($fp, filesize($xmlFile))){
        
xml_parse($xmlParser, $data, feof($fp));}

    
xml_parser_free($xmlParser);

   }

  
// Sets the current XML element, and pushes itself onto the element hierarchy
  
function startElement($parser, $name, $attrs)
   {

     global
$currentElements, $itemCount;

    
array_push($currentElements, $name);

     if(
$name == "item"){$itemCount += 1;}

   }

  
// Prints XML data; finds highlights and links
  
function characterData($parser, $data)
   {

     global
$currentElements, $newsArray, $itemCount;

    
$currentCount = count($currentElements);
    
$parentElement = $currentElements[$currentCount-2];
    
$thisElement = $currentElements[$currentCount-1];

     if(
$parentElement == "item"){
        
$newsArray[$itemCount-1][$thisElement] = $data;}
     else{
         switch(
$name){
           case
"title":
               break;
           case
"link":
               break;
           case
"description":
               break;
           case
"language":
               break;
           case
"item":
               break;}}

   }

  
// If the XML element has ended, it is poped off the hierarchy
  
function endElement($parser, $name)
   {

     global
$currentElements;

    
$currentCount = count($currentElements);
     if(
$currentElements[$currentCount-1] == $name){
        
array_pop($currentElements);}

   }

?>
talraith at withouthonor dot com
03-Feb-2004 06:27
I have created a class set that both parses XML into an object structure and from that structure creates XML code.  It is mostly finished but I thought I would post here as it may help someone out or if someone wants to use it as a base for their own parser.  The method for creating the object is original compared to the posts before this one.

The object tree is created by created seperate tag objects for each tag inside the main document object and associating them together by way of object references.  An index table is created so that each tag is assigned an ID number (in numerical order from 0) and can be accessed directly using that ID number.  Each tag has object references to its children.  There are no uses of eval() in this code.

The code is too long to post here, so I have made a HTML page that has it:  http://www.withouthonor.com/obj_xml.html

Sample code would look something like this:

<?

$xml
= new xml_doc($my_xml_code);
$xml->parse();

$root_tag =& $xml->xml_index[0];
$children =& $root_tag->children;

// and so forth

// To create XML code using the object, would be similar to this:

$my_xml = new xml_doc();

$root_tag = $my_xml->CreateTag('ROOTTAG');
$my_xml->CreateTag('CHILDTAG',array(),'',$root_tag);

// The following is used for the CreateTag() method
// string Name (The name of the child tag)
// array Attributes (associative array of attributes for tag)
// string Content (textual data for the child tag)
// int ParentID (Index number for parent tag)

// To generate the XML, use the following method

$out_xml = $my_xml->generate();

?>
bradparks at bradparks dot com
18-Dec-2003 06:38
Hey;

If you need to parse XML on an older version of PHP (e.g. 4.0) or if you can't get the expat extension enabled on your server, you might want to check out the Saxy and DOMIT! xml parsers from Engage Interactive. They're opensource and pure php, so no extensions or changes to your server are required. I've been using them for over a month on some projects with no problems whatsoever!

Check em out at:

DOMIT!, a DOM based xml parser, uses Saxy (included)
http://www.engageinteractive.com/redir.php?resource=1&target=domit

or

Saxy, a sax based xml parser
http://www.engageinteractive.com/redir.php?resource=2&target=saxy

Brad
chris at hitcatcher dot com
08-Nov-2003 06:48
In regards to jon at gettys dot org's XML object, The data should be TRIM()ed to remove any whitespace that could appear in CDATA entered as :

<xml_tag>
   cdata here. cdata here. cdata here. cdata here.
</xml_tag>

So, after applying fred at barron dot com's suggested change to the characterData function, the function should appear as:

function characterData($parser, $data)
{
   global $obj;
   $data = addslashes($data);
   eval($obj->tree."->data.='".trim($data)."';");
}

SIDE NOTE: I'm fairly new to XML so perhaps it is considered bad form to enter CDATA as I did in my example. Is this true or is the extra whitespace for the sake of readablity acceptable?
ml at csite dot com
02-Jul-2003 11:29
A fix for the fread breaking thing:

while ($data = fread($fp, 4096)) {

   $data = $cache . $data;

   if (!feof($fp)) {
       if (preg_match_all("(</?[a-z0-9A-Z]+>)", $data, $regs)) {
           $lastTagname = $regs[0][count($regs[0])-1];
           $split = false;
           for ($i=strlen($data)-strlen($lastTagname); $i>=strlen($lastTagname); $i--) {
               if ($lastTagname == substr($data, $i, strlen($lastTagname))) {
                   $cache = substr($data, $i, strlen($data));
                   $data = substr($data, 0, $i);
                   $split = true;
                   break;
               }
           }
       }
       if (!$split) {
           $cache = $data;
       }
   }

   if (!xml_parse($xml_parser, $data, feof($fp))) {
       die(sprintf("XML error: %s at line %d", xml_error_string(xml_get_error_code($xml_parser)), xml_get_current_line_number($xml_parser)));
   }
}
panania at 3ringwebs dot com
21-May-2003 06:12
The above example doesn't work when you're parsing a string being returned from a curl operation (why I don't know!) I kept getting undefined offsets at the highest element number in both the start and end element functions. It wasn't the string itself I know, because I substringed it to death with the same results. But I fixed the problem by adding these lines of code...

function defaultHandler($parser, $name) {
   global $depth;
@    $depth[$parser]--;
}

xml_set_default_handler($xml_parser, "defaultHandler");

Hope this helps 8-}
fred at barron dot com
23-Apr-2003 08:28
regarding jon at gettys dot org's nice XML to Object code, I've made some useful changes (IMHO) to the characterData function... my minor modifications allow multiple lines of data and it escapes quotes so errors don't occur in the eval...

function characterData($parser, $data)
{
   global $obj;
   $data = addslashes($data);
   eval($obj->tree."->data.='".$data."';");
}
software at serv-a-com dot com
18-Feb-2003 01:10
2. Pre Parser Strings and New Line Delimited Data
One important thing to note at this point is that the xml_parse function requires a string variable. You can manipulate the content of any string variable easily as we all know.

A better approach to removing newlines than:
while ($data = fread($fp, 4096)) {
$data = preg_replace("/\n|\r/","",$data); //flarp
if (!xml_parse($xml_parser, $data, feof($fp))) {...

Above works across all 3 line-delimited text files  (\n, \r, \r\n). But this could potentially (or will most likely) damage or scramble data contained in for example CDATA areas. As far as I am concerned end of line characters should not be used _within_ XML tags. What seems to be the ultimate solution is to pre-parse the loaded data this would require checking the position within the XML document and adding or subtracting (using a in-between fread temporary variable) data based on conditions like: "Is within tag", "Is within CDATA" etc. before fedding it to the parser. This of course opens up a new can of worms (as in parse data for the parser...). (above procedure would take place between fread and xml_parser calls this method would be compatible with the general usage examples on top of the page)

3. The Answer to parsing arbitrary XML and Preprocessor Revisited
You can't just feed any XML document to the parser you constructed and assuming that it will work! You have to know what kind of methods for storing data are used, for example is there a end of line delimited data in the  file ?, Are there any carriage returns in the tags etc... XML files come formatted in different ways some are just a one long string of characters with out any end of line markers others have newlines, carriage returns or both (Microsloth Windows). May or may not contain space and other whitespace between tags. For this reason it is important to what I call Normalize the data before feeding it to the parser. You can perform this with regular expressions or plain old str_replace and concatenation. In many cases this can be done to the file it self sometimes to string data on the fly( as shown in the example above). But I feel it is important to normalize the data before even calling the function to call xml_parse. If you have the ability to access all data before that call you can convert it to what you fell the data should have been in the first place and omit many surprises and expensive regular expression substitution (in a tight spot) while fread'ing the data.
software at serv-a-com dot com
18-Feb-2003 01:09
My previous XML post (software at serv-a-com dot com/22-Jan-2003 03:08) resulted in some of the visitors e-mailg me on the carriage return stripping issue with questions. I'll try to make the following mumble as brief and easy to understand as possible.

1. Overview of the 4096 fragmentation issue
As you know the following freads the file 4096 bytes at a time (that is 4KB) this is perhaps ok for testing expat and figuring out how things work, but it it rather dangerous in the production environment. Data may not be fully understandable due to fread fragmentation and improperly formatted due to numerous sources(formats) of data contained within (i.e. end of line delimited CDATA).

while ($data = fread($fp, 4096)) {
if (!xml_parse($xml_parser, $data, feof($fp))) {

Sometimes to save time one may want to load it all up into a one big variable and leave all the worries to expat. I think anything under 500 KB is ok (as long as nobody knows about it). Some may argue that larger variables are acceptable or even necessary because of the magic that take place while parsing using xml_parse. Our XML parser(expat) works and can be successfully implemented only when we know what type of XML data we are dealing with, it's average size and structure of general layout and data contained within tags. For example if the tags are followed by a line delimiter like a new line we can read it with fgets in and with minimal effort make sure that no data will be sent to the function that does not end with a end tag. But this require a fair knowledge of the file's preference for storing XML data and tags (and a bit of code between reading data and xml_parse'ing it).
software at serv-a-com dot com
23-Jan-2003 06:08
use:
while ($data = str_replace("\n","",fread($fp, 4096))){

instead of:
while ($data = fread($fp, 4096)) {
It will save you a headache.

and in response to (simen at bleed dot no 11-Jan-2003 04:27) "If the 4096 byte buffer fills up..."
Please take better care of your data don't just shove it in to the xml_parse() check and make sure that the tags are not sliced the middle, use a temporary variable between fread and xml_parse.
simen at bleed dot no
12-Jan-2003 07:27
I was experiencing really wierd behaviour loading a large XML document (91k) since the buffer of 4096, when reading the file actually doesn't take into consideration the following:

<node>this is my value</node>

If the 4096 byte buffer fills up at "my", you will get a split string into your xml_set_character_data_handler().

The only solution I've found so far is to read the whole document into a variable and then parse.
sfaulkner at hoovers dot com
04-Nov-2002 04:29
Building on... This allows you to return the value of an element using an XPath reference.  This code would of course need error handling added :-)

 function GetElementByName ($xml, $start, $end) {
   $startpos = strpos($xml, $start);
   if ($startpos === false) {
     return false;
   }
   $endpos = strpos($xml, $end);
   $endpos = $endpos+strlen($end);   
   $endpos = $endpos-$startpos;
   $endpos = $endpos - strlen($end);
   $tag = substr ($xml, $startpos, $endpos);
   $tag = substr ($tag, strlen($start));
   return $tag;
 }
 
 function XPathValue($XPath,$XML) {
   $XPathArray = explode("/",$XPath);
   $node = $XML;
   while (list($key,$value) = each($XPathArray)) {
     $node = GetElementByName($node, "<$value>", "</$value>");
   }
  
   return $node;
 }
 
  print XPathValue("Response/Shipment/TotalCharges/Value",$xml);
guy at bhaktiandvedanta dot com
28-Sep-2002 03:01
For a simple XML parser you can use this function. It doesn't require any extensions to run.

<?
// Extracts content from XML tag

function GetElementByName ($xml, $start, $end) {

   global
$pos;
  
$startpos = strpos($xml, $start);
   if (
$startpos === false) {
       return
false;
   }
  
$endpos = strpos($xml, $end);
  
$endpos = $endpos+strlen($end);   
  
$pos = $endpos;
  
$endpos = $endpos-$startpos;
  
$endpos = $endpos - strlen($end);
  
$tag = substr ($xml, $startpos, $endpos);
  
$tag = substr ($tag, strlen($start));

   return
$tag;

}

// Open and read xml file. You can replace this with your xml data.

$file = "data.xml";
$pos = 0;
$Nodes = array();

if (!(
$fp = fopen($file, "r"))) {
   die(
"could not open XML input");
}
while (
$getline = fread($fp, 4096)) {
  
$data = $data . $getline;
}

$count = 0;
$pos = 0;

// Goes throw XML file and creates an array of all <XML_TAG> tags.
while ($node = GetElementByName($data, "<XML_TAG>", "</XML_TAG>")) {
  
$Nodes[$count] = $node;
  
$count++;
  
$data = substr($data, $pos);
}

// Gets infomation from tag siblings.
for ($i=0; $i<$count; $i++) {
$code = GetElementByName($Nodes[$i], "<Code>", "</Code>");
$desc = GetElementByName($Nodes[$i], "<Description>", "</Description>");
$price = GetElementByName($Nodes[$i], "<BasePrice>", "</BasePrice>");
}
?>

Hope this helps! :)
Guy Laor
dmarsh dot NO dot SPAM dot PLEASE at spscc dot ctc dot edu
19-Sep-2002 03:27
Some reference code I am working on as "XML Library" of which I am folding it info an object. Notice the use of the DEFINE:

Mainly Example 1 and parts of 2 & 3 re-written as an object:
--- MyXMLWalk.lib.php ---
<?php

if (!defined("PHPXMLWalk")) {
define("PHPXMLWalk",TRUE);

class
XMLWalk {
 var
$p; //short for xml parser;
 
var $e; //short for element stack/array

 
function prl($x,$i=0) {
  
ob_start();
  
print_r($x);
  
$buf=ob_get_contents();
  
ob_end_clean();
   return
join("\n".str_repeat(" ",$i),split("\n",$buf));
  }

 function
XMLWalk() {
 
$this->p = xml_parser_create();
 
$this->e = array();
 
xml_parser_set_option($this->p, XML_OPTION_CASE_FOLDING, true);
 
xml_set_element_handler($this->p, array(&$this, "startElement"), array(&$this, "endElement"));
 
xml_set_character_data_handler($this->p, array(&$this, "dataElement"));
 
register_shutdown_function(array(&$this, "free")); // make a destructor
 
}

  function
startElement($parser, $name, $attrs) {
   if (
count($attrs)>=1) {
    
$x = $this->prl($attrs, $this->e[$parser]+6);
   } else {
    
$x = "";
   }

   print
str_repeat(" ",$this->e[$parser]+0). "$name $x\n";
  
$this->e[$parser]++;
  
$this->e[$parser]++;
  }

  function
dataElement($parser, $data) {
   print
str_repeat(" ",$this->e[$parser]+0). htmlspecialchars($data, ENT_QUOTES) ."\n";
  }

  function
endElement($parser, $name) {
  
$this->e[$parser]--;
  
$this->e[$parser]--;
  }
  function
parse($data, $fp) {
   if (!
xml_parse($this->p, $data, feof($fp))) {
       die(
sprintf("XML error: %s at line %d",
                  
xml_error_string(xml_get_error_code($this->p)),
                  
xml_get_current_line_number($this->p)));
   }
  }

  function
free() {
  
xml_parser_free($this->p);
  }

}
// end of class

} // end of define

?>

--- end of file ---

Calling code:
<?php

...

require(
"MyXMLWalk.lib.php");

$file = "x.xml";

$xme = new XMLWalk;

if (!(
$fp = fopen($file, "r"))) {
   die(
"could not open XML input");
}

while (
$data = fread($fp, 4096)) {
 
$xme->parse($data, $fp);
}

...
?>
jon at gettys dot org
15-Aug-2002 04:59
[Editor's note: see also xml_parse_into_struct().]

Very simple routine to convert an XML file into a PHP structure. $obj->xml contains the resulting PHP structure. I would be interested if someone could suggest a cleaner method than the evals I am using.

<?
$filename
= 'sample.xml';
$obj->tree = '$obj->xml';
$obj->xml = '';

function
startElement($parser, $name, $attrs) {
   global
$obj;
  
  
// If var already defined, make array
  
eval('$test=isset('.$obj->tree.'->'.$name.');');
   if (
$test) {
     eval(
'$tmp='.$obj->tree.'->'.$name.';');
     eval(
'$arr=is_array('.$obj->tree.'->'.$name.');');
     if (!
$arr) {
       eval(
'unset('.$obj->tree.'->'.$name.');');
       eval(
$obj->tree.'->'.$name.'[0]=$tmp;');
      
$cnt = 1;
     }
     else {
       eval(
'$cnt=count('.$obj->tree.'->'.$name.');');
     }
    
    
$obj->tree .= '->'.$name."[$cnt]";
   }
   else {
    
$obj->tree .= '->'.$name;
   }
   if (
count($attrs)) {
       eval(
$obj->tree.'->attr=$attrs;');
   }
}

function
endElement($parser, $name) {
   global
$obj;
  
// Strip off last ->
  
for($a=strlen($obj->tree);$a>0;$a--) {
       if (
substr($obj->tree, $a, 2) == '->') {
          
$obj->tree = substr($obj->tree, 0, $a);
           break;
       }
   }
}

function
characterData($parser, $data) {
   global
$obj;

   eval(
$obj->tree.'->data=\''.$data.'\';');
}

$xml_parser = xml_parser_create();
xml_set_element_handler($xml_parser, "startElement", "endElement");
xml_set_character_data_handler($xml_parser, "characterData");
if (!(
$fp = fopen($filename, "r"))) {
   die(
"could not open XML input");
}

while (
$data = fread($fp, 4096)) {
   if (!
xml_parse($xml_parser, $data, feof($fp))) {
       die(
sprintf("XML error: %s at line %d",
                  
xml_error_string(xml_get_error_code($xml_parser)),
                  
xml_get_current_line_number($xml_parser)));
   }
}
xml_parser_free($xml_parser);
print_r($obj->xml);
return
0;

?>
danielc at analysisandsolutions dot com
16-Apr-2002 05:23
I put up a good, simple, real world example of how to parse XML documents. While the sample grabs stock quotes off of the web, you can tweak it to do whatever you need.

http://www.analysisandsolutions.com/code/phpxml.htm
jason at N0SPAM dot projectexpanse dot com
23-Mar-2002 05:16
In reference to the note made by sam@cwa.co.nz about parsing entities:

I could be wrong, but since it is possible to define your own entities within an XML DTD, the cdata handler function parses these individually to allow for your own implementation of those entities within your cdata handler.
jason at NOSPAM_projectexpanse_NOSPAM dot com
27-Feb-2002 08:11
For newbies wanting a good tutorial on how to actually get started and where to go from this listing of functions, then visit:
http://www.wirelessdevnet.com/channels/wap/features/xmlcast_php.html

It shows an excellent example of how to read the XML data into a class file so you can actually process it, not just display it all pretty-like, like many tutorials on PHP/XML seem to be doing.
hans dot schneider at bbdo-interone dot de
25-Jan-2002 12:43
I had to TRIM the data when I passed one large String containig a wellformed XML-File to xml_parse. The String was read by CURL, which aparently put a BLANK at the end of the String. This BLANK produced a "XML not wellformed"-Error in xml_parse!
morgan_rogers at yahoo dot com
06-Oct-2000 04:37
There's a really good article on XML parsing with PHP at http://www.zend.com/zend/art/parsing.php
sam at cwa dot co dot nz
28-Sep-2000 10:39
I've discovered some unusual behaviour in this API when ampersand entities are parsed in cdata; for some reason the parser breaks up the section around the entities, and calls the handler repeated times for each of the sections. If you don't allow for this oddity and you are trying to put the cdata into a variable, only the last part will be stored.

You can get around this with a line like:

$foo .= $cdata;

If the handler is called several times from the same tag, it will append them, rather than rewriting the variable each time. If the entire cdata section is returned, it doesn't matter.

May happen for other entities, but I haven't investigated.

Took me a while to figure out what was happening; hope this saves someone else the trouble.
Daniel dot Rendall at btinternet dot com
08-Jul-1999 01:21
When using the XML parser, make sure you're not using the magic quotes option (e.g. use set_magic_quotes_runtime(0) if it's not the compiled default), otherwise you'll get 'not well-formed' errors when dealing with tags with attributes set in them.