mb_detect_encoding

(PHP 4 >= 4.0.6, PHP 5)

mb_detect_encoding -- Detect character encoding

Description

string mb_detect_encoding ( string str [, mixed encoding_list [, bool strict]] )

mb_detect_encoding() detects character encoding in string str. It returns detected character encoding.

encoding_list is list of character encoding. Encoding order may be specified by array or comma separated list string.

If encoding_list is omitted, detect_order is used.

例子 1. mb_detect_encoding() example

<?php
/* Detect character encoding with current detect_order */
echo mb_detect_encoding($str);

/* "auto" is expanded to "ASCII,JIS,UTF-8,EUC-JP,SJIS" */
echo mb_detect_encoding($str, "auto");

/* Specify encoding_list character encoding by comma separated list */
echo mb_detect_encoding($str, "JIS, eucjp-win, sjis-win");

/* Use array to specify encoding_list  */
$ary[] = "ASCII";
$ary[] = "JIS";
$ary[] = "EUC-JP";
echo
mb_detect_encoding($str, $ary);
?>

See also mb_detect_order().


add a note add a note User Contributed Notes
sunggsun
15-Aug-2006 03:26
from PHPDIG

   function isUTF8($str) {
       if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32", "UTF-8"), "UTF-8", "UTF-32")) {
           return true;
       } else {
           return false;
       }
   }
chris AT w3style.co DOT uk
03-Aug-2006 05:22
Based upon that snippet below using preg_match() I needed something faster and less specific.  That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8.  I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.

I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string.  This is quite a lot faster.

<?php

function detectUTF8($string)
{
       return
preg_match('%(?:
       [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
       |\xE0[\xA0-\xBF][\x80-\xBF]              # excluding overlongs
       |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
       |\xED[\x80-\x9F][\x80-\xBF]              # excluding surrogates
       |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
       |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
       |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
       )+%xs'
, $string);
}

?>
telemach
28-Jul-2005 09:48
beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)

mb_detect_encoding('accentue' , 'UTF-8, ISO-8859-1')

returns ISO-8859-1, while

mb_detect_encoding('accentu' , 'UTF-8, ISO-8859-1')

returns UTF-8

bottom line : an ending '' (and probably other accentuated chars) mislead mb_detect_encoding
Chrigu
29-Mar-2005 11:32
If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:
mb_detect_encoding($string, 'UTF-8, ISO-8859-1');

if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
php-note-2005 at ryandesign dot com
17-Feb-2005 11:57
Much simpler UTF-8-ness checker using a regular expression created by the W3C:

<?php

// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
  
  
// From http://w3.org/International/questions/qa-forms-utf-8.html
  
return preg_match('%^(?:
         [\x09\x0A\x0D\x20-\x7E]            # ASCII
       | [\xC2-\xDF][\x80-\xBF]            # non-overlong 2-byte
       |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
       | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
       |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
       |  \xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
       | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
       |  \xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
   )*$%xs'
, $string);
  
}
// function is_utf8

?>
jaaks at playtech dot com
14-Jan-2005 04:27
Last example for verifying UTF-8 has one little bug. If 10xxxxxx byte occurs alone i.e. not in multibyte char, then it is accepted although it is against UTF-8 rules. Make following replacement to repair it.

Replace
         } // goto next char
with
         } else {
           return false; // 10xxxxxx occuring alone
         } // goto next char
maarten
13-Jan-2005 07:55
Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.
To verify utf 8 use the following:

//
//    utf8 encoding validation developed based on Wikipedia entry at:
//    http://en.wikipedia.org/wiki/UTF-8
//
//    Implemented as a recursive descent parser based on a simple state machine
//    copyright 2005 Maarten Meijer
//
//    This cries out for a C-implementation to be included in PHP core
//
   function valid_1byte($char) {
       if(!is_int($char)) return false;
       return ($char & 0x80) == 0x00;
   }
  
   function valid_2byte($char) {
       if(!is_int($char)) return false;
       return ($char & 0xE0) == 0xC0;
   }

   function valid_3byte($char) {
       if(!is_int($char)) return false;
       return ($char & 0xF0) == 0xE0;
   }

   function valid_4byte($char) {
       if(!is_int($char)) return false;
       return ($char & 0xF8) == 0xF0;
   }
  
   function valid_nextbyte($char) {
       if(!is_int($char)) return false;
       return ($char & 0xC0) == 0x80;
   }
  
   function valid_utf8($string) {
       $len = strlen($string);
       $i = 0;   
       while( $i < $len ) {
           $char = ord(substr($string, $i++, 1));
           if(valid_1byte($char)) {    // continue
               continue;
           } else if(valid_2byte($char)) { // check 1 byte
               if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                   return false;
           } else if(valid_3byte($char)) { // check 2 bytes
               if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                   return false;
               if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                   return false;
           } else if(valid_4byte($char)) { // check 3 bytes
               if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                   return false;
               if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                   return false;
               if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                   return false;
           } // goto next char
       }
       return true; // done
   }

for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png