PHP
downloads | documentation | faq | getting help | mailing lists | reporting bugs | php.net sites | links | conferences | my php.net

search for in the

mb_detect_order> <mb_decode_numericentity
Last updated: Fri, 10 Oct 2008

view this page in

mb_detect_encoding

(PHP 4 >= 4.0.6, PHP 5)

mb_detect_encodingDétecte un encodage

Description

string mb_detect_encoding ( string $str [, mixed $encoding_list [, bool $strict ]] )

Détecte l'encodage utilisé par la chaîne str .

Liste de paramètres

str

La chaîne à analyser.

encoding_list

encoding_list est une liste d'encodages, sous forme de tableau, ou bien de chaîne, les valeurs étant séparées par des virgules.

Si encoding_list est omis, l'ordre spécifié par mb_detect_order() est utilisé.

strict

strict spécifie si l'on doit utiliser une détection de l'encodage strict ou non. Par défaut, vaut FALSE.

Valeurs de retour

L'encodage détecté.

Exemples

Exemple #1 Exemple avec mb_detect_encoding()

<?php
/* Détecte l'encodage avec les valeurs par défaut */
echo mb_detect_encoding($str);

/* "auto" signifie "ASCII,JIS,UTF-8,EUC-JP,SJIS" */
echo mb_detect_encoding($str"auto");

/* Spécifie une liste d'encodages possibles avec une liste Ã  virgules */
echo mb_detect_encoding($str"JIS, eucjp-win, sjis-win");

/* Spécifie une liste d'encodages possibles avec un tableau  */
$ary[] = "ASCII";
$ary[] = "JIS";
$ary[] = "EUC-JP";
echo 
mb_detect_encoding($str$ary);
?>

Voir aussi



mb_detect_order> <mb_decode_numericentity
Last updated: Fri, 10 Oct 2008
 
add a note add a note User Contributed Notes
mb_detect_encoding
dennis at nikolaenko dot ru
06-Oct-2008 06:18
Beware of bug to detect Russian encodings
http://bugs.php.net/bug.php?id=38138
hmdker at gmail dot com
24-Aug-2008 06:58
Function to detect UTF-8, when mb_detect_encoding is not available it may be useful.

<?php
function is_utf8($str) {
   
$c=0; $b=0;
   
$bits=0;
   
$len=strlen($str);
    for(
$i=0; $i<$len; $i++){
       
$c=ord($str[$i]);
        if(
$c > 128){
            if((
$c >= 254)) return false;
            elseif(
$c >= 252) $bits=6;
            elseif(
$c >= 248) $bits=5;
            elseif(
$c >= 240) $bits=4;
            elseif(
$c >= 224) $bits=3;
            elseif(
$c >= 192) $bits=2;
            else return
false;
            if((
$i+$bits) > $len) return false;
            while(
$bits > 1){
               
$i++;
               
$b=ord($str[$i]);
                if(
$b < 128 || $b > 191) return false;
               
$bits--;
            }
        }
    }
    return
true;
}
?>
yaqy at qq dot com
21-Jul-2008 07:14
<?php
/*
*QQ: 290359552
* conver to Utf8 if $str is not equals to 'UTF-8'
*/
function convToUtf8($str)
{
if(
mb_detect_encoding($str,"UTF-8, ISO-8859-1, GBK")!="UTF-8" )
{

return 
iconv("gbk","utf-8",$str);

}
else
{
return
$str;
}

}
?>
hoermann dot j at gmail dot com
20-Mar-2008 02:35
referring to the bug in mb_detect_encoding decribed by telemach
http://de2.php.net/manual/de/function.mb-detect-encoding.php#55228 I want to give a simple solution.

Because
<?php
mb_detect_encoding
('accentué' , 'UTF-8, ISO-8859-1');
?>
will lead to a wrong result (UTF-8) but
<?php
mb_detect_encoding
('accentuée' , 'UTF-8, ISO-8859-1');
?>
will not, you should always add a ISO-8859-1 character to your string for this check.

Do this:
<?php
mb_detect_encoding
($myVal . 'a' , 'UTF-8, ISO-8859-1');
?>
This will suppress the situation where the error occurs and will not modify your variable. And it will still work if the error in the function will be fixed one day.
mark at kinoko dot fr
12-Oct-2007 04:56
For: rl at itfigures dot nl

Just note that your Euro symbol being \x80 is NOT standard for ISO-8859-1 or ISO-8859-15 as \x80 is a reserved character.

It is however "common practice" for windows developpers to mix windows-1252 and ISO-8859-1. Just convert to windows-1252 instead of ISO-8859-1 and you'll get your € symbol at the right place.
rl at itfigures dot nl
04-Sep-2007 11:00
I used Chris's function "detectUTF8" to detect the need from conversion from utf8 to 8859-1, which works fine. I did have a problem with the following iconv-conversion.

The problem is that the iconv-conversion to 8859-1 (with //TRANSLIT) replaces the euro-sign with EUR, although it is common practice  that \x80 is used as the euro-sign in the 8859-1 charset.

I could not use 8859-15 since that mangled some other characters, so I added 2 str_replace's:

if(detectUTF8($str)){
  $str=str_replace("\xE2\x82\xAC","&euro;",$str);
  $str=iconv("UTF-8","ISO-8859-1//TRANSLIT",$str);
  $str=str_replace("&euro;","\x80",$str);
}

If html-output is needed the last line is not necessary (and even unwanted).
sunggsun
15-Aug-2006 09:26
from PHPDIG

    function isUTF8($str) {
        if ($str === mb_convert_encoding(mb_convert_encoding($str, "UTF-32", "UTF-8"), "UTF-8", "UTF-32")) {
            return true;
        } else {
            return false;
        }
    }
chris AT w3style.co DOT uk
03-Aug-2006 11:22
Based upon that snippet below using preg_match() I needed something faster and less specific.  That function works and is brilliant but it scans the entire strings and checks that it conforms to UTF-8.  I wanted something purely to check if a string contains UTF-8 characters so that I could switch character encoding from iso-8859-1 to utf-8.

I modified the pattern to only look for non-ascii multibyte sequences in the UTF-8 range and also to stop once it finds at least one multibytes string.  This is quite a lot faster.

<?php

function detectUTF8($string)
{
        return
preg_match('%(?:
        [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
        |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding overlongs
        |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
        |\xED[\x80-\x9F][\x80-\xBF]               # excluding surrogates
        |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
        |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
        |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
        )+%xs'
, $string);
}

?>
telemach
28-Jul-2005 03:48
beware : even if you need to distinguish between UTF-8 and ISO-8859-1, and you the following detection order (as chrigu suggests)

mb_detect_encoding('accentuée' , 'UTF-8, ISO-8859-1')

returns ISO-8859-1, while

mb_detect_encoding('accentué' , 'UTF-8, ISO-8859-1')

returns UTF-8

bottom line : an ending 'é' (and probably other accentuated chars) mislead mb_detect_encoding
Chrigu
29-Mar-2005 05:32
If you need to distinguish between UTF-8 and ISO-8859-1 encoding, list UTF-8 first in your encoding_list:
mb_detect_encoding($string, 'UTF-8, ISO-8859-1');

if you list ISO-8859-1 first, mb_detect_encoding() will always return ISO-8859-1.
php-note-2005 at ryandesign dot com
17-Feb-2005 04:57
Much simpler UTF-8-ness checker using a regular expression created by the W3C:

<?php

// Returns true if $string is valid UTF-8 and false otherwise.
function is_utf8($string) {
   
   
// From http://w3.org/International/questions/qa-forms-utf-8.html
   
return preg_match('%^(?:
          [\x09\x0A\x0D\x20-\x7E]            # ASCII
        | [\xC2-\xDF][\x80-\xBF]             # non-overlong 2-byte
        |  \xE0[\xA0-\xBF][\x80-\xBF]        # excluding overlongs
        | [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}  # straight 3-byte
        |  \xED[\x80-\x9F][\x80-\xBF]        # excluding surrogates
        |  \xF0[\x90-\xBF][\x80-\xBF]{2}     # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        |  \xF4[\x80-\x8F][\x80-\xBF]{2}     # plane 16
    )*$%xs'
, $string);
   
}
// function is_utf8

?>
jaaks at playtech dot com
14-Jan-2005 09:27
Last example for verifying UTF-8 has one little bug. If 10xxxxxx byte occurs alone i.e. not in multibyte char, then it is accepted although it is against UTF-8 rules. Make following replacement to repair it.

Replace
         } // goto next char
with
         } else {
           return false; // 10xxxxxx occuring alone
         } // goto next char
maarten
13-Jan-2005 12:55
Sometimes mb_detect_string is not what you need. When using pdflib for example you want to VERIFY the correctness of utf-8. mb_detect_encoding reports some iso-8859-1 encoded text as utf-8.
To verify utf 8 use the following:

//
//    utf8 encoding validation developed based on Wikipedia entry at:
//    http://en.wikipedia.org/wiki/UTF-8
//
//    Implemented as a recursive descent parser based on a simple state machine
//    copyright 2005 Maarten Meijer
//
//    This cries out for a C-implementation to be included in PHP core
//
    function valid_1byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0x80) == 0x00;
    }
   
    function valid_2byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xE0) == 0xC0;
    }

    function valid_3byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xF0) == 0xE0;
    }

    function valid_4byte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xF8) == 0xF0;
    }
   
    function valid_nextbyte($char) {
        if(!is_int($char)) return false;
        return ($char & 0xC0) == 0x80;
    }
   
    function valid_utf8($string) {
        $len = strlen($string);
        $i = 0;   
        while( $i < $len ) {
            $char = ord(substr($string, $i++, 1));
            if(valid_1byte($char)) {    // continue
                continue;
            } else if(valid_2byte($char)) { // check 1 byte
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } else if(valid_3byte($char)) { // check 2 bytes
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } else if(valid_4byte($char)) { // check 3 bytes
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
                if(!valid_nextbyte(ord(substr($string, $i++, 1))))
                    return false;
            } // goto next char
        }
        return true; // done
    }

for a drawing of the statemachine see: http://www.xs4all.nl/~mjmeijer/unicode.png and http://www.xs4all.nl/~mjmeijer/unicode2.png

mb_detect_order> <mb_decode_numericentity
Last updated: Fri, 10 Oct 2008
 
 
show source | credits | sitemap | contact | advertising | mirror sites