我想拥有一个从Unicode字符串创建子段的函数,例如gen_slug('Andrés Cortez')
should return andres-cortez
。我该怎么办?
andres
。您确定输入的内容完全是“andrés”吗?
我想拥有一个从Unicode字符串创建子段的函数,例如gen_slug('Andrés Cortez')
should return andres-cortez
。我该怎么办?
andres
。您确定输入的内容完全是“andrés”吗?
Answers:
代替冗长的替换,请尝试以下方法:
public static function slugify($text)
{
// replace non letter or digits by -
$text = preg_replace('~[^\pL\d]+~u', '-', $text);
// transliterate
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
// remove unwanted characters
$text = preg_replace('~[^-\w]+~', '', $text);
// trim
$text = trim($text, '-');
// remove duplicate -
$text = preg_replace('~-+~', '-', $text);
// lowercase
$text = strtolower($text);
if (empty($text)) {
return 'n-a';
}
return $text;
}
这是基于Symfony的Jobeet教程中的内容。
^
开括号后的位置-它会使比赛逆转。
iconv
如果$text
包含不具有ascii等效项的字符,将无法正确转换。例如,iconv('utf-8', 'us-ascii//TRANSLIT', "EFI收购Cretaprint")
将返回"EFI"
并泄漏警告。
$text = trim($text, '-');
应该在末尾,否则Foo 收
变为foo-
。而且,Foo 收 Bar
变得foo--bar
(重复-
似乎多余)。
由于此答案引起了人们的关注,因此我添加了一些解释。
所提供的解决方案基本上将除AZ,az,0-9和&-(连字符)之外的所有内容替换为-(连字符)。因此,它不能与其他unicode字符(对于URL段/字符串有效的字符)一起正常工作。常见的情况是输入字符串包含非英语字符。
仅在确信输入字符串不会包含Unicode字符(您可能希望将其作为输出/子段的一部分)时,才使用此解决方案。
例如。“नारीशक््ि”将变为“ ----------”(所有连字符),而不是“नारी-श््ि”(有效的URL标记)。
怎么样...
$slug = strtolower(trim(preg_replace('/[^A-Za-z0-9-]+/', '-', $string)));
?
strtolower(trim(preg_replace('/[^A-Za-z0-9-]+/', '-', "Étienne")))
返回"-tienne"
而不是"etienne"
,因此它不适用于带重音符号的字符。
如果您安装了intl扩展名,则可以使用Transliterator :: transliterate函数轻松创建一个插件。
<?php
$string = 'Namnet på bildtävlingen';
$slug = \Transliterator::createFromRules(
':: Any-Latin;'
. ':: NFD;'
. ':: [:Nonspacing Mark:] Remove;'
. ':: NFC;'
. ':: [:Punctuation:] Remove;'
. ':: Lower();'
. '[:Separator:] > \'-\''
)
->transliterate( $string );
echo $slug; // namnet-pa-bildtavlingen
?>
注意:我是从wordpress上摘下来的,它有效!!
像这样使用它:
echo sanitize('testing this link');
码
//taken from wordpress
function utf8_uri_encode( $utf8_string, $length = 0 ) {
$unicode = '';
$values = array();
$num_octets = 1;
$unicode_length = 0;
$string_length = strlen( $utf8_string );
for ($i = 0; $i < $string_length; $i++ ) {
$value = ord( $utf8_string[ $i ] );
if ( $value < 128 ) {
if ( $length && ( $unicode_length >= $length ) )
break;
$unicode .= chr($value);
$unicode_length++;
} else {
if ( count( $values ) == 0 ) $num_octets = ( $value < 224 ) ? 2 : 3;
$values[] = $value;
if ( $length && ( $unicode_length + ($num_octets * 3) ) > $length )
break;
if ( count( $values ) == $num_octets ) {
if ($num_octets == 3) {
$unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]) . '%' . dechex($values[2]);
$unicode_length += 9;
} else {
$unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]);
$unicode_length += 6;
}
$values = array();
$num_octets = 1;
}
}
}
return $unicode;
}
//taken from wordpress
function seems_utf8($str) {
$length = strlen($str);
for ($i=0; $i < $length; $i++) {
$c = ord($str[$i]);
if ($c < 0x80) $n = 0; # 0bbbbbbb
elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
else return false; # Does not match any model
for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}
//function sanitize_title_with_dashes taken from wordpress
function sanitize($title) {
$title = strip_tags($title);
// Preserve escaped octets.
$title = preg_replace('|%([a-fA-F0-9][a-fA-F0-9])|', '---$1---', $title);
// Remove percent signs that are not part of an octet.
$title = str_replace('%', '', $title);
// Restore octets.
$title = preg_replace('|---([a-fA-F0-9][a-fA-F0-9])---|', '%$1', $title);
if (seems_utf8($title)) {
if (function_exists('mb_strtolower')) {
$title = mb_strtolower($title, 'UTF-8');
}
$title = utf8_uri_encode($title, 200);
}
$title = strtolower($title);
$title = preg_replace('/&.+?;/', '', $title); // kill entities
$title = str_replace('.', '-', $title);
$title = preg_replace('/[^%a-z0-9 _-]/', '', $title);
$title = preg_replace('/\s+/', '-', $title);
$title = preg_replace('|-+|', '-', $title);
$title = trim($title, '-');
return $title;
}
sanitize_title_with_dashes($string, null, 'save')
(注意额外的参数),否则您会得到一些混乱的字符代码,例如telstra%e2%80%99s-%e2%80%98all-roles-flex%e2%80%99
。不太漂亮 :-(
sanitize
是一个奇怪的,容易忘记的函数名称,它会生成一个段。
使用许多高级开发人员支持的现有解决方案总是一个好主意。最受欢迎的一种是https://github.com/cocur/slugify。首先,它支持多种语言,并且正在更新中。
如果您不想使用整个程序包,则可以复制所需的部分。
这是另一种,例如,“带有奇怪字符的标题éééAX Z”变为“ title-with-strange-characters-eee-axz”。
/**
* Function used to create a slug associated to an "ugly" string.
*
* @param string $string the string to transform.
*
* @return string the resulting slug.
*/
public static function createSlug($string) {
$table = array(
'Š'=>'S', 'š'=>'s', 'Đ'=>'Dj', 'đ'=>'dj', 'Ž'=>'Z', 'ž'=>'z', 'Č'=>'C', 'č'=>'c', 'Ć'=>'C', 'ć'=>'c',
'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss',
'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e',
'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o',
'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
'ÿ'=>'y', 'Ŕ'=>'R', 'ŕ'=>'r', '/' => '-', ' ' => '-'
);
// -- Remove duplicated spaces
$stripped = preg_replace(array('/\s{2,}/', '/[\t\n]/'), ' ', $string);
// -- Returns the slug
return strtolower(strtr($string, $table));
}
@Imran Omar Bukhsh代码的更新版本(来自最新的Wordpress(4.0)分支):
<?php
// Add methods to slugify taken from Wordpress:
// - https://github.com/WordPress/WordPress/blob/master/wp-includes/formatting.php
// - https://github.com/WordPress/WordPress/blob/master/wp-includes/functions.php
/**
* Set the mbstring internal encoding to a binary safe encoding when func_overload
* is enabled.
*
* When mbstring.func_overload is in use for multi-byte encodings, the results from
* strlen() and similar functions respect the utf8 characters, causing binary data
* to return incorrect lengths.
*
* This function overrides the mbstring encoding to a binary-safe encoding, and
* resets it to the users expected encoding afterwards through the
* `reset_mbstring_encoding` function.
*
* It is safe to recursively call this function, however each
* `mbstring_binary_safe_encoding()` call must be followed up with an equal number
* of `reset_mbstring_encoding()` calls.
*
* @since 3.7.0
*
* @see reset_mbstring_encoding()
*
* @param bool $reset Optional. Whether to reset the encoding back to a previously-set encoding.
* Default false.
*/
function mbstring_binary_safe_encoding( $reset = false ) {
static $encodings = array();
static $overloaded = null;
if ( is_null( $overloaded ) )
$overloaded = function_exists( 'mb_internal_encoding' ) && ( ini_get( 'mbstring.func_overload' ) & 2 );
if ( false === $overloaded )
return;
if ( ! $reset ) {
$encoding = mb_internal_encoding();
array_push( $encodings, $encoding );
mb_internal_encoding( 'ISO-8859-1' );
}
if ( $reset && $encodings ) {
$encoding = array_pop( $encodings );
mb_internal_encoding( $encoding );
}
}
/**
* Reset the mbstring internal encoding to a users previously set encoding.
*
* @see mbstring_binary_safe_encoding()
*
* @since 3.7.0
*/
function reset_mbstring_encoding() {
mbstring_binary_safe_encoding( true );
}
/**
* Checks to see if a string is utf8 encoded.
*
* NOTE: This function checks for 5-Byte sequences, UTF8
* has Bytes Sequences with a maximum length of 4.
*
* @author bmorel at ssi dot fr (modified)
* @since 1.2.1
*
* @param string $str The string to be checked
* @return bool True if $str fits a UTF-8 model, false otherwise.
*/
function seems_utf8($str) {
mbstring_binary_safe_encoding();
$length = strlen($str);
reset_mbstring_encoding();
for ($i=0; $i < $length; $i++) {
$c = ord($str[$i]);
if ($c < 0x80) $n = 0; # 0bbbbbbb
elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
else return false; # Does not match any model
for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
return false;
}
}
return true;
}
/**
* Encode the Unicode values to be used in the URI.
*
* @since 1.5.0
*
* @param string $utf8_string
* @param int $length Max length of the string
* @return string String with Unicode encoded for URI.
*/
function utf8_uri_encode( $utf8_string, $length = 0 ) {
$unicode = '';
$values = array();
$num_octets = 1;
$unicode_length = 0;
mbstring_binary_safe_encoding();
$string_length = strlen( $utf8_string );
reset_mbstring_encoding();
for ($i = 0; $i < $string_length; $i++ ) {
$value = ord( $utf8_string[ $i ] );
if ( $value < 128 ) {
if ( $length && ( $unicode_length >= $length ) )
break;
$unicode .= chr($value);
$unicode_length++;
} else {
if ( count( $values ) == 0 ) $num_octets = ( $value < 224 ) ? 2 : 3;
$values[] = $value;
if ( $length && ( $unicode_length + ($num_octets * 3) ) > $length )
break;
if ( count( $values ) == $num_octets ) {
if ($num_octets == 3) {
$unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]) . '%' . dechex($values[2]);
$unicode_length += 9;
} else {
$unicode .= '%' . dechex($values[0]) . '%' . dechex($values[1]);
$unicode_length += 6;
}
$values = array();
$num_octets = 1;
}
}
}
return $unicode;
}
/**
* Sanitizes a title, replacing whitespace and a few other characters with dashes.
*
* Limits the output to alphanumeric characters, underscore (_) and dash (-).
* Whitespace becomes a dash.
*
* @since 1.2.0
*
* @param string $title The title to be sanitized.
* @param string $raw_title Optional. Not used.
* @param string $context Optional. The operation for which the string is sanitized.
* @return string The sanitized title.
*/
function sanitize_title_with_dashes( $title, $raw_title = '', $context = 'display' ) {
$title = strip_tags($title);
// Preserve escaped octets.
$title = preg_replace('|%([a-fA-F0-9][a-fA-F0-9])|', '---$1---', $title);
// Remove percent signs that are not part of an octet.
$title = str_replace('%', '', $title);
// Restore octets.
$title = preg_replace('|---([a-fA-F0-9][a-fA-F0-9])---|', '%$1', $title);
if (seems_utf8($title)) {
if (function_exists('mb_strtolower')) {
$title = mb_strtolower($title, 'UTF-8');
}
$title = utf8_uri_encode($title, 200);
}
$title = strtolower($title);
$title = preg_replace('/&.+?;/', '', $title); // kill entities
$title = str_replace('.', '-', $title);
if ( 'save' == $context ) {
// Convert nbsp, ndash and mdash to hyphens
$title = str_replace( array( '%c2%a0', '%e2%80%93', '%e2%80%94' ), '-', $title );
// Strip these characters entirely
$title = str_replace( array(
// iexcl and iquest
'%c2%a1', '%c2%bf',
// angle quotes
'%c2%ab', '%c2%bb', '%e2%80%b9', '%e2%80%ba',
// curly quotes
'%e2%80%98', '%e2%80%99', '%e2%80%9c', '%e2%80%9d',
'%e2%80%9a', '%e2%80%9b', '%e2%80%9e', '%e2%80%9f',
// copy, reg, deg, hellip and trade
'%c2%a9', '%c2%ae', '%c2%b0', '%e2%80%a6', '%e2%84%a2',
// acute accents
'%c2%b4', '%cb%8a', '%cc%81', '%cd%81',
// grave accent, macron, caron
'%cc%80', '%cc%84', '%cc%8c',
), '', $title );
// Convert times to x
$title = str_replace( '%c3%97', 'x', $title );
}
$title = preg_replace('/[^%a-z0-9 _-]/', '', $title);
$title = preg_replace('/\s+/', '-', $title);
$title = preg_replace('|-+|', '-', $title);
$title = trim($title, '-');
return $title;
}
$title = '#PFW Alexander McQueen Spring/Summer 2015';
echo "title -> slug: \n". $title ." -> ". sanitize_title_with_dashes($title);
echo "\n\n";
$title = '«GQ»: Elyas M\'Barek gehört zu Männern des Jahres';
echo "title -> slug: \n". $title ." -> ". sanitize_title_with_dashes($title);
查看在线示例。
%c2%abgq%c2%bb-elyas-mbarek-geh%c3%b6rt-zu-m%c3%a4nnern-des-jahres
public static function slugify ($text) {
$replace = [
'<' => '', '>' => '', ''' => '', '&' => '',
'"' => '', 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä'=> 'Ae',
'Ä' => 'A', 'Å' => 'A', 'Ā' => 'A', 'Ą' => 'A', 'Ă' => 'A', 'Æ' => 'Ae',
'Ç' => 'C', 'Ć' => 'C', 'Č' => 'C', 'Ĉ' => 'C', 'Ċ' => 'C', 'Ď' => 'D', 'Đ' => 'D',
'Ð' => 'D', 'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'Ē' => 'E',
'Ę' => 'E', 'Ě' => 'E', 'Ĕ' => 'E', 'Ė' => 'E', 'Ĝ' => 'G', 'Ğ' => 'G',
'Ġ' => 'G', 'Ģ' => 'G', 'Ĥ' => 'H', 'Ħ' => 'H', 'Ì' => 'I', 'Í' => 'I',
'Î' => 'I', 'Ï' => 'I', 'Ī' => 'I', 'Ĩ' => 'I', 'Ĭ' => 'I', 'Į' => 'I',
'İ' => 'I', 'IJ' => 'IJ', 'Ĵ' => 'J', 'Ķ' => 'K', 'Ł' => 'K', 'Ľ' => 'K',
'Ĺ' => 'K', 'Ļ' => 'K', 'Ŀ' => 'K', 'Ñ' => 'N', 'Ń' => 'N', 'Ň' => 'N',
'Ņ' => 'N', 'Ŋ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O',
'Ö' => 'Oe', 'Ö' => 'Oe', 'Ø' => 'O', 'Ō' => 'O', 'Ő' => 'O', 'Ŏ' => 'O',
'Œ' => 'OE', 'Ŕ' => 'R', 'Ř' => 'R', 'Ŗ' => 'R', 'Ś' => 'S', 'Š' => 'S',
'Ş' => 'S', 'Ŝ' => 'S', 'Ș' => 'S', 'Ť' => 'T', 'Ţ' => 'T', 'Ŧ' => 'T',
'Ț' => 'T', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'Ue', 'Ū' => 'U',
'Ü' => 'Ue', 'Ů' => 'U', 'Ű' => 'U', 'Ŭ' => 'U', 'Ũ' => 'U', 'Ų' => 'U',
'Ŵ' => 'W', 'Ý' => 'Y', 'Ŷ' => 'Y', 'Ÿ' => 'Y', 'Ź' => 'Z', 'Ž' => 'Z',
'Ż' => 'Z', 'Þ' => 'T', 'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a',
'ä' => 'ae', 'ä' => 'ae', 'å' => 'a', 'ā' => 'a', 'ą' => 'a', 'ă' => 'a',
'æ' => 'ae', 'ç' => 'c', 'ć' => 'c', 'č' => 'c', 'ĉ' => 'c', 'ċ' => 'c',
'ď' => 'd', 'đ' => 'd', 'ð' => 'd', 'è' => 'e', 'é' => 'e', 'ê' => 'e',
'ë' => 'e', 'ē' => 'e', 'ę' => 'e', 'ě' => 'e', 'ĕ' => 'e', 'ė' => 'e',
'ƒ' => 'f', 'ĝ' => 'g', 'ğ' => 'g', 'ġ' => 'g', 'ģ' => 'g', 'ĥ' => 'h',
'ħ' => 'h', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ī' => 'i',
'ĩ' => 'i', 'ĭ' => 'i', 'į' => 'i', 'ı' => 'i', 'ij' => 'ij', 'ĵ' => 'j',
'ķ' => 'k', 'ĸ' => 'k', 'ł' => 'l', 'ľ' => 'l', 'ĺ' => 'l', 'ļ' => 'l',
'ŀ' => 'l', 'ñ' => 'n', 'ń' => 'n', 'ň' => 'n', 'ņ' => 'n', 'ʼn' => 'n',
'ŋ' => 'n', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'oe',
'ö' => 'oe', 'ø' => 'o', 'ō' => 'o', 'ő' => 'o', 'ŏ' => 'o', 'œ' => 'oe',
'ŕ' => 'r', 'ř' => 'r', 'ŗ' => 'r', 'š' => 's', 'ù' => 'u', 'ú' => 'u',
'û' => 'u', 'ü' => 'ue', 'ū' => 'u', 'ü' => 'ue', 'ů' => 'u', 'ű' => 'u',
'ŭ' => 'u', 'ũ' => 'u', 'ų' => 'u', 'ŵ' => 'w', 'ý' => 'y', 'ÿ' => 'y',
'ŷ' => 'y', 'ž' => 'z', 'ż' => 'z', 'ź' => 'z', 'þ' => 't', 'ß' => 'ss',
'ſ' => 'ss', 'ый' => 'iy', 'А' => 'A', 'Б' => 'B', 'В' => 'V', 'Г' => 'G',
'Д' => 'D', 'Е' => 'E', 'Ё' => 'YO', 'Ж' => 'ZH', 'З' => 'Z', 'И' => 'I',
'Й' => 'Y', 'К' => 'K', 'Л' => 'L', 'М' => 'M', 'Н' => 'N', 'О' => 'O',
'П' => 'P', 'Р' => 'R', 'С' => 'S', 'Т' => 'T', 'У' => 'U', 'Ф' => 'F',
'Х' => 'H', 'Ц' => 'C', 'Ч' => 'CH', 'Ш' => 'SH', 'Щ' => 'SCH', 'Ъ' => '',
'Ы' => 'Y', 'Ь' => '', 'Э' => 'E', 'Ю' => 'YU', 'Я' => 'YA', 'а' => 'a',
'б' => 'b', 'в' => 'v', 'г' => 'g', 'д' => 'd', 'е' => 'e', 'ё' => 'yo',
'ж' => 'zh', 'з' => 'z', 'и' => 'i', 'й' => 'y', 'к' => 'k', 'л' => 'l',
'м' => 'm', 'н' => 'n', 'о' => 'o', 'п' => 'p', 'р' => 'r', 'с' => 's',
'т' => 't', 'у' => 'u', 'ф' => 'f', 'х' => 'h', 'ц' => 'c', 'ч' => 'ch',
'ш' => 'sh', 'щ' => 'sch', 'ъ' => '', 'ы' => 'y', 'ь' => '', 'э' => 'e',
'ю' => 'yu', 'я' => 'ya'
];
// make a human readable string
$text = strtr($text, $replace);
// replace non letter or digits by -
$text = preg_replace('~[^\\pL\d.]+~u', '-', $text);
// trim
$text = trim($text, '-');
// remove unwanted characters
$text = preg_replace('~[^-\w.]+~', '', $text);
$text = strtolower($text);
return $text;
}
不要为此使用preg_replace。有一个仅用于该任务的php函数:strtr() http://php.net/manual/en/function.strtr.php
摘自上面链接的评论(我自己进行了测试;它可以正常工作:
function normalize ($string) {
$table = array(
'Š'=>'S', 'š'=>'s', 'Đ'=>'Dj', 'đ'=>'dj', 'Ž'=>'Z', 'ž'=>'z', 'Č'=>'C', 'č'=>'c', 'Ć'=>'C', 'ć'=>'c',
'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss',
'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e',
'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o',
'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
'ÿ'=>'y', 'Ŕ'=>'R', 'ŕ'=>'r',
);
return strtr($string, $table);
}
我在用:
function slugify($text)
{
$text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
return strtolower(preg_replace('/[^A-Za-z0-9-]+/', '-', $text));
}
唯一的后退是不会转换西里尔字母,我现在正在寻找对于每个西里尔字母都不长str_replace的解决方案。
set_locale('cyrillic.UTF-8')
。确切值取决于您安装的语言环境。
我不知道该使用哪个,所以我在phptester.net上做了一个快速实验
<?php
// First test
// https://stackoverflow.com/a/42740874/10232729
function slugify(STRING $string, STRING $separator = '-'){
$accents_regex = '~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i';
$special_cases = [ '&' => 'and', "'" => ''];
$string = mb_strtolower( trim( $string ), 'UTF-8' );
$string = str_replace( array_keys($special_cases), array_values( $special_cases), $string );
$string = preg_replace( $accents_regex, '$1', htmlentities( $string, ENT_QUOTES, 'UTF-8' ) );
$string = preg_replace('/[^a-z0-9]/u', $separator, $string);
return preg_replace('/['.$separator.']+/u', $separator, $string);
}
// Second test
// https://stackoverflow.com/a/13331948/10232729
function slug(STRING $string, STRING $separator = '-'){
$string = transliterator_transliterate('Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC; [:Punctuation:] Remove; Lower();', $string);
return str_replace(' ', $separator, $string);;
}
// Third test - My choice
// https://stackoverflow.com/a/38066136/10232729
function slugbis($text){
$replace = [
'<' => '', '>' => '', '-' => ' ', '&' => '',
'"' => '', 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä'=> 'Ae',
'Ä' => 'A', 'Å' => 'A', 'Ā' => 'A', 'Ą' => 'A', 'Ă' => 'A', 'Æ' => 'Ae',
'Ç' => 'C', 'Ć' => 'C', 'Č' => 'C', 'Ĉ' => 'C', 'Ċ' => 'C', 'Ď' => 'D', 'Đ' => 'D',
'Ð' => 'D', 'È' => 'E', 'É' => 'E', 'Ê' => 'E', 'Ë' => 'E', 'Ē' => 'E',
'Ę' => 'E', 'Ě' => 'E', 'Ĕ' => 'E', 'Ė' => 'E', 'Ĝ' => 'G', 'Ğ' => 'G',
'Ġ' => 'G', 'Ģ' => 'G', 'Ĥ' => 'H', 'Ħ' => 'H', 'Ì' => 'I', 'Í' => 'I',
'Î' => 'I', 'Ï' => 'I', 'Ī' => 'I', 'Ĩ' => 'I', 'Ĭ' => 'I', 'Į' => 'I',
'İ' => 'I', 'IJ' => 'IJ', 'Ĵ' => 'J', 'Ķ' => 'K', 'Ł' => 'K', 'Ľ' => 'K',
'Ĺ' => 'K', 'Ļ' => 'K', 'Ŀ' => 'K', 'Ñ' => 'N', 'Ń' => 'N', 'Ň' => 'N',
'Ņ' => 'N', 'Ŋ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O', 'Õ' => 'O',
'Ö' => 'Oe', 'Ö' => 'Oe', 'Ø' => 'O', 'Ō' => 'O', 'Ő' => 'O', 'Ŏ' => 'O',
'Œ' => 'OE', 'Ŕ' => 'R', 'Ř' => 'R', 'Ŗ' => 'R', 'Ś' => 'S', 'Š' => 'S',
'Ş' => 'S', 'Ŝ' => 'S', 'Ș' => 'S', 'Ť' => 'T', 'Ţ' => 'T', 'Ŧ' => 'T',
'Ț' => 'T', 'Ù' => 'U', 'Ú' => 'U', 'Û' => 'U', 'Ü' => 'Ue', 'Ū' => 'U',
'Ü' => 'Ue', 'Ů' => 'U', 'Ű' => 'U', 'Ŭ' => 'U', 'Ũ' => 'U', 'Ų' => 'U',
'Ŵ' => 'W', 'Ý' => 'Y', 'Ŷ' => 'Y', 'Ÿ' => 'Y', 'Ź' => 'Z', 'Ž' => 'Z',
'Ż' => 'Z', 'Þ' => 'T', 'à' => 'a', 'á' => 'a', 'â' => 'a', 'ã' => 'a',
'ä' => 'ae', 'ä' => 'ae', 'å' => 'a', 'ā' => 'a', 'ą' => 'a', 'ă' => 'a',
'æ' => 'ae', 'ç' => 'c', 'ć' => 'c', 'č' => 'c', 'ĉ' => 'c', 'ċ' => 'c',
'ď' => 'd', 'đ' => 'd', 'ð' => 'd', 'è' => 'e', 'é' => 'e', 'ê' => 'e',
'ë' => 'e', 'ē' => 'e', 'ę' => 'e', 'ě' => 'e', 'ĕ' => 'e', 'ė' => 'e',
'ƒ' => 'f', 'ĝ' => 'g', 'ğ' => 'g', 'ġ' => 'g', 'ģ' => 'g', 'ĥ' => 'h',
'ħ' => 'h', 'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ī' => 'i',
'ĩ' => 'i', 'ĭ' => 'i', 'į' => 'i', 'ı' => 'i', 'ij' => 'ij', 'ĵ' => 'j',
'ķ' => 'k', 'ĸ' => 'k', 'ł' => 'l', 'ľ' => 'l', 'ĺ' => 'l', 'ļ' => 'l',
'ŀ' => 'l', 'ñ' => 'n', 'ń' => 'n', 'ň' => 'n', 'ņ' => 'n', 'ʼn' => 'n',
'ŋ' => 'n', 'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'oe',
'ö' => 'oe', 'ø' => 'o', 'ō' => 'o', 'ő' => 'o', 'ŏ' => 'o', 'œ' => 'oe',
'ŕ' => 'r', 'ř' => 'r', 'ŗ' => 'r', 'š' => 's', 'ù' => 'u', 'ú' => 'u',
'û' => 'u', 'ü' => 'ue', 'ū' => 'u', 'ü' => 'ue', 'ů' => 'u', 'ű' => 'u',
'ŭ' => 'u', 'ũ' => 'u', 'ų' => 'u', 'ŵ' => 'w', 'ý' => 'y', 'ÿ' => 'y',
'ŷ' => 'y', 'ž' => 'z', 'ż' => 'z', 'ź' => 'z', 'þ' => 't', 'ß' => 'ss',
'ſ' => 'ss', 'ый' => 'iy', 'А' => 'A', 'Б' => 'B', 'В' => 'V', 'Г' => 'G',
'Д' => 'D', 'Е' => 'E', 'Ё' => 'YO', 'Ж' => 'ZH', 'З' => 'Z', 'И' => 'I',
'Й' => 'Y', 'К' => 'K', 'Л' => 'L', 'М' => 'M', 'Н' => 'N', 'О' => 'O',
'П' => 'P', 'Р' => 'R', 'С' => 'S', 'Т' => 'T', 'У' => 'U', 'Ф' => 'F',
'Х' => 'H', 'Ц' => 'C', 'Ч' => 'CH', 'Ш' => 'SH', 'Щ' => 'SCH', 'Ъ' => '',
'Ы' => 'Y', 'Ь' => '', 'Э' => 'E', 'Ю' => 'YU', 'Я' => 'YA', 'а' => 'a',
'б' => 'b', 'в' => 'v', 'г' => 'g', 'д' => 'd', 'е' => 'e', 'ё' => 'yo',
'ж' => 'zh', 'з' => 'z', 'и' => 'i', 'й' => 'y', 'к' => 'k', 'л' => 'l',
'м' => 'm', 'н' => 'n', 'о' => 'o', 'п' => 'p', 'р' => 'r', 'с' => 's',
'т' => 't', 'у' => 'u', 'ф' => 'f', 'х' => 'h', 'ц' => 'c', 'ч' => 'ch',
'ш' => 'sh', 'щ' => 'sch', 'ъ' => '', 'ы' => 'y', 'ь' => '', 'э' => 'e',
'ю' => 'yu', 'я' => 'ya'
];
// make a human readable string
$text = strtr($text, $replace);
// replace non letter or digits by -
$text = preg_replace('~[^\pL\d.]+~u', '-', $text);
// trim
$text = trim($text, '-');
// remove unwanted characters
$text = preg_replace('~[^-\w.]+~', '', $text);
return strtolower($text);
}
// Fourth test
// https://stackoverflow.com/a/2955521/10232729
function slugagain($string){
$table = [
'Š'=>'S', 'š'=>'s', 'Đ'=>'Dj', 'đ'=>'dj', 'Ž'=>'Z', 'ž'=>'z', 'Č'=>'C', 'č'=>'c', 'Ć'=>'C', 'ć'=>'c',
'À'=>'A', 'Á'=>'A', 'Â'=>'A', 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'A', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N', 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O',
'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U', 'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'Ss',
'à'=>'a', 'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'a', 'ç'=>'c', 'è'=>'e', 'é'=>'e',
'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 'î'=>'i', 'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o',
'ô'=>'o', 'õ'=>'o', 'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 'ý'=>'y', 'þ'=>'b',
'ÿ'=>'y', 'Ŕ'=>'R', 'ŕ'=>'r', ' '=>'-'
];
return strtr($string, $table);
}
// Fifth test
// https://stackoverflow.com/a/27396804/10232729
function slugifybis($url){
$url = trim($url);
$url = str_replace(' ', '-', $url);
$url = str_replace('/', '-slash-', $url);
return rawurlencode($url);
}
// Sixth and last test
// https://stackoverflow.com/a/39442034/10232729
setlocale( LC_ALL, "en_US.UTF8" );
function slugifyagain($string){
$string = iconv('utf-8', 'us-ascii//translit//ignore', $string); // transliterate
$string = str_replace("'", '', $string);
$string = preg_replace('~[^\pL\d]+~u', '-', $string); // replace non letter or non digits by "-"
$string = preg_replace('~[^-\w]+~', '', $string); // remove unwanted characters
$string = preg_replace('~-+~', '-', $string); // remove duplicate "-"
$string = trim($string, '-'); // trim "-"
$string = trim($string); // trim
$string = mb_strtolower($string, 'utf-8'); // lowercase
return urlencode($string); // safe;
};
$string = $newString = "¿ Àñdréß l'affreux ğarçon & nøël en forêt !";
$max = 10000;
echo '<pre>';
echo 'Beginning :';
echo '<br />';
echo '<br />';
echo '> Slugging '.$max.' iterations of following :';
echo '<br />';
echo '>> ' . $string;
echo '<br />';
echo '<br />';
echo 'Output results :';
echo '<br />';
echo '<br />';
$start = microtime(true);
for($i = 0 ; $i < $max ; $i++){
$newString = slugify($string);
}
$time = (microtime(true) - $start) * 1000;
echo '> First test passed in **' . round($time, 2) . 'ms**';
echo '<br />';
echo '>> Result : ' . $newString;
echo '<br />';
echo '<br />';
$start = microtime(true);
for($i = 0 ; $i < $max ; $i++){
$newString = slug($string);
}
$time = (microtime(true) - $start) * 1000;
echo '> Second test passed in **' . round($time, 2) . 'ms**';
echo '<br />';
echo '>> Result : ' . $newString;
echo '<br />';
echo '<br />';
$start = microtime(true);
for($i = 0 ; $i < $max ; $i++){
$newString = slugbis($string);
}
$time = (microtime(true) - $start) * 1000;
echo '> Third test passed in **' . round($time, 2) . 'ms**';
echo '<br />';
echo '>> Result : ' . $newString;
echo '<br />';
echo '<br />';
$start = microtime(true);
for($i = 0 ; $i < $max ; $i++){
$newString = slugagain($string);
}
$time = (microtime(true) - $start) * 1000;
echo '> Fourth test passed in **' . round($time, 2) . 'ms**';
echo '<br />';
echo '>> Result : ' . $newString;
echo '<br />';
echo '<br />';
$start = microtime(true);
for($i = 0 ; $i < $max ; $i++){
$newString = slugifybis($string);
}
$time = (microtime(true) - $start) * 1000;
echo '> Fifth test passed in **' . round($time, 2) . 'ms**';
echo '<br />';
echo '>> Result : ' . $newString;
echo '<br />';
echo '<br />';
$start = microtime(true);
for($i = 0 ; $i < $max ; $i++){
$newString = slugifyagain($string);
}
$time = (microtime(true) - $start) * 1000;
echo '> Sixth test passed in **' . round($time, 2) . 'ms**';
echo '<br />';
echo '>> Result : ' . $newString;
echo '</pre>';
开始:
进行以下10000次迭代:
¿Àñdréßl'affreuxğarçon&nøëlenforêt!
输出结果:
第一次测试通过了120.78ms
结果:-iquest-andresz-laffreux-arcon和noel-en-Fort-
第二次测试通过3883.82ms
结果:-andreß-laffreux-garcon--nøel-en-foret-
第三次测试通过了56.83ms
结果:安德雷斯·l·阿夫勒·加尔康·诺埃尔·昂·福特
第四次测试通过了18.93ms
结果:¿-AndreSs-l'affreux-ğarcon-&-noel-en-foret-!
第五次测试在6.45毫秒内通过
结果:%C2%BF-%C3%80%C3%B1dr%C3%A9%C3%9F-1%27affreux-%C4%9Far%C3%A7on-%26-n%C3%B8%C3%AB1- en-for%C3%AAt-%21
第六次测试通过112.42ms
结果:andress-laffreux-garcon-n-el-en-foret
需要进一步测试。
编辑:更少的迭代测试
开始:
进行以下迭代100次:
¿Àñdréßl'affreuxğarçon&nøëlenforêt!
输出结果:
第一次测试通过了1.72ms
结果:-iquest-andresz-laffreux-arcon和noel-en-Fort-
第二次测试通过48.59ms
结果:-andreß-laffreux-garcon--nøel-en-foret-
第三次测试通过了0.91ms
结果:安德雷斯·l·阿夫勒·加尔康·诺埃尔·昂·福特
第四次测试通过0.3ms
结果:¿-AndreSs-l'affreux-ğarcon-&-noel-en-foret-!
第五次测试通过0.14ms
结果:%C2%BF-%C3%80%C3%B1dr%C3%A9%C3%9F-1%27affreux-%C4%9Far%C3%A7on-%26-n%C3%B8%C3%AB1- en-for%C3%AAt-%21
第六次测试通过1.4ms
结果:andress-laffreux-garcon-n-el-en-foret
使用Core中已实现的功能怎么样?
//Clean non UTF-8 characters
Mage::getHelper('core/string')->cleanString($str)
或核心url / url重写方法之一。
由于通用顶级域名(gTLD)和国际化域名(IDN)越来越多,我无法理解为什么URL不应该包含Andrés。
而是您想要的rawurlencode $ URL。大多数浏览器在URL中显示UTF-8字符(也许不是某些古老的IE6),如果需要俄语或阿拉伯语(例如出于广告目的)或仅将它们写在广告中,可以使用bit.ly / goo.gl来使其简短就像用户将它们写在浏览器URL上一样。
唯一的区别是空格“”,如果您不希望使用空格,最好将其替换为“-”和“ /”。
<?php
function slugify($url)
{
$url = trim($url);
$url = str_replace(" ","-",$url);
$url = str_replace("/","-slash-",$url);
$url = rawurlencode($url);
}
?>
编码为 网址的网址http://www.hurtta.com/RU/%D0%9F%D1%80%D0%BE%D0%B4%D1%83%D0%BA%D1%82%D1%8B/
我是根据Maerlyn的回复写的。无论页面上的字符编码如何,此功能都将起作用。它也不会将单引号变成破折号:)
function slugify ($string) {
$string = utf8_encode($string);
$string = iconv('UTF-8', 'ASCII//TRANSLIT', $string);
$string = preg_replace('/[^a-z0-9- ]/i', '', $string);
$string = str_replace(' ', '-', $string);
$string = trim($string, '-');
$string = strtolower($string);
if (empty($string)) {
return 'n-a';
}
return $string;
}
在我的本地主机上一切正常,但在服务器上,它帮助我在“ mb_strtolower”处使用“ set_locale”和“ utf-8”。
<?
setlocale( LC_ALL, "en_US.UTF8" );
function slug( $string )
{
$string = iconv( "utf-8", "us-ascii//translit//ignore", $string ); // transliterate
$string = str_replace( "'", "", $string );
$string = preg_replace( "~[^\pL\d]+~u", "-", $string ); // replace non letter or non digits by "-"
$string = preg_replace( "~[^-\w]+~", "", $string ); // remove unwanted characters
$string = preg_replace( "~-+~", "-", $string ); // remove duplicate "-"
$string = trim( $string, "-" ); // trim "-"
$string = trim( $string ); // trim
$string = mb_strtolower( $string, "utf-8" ); // lowercase
$string = urlencode( $string ); // safe
return $string;
};
?>
我认为最优雅的方式是使用Behat \ Transliterator \ Transliterator。
我需要按您的类扩展该类,因为它是一个Abstract,如下所示:
<?php
use Behat\Transliterator\Transliterator;
class Urlizer extends Transliterator
{
}
然后,只需使用它:
$text = "Master Ápiu";
$urlizer = new Urlizer();
$slug = $urlizer->transliterate($slug, "-");
echo $slug; // master-apiu
当然,您也应该将这些东西放入您的作曲家。
composer require behat/transliterator
这里有一个很好的解决方案,也可以处理特殊字符。
TextoFantástico=> texto-fantastico
function slugify( $string, $separator = '-' ) {
$accents_regex = '~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i';
$special_cases = array( '&' => 'and', "'" => '');
$string = mb_strtolower( trim( $string ), 'UTF-8' );
$string = str_replace( array_keys($special_cases), array_values( $special_cases), $string );
$string = preg_replace( $accents_regex, '$1', htmlentities( $string, ENT_QUOTES, 'UTF-8' ) );
$string = preg_replace("/[^a-z0-9]/u", "$separator", $string);
$string = preg_replace("/[$separator]+/u", "$separator", $string);
return $string;
}
作者:Natxet
这也可能是一种方法。从这些链接中得到启发专家交流和alinalexander
function slugifier($txt){
/* Get rid of accented characters */
$search = explode(",","ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,e,i,ø,u");
$replace = explode(",","c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,e,i,o,u");
$txt = str_replace($search, $replace, $txt);
/* Lowercase all the characters */
$txt = strtolower($txt);
/* Avoid whitespace at the beginning and the ending */
$txt = trim($txt);
/* Replace all the characters that are not in a-z or 0-9 by a hyphen */
$txt = preg_replace("/[^a-z0-9]/", "-", $txt);
/* Remove hyphen anywhere it's more than one */
$txt = preg_replace("/[\-]+/", '-', $txt);
return $txt;
}
既然我在这里看到了很多方法,但是我为自己找到了一个最简单的方法,也许会对别人有所帮助。
$slug = strtolower(preg_replace('/[^a-zA-Z0-9\-]/', '',preg_replace('/\s+/', '-', $string) ));
对我来说,此变体是完美的,也&
变为and
。这是代码:
function dSlug($string) {
return strtolower(trim(preg_replace('~[^0-9a-z]+~i', '-', html_entity_decode(preg_replace('~&([a-z]{1,2})(?:acute|cedil|circ|grave|lig|orn|ring|slash|th|tilde|uml);~i', '$1',htmlentities(preg_replace('/[&]/', ' and ', $title), ENT_QUOTES, 'UTF-8')), ENT_QUOTES, 'UTF-8')), '-'));
}`