MySQL中的自然排序


80

有没有一种优雅的方法可以在MySQL数据库中进行高性能的自然排序?

例如,如果我有此数据集:

  • 最终幻想
  • 最终幻想4
  • 最终幻想10
  • 最终幻想12
  • 最终幻想12:爱情之链
  • 最终幻想冒险
  • 最终幻想起源
  • 最终幻想战术

除了将游戏名称拆分成各个组成部分以外,其他任何优雅的解决方案

  • 标题:《最终幻想》
  • 编号:“ 12”
  • 副标题:“ Promathia的链条”

确保它们以正确的顺序出现?(10个在4之后,而不是2之前)。

这样做是一件令人痛苦的事情,因为不时有另一款游戏破坏了解析游戏标题的机制(例如“战锤40,000”,“詹姆斯·邦德007”)


27
Promathia链与11相关。–
Flame

MySQL'Order By'的
Christian

Answers:


20

我认为这就是为什么很多东西都按发布日期排序的原因。

一种解决方案是在表中为“ SortKey”创建另一列。这可能是标题的经过净化处理的版本,与您创建的易于排序的图案或计数器相符。



2
这绝对是正确的方法,但仅靠它本身并不是一个答案!
Doin

90

这是一个快速的解决方案:

SELECT alphanumeric, 
       integer
FROM sorting_test
ORDER BY LENGTH(alphanumeric), alphanumeric

48
如果一切都是“最终幻想”,那很好,但它将“高飞”放在FF套件之前。
fortboise 2011年

4
此解决方案并非始终有效。有时会破裂。您应该宁愿使用这一地址:stackoverflow.com/a/12257917/384864
Borut Tomazin 2013年

6
打桩杂牌在杂牌:SELECT alphanumeric, integer FROM sorting_test ORDER BY SOUNDEX(alphanumeric), LENGTH(alphanumeric), alphanumeric。如果这完全可行,那是因为SOUNDEX方便地丢弃了这些数字,从而确保了例如apple1before z1
2014年

很好的解决方案,谢谢,尽管我不得不做个切换alphanmuriclength(alphanumeric)以避免在“最终幻想”之前避免使用“高飞”
2014年

1
@ offby1建议仅在文本是100%用英语书写时才有效,因为该SOUNDEX()设计仅适用于英语单词。
雷蒙德·尼兰

56

刚刚发现:

SELECT names FROM your_table ORDER BY games + 0 ASC

当数字位于最前面时,进行自然排序,也可能适用于中间。


2
我没有尝试过,但我对此表示严重怀疑。它与前面的数字一起使用的原因是因为games它在数字上下文中使用,因此在比较之前已转换为数字。如果在中间,它将始终转换为0,并且排序将变为伪随机。
manixrock 2011年

1
这不是自然的排序。而是看一下这个可行的解决方案:stackoverflow.com/a/12257917/384864
可行的 Borut Tomazin 2013年

@fedir这对我也很好。我什至不完全确定为什么会这样。有机会解释markletp吗?
BizNuge 2014年

刚刚对此进行了快速调查,我明白了。我什至没有意识到MySQL仅通过在字符串上使用数学运算符就可以进行这种类型的转换!很酷的事情是,在要“ cast”的字符串的前面没有整数的情况下,它仅返回zer0。谢谢你!--->选择地址,(ADDRESS * 1)as _cast from场所,其中邮政编码按“ NE1%”顺序按地址排序* 1 ASC,地址限制为100000;
BizNuge 2014年

1
当数字位于中间时,例如“最终幻想100”或“最终幻想2”,这实际上不起作用。《最终幻想100》将首次亮相。但是,当整数是第一个“ 100 Final Fantasy”时,它确实起作用
dwenaus

51

与@plalx发布的功能相同,但重写为MySQL:

DROP FUNCTION IF EXISTS `udf_FirstNumberPos`;
DELIMITER ;;
CREATE FUNCTION `udf_FirstNumberPos` (`instring` varchar(4000)) 
RETURNS int
LANGUAGE SQL
DETERMINISTIC
NO SQL
SQL SECURITY INVOKER
BEGIN
    DECLARE position int;
    DECLARE tmp_position int;
    SET position = 5000;
    SET tmp_position = LOCATE('0', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF; 
    SET tmp_position = LOCATE('1', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('2', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('3', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('4', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('5', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('6', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('7', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('8', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;
    SET tmp_position = LOCATE('9', instring); IF (tmp_position > 0 AND tmp_position < position) THEN SET position = tmp_position; END IF;

    IF (position = 5000) THEN RETURN 0; END IF;
    RETURN position;
END
;;

DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
DELIMITER ;;
CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50)) 
RETURNS varchar(4000)
LANGUAGE SQL
DETERMINISTIC
NO SQL
SQL SECURITY INVOKER
BEGIN
    DECLARE sortString varchar(4000);
    DECLARE numStartIndex int;
    DECLARE numEndIndex int;
    DECLARE padLength int;
    DECLARE totalPadLength int;
    DECLARE i int;
    DECLARE sameOrderCharsLen int;

    SET totalPadLength = 0;
    SET instring = TRIM(instring);
    SET sortString = instring;
    SET numStartIndex = udf_FirstNumberPos(instring);
    SET numEndIndex = 0;
    SET i = 1;
    SET sameOrderCharsLen = CHAR_LENGTH(sameOrderChars);

    WHILE (i <= sameOrderCharsLen) DO
        SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
        SET i = i + 1;
    END WHILE;

    WHILE (numStartIndex <> 0) DO
        SET numStartIndex = numStartIndex + numEndIndex;
        SET numEndIndex = numStartIndex;

        WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
            SET numEndIndex = numEndIndex + 1;
        END WHILE;

        SET numEndIndex = numEndIndex - 1;

        SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);

        IF padLength < 0 THEN
            SET padLength = 0;
        END IF;

        SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));

        SET totalPadLength = totalPadLength + padLength;
        SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, CHAR_LENGTH(instring) - numEndIndex));
    END WHILE;

    RETURN sortString;
END
;;

用法:

SELECT name FROM products ORDER BY udf_NaturalSortFormat(name, 10, ".")

4
这是唯一有效的解决方案。我还测试了drupals代码,但有时会失败。谢啦!
Borut Tomazin

有人在10万以上的大桌子上使用它吗?
Mark Steudel '17

2
@MarkSteudel我们使用与该函数类似的函数(尽管不是确切的函数)对多个表进行自然排序,其中最大表为500万行。但是,我们不会在查询中直接调用它,而是使用它来设置nat_name列的值。每当行更新时,我们都会使用触发器来运行该函数。这种方法使您可以自然排序,而没有实际的性能成本,但会增加额外的列。
雅各布

这行之有效,在字母前对数字进行排序,并且可以使用hook_views_query_alter来实现,在Drupal中可以使用类似于此的东西if ($query->orderby[0]["field"] === "node_field_data.title") { $orderBySql = " udf_NaturalSortFormat(node_field_data.title, 10, '.') "; $query->orderby = []; $query->addOrderBy(NULL, $orderBySql, $query->orderby[0]["direction"], 'title_natural'); array_unshift($query->orderby, end($query->orderby)); }
realgt

16

我之前为MSSQL 2000编写了此函数:

/**
 * Returns a string formatted for natural sorting. This function is very useful when having to sort alpha-numeric strings.
 *
 * @author Alexandre Potvin Latreille (plalx)
 * @param {nvarchar(4000)} string The formatted string.
 * @param {int} numberLength The length each number should have (including padding). This should be the length of the longest number. Defaults to 10.
 * @param {char(50)} sameOrderChars A list of characters that should have the same order. Ex: '.-/'. Defaults to empty string.
 *
 * @return {nvarchar(4000)} A string for natural sorting.
 * Example of use: 
 * 
 *      SELECT Name FROM TableA ORDER BY Name
 *  TableA (unordered)              TableA (ordered)
 *  ------------                    ------------
 *  ID  Name                    ID  Name
 *  1.  A1.                 1.  A1-1.       
 *  2.  A1-1.                   2.  A1.
 *  3.  R1      -->         3.  R1
 *  4.  R11                 4.  R11
 *  5.  R2                  5.  R2
 *
 *  
 *  As we can see, humans would expect A1., A1-1., R1, R2, R11 but that's not how SQL is sorting it.
 *  We can use this function to fix this.
 *
 *      SELECT Name FROM TableA ORDER BY dbo.udf_NaturalSortFormat(Name, default, '.-')
 *  TableA (unordered)              TableA (ordered)
 *  ------------                    ------------
 *  ID  Name                    ID  Name
 *  1.  A1.                 1.  A1.     
 *  2.  A1-1.                   2.  A1-1.
 *  3.  R1      -->         3.  R1
 *  4.  R11                 4.  R2
 *  5.  R2                  5.  R11
 */
CREATE FUNCTION dbo.udf_NaturalSortFormat(
    @string nvarchar(4000),
    @numberLength int = 10,
    @sameOrderChars char(50) = ''
)
RETURNS varchar(4000)
AS
BEGIN
    DECLARE @sortString varchar(4000),
        @numStartIndex int,
        @numEndIndex int,
        @padLength int,
        @totalPadLength int,
        @i int,
        @sameOrderCharsLen int;

    SELECT 
        @totalPadLength = 0,
        @string = RTRIM(LTRIM(@string)),
        @sortString = @string,
        @numStartIndex = PATINDEX('%[0-9]%', @string),
        @numEndIndex = 0,
        @i = 1,
        @sameOrderCharsLen = LEN(@sameOrderChars);

    -- Replace all char that has to have the same order by a space.
    WHILE (@i <= @sameOrderCharsLen)
    BEGIN
        SET @sortString = REPLACE(@sortString, SUBSTRING(@sameOrderChars, @i, 1), ' ');
        SET @i = @i + 1;
    END

    -- Pad numbers with zeros.
    WHILE (@numStartIndex <> 0)
    BEGIN
        SET @numStartIndex = @numStartIndex + @numEndIndex;
        SET @numEndIndex = @numStartIndex;

        WHILE(PATINDEX('[0-9]', SUBSTRING(@string, @numEndIndex, 1)) = 1)
        BEGIN
            SET @numEndIndex = @numEndIndex + 1;
        END

        SET @numEndIndex = @numEndIndex - 1;

        SET @padLength = @numberLength - (@numEndIndex + 1 - @numStartIndex);

        IF @padLength < 0
        BEGIN
            SET @padLength = 0;
        END

        SET @sortString = STUFF(
            @sortString,
            @numStartIndex + @totalPadLength,
            0,
            REPLICATE('0', @padLength)
        );

        SET @totalPadLength = @totalPadLength + @padLength;
        SET @numStartIndex = PATINDEX('%[0-9]%', RIGHT(@string, LEN(@string) - @numEndIndex));
    END

    RETURN @sortString;
END

GO

@MarkSteudel您必须试一试并自己进行测试。更糟糕的是,您总是可以缓存格式化的值。那可能就是我要处理大型表的原因,因为您也可以为该字段建立索引。
plalx

15

MySQL不允许这种“自然排序”,因此,获得所需信息的最好方法似乎是按照上述方法拆分数据集(单独的id字段等),否则将失败即,基于非标题元素,数据库中的索引元素(日期,数据库中插入的ID等)执行排序。

让数据库为您进行排序几乎总是比将大型数据集读入您选择的编程语言并在那里进行排序要快得多,因此,如果您对此处的数据库架构有任何控制权,那么看看添加如上所述,这些字段易于排序,从长远来看,它将为您节省很多麻烦和维护工作。

MySQL错误讨论论坛上,有时会提出添加“自然排序”的请求,许多解决方案都围绕着剥离数据的特定部分并将其转换为ORDER BY查询的一部分,例如

SELECT * FROM table ORDER BY CAST(mid(name, 6, LENGTH(c) -5) AS unsigned) 

可以在上面的“最终幻想”示例中使用这种解决方案,但是这种解决方案不是特别灵活,并且恐怕无法完全扩展到包括“战锤40,000”和“詹姆斯·邦德007”在内的数据集。


9

因此,虽然我知道您已经找到了满意的答案,但是我已经为这个问题苦苦挣扎了一段时间,并且我们先前确定它在SQL中不能做得很好,我们将不得不在JSON上使用javascript数组。

这就是我仅使用SQL即可解决的方法。希望这对其他人有帮助:

我有如下数据:

场景1
场景1A
场景1B
场景2A
场景3
...
场景101
场景XXA1
场景XXA2

尽管我认为这也可能起作用,但实际上我没有“投射”东西。

我首先替换了数据中不变的部分,在本例中为“场景”,然后进行了LPAD整理。这似乎可以很好地使alpha字符串以及带编号的字符串正确排序。

我的ORDER BY子句如下:

ORDER BY LPAD(REPLACE(`table`.`column`,'Scene ',''),10,'0')

显然,这并不能解决原本不太统一的问题,但我想这可能会解决许多其他相关问题,因此将其解决。


这个LPAD()提示非常有帮助。我可以对单词和数字进行排序,LPAD可以自然地对数字进行排序。并且使用CONCAT我忽略非数字。我的查询看起来像这样(别名是列进行排序):IF(CONCAT("",alias*1)=alias, LPAD(alias,5,"0"), alias) ASC;👍
凯诺亚克

6
  1. 在表中添加一个排序键(等级)。 ORDER BY rank

  2. 使用“发布日期”列。 ORDER BY release_date

  3. 从SQL提取数据时,请让您的对象进行排序,例如,如果将其提取到Set中,使其成为TreeSet,并让您的数据模型实现Comparable并在此处制定自然排序算法(如果使用的话,插入排序就足够了)一种没有集合的语言),因为您将在创建模型并将其插入集合时从SQL中逐行读取行)


5

关于Richard Toth的最佳回应https://stackoverflow.com/a/12257917/4052357

注意包含2字节(或更多)字符和数字的UTF8编码的字符串,例如

12 南新宿

使用MySQL的LENGTH()inudf_NaturalSortFormat函数将返回字符串的字节长度并且不正确,而是使用CHAR_LENGTH()它将返回正确的字符长度。

就我而言,使用LENGTH()导致查询无法完成并导致MySQL的CPU使用率为100%

DROP FUNCTION IF EXISTS `udf_NaturalSortFormat`;
DELIMITER ;;
CREATE FUNCTION `udf_NaturalSortFormat` (`instring` varchar(4000), `numberLength` int, `sameOrderChars` char(50)) 
RETURNS varchar(4000)
LANGUAGE SQL
DETERMINISTIC
NO SQL
SQL SECURITY INVOKER
BEGIN
    DECLARE sortString varchar(4000);
    DECLARE numStartIndex int;
    DECLARE numEndIndex int;
    DECLARE padLength int;
    DECLARE totalPadLength int;
    DECLARE i int;
    DECLARE sameOrderCharsLen int;

    SET totalPadLength = 0;
    SET instring = TRIM(instring);
    SET sortString = instring;
    SET numStartIndex = udf_FirstNumberPos(instring);
    SET numEndIndex = 0;
    SET i = 1;
    SET sameOrderCharsLen = CHAR_LENGTH(sameOrderChars);

    WHILE (i <= sameOrderCharsLen) DO
        SET sortString = REPLACE(sortString, SUBSTRING(sameOrderChars, i, 1), ' ');
        SET i = i + 1;
    END WHILE;

    WHILE (numStartIndex <> 0) DO
        SET numStartIndex = numStartIndex + numEndIndex;
        SET numEndIndex = numStartIndex;

        WHILE (udf_FirstNumberPos(SUBSTRING(instring, numEndIndex, 1)) = 1) DO
            SET numEndIndex = numEndIndex + 1;
        END WHILE;

        SET numEndIndex = numEndIndex - 1;

        SET padLength = numberLength - (numEndIndex + 1 - numStartIndex);

        IF padLength < 0 THEN
            SET padLength = 0;
        END IF;

        SET sortString = INSERT(sortString, numStartIndex + totalPadLength, 0, REPEAT('0', padLength));

        SET totalPadLength = totalPadLength + padLength;
        SET numStartIndex = udf_FirstNumberPos(RIGHT(instring, CHAR_LENGTH(instring) - numEndIndex));
    END WHILE;

    RETURN sortString;
END
;;

ps:我会将此添加为原始文档的注释,但是我没有足够的声誉(尚未)


4

为“排序关键字”添加一个字段,该字段将所有数字字符串都零填充到固定长度,然后在该字段上进行排序。

如果您的数字字符串可能很长,则另一种方法是在每个数字字符串前添加数字位数(固定宽度,零填充)。例如,如果连续的位数不超过99位,则“ Super Blast 10 Ultra”的排序键将为“ Super Blast 0210 Ultra”。


4

要订购:
0
1
2
10
23
101
205
1000
a
aac
b
casdsadsa
css

使用此查询:

选择 
    column_name 
从 
    table_name 
订购
    column_name REGEXP'^ \ d * [^ \ da-z&\。\'\-\“ \!\ @ \#\ $ \%\ ^ \ * \(\)\; \:\\,\?\ / \〜\`\ | \ __ \-]'DESC, 
    column_name + 0, 
    column_name;

不幸的是这打破了,如果你在诸如增加值下降a1a2a11等...
random_user_name

4

如果您不想重新发明轮子或者对很多无法使用的代码感到头疼,只需使用Drupal Natural Sort ...即可运行压缩的SQL(MySQL或Postgre),仅此而已。进行查询时,只需使用以下命令进行排序:

... ORDER BY natsort_canon(column_name, 'natural')

为此,我一直在尝试各种解决方案(哈哈,看看我在这里做了什么?),但是这些解决方案都无法真正解决我拥有的所有数据。drupal功能就像一个魅力。感谢您的发布。
本·希区柯克

这有效,但在末尾对数字进行排序(AZ然后是0-9)
realgt

4

另一种选择是从mysql提取数据后在内存中进行排序。从性能的角度来看,这并不是最佳选择,但如果您不对庞大的列表进行排序,则应该没问题。

如果您查看Jeff的文章,则可以找到许多与您可能使用的语言有关的算法。 人选:自然排序


2

您还可以动态创建“排序列”:

SELECT name, (name = '-') boolDash, (name = '0') boolZero, (name+0 > 0) boolNum 
FROM table 
ORDER BY boolDash DESC, boolZero DESC, boolNum DESC, (name+0), name

这样,您可以创建要排序的组。

在查询中,我想在所有内容前面加上“-”,然后是数字,然后是文本。这可能会导致类似:

-
0    
1
2
3
4
5
10
13
19
99
102
Chair
Dog
Table
Windows

这样,您在添加数据时不必按正确的顺序维护排序列。您还可以根据需要更改排序顺序。


我不知道这样的表现如何。我一直在使用它,没有任何不便。我的数据库不大。
antoine 2013年

1

如果您使用的是PHP,则可以在php中进行自然排序。

$keys = array();
$values = array();
foreach ($results as $index => $row) {
   $key = $row['name'].'__'.$index; // Add the index to create an unique key.
   $keys[] = $key;
   $values[$key] = $row; 
}
natsort($keys);
$sortedValues = array(); 
foreach($keys as $index) {
  $sortedValues[] = $values[$index]; 
}

我希望MySQL在将来的版本中实现自然排序,但是功能请求(#1588)自2003年以来一直处于开放状态,所以我不会屏息。


从理论上讲这是可能的,但是我需要先将所有数据库记录读取到我的Web服务器上。
BlaM 2011年

或者,考虑:(usort($mydata, function ($item1, $item2) { return strnatcmp($item1['key'], $item2['key']); });我有一个关联数组,并按键排序。)Ref:stackoverflow.com/q/12426825/1066234
Kai Noack

1

@ plaix / Richard Toth / Luke Hoggett的最佳响应的简化非udf版本仅适用于该字段中的第一个整数,它是

SELECT name,
LEAST(
    IFNULL(NULLIF(LOCATE('0', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('1', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('2', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('3', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('4', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('5', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('6', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('7', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('8', name), 0), ~0),
    IFNULL(NULLIF(LOCATE('9', name), 0), ~0)
) AS first_int
FROM table
ORDER BY IF(first_int = ~0, name, CONCAT(
    SUBSTR(name, 1, first_int - 1),
    LPAD(CAST(SUBSTR(name, first_int) AS UNSIGNED), LENGTH(~0), '0'),
    SUBSTR(name, first_int + LENGTH(CAST(SUBSTR(name, first_int) AS UNSIGNED)))
)) ASC

1

我尝试了几种解决方案,但实际上非常简单:

SELECT test_column FROM test_table ORDER BY LENGTH(test_column) DESC, test_column DESC

/* 
Result 
--------
value_1
value_2
value_3
value_4
value_5
value_6
value_7
value_8
value_9
value_10
value_11
value_12
value_13
value_14
value_15
...
*/

1
非常适合对格式进行数字排序23-4244。谢谢:)
Pyton '16

1
仅适用于此测试数据,因为数字前面的字符串都相同。尝试在其中保留一个值,该值z_99将放在顶部,但排z在后面v
塞缪尔·内夫

@SamuelNeff,请参见SQL:ORDER BY LENGTH(test_column)DESC,test_column DESC,所以是的,因为它将首先按列的长度排序。这可以很好地对表的前缀组进行排序,否则您将无法仅使用“ test_column DESC”进行排序
Tarik

1

我在这里(以及重复出现的问题)中看到的许多其他答案基本上仅适用于格式非常特殊的数据,例如,字符串完全是数字,或具有固定长度的字母前缀。在一般情况下,这是行不通的。

的确,在MySQL中实际上没有任何方法可以实现100%通用的nat-sort,因为要做到这一点,您真正需要的是一个经过修改的比较函数,当遇到时,该函数可以在字符串的字典排序排序和数字排序之间切换一个号码。这样的代码可以实现您希望用于识别和比较两个字符串中的数字部分的任何算法。但是,不幸的是,MySQL中的比较功能是其代码的内部功能,用户无法更改。

这留下了某种骇客,您尝试在其中为字符串创建一个排序键,在该中重新格式化数字部分,以便标准词典编目排序实际上按照您想要的方式对其进行排序

对于最大位数不超过0的纯整数,显而易见的解决方案是简单地用零左键填充它们,以使它们均为固定宽度。这是Drupal插件采用的方法,也是@plalx / @RichardToth的解决方案。(@Christian有一个不同且复杂得多的解决方案,但我看不到任何优势)。

正如@tye指出的那样,您可以通过在每个数字前添加固定数字长度而不是简单地将其左填充来改进此功能。尽管有本质上笨拙的hack的局限性,但是您还有很多可以改进的地方。但是,似乎没有任何预先构建的解决方案!

例如,关于:

  • 加号和减号?+10 vs 10 vs -10
  • 小数点?8.2、8.5、1.006,.75
  • 前导零?020、030、00000922
  • 千位分隔符?“ 1,001达姆斯”和“ 1001达姆斯”
  • 版本号?MariaDB v10.3.18和MariaDB v10.3.3
  • 数字很​​长?103,768,276,592,092,364,859,236,487,687,870,234,598.55

扩展@tye的方法,我创建了一个相当紧凑的NatSortKey()存储函数,该函数将任意字符串转换为nat-sort键,并且可以处理上述所有情况,效率相当高,并且可以保留总排序-顺序(没有两个不同的字符串具有比较相等的排序键)。第二个参数可用于限制每个字符串中处理的数字数量(例如,限制为前10个数字),这可用于确保输出适合给定长度。

注意:使用此第二个参数的给定值生成的排序键字符串只能与使用相同参数值生成的其他字符串进行排序,否则它们可能无法正确排序!

您可以直接在订购中使用它,例如

SELECT myString FROM myTable ORDER BY NatSortKey(myString,0);  ### 0 means process all numbers - resulting sort key might be quite long for certain inputs

但是为了有效地对大型表进行排序,最好将排序键预先存储在另一列中(可能带有索引):

INSERT INTO myTable (myString,myStringNSK) VALUES (@theStringValue,NatSortKey(@theStringValue,10)), ...
...
SELECT myString FROM myTable ORDER BY myStringNSK;

[理想情况下,您可以通过以下方式将键列创建为计算所得的存储列来自动实现:

CREATE TABLE myTable (
...
myString varchar(100),
myStringNSK varchar(150) AS (NatSortKey(myString,10)) STORED,
...
KEY (myStringNSK),
...);

但是目前,MySQL和MariaDB都不允许在计算列中存储函数,因此很遗憾,您还不能这样做。]


我的功能仅影响数字排序。如果要执行其他排序规范化操作,例如删除所有标点符号,修剪两端的空白或将多个空白序列替换为单个空格,则可以扩展该函数,也可以在此函数之前或之后NatSortKey()进行应用于您的数据。(我建议REGEXP_REPLACE()为此目的使用)。

我认为“。”在某种程度上也是以英语为中心的。代表小数点,“,”代表千位分隔符,但是如果您想取反或希望将其作为参数进行切换,应该足够容易地进行修改。

也许可以通过其他方式进一步改进;例如,它当前按绝对值对负数进行排序,因此-1位于-2之前,而不是相反。也没有办法为数字指定DESC排序顺序,同时保留文本的ASC字典顺序。这两个问题可以通过更多的工作来解决。如果有时间,我将更新代码。

还有很多其他细节需要注意-包括对所使用的字符集和排序规则的一些关键依赖性-但我已经将它们全部放入SQL代码的注释块中。在自己使用该功能之前,请仔细阅读该内容!

所以,这是代码。如果您发现错误,或者有我未提及的改进,请在评论中让我知道!


delimiter $$
CREATE DEFINER=CURRENT_USER FUNCTION NatSortKey (s varchar(100), n int) RETURNS varchar(350) DETERMINISTIC
BEGIN
/****
  Converts numbers in the input string s into a format such that sorting results in a nat-sort.
  Numbers of up to 359 digits (before the decimal point, if one is present) are supported.  Sort results are undefined if the input string contains numbers longer than this.
  For n>0, only the first n numbers in the input string will be converted for nat-sort (so strings that differ only after the first n numbers will not nat-sort amongst themselves).
  Total sort-ordering is preserved, i.e. if s1!=s2, then NatSortKey(s1,n)!=NatSortKey(s2,n), for any given n.
  Numbers may contain ',' as a thousands separator, and '.' as a decimal point.  To reverse these (as appropriate for some European locales), the code would require modification.
  Numbers preceded by '+' sort with numbers not preceded with either a '+' or '-' sign.
  Negative numbers (preceded with '-') sort before positive numbers, but are sorted in order of ascending absolute value (so -7 sorts BEFORE -1001).
  Numbers with leading zeros sort after the same number with no (or fewer) leading zeros.
  Decimal-part-only numbers (like .75) are recognised, provided the decimal point is not immediately preceded by either another '.', or by a letter-type character.
  Numbers with thousand separators sort after the same number without them.
  Thousand separators are only recognised in numbers with no leading zeros that don't immediately follow a ',', and when they format the number correctly.
  (When not recognised as a thousand separator, a ',' will instead be treated as separating two distinct numbers).
  Version-number-like sequences consisting of 3 or more numbers separated by '.' are treated as distinct entities, and each component number will be nat-sorted.
  The entire entity will sort after any number beginning with the first component (so e.g. 10.2.1 sorts after both 10 and 10.995, but before 11)
  Note that The first number component in an entity like this is also permitted to contain thousand separators.

  To achieve this, numbers within the input string are prefixed and suffixed according to the following format:
  - The number is prefixed by a 2-digit base-36 number representing its length, excluding leading zeros.  If there is a decimal point, this length only includes the integer part of the number.
  - A 3-character suffix is appended after the number (after the decimals if present).
    - The first character is a space, or a '+' sign if the number was preceded by '+'.  Any preceding '+' sign is also removed from the front of the number.
    - This is followed by a 2-digit base-36 number that encodes the number of leading zeros and whether the number was expressed in comma-separated form (e.g. 1,000,000.25 vs 1000000.25)
    - The value of this 2-digit number is: (number of leading zeros)*2 + (1 if comma-separated, 0 otherwise)
  - For version number sequences, each component number has the prefix in front of it, and the separating dots are removed.
    Then there is a single suffix that consists of a ' ' or '+' character, followed by a pair base-36 digits for each number component in the sequence.

  e.g. here is how some simple sample strings get converted:
  'Foo055' --> 'Foo0255 02'
  'Absolute zero is around -273 centigrade' --> 'Absolute zero is around -03273 00 centigrade'
  'The $1,000,000 prize' --> 'The $071000000 01 prize'
  '+99.74 degrees' --> '0299.74+00 degrees'
  'I have 0 apples' --> 'I have 00 02 apples'
  '.5 is the same value as 0000.5000' --> '00.5 00 is the same value as 00.5000 08'
  'MariaDB v10.3.0018' --> 'MariaDB v02100130218 000004'

  The restriction to numbers of up to 359 digits comes from the fact that the first character of the base-36 prefix MUST be a decimal digit, and so the highest permitted prefix value is '9Z' or 359 decimal.
  The code could be modified to handle longer numbers by increasing the size of (both) the prefix and suffix.
  A higher base could also be used (by replacing CONV() with a custom function), provided that the collation you are using sorts the "digits" of the base in the correct order, starting with 0123456789.
  However, while the maximum number length may be increased this way, note that the technique this function uses is NOT applicable where strings may contain numbers of unlimited length.

  The function definition does not specify the charset or collation to be used for string-type parameters or variables:  The default database charset & collation at the time the function is defined will be used.
  This is to make the function code more portable.  However, there are some important restrictions:

  - Collation is important here only when comparing (or storing) the output value from this function, but it MUST order the characters " +0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ" in that order for the natural sort to work.
    This is true for most collations, but not all of them, e.g. in Lithuanian 'Y' comes before 'J' (according to Wikipedia).
    To adapt the function to work with such collations, replace CONV() in the function code with a custom function that emits "digits" above 9 that are characters ordered according to the collation in use.

  - For efficiency, the function code uses LENGTH() rather than CHAR_LENGTH() to measure the length of strings that consist only of digits 0-9, '.', and ',' characters.
    This works for any single-byte charset, as well as any charset that maps standard ASCII characters to single bytes (such as utf8 or utf8mb4).
    If using a charset that maps these characters to multiple bytes (such as, e.g. utf16 or utf32), you MUST replace all instances of LENGTH() in the function definition with CHAR_LENGTH()

  Length of the output:

  Each number converted adds 5 characters (2 prefix + 3 suffix) to the length of the string. n is the maximum count of numbers to convert;
  This parameter is provided as a means to limit the maximum output length (to input length + 5*n).
  If you do not require the total-ordering property, you could edit the code to use suffixes of 1 character (space or plus) only; this would reduce the maximum output length for any given n.
  Since a string of length L has at most ((L+1) DIV 2) individual numbers in it (every 2nd character a digit), for n<=0 the maximum output length is (inputlength + 5*((inputlength+1) DIV 2))
  So for the current input length of 100, the maximum output length is 350.
  If changing the input length, the output length must be modified according to the above formula.  The DECLARE statements for x,y,r, and suf must also be modified, as the code comments indicate.
****/
  DECLARE x,y varchar(100);            # need to be same length as input s
  DECLARE r varchar(350) DEFAULT '';   # return value:  needs to be same length as return type
  DECLARE suf varchar(101);   # suffix for a number or version string. Must be (((inputlength+1) DIV 2)*2 + 1) chars to support version strings (e.g. '1.2.33.5'), though it's usually just 3 chars. (Max version string e.g. 1.2. ... .5 has ((length of input + 1) DIV 2) numeric components)
  DECLARE i,j,k int UNSIGNED;
  IF n<=0 THEN SET n := -1; END IF;   # n<=0 means "process all numbers"
  LOOP
    SET i := REGEXP_INSTR(s,'\\d');   # find position of next digit
    IF i=0 OR n=0 THEN RETURN CONCAT(r,s); END IF;   # no more numbers to process -> we're done
    SET n := n-1, suf := ' ';
    IF i>1 THEN
      IF SUBSTRING(s,i-1,1)='.' AND (i=2 OR SUBSTRING(s,i-2,1) RLIKE '[^.\\p{L}\\p{N}\\p{M}\\x{608}\\x{200C}\\x{200D}\\x{2100}-\\x{214F}\\x{24B6}-\\x{24E9}\\x{1F130}-\\x{1F149}\\x{1F150}-\\x{1F169}\\x{1F170}-\\x{1F189}]') AND (SUBSTRING(s,i) NOT RLIKE '^\\d++\\.\\d') THEN SET i:=i-1; END IF;   # Allow decimal number (but not version string) to begin with a '.', provided preceding char is neither another '.', nor a member of the unicode character classes: "Alphabetic", "Letter", "Block=Letterlike Symbols" "Number", "Mark", "Join_Control"
      IF i>1 AND SUBSTRING(s,i-1,1)='+' THEN SET suf := '+', j := i-1; ELSE SET j := i; END IF;   # move any preceding '+' into the suffix, so equal numbers with and without preceding "+" signs sort together
      SET r := CONCAT(r,SUBSTRING(s,1,j-1)); SET s = SUBSTRING(s,i);   # add everything before the number to r and strip it from the start of s; preceding '+' is dropped (not included in either r or s)
    END IF;
    SET x := REGEXP_SUBSTR(s,IF(SUBSTRING(s,1,1) IN ('0','.') OR (SUBSTRING(r,-1)=',' AND suf=' '),'^\\d*+(?:\\.\\d++)*','^(?:[1-9]\\d{0,2}(?:,\\d{3}(?!\\d))++|\\d++)(?:\\.\\d++)*+'));   # capture the number + following decimals (including multiple consecutive '.<digits>' sequences)
    SET s := SUBSTRING(s,LENGTH(x)+1);   # NOTE: LENGTH() can be safely used instead of CHAR_LENGTH() here & below PROVIDED we're using a charset that represents digits, ',' and '.' characters using single bytes (e.g. latin1, utf8)
    SET i := INSTR(x,'.');
    IF i=0 THEN SET y := ''; ELSE SET y := SUBSTRING(x,i); SET x := SUBSTRING(x,1,i-1); END IF;   # move any following decimals into y
    SET i := LENGTH(x);
    SET x := REPLACE(x,',','');
    SET j := LENGTH(x);
    SET x := TRIM(LEADING '0' FROM x);   # strip leading zeros
    SET k := LENGTH(x);
    SET suf := CONCAT(suf,LPAD(CONV(LEAST((j-k)*2,1294) + IF(i=j,0,1),10,36),2,'0'));   # (j-k)*2 + IF(i=j,0,1) = (count of leading zeros)*2 + (1 if there are thousands-separators, 0 otherwise)  Note the first term is bounded to <= base-36 'ZY' as it must fit within 2 characters
    SET i := LOCATE('.',y,2);
    IF i=0 THEN
      SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x,y,suf);   # k = count of digits in number, bounded to be <= '9Z' base-36
    ELSE   # encode a version number (like 3.12.707, etc)
      SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x);   # k = count of digits in number, bounded to be <= '9Z' base-36
      WHILE LENGTH(y)>0 AND n!=0 DO
        IF i=0 THEN SET x := SUBSTRING(y,2); SET y := ''; ELSE SET x := SUBSTRING(y,2,i-2); SET y := SUBSTRING(y,i); SET i := LOCATE('.',y,2); END IF;
        SET j := LENGTH(x);
        SET x := TRIM(LEADING '0' FROM x);   # strip leading zeros
        SET k := LENGTH(x);
        SET r := CONCAT(r,LPAD(CONV(LEAST(k,359),10,36),2,'0'),x);   # k = count of digits in number, bounded to be <= '9Z' base-36
        SET suf := CONCAT(suf,LPAD(CONV(LEAST((j-k)*2,1294),10,36),2,'0'));   # (j-k)*2 = (count of leading zeros)*2, bounded to fit within 2 base-36 digits
        SET n := n-1;
      END WHILE;
      SET r := CONCAT(r,y,suf);
    END IF;
  END LOOP;
END
$$
delimiter ;

我是MySQL的初学者,并尝试过此方法。收到此错误:“#1305-FUNCTION mydatabase.REGEXP_INSTR不存在”。任何的想法?
约翰T

对于那里的任何其他新手。我没有安装MySQL 8.0。REGEXP_INSTR(和其他REGEXP东西)需要它。
约翰·T

刚刚修复了NatSortKey中的一个严重错误:存在不正确的正则表达式字符。如果您亲自使用了此功能,请更新您的代码!
杜安


0

如果标题的版本号为数字,这是一个简单的例子:

ORDER BY CAST(REGEXP_REPLACE(title, "[a-zA-Z]+", "") AS INT)';

否则,如果使用模式,则可以使用简单的SQL(此模式在版本之前使用#):

create table titles(title);

insert into titles (title) values 
('Final Fantasy'),
('Final Fantasy #03'),
('Final Fantasy #11'),
('Final Fantasy #10'),
('Final Fantasy #2'),
('Bond 007 ##2'),
('Final Fantasy #01'),
('Bond 007'),
('Final Fantasy #11}');

select REGEXP_REPLACE(title, "#([0-9]+)", "\\1") as title from titles
ORDER BY REGEXP_REPLACE(title, "#[0-9]+", ""),
CAST(REGEXP_REPLACE(title, ".*#([0-9]+).*", "\\1") AS INT);     
+-------------------+
| title             |
+-------------------+
| Bond 007          |
| Bond 007 #2       |
| Final Fantasy     |
| Final Fantasy 01  |
| Final Fantasy 2   |
| Final Fantasy 03  |
| Final Fantasy 10  |
| Final Fantasy 11  |
| Final Fantasy 11} |
+-------------------+
8 rows in set, 2 warnings (0.001 sec)

如果需要,您可以使用其他模式。例如,如果您有一部电影“ I'm#1”和“ I'm#1 part 2”,那么可以包装诸如“ Final Fantasy {11}”的版本


-4

我知道这个话题很古老,但我想我已经找到了一种方法:

SELECT * FROM `table` ORDER BY 
CONCAT(
  GREATEST(
    LOCATE('1', name),
    LOCATE('2', name),
    LOCATE('3', name),
    LOCATE('4', name),
    LOCATE('5', name),
    LOCATE('6', name),
    LOCATE('7', name),
    LOCATE('8', name),
    LOCATE('9', name)
   ),
   name
) ASC

报废,它对以下集合进行了不正确的排序(这是无用的笑声):

最终幻想1最终幻想2最终幻想5最终幻想7最终幻想7:降临儿童最终幻想12最终幻想112 FF1 FF2


3
为什么不删除此答案?您将为此获得徽章
m47730
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.