圣经希伯来语应使用哪种SQL Server排序规则?所考虑的数据库需要容纳变音符号(即元音,重音,音调等)。
圣经希伯来语应使用哪种SQL Server排序规则?所考虑的数据库需要容纳变音符号(即元音,重音,音调等)。
Answers:
第一:在整理方面,圣经的希伯来语和现代希伯来语之间没有区别。我们只是在处理希伯来语。
第二: 无论使用任何其他方法,您都希望使用最新的排序规则集,通常,我强烈建议您使用所需排序规则的最新版本,但是至少在这种情况下,有充分的理由使用名称中没有版本号的版本。版本100(或更新版本)的排序规则要完整得多,并且可以区分辅助字符(如果使用_100_
因为它们比没有名称名称的旧版本(在技术上是版本80
)的旧系列具有更新/更完整的排序权重和语言规则。SC
或140
排序规则,甚至可以完全支持它们),但是假设您不处理辅助字符,则版本80(无版本)排序规则)在处理希伯来语方面做得更好(请参见下面的“第六”)。
第三:希伯来语中没有“假名”(或假名类型)的概念,因此您可以忽略_KS
其名称中的任何排序规则变体(因为这是永远不会使用的敏感性)。
第四:归类以_SC
支持补充字符(即完整的UTF-16)结尾,因此通常最好选择其中的一个(如果有)(意味着:如果您使用的是SQL Server 2012或更高版本)。
第五:您不希望使用二进制排序规则(_BIN
或_BIN2
),因为它们无法区分具有相同元音和斜线标记但具有不同顺序组合字符的希伯来字母,也不能忽略元音和其他标记等同于א
和אֽ
。
例如(元音和倾斜标记以相反的顺序组合字符):
SELECT NCHAR(0x05D0) + NCHAR(0x059C) + NCHAR(0x05B8),
NCHAR(0x05D0) + NCHAR(0x05B8) + NCHAR(0x059C)
WHERE NCHAR(0x05D0) + NCHAR(0x059C) + NCHAR(0x05B8) =
NCHAR(0x05D0) + NCHAR(0x05B8) + NCHAR(0x059C) COLLATE Hebrew_100_CS_AS_SC;
-- אָ֜ אָ֜
SELECT NCHAR(0x05D0) + NCHAR(0x059C) + NCHAR(0x05B8),
NCHAR(0x05D0) + NCHAR(0x05B8) + NCHAR(0x059C)
WHERE NCHAR(0x05D0) + NCHAR(0x059C) + NCHAR(0x05B8) =
NCHAR(0x05D0) + NCHAR(0x05B8) + NCHAR(0x059C) COLLATE Hebrew_100_BIN2;
-- no rows
第六:这取决于您将如何与字符串值进行交互。希伯来语没有大写/小写字母,但是有些代码点受大小写敏感性的影响。甚至有几个对宽度敏感的代码点。口音敏感/不敏感会影响用于元音,发音和悬臂标记(即音高音)的变音符号。
您是否需要区分同一封信的最终形式和非最终形式?希伯来语中有五个字母,当用作单词的最后一个字母时,它们看起来有所不同。SQL Server通过区分大小写/ _CS
排序规则来处理此问题(但是,不幸的是,它似乎在更新(通常更好)的版本100和更新的排序规则中被破坏):
SELECT NCHAR(0x05DE) AS [Mem],
NCHAR(0x05DD) AS [Final Mem]
WHERE NCHAR(0x05DE) = NCHAR(0x05DD) COLLATE Hebrew_CI_AS_KS_WS;
-- 1 row (expected; all sensitive except case)
-- Mem Final Mem
-- מ ם
SELECT NCHAR(0x05DE) AS [Mem],
NCHAR(0x05DD) AS [Final Mem]
WHERE NCHAR(0x05DE) = NCHAR(0x05DD) COLLATE Hebrew_CS_AI;
-- no rows (expected; all insensitive except case)
SELECT NCHAR(0x05DE) AS [Mem],
NCHAR(0x05DD) AS [Final Mem]
WHERE NCHAR(0x05DE) = NCHAR(0x05DD) COLLATE Hebrew_100_CI_AI;
-- no rows (expected 1 row; all insensitive)
您需要区分发音标记,元音和悬念标记吗?SQL Server通过重音/ _AS
排序规则来处理此问题(但是,不幸的是,它似乎在更新(通常更好)的版本100和更新的排序规则中被破坏)。请注意,所有这三个元素都在重音符号敏感性下分组在一起,并且无法单独控制(即,您对元音不敏感,对衔接标记不敏感)。
发音标记
有几个字母有两种不同的声音。有时,唯一要使用声音的指标是字母所用单词的上下文(有时甚至是周围的单词),例如实际的律法书(没有发音标记或元音)。但是,其他形式的同一文本以及其他文本将在字母内部或字母Shin上方放置点。字母Shin可以具有“ sh”或“ s”声音。为了指示“ sh”声音(即字母“ shin”),在右侧上方有一个点,而在左侧上方有一个点表示“ s”声音(字母“ sin”):
SELECT NCHAR(0x05E9) AS [Shin], -- ש
NCHAR(0x05E9) + NCHAR(0x05C1) AS [Shin + Shin Dot], -- שׁ
NCHAR(0x05E9) + NCHAR(0x05C2) AS [Shin + Sin Dot] -- שׂ
WHERE NCHAR(0x05E9) = NCHAR(0x05E9) + NCHAR(0x05C1) COLLATE Hebrew_CS_AI_KS_WS
AND NCHAR(0x05E9) = NCHAR(0x05E9) + NCHAR(0x05C2) COLLATE Hebrew_CS_AI_KS_WS;
-- 1 row (expected; all sensitive except accent)
SELECT NCHAR(0x05E9) AS [Shin], -- ש
NCHAR(0x05E9) + NCHAR(0x05C1) AS [Shin + Shin Dot], -- שׁ
NCHAR(0x05E9) + NCHAR(0x05C2) AS [Shin + Sin Dot] -- שׂ
WHERE NCHAR(0x05E9) = NCHAR(0x05E9) + NCHAR(0x05C1) COLLATE Hebrew_CI_AS
OR NCHAR(0x05E9) = NCHAR(0x05E9) + NCHAR(0x05C2) COLLATE Hebrew_CI_AS;
-- no rows (expected; all insensitive except accent)
SELECT NCHAR(0x05E9) AS [Shin], -- ש
NCHAR(0x05E9) + NCHAR(0x05C1) AS [Shin + Shin Dot], -- שׁ
NCHAR(0x05E9) + NCHAR(0x05C2) AS [Shin + Sin Dot] -- שׂ
WHERE NCHAR(0x05E9) = NCHAR(0x05E9) + NCHAR(0x05C1) COLLATE Hebrew_100_CI_AI_SC
OR NCHAR(0x05E9) = NCHAR(0x05E9) + NCHAR(0x05C2) COLLATE Hebrew_100_CI_AI_SC;
-- no rows (expected 1 row; all insensitive)
元音
SELECT NCHAR(0x05D0) AS [Aleph], -- א
NCHAR(0x05D0) + NCHAR(0x05B8) AS [Aleph with vowel] -- אָ
WHERE NCHAR(0x05D0) =
NCHAR(0x05D0) + NCHAR(0x05B8) COLLATE Hebrew_CS_AI_KS_WS;
-- 1 row (expected; all sensitive except accent)
SELECT NCHAR(0x05D0) AS [Aleph], -- א
NCHAR(0x05D0) + NCHAR(0x05B8) AS [Aleph with vowel] -- אָ
WHERE NCHAR(0x05D0) =
NCHAR(0x05D0) + NCHAR(0x05B8) COLLATE Hebrew_CI_AS;
-- no rows (expected; all insensitive except accent)
SELECT NCHAR(0x05D0) AS [Aleph], -- א
NCHAR(0x05D0) + NCHAR(0x05B8) AS [Aleph with vowel] -- אָ
WHERE NCHAR(0x05D0) =
NCHAR(0x05D0) + NCHAR(0x05B8) COLLATE Hebrew_100_CI_AI_SC;
-- no rows (expected 1 row; all insensitive)
悬垂痕
从技术上讲,根据官方Unicode数据,希伯来语的可插拔标记是可忽略的,并且在使用二进制归类时仅应在此处注册为差异。但是,SQL Server将它们与重音一样对待(很遗憾),并且不能将它们与发音标记或元音分开忽略。
SELECT NCHAR(0x05D0) AS [Aleph], -- א
NCHAR(0x05D0) + NCHAR(0x05A8) AS [Aleph with cantillation mark] -- א֨
WHERE NCHAR(0x05D0) =
NCHAR(0x05D0) + NCHAR(0x05A8) COLLATE Hebrew_CS_AI_KS_WS;
-- 1 row (expected; all sensitive except accent)
SELECT NCHAR(0x05D0) AS [Aleph], -- א
NCHAR(0x05D0) + NCHAR(0x05A8) AS [Aleph with cantillation mark] -- א֨
WHERE NCHAR(0x05D0) =
NCHAR(0x05D0) + NCHAR(0x05A8) COLLATE Hebrew_CI_AS;
-- no rows (expected; all insensitive except accent)
SELECT NCHAR(0x05D0) AS [Aleph], -- א
NCHAR(0x05D0) + NCHAR(0x05A8) AS [Aleph with cantillation mark] -- א֨
WHERE NCHAR(0x05D0) =
NCHAR(0x05D0) + NCHAR(0x05A8) COLLATE Hebrew_100_CI_AI_SC;
-- no rows (expected 1 row; all insensitive)
您是否需要区分同一字母的宽和非宽形式?希伯来语中有八个字母被拉伸(宽),但仅用于在Torah卷轴中使用(手写/真实或印刷)以保持完全合理的列格式(这实际上是在Torah卷轴中的显示方式) )。SQL Server通过宽度敏感度/ _WS
排序规则来处理此问题(有趣的是,它似乎是在新版本100和更高版本的排序规则中可以正常工作的唯一敏感度,尽管可悲的是,它最不可能被使用):
SELECT NCHAR(0x05DC) AS [Lamed],
NCHAR(0xFB25) AS [Wide Lamed]
WHERE NCHAR(0x05DC) = NCHAR(0xFB25) COLLATE Hebrew_CI_AI;
-- no rows (expected 1 row; all insensitive)
SELECT NCHAR(0x05DC) AS [Lamed],
NCHAR(0xFB25) AS [Wide Lamed]
WHERE NCHAR(0x05DC) = NCHAR(0xFB25) COLLATE Hebrew_100_CS_AS_KS_SC;
-- 1 row (expected; all sensitive except width)
-- Lamed Wide Lamed
-- ל ﬥ
SELECT NCHAR(0x05DC) AS [Lamed],
NCHAR(0xFB25) AS [Wide Lamed]
WHERE NCHAR(0x05DC) = NCHAR(0xFB25) COLLATE Hebrew_100_CI_AI_WS_SC;
-- no rows (expected; all insensitive except width)
所以,也许Hebrew_CI_AI
在列,并且可以为每个表达式/谓词通过覆盖COLLATE
声明,如果你需要使用的变化,如COLLATE Hebrew_CS_AI
或Hebrew_CI_AS
或Hebrew_CS_AS
。
补充说明
您将需要将数据存储在NVARCHAR
列/变量中。您可以使用Windows-1255代码页(所有归类使用的代码页)以常规的8位代码完成大部分操作,包括组合元音和发音点的字符:VARCHAR
Hebrew_*
;WITH Hebrew AS
(
SELECT NCHAR(0x05E9) + NCHAR(0x05C1) + NCHAR(0x05B8)
COLLATE Hebrew_100_CS_AS AS [Shin]
)
SELECT
Hebrew.[Shin] AS [Unicode],
CONVERT(VARCHAR(20), Hebrew.[Shin]) AS [CodePage1255],
CONVERT(VARBINARY(10), CONVERT(VARCHAR(20), Hebrew.[Shin])) AS [CodePage1255_bytes]
FROM Hebrew;
-- Unicode CodePage1255 CodePage1255_bytes
-- שָׁ שָׁ F9D1C8
但是,只有Unicode 希伯来语块包含悬停标记(即trope;代码点U + 0591至U + 05AF)加上一些附加字符(代码点U + 05C4至U + 05C7),而字母表示形式块包含宽几个字母加上一些其他东西的变体。
根据针对希伯来语(“ he”和“ he-IL”)文化的官方Unicode CLDR(特定于语言环境的剪裁)规则,U + 05F3希伯来语标点语言应该匹配或先于 U + 0027 APOSTROPHE。通常,U + 05F3 在撇号后排序。当使用ICU整理演示程序并在“根” /标准排序顺序(由美国英语/“ en-US”使用)和“他”之间切换时,确实可以看到此行为。但是,.NET或SQL Server中似乎都没有此行为:
SELECT NCHAR(0x05F3)
WHERE NCHAR(0x05F3) <= N'''' COLLATE Hebrew_100_CS_AS_KS_WS;
-- no rows
SELECT NCHAR(0x05F3)
WHERE NCHAR(0x05F3) <= N'''' COLLATE Hebrew_CS_AS_KS_WS;
-- no rows
不幸的是,尽管如此,但鉴于Windows Sorting Weight Table文件中没有任何“ he”或“ he-IL”特定的剪裁,这确实是有道理的。这很可能意味着关联代码页之外的Hebrew_*
和Latin1_General_*
归类之间没有实际区别,该区别仅用于VARCHAR
数据,而在此处不适用。
OP回复:
是的,我需要区分:1)同一字母的最终形式和非最终形式2)发音标记3)元音,以及4)悬空标记。
在这种情况下,由于您无需忽略这些属性之间的差异,因此可以使用100个级别的排序规则。下面的示例显示带有发音标记,悬念标记和元音的希伯来字母(Sin)。有六个版本,因此可以表示组合字符排序的每种可能组合。第七个条目使用另一个点来创建具有相同基本字母,元音和倾斜标记的字母Shin。该查询显示只有六个“ Sin”条目彼此匹配(即使具有不同的字节顺序),但没有匹配。
我包括Latin1_General
和Japanese_XJIS_140
归类的使用,以表明规则在需要使用它们的情况下也适用于这些规则(140
归类(仅日语)比旧版本具有更多的大写/小写映射)。但是总的来说,100
如果您需要忽略元音,标记,点以及最终形式与非最终形式之间的差异,则最好坚持使用希伯来语校对,并使用非版本。
DECLARE @Shin NVARCHAR(5) = NCHAR(0x05E9), -- base Hebrew letter
@Dot NVARCHAR(5) = NCHAR(0x05C2), -- Sin Dot
@Mark NVARCHAR(5) = NCHAR(0x05A8), -- Cantillation Mark (i.e. trope)
@Vowel NVARCHAR(5) = NCHAR(0x05B8); -- Vowel
DECLARE @Dot_Mark_Vowel NVARCHAR(20) = @Shin + @Dot + @Mark + @Vowel,
@Dot_Vowel_Mark NVARCHAR(20) = @Shin + @Dot + @Vowel + @Mark,
@Vowel_Dot_Mark NVARCHAR(20) = @Shin + @Vowel + @Dot + @Mark,
@Vowel_Mark_Dot NVARCHAR(20) = @Shin + @Vowel + @Mark + @Dot,
@Mark_Vowel_Dot NVARCHAR(20) = @Shin + @Mark + @Vowel + @Dot,
@Mark_Dot_Vowel NVARCHAR(20) = @Shin + @Mark + @Dot + @Vowel,
@ShinDot_Mark_Vowel NVARCHAR(20) = @Shin + NCHAR(0x05C1) + @Mark + @Vowel;
SELECT @Dot_Mark_Vowel AS [Sin], @ShinDot_Mark_Vowel AS [Shin];
;WITH chr AS
(
SELECT *
FROM (VALUES
(@Dot_Mark_Vowel, 'Dot + Mark + Vowel'),
(@Dot_Vowel_Mark, 'Dot + Vowel + Mark'),
(@Vowel_Dot_Mark, 'Vowel + Dot + Mark'),
(@Vowel_Mark_Dot, 'Vowel + Mark + Dot'),
(@Mark_Vowel_Dot, 'Mark + Vowel + Dot'),
(@Mark_Dot_Vowel, 'Mark + Dot + Vowel'),
(@ShinDot_Mark_Vowel, 'ShinDot + Mark + Vowel')
) tmp([Hebrew], [Description])
) SELECT chr1.[Hebrew],
'--' AS [---],
chr1.[Description] AS [Description_1],
CONVERT(VARBINARY(20), RIGHT(chr1.[Hebrew], 3)) AS [Bytes_1],
'--' AS [---],
chr2.[Description] AS [Description_2],
CONVERT(VARBINARY(20), RIGHT(chr2.[Hebrew], 3)) AS [Bytes_2]
FROM chr chr1
CROSS JOIN chr chr2
WHERE chr1.[Description] <> chr2.[Description] -- do not compare item to itself
AND chr1.[Hebrew] = chr2.[Hebrew] COLLATE Hebrew_100_CS_AS_SC
AND chr1.[Hebrew] = chr2.[Hebrew] COLLATE Latin1_General_100_CS_AS_SC
AND chr1.[Hebrew] = chr2.[Hebrew] COLLATE Japanese_XJIS_140_CS_AS;
-- this query returns 30 rows
这取决于很多事情。排序规则是排序,比较和非Unicode代码页。
这个仓库在希伯来语周围有不错的选择。
+---------------------------+---------------------------------------------------------------------------------------------------------------------+
| Hebrew_BIN | Hebrew, binary sort |
| Hebrew_BIN2 | Hebrew, binary code point comparison sort |
| Hebrew_CI_AI | Hebrew, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive |
| Hebrew_CI_AI_WS | Hebrew, case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive |
| Hebrew_CI_AI_KS | Hebrew, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive |
| Hebrew_CI_AI_KS_WS | Hebrew, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive |
| Hebrew_CI_AS | Hebrew, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive |
| Hebrew_CI_AS_WS | Hebrew, case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive |
| Hebrew_CI_AS_KS | Hebrew, case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive |
| Hebrew_CI_AS_KS_WS | Hebrew, case-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive |
| Hebrew_CS_AI | Hebrew, case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive |
| Hebrew_CS_AI_WS | Hebrew, case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive |
| Hebrew_CS_AI_KS | Hebrew, case-sensitive, accent-insensitive, kanatype-sensitive, width-insensitive |
| Hebrew_CS_AI_KS_WS | Hebrew, case-sensitive, accent-insensitive, kanatype-sensitive, width-sensitive |
| Hebrew_CS_AS | Hebrew, case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive |
| Hebrew_CS_AS_WS | Hebrew, case-sensitive, accent-sensitive, kanatype-insensitive, width-sensitive |
| Hebrew_CS_AS_KS | Hebrew, case-sensitive, accent-sensitive, kanatype-sensitive, width-insensitive |
| Hebrew_CS_AS_KS_WS | Hebrew, case-sensitive, accent-sensitive, kanatype-sensitive, width-sensitive |
| Hebrew_100_BIN | Hebrew-100, binary sort |
| Hebrew_100_BIN2 | Hebrew-100, binary code point comparison sort |
| Hebrew_100_CI_AI | Hebrew-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive |
| Hebrew_100_CI_AI_WS | Hebrew-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive |
| Hebrew_100_CI_AI_KS | Hebrew-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive |
| Hebrew_100_CI_AI_KS_WS | Hebrew-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive |
| Hebrew_100_CI_AS | Hebrew-100, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive |
| Hebrew_100_CI_AS_WS | Hebrew-100, case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive |
| Hebrew_100_CI_AS_KS | Hebrew-100, case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive |
| Hebrew_100_CI_AS_KS_WS | Hebrew-100, case-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive |
| Hebrew_100_CS_AI | Hebrew-100, case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive |
| Hebrew_100_CS_AI_WS | Hebrew-100, case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive |
| Hebrew_100_CS_AI_KS | Hebrew-100, case-sensitive, accent-insensitive, kanatype-sensitive, width-insensitive |
| Hebrew_100_CS_AI_KS_WS | Hebrew-100, case-sensitive, accent-insensitive, kanatype-sensitive, width-sensitive |
| Hebrew_100_CS_AS | Hebrew-100, case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive |
| Hebrew_100_CS_AS_WS | Hebrew-100, case-sensitive, accent-sensitive, kanatype-insensitive, width-sensitive |
| Hebrew_100_CS_AS_KS | Hebrew-100, case-sensitive, accent-sensitive, kanatype-sensitive, width-insensitive |
| Hebrew_100_CS_AS_KS_WS | Hebrew-100, case-sensitive, accent-sensitive, kanatype-sensitive, width-sensitive |
| Hebrew_100_CI_AI_SC | Hebrew-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-insensitive, supplementary characters |
| Hebrew_100_CI_AI_WS_SC | Hebrew-100, case-insensitive, accent-insensitive, kanatype-insensitive, width-sensitive, supplementary characters |
| Hebrew_100_CI_AI_KS_SC | Hebrew-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-insensitive, supplementary characters |
| Hebrew_100_CI_AI_KS_WS_SC | Hebrew-100, case-insensitive, accent-insensitive, kanatype-sensitive, width-sensitive, supplementary characters |
| Hebrew_100_CI_AS_SC | Hebrew-100, case-insensitive, accent-sensitive, kanatype-insensitive, width-insensitive, supplementary characters |
| Hebrew_100_CI_AS_WS_SC | Hebrew-100, case-insensitive, accent-sensitive, kanatype-insensitive, width-sensitive, supplementary characters |
| Hebrew_100_CI_AS_KS_SC | Hebrew-100, case-insensitive, accent-sensitive, kanatype-sensitive, width-insensitive, supplementary characters |
| Hebrew_100_CI_AS_KS_WS_SC | Hebrew-100, case-insensitive, accent-sensitive, kanatype-sensitive, width-sensitive, supplementary characters |
| Hebrew_100_CS_AI_SC | Hebrew-100, case-sensitive, accent-insensitive, kanatype-insensitive, width-insensitive, supplementary characters |
| Hebrew_100_CS_AI_WS_SC | Hebrew-100, case-sensitive, accent-insensitive, kanatype-insensitive, width-sensitive, supplementary characters |
| Hebrew_100_CS_AI_KS_SC | Hebrew-100, case-sensitive, accent-insensitive, kanatype-sensitive, width-insensitive, supplementary characters |
| Hebrew_100_CS_AI_KS_WS_SC | Hebrew-100, case-sensitive, accent-insensitive, kanatype-sensitive, width-sensitive, supplementary characters |
| Hebrew_100_CS_AS_SC | Hebrew-100, case-sensitive, accent-sensitive, kanatype-insensitive, width-insensitive, supplementary characters |
| Hebrew_100_CS_AS_WS_SC | Hebrew-100, case-sensitive, accent-sensitive, kanatype-insensitive, width-sensitive, supplementary characters |
| Hebrew_100_CS_AS_KS_SC | Hebrew-100, case-sensitive, accent-sensitive, kanatype-sensitive, width-insensitive, supplementary characters |
| Hebrew_100_CS_AS_KS_WS_SC | Hebrew-100, case-sensitive, accent-sensitive, kanatype-sensitive, width-sensitive, supplementary characters |
+---------------------------+---------------------------------------------------------------------------------------------------------------------+