如何在AppleScript中剥离希伯来语的元音和标点符号文本?


3

以希伯来语中创世纪的前几节经文为例:

בְּרֵאשִׁ֖יתבָּרָ֣אשִׁ֖אֵ֥תהַשָּׁמַ֖יִםוְאֵ֥תהָאָֽרֶץ׃

ְֱַָָּּּּּּּּ֗֙֙֙֙֙֙֙֙חֶ֖פֶתִם׃ְחֹ֖שֶׁךְִם׃

וַיֹּ֥אמֶראֱלֹהִ֖יםיְהִ֣יא֑וֹרוַֽיְהִי־אֽוֹר׃

ֱֱֱַ־ט֑ ־ט֑ ־ט֑ ־טֱַָָָָ֑֖֖ןַּ

ֱֶַחָֽד׃פפ(פ)פֹםפפפפפפ(ֶחָֽד׃)

(פ)由于某种原因,blockquote中的格式不正确,但是在我的文本文件中却是如此。)

现在,我要去除此文本中的所有字符,除了标准的27个字母的希伯来字母אבגדהוזחטיכךלמםנןסעפףצץקרשת,加换行符(“脚本编辑器”自动将其解析为\n)以及换行符和段落符(:(פ)(ס))。您会在几行中注意到有连字符-应该用空格代替。有些行还包含|-那些行应替换为一个。完成后,它应如下所示:

בראשיתבראאלהיםאתהשמיםואתהארץ׃

ווווווםפנפנפנפנפנפנ

ויאמראלהיםיהיאורויהיאור׃

上一个

俄语俄语(פ)

我想简单的东西在第一-设置希伯来字母加()一个列表,设置x为输入字符串的长度,然后做一个重复的字符串中的每个字符:如果它在列表中,然后将其附加到输出; 如果是a -,则追加到输出;如果\是a n,下一个是a ,则追加\n到输出;如果行中有两个空格,请删除第二个。

我记录了输出并发现了一些乱码:

(*אאית   א    ים  ת     ם   ת    ץץץץץץץץ    ה  הה   הה       ללללי    ם         ים     ת  ללללי    םםםםםאאר    ים   י   ר    ייייררררררא    ים  תתתתתר  ייייב     ל    ים  ין    ר   ין           א    ים    אאא   ם         א    ה    ייייב    ייייר   ם   דד (פ)*)

似乎是段落中没有元音的每个字母,如果以下字母出现,则重复该字母。(我对重复的错误-重复循环写得很差。)但是,它跳过了也有元音的辅音,这让我感到奇怪。

所以我做了一个测试:

set charNum to ASCII number "בְּ"
log charNum
set charNum to ASCII number "ב"
log charNum
-->result: (*63*) (*63*)

尽管在文本编辑器中,元音等是叠加在前一个字符上的单独字符,但脚本编辑器却不这样看,而是将בְּ和ב视为同一字母。但是,将其与我的列表进行比较时,它无法识别该字符并跳过了它。

那么,如何在不对任何可能的字母和元音组合进行if循环的情况下从字母中去除元音等?

Answers:


2

ASCII number已不推荐使用,并且不能与unicode文本一起正常使用,请使用id of someCharacter

set charNum to id of "בְּ" -- this return id of 3 characters because "בְּ" is a composed character
log charNum
set charNum to id of "ב"
log charNum
-->result: 
(*1489, 1456, 1468*)
(*1489*)

因此,我不知道如何在纯AppleScript中执行此操作。


但是,您可以在以下命令中使用perl命令do shell script

-- The text look not good in this code block, but it will be correct after the compilation of the script
set theString to "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃

וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֙הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי־אֽוֹר׃

וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָא֖וֹר כִּי־ט֑וֹב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָא֖וֹר וּבֵ֥ין הַחֹֽשֶׁךְ׃

וַיִּקְרָ֨א אֱלֹהִ֤ים ׀ לָאוֹר֙ י֔וֹם וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר י֥וֹם אֶחָֽד׃ (פ)"


return do shell script "perl -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' <<< " & quoted form of theString

这是perl脚本的简要说明

  • -CSD选项:输出与所述误差将是UTF-8,输入被假定为UTF-8
  • s~\\p{NonspacingMark}~~og :删除非空格标记
  • s~־|׀~ ~g:更换所有־׀一个空格
  • s~ +~ ~g :连续用一个空格替换多个空格

如果AppleScript从文件中读取文本,则可以使用perl读取文件:

do shell script "perl -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' < " & quoted form of posix path of pathOfTheTextFile

该文件的编码必须为utf8。


另一个解决方案是使用Cocoa-AppleScript

        use framework "Foundation"
        use scripting additions
        -- The text look not good in this code block, but it will be correct after the compilation of the script
        set theString to "בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃

וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֙הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃

וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי־אֽוֹר׃

וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָא֖וֹר כִּי־ט֑וֹב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָא֖וֹר וּבֵ֥ין הַחֹֽשֶׁךְ׃

וַיִּקְרָ֨א אֱלֹהִ֤ים ׀ לָאוֹר֙ י֔וֹם וְלַחֹ֖שֶׁךְ קָ֣רָא לָ֑יְלָה וַֽיְהִי־עֶ֥רֶב וַֽיְהִי־בֹ֖קֶר י֥וֹם אֶחָֽד׃ (פ)"

        return stripString(theString)

        on stripString(t)
            set sourceString to current application's NSMutableString's stringWithString:t
            set myOpt to current application's NSRegularExpressionSearch
            set theSuccess to sourceString's applyTransform:(current application's NSStringTransformStripCombiningMarks) |reverse|:false range:(current application's NSMakeRange(0, (sourceString's |length|))) updatedRange:(missing value)
            if theSuccess then
                -- *** Replace all "־" and "׀" by a space, each character must be separated by a vertical bar character, e.g. "a|d|z"
                sourceString's replaceOccurrencesOfString:"־|׀" withString:" " options:myOpt range:(current application's NSMakeRange(0, (sourceString's |length|)))

                -- **** Replace multiple spaces in a row by one space
                sourceString's replaceOccurrencesOfString:" +" withString:" " options:myOpt range:(current application's NSMakeRange(0, (sourceString's |length|)))
                return sourceString as string -- convert the NSString object to an AppleScript's string
            end if
            return "" -- else, the transform was not applied
        end stripString

根据评论:

对于小滴,脚本需要一个on open handler,如下所示:

on open theseFiles
    repeat with f in theseFiles
        set cleanText to do shell script "perl -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' " & quoted form of POSIX path of f
        -- do something with that cleanText
    end repeat
end open

如果要进行就地编辑(perl脚本需要-i选项+ '.some name extension'):

这将创建每个文件的备份(在名称后添加“ .bak ”)

on open theseFiles
    repeat with f in theseFiles -- ***  create a backup and edit the file in-place ***
        do shell script "perl -i'.bak' -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' " & quoted form of POSIX path of f
    end repeat
end open

如果您不想备份每个文件(perl脚本需要-i选项+ ''),如下所示:

-- ***  edit the file in-place without backup***
do shell script "perl -i'' -CSD -pe  'use utf8; s~\\p{NonspacingMark}~~og; s~־|׀~ ~g;  s~ +~ ~g;' " & quoted form of POSIX path of f

由于某些原因,可可并不总是在我的机器上运行,考虑到这是新款MacBook Pro,这很奇怪。我必须下载某种扩展程序才能使其正常工作吗?无论如何,让Applescript让Bash调用Perl时,虽然有点令人费解,但效果很好,所以我给您打了勾。谢谢!最后一个问题:如果我将第二个perl保存为一个小滴,是否可以将.txt文件放到它上面,然后它将在该文件中为我解析?该return功能是否允许编辑文件?
DonielF '17

此代码应在MacOS 10.11.x以后使用(不需要任何操作)。如果这不起作用,则可能取决于一个或多个因素。这需要代码调试才能找出。但是,如果perl脚本有效,则无需继续执行Cocoa-Applescript。我为另一个问题更新了答案。
jackjr300
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.