如何最好地将Google ngram存储在数据库中?


9

我几天前下载了Google onegrams,它已经是大量数据了。我将10个软件包中的第一个插入mysql,现在我有一个4,700万个记录数据库。

我想知道如何最好地将Google ngrams存储在数据库中。我的意思是,如果您不使用一克,而是两克或三克,则金额将大得多。我可以在一个数据库中存储5亿条记录并使用它,还是应该将其拆分到不同的表中?

在将一条记录分割成几条记录之后,又如何最好地对其进行分割(考虑到twogram具有100个文件,因此大约有50亿条记录)?是否建议使用MySQL水平分区还是建立自己的分区(例如,通过word => twograms_a的第一个字符)。

Answers:


4

我必须对第一个答案进行很多更改,我才开始这个!

USE test
DROP TABLE IF EXISTS ngram_key;
DROP TABLE IF EXISTS ngram_rec;
DROP TABLE IF EXISTS ngram_blk;
CREATE TABLE ngram_key
(
    NGRAM_ID UNSIGNED BIGINT NOT NULL AUTO_INCREMENT,
    NGRAM VARCHAR(64) NOT NULL,
    PRIMARY KEY (NGRAM),
    KEY (NGRAM_ID)
) ENGINE=MyISAM ROW_FORMAT=FIXED PARTITION BY KEY(NGRAM) PARTITIONS 256;
CREATE TABLE ngram_rec
(
    NGRAM_ID UNSIGNED BIGINT NOT NULL,
    YR SMALLINT NOT NULL,
    MC SMALLINT NOT NULL,
    PC SMALLINT NOT NULL,
    VC SMALLINT NOT NULL,
    PRIMARY KEY (NGRAM_ID,YR)
) ENGINE=MyISAM ROW_FORMAT=FIXED;
CREATE TABLE ngram_blk
(
    NGRAM VARCHAR(64) NOT NULL,
    YR SMALLINT NOT NULL,
    MC SMALLINT NOT NULL,
    PC SMALLINT NOT NULL,
    VC SMALLINT NOT NULL
) ENGINE=BLACKHOLE;
DELIMITER $$
CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blk FOR EACH ROW
BEGIN
    DECLARE NEW_ID BIGINT;

    INSERT IGNORE INTO ngram_key (NGRAM) VALUES (NEW.NGRAM);
    SELECT NGRAM_ID INTO NEW_ID FROM ngram_key WHERE NGRAM=NEW.NGRAM;
    INSERT IGNORE INTO ngram_rec VALUES (NEW_ID,NEW.YR,NEW.MC,NEW.PC,NEW.VC);
END; $$
DELIMITER ;
INSERT INTO ngram_blk VALUES
('rolando',1965,31,29,85),
('pamela',1971,33,21,86),
('dominique',1996,30,18,87),
('diamond',1998,13,28,88),
('rolando edwards',1965,31,29,85),
('pamela edwards',1971,33,21,86),
('dominique edwards',1996,30,18,87),
('diamond edwards',1998,13,28,88),
('rolando angel edwards',1965,31,29,85),
('pamela claricia edwards',1971,33,21,86),
('dominique sharlisee edwards',1996,30,18,87),
('diamond ashley edwards',1998,13,28,88);
UPDATE ngram_rec SET yr=yr+1,mc=mc+30,pc=pc+30,vc=vc+30;
INSERT INTO ngram_blk VALUES
('rolando',1965,31,29,85),
('pamela',1971,33,21,86),
('dominique',1996,30,18,87),
('diamond',1998,13,28,88),
('rolando edwards',1965,31,29,85),
('pamela edwards',1971,33,21,86),
('dominique edwards',1996,30,18,87),
('diamond edwards',1998,13,28,88),
('rolando angel edwards',1965,31,29,85),
('pamela claricia edwards',1971,33,21,86),
('dominique sharlisee edwards',1996,30,18,87),
('diamond ashley edwards',1998,13,28,88);
UPDATE ngram_rec SET yr=yr+1,mc=mc+30,pc=pc+30;
INSERT INTO ngram_blk VALUES
('rolando',1965,31,29,85),
('pamela',1971,33,21,86),
('dominique',1996,30,18,87),
('diamond',1998,13,28,88),
('rolando edwards',1965,31,29,85),
('pamela edwards',1971,33,21,86),
('dominique edwards',1996,30,18,87),
('diamond edwards',1998,13,28,88),
('rolando angel edwards',1965,31,29,85),
('pamela claricia edwards',1971,33,21,86),
('dominique sharlisee edwards',1996,30,18,87),
('diamond ashley edwards',1998,13,28,88);
UPDATE ngram_rec SET yr=yr+1,mc=mc+30;
SELECT * FROM ngram_key;
SELECT * FROM ngram_rec;
SELECT A.ngram NGram,B.yr Year,B.mc Matches,B.pc Pages,B.vc Volumes FROM 
ngram_key A,ngram_rec B
WHERE A.ngram='rolando angel edwards'
AND A.ngram_id=B.ngram_id;

用于年份信息的表要小得多,但是保留原始ngram的键要大得多。我还增加了测试数据量。您可以将其直接剪切并粘贴到MySQL中。

警告

只需删除ROW_FORMAT,它就会变得非常动态,并将ngram_key表压缩得更小。


DiskSpace指标

nrgram_rec每行有17个字节ngram_id有
8个字节(最大无符号值18446744073709551615 [2 ^ 64-1])
8个字节有4个smallint(每个2个字节)
1个字节MyISAM内部删除标志

ngram_rec的索引条目= 10字节(8(ngram_id)+ 2(yr))

4,700万行X每行17个字节= 0799百万字节= 761.98577 MB
4,700万行X每行12个字节= 0564百万字节= 537.85231 MB
4,700万行X每行29个字节= 13.63亿字节= 1.269393 GB

50亿行X每行17个字节= 8850亿字节= 079.1624 GB
50亿行X每行12个字节= 6600亿字节= 055.8793 GB
50亿行X每行29个字节= 1450亿字节= 135.0417 GB


ngram_key有73个字节,ngram有64个字节(ROW_FORMAT = FIXED将varchar设置为char)ngram_id有8个字节1个字节MyISAM内部删除标志

ngram_key的2个索引条目= 64字节+ 8字节= 72字节

4,700万行X每行073字节= 3431百万字节= 3.1954 GB
4700万行X每072字节= 33.84亿字节= 3.1515 GB
4700万行X 145字节每行= 681.5亿字节= 6.3469 GB

50亿行X每行073字节= 3650亿字节= 339.9327 GB
50亿行X每行072字节= 3600亿字节= 335.2761 GB
50亿行X每行145字节= 7250亿字节= 675.2088 GB


感谢您提供两个很好的答案。我很好奇,使用此黑洞+触发方法填充表格的原因是什么?
多兰·安特努奇

黑洞接受原始的ngram。触发器创建了一个干净的INSERT IGNORE机制,用于从auto_increment值中拆分ngram。
RolandoMySQLDBA 2011年

3

这是一个很疯狂的建议

将所有ngram转换为32个字符的MD5密钥

该表将容纳所有大小(最多255个字符),1克,2克等的ngram。

use test
DROP TABLE ngram_node;
DROP TABLE ngram_blackhole;
CREATE TABLE ngram_node
(
  NGRAM_KEY  CHAR(32) NOT NULL,
  NGRAM_YEAR SMALLINT NOT NULL,
  M_COUNT    SMALLINT NOT NULL,
  P_COUNT    SMALLINT NOT NULL,
  V_COUNT    SMALLINT NOT NULL,
  PRIMARY KEY   (NGRAM_KEY,NGRAM_YEAR)
) ENGINE=MyISAM
PARTITION BY KEY(NGRAM_KEY)
PARTITIONS 256;
CREATE TABLE ngram_blackhole
(
  NGRAM      VARCHAR(255) NOT NULL,
  NGRAM_YEAR SMALLINT NOT NULL,
  M_COUNT    SMALLINT NOT NULL,
  P_COUNT    SMALLINT NOT NULL,
  V_COUNT    SMALLINT NOT NULL
) ENGINE=BLACKHOLE;
DELIMITER $$
CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blackhole FOR EACH ROW
BEGIN
    INSERT INTO ngram_node VALUES (MD5(NEW.NGRAM),NEW.NGRAM_YEAR,NEW.M_COUNT,NEW.P_COUNT,NEW.V_COUNT);
END; $$
DELIMITER ;
INSERT INTO ngram_blackhole VALUES
('rolando',1965,31,29,85),
('pamela',1971,33,21,86),
('dominique',1996,30,18,87),
('diamond',1998,13,28,88),
('rolando edwards',1965,31,29,85),
('pamela edwards',1971,33,21,86),
('dominique edwards',1996,30,18,87),
('diamond edwards',1998,13,28,88),
('rolando angel edwards',1965,31,29,85),
('pamela claricia edwards',1971,33,21,86),
('dominique sharlisee edwards',1996,30,18,87),
('diamond ashley edwards',1998,13,28,88);
SELECT * FROM ngram_node;

我之所以选择256个分区,是因为MD5函数返回16个不同的字符(所有十六进制数字)。前两个字节为16 X 16、256。

这是Windows 7桌面上MySQL 5.5.11的结果

mysql> use test
Database changed
mysql> DROP TABLE ngram_node;
Query OK, 0 rows affected (0.22 sec)

mysql> DROP TABLE ngram_blackhole;
Query OK, 0 rows affected (0.11 sec)

mysql> CREATE TABLE ngram_node
    -> (
    ->   NGRAM_KEY  CHAR(32) NOT NULL,
    ->   NGRAM_YEAR SMALLINT NOT NULL,
    ->   M_COUNT    SMALLINT NOT NULL,
    ->   P_COUNT    SMALLINT NOT NULL,
    ->   V_COUNT    SMALLINT NOT NULL,
    ->   PRIMARY KEY    (NGRAM_KEY,NGRAM_YEAR)
    -> ) ENGINE=MyISAM
    -> PARTITION BY KEY(NGRAM_KEY)
    -> PARTITIONS 256;
Query OK, 0 rows affected (0.36 sec)

mysql> CREATE TABLE ngram_blackhole
    -> (
    ->   NGRAM      VARCHAR(255) NOT NULL,
    ->   NGRAM_YEAR SMALLINT NOT NULL,
    ->   M_COUNT    SMALLINT NOT NULL,
    ->   P_COUNT    SMALLINT NOT NULL,
    ->   V_COUNT    SMALLINT NOT NULL
    -> ) ENGINE=BLACKHOLE;
Query OK, 0 rows affected (0.11 sec)

mysql> DELIMITER $$
mysql> CREATE TRIGGER populate_ngram AFTER INSERT ON ngram_blackhole FOR EACH ROW
    -> BEGIN
    ->  INSERT INTO ngram_node VALUES (MD5(NEW.NGRAM),NEW.NGRAM_YEAR,NEW.M_COUNT,NEW.P_COUNT,NEW.V_COUNT);
    -> END; $$
Query OK, 0 rows affected (0.05 sec)

mysql> DELIMITER ;
mysql> INSERT INTO ngram_blackhole VALUES
    -> ('rolando',1965,31,29,85),
    -> ('pamela',1971,33,21,86),
    -> ('dominique',1996,30,18,87),
    -> ('diamond',1998,13,28,88),
    -> ('rolando edwards',1965,31,29,85),
    -> ('pamela edwards',1971,33,21,86),
    -> ('dominique edwards',1996,30,18,87),
    -> ('diamond edwards',1998,13,28,88),
    -> ('rolando angel edwards',1965,31,29,85),
    -> ('pamela claricia edwards',1971,33,21,86),
    -> ('dominique sharlisee edwards',1996,30,18,87),
    -> ('diamond ashley edwards',1998,13,28,88);
Query OK, 12 rows affected (0.18 sec)
Records: 12  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM ngram_node;
+----------------------------------+------------+---------+---------+---------+
| NGRAM_KEY                        | NGRAM_YEAR | M_COUNT | P_COUNT | V_COUNT |
+----------------------------------+------------+---------+---------+---------+
| 2ca237192aaac3b3a20ce0649351b395 |       1996 |      30 |      18 |      87 |
| 6f7fd3368170c562604f62fb4e92056d |       1965 |      31 |      29 |      85 |
| fb201333fef377917be714dabd3776d9 |       1971 |      33 |      21 |      86 |
| 4f79e21800ed6e30be4d1cb597f910c6 |       1971 |      33 |      21 |      86 |
| 9068e0de9f3fd674d4fa7cbc626e5888 |       1998 |      13 |      28 |      88 |
| 8a18abe90f2612827dc3a215fd1905d3 |       1965 |      31 |      29 |      85 |
| be60b431a46fcc7bf5ee4f7712993e3b |       1996 |      30 |      18 |      87 |
| c8adc38aa00759488b1d759aa8f91725 |       1996 |      30 |      18 |      87 |
| e80d4ab77eb18a4ca350157fd487d7e2 |       1965 |      31 |      29 |      85 |
| 669ffc150d1f875819183addfc842cab |       1971 |      33 |      21 |      86 |
| b685323e9de65080f733b53b2305da6e |       1998 |      13 |      28 |      88 |
| 75c6f03161d020201000414cd1501f9f |       1998 |      13 |      28 |      88 |
+----------------------------------+------------+---------+---------+---------+
12 rows in set (0.00 sec)

mysql>

请注意,我在同一张表中加载了1克,2克和3克,但是您不知道哪个MD5属于哪个ngram。因此,所有ngram都可以改写到该表中。只需记住将其插入到ngram_blackhole表中,其余的工作就为您完成了。

无论哪个ngram,都必须使用ngram的MD5()查询ngram_node表。

mysql> select * from ngram_node where ngram_key=MD5('rolando edwards');
+----------------------------------+------------+---------+---------+---------+
| NGRAM_KEY                        | NGRAM_YEAR | M_COUNT | P_COUNT | V_COUNT |
+----------------------------------+------------+---------+---------+---------+
| 6f7fd3368170c562604f62fb4e92056d |       1965 |      31 |      29 |      85 |
+----------------------------------+------------+---------+---------+---------+
1 row in set (0.05 sec)

如果希望将1克,2克和3克分离到单独的存储库中,只需创建另一个表,另一个黑洞表和在黑洞表上的另一个触发器以插入另一个表中即可。

另外,如果您的ngram大于255(如果您正在执行7克或8克),则只需增加ngram_blackhole表中NGRAM列的VARCHAR大小即可。

试试看 !!!

更新

在该问题中,据说有4,700万行已加载到mysql中。对于我建议的表布局,请注意以下几点:

ngram_node是每行41个字节:NGRAM_KEY
为32,数字为8(每个SMALLINT
为2)内部MyISAM DELETED标志为1


对于NGRAM_KEY,每个主键索引条目将为34字节32;对于NGRAM_YEAR,则为
2

4700万行X每行41个字节= 19.27亿字节,约1.79466 GB。
4700万行X每个索引条目34个字节= 15.98亿字节,约1.48825 GB。
MyISAM表的使用量总计应约为3.28291 GB。

该问题还提到要加载50亿行。

50亿行X每行41个字节= 2050亿字节,大约190.9211 GB。
50亿行X每个索引条目34个字节= 1700亿字节,大约158.3248 GB。
MyISAM表的消耗应总计约为349.2459 GB。

请注意,由于恒定大小的主键,在MyISAM表中使用的空间的增长率是线性的。现在,您可以基于此计划磁盘空间。


1
我考虑了我的答案,并想到了另一个建议,以便使用更少的磁盘空间。我星期一解决这个问题!周末愉快。
RolandoMySQLDBA 2011年
By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.