在Linux上提高SAS JBOD性能的多路径


10

我正在尝试使用Linux在某些Sun硬件上优化存储设置。任何想法将不胜感激。

我们有以下硬件:

  • 太阳之刃X6270
  • 2个LSISAS1068E SAS控制器
  • 2个带1 TB磁盘的Sun J4400 JBOD(每个JBOD 24个磁盘)
  • Fedora Core 12
  • FC13的2.6.33发行版内核(也尝试使用FC12的最新2.6.31内核,结果相同)

这是SAS硬件的数据表:

http://www.sun.com/storage/storage_networking/hba/sas/PCIe.pdf

它使用8个通道的PCI Express 1.0a。每个通道的带宽为250 MB /秒,每个SAS控制器应该能够做到2000 MB /秒。

每个控制器每个端口可以执行3 Gb /秒的速度,并具有两个4端口PHY。我们将两个PHY从控制器连接到JBOD。因此,在JBOD和控制器之间,我们有2个PHY * 4个SAS端口* 3 Gb /秒= 24 Gb /秒的带宽,这比PCI Express带宽还大。

启用写缓存并进行大写操作时,每个磁盘可以维持大约80 MB /秒的速度(接近磁盘开始位置)。如果使用24个磁盘,则意味着每个JBOD我们应该能够实现1920 MB /秒的速度。

多路径{
  rr_min_io 100
  uid 0
  path_grouping_policy多总线
  故障回复手册
  path_selector“循环0”
  rr_weight优先级
  别名somealias
  no_path_retry队列
  模式0644
  吉德0
  某某
}

我为rr_min_io尝试了50、100、1000的值,但这似乎没有太大的区别。

随着rr_min_io的变化,我尝试在启动dd之间添加一些延迟,以防止它们同时在同一PHY上进行写入,但这没有任何区别,因此,我认为I / O正在适当地扩展。

根据/ proc / interrupts,SAS控制器正在使用“ IR-IO-APIC-fasteoi”中断方案。由于某些原因,只有机器中的核心#0正在处理这些中断。我可以通过分配一个单独的内核来处理每个SAS控制器的中断来稍微提高性能:

回声2> / proc / irq / 24 / smp_affinity
回声4> / proc / irq / 26 / smp_affinity

使用dd写入磁盘会生成“函数调用中断”(不知道它们是什么),这些中断由内核4处理,因此我也将其他进程置于该内核之外。

我运行48 dd(每个磁盘一个),将它们分配给不处理中断的内核,如下所示:

任务集-c somecore dd if = / dev / zero of = / dev / mapper / mpathx oflag = direct bs = 128M

oflag = direct可防止涉及任何类型的缓冲区高速缓存。

我的核心似乎都没有被用尽。处理中断的内核大多处于空闲状态,所有其他内核都在等待I / O,正如人们所期望的那样。

Cpu0:0.0%us,1.0%sy,0.0%ni,91.2%id,7.5%wa,0.0%hi,0.2%si,0.0%st
Cpu1:0.0%us,0.8%sy,0.0%ni,93.0%id,0.2%wa,0.0%hi,6.0%si,0.0%st
Cpu2:0.0%us,0.6%sy,0.0%ni,94.4%id,0.1%wa,0.0%hi,4.8%si,0.0%st
Cpu3:0.0%us,7.5%sy,0.0%ni,36.3%id,56.1%wa,0.0%hi,0.0%si,0.0%st
Cpu4:0.0%us,1.3%sy,0.0%ni,85.7%id,4.9%wa,0.0%hi,8.1%si,0.0%st
Cpu5:0.1%us,5.5%sy,0.0%ni,36.2%id,58.3%wa,0.0%hi,0.0%si,0.0%st
Cpu6:0.0%us,5.0%sy,0.0%ni,36.3%id,58.7%wa,0.0%hi,0.0%si,0.0%st
Cpu7:0.0%us,5.1%sy,0.0%ni,36.3%id,58.5%wa,0.0%hi,0.0%si,0.0%st
Cpu8:0.1%us,8.3%sy,0.0%ni,27.2%id,64.4%wa,0.0%hi,0.0%si,0.0%st
Cpu9:0.1%us,7.9%sy,0.0%ni,36.2%id,55.8%wa,0.0%hi,0.0%si,0.0%st
Cpu10:0.0%us,7.8%sy,0.0%ni,36.2%id,56.0%wa,0.0%hi,0.0%si,0.0%st
Cpu11:0.0%us,7.3%sy,0.0%ni,36.3%id,56.4%wa,0.0%hi,0.0%si,0.0%st
Cpu12:0.0%us,5.6%sy,0.0%ni,33.1%id,61.2%wa,0.0%hi,0.0%si,0.0%st
Cpu13:0.1%us,5.3%sy,0.0%ni,36.1%id,58.5%wa,0.0%hi,0.0%si,0.0%st
Cpu14:0.0%us,4.9%sy,0.0%ni,36.4%id,58.7%wa,0.0%hi,0.0%si,0.0%st
Cpu15:0.1%us,5.4%sy,0.0%ni,36.5%id,58.1%wa,0.0%hi,0.0%si,0.0%st

考虑到所有这些,运行“ dstat 10”报告的吞吐量在2200-2300 MB /秒的范围内。

鉴于上述数学原理,我期望范围为2 * 1920〜= 3600+ MB / sec。

有人知道我丢失的带宽去了哪里吗?

谢谢!


LSI SAS控制器的缓存是否设置为直写?(对于较大的顺序工作负载,回写将较慢)。可能还想用较小的bs来测试dd,例如bs = 1M。
布莱恩

Answers:


1

尼斯,准备充分的问题:)

我本人是个不速之客,老实说,我觉得你有钱。我曾一半期望看到您的吞吐量低于当前水平,但是我认为您的效率有所提高,并且效率低下。例如,PCIe总线一直很难达到100%,最好假设整体速率低至90%。考虑到抖动,这将导致PHY不会一直都被“喂饱”,因此您在其中会损失一点点,例如高速缓存,磁盘,非协作中断,IO调度等。效率低下的时间乘以效率低下的时间...依此类推,最终结果超过了预期效率的5-10%。我已经看到过这种情况,HP DL服务器使用W2K3与MSA SAS盒进行通信,然后成为NLB。在多个NIC上进行编辑-令人沮丧但可以理解。无论如何,那是我的2c,对不起,它不是太积极。

By using our site, you acknowledge that you have read and understand our Cookie Policy and Privacy Policy.
Licensed under cc by-sa 3.0 with attribution required.