在依赖项失败时重新启动systemd服务

26

如果某个服务的依赖项之一在启动时失败（但在重试后成功），那么处理重启服务的正确方法是什么。

这是人为设计的副本，可以使问题更清楚。

a.service（模拟第一次尝试失败和第二次尝试成功）

[Unit]
Description=A

[Service]
ExecStartPre=/bin/sh -x -c "[ -f /tmp/success ] || (touch /tmp/success && sleep 10)"
ExecStart=/bin/true
TimeoutStartSec=5
Restart=on-failure
RestartSec=5
RemainAfterExit=yes

b.service（在A启动后通常会成功）

[Unit]
Description=B
After=a.service
Requires=a.service

[Service]
ExecStart=/bin/true
RemainAfterExit=yes
Restart=on-failure
RestartSec=5

让我们开始吧：

# systemctl start b
A dependency job for b.service failed. See 'journalctl -xe' for details.

日志：

Jun 30 21:34:54 debug systemd[1]: Starting A...
Jun 30 21:34:54 debug sh[1308]: + '[' -f /tmp/success ']'
Jun 30 21:34:54 debug sh[1308]: + touch /tmp/success
Jun 30 21:34:54 debug sh[1308]: + sleep 10
Jun 30 21:34:59 debug systemd[1]: a.service start-pre operation timed out. Terminating.
Jun 30 21:34:59 debug systemd[1]: Failed to start A.
Jun 30 21:34:59 debug systemd[1]: Dependency failed for B.
Jun 30 21:34:59 debug systemd[1]: Job b.service/start failed with result 'dependency'.
Jun 30 21:34:59 debug systemd[1]: Unit a.service entered failed state.
Jun 30 21:34:59 debug systemd[1]: a.service failed.
Jun 30 21:35:04 debug systemd[1]: a.service holdoff time over, scheduling restart.
Jun 30 21:35:04 debug systemd[1]: Starting A...
Jun 30 21:35:04 debug systemd[1]: Started A.
Jun 30 21:35:04 debug sh[1314]: + '[' -f /tmp/success ']'

A已成功启动，但B处于失败状态，不会重试。

编辑

我在两个服务中都添加了以下内容，现在当A启动时B成功启动了，但是我无法解释原因。

[Install]
WantedBy=multi-user.target

为什么这会影响A和B之间的关系？

编辑2

上面的“修复”在systemd 220中不起作用。

systemd 219调试日志

systemd219 systemd[1]: Trying to enqueue job b.service/start/replace
systemd219 systemd[1]: Installed new job b.service/start as 3454
systemd219 systemd[1]: Installed new job a.service/start as 3455
systemd219 systemd[1]: Enqueued job b.service/start as 3454
systemd219 systemd[1]: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch oldcoreos
systemd219 systemd[1]: Forked /bin/sh as 1502
systemd219 systemd[1]: a.service changed dead -> start-pre
systemd219 systemd[1]: Starting A...
systemd219 systemd[1502]: Executing: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmpoldcoreos
systemd219 sh[1502]: + '[' -f /tmp/success ']'
systemd219 sh[1502]: + touch /tmp/success
systemd219 sh[1502]: + sleep 10
systemd219 systemd[1]: a.service start-pre operation timed out. Terminating.
systemd219 systemd[1]: a.service changed start-pre -> final-sigterm
systemd219 systemd[1]: Child 1502 belongs to a.service
systemd219 systemd[1]: a.service: control process exited, code=killed status=15
systemd219 systemd[1]: a.service got final SIGCHLD for state final-sigterm
systemd219 systemd[1]: a.service changed final-sigterm -> failed
systemd219 systemd[1]: Job a.service/start finished, result=failed
systemd219 systemd[1]: Failed to start A.
systemd219 systemd[1]: Job b.service/start finished, result=dependency
systemd219 systemd[1]: Dependency failed for B.
systemd219 systemd[1]: Job b.service/start failed with result 'dependency'.
systemd219 systemd[1]: Unit a.service entered failed state.
systemd219 systemd[1]: a.service failed.
systemd219 systemd[1]: a.service changed failed -> auto-restart
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: a.service holdoff time over, scheduling restart.
systemd219 systemd[1]: Trying to enqueue job a.service/restart/fail
systemd219 systemd[1]: Installed new job a.service/restart as 3718
systemd219 systemd[1]: Installed new job b.service/restart as 3803
systemd219 systemd[1]: Enqueued job a.service/restart as 3718
systemd219 systemd[1]: a.service scheduled restart job.
systemd219 systemd[1]: Job b.service/restart finished, result=done
systemd219 systemd[1]: Converting job b.service/restart -> b.service/start
systemd219 systemd[1]: a.service changed auto-restart -> dead
systemd219 systemd[1]: Job a.service/restart finished, result=done
systemd219 systemd[1]: Converting job a.service/restart -> a.service/start
systemd219 systemd[1]: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch oldcoreos
systemd219 systemd[1]: Forked /bin/sh as 1558
systemd219 systemd[1]: a.service changed dead -> start-pre
systemd219 systemd[1]: Starting A...
systemd219 systemd[1]: Child 1558 belongs to a.service
systemd219 systemd[1]: a.service: control process exited, code=exited status=0
systemd219 systemd[1]: a.service got final SIGCHLD for state start-pre
systemd219 systemd[1]: About to execute: /bin/true
systemd219 systemd[1]: Forked /bin/true as 1561
systemd219 systemd[1]: a.service changed start-pre -> running
systemd219 systemd[1]: Job a.service/start finished, result=done
systemd219 systemd[1]: Started A.
systemd219 systemd[1]: Child 1561 belongs to a.service
systemd219 systemd[1]: a.service: main process exited, code=exited, status=0/SUCCESS
systemd219 systemd[1]: a.service changed running -> exited
systemd219 systemd[1]: a.service: cgroup is empty
systemd219 systemd[1]: About to execute: /bin/true
systemd219 systemd[1]: Forked /bin/true as 1563
systemd219 systemd[1]: b.service changed dead -> running
systemd219 systemd[1]: Job b.service/start finished, result=done
systemd219 systemd[1]: Started B.
systemd219 systemd[1]: Starting B...
systemd219 systemd[1]: Child 1563 belongs to b.service
systemd219 systemd[1]: b.service: main process exited, code=exited, status=0/SUCCESS
systemd219 systemd[1]: b.service changed running -> exited
systemd219 systemd[1]: b.service: cgroup is empty
systemd219 sh[1558]: + '[' -f /tmp/success ']'

systemd 220调试日志

systemd220 systemd[1]: b.service: Trying to enqueue job b.service/start/replace
systemd220 systemd[1]: a.service: Installed new job a.service/start as 4846
systemd220 systemd[1]: b.service: Installed new job b.service/start as 4761
systemd220 systemd[1]: b.service: Enqueued job b.service/start as 4761
systemd220 systemd[1]: a.service: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 systemd[1]: a.service: Forked /bin/sh as 2032
systemd220 systemd[1]: a.service: Changed dead -> start-pre
systemd220 systemd[1]: Starting A...
systemd220 systemd[2032]: a.service: Executing: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 sh[2032]: + '[' -f /tmp/success ']'
systemd220 sh[2032]: + touch /tmp/success
systemd220 sh[2032]: + sleep 10
systemd220 systemd[1]: a.service: Start-pre operation timed out. Terminating.
systemd220 systemd[1]: a.service: Changed start-pre -> final-sigterm
systemd220 systemd[1]: a.service: Child 2032 belongs to a.service
systemd220 systemd[1]: a.service: Control process exited, code=killed status=15
systemd220 systemd[1]: a.service: Got final SIGCHLD for state final-sigterm.
systemd220 systemd[1]: a.service: Changed final-sigterm -> failed
systemd220 systemd[1]: a.service: Job a.service/start finished, result=failed
systemd220 systemd[1]: Failed to start A.
systemd220 systemd[1]: b.service: Job b.service/start finished, result=dependency
systemd220 systemd[1]: Dependency failed for B.
systemd220 systemd[1]: b.service: Job b.service/start failed with result 'dependency'.
systemd220 systemd[1]: a.service: Unit entered failed state.
systemd220 systemd[1]: a.service: Failed with result 'timeout'.
systemd220 systemd[1]: a.service: Changed failed -> auto-restart
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: Failed to send unit change signal for a.service: Transport endpoint is not connected
systemd220 systemd[1]: a.service: Service hold-off time over, scheduling restart.
systemd220 systemd[1]: a.service: Trying to enqueue job a.service/restart/fail
systemd220 systemd[1]: a.service: Installed new job a.service/restart as 5190
systemd220 systemd[1]: a.service: Enqueued job a.service/restart as 5190
systemd220 systemd[1]: a.service: Scheduled restart job.
systemd220 systemd[1]: a.service: Changed auto-restart -> dead
systemd220 systemd[1]: a.service: Job a.service/restart finished, result=done
systemd220 systemd[1]: a.service: Converting job a.service/restart -> a.service/start
systemd220 systemd[1]: a.service: About to execute: /bin/sh -x -c '[ -f /tmp/success ] || (touch /tmp/success && sleep 10)'
systemd220 systemd[1]: a.service: Forked /bin/sh as 2132
systemd220 systemd[1]: a.service: Changed dead -> start-pre
systemd220 systemd[1]: Starting A...
systemd220 systemd[1]: a.service: Child 2132 belongs to a.service
systemd220 systemd[1]: a.service: Control process exited, code=exited status=0
systemd220 systemd[1]: a.service: Got final SIGCHLD for state start-pre.
systemd220 systemd[1]: a.service: About to execute: /bin/true
systemd220 systemd[1]: a.service: Forked /bin/true as 2136
systemd220 systemd[1]: a.service: Changed start-pre -> running
systemd220 systemd[1]: a.service: Job a.service/start finished, result=done
systemd220 systemd[1]: Started A.
systemd220 systemd[1]: a.service: Child 2136 belongs to a.service
systemd220 systemd[1]: a.service: Main process exited, code=exited, status=0/SUCCESS
systemd220 systemd[1]: a.service: Changed running -> exited
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 systemd[1]: a.service: cgroup is empty
systemd220 sh[2132]: + '[' -f /tmp/success ']'

systemd

— 瓦迪姆
source

1

有一个上游systemd问题跟踪此问题：github.com/systemd/systemd/issues/1312

— JKnight '16

31

如果有人因缺少有关此主题的信息而遇到此问题，我将尝试总结我对该问题的发现。

Restart=on-failure 仅适用于流程失败（不适用于因依赖项失败而导致的失败）
当依赖项成功重新启动时，在某些情况下依赖失败的单元会在某些条件下重新启动的事实是systemd <220中的错误：http : //lists.freedesktop.org/archives/systemd-devel/2015-July/033513.html
如果某个依赖项在启动时可能失败的可能性很小，并且您在乎弹性，请不要使用Before/ After，而是对依赖项产生的某些工件进行检查

例如

ExecStartPre=/usr/bin/test -f /some/thing
Restart=on-failure
RestartSec=5s

您甚至可以使用systemctl is-active <dependecy>。

非常骇客，但我找不到更好的选择。

我认为，没有办法处理依赖项失败是systemd中的一个缺陷。

— 瓦迪姆
source

是的，更不用说没有重试伦纳德·波特林不想实现的挂载点了：github.com/systemd/systemd/issues/4468

— Hvisage

0

似乎可以轻松编写脚本并将其放入cronjob中。基本逻辑是这样的

检查服务a和b以及依赖项是否都在运行/处于有效状态。您将知道检查一切是否正常的最佳方法
如果一切正常，则什么也不做，或记录一切正常。日志记录的优点是允许您查找先前的日志条目。
如果发生故障，请重新启动服务，然后跳回到执行服务和依赖项状态检查的脚本的开头。仅当您对重新启动服务有信心并且依赖项很有可能正常工作时，才应该发生跳转，否则就有可能发生循环。
让cron过一会儿再次运行脚本

一旦设置了脚本，cron就是测试它的好地方，如果cron效率低下，那么该脚本将是尝试编写可检查某些其他服务的状态并根据需要重新启动它们的低级系统服务的良好起点。根据您要投入多少精力，甚至可以根据结果将脚本设置为通过电子邮件发送给您（当然，所涉及的服务是网络服务）。

— 马特
source

这项工作应该在流程/服务管理器中完成，否则，您将回到SVR4方法，systemd尝试不这样做...

— Hvisage

0

After并且Before仅设置启动服务的顺序，您的服务文件会显示“如果将启动A和B，则必须在B之前启动A”。

Requires 意味着如果要启动此服务，则必须先启动该服务，在您的示例中“如果B已启动而A没有运行，请启动A”

当您添加时，WantedBy=multi-user.target您现在告诉系统在系统初始化时必须启动服务multi-user.target，大概这意味着一旦添加，便是让系统启动服务而不是手动启动它们？

我不确定为什么在220版本中不起作用，可能值得尝试222。我将挖掘一个VM，并在有机会的情况下尝试您的服务。

— 邵逸夫
source

1

我在systemd-devel上询问，它在219中正常工作是一个错误。预期的行为是失败的依赖项不会重新启动。

— 瓦迪姆

0

我花了几天的时间，试图让它以“系统化”的方式工作，但是却沮丧地放弃了，写了一个包装器脚本来管理依赖性和失败。每个子服务都是普通的systemd服务，没有“ Requires”或“ PartOf”或与其他服务的任何挂钩。

我的顶级服务文件如下所示：

[Service]
Type=simple
Environment=REQUIRES=foo.service bar.service
ExecStartPre=/usr/bin/systemctl start $REQUIRES
ExecStart=@PREFIX@/bin/top-service.sh $REQUIRES
ExecStop=/usr/bin/systemctl      stop $REQUIRES

到目前为止，一切都很好。该top.service文件控制foo.service和bar.service。起动top开始foo和bar停止top停止foo和bar。最后一个成分是我的top-service.sh脚本，该脚本监视服务的失败情况：

#!/bin/bash

# This monitors REQUIRES services. If any service stops, all of the services are stopped and this script ends.

REQUIRES="$@"

if [ "$REQUIRES" == "" ]
then
  echo "ERROR: no services listed"
  exit 1
fi

echo "INFO: watching services: ${REQUIRES}"

end=0
while [[ $end == 0 ]]
do
  s=$(systemctl is-active ${REQUIRES} )
  if echo $s | egrep '^(active ?)+$' > /dev/null
  then
    # $s has embedded newlines, but echo $s seems to get rid of them, while echo "$s" keeps them.
    # echo INFO: All active, $s
    end=0
  else
    echo "WARN: ${REQUIRES}"
    echo WARN: $s
  fi

  if [[ $s == *"failed"* ]] || [[ $s == *"unknown"* ]]
  then
    echo "WARN: At least one service is failed or unknown, ending service"
    end=1
  else
    sleep 1
  fi
done

echo "INFO: done watching services, stopping: ${REQUIRES}"
systemctl stop ${REQUIRES}
echo "INFO: stopped: ${REQUIRES}"
exit 1

— 马克·拉卡塔
source

REQUIRES="$@"是天生的错误代码-您将数组折叠成字符串，放弃了项目之间的原始边界，因此创建了参数，即。set -- "argument one" "argument two"与相同set -- "argument" "one" "argument" "two"。requires=( "$@" )会保留原始数据，因此可以安全地扩展为systemctl is-active "${requires[@]}"。

— 查尔斯·达菲

-1

没有答案。但是有人可能需要这个（因为此页面显示在搜索中）：

应该

[Service]
 Restart=always
 RestartSec=3

https://jonarcher.info/2015/08/ensure-systemd-services-restart-on-failure/

— 西蒙·道金（Shimon Doodkin）
source

请更仔细地阅读问题。这与重新启动单个运行不正常的服务无关，而与被告服务失败时systemd的行为有关。

— Vadim