systemd.timerがactivating状態のまま

現象

systemd.timerのステートがactivating状態のままで、定期実行が行われなくなってしまった。batchの定期実行を監視したい。

#systemd のステートはどんなのがあるのか。

linux - All systemd states - Super User

active

inactive

activating

deactivating

failed

activating ってどんな状態?

systemd

Units may be "active" (meaning started, bound, plugged in, …, depending on the unit type, see below), or "inactive" (meaning stopped, unbound, unplugged, …), as well as in the process of being activated or deactivated, i.e. between the two states (these states are called "activating", "deactivating").

つまり、 inactive -> activating -> active （ただし Type=simple の場合）

サービス種別が simple

ExecStartで activating になる

指定されたプロセスと依存プロセスが全て起動したら activeになる

サービス種別が oneshot

ExecStartで activating になる

実行が終わったらinactive

なぜ activating のままなの？

サービス種別がoneshotで、プロセス（バッチ）が’実行しっぱなしだから

バッチの起動中になにかが詰まった可能性がある

例えば、バッチでメールサーバーにアクセスしているなら、メールサーバーからの応答が返ってこない、非常に遅い、など

タイムアウト設定できるの？

systemd-devel Timeout for 'Activating (start)' status

>> 2) What is systemd's timeout by default for service activation

>> (timer-activated, socket-activated)? If it is documentened, please, give

>> me a hint.

> By default, Type=oneshot services don't have a start timeout.

>> 3) If systemd's timeout from 2) is present, how can it be managed by

>> user\admin? E.g. after 10 minutes of 'Activating (start)', service gets

>> FAILED state with putting this info to systemd log, of course (something

>> like "systemd1 fails to start foobar.servicePID by timeout.

>> ExitCode:<number>").

> If you do add a TimeoutStartSec= to a Type=oneshot unit, this will

> force the unit to be stopped (by killing the process) after that time.

> That's probably not what you want.

In general way -- why not? For more twisted cases I can use OnFailure=

directive here, I suppose. So, will try to play with TimeoutStartSec=

directive.

> For a Type=simple service, the default TimeoutStartSec= is set in

> /etc/systemd/system.conf. It will be 90 seconds unless you've changed it.

タイムアウトを設定できるらしい

ここまでのまとめ

timerに TimeoutStartSec=240 （4分）の設定をして、バッチが詰まったらタイムアウトさせる

Type=oneshot なら、activating で正常

その先の話

バッチのステータスを細かく知りたい場合は？

ログを出すといいのでは

systemd-notify コマンドや相当するソケットでのNotifyを実装すれば、systemctlでこまかい状況分かる

バッチが長時間かかって、4分で終わらなくなったらどうするの？

Timeout方式ではなく、Watchdog方式で応答なくなったらKILLなどできないかな？

バッチ自身で、4分くらい実行したら終了するようにして、次の起動で続きの動作をする等はできないか？

WatchDog

WatchdogSec=

時間を秒数で設定。その時間のあいだ通知がこなければ、発動される

Restart=always で時間経過後に再起動する設定

watchdogは WATCHDOG=1 をNotifyに送ることで有効なる。有効になれば、Watchdogが開始される

Notify

Type=notify

simple はプロセスが起動したらactivate状態になる

forking はプロセスがforkするのを捕まえる以外はsimpleと同じ

oneshot はバッチ処理等で使うが、バッチ処理中はactivating状態で、おわったらinactiveになる

notify は、実行プログラムからnotify通知（UNIX Domain Socket経由）を受け取るモード

サンプルコード

code:batch.service

# /etc/systemd/system/batch.service

Service

User=ubuntu

Group=ubuntu

ExecStart=/usr/bin/python3 -u /home/ubuntu/systemdtest/batch.py

ExecStop=/bin/kill -s TERM $MAINPID

Type=notify # systemdのstatusをbatch.pyから制御

WatchdogSec=7 # batch.pyから7秒間 WATCHDOG=1 が来ない場合KILLする

Restart=no # batch.pyがエラー停止したら次のtimerまで止めておく（すぐ再起動したい場合は always を指定）

Unit

Description=batch.py

After=multi-user.target

code:batch.timer

# /etc/systemd/system/batch.timer

Unit

Description=batch

Timer

OnBootSec=1min

OnUnitActiveSec=1min

Install

WantedBy=timers.target

code:batch.py

import sys

import os

import socket

import time

import random

import signal

def notify(msg):

# https://www.freedesktop.org/software/systemd/man/sd_notify.html

socket_path = os.environ.get('NOTIFY_SOCKET')

if not socket_path:

raise RuntimeError('No notify socket')

with socket.socket(socket.AF_UNIX, socket.SOCK_DGRAM) as s:

s.connect(socket_path)

s.sendall(msg.encode())

def loop():

for i in range(1, 11):

sec = random.choice(1*97 + 10*7)

print('Do #{} that takes {} seconds. doing...'.format(i, sec))

time.sleep(sec) # DO SOMETHING

print('... done')

notify('WATCHDOG=1') # update watchdog timestamp

notify('STATUS={} tasks are finished'.format(i)) # update systemd status

def initialize():

notify('STATUS=Process started. Setting up...') # update systemd status

signal.signal(signal.SIGABRT, handle_watchdog) # handle watchdog time limit

time.sleep(3)

notify('READY=1') # activating -> active

notify('STATUS=Ready.') # update systemd status

def handle_watchdog(signum, tb):

notify('STATUS=KILLED by watchdog') # update systemd status

# do some finialize process

sys.exit(-1) # inactive

def finalize():

notify('STATUS=Process closing...') # update systemd status

notify('STOPPING=1') # active -> deactivating

time.sleep(3) # do some finalize process

notify('STATUS=Finished.') # update systemd status

sys.exit(0) # inactive

def main():

initialize()

loop()

finalize()

if __name__ == '__main__':

main()

activating 状態のときにエラー終了した場合のログ

code:killed-in-activating.log

Sep 01 15:25:48 ubuntu-xenial systemd1: Starting batch.py...

Sep 01 15:26:06 ubuntu-xenial systemd1: batch.service: Watchdog timeout (limit 7s)!

Sep 01 15:26:06 ubuntu-xenial systemd1: batch.service: Main process exited, code=exited, status=255/n/a

Sep 01 15:26:06 ubuntu-xenial systemd1: Failed to start batch.py.

Sep 01 15:26:06 ubuntu-xenial systemd1: batch.service: Unit entered failed state.

Sep 01 15:26:06 ubuntu-xenial systemd1: batch.service: Failed with result 'exit-code'.

Failed to start batch.py. という表現になっている。

active 状態のときにエラー終了した場合のログ

code:killed-in-active.log

Sep 01 15:30:52 ubuntu-xenial systemd1: Starting batch.py...

Sep 01 15:30:55 ubuntu-xenial systemd1: Started batch.py.

Sep 01 15:31:02 ubuntu-xenial systemd1: batch.service: Watchdog timeout (limit 7s)!

Sep 01 15:31:02 ubuntu-xenial systemd1: batch.service: Main process exited, code=exited, status=255/n/a

Sep 01 15:31:02 ubuntu-xenial systemd1: batch.service: Unit entered failed state.

Sep 01 15:31:02 ubuntu-xenial systemd1: batch.service: Failed with result 'exit-code'.

Started batch.py. が追加され、ここで activate 状態になっている。

Failed to start batch.py. は表示されなくなった（スタート中のエラーではなくなったため）。

ここまでのまとめ

Type=notify にしてsystemdのstatusをbatch.pyから制御すると分かりやすい

WatchdogSec=7 batch.pyからN秒間 WATCHDOG=1 が来ない場合KILLする（batch詰まり防止）

Restart=no batch.pyがエラー停止したら次のtimerまで止めておく（すぐ再起動したい場合はalwaysを指定）

batch.py からsystemdに状況を通知することで、詰まりを防止し、状態を観察しやすくできた。