Both the host OS and the guest OS are Debian 8 amd64. The guest is a raw disk image stored at /srv/machines/web.img
. The host has a backup script scheduled each night at 1:55am. Here is my backup script:
#!/bin/bash
BACKDIR='/srv/backups'
BACKFILENAME='web.img'
# Let's rotate the backups and delete the oldest one
for i in `seq 8 -1 0` ; do
if [ -f "$BACKDIR"/"$BACKFILENAME"-$i ] ; then
mv "$BACKDIR"/"$BACKFILENAME"-$i "$BACKDIR"/"$BACKFILENAME"-$(($i + 1))
fi
done
rm -f "$BACKDIR"/"$BACKFILENAME"-9
# Hold guest writes in a temp COW file
virsh snapshot-create-as --domain web bsnap --diskspec vda,file=/srv/machines/bsnap.qcow2 --disk-only --atomic --quiesce
# Make backup of the guest RAW image
cp -a /srv/machines/"$BACKFILENAME" "$BACKDIR"/"$BACKFILENAME"-0
# commit guest writes to the usual RAW file
virsh blockcommit web vda --active --verbose --pivot
Zabbix CPU I/O wait graph shows the backup takes about 1 hour. Every night during this time period, but at different times, I get kernel oops (not sure they are actually oops) int the guest vm, like this one:
Jul 4 02:13:59 web kernel: [83400.216115] INFO: task jbd2/dm-0-8:200 blocked for more than 120 seconds.
Jul 4 02:13:59 web kernel: [83400.216485] Not tainted 3.16.0-4-amd64 #1
Jul 4 02:13:59 web kernel: [83400.216817] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 4 02:13:59 web kernel: [83400.217427] jbd2/dm-0-8 D ffff880efef1dac8 0 200 2 0x00000000
Jul 4 02:13:59 web kernel: [83400.217433] ffff880efef1d670 0000000000000046 0000000000012f00 ffff880efef87fd8
Jul 4 02:13:59 web kernel: [83400.217437] 0000000000012f00 ffff880efef1d670 ffff880f3fcb37b0 ffff880f3ff7a5e8
Jul 4 02:13:59 web kernel: [83400.217441] 0000000000000002 ffffffff811d7620 ffff880efef87c80 ffff8800bbba8b98
Jul 4 02:13:59 web kernel: [83400.217444] Call Trace:
Jul 4 02:13:59 web kernel: [83400.217454] [<ffffffff811d7620>] ? generic_block_bmap+0x50/0x50
Jul 4 02:13:59 web kernel: [83400.217460] [<ffffffff815114a9>] ? io_schedule+0x99/0x120
Jul 4 02:13:59 web kernel: [83400.217464] [<ffffffff811d762a>] ? sleep_on_buffer+0xa/0x10
Jul 4 02:13:59 web kernel: [83400.217468] [<ffffffff8151182c>] ? __wait_on_bit+0x5c/0x90
Jul 4 02:13:59 web kernel: [83400.217471] [<ffffffff811d7620>] ? generic_block_bmap+0x50/0x50
Jul 4 02:13:59 web kernel: [83400.217474] [<ffffffff815118d7>] ? out_of_line_wait_on_bit+0x77/0x90
Jul 4 02:13:59 web kernel: [83400.217480] [<ffffffff810a7e90>] ? autoremove_wake_function+0x30/0x30
Jul 4 02:13:59 web kernel: [83400.217498] [<ffffffffa00c542e>] ? jbd2_journal_commit_transaction+0x175e/0x1950 [jbd2]
Jul 4 02:13:59 web kernel: [83400.217507] [<ffffffff810a2f21>] ? pick_next_task_fair+0x6e1/0x820
Jul 4 02:13:59 web kernel: [83400.217514] [<ffffffffa00c8be2>] ? kjournald2+0xb2/0x240 [jbd2]
Jul 4 02:13:59 web kernel: [83400.217518] [<ffffffff810a7e60>] ? prepare_to_wait_event+0xf0/0xf0
Jul 4 02:13:59 web kernel: [83400.217524] [<ffffffffa00c8b30>] ? commit_timeout+0x10/0x10 [jbd2]
Jul 4 02:13:59 web kernel: [83400.217530] [<ffffffff8108809d>] ? kthread+0xbd/0xe0
Jul 4 02:13:59 web kernel: [83400.217534] [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
Jul 4 02:13:59 web kernel: [83400.217539] [<ffffffff81514958>] ? ret_from_fork+0x58/0x90
Jul 4 02:13:59 web kernel: [83400.217543] [<ffffffff81087fe0>] ? kthread_create_on_node+0x180/0x180
As a consequence, some random process gets killed (last night was the database engine...). I'm not skilled enough to decode what actually happened to the guest kernel here, but that 120 seconds
with jbd2
in the same line makes me suspect some sort of filesystem timeout is the culprit.
Now, assuming my suspect goes in the right direction, I suppose that the host backup script interferes badly with the I/O performance in the guest.
Please note that the host machine has 256GB of RAM memory, the guest image is 180GB in size and nearly the whole guest image is kept into RAM (vmtouch shows figures between 90 and 98% of it cached). I can't tell about the temporary COW file, but, even if it were kept on disk, seems to me 120 seconds are way too much to be justified by software RAID1 6GB/s Enterprise level SATA disks. Moreover, the host OS logs do not show anything that could make me suspect of disk I/O performance problems. Server load, at that time, is what the backup script does only and nothing else.
Is there anything in my backup script that can cause such problems?
User contributions licensed under CC BY-SA 3.0