wget hangs/freezes when downloading file into NFS

Question

wget hangs/freezes when downloading file into NFS

I'm running some experiments with Amazon EFS (General purpose) and EC2 and I have the issue that EFS seems to act unstable. Commands that involve the mounted file system would hang or freeze.

For example, wget (file of 8gb) downloads a file at 10MB/s for a couple of seconds (perhaps 30 seconds) and then freezes for a couple of minutes (for example 30 minutes). After every freeze, it would try re-downloading it again.

This is the output of wget:

# wget http://mirror.math.princeton.edu/pub/ubuntu-iso/artful/ubuntu-17.10-desktop-amd64.iso
--2017-12-12 00:01:02--  http://mirror.math.princeton.edu/pub/ubuntu-iso/artful/ubuntu-17.10-desktop-amd64.iso
Resolving mirror.math.princeton.edu (mirror.math.princeton.edu)... 128.112.18.21
Connecting to mirror.math.princeton.edu (mirror.math.princeton.edu)|128.112.18.21|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1501102080 (1.4G) [application/octet-stream]
Saving to: ‘ubuntu-17.10-desktop-amd64.iso’

ubuntu-17.10-desktop-amd64.iso  13%[=====>                                            ] 188.08M  5.38MB/s    eta 3m ubuntu-17.10-d  14% 200.57M   116KB/s    in 3m 9s

2017-12-12 00:04:26 (1.06 MB/s) - Connection closed at byte 210310075. Retrying.

--2017-12-12 00:04:27--  (try: 2)  http://mirror.math.princeton.edu/pub/ubuntu-iso/artful/ubuntu-17.10-desktop-amd64.iso
Connecting to mirror.math.princeton.edu (mirror.math.princeton.edu)|128.112.18.21|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 1501102080 (1.4G), 1290792005 (1.2G) remaining [application/octet-stream]
Saving to: ‘ubuntu-17.10-desktop-amd64.iso’

ubuntu-17.10-desktop-amd64.iso                               24%[+++++++++++++++++++=============>                                                                                                        ] 348.76M   199KB/s    in 2m 56s

2017-12-12 00:07:34 (864 KB/s) - Connection closed at byte 365704683. Retrying.

--2017-12-12 00:07:36--  (try: 3)  http://mirror.math.princeton.edu/pub/ubuntu-iso/artful/ubuntu-17.10-desktop-amd64.iso
Connecting to mirror.math.princeton.edu (mirror.math.princeton.edu)|128.112.18.21|:80... connected.
HTTP request sent, awaiting response... 206 Partial Content
Length: 1501102080 (1.4G), 1135397397 (1.1G) remaining [application/octet-stream]
Saving to: ‘ubuntu-17.10-desktop-amd64.iso’

ubuntu-17.10-desktop-amd64.iso                               39%[+++++++++++++++++++++++++++++++++====================>                                                                                   ] 572.22M  18.0MB/s    eta 57s

The setup is the following:

Mount command:

sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2 fs-xxxxxxxxx.efs.eu-west-1.amazonaws.com:/ efs

This is on an Ubuntu 16.04 machine with Linux 4.4.0-1043-aws.

These are my kernel logs:

[  960.336118] INFO: task wget:1430 blocked for more than 120 seconds.
[  960.339660]       Not tainted 4.4.0-1043-aws #52-Ubuntu
[  960.342578] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  960.346890] wget            D ffff8803f9eaf9f8     0  1430   1419 0x00000000
[  960.346895]  ffff8803f9eaf9f8 0000000000000005 ffff8803fd2c1b80 ffff8803fb376040
[  960.346898]  ffff8803f9eb0000 ffff8803ff296c40 7fffffffffffffff ffffffff81813640
[  960.346900]  ffff8803f9eafb58 ffff8803f9eafa10 ffffffff81812e45 0000000000000000
[  960.346903] Call Trace:
[  960.346912]  [<ffffffff81813640>] ? bit_wait+0x60/0x60
[  960.346918]  [<ffffffff81812e45>] schedule+0x35/0x80
[  960.346922]  [<ffffffff81815f95>] schedule_timeout+0x1b5/0x270
[  960.346926]  [<ffffffff811ebb2b>] ? __slab_free+0xcb/0x2c0
[  960.346943]  [<ffffffffc03d4ef0>] ? nfs4_update_server+0x2f0/0x2f0 [nfsv4]
[  960.346949]  [<ffffffff81023765>] ? xen_clocksource_get_cycles+0x15/0x20
[  960.346951]  [<ffffffff81813640>] ? bit_wait+0x60/0x60
[  960.346956]  [<ffffffff81812374>] io_schedule_timeout+0xa4/0x110
[  960.346959]  [<ffffffff8181365b>] bit_wait_io+0x1b/0x70
[  960.346961]  [<ffffffff818131ed>] __wait_on_bit+0x5d/0x90
[  960.346963]  [<ffffffff81813640>] ? bit_wait+0x60/0x60
[  960.346966]  [<ffffffff818132a2>] out_of_line_wait_on_bit+0x82/0xb0
[  960.346969]  [<ffffffff810c3130>] ? autoremove_wake_function+0x40/0x40
[  960.346979]  [<ffffffffc0367cd7>] nfs_wait_on_request+0x37/0x40 [nfs]
[  960.346987]  [<ffffffffc036cb13>] nfs_writepage_setup+0x103/0x600 [nfs]
[  960.346993]  [<ffffffffc036d0ea>] nfs_updatepage+0xda/0x370 [nfs]
[  960.346999]  [<ffffffffc035cdbd>] nfs_write_end+0x13d/0x4b0 [nfs]
[  960.347003]  [<ffffffff8140ae0d>] ? iov_iter_copy_from_user_atomic+0x8d/0x220
[  960.347005]  [<ffffffff8118c434>] generic_perform_write+0x114/0x1c0
[  960.347008]  [<ffffffff81812796>] ? __schedule+0x3b6/0xa30
[  960.347010]  [<ffffffff81812796>] ? __schedule+0x3b6/0xa30
[  960.347012]  [<ffffffff8118e122>] __generic_file_write_iter+0x1a2/0x1e0
[  960.347014]  [<ffffffff81813048>] ? preempt_schedule_common+0x18/0x30
[  960.347016]  [<ffffffff8118e245>] generic_file_write_iter+0xe5/0x1e0
[  960.347022]  [<ffffffffc035c48a>] nfs_file_write+0x9a/0x170 [nfs]
[  960.347024]  [<ffffffff8120c00b>] new_sync_write+0x9b/0xe0
[  960.347026]  [<ffffffff8120c076>] __vfs_write+0x26/0x40
[  960.347028]  [<ffffffff8120c9f9>] vfs_write+0xa9/0x1a0
[  960.347030]  [<ffffffff8120d6b5>] SyS_write+0x55/0xc0
[  960.347032]  [<ffffffff810ee701>] ? posix_ktime_get_ts+0x11/0x20
[  960.347034]  [<ffffffff81816f72>] entry_SYSCALL_64_fastpath+0x16/0x71

ubuntu

amazon-ec2

nfs

wget

amazon-efs

asked on Server Fault Dec 12, 2017 by

mitchkman

1 Answer

I'm going to speculate that your EFS filesystem is very small.

If so, you are overrunning the allowed throughput so aggressively that you are getting timeouts.

Size + Performance are tightly coupled in EFS -- larger filesystems are allocated more credits and thus able to sustain more I/O.

The only way to make a small EFS deployment faster is to create large files on the filesystem to temporarily increase the allowed throughput. Later, when your actual data is larger, you can delete those files.

Over the course of 24 hours, an EFS filesystem allows 5 MiB/s sustained per 100 GiB stored, and can burst to 100 MiB/s (until it has a stored size exceeding 1 TiB, which also causes the burst rate to climb from the 100 MiB/s baseline).

See Amazon EFS Performance.

If you occasionally need a large disk workspace for large files, you might find that st1 or sc1 EBS volumes are a better solution but of course this is strongly dependent on your use case.

answered on Server Fault Dec 12, 2017 by

Michael - sqlbot

User contributions licensed under CC BY-SA 3.0