dhclient lease renewal occasionally breaks DNS resolution

1

I have a set of ec2 instances (ubuntu trusty 14.04) that I have never done any special dhcp configuration with. It's on a VPC with the default dhcp options.

For some reason, roughly ever 25 minutes, I see this in my logs

(IP's and xid are scrubbed)

DHCPREQUEST of 172.16.1.111 on eth0 to 172.16.0.1 port 67 (xid=0x0000000c)
DHCPACK of 172.16.1.111 from 172.16.0.1
bound to 172.16.1.111 -- renewal in 1693 seconds.

(The exact number of seconds changes between 1300 and 1700.)

Occasionally, like once every 10 days, this renewal will break DNS, and my running application will start giving errors like getaddrinfo: Name or service not known. Once the renewal runs again in about 25 minutes, the problem is resolved. I have tested this by waiting for a failure and manually renewing the dhclient lease (sudo dhclient -v -r eth0 then sudo dhclient -v eth0), and seeing that fix the issue instantly.

I have 2 questions:

  1. Why is the renewal time this strange ~25 minute number? I know that I can set this through a conf file, but this seems like it's a strange default.

  2. Why does it sometimes break DNS resolution? This is the main issue here. My other sets of ec2 instances also have this short DHCP renewal time, but only this one set of instances has the issue where, occasionally, DNS breaks when DHCP is renewed.

dns
dhcp
amazon-web-services
amazon-ec2
asked on Super User Feb 22, 2019 by swagrov • edited Feb 27, 2019 by swagrov

1 Answer

2

My guess is you're receiving a DHCP renewal with a bad DNS server IP - have you checked the contents of /etc/resolv.conf during the outage and compared it to the contents when things are working?

But it's better not to guess at all when you can gather a little more data to see exactly what's happening. Try capturing your DHCP traffic with:

tcpdump -c 10000 -w /var/tmp/dhcpdump.tcp -i INTERFACE port bootpc or port bootps

Where "INTERFACE" is eth0 or whatever your primary interface is named. This will capture DHCP traffic on the server (automatically exiting after 10k packets so it won't fill up your disk if you forget about the running task). After you've experienced the problem again, review the sniff file with "tcpdump -v -r FILE" or Wireshark. That should show you what's different about the DHCP renewals that cause the problem.

If you can see a definite pattern with the DHCP renewals that cause the problem, contact Amazon support, and send them the sniff file or the decoded output showing a good and bad renewal.

Regarding the lease time, there's nothing unusual about it. The folks managing the DHCP service decided they wanted short leases. Perhaps other customers are creating and destroying instances every 15 minutes so they want to recover the IP for another customer if it is no longer in use.

answered on Super User Mar 6, 2019 by Velo Traveler

User contributions licensed under CC BY-SA 3.0