Warning (possibly) about Watchdog under Linux

Post Reply
LEGOManiac
Posts: 15
Joined: Sun Mar 28, 2010 4:21 am

Warning (possibly) about Watchdog under Linux

Post by LEGOManiac »

I want to first qualify this by pointing out that I'm not really qualified to definitively confirm my theory. It's based on observations and past experience with micro-controllers.

I've been using an eBox (now called Zentyal) server for over a year now, running on a FitPC2i.

It has worked for the most part, but occasionally requires a re-boot as the WAN port seems to hang up from time-to-time.

Recently, I applied a round of system patches via the Zentyal interface. I didn't pay too much attention after that, but my son (it's a home server) came to me a few days ago complaining that he was losing internet access.

When I had time, I looked into it and discovered that the FitPC was rebooting itself spontaneously. Initially, it appeared to be around the point where the DHCP server is enabled, but I discovered that, if I loaded the server, I could extend it to several minutes before it rebooted.

I couldn't see anything in the logs but still blamed Zentyal. After trying various things to repair the server, I finally mounted it on my local desktop and it booted and ran OK (ie. didn't re-start).

I tried ClearOS on the FitPC and it did exactly the same thing. There was a third server installation, who's name I forgot, that also behaved this way.

This morning, I had a sudden flashback to 25 years ago when I worked in the electronics industry, that micro-controllers have a watchdog timer which is used to reboot a hung system. To avoid this, the programmer has to reset the watchdog timer before it reaches zero and so has to include this code somewhere in a routine that will be run regularly.

I checked it out, and sure enough, the FitPC has a watchdog timer and it was enabled. I disabled it, and the system has been stable ever since.

I'm guessing that an update within linux may have either broken the watchdog routine, or the routine might have been changed or re-located such that it doesn't run as often as it should have. Probably the latter since loading the server lengthened the time it would run without rebooting.

Again, I'm not qualified to confirm that this is definitively what is happening, but I thought I would point this out before someone else goes banging their had against the wall for several nights trying to fix the wrong problem.

I'm also not sure if the three server's used the same core Linux distribution, or used the same watchdog code.

Once I get ClearOS set up the way I want it, I'll switch the watchdog timer back on and fiddle with the duration to see if I can get it to be stable again.

Post Reply

Return to “Linux on fit-PC2”