ER605 Needs weekly Reboot Issue - does it have a memory leak?
So I have had my fleet of ER605's on a weekly reboot diet now for over a year. Because if I don't do that, they eventually permanently lose contact with the Controller (hardware or software). I have recently started experimenting with LibreNMS and I've found a few things, but the most interesting was the continuous ramping of memory usage over time. I only have about 4 days since I set this system up (running in a Docker on my Synology) but here's what I see. I am curious if this is normal, or if the platform has a memory leak.
Left graph is last 24h, Right graph is last week.
You can also see the resetting impact of my 3AM reboot
- Copy Link
- Subscribe
- Bookmark
- Report Inappropriate Content
@d0ugmac1 I don't experience this at all or as much probably due to different number of clients and workload. My ER605v1 is still sitting at 39% memory with 40 days of uptime.
I use standalone mode which has an option to "export diagnostic information" which includes the output of "top" to see per-process memory usage. Does the controller include a similar option so that you could export the diagnostic info immediately after a reboot and then again a few days later to see the difference?
Here's a snippet from my diagnostic export today:
PID USER VSZ STAT COMMAND
8098 root 11772 S nginx: master process /usr/sbin/nginx
8099 root 13636 S nginx: worker process
8999 root 16292 S /usr/sbin/omadad
The largest memory consumers are usually the nginx processes (nginx) and the omada daemon process (omadad) so I suspect that's where the difference would be observed.
- Copy Link
- Report Inappropriate Content
- Copy Link
- Report Inappropriate Content
Reported Physical and Virtual Memory continue to rise:
and the buffers are behaving weirdly
- Copy Link
- Report Inappropriate Content
I believe I am now in 'end stage' disconnect-itis (I am at 21 days 11h since last reboot)
Controller seems to have lost ability to determine/report latency as of 3AM today (exactly 21 days since last reboot)
Controller now also starting to report disconnects:
I am now also seeing regular drops of the L2TP/IPsec tunnel to the remote site (logs too sensitive to post)
yet Controller logs don't look alarming at all...
Router still shows as Connected, internet access still working....nothing alarming looking
More updates very soon I'm sure.
- Copy Link
- Report Inappropriate Content
Router has started to disconnect, ~8-10 events in the last 24h, largest was about 1h around midnight last night. It continues to pass traffic, and has come back online after each outage so far. Pings continue to work, and it continues to forward traffic and self collect stats, but controller, http and ssh are all offline when disconnected.
I pulled an SNMP walk and found some messages about low swap and high loads. That is corroborated by some of the SNMP polled data shown below:
Some of the spikes may be related to the re-establishment of an L2TP/IPsec tunnel each time the router reconnects after being disconnected. Periods of initial disconnections hard to determine but typically < 5min.
- Copy Link
- Report Inappropriate Content
Looks like something snapped after the last 1hr disconnect...router is showing as Disconnected, and I don't expect it to come back
...though SNMP is still active.
- Copy Link
- Report Inappropriate Content
Conclusion. OpenVPN logging resulting in memory starvation. After spending some quality time online with the TPlink Support team last night, it appeared that I had amassed ~30MB of OpenVPN logs in just 3 weeks (a problem on a device with only 128M of RAM), despite not a single vpn connection during that time...ironic if you think how many people would have like to have seen what was in those logs, but regardless, it looks to be the culprit. Interestingly, it was only HTTP and the Controller interface that went offline, SNMP and SSH were working fine. I have since deleted my OpenVPN settings from the router and have rebooted. TPlink should be issuing a beta firmware version with the verbosity dialed back and hopefully some kind of log rotation increase as well. I'll monitor for another few weeks just to be sure, but I think we can close the book on this 'colourful' episode.
- Copy Link
- Report Inappropriate Content
Flagging that memory behaviour is getting worrisome again. Will continue to monitor and will request ticket if it hits starvation levels.
Memory buffers are already basically back to nothing again which was the precursor to the last Disconnect event.
(Note, I have installed the latest 1.2.3 beta software for the ER605v1 some weeks ago)
Here's where we are at today for overall memory levels...which creep up inexorably:
- Copy Link
- Report Inappropriate Content
More evidence that my ER605 is leaning towards disk/swap/cache starvation and eventual disconnect...I/O waits have really started to tick upwards now...
- Copy Link
- Report Inappropriate Content
Hello @d0ugmac1
Are you using the 1.2.3 Build 20230413 Beta firmware?
ER605 V1_1.2.3_Build 20230413 Beta Firmware For Trial (Released on Apr 14th, 2023)
- Copy Link
- Report Inappropriate Content
Information
Helpful: 1
Views: 2721
Replies: 26
Voters 0
No one has voted for it yet.