TL-SG108E arp flood problem

This thread has been locked for further replies. You can start a new thread to share your ideas or ask questions.

TL-SG108E arp flood problem

This thread has been locked for further replies. You can start a new thread to share your ideas or ask questions.
TL-SG108E arp flood problem
TL-SG108E arp flood problem
2016-12-16 07:28:15
Model : TL-SG108E

Hardware Version : TL-SG108E 2.0 (also had the problem on hardware version 1.0, the switch was swapped at the store since the issue could not be found)

Firmware Version : 1.0.1 Build 20160108 Rel. 57851 (also tested multiple firmware releases on the hardware 1.0)

ISP :

It seems that when the switch receives too many arp requests, these are not forwarded anymore.

To explain in detail, let me explain my setup at home:
ISP Modem --- (eth0) Server (eth1) --- TP-Link --- Wireless bridge / Digibox
So this server which acts as a NAT router, NAS and media server has 2 ethernet interfaces: eth0 is connected to my cable (coax) Internet Providers modem
The 2nd interface of the server (eth1) is connected to port 1 on the TL-SG108E.
On the default VLAN (id 1) the LAN connection is provided, which is a private subnet behind a NAT by the server.
On port 2, a wireless bridge is connected which shares this VLAN to wireless devices.
Also on port 1 is a tagged VLAN (id 2) which bridges the WAN from the ISP modem (as seen on eth0 at the server) to port 8 (untagged) on which the Digibox is connected.
This digibox is provided by the Internet Provider and requires an ethernet connection for on-demand streaming.

The modem provides 2 networks:
- All digiboxes are registered with the provider, so if a MAC address of a digibox is seen, it'll receive an ip in the 10.0.0.0/8 range
- Any other device will get a public internet IP (real public IP, no NAT)

What I noticed for a few months was that my local devices were losing their network connection. From a tcpdump I quickly noticed that a lot of arp requests were sent out to try to re-establish the communication, however I couldn't really find the root cause for it. As an interim solution I've replaced the TL-SG108E a few times with another switch because the frequent disconnects which last about 30 seconds were really annoying.
As I was looking at the problem from the LAN side, I didn't seem to find the root cause. During some troubleshooting today however, I noticed a lot of arp traffic on my modem which is bridged to the switch on vlan 2.
Almost all of the arp traffic are broadcasts from the ISP headend to map IP's to mac addresses. All of these have the same source MAC address and sent out as broadcast. I see about 30 arp requests per second passing through the interface.
When I added an arptables rule on the server to filter out all arp traffic from vlan 2, not destined for my digibox all the problems on the switch go away.

This seems to suggest there's some rate limiting of arp requests going on in the switch, allthough I don't see any settings to do that (and in the past I've also reproduced the issue with the switch reset to factory defaults).
I'm guessing that my problem started after the ISP decided to put more subnets on the same layer 2 segment, causing an increase in the amount of arp requests.

As the arp filtering is the only thing I've changed to fix the issue (the wan is still on tagged vlan 2 and except for the arp all traffic still gets through) I'm rather sure that's the root problem with the switch. I was wondering if someone could provide more information on this.
  0      
  0      
#1
Options
7 Reply
Re:TL-SG108E arp flood problem
2016-12-18 01:13:23
I got the same problem since I installed the TL-SG108E
I have a TL-SG105E with a NAS on port 1 (VLAN1), Cable Modem on port 4 (VLAN10), Dot1Q link TL-SG108E (VLAN1, VLAN10 tagged)
On the TL-SG108E I have my router: port 1 (VLAN10) and port 2 (VLAN1), Digibox (VLAN10), Sonos soundbar (VLAN1)

On regular base, my music stream stops our browsing on a PC is freezing. When I start a wireshark, I see a lot of arp requests but no replies.
So it seems the switch is blocking the arp requests/replies.

Anyone knows how to fix this ?
  0  
  0  
#2
Options
Re:TL-SG108E arp flood problem
2016-12-19 09:19:40
I even managed to reproduce the issue:

On the server I launch the following script:
arpflood.sh:
[CODE]for d in `seq 1 99`; do
arping -c 1 -w 0.001 -I eth0 10.99.0.$d >/dev/null 2>&1 &
sleep 0.1
done

for c in `seq 1 99`; do
for d in `seq 0 99`; do
arping -c 1 -w 0.001 -I eth0 10.99.$c.$d >/dev/null 2>&1 &
sleep 0.1
done
done[/CODE]

The 10.99.0.0/16 network could be any network, I just used that one to test. Also replace eth0 with the appropriate interface. The reason for the strange numbering is that it gives a nice output where 10.99 is the prefix and then you'll get a decimal number after it (10.99.c.d --> c*100 + d). The script will send out 10 arp requests/second.
You can then easily monitor on another host in the same subnet using the following command (again replacing wlan0 with the appropriate interface):
[CODE]tcpdump -l -i wlan0 'arp' | grep --line-buffered 'who-has 10.99' | grep --line-buffered -n 'who-has'[/CODE]
This will prefix the lines with the same number that should appear after the 10.99 (that's the reason for the first loop in the arpflood.sh to make the counter 1-based instead of 0-based). As I'm testing on a wireless connection I'm ok with a few dropped packets. However, I get up to about a 1000 (10.99.10.06) and then I stop receiving arp requests completely. As the network is active, there might be some other arp requests from active devices pushing the total a little higher. You can press ctrl-C to stop the arpflood.sh script as it will otherwise run for about 17 minutes.

As this switch doesn't have layer 3 capabilities, I'm wondering why it's looking at the arp requests at all (on layer 2 these requests only use a single source mac address and the broadcast address).
On a network with less than a 1000 hosts this will probably work fine, however overnight I notice the cablemodem resolving about 5500 distinct IP's.

The specifications of this switch list a 4K mac table, but apparently the arp handling will kill the traffic long before running out of mac addresses. Also keep in mind that as I only see the outgoing traffic from the cablemodem but not the replies, it only takes up 1 address in the mac address table, even when trying to resolve 5500 hosts.
  0  
  0  
#3
Options
Re:TL-SG108E arp flood problem
2016-12-22 02:58:48
Good news.
I've send an e-mail on Sunday to the support and received a beta release for firmware 1.0.3 today, which seem to solve the problem.
  0  
  0  
#4
Options
Re:TL-SG108E arp flood problem
2016-12-22 08:42:25
There is a beta firmware i get from TP-Link support team. And it seems to limit the speed of arp packets. Here is the download link: https://www.dropbox.com/s/cw4ju2i3ykdyfyk/TL-SG108Ev2_en_1.0.3_[20161024-rel36373]_up_beta.zip?dl=0
  0  
  0  
#5
Options
Re:TL-SG108E arp flood problem
2016-12-28 00:36:13

johnson wrote

There is a beta firmware i get from TP-Link support team. And it seems to limit the speed of arp packets. Here is the download link: https://www.dropbox.com/s/cw4ju2i3ykdyfyk/TL-SG108Ev2_en_1.0.3_[20161024-rel36373]_up_beta.zip?dl=0

I've tried it. It seems like it might have a longer timeout for arp entries (I didn't really write down the numbers, so it could be the same as well). Unfortunately, even if that is the case, that can both be better and worse. If all your devices are always on the network and online they'll not lose the arp entry as fast. However, if they're offline and the arp entries get filled up by other devices, it'll also take longer to get a free spot. It definitely doesn't store more entries compared to the previous release. For a device that's advertised with a 4K MAC table, it should at least match this number of entries in the arp table, or have an option not to inspect arp. As it's a layer-2 device, I'm still wondering why it's doing any arp filtering at all.
  0  
  0  
#6
Options
Re:TL-SG108E arp flood problem
2017-01-03 06:07:41
I have spent the better part of 2 weeks troubleshooting SG108EV2 and a 24 ports TL-SG1024DEV1 with some success!

These cheap switches are good but they are easy to CRASH based on VLAN definitions and other options as well.

When the switch crashes it goes crazy with ARP traffic and the normal switched connections become seriously slow ~500Kb/s.

HOW NOT TO CRASH IT and keep Smart Switch stable:
- Make sure none of your VLAN (group/segment/subnets) definitions overlap
- Do not use the "Flow Control: On/Off" option as it can corrupt VLAN stacks by dropping frames
- Do not use any "STORM CONTROL" option as it can corrupt VLAN stacks by dropping frames

The part that are extremely sensitive are the Tagged VLAN stacks ie. the managed traffic queues implementing your VLANS definitions. Once corrupted you'll need to *Delete* and recreate VLAN definitions because a simple Reboot will NOT clear stacks corruption.
Use as few VLAN definition as possible ie. do not slice-and-dice your subnets too much (Instead use faster MTU or Port based instead of Tag based VLAN)

HOW TO TROUBLESHOOT:
- To Revert back to a working state (without doing a complete Reset): delete all your VLANs (All PVIDs automatically return to 1)
- Build and test a minimal VLAN configs before adding more complexity like Tagged-Out shared VLAN links
- When the switch has gone crazy (crashed) and is maxing out with junk traffic: it will lights up like a XMas tree!!
- Run Speedtest traffic through it to confirm you get ballpark numbers. If crashed: the I/O speeds will be dismal

THE LIMITATIONS:
These switches are affordable and far from perfect but they are far cheaper than Cisco!
TP-Link can not provide fixes for what the silicon chip maker never figured out with its shoe-string budget
They are seriously fast and indeed smart. The challenge is to make use of them for what they are worth by staying out of its mine-fields.

Cheers: more power to the wise.
  0  
  0  
#7
Options
Re:TL-SG108E arp flood problem
2017-01-03 18:50:28
Note that my problem has nothing to do with a misconfiguration of the switch, but a limitation on it's internal arp table, functionality that's not expected in a layer-2 switch. In case it does filter arp traffic, I would expect it to be able to handle at least the same amount of entries as it has MAC address space (actually it should be more as there are a lot of valid cases for a MAC address to be used by multiple IP's).
I've also presented a script that can be used to reproduce the issue at will demonstrating it's the number of arp requests that causes the issue, not the vlan configuration.
  0  
  0  
#8
Options