[Pool] 8-10k pps in Brazil

Matt Wagner mwaggy at gmail.com
Thu May 28 04:43:58 UTC 2015


I'm still seeing 10,000 queries/second, even after reducing my bandwidth in
the pool by 90%. I'm seeing some weird things that are probably clues, but
that I don't know how to interpret.

My first hypothesis was that reducing my bandwidth in the pool would pretty
quickly lead to a proportional drop in queries, because it was probably
badly-behaved clients that were constantly trying to sync with whatever was
in the pool. I dropped my bandwidth setting from 100 Mbps to 50 Mbps, and
saw no decline over a few days. I dropped from 50 to 10 Mbps today, and am
still doing 10k qps. (10,967/second over the past 5 minutes or so.) I was
also seeing this insane query volume (though I wasn't quantifying it) even
when I had fallen out of the pool for a low score. This isn't unreasonable,
of course; another server I run (not in Brazil) is still getting queries
months after I took it out of the pool. But if it was simply a giant mass
of clients, I'd expect to start seeming _some_ change pretty quickly when I
dropped my overall bandwidth by 90%. I did not.

Also very interesting to me -- a bit over 98% of incoming queries are
NTPv3. On other servers that are or have been in the pool, that number is
below 50%. Here's the latest sysstat:

$ /usr/local/bin/ntpq -c sysstat
uptime:                 545693
sysstats reset:         545693
packets received:       5215604712
current version:        85369465
older version:          5130135559
bad length or format:   98762
authentication failed:  64726
declined:               19
restricted:             1248
rate limited:           464893288
KoD responses:          77028191
processed for time:     2772

(Only 'current version' and 'older version' are reported, but tcpdump and
mrulist show that it's pretty much all v3 and v4.)

My theory, and one that looking at "mrulist limited" seems to support, is
that the giant increase in traffic is all v3. If I look at just 'current
version', there are 85369465 queries over 545693 seconds, which is well in
keeping with the 250-400qps range I used to see.

The other thing I can't figure out -- if I run something like "mrulist
limited", the overwhelming majority of clients that are listed are of the
form "191-247-nnn-nn.3g.claro.net.br" (with the 'nnn' representing IP
octets, of course). I tried adding "sortorder=count" to rule out it being
stored by IP range or something. So on the surface, it seemed like that
netblock was the culprit.

I added iptables rules to try to prove it -- I accept traffic on a bunch of
subnets I saw coming up often, and then I default-accept the rest of the
traffic. Of the enumerated IP ranges, 191.244.0.0/14 is indeed the busiest,
with 17GB passed since I last restarted iptables. The next-closest subnet
is about 2GB. BUT, the default accept has matched 311GB, so it seems that
they're not the source of the bulk of my traffic. I think I need to dig
into that 311GB a bit and try to break it down a bit more.

The "3g" bit of the hostnames piques my interest as well -- could they be
cell phones? Cellular modems? Maybe they have a faulty NTP client? (But I
thought GSM had its own way of syncing time?)

People have suggested two theories that seem very reasonable, and that fit
Occam's Razor, but that I'm not sure explain quite what I'm seeing. The
first was that it's a DDoS attack, and the second was that this is caused
by the drop in servers in the Brazil zone. (Although both might be
contributing in a small part.)

According to monlist, the most abusive client (whether it's an actual IP or
a spoofed one that attackers want to have me attack) has queried me about
750k times -- certainly badly-behaved, but less than 2pps on average. And
only 9 total IPs have sent me more than 15k queries. (Though is there a way
for me to see how often this list has rolled over? It's probably being
purged a lot at this scale?) It doesn't seem like attackers would be
accomplishing anything more than sending a few kbps of traffic to any given
IP.

I also don't know if the decline in the number of servers can entirely
explain this. Previously, I was doing a few hundred queries per second;
10k/second is an enormous leap that wouldn't seem to make sense when half
the servers left. And if I look at only NTPv4 traffic, my count is pretty
similar to what it used to be. (Though I don't have those metrics from
before this started.)

I'm thinking of writing a small tool to stream tcpdump output and keep a
per-subnet counter, unless there's already something doing this?

I've put my server in Canada, which sits on a 100Mbps unmetered connection,
in the pool and asked if it might be moved into the Brazil pool to try to
absorb some of the load. I can't keep paying for this level of bandwidth
and the larger instance I moved this to to handle the load. But I also
suspect that something, somewhere has broken to cause this, and I can't
figure out what it is.

-- Matt

On Fri, May 22, 2015 at 2:28 PM, Matt Wagner <mwaggy at gmail.com> wrote:
>
> Does anyone else here run an NTP server in Brazil? I'm wondering if you
are seeing the same crazy load I am.
>
> For a long time I saw maybe 400 queries/second, but I got email last
weekend that I had fallen out of the pool for being unreachable. Indeed, I
couldn't even SSH in. It turns out that it's because my server (a t1.micro
instance) was dying under the load, which is close to 10,000 queries per
second right now. For giggles, I upsized to a larger instance and moved the
IP to watch what was happening on a machine that could handle the load.
>
> Yes, I'm patched against the old monlist exploit.
>
> $ /usr/local/bin/ntpq -c sysstat
> uptime:                 77729
> sysstats reset:         77729
> packets received:       670434339
> current version:        10573419
> older version:          659857017
> bad length or format:   3276
> authentication failed:  7916
> declined:               3
> restricted:             126
> rate limited:           60293937
> KoD responses:          10096867
> processed for time:     636
>
> There are definitely some abusive clients, but it's not a crazy DoS from
one IP or anything. Less than 10% of requests hit rate limits, and if I
watch tcpdump or something, it's from a huge range of IPs. Only a handful
of clients have made more than 50,000 requests (over the ~77000 second
uptime), and none are way over that. Trying to profile random IPs from
tcpdump, none seem to be behaving too wildly. It seems like I'm just
serving a huge number of clients.
>
> My bandwidth is set at 100 Mbps, which it has been at for a while. The
jump from a few hundred queries/second to 10,000 queries/second seems to
have come out of nowhere.
>
> Is anyone else seeing this? I'm happy to keep soaking up some of the
load, but I'm not eager to pay for 50GB of NTP traffic a day for too long.


More information about the pool mailing list