Wednesday, June 29, 2016

Detecting DNS Data Exfiltration

This blog was co-authored by Martin Lee and Jaeson Schultz with contributions from Warren Mercer.

The recent discovery of Wekby and Point of Sale malware using DNS requests as a command and control channel highlights the need to consider DNS as a potentially malicious channel. Although a skilled analyst may be able to quickly spot unusual activity because they are familiar with their organisation’s normal DNS activity, manually reviewing DNS logs is typically time consuming and tedious. In an environment where it might be unclear what malicious DNS traffic looks like, how can we identify malicious DNS requests?

We all have subconscious mental models that shape our perceptions of the environment and help us to identify the unusual. An outlandish or unusual happening in the local neighbourhood piques our curiosity and make us want to find out what is going on. We compare our expectations of normality with our observations, if the two don’t match we want to know why. A similar approach can be applied to DNS logs. If we can construct a baseline or model of ‘normality’ we can compare our observations to the model and spot if reality as we see it, is wildly different from that which we would expect.

We are familiar with common DNS requests such as requesting the IP address of ‘’, but what kind of request would be so unusual as to require investigation? Malware could encode stolen data as the subdomain part of a DNS lookup for a domain where the nameserver is under control of an attacker. A DNS lookup for ‘’ would be forwarded to the nameserver of, which would record ‘long-string-of-exfiltrated-data’ and reply back to the malware with a coded response.

Naively, we would expect the subdomain part of such requests to be much longer than usual requests. We can use the distribution of the lengths of subdomains within DNS requests to construct a mathematical model that describes normality, and use this to compare our observations to identify the outlandish.

Calculating the frequency of occurrence of subdomain length by removing the domain name and domain extension of a sample of DNS requests gives the following graph:

The orange line shows the distribution of subdomain lengths from single up to sixty five characters. Although it is obviously not an exact fit, this distribution approximates to the smooth exponential curve shown in yellow. We can use this curve as our model of normality and compare our observed values to this curve in order to spot anomalies.

Immediately we can see that subdomains of three characters in length are far more common than we would expect. Understandably, this corresponds to the length ‘www’, a very common subdomain string. To measure how more frequent this observation is than we would expect, we can divide the observed value by that predicted from our curve in order to calculate a metric of how unusual this observation is.

Continuing this calculation for all the length values and plotting this gives a graph showing how much reality diverges from our expectations of normality for each subdomain length:

Clearly, a handful of subdomain lengths are occurring at much higher frequencies than we would expect. Indeed, so great is the divergence from that expected that a few values had to be truncated at 1000.

Concentrating on these outlandish values reduces the manual work necessary to review the set of logs. Many of these particularly long subdomains turn out to be legitimate cloud services or content distribution networks. However, a handful of domains with subdomains of 231 and 233 characters in length seemed particularly interesting.

Despite the name server for each domain being hosted on different networks, the domains share a number of unusual features. There are hundreds of subdomains for each domain, but each unique subdomain is only ever accessed once. Although, not necessarily uncommon, each DNS lookup resulted in ‘’ being returned. is a known malicious domain by which the Multigrain malware exfiltrates stolen credit card numbers. The clear similarities between the three domains suggest that the previously unknown and domains are related to that of

The Multigrain malware uses base32 encoding to exfiltrate data from infected machines. Although less space efficient than the more commonly known base64 encoding technique, base32 encoding uses an alphabet consisting of the characters a-z and the digits 2-7. The digits '0' and '1' are omitted from the base32 alphabet due to their similarity to the letters 'O' and 'I'. The encoding has the advantage that there are no characters which cannot be used in a DNS lookup, and that capitalisation does not need to be maintained.

The major part of the multigrain DNS request is encrypted, but the first section encoding an identifier of the infected machine is readable. For example, the section beginning nu6t in the following:
decodes to: m=3753560948

Any feature of DNS requests such as the length of the domain name, the number of subdomains etc. can all be used to construct models of expected behaviour to which observed values can be compared.
These identify domains with similar patterns such as:
which are known to be associated with the W32/AutoRun worm. And the hilarious:
associated with phishing scams.

Monitoring logs, and DNS logs in particular, is an excellent technique for spotting attacks. When you have more data than you can eyeball, using simple techniques to model the data can help identify those entries that require a second glance. Its these second glances that often make the difference between well defended and compromised networks.


  1. This comment has been removed by the author.

    1. Thanks for the comment. There are many different techniques out there. This work started from wondering how could you exfiltrate over DNS and what would such attempts look like in the logs? From that point it was a matter of finding a method which would reliably identify such attempts.

    2. I believe that DNS entropy is a terrible way to identify bad domains. After performing a DNS tunneling experiment, the domain names had a shannon entropy of no more than 2.6 which when lumped with other 2.6 entropy domains, it didn't matter.

      I think the idea is to look at frequency rather than the actual name.

  2. What about the resolution of those DNS requests? I am not sure how these DNS requests reach out to the attacker? Do all of them resolve to same IP or something similar ?

  3. Splunk can really help with this, or some other SIEM if you have your DNS logs centrally logged or at least have Bro.

    1. I agree entirely. You can implement the technique using whatever platform that you have.
      Of course, you could just use OpenDNS and have all this work going on behind the scenes without needing to worry about it ;-)

  4. What PoS malware is using this technique?

    1. Multigrain, which is thought to be a variant of NewPosThings.


Post a Comment

Note: Only a member of this blog may post a comment.