We've received a number of queries recently about the source of the data in the blocklist.rules category. I'm posting the answer here, since it will be of broad interest to the Sourcefire/Snort user base.

One of the side effects of our 2007 acquisition of the ClamAV project was the VRT gaining access to the ClamAV database. This massive collection of malware, augmented by tens of thousands of unique samples per day from a variety of sources, is a treasure trove of information - assuming you can find a useful way to sift through it all. Thanks to the magic of the VMWare API, I've developed a system that does precisely that, with a focus on network traffic instead of the traditional anti-virus interest in the malicious files themselves.

The setup starts with some big, beefy chunks of hardware, running VMWare ESXi Server. For each of these machines, there is a single Ubuntu-based VM that serves as a NAT/controller box, and as many freshly installed Windows XP SP2 (unpatched) systems as the hardware can support. The controller systems each run scripts that automatically pull the latest executable samples from the ClamAV database, and then farm them out in a systematic way to the XP systems, following this simple procedure:

  1. Revert XP VM to clean snapshot
  2. Copy malware sample to XP VM
  3. Fork off tcpdump in the background on the controller, with a BPF specific to the XP VM in question
  4. Execute the malware sample on the XP VM
  5. Wait 150 seconds
  6. Repeat step 1Simple as the process seems, it's taken some time to get it running smoothly. At first, we thought RAM would be our bottleneck; as it turns out, disk access time was a considerably more important factor, as the process of constantly reverting machines to a clean snapshot is very I/O intensive. We've had to fine-tune our queue management process, since the rate of growth in new malware samples is outstripping our hardware's ability to process them. Parsing through all of the PCAPs generated by the system required learning, and eventually patching, tshark, the command-line PCAP processing tool from the good folks at Wireshark.org (yes, we submitted the patch back). After a not-inconsiderable amount of time getting everything set up, this system has been happily churning through malicious executables from ClamAV for several months now.

As the data came rolling in, we knew we'd need to whitelist things out before creating rules from what we saw. Given that all of the network traffic generated by this system comes directly from infected machines, with no human interaction, we figured that the whitelisting process would be pretty straightforward - aside from the occasional ping to Google.com to verify connectivity to the Internet, the initial expectation was that most of the traffic would be command-and-control, or at least talking to clearly not-legitimate systems. Surprisingly enough, however, a huge portion of the HTTP traffic in these PCAPs - which in turn represents the vast bulk of the traffic captured - went to legitimate domains, including thousands of relatively obscure ad servers across the world. Separating these domains from truly malicious sites has been one of the more interesting ongoing challenges in running this system.

Since the goal is to generate data that's useful for the largest possible number of users, the domains and URLs that make up the end-product rules are those accessed most frequently by our system's infected machines. Looking for the most commonly accessed places has a side benefit of helping to filter out highly transient endpoints and behaviors - the domains and URLs in question are often being accessed thousands of times over the course of weeks or months by our victim machines, and rules that we released last year are still helping users identify and clean out infected machines on their network. As of the Feb. 8, 2011 rule release, we're also including rules for abnormal User-Agent strings generated by our systems, taking advantage of malware authors dumb enough to set the HTTP equivalent of the Evil Bit as they talk to systems around the Internet.

In addition to the rules we publish, we're also automatically publishing chunks of raw data from this malware system on the VRT Labs site on a daily basis. As a word of caution to anyone considering simply pulling those lists of IPs, URLs, and domains down and adding them to an internal blocklist: your mileage may vary rather drastically. While we do filter those lists, they don't receive the level of human attention and verification that the data going into the rules gets, and consequently they are much, much more likely to contain false positives. As such, we would suggest cross-referencing them with other data sets, or applying other filtering techniques your organization may have, before using them to block data in your enterprise.

Obviously, there is room for improvement in this system; we know, for example, that there is a lot of malware that will detect virtual machines and refuse to run, or that there is additional data we could be pulling from the PCAPs the infected systems generate. That said, we feel that it provides considerable value to our users as-is, and we will be continuing to work to improve it as time goes on. In the meantime, if anyone has suggestions for us, please don't hesitate to contact the VRT - your feedback is always valuable!