Users of prior versions of ClamAV may have noticed a drastic increase in the size of the tarball with the introduction of 0.96. This is due to the addition of a bytecode interpreter, and a JIT Low Level Virtual Machine (LLVM). It greatly extends ClamAV detection capabilities by being able to interpret/execute bytecode. Not a lot of documentation exists as yet about how to write bytecode for ClamAV and take advantage of the tremendous flexibility it offers (I will try to fix that). If you want to write your own bytecode for ClamAV, you will need to configure ClamAV to allow it to load unsigned bytecode (bytecode shipped by ClamAV is digitally signed, and by default only signed bytecode is loaded).

If you already have ClamAV installed, even the latest version, you will have to remove it:

sudo make uninstall

(Alternatively you can keep your existing ClamAV installed, and just build a new ClamAV without installing it.)

Get the latest version of ClamAV here. Untar the archive and run the commands

./configure --enable-unsigned-bytecode && make && sudo make install

Note the configure option --enable-unsigned-bytecode. Without it, ClamAV will refuse to load your custom bytecode and produce this warning:

LibClamAV Warning: Only loading signed bytecode, skipping load of unsigned bytecode!

Now get the bytecode compiler by running the command

git clone git://git.clamav.net/git/clamav-bytecode-compiler

This will create a folder called clamav-bytecode-compiler that contains everything needed to compile ClamAV bytecode, including documentation in the subfolder doc (the latest compiler documentation can always be accessed here). Make sure to follow the instructions in the README file to build the compiler.

Here's a case study to see how ClamAV bytecode can come in handy (this is an integer overflow vulnerability in a old version of OpenOffice CVE-2008-2238). The vulnerability came about due to the way OpenOffice used to parse Enhanced Metafiles (EMF). The specifications for the EMF file format is available here. An EMF metafile is composed of a series of variable-length records called EMF records. An EMF record has the following format:

Offset Size Description------------------------------------0x0000 4 Record Type0x0004 4 Record Size 0x0008 N Type-Specific Data

There is a record called EMR_EXTTEXTOUTW which has the following format:

Offset Size Description------------------------------------0x0000 4 Record Type: EMF_EXTTEXTOUTW <0x00000054>0x0004 4 Record Size 0x0008 16 Bounds0x0018 4 iGraphicsMode0x001c 4 exScale0x0020 4 eyScale0x0024 N EmrText (variable)

The EmrText block has the following format:

Offset Size Description-----------------------------0x0000 8 Reference0x0008 4 Chars OR nLen0x000C 4 OffString......

Without getting into the details of why, I'll just say that there is an integer overflow condition if the value of Chars is equal or greater than 0x80000000 bytes.

Fire up your favorite text editor and create a file called emf_CVE-2008-2238.c.

Start off by specifying the type of file you are targeting (more information about target types here):

TARGET(0)

Next we declare the .ndb style pattern we will be looking for in EMF files as we attempt to identify the ones that may be trying to leverage the vulnerability. Based on the specifications for the EMF format, the first record in the metafile is always an EMF header record (type 0x01) and 40 bytes into the record is a digital signature that must be EMF. Let's declare this signature and delimit it with the macros SIGNATURES_DECL_BEGIN and SIGNATURES_DECL_END:

SIGNATURES_DECL_BEGINDECLARE_SIGNATURE(emr_header)SIGNATURES_DECL_END

The definitions are delimited by the macros SIGNATURE_DEF_BEGIN and SIGNATURES_END:

SIGNATURES_DEF_BEGINDEFINE_SIGNATURE(emr_header, "0:01000000{37}454d46")SIGNATURES_END

We then define a function called logical_trigger() which is a must for bytecode that is triggered by a logical signature:

bool logical_trigger(){return matches(Signatures.emr_header);}

If needed you can combine multiple signatures here with boolean and comparison operators. See the format of .ldb signatures for more details, or the compiler's documentation. In this case what this function does is return true if the emr_header signature is matched. If the function logical_trigger returns true then the fuction entrypoint is called. The function is of type int. I have attempted to explain the detection logic of the function through the embedded comments below:

/* This is the bytecode function that is actually executed when the logical signature is matched */int entrypoint(void){ uint8_t emf_exttextoutw[4] = "\x54\x00\x00\x00"; /* Header for EMF record EMR_EXTTEXTOUTW  */ int pos=0;      /* Cursor position in file    */ int Chars_value=0;     /* Value of the attribute Chars    */ uint8_t Chars[4];     /* Chars attribute. See format for EmrText block */   while (1) {  /* Find a EMF record EMR_EXTTEXTOUTW */  pos = file_find(emf_exttextoutw,4);    /* If EMF record EMR_EXTTEXTOUTW cannot be found */  if (pos == -1)   break;  else  {   /* Move the cursor 44 bytes forward, to the start of Chars     */   seek(pos+44, SEEK_SET);     /** Read Chars, which is 4 bytes long, little endian **/          read (Chars, sizeof(Chars));        /*** Convert to host system's endianess. cli_readint32 is part if the ClamAV API.   So if your system is already little endian it does nothing (just reads   the value), and if your system is big endian it swaps the bytes. See definition   of cli_readint32 in other.h in the libclamav folder of your ClamAV installation ***/   int Chars_value = cli_readint32(Chars);      if (Chars_value >= 0x80000000)   {    foundVirus("CVE-2008-2238");    break;   }   else              {                         /** Advance by 1 position in the file **/                  seek (pos+1, SEEK_SET);          }  } }return 0;}

Here's the code in its entirety. Use it as a template to write your own bytecode, or as an exercise, compile it and using a hex editor, create a file that will trigger this bytecode signature.

Finally, before you run off and start writing your own code, keep in mind that you are writing code in C. What I mean by that is that you can introduce buffer overflow vulnerabilities, infinite loop conditions and so on. Check, double check, heck! triple check your code before you start using it in a production environment. With that being said, ClamAV does have some measures in place to keep it from running out of control: memory accesses are bounds checked, bytecode execution has timeouts, and bytecodes are run with stack smashing protection. When either of these are detected at runtime, bytecode execution is stopped and ClamAV continues to execute normally. Still it is not guaranteed that these protections are perfect, so you should still check your code!