Youve whipped the winevent streams, deftly dealt with daunting data inputs, and perhaps even

concocted clever code for scripts. But while most logs are encoded in ASCII (or at least halfASCII), there are still devices and applications out there which produce binary-encoded logfiles.
What if you have to Splunk such a beast? Getting this data into Splunk requires a little extra
work, but is a straight-forward process. It will require some scripting skills (in your favorite
language, such as Perl, Python or Java), access to vendor reference manuals and hexadecimal
conversions, perseverance, and ready supply of your favorite code-slinging beverage.

Bit Wise and Byte Foolish

Binary-encoded logfiles seemed like a good idea when they were introduced during the
Eisenhower administration: they are compact, to take up minimal space on your punchcard-based
storage; they are very orderly, which appeals to the pre-opensource mindset of the day, and they
probably require special utilities to even read the data, which appeals to the vendors wanting to
sell more software. Many voice switch systems still produce binary-encoded logfiles, such as
Huawei, Ericsson, Lucent and others.
The log data will be output in fixed-length records, composed of various fields of data. Just to
keep it interesting, the fields will probably be encoded in a variety of different formats, such as:

Binary Coded Decimal Each decimal digit occupies four bits, and only hex
values 0-9 are used. Thus, a two-byte value which looks like this:
1st byte

Bit 7

Bit 6

Bit 5

Bit 4

Bit 3

Bit 2

Bit 1

Bit 0


2nd byte

Bit 7

Bit 6

Bit 5

Bit 4

Bit 3

Bit 2

Bit 1

Bit 0


and translates to a decimal value of 1409. Get it?

Little Endian This means that the least-significant byte is first backwards
from the conventional view of how things should be laid out. A four-byte
little-endian value looks like this:
1st byte

2nd byte

3rd byte

4th byte





and translates to a hex value of 00000004, or decimal 4.

Big Endian No, not Tonto. This means that the most-significant byte is
first. A four-byte big-endian value looks like this:

1st byte

2nd byte

3rd byte

4th byte





and translates to a hex value of 04000000, or decimal 67108864.

ASCII Occasionally the vendor will slip up and encode fields in plain-oleASCII. But not often.

The Plan of Attack

Since we will be creating a script to convert binary log records into Splunk-ready ASCII, we
might as well make it as Splunk-friendly as possible. Thus, the script should:

Take input data as STDIN, and output converted data directly to STDOUT.
This will allow Splunk to stream the data inputs, in much the same way that it
handles compressed (i.e., gzipd) files.

Output the data in key-value pairs, so that Splunk will auto-magically extract
the field names and corresponding values. Use a format like this:

FIELDNAME1=Field value 1,FIELDNAME2=Field value 2,

After creating a conversion script, make the appropriate entries in inputs.conf and props.conf,
then start Splunking!

Here are some potential problem areas and recommendations:

Build-in a debug option for your script to output in a more human-friendly

format, for initial development and later troubleshooting. Consider adding
extra linefeeds, record counter numbers and similar niceties.

Make sure to build in a way to sanity-check if your utility gets off on record
boundaries. Since we are processing a sea of bits, it is not intuitively
obviously if our script were to derail and start processing the wrong data from
the wrong places in a record. Choose a field which has a predictable value,
such as sequential record numbers or the year portion of a date stamp or
something similar. Add a valid_record field or some other means of
detecting that the conversion has derailed and that the field values may be
bogus for this record.

Whenever possible, go ahead and perform the lookup for enumerated

values. Thus, instead of showing User_Type=1, show
User_Type=PREPAID_SUBSCRIBER. This will simplify the Splunk
configuration, and improve the overall speed of processing records.

I prefer code which requires a minimum of external modules and

dependencies. This will make it more portable, though it may require some
extra coding. And the extra suffering will build character.

Lets look at an example, based on binary-encoded Call Data Records from a Well Known
Switch Vendor, and a Perl script to convert it. For the sake of brevity, we will assume a record
which contains only four fields, rather than the actual record which contains 100 fields (youre
welcome). I have written the conversion script in Perl, my hack-ware of choice. I have done just
enough software development to appreciate Real Software Developers, in the same way that I
have sweated just enough copper pipes to appreciate competent plumbers. Thus, I make no
claims that my code is the most elegant or the most efficient.

Let us assume that each binary record is 17 bytes long, and contains the following fields:

Serial_Number Sequential record serial number, 4 bytes in little-endian

format. Example:
1st byte

2nd byte

3rd byte

4th byte





This translates to a Serial_Number of 04030201 (hex), or 67305985 (decimal).

CDR_type Enumerated value, 1 byte. Example:

CDR Type

Decimal Value




anything else


Charge_start_time Indicate year, month, day, hour, minute, and second,

in sequence. The year is in the little-endian format and occupies two bytes,
the lower byte is in the front and higher byte is in the back. The month,
day, hour, minute, and second occupies one byte respectively.
1st byte

2nd byte

3rd byte

4th byte

5th byte

6th byte

7th byte








This transaltes to a Charge_start_time of 1998/10/11 08:22:26.

Caller_party_number 10-digit phone number, in Binary Coded Decimal

(BCD). 5 bytes, with 2 decimal digits encoded in each byte, in big-endian
order. Example:

1st byte

2nd byte

3rd byte

4th byte

5th byte






This translates to a Caller_party_number of 3032923776.

My perl script processes the data in three steps:

1. Grab data in record-sized 17-byte chunks, assigning the data into raw
2. Rearrange/tweak the raw variables as necessary to derive the correct,
translated, ready-to-output values.

3. Output the translated records in pretty, Splunk-ready format

The following script makes extensive use of the pack and unpack functions; we leave it as an
Exercise For The Student to learn the nuances of these perl functions.
Here is a sample Perl script to process the binary-coded records described above.

Test It
From the command line, test the script in this way:
# cat binary_logfile |

It should produce output like this:

2010/12/02 04:07:39
2010/12/02 04:07:53
2010/12/02 03:59:18
. . .

Looking good!

Splunk It
Add a stanza to inputs.conf, similar to this:
disabled = 0
followTail = 0
host = voice_switch
index = cdr
sourcetype = cdr_binary

Add a stanza to props.conf to pre-process the binary logs. Make sure to use the same sourcetype
as the inputs.conf entry:
invalid_cause = archive
unarchive_cmd = /usr/local/bin/

And the rest is history.

Although the majority of logged event data is text/ASCII, there are still systems which generate
binary-encoded log data, including a number of voice switch products. Pulling this data into
Splunk can yield extremely valuable insights, such as call volumes, per-user trends and
fraud/abuse analysis. With a little scripting, such data can be readily converted and streamed into
Splunk. Amaze your coworkers, and possibly even your boss!

