Find Sensitive Data with Bulk Extractor

Bulk Extractor is a great tool for searching a file system for sensitive data. Bulk extractor ignores the file system and scans it linearly. This, in combination with parallel processing, makes the tool very fast. It will have an issue with fragmented files, but typically, files aren’t fragmented.

Follow the directions here for installation.

Using BEViewer, the Bulk Extractor GUI

While you may prefer the command line, in my opinion it is easier to get a base understanding of the tool starting with the GUI. The layout gives you a better idea of default settings and how everything works. Plus it generates the command line so that you can get a feel for the syntax.

Click on the Tools option and then run bulk_extractor like below…

…and you will be presented with a large selection of options!

bulk extractor options — Many options available

Image

In the required parameters section, we can see that there are three options for the type of images (ie: E01, raw devices, and specific directories) that can be targeted.

Next, select what you would like bulk extractor to search (it changes based on the last choice). The output feature directory is where you want to output the results.

Scanners

Scanners are another very important option. When you first open this view some are selected by default while others are not.

These are your default enabled and disabled ones. Most scanners output to files that match their names (e.g. elf scanner will output to elf.txt). Below is a description of the different scanners.

Enabled:

Accts searches for credit card numbers, track data, phone numbers, and other numbers
AES finds AES keys
Base64 Searches for Base64 encoded text
Elf Searches for ELF type files.
Email Searches for headers, cookies, hostnames, IPs, emails, and URLs.
Exif Finds images and their metadata
Find Used for finding specific regular expressions
GPS finds Garmin-formatted XML containing GPS coordinates
Gzip Finds gzip compressed files
Hiberfile Finds the Windows hibernation file
Httplogs Finds HTTP log files
Json Searches for JSON type files
Kml Finds KML type files.
Msxml Searches for Microsoft XML Core Services
Net Finds packets in memory
Pdf Searches for text from PDF files
Rar Searches for RAR compressed files
Sqlite Finds SQLite3 database files
Vcard Finds vCard type files
Windirs Searches for Windows directories
Winlnk Finds Windows LNK files
Winpe Searches for windows executables and dlls.
Winprefetch Searches for prefetch files.
Zip Searches for ZIP compressed files

Disabled:

Base16 will search for hex code
Facebook Finds Facebook HTML
Outlook Finds Outlook Compressable Encryption
Sceadan Stands for Systematic Classification Engine for Advanced Data ANalysis. Unsure what this scanner does.
Wordlist Finds words. Potentially useful for passwords
Xor Searches for data hidden by XOR encoding

General Options

The banner file will put a banner at the beginning of each output file.
Alert list will create an alert file for specific terms when found.
Stop list specifies a whitelist that will be put into a special file.
Regex text and Regex text file will search for specified regular expressions.
Random sample will ostensibly take a random sample of the data to search through.

Tuning Parameters

These relate primarily to the how the scanner will perform its scan.

Used for specifying the context that scanners will use
The page size is how much bulk_extractor will search at each stage (how many bytes at a time it searches).
The margin size is to determine how much overlap between each page there is (to avoid missing data).
Block size
Number of threads is defaulted to the number of processors on the computer and determines how many threads it will use.
Maximum recursion depth is how deep it will search through files (for example: zipped files)
Wait time is how long bulk extractor will wait for scanners to finish after all the data has been read.

Parallelizing

Start at a specific point
Process between two parts in the file/directory
Adds a value to the reported offsets

Debugging Options

Starts at a specific page
See the source code for different debugging options
Erases output after finishing

Scanner Controls

Plugin directories specifies a directory for plugins (default /usr/local/lib/bulk_extractor and /usr/lib/bulk_extractor)
Use settable options allows you to set options that can be found on page 24 of the user manual.

Command Line

The basic syntax for using bulk_extractor from the command line is as follows:

bulk_extractor -o <out_dir> <image>

bulk_extractor -o <out_dir> -R <dir>