Bulk Extractor is a great tool for searching a file system for sensitive data. Bulk extractor ignores the file system and scans it linearly. This, in combination with parallel processing, makes the tool very fast. It will have an issue with fragmented files, but typically, files aren’t fragmented.
Follow the directions here for installation.
Using BEViewer, the Bulk Extractor GUI
While you may prefer the command line, in my opinion it is easier to get a base understanding of the tool starting with the GUI. The layout gives you a better idea of default settings and how everything works. Plus it generates the command line so that you can get a feel for the syntax.
Click on the Tools option and then run bulk_extractor like below…
…and you will be presented with a large selection of options!
In the required parameters section, we can see that there are three options for the type of images (ie: E01, raw devices, and specific directories) that can be targeted.
Next, select what you would like bulk extractor to search (it changes based on the last choice). The output feature directory is where you want to output the results.
Scanners are another very important option. When you first open this view some are selected by default while others are not.
These are your default enabled and disabled ones. Most scanners output to files that match their names (e.g. elf scanner will output to elf.txt). Below is a description of the different scanners.
- Accts searches for credit card numbers, track data, phone numbers, and other numbers
- AES finds AES keys
- Base64 Searches for Base64 encoded text
- Elf Searches for ELF type files.
- Email Searches for headers, cookies, hostnames, IPs, emails, and URLs.
- Exif Finds images and their metadata
- Find Used for finding specific regular expressions
- GPS finds Garmin-formatted XML containing GPS coordinates
- Gzip Finds gzip compressed files
- Hiberfile Finds the Windows hibernation file
- Httplogs Finds HTTP log files
- Json Searches for JSON type files
- Kml Finds KML type files.
- Msxml Searches for Microsoft XML Core Services
- Net Finds packets in memory
- Pdf Searches for text from PDF files
- Rar Searches for RAR compressed files
- Sqlite Finds SQLite3 database files
- Vcard Finds vCard type files
- Windirs Searches for Windows directories
- Winlnk Finds Windows LNK files
- Winpe Searches for windows executables and dlls.
- Winprefetch Searches for prefetch files.
- Zip Searches for ZIP compressed files
- Base16 will search for hex code
- Facebook Finds Facebook HTML
- Outlook Finds Outlook Compressable Encryption
- Sceadan Stands for Systematic Classification Engine for Advanced Data ANalysis. Unsure what this scanner does.
- Wordlist Finds words. Potentially useful for passwords
- Xor Searches for data hidden by XOR encoding
- The banner file will put a banner at the beginning of each output file.
- Alert list will create an alert file for specific terms when found.
- Stop list specifies a whitelist that will be put into a special file.
- Regex text and Regex text file will search for specified regular expressions.
- Random sample will ostensibly take a random sample of the data to search through.
These relate primarily to the how the scanner will perform its scan.
- Used for specifying the context that scanners will use
- The page size is how much bulk_extractor will search at each stage (how many bytes at a time it searches).
- The margin size is to determine how much overlap between each page there is (to avoid missing data).
- Block size
- Number of threads is defaulted to the number of processors on the computer and determines how many threads it will use.
- Maximum recursion depth is how deep it will search through files (for example: zipped files)
- Wait time is how long bulk extractor will wait for scanners to finish after all the data has been read.
- Start at a specific point
- Process between two parts in the file/directory
- Adds a value to the reported offsets
- Starts at a specific page
- See the source code for different debugging options
- Erases output after finishing
- Plugin directories specifies a directory for plugins (default /usr/local/lib/bulk_extractor and /usr/lib/bulk_extractor)
- Use settable options allows you to set options that can be found on page 24 of the user manual.
The basic syntax for using bulk_extractor from the command line is as follows:
bulk_extractor -o <out_dir> <image>
bulk_extractor -o <out_dir> -R <dir>
- -o <dir> – puts the results in the <dir> directory.
- -R <dir> – scans a directory recursively.
- -E <scanner> – enables <scanner> and then disables all others.
- -e <scanner> – enables <scanner> (typically for disabled scanners).
- -x <scanner> – disables <scanner>.
- -b <file> – sets banner file to <file>.
- -r <file> – sets alert list to <file>.
- -w <file> – sets stops list to <file>.
- -f <regex> – searches for <regex>.
- -F <file> – searches for regex’s in <file>.
- -W<num1>:<num2> – only extracts words between <num1> and <num2> in length.
- -s frac[:<num>] – sets random sampling values.
- -C <num> – sets the context window to <num> (default 16).
- -S fr:<name>:[window=<num>|window_before=<num>|window_after=<num>] – specifies context window <num> for before, after, or during recorder <name>.
- -G <num> – sets the page size to <num>.
- -g <num> – sets the margin to <num>.
- -j <num> – sets number of threads to <num>.
- -M <num> – sets max recursion depth to <num>.
- -m <num> – sets max number of minutes to wait to <num>.
- -Y <offset1>[-<offset2>] – starts at <offset1> and goes to <offset2> if specified.
- -A <num> – adds <num> to the reported offset.
- -V – prints the version.
- -H – prints detailed info on the scanners.
- -z <num> – starts on a page <num>.
- -d<num> – uses debug mode <num> (note the lack of space).
- -Z – deletes the output directory.
- -P <dir> – specifies the plugin directory.
- -S <option>=<value> – method for setting settable options (e.g. word_min=6 for minimum size of words to report).
And that is Bulk Extractor. It’s quick and quite useful. Hopefully you agree! As always, keep hacking.