File Carving

The process is mainly used in computer forensics to extract information from a quantity of raw data without having to know which file system created the files.
Generally, all file types have common characteristics. For example, if we look at their structure, all JPG/JFIF files start with FF D8 FF E0 and end with FF D9, as the following image shows:
From this, in data blocks, those files that correspond to JPGs can be identified based on the beginning and end of their structures.
This technique is very useful in a number of situations, for instance when storage devices have become corrupted or damaged or, as we have previously mentioned, when an investigation is being conducted using forensic analysis to manage an incident, for example. A practical example that illustrates how this technique can be used is the analysis that was carried out on the hard disks and removable drives seized by the U.S. Navy Seals from Osama Bin Laden’s camp.
All file types have a similar structure. They use a constant known as the Magic Number which enables the corresponding file type to be identified. The table below includes a series of example magic numbers along with the file types associated with them.
A more extensive list is available here.
There are a number of different “file carving” techniques:
- Those based on the header and the end of a file or, if this is not known, on the maximum file size (available in the header).
- Those based on the structure of a given file: header, footer, significant strings, size, etc.
- Those based on the content of the file: entropy, language recognition, static attributes, etc. For example: HTML, XML, etc
There is also a host of tools, available in both free and private software, which enable “file carving”, with varying levels of effectiveness and usefulness. A few of these tools are described below:
- Foremost - http://foremost.sourceforge.net is an open source tool developed by the U.S. Air Force Office of Special Investigations. It enables you to work both with images from devices (dd, encase, etc.) and directly on the device. It is designed to recover information from Linux environments. Its limitation is that it can only process files of up to 2 GB.
- Scalpel - https://github.com/sleuthkit/scalpel is an open source tool based on Foremost, but much more efficient. It is included in The Sleuth Kit. It is designed to recover information from both Linux, including OSX, and Windows environments.
- Forensic Toolkit (FTK) AccessData - https://accessdata.com/products-services/forensic-toolkit-ftk is a particularly comprehensive suite for conducting forensic analysis. One of its many functions is advanced “carving” which allows search criteria such as file size, data type and pixel size to be specified in order to reduce the number of irrelevant data extracted.
- X-Way Forensics (WinHex) - http://www.x-ways.net/forensics/, like FTK, is a comprehensive forensic analysis suite. Its “carving” features are not as easily configured, but they are just as powerful.
Practical example
The PCAP “packet capture” specification corresponds to the file format used to store network traffic captures. There is a host of both open source and commercial tools that handle this file format: protocol analysers, network monitors, intrusion detection systems, traffic capture programs (packet sniffers), traffic generators, etc.
There are also free online repositories from which traffic captures can be downloaded for testing:
- Contagio Dump - http://contagiodump.blogspot.com.es/2013/04/collection-of-pcap-files-from-malware.html
- WireShark SampleCaptures - https://wiki.wireshark.org/SampleCaptures#Sample_Captures
- NETRESEC Publicly available PCAP files - http://www.netresec.com/?page=PcapFiles
- Chris Sanders’ blog - http://chrissanders.org/packet-captures/
The situation in our example is that a computer has been identified, through analysis using an anti-botnet service, as being involved in botnet activity. The detection systems installed in the computer cannot identify any sort of malware, although due to the slow connection speed it is suspected that it could be compromised, so a small network traffic capture has been obtained.
Starting with a PCAP file, which is available here, it has been analysed and certain anomalies, such as SMTP traffic that should not be there, have been found. Therefore, it has been decided that the emails containing the capture should be identified and extracted so the type of emails they are can be specified (phishing, malware, etc.) and information concerning the threat can be obtained.
Python has been chosen for this as it uses more flexible, powerful and simple language. A number of different libraries are implemented in this language, the most noteworthy being Scapy which enables PCAP files to be handled and worked on. The code used is as follows:
And this email is the result:
The evidence shows that the computer has been sending emails impersonating an airline, with a malicious file attached. Analysis of the file confirms that it contained malware related to the botnet known as Asprox so the necessary measures to disinfect it are then taken.
It is important to bear in mind certain aspects of “file carving”, which make it a tricky process.
- It usually works on a very high volume of data.
- The process takes a good deal of time.
- In a number of cases, results are partial or incomplete and there is a lot of information that cannot be recovered.
- There is a high number of false positives.
- The files’ content can be recovered but their metadata and directory structure cannot be.
- This type of tool can be deceived by anti-forensic techniques.