Every type of file which exists on standard computers typically is accompanied by a file signature, often referred to as a ‘magic number’. A file signature is typically 1-4 bytes in length and located at offset 0 in the file when inspecting raw data but there are many exceptions to this. Certain files such as a ‘Canon RAW’ formatted image or ‘GIF’ files have signatures larger than 4 bytes and others such as a ISO9660 CD/DVD ISO image file have signatures located at separate offsets other than 0. A comprehensive list of file signatures in HEX format, the commonly associated file extension and a brief description of the file may be found at https://www.garykessler.net/library/file_sigs.html, courtesy of Gary Kessler. Unfortunately there exists no penultimate compendium of magic numbers and it is possible for malicious software to disguise its magic number, potentially masquerading as another file type. Typically, detecting a certain magic number will indicate the file type but the specific file type may not always have the correct magic number. Analyzing files to look at their current file signature and compare it to the existing extension is a core feature of certain forensics software such as FTK or EnCase but it can be done in a simpler fashion through basic Python scripting which doesn’t require the usage of external utilities.
First, a list of known HEX signatures, the off-set they exist at and a brief description along with the associated extensions is established in a space-delimited format in order to have a reference for future analysis and comparison purposes. A sample of the created list is shown below.
The list created is not by any means comprehensive but it is easily modular by simply addition additional file signatures, offsets and associated extensions wherever one would like to. The script first loads these signatures into memory via an appended list as shown in the code snippet below.
‘loadSigs()’ functions to append the HEX signature, expected offset and description/extension to ‘siglist’ for usage later in the script. Immediately after loading the known signatures, the user is able to select a path from which to begin recursive scanning of detected files, with the code snippet below demonstrating path detection existence capabilities.
In the above screen, we can observe that the user must enter a path rather than a specific file and the path must exist before the script will continue. Additionally, the user can select the maximum file size to scan, allowing for the exclusion of files over a particular size. This is useful since most malware will not exceed 25-100 MegaBytes and even malware on the scale of greater than 5-10 MegaBytes are extremely uncommon. A snippet of the code for this functionality is shown below.
The next called function, ‘scanforPE()’, allows the user to specify whether they would like to scan for a specific extension type or simply scan all detected extensions. This is useful if the user is looking to scan, for example, all JPEG files in a particular directory for hidden EXE but does not wish to scan other file types. An example of this functionality is shown below.
In recursively scanning through OS directories, the script hands each file off as a parameter argument to ‘isPE()’ which in turn makes sure the file is open-able and then passes it as parameter argument to ‘scanTmp()’. The overall goal of the ‘scanTmp’ function is to check the current file-size against the max size, skipping if greater and then to read the binary into a raw binary dump which is in turn converted to upper-case HEX via ‘hexlify’, as shown in the image below.
As shown above, after the raw binary data is dumped into upper-case HEX format the temporary object is passed to another function labelled ‘checkSig()’. ‘checkSig’ consists of the main business logic for the script and performs a variety of functions which in all likelihood should probably be split up further. Essentially, it takes in the previously dumped temporary file, examines the signature list and puts the file-signature and offset into appropriate formats and then it calls another function, ‘getsubstring’, which takes a slice of the file at the location where a signature is expected for the associated file extension. It then cuts the original file down to the same location slice and tests to see whether or not the original file slice is found within the sliced signature string, which would indicate a potential signature detection. If this occurs, the extension type is compared to the expected type in order to determine whether a mis-match has been detected which may indicate a potentially malicious file masquerading as another extension type. The function is relatively inelegant and displaying it here would not provide much benefit but it may be studied at the source GitHub link given at the end of this post. Some additional screenshots of the script in action are shown below.
Once this operation is complete for all signatures and all detected files, a report is written detailing all possible detections, mismatches and files which were skipped due to their size or for permission reasons and it may be reviewed at the investigator’s leisure. This is a basic and naive attempt at file signature analysis but it helps to demonstrate how it may be achieved without the usage of expensive utilities such as EnCase.