In our previous posts, we covered content type validation and file size validation as the first two layers of defense in our file upload security pipeline. Today, we're diving into what I consider the most critical validation step: file signature validation, also known as "magic number" validation. This is where we stop trusting what files claim to be and start verifying what they actually are.
The Problem: files that lie
Here's a sobering truth: both content type headers and file extensions are trivially easy to manipulate. An attacker can:
- Rename
malicious.php
toharmless.jpg
- Upload a PHP web shell with the content type set to
image/jpeg
- Disguise an executable as a PDF by simply changing the extension
- Bypass your content type validation while still delivering malicious payloads
Consider this scenario: Your application accepts image uploads for user profiles. You've implemented content type validation that only allows image/jpeg
, image/png
, and image/gif
. An attacker uploads a file with:
- Filename:
profile.jpg
- Content-Type header:
image/jpeg
- Actual content: A PHP web shell
Your content type validator sees image/jpeg
and happily accepts it. The file gets stored in your uploads directory. If that directory is served by your web server and configured to execute PHP files, the attacker can now execute arbitrary code on your server by simply navigating to the uploaded file.
File Signatures
Every file format has a unique "signature" or "magic number"—a specific sequence of bytes at the beginning of the file that identifies its true format. For example:
- JPEG images start with
FF D8 FF
- PNG images start with
89 50 4E 47 0D 0A 1A 0A
- PDF documents start with
25 50 44 46
(which is%PDF
in ASCII) - ZIP archives start with
50 4B 03 04
or50 4B 05 06
These signatures are built into the file format specifications and cannot be faked by simply renaming the file or changing HTTP headers. To pass file signature validation, a file must actually be what it claims to be.
Our implementation: Using FileSignatures library
Rather than maintaining our own database of file signatures (which would be error-prone and require constant updates), we leverage the excellent FileSignatures library by Neil Harvey. This library provides a comprehensive, well-maintained collection of file format signatures.
Here's how we implemented it:
How it works
1. Format Discovery
In the constructor, we use FileFormatLocator.GetFormats()
to discover all available file format definitions:
The parameters here are important:
this.GetType().Assembly
: Looks for custom format definitions in your own assemblytrue
: Includes all the default format definitions from the FileSignatures library
This approach allows you to extend the library with custom format definitions if needed while still getting all the built-in formats.
2. Inspection
We create a FileFormatInspector
with all discovered formats. This inspector is reused across all validation requests, which is important for performance—we don't want to reconstruct the format database for every uploaded file.
3. Validation
The actual validation happens in our custom FileSignatureValidator.IsValid()
method (we'll explore this in detail below). This method:
- Opens a stream to read the file's bytes
- Inspects the file signature using the FileSignatures library
- Compares the detected format against our allow-list of supported content types
- Logs any mismatches for security monitoring
The validation logic in more details
Let's examine what a typical FileSignatureValidator.IsValid()
implementation might look like:
Performance considerations
Stream Positioning
An important implementation detail: after the FileSignatures library reads the file stream to check the signature, you need to reset the stream position:
The FileSignatures library reads from the beginning of the stream, so if you need to process the file after validation, make sure to reset the position.
Caching the Inspector
We create the FileFormatInspector
once in the constructor and reuse it for all validations. This is efficient because:
- Format definitions are loaded only once
- No repeated assembly scanning
- Reduced memory allocation
For high-traffic applications processing thousands of uploads, this optimization matters.
Read-only Operations
File signature validation only reads the first few bytes of a file (typically 2-16 bytes depending on the format). This is extremely fast—even for large files, we're only examining a tiny portion of the content.
Limitations and Edge Cases
File signature validation is powerful, but not perfect:
Polyglot Files
Sophisticated attackers can create "polyglot" files that are valid in multiple formats simultaneously. For example, a file that is both a valid JPEG and valid JavaScript. These are rare and difficult to create, but they exist.
Container Formats
Some formats are containers that can hold various content types:
- ZIP files can contain anything
- Office documents (DOCX, XLSX) are actually ZIP archives containing XML
- PDF files can embed JavaScript and other executable content
File signature validation confirms the container format is legitimate, but doesn't analyze the contents. This is why we also need malware scanning (which we'll cover in the next post).
Integration with our validation pipeline
Remember our validation pipeline from the previous posts? File signature validation is the third step, after size and content type:
What's next
File signature validation ensures files are what they claim to be, but even legitimate files can contain malicious payloads. A valid PDF can have embedded JavaScript that exploits vulnerabilities. A legitimate Office document can contain malicious macros. An image can exploit processing library vulnerabilities.
In our final post, we'll explore the last line of defense: malware scanning with integration into antivirus engines.