Securing File Uploads Part 3: File Signature Validation

In our previous posts, we covered content type validation and file size validation as the first two layers of defense in our file upload security pipeline. Today, we're diving into what I consider the most critical validation step: file signature validation, also known as "magic number" validation. This is where we stop trusting what files claim to be and start verifying what they actually are.

The Problem: files that lie

Here's a sobering truth: both content type headers and file extensions are trivially easy to manipulate. An attacker can:

Rename malicious.php to harmless.jpg
Upload a PHP web shell with the content type set to image/jpeg
Disguise an executable as a PDF by simply changing the extension
Bypass your content type validation while still delivering malicious payloads

Consider this scenario: Your application accepts image uploads for user profiles. You've implemented content type validation that only allows image/jpeg, image/png, and image/gif. An attacker uploads a file with:

Filename: profile.jpg
Content-Type header: image/jpeg
Actual content: A PHP web shell

Your content type validator sees image/jpeg and happily accepts it. The file gets stored in your uploads directory. If that directory is served by your web server and configured to execute PHP files, the attacker can now execute arbitrary code on your server by simply navigating to the uploaded file.

File Signatures

Every file format has a unique "signature" or "magic number"—a specific sequence of bytes at the beginning of the file that identifies its true format. For example:

JPEG images start with FF D8 FF
PNG images start with 89 50 4E 47 0D 0A 1A 0A
PDF documents start with 25 50 44 46 (which is %PDF in ASCII)
ZIP archives start with 50 4B 03 04 or 50 4B 05 06

These signatures are built into the file format specifications and cannot be faked by simply renaming the file or changing HTTP headers. To pass file signature validation, a file must actually be what it claims to be.

Our implementation: Using FileSignatures library

Rather than maintaining our own database of file signatures (which would be error-prone and require constant updates), we leverage the excellent FileSignatures library by Neil Harvey. This library provides a comprehensive, well-maintained collection of file format signatures.

Here's how we implemented it:

How it works

1. Format Discovery

In the constructor, we use FileFormatLocator.GetFormats() to discover all available file format definitions:

The parameters here are important:

this.GetType().Assembly: Looks for custom format definitions in your own assembly
true: Includes all the default format definitions from the FileSignatures library

This approach allows you to extend the library with custom format definitions if needed while still getting all the built-in formats.

2. Inspection

We create a FileFormatInspector with all discovered formats. This inspector is reused across all validation requests, which is important for performance—we don't want to reconstruct the format database for every uploaded file.

3. Validation

The actual validation happens in our custom FileSignatureValidator.IsValid() method (we'll explore this in detail below). This method:

Opens a stream to read the file's bytes
Inspects the file signature using the FileSignatures library
Compares the detected format against our allow-list of supported content types
Logs any mismatches for security monitoring

The validation logic in more details

Let's examine what a typical FileSignatureValidator.IsValid() implementation might look like:

Performance considerations

Stream Positioning

An important implementation detail: after the FileSignatures library reads the file stream to check the signature, you need to reset the stream position:

The FileSignatures library reads from the beginning of the stream, so if you need to process the file after validation, make sure to reset the position.

Caching the Inspector

We create the FileFormatInspector once in the constructor and reuse it for all validations. This is efficient because:

Format definitions are loaded only once
No repeated assembly scanning
Reduced memory allocation

For high-traffic applications processing thousands of uploads, this optimization matters.

Read-only Operations

File signature validation only reads the first few bytes of a file (typically 2-16 bytes depending on the format). This is extremely fast—even for large files, we're only examining a tiny portion of the content.

Limitations and Edge Cases

File signature validation is powerful, but not perfect:

Polyglot Files

Sophisticated attackers can create "polyglot" files that are valid in multiple formats simultaneously. For example, a file that is both a valid JPEG and valid JavaScript. These are rare and difficult to create, but they exist.

Container Formats

Some formats are containers that can hold various content types:

ZIP files can contain anything
Office documents (DOCX, XLSX) are actually ZIP archives containing XML
PDF files can embed JavaScript and other executable content

File signature validation confirms the container format is legitimate, but doesn't analyze the contents. This is why we also need malware scanning (which we'll cover in the next post).

Integration with our validation pipeline

Remember our validation pipeline from the previous posts? File signature validation is the third step, after size and content type:

What's next

File signature validation ensures files are what they claim to be, but even legitimate files can contain malicious payloads. A valid PDF can have embedded JavaScript that exploits vulnerabilities. A legitimate Office document can contain malicious macros. An image can exploit processing library vulnerabilities.

In our final post, we'll explore the last line of defense: malware scanning with integration into antivirus engines.

More information

neilharvey/FileSignatures: A small library for detecting the type of a file based on header signature (also known as magic number).

Kubernetes–Limit your environmental impact

Reducing the carbon footprint and CO2 emission of our (cloud) workloads, is a responsibility of all of us. If you are running a Kubernetes cluster, have a look at Kube-Green . kube-green is a simple Kubernetes operator that automatically shuts down (some of) your pods when you don't need them. A single pod produces about 11 Kg CO2eq per year( here the calculation). Reason enough to give it a try! Installing kube-green in your cluster The easiest way to install the operator in your cluster is through kubectl. We first need to install a cert-manager: kubectl apply -f https://github.com/cert-manager/cert-manager/releases/download/v1.14.5/cert-manager.yaml Remark: Wait a minute before you continue as it can take some time before the cert-manager is up & running inside your cluster. Now we can install the kube-green operator: kubectl apply -f https://github.com/kube-green/kube-green/releases/latest/download/kube-green.yaml Now in the namespace where we want t...

The art of simplicity

Search This Blog