Large Files and APIs: 9 Lessons Learnt

26 Oct 2023 by Marty Rowe
large files

APIs are great for lightweight transactions, but what do you do when you need to bring larger files into or out of your service?

There are multiple design patterns that can be used, but let’s focus on avoiding the common pitfalls and applying some practical tips to ensure that your implementation goes smoothly.

1. Understand your threat model

To design a secure system, you need to know what information you’re protecting from what kind of threat. For example:

  • Are you making sensitive medical records available to trusted health providers, or providing users with bespoke cat memes?
  • Are you receiving bulk transactions in CSV files, or accepting large software installation files?
  • Is Denial of Service a concern for your limited bandwidth?
  • Could a user cause a spike in your cloud infrastructure costs through excessive file transfers?
  • What amount of usage is reasonable or considered malicious?

Some files contain sensitive information, other types of files are more likely to obscure malicious code. How you protect from abuse and data loss depends on how you design your system, and you won’t be able to make smart design decisions without having this understanding up front.

2. Use two steps

Almost no one wants to build a file upload system that is publicly writable. Likewise, if you’re building a publicly readable file download system then you’re overcomplicating it by implementing an API (you’re looking for “serving static content” instead).

For all other file transfer systems the best pattern is usually to have a REST API that the client can use to initiate the transfer, and then a second step to handle the actual transfer. These two steps can (and should) be handled by different services; the API service/gateway can handle the authn/authz and the business logic involved, while the file transfer service can optimise for bandwidth. Trying to do both at once or even with the same service will lead to compromises that your users and/or security team will notice.

3. Don’t host the files yourself

If you’re storing the files on your own bare-metal then one of your highest concerns will be managing your finite bandwidth. A two-step API helps because you can control the flow of traffic before users start consuming that bandwidth and the metering logic can dynamically take into account expected file sizes and simultaneous user sessions. It would be better, though, to remove the problem entirely and use the flexible bandwidth offered by the cloud.

4. Stay off the transfer critical path

Don’t put your own services on the critical file transfer path between the client and the cloud! Any service on that path will need to scale quickly and almost certainly adds no value over what can be obtained from the paired API request. All you’re going to do is increase latency, decrease bandwidth, and decrease reliability. One of the features you’re paying for with cloud is “global presence”, so make sure you’re getting the most out of it by letting your users transfer directly with their closest point-of-presence.

5. Use a signed URL

Now you need to figure out how to control access to your file storage service in the cloud. The best way to implement access control is to have your API return a “signed URL” to the client. A signed URL is a cryptographic token; someone with credentials to access the file can generate such a token and give it to a third party to access the file. The storage service can verify that the token is valid using hashes and signatures included in the token. If the third party modifies the token to access a different file, then the signature check fails and access is denied. Signed URLs usually expire after a short period of time and are supported by all the major storage providers.

If you aren’t using a cloud storage service it is still useful to use a signed URL pattern because it removes any requirement for your file processing service to communicate with your authn/authz services. Not only does this save a double validation, but it also means that your file processing service, which may be handling malicious user content, can be strongly isolated from the rest of your network. Validating the transfer request just needs the public part of the key that was used to sign the token. A good starting point for rolling your own signature pattern is to look at the way AWS implements it.

6. Require the client to send metadata for uploads

The exact access a signed URL provides depends on the parameters you set when you create it. It is a lot easier to prevent abuse if you require the client to tell you about the file they want to upload. Specifically, you should at least require that the file checksum be provided, and use a strong algorithm like SHA256. Most cloud storage providers can verify a checksum on upload and can include a checksum as a signed parameter, so the client can only upload the exact file they specified. Also, if the client loses control of the token to an attacker before it expires, the attacker can still only upload the exact file the client intended. There is the possibility of hash collisions, but that is mitigated by using a strong algorithm.

Beware that it can be difficult to validate file size and content type using this approach. One method is to validate them as business logic in the initial API call, but whether the limit can be enforced during the subsequent file transfer step depends on the capabilities of the signed URL implementation that is used. Ensure that any parameters that are not enforced during the file transfer step are subsequently validated, otherwise you will create a TOCTOU vulnerability in your system.

7. Scan uploaded user content

Before any uploaded content is used, it must be scanned for malicious content. There are many ways to build an automated pipeline to virus scan files, but you also need to verify all of the business rules that couldn’t be enforced during the file transfer process. After these scans are complete and the file is definitely what was expected, then it can be made available to your services to consume. A good way to ensure that files are not accessed before they are checked is to use Attribute Based Access Controls and tag the files when they are cleaned; your services should only have access to files that have the tag. If a malicious file is detected then that should be fed back to the API so that business logic can be used to prevent further abuse.

8. Consider file encryption for downloads

Signed URLs allow anyone with the URL to access the file, which is a problem if a Man-in-the-Middle can intercept the API response. Using TLS for the API is assumed, but doesn’t fully mitigate the vulnerability, in particular if the client is on an untrustworthy network, and with the prevalence of Work From Home, shared office spaces, commercial VPNs, and free public Wi-Fi this vulnerability is only getting worse. Who is to blame if the client loses control of the signed URL? It’s probably better to entirely avoid reputational damage by removing the vulnerability through design.

Malicious access to the file can be mitigated if the file is encrypted at rest. To do this you:

  1. Ensure that the client provides a public key in their API request
  2. Generate a short symmetric key for each request
  3. Encrypt the symmetric key with the client’s public key
  4. Return the encrypted key in the API response
  5. Use the symmetric key to encrypt the file

Now the client can decrypt the encrypted key and use it to decrypt the file. Why not just use the public key to encrypt the file and save an encryption/decryption step? Public key cryptography is not great for encrypting lots of data; a file is almost certainly too big, but a symmetric key is just right.

Even if an attacker can intercept the API response, they can only download the encrypted file and see the encrypted symmetric key. They would need to compromise the client’s private key or the API server to get the information required to decrypt the file.

9. Consider using a CDN for downloads

So far we have assumed that the files requested were sensitive or unique to the client, but there is a case where you may have multiple clients that need access to the same file and there’s nothing sensitive if the signed URL leaks. If that is the case, then consider using Google CDN, AWS CloudFront, or your provider’s equivalent (if it supports signed URLs) instead of the raw storage service.

Back to Blog