Webserver logs contain information classified as personal data by default under the European Union’s General Data Protection Regulation (GDPR). The new privacy regulation comes in effect from . Just about everyone needs to take action now to become compliant.
Disclaimer: I’m not a lawyer and I’m not providing you legal advice. Contact your legal counsel for help interpreting and implementing the GDPR. This article is provided for entertainment purposes, and amounts to nothing but my interpretation of the GDPR.
The General Data Protection Regulation shifts the default operating mode for personal data collection from collect and store as much information about everyone as possible for all eternity to don’t collect any information about anyone unless there’s documented and informed consent for the collection, and don’t use that information for anything but the specific purposes consent were given for. The GDPR turns big-data collection of personal data on the web from an asset into a liability with fines as high as 20 000 000 Euro or 4 % of global revenue (whichever is greater).
I’ve limited the scope of this article to discuss and focus on some of the technical requirements surrounding personal data collected by default in the logs generated by popular webserver software. I’ll not go through the entire GDPR and all the requirements, but focus on some actionable points.
Personal data in server logs
The default configuration of popular webservers including Apache Web Server and Nginx collect and store at least two of the following three types of logs:
- Access logs
- Error logs (including processing-language logs like PHP)
- Security audit logs (e.g. ModSecurity)
All of these logs contain personal information by default under the new regulation. IP addresses are specifically defined as personal data per Article 4, Point 1, and Recital 49. The logs can also contain usernames if your web service uses them as part of their URL structure, and even the referral information that’s logged by default can contain personal information (e.g. unintended collection of sensitive data; like being referred from a sensitive-subject website).
If you don’t have a legitimate need to store these logs you should disable logging in your webserver. You’re not even allowed to store this type of information without having obtained direct consent for the purposes you intend to store the information for from the persons you’re storing information about. The less customer information you store the lower the risk to your organization.
Legal basis for collecting and storing logs without consent
You can’t collect and store any personal data without having obtained, and being able to document that you obtained, consent from the persons you’re collecting data from. You can, however, still collect and store personal data in your server logs for the limited and legitimate purpose of detecting and preventing fraud and unauthorized system access, and ensuring the security of your systems.
Here are the relevant excerpts from the GDPR that allows data collection for this type of purposes:
Notably, this doesn’t exempt such collection from the strict requirements of the GDPR. Gandalf the Grey offers some great advice for how you should treat personal data to achieve GDPR compliance in your organization:
Encryption, access restriction, and timely erasure
The specific requirements under the GDPR that apply to your organization depend on the scope and type of data you collect set against the needs to store the data. The regulation with all its recitals is 54 800 words long, but I’ll try to summarize some practical implementation requirements from the regulation:
There are no specific technical details offered regarding how these requirements should be implemented besides the suggestion to use “encryption” in Article 6, Paragraph 4, Point E and Recital 83. The take-away is still clear: data should be secured, access should be limited even within your organization, and data should be deleted (including from backups) when there’s no longer a need to retain it.
Utility of the day: logrotate (+ gnupg)
logrotate is a very useful tool that can be used to encrypt logs in storage even on edge servers, and can help automate the deletion of old log files. It can also be used to encrypt log files in storage which when combined with organizational measures can limit access to decrypting the log files. Unless you encrypt your logs, an unauthorized third-party who gained access to your servers could extract a lot of data about your users from your logs. Depending on how much private information is stored in your log files and the potential sensitive nature of your business, you shouldn’t store log files for more than a few hours or days unless you take measures to protect them.
Managing PGP keys in GnuPG is beyond the scope of this article. In short, you would create a key-pair on a secure machine, and then import the public-key into the GnuPG key chain on your servers while storing the private-key on a secure medium with limited access for authorized employees only (e.g. printed on paper or kept on a removable storage media). The server can then use the public key to encrypt its log files without with public key cryptography; resulting in the server being able to encrypt the data without being able to decrypt it without the private. The log files could even be transferred to a centralized log-storage server for cold storage.
I believe such a setup could be used to achieve GDPR compliance while still maintaining auditable logs in the event of a breach of server security or other incidents that would require a log trail.
The following logrotate configuration example demonstrates secure encrypted storage (using GnuPG) erasure after time intervals (
rotate in days) appropriate to how important it’s to store the various log files following a security incident.
In the above example, access logs are deleted after 100 days, error logs after 200 days, and ModSecurity logs (which would only contain suspicious activity), is retained for 400 days. After this time, the logs would be securely erased using the shred utility.
The logs are still kept unencrypted for up to 24-hours when they were first recorded. This is a small time window when the data isn’t stored encrypted, but it’s required to allow human technicians and automated log analyzing tools (like SSHGuard or Fail2Ban) to process the data and act upon it to help detect and prevent unauthorized or unlawful system access.
You can reduce the time window when data is kept unencrypted by rotating logs hourly instead of daily, or by piping logs directly from your webserver into an encrypted storage. This may have a serious performance impact on your server and complicate the configuration of automated security monitoring tools.
Any identifier, including network or equipment identifiers like an IP address, are considered personal data. Don’t store server logs if you don’t have to. Encrypt logs in storage and limit access to decryption credentials. Delete logs as early as possible, including from any backups. Document what steps you’ve taken to secure data and limit the impact in the case of a server breach.
This article has focused on server logs as they’re something every organization with a website or an online service will have to deal with. However, the same principles and even stricter requirements apply to other types of data that your organization keeps on people. The deadline for GDPR compliance is , and that’s barely enough time to read through the 54 800 word regulation. With fines up in the 20 Million Euro range; you better get started auditing personal information collection and storage in your organization right away. This is the perfect time to rethink old decisions regarding what data your organization needs to keep and for how long.