Network Working Group | S. Rao |
Internet-Draft | Grab |
Intended status: Experimental | S. Sahib |
Expires: May 7, 2020 | R. Guest |
Salesforce | |
November 4, 2019 |
Personal Information Tagging for Logs (PITFoL)
draft-rao-pitfol-00
Software applications typically generate a large amount of log data in the course of their operation in order to help with monitoring, troubleshooting, etc. However, like all data generated and operated upon by software systems, logs can contain information sensitive to users. Personal data identification and anonymization in logs is thus crucial to ensure that no personal data is being inadvertently logged and retained which would make the logging application run afoul of laws around storing private information. This document focuses on exploring mechanisms to specify personal or sensitive data in logs, to enable any server collecting, processing or analyzing logs to identify personal data and thereafter, potentially enforce any redaction.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119.
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."
This Internet-Draft will expire on May 7, 2020.
Copyright (c) 2019 IETF Trust and the persons identified as the document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.
Personal data identification and redaction is crucial to make sure that a logging application is not storing and potentially leaking users’ private information. There are known precedents that help discover and extract sensitive data, for example, we can define a regular expression or lookup rules that will match a person's name, credit card number, email address and so on. Besides, there are data dictionary and datasets based training models that can predict the presence of sensitive data. In most cases, however, what data is considered personal and sensitive is often subjective, provisional and contextual to the data source or the application processing the data, which makes it hard to use automated techniques to identify personal data. The challenges are summarized as follows:
- What comprises personal data is often subjective and use case specific.
- There are many disparate set of personal data types and often require multitude approaches for its detection.
- There are no standards that govern formats of sensitive data making automation difficult for most common use cases.
Once the personal information is identified, it has to be appropriately tagged. Personal data tagging is especially important in cases where log data is flowing in from disparate sources. In cases where tagging at source is not possible (e.g. log data generated by a legacy IoT device, Web server or a Firewall), a centralized logging server can be tasked with making sure the log data is tagged before passing on downstream. Once the logs are tagged, the logging application can use anonymization techniques to redact the fields appropriately. This document focuses on the tagging aspect of log redaction.
Personal data: RFC 6973 defines personal data as "any information relating to an individual who can be identified, directly or indirectly.” This typically includes information such as IP addresses, username. However, the definition of personal data varies heavily by what other information is available, the jurisdiction of operation and other such factors. Hence, this document does not focus on prescriptively listing what log fields contain personal data but rather on what a tagging mechanism would look like once a logging application has determined which fields it considers to hold personal data.
Most systems like network devices, web servers and application services record information about user activity, transactions, network flows, etc., as log data. Logs are incredibly useful for various purposes such as security monitoring, application debugging and opertional maintenace. In addition, there are use cases of organizations exporting or sharing logs with third party log analyzers for purposes of security incident reponse, monitoring, business analytics, where logs can be valuable source of information. In such cases, there are concerns about potential exposure of personal data to unintented systems or receipients. This document explores techiques for tagging logs to aid identification of personal data.
Once personal information data is identified via manual detection, dictionary or dataset based training models, the log imposed with tag information either at field-level or the log-level.
This is an example of a log message in RFC 3164 format. We can imagine that a logging application determines that user_name, err_user and ip_addr are fields that can contain sensitive personal data.
<120> Nov 16 16:00:00 10.0.1.11 ABCDEFG: [AF@0 event="AF-Authority failure" violation="A-Not authorized to object" actual_type="AF-A" jrn_seq="1001363" timestamp="20120418163258988000" job_name="QPADEV000B" user_name="XYZZY" job_number="256937" err_user="TESTFORAF" ip_addr="10.0.1.21" port="55875" action="Undefined(x00)" val_job="QPADEV000B" val_user="XYZZY" val_jobno="256937" object="TEST" object_library="CUS9242" object_type="*FILE" pgm_name="" pgm_libr="" workstation=""]
In the field-level tagging method, the identifed <attribute, value>field is tagged with a "pii_data=true" attribute specifying the field to be sensitive or personal. In case of log-level tagging approach, the data about fields that are personal is specified using "pii_name" attribute that contains list of one or more field deemed sensitive or personal.
This log can be transformed as following using 'Field-level' tagging techique:
<120> Apr 18 16:32:58 10.0.1.11 QAUDJRN: [AF@0 event="AF-Authority failure" violation="A-Not authorized to object" actual_type="AF-A" jrn_seq="1001363" timestamp="20120418163258988000" job_name="QPADEV000B" {user_name="XYZZY" pii_data="true"} job_number="256937" {err_user="XYZZY" pii_data="true"] [ip_addr="10.0.1.21" pii_data="true"] port="55875" action="Undefined(x00)" val_job="QPADEV000B" val_jobno="256937" object="TEST" object_library="CUS9242" object_type="*FILE" pgm_name="" pgm_libr="" workstation=""]
<120> Apr 18 16:32:58 10.0.1.11 QAUDJRN: [AF@0 event="AF-Authority failure" violation="A-Not authorized to object" actual_type="AF-A" jrn_seq="1001363" timestamp="20120418163258988000" job_name="QPADEV000B" user_name="XYZZY" job_number="256937" err_user="XYZZY" ip_addr="10.0.1.21" port="55875" action="Undefined(x00)" val_job="QPADEV000B" val_jobno="256937" object="TEST" object_library="CUS9242" object_type="*FILE" pgm_name="" pgm_libr="" workstation="", pii=”user_name,err_user, ip_addr”]
A new (metadata) "pii" field was added to the MSG part of the syslog log message.
A more complicated example, that can be used to support the ability to radact different fields in different ways as per privacy preservation policy.
<120> Apr 18 16:32:58 10.0.1.11 QAUDJRN: [AF@0 event="AF-Authority failure" violation="A-Not authorized to object" actual_type="AF-A" jrn_seq="1001363" timestamp="20120418163258988000" job_name="QPADEV000B" user_name="XYZZY" job_number="256937" err_user="XYZZY" ip_addr="10.0.1.21" port="55875" action="Undefined(x00)" val_job="QPADEV000B" val_jobno="256937" object="TEST" object_library="CUS9242" object_type="*FILE" pgm_name="" pgm_libr="" workstation="", pii_name=”user_name,err_user”, pii_ipaddr=”ip_addr”]
where the log data is tagged with "pii_name" and "pii_ipaddr" attributes that specifies the senstive data in the log at a granular level.
We can consider defining a Structured Data ID for PII to specify various structured parameters.
TBD
TBD
[RFC2119] | Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, March 1997. |
[RFC3164] | Lonvick, C., "The BSD Syslog Protocol", RFC 3164, DOI 10.17487/RFC3164, August 2001. |
[RFC6973] | Cooper, A., Tschofenig, H., Aboba, B., Peterson, J., Morris, J., Hansen, M. and R. Smith, "Privacy Considerations for Internet Protocols", RFC 6973, DOI 10.17487/RFC6973, July 2013. |