HK40102319A - Configurable annotations for privacy-sensitive user content - Google Patents

Configurable annotations for privacy-sensitive user content Download PDF

Info

Publication number: HK40102319A
Authority: HK; Hong Kong
Prior art keywords: content; user; data; sensitive; annotation
Prior art date: 2017-03-23

Application number

HK42024088258.9A

Other languages

Chinese (zh)

Inventor

P·D·艾伦

Original Assignee

微软技术许可有限责任公司

Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)

2017-03-23

Filing date

2024-03-05

Publication date

2024-06-07

2024-03-05 Application filed by 微软技术许可有限责任公司 filed Critical 微软技术许可有限责任公司

2024-06-07 Publication of HK40102319A publication Critical patent/HK40102319A/en

Links

Description

针对隐私敏感用户内容的可配置注释Configurable annotations for privacy-sensitive user content

本申请是2018年3月14日提交的申请号为201880020423.6的同名专利申请的分案申请。This application is a divisional application of the same patent application, filed on March 14, 2018, with application number 201880020423.6.

背景技术Background Technology

各种用户生产力应用允许数据输入以及对用户内容的分析。这些应用可以使用电子表格、演示、文本文档、混合媒体文档、消息传送格式、或其他用户内容格式来提供内容创建、编辑、和分析。在该用户内容中，各种文本、字母数字、或其他基于字符的信息可以包括用户或组织可能不希望被包含在已发布或分发的作品中的敏感数据。例如，电子表格可以包括社会保险号码(SSN)、信用卡信息、医疗健康标识符、或其他信息。尽管录入该数据或用户内容的用户可能有权限查看该敏感数据，但其他实体或分发端点可能不具有这样的权限。Various user productivity applications allow for data entry and analysis of user content. These applications can use spreadsheets, presentations, text documents, mixed media documents, messaging formats, or other user content formats to enable content creation, editing, and analysis. Within this user content, various text, alphanumeric, or other character-based information may include sensitive data that the user or organization may not wish to be included in published or distributed work. For example, a spreadsheet may include Social Security Numbers (SSNs), credit card information, medical identifiers, or other information. While the user entering this data or user content may have permission to view this sensitive data, other entities or distribution endpoints may not have such permission.

信息保护和管理技术可以被称为数据丢失保护(DLP)，其尝试避免对该敏感数据的误分派和误分配。在某些内容格式或内容类型(例如，包括在电子表格、基于幻灯片的演示、和图形图解应用中的那些)中，用户内容可以被包括在各种单元格、对象、或其他结构化或半结构化数据实体中。此外，敏感数据可以在多于一个数据实体之间被分割。当这样的文档包括敏感数据时，在尝试识别敏感数据和防止敏感数据丢失时可能会出现困难。Information protection and management techniques, often referred to as Data Loss Prevention (DLP), attempt to prevent the misallocation and misdistribution of sensitive data. In certain content formats or types (e.g., those included in spreadsheets, slideshow-based presentations, and graphical applications), user content can be contained within various cells, objects, or other structured or semi-structured data entities. Furthermore, sensitive data can be segmented across more than one data entity. When such documents contain sensitive data, difficulties can arise in attempting to identify and prevent its loss.

发明内容Summary of the Invention

在本文中提供了用于用户应用的数据隐私注释框架的系统、方法、和软件。示例性方法包括至少识别第一阈值数量，用于将所述第一阈值数量修改为第二阈值数量的弹性因子，以及对指示所述第二阈值数量何时覆盖所述第一阈值数量的阈值回弹属性的指示。所述方法包括监视对用户内容的内容编辑过程，以识别包含与一个或多个预先确定的数据方案相对应的敏感数据的用户内容的数量，并且在所述内容编辑过程期间，至少基于以下项来启用和禁用对所述内容元素的注释指示符的呈现：所述内容元素相对于所述第一阈值数量的当前数量、当被启用时针对所述第一阈值数量的所述弹性因子、以及对所述阈值回弹属性的指示。This document provides systems, methods, and software for a data privacy annotation framework for user applications. An exemplary method includes identifying at least a first threshold quantity, a resilience factor for modifying the first threshold quantity to a second threshold quantity, and an indication of a threshold bounce attribute indicating when the second threshold quantity overrides the first threshold quantity. The method includes monitoring a content editing process of user content to identify the quantity of user content containing sensitive data corresponding to one or more pre-defined data schemes, and during the content editing process, enabling and disabling the rendering of annotation indicators for the content elements based on at least: the current quantity of the content elements relative to the first threshold quantity, the resilience factor for the first threshold quantity when enabled, and the indication of the threshold bounce attribute.

提供了该发明内容以用简化的形式引入对以下的具体实施方式中进一步描述的概念的选择。应当理解的是，该发明内容不旨在标识所要求保护主题的关键特征或必要特征，也不旨在帮助确定所要求保护的主题的范围。This summary is provided to introduce, in a simplified form, the selection of concepts further described in the following detailed description. It should be understood that this summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to help determine the scope of the claimed subject matter.

附图说明Attached Figure Description

可以参考以下附图更好地理解本公开的许多方面。尽管结合这些附图描述了几个实现，但是本公开不限于在本文中所公开的实现。相反，目的是要覆盖所有的替换、修改、和等同形式。Many aspects of this disclosure can be better understood with reference to the following figures. Although several implementations are described in conjunction with these figures, this disclosure is not limited to the implementations disclosed herein. Rather, the aim is to cover all substitutions, modifications, and equivalents.

图1示出了示例中的数据丢失保护环境。Figure 1 illustrates a data loss protection environment in the example.

图2示出了示例中的数据丢失保护环境的元素。Figure 2 shows the elements of the data loss protection environment in the example.

图3示出了示例中的数据丢失保护环境的元素。Figure 3 illustrates the elements of a data loss protection environment in an example.

图4示出了示例中的数据丢失保护环境的操作。Figure 4 illustrates the operation of the data loss protection environment in the example.

图5示出了示例中的数据丢失保护环境的操作。Figure 5 illustrates the operation of the data loss protection environment in the example.

图6示出了示例中的数据丢失保护环境的操作。Figure 6 illustrates the operation of the data loss protection environment in the example.

图7示出了示例中的数据丢失保护环境的操作。Figure 7 illustrates the operation of the data loss protection environment in the example.

图8示出了示例中的数据丢失保护环境的数据阈值操作。Figure 8 illustrates the data thresholding operation of the data loss protection environment in the example.

图9示出了适合于实现在本文中所公开的架构、过程、平台、服务、和操作场景中的任何一个的计算系统。Figure 9 illustrates a computing system suitable for implementing any of the architectures, processes, platforms, services, and operational scenarios disclosed herein.

具体实施方式Detailed Implementation

用户生产力应用使用电子表格、幻灯片、矢量图形元素、文档、电子邮件、消息传送内容、数据库、或其他应用数据格式和类型来提供用户数据和内容创建、编辑、和分析。在用户内容中，可以包括各种文本、字母数字、或其他基于字符的信息。例如，电子表格可以包括社会保险号码(SSN)、信用卡信息、医疗健康标识符、护照号码、或其他信息。尽管录入该数据或用户内容的用户可能有权限查看敏感数据，但其他实体或分发端点可能不具有这样的权限。可以建立指示哪些类型的数据或用户内容性质上是敏感的各种隐私策略或数据隐私规则。可以包含在本文中所讨论的增强的数据丢失保护(DLP)措施以试图避免对该敏感数据的误分派和误分配。User productivity applications use spreadsheets, PowerPoint presentations, vector graphics elements, documents, emails, messaging content, databases, or other application data formats and types to provide user data and content creation, editing, and analysis. User content may include various text, alphanumeric, or other character-based information. For example, a spreadsheet may include Social Security Numbers (SSNs), credit card information, health identifiers, passport numbers, or other information. While the user entering this data or user content may have permission to view sensitive data, other entities or distribution endpoints may not have such permission. Various privacy policies or data privacy rules can be established to indicate which types of data or user content are sensitive in nature. Enhanced data loss protection (DLP) measures discussed herein may be included to attempt to prevent misallocation and misdistribution of such sensitive data.

在某些内容格式或内容类型(例如，包括在电子表格、基于幻灯片的演示、和图形图解应用中的那些)中，用户内容可以被包括在各种单元格、对象、或其他结构化或半结构化数据实体中。此外，敏感数据可以在多于一个数据元素或条目之间被分割。本文中的示例提供了对包括结构化数据元素的用户数据文件中的敏感数据的增强的识别。此外，本文中的示例提供了增强的用户界面，以向用户警告敏感数据。这些用户界面元素可以包括标记个体的包含敏感数据的数据元素，以及用于在编辑内容期间进行警报的阈值。In certain content formats or types (e.g., those included in spreadsheets, slideshow-based presentations, and graphical applications), user content can be contained within various cells, objects, or other structured or semi-structured data entities. Furthermore, sensitive data can be segmented across more than one data element or entry. The examples in this document provide enhanced identification of sensitive data in user data files that include structured data elements. Additionally, the examples in this document provide enhanced user interfaces to alert users to sensitive data. These user interface elements can include data elements that flag individuals containing sensitive data, as well as thresholds for triggering alerts during content editing.

在使用结构化数据元素的一个示例应用(例如，电子表格应用)中，可以将数据录入到布置成列和行的单元格中。每个单元格可以包含用户数据或用户内容，并且还可以包括用于执行计算的一个或多个表达式，其可以引用一个或多个其他单元格中的用户录入的数据。诸如幻灯片放映演示应用之类的其他用户应用可以包括多于一个幻灯片上的用户内容以及在这些幻灯片上包括的对象内的用户内容。In a sample application that uses structured data elements (e.g., a spreadsheet application), data can be entered into cells arranged in columns and rows. Each cell can contain user data or user content, and may also include one or more expressions for performing calculations, which can reference user-entered data in one or more other cells. Other user applications, such as slideshow presentation applications, can include user content on more than one slide, as well as user content within objects included on those slides.

有利地，本文的示例和实现提供了用于数据丢失保护服务的增强的操作和结构。这些增强的操作和结构具有更快识别文档并且尤其是结构化文档(例如，电子表格、演示、图形绘图等)内的敏感内容的技术效果。此外，多个应用可以共享单个分类服务，该分类服务提供跨许多不同应用和终端用户平台对用户数据文件中的敏感内容的检测和识别。终端用户级的注释和模糊过程也在应用的用户界面中提供显著的优势和技术效果。例如，可以向用户呈现对敏感内容的图形注释，以及呈现各种模糊或掩盖选项的弹出对话框。还可以建立各种增强的注释阈值以动态地向用户指示敏感内容，从而使得用户内容编辑和敏感数据模糊更高效并且符合各种数据丢失保护策略和规则。Advantageously, the examples and implementations in this paper provide enhanced operations and structures for data loss protection services. These enhanced operations and structures offer the technical advantage of faster identification of sensitive content within documents, especially structured documents (e.g., spreadsheets, presentations, graphical drawings, etc.). Furthermore, multiple applications can share a single classification service that provides detection and identification of sensitive content in user data files across many different applications and end-user platforms. End-user-level annotation and blurring processes also offer significant advantages and technical effects within the application's user interface. For example, graphical annotations of sensitive content can be presented to the user, along with pop-up dialog boxes presenting various blurring or masking options. Various enhanced annotation thresholds can also be established to dynamically indicate sensitive content to the user, making user content editing and sensitive data blurring more efficient and compliant with various data loss protection policies and rules.

作为用户应用的数据丢失保护环境的第一示例而提供了图1。图1示出了示例中的数据丢失保护环境100。环境100包括用户平台110和数据丢失保护平台120。图1的元素可以通过一个或多个物理或逻辑通信链路进行通信。在图1中，示出了链路160-161。然而，应当理解的是，这些链路仅仅是示例性的，并且可以包括一个或多个另外的链路，其可以包括无线、有线、光学、或逻辑部分。Figure 1 is provided as a first example of a data loss protection environment for a user application. Figure 1 illustrates an example data loss protection environment 100. Environment 100 includes a user platform 110 and a data loss protection platform 120. The elements of Figure 1 can communicate via one or more physical or logical communication links. Links 160-161 are shown in Figure 1. However, it should be understood that these links are merely exemplary and may include one or more additional links, which may include wireless, wired, optical, or logical components.

数据丢失保护框架可以包括具体用户应用本地的部分，以及跨许多应用采用的共享的部分。用户平台110为用户提供了用于经由用户界面112来与用户应用111的元素交互的应用环境。在用户与应用111的交互期间，可以执行内容输入和内容操控。应用数据丢失保护(DLP)模块113可以在应用111内提供用于敏感数据注释和替换的功能的部分。在该示例中，应用DLP模块113是用户平台110本地的，但是可以替代地与应用111分离或者集成到应用111中。应用DLP模块113可以为用户和应用111提供敏感数据注释和替换。数据丢失保护平台120提供数据丢失保护框架的共享的部分，并且为许多应用提供共享的DLP服务121以便共享例如具有相关联的位置DLP部分193的应用190。The data loss protection framework may include a portion local to a specific user application, as well as a shared portion adopted across many applications. User platform 110 provides users with an application environment for interacting with elements of user application 111 via user interface 112. During user interaction with application 111, content input and content manipulation can be performed. Application Data Loss Protection (DLP) module 113 may provide functionality for sensitive data annotation and replacement within application 111. In this example, application DLP module 113 is local to user platform 110, but may alternatively be separate from or integrated into application 111. Application DLP module 113 can provide sensitive data annotation and replacement for both the user and application 111. Data loss protection platform 120 provides the shared portion of the data loss protection framework and provides shared DLP services 121 for many applications to share, for example, applications 190 with associated location DLP portions 193.

在操作中，应用111提供用户界面112，用户可以通过该用户界面112来与应用111交互，例如录入、编辑、以及以其他方式操控可以经由一个或多个数据文件加载或经由用户界面112录入的用户内容。在图1中，显示了电子表格工作簿，其中单元格布置成行和列。作为应用111的一部分，提供了数据丢失保护服务，其识别敏感用户内容并允许用户用安全的文本或数据来替换敏感用户内容。敏感内容包括可能具有隐私问题、隐私策略/规则、或者不期望传播或不想要传播的其他属性的内容。该上下文中的数据丢失是指将私有或敏感数据传播给未授权的用户或端点。In operation, application 111 provides a user interface 112 through which users can interact with application 111, such as entering, editing, and otherwise manipulating user content that can be loaded via one or more data files or entered via user interface 112. Figure 1 shows a spreadsheet workbook with cells arranged in rows and columns. As part of application 111, a data loss protection service is provided that identifies sensitive user content and allows users to replace it with secure text or data. Sensitive content includes content that may have privacy concerns, privacy policies/rules, or other attributes that are not expected or desired to be disseminated. Data loss in this context refers to the dissemination of private or sensitive data to unauthorized users or endpoints.

为了识别敏感内容，应用111向数据丢失保护服务提供用户内容到用户内容的片段或块中的分派。在图1中，示出了内容部分140，其中，个体的内容部分141-145随着时间被提供至DLP服务121。通常而言，应用111可以处理用户内容以在空闲时段期间(例如，当与应用111相关的一个或更多处理线程是空闲的或低于活动阈值时)将用户内容分派到所述部分中。如将在本文中讨论的，结构化用户内容在分派过程期间被转换成“平面化”或非结构化布置。这种非结构化布置对于由DLP服务121进行的处理具有若干优点。To identify sensitive content, application 111 provides user content to the data loss protection service for distribution into segments or blocks of user content. Figure 1 illustrates content portion 140, where individual content portions 141-145 are provided to DLP service 121 over time. Typically, application 111 can process user content to distribute it into said portions during idle periods (e.g., when one or more processing threads associated with application 111 are idle or below an activity threshold). As will be discussed herein, structured user content is transformed into a “flat” or unstructured arrangement during the distribution process. This unstructured arrangement offers several advantages for processing by DLP service 121.

接着，DLP服务121单独地处理用户内容的每个部分或“块”以确定所述部分是否包含敏感内容。可以将各种分类规则125(例如，数据方案、数据模式、或隐私策略/规则)引入DLP服务121以识别敏感数据。在DLP服务121解析了用户内容的每个个体的块之后，确定用户数据文件中的敏感数据的位置偏移，以指示给应用DLP服务113。应用DLP服务113中的映射器功能确定块偏移与文档的结构之间的结构性关系。可以向应用111提供位置偏移、敏感数据长度、和敏感数据类型的指示，如例如敏感数据指示150可见。由DLP服务121指示的位置偏移可能不会针对敏感内容产生在用户数据文件的结构性元素中的确切或具体位置。在这些实例中，应用111的应用DLP服务113可以采用映射过程来确定包含敏感数据的具体的结构性元素。Next, DLP service 121 processes each portion or "block" of the user content individually to determine whether said portion contains sensitive content. Various classification rules 125 (e.g., data schemes, data patterns, or privacy policies/rules) can be introduced into DLP service 121 to identify sensitive data. After DLP service 121 has parsed each individual block of the user content, it determines the location offset of the sensitive data in the user data file to indicate to the application DLP service 113. A mapper function in application DLP service 113 determines the structural relationship between the block offset and the document's structure. Indications of location offset, sensitive data length, and sensitive data type can be provided to application 111, such as a sensitive data indication 150. The location offset indicated by DLP service 121 may not result in an exact or specific location within the structural elements of the user data file for the sensitive content. In these instances, application 111's application DLP service 113 may employ a mapping process to determine the specific structural element containing the sensitive data.

一旦确定了具体位置，则应用111就可以在用户界面112内注释敏感数据。该注释可以包括对敏感数据的全局或个体的标志或标记。所述注释可以包括在用户界面中呈现的“策略提示”。接着，可以向用户呈现一个或多个选项，所述选项用于模糊用户内容或以其他方式将内容渲染为不可识别为原本的敏感内容。可以建立关于敏感内容的通知的各种阈值，其基于在用户数据文件中存在的敏感数据的计数或数量来触发。Once the specific location is determined, application 111 can annotate the sensitive data within user interface 112. This annotation can include a global or individual flag or label for the sensitive data. The annotation may include a "policy hint" presented in the user interface. Then, one or more options can be presented to the user to obscure the user content or otherwise render the content in a way that makes it unrecognizable as originally sensitive content. Various thresholds for notifications regarding sensitive content can be established, triggered based on the count or quantity of sensitive data present in the user data file.

在一个示例中，用户数据文件114包括用户数据文件114的特定单元格中的内容115、116、和117，它们可以与电子表格工作簿的特定工作表或页面相关联。各种内容可以被包括在相关联的单元格中，并且该内容可以包括潜在敏感的数据，例如图1中可见的针对SSN、电话号码、和地址的示例。该内容中的一些内容可以跨越用户数据文件中的结构性边界，例如横跨多个单元格或横跨多个图形对象。如果“块”将数据分派到行或行分组中，则平面化的表示(即，剥离了任何结构性内容)仍然可以识别一个或多个单元格内的敏感数据。In one example, user data file 114 includes contents 115, 116, and 117 in specific cells of user data file 114, which may be associated with specific worksheets or pages of a spreadsheet workbook. Various contents can be included in the associated cells, and this content may include potentially sensitive data, such as examples of SSNs, phone numbers, and addresses visible in Figure 1. Some of this content may cross structural boundaries in the user data file, such as spanning multiple cells or multiple graphic objects. If “blocks” distribute data into rows or groupings of rows, a flattened representation (i.e., stripped of any structural content) can still identify sensitive data within one or more cells.

用户平台110和DLP平台120中的每个的元素可以包括通信接口、网络接口、处理系统、计算机系统、微处理器、存储系统、存储介质、或一些其他处理设备或软件系统，并且可以分布在多个设备中或跨多个地理位置分布。用户平台110和DLP平台120中的每个的元素的示例可以包括诸如操作系统、应用、日志、接口、数据库、实用程序、驱动程序、网络化软件之类的软件，以及存储在计算机可读介质上的其他软件。用户平台110和DLP平台120中的每个的元素可以包括由分布式计算系统或云计算服务托管的一个或多个平台。用户平台110和DLP平台120中的每个的元素可以包括逻辑接口元素，例如软件定义的接口和应用编程接口(API)。Elements of each of user platform 110 and DLP platform 120 may include communication interfaces, network interfaces, processing systems, computer systems, microprocessors, storage systems, storage media, or other processing devices or software systems, and may be distributed across multiple devices or geographical locations. Examples of elements of each of user platform 110 and DLP platform 120 may include software such as operating systems, applications, logs, interfaces, databases, utilities, drivers, networked software, and other software stored on computer-readable media. Elements of each of user platform 110 and DLP platform 120 may include one or more platforms hosted by a distributed computing system or cloud computing service. Elements of each of user platform 110 and DLP platform 120 may include logical interface elements, such as software-defined interfaces and application programming interfaces (APIs).

用户平台110的元素包括应用111、用户界面112、和应用DLP模块113。在该示例中，应用111包括电子表格应用。应当理解的是，用户应用111可以包括任何用户应用，例如生产力应用、通信应用、社交媒体应用、游戏应用、移动应用、或其他应用。用户界面112包括图形用户界面元素，其能够产生输出以向用户显示并且从用户接收输入。用户界面112可以包括针对用户接口系统908在下文图9中讨论的元素。应用DLP模块113包括一个或多个软件元素，它们被配置为分派内容以便传递至分类服务，注释被指示为敏感的数据，以及模糊敏感数据，此外还有其他操作。The user platform 110 includes an application 111, a user interface 112, and an application DLP module 113. In this example, application 111 includes a spreadsheet application. It should be understood that user application 111 can include any user application, such as a productivity application, a communication application, a social media application, a game application, a mobile application, or other applications. User interface 112 includes graphical user interface elements that are capable of generating output to display to the user and receiving input from the user. User interface 112 can include the elements for user interface system 908 discussed below in Figure 9. Application DLP module 113 includes one or more software elements configured to dispatch content for delivery to a classification service, indicate sensitive data with annotations, and obfuscate sensitive data, among other operations.

DLP平台120的元素包括DLP服务121。DLP服务121包括应用编程接口(API)122形式的外部接口，但可以采用其他接口。DLP服务121还包括跟踪器123和分类服务124，它们将在下文更加详细地被讨论。API 122可以包括一个或多个用户接口，例如web接口、API、终端接口、控制台接口、命令行shell接口、可扩展标记语言(XML)接口等。跟踪器123保留在结构化用户内容的平面化部分内针对特定文档找到的敏感数据的计数或数量，并且还保留结构化用户内容的平面化部分内的、与结构化用户内容内的敏感数据的位置相对应的位置偏移的记录。跟踪器123还可以执行阈值分析以确定阈值数量的敏感数据何时被找到并且应当由应用DLP模块113来注释。然而，在其他示例中，DLP服务121的阈值/计数部分可以被包括在DLP模块113中。分类服务124解析平面化的用户内容以确定敏感数据的存在，并且可以采用定义用于识别敏感数据的规则和策略的各种输入。应用DLP模块113和共享的DLP服务121的元素可以被配置在图1所示的不同布置或分布中，例如当共享的DLP服务121的部分被包括在应用DLP模块113或应用111中时，此外还有其他配置。在一个示例中，共享的DLP服务121的部分包括动态链接库(DLL)，其被包括在用户平台110上以供应用111和应用DLP模块113使用。The elements of DLP platform 120 include DLP service 121. DLP service 121 includes an external interface in the form of an application programming interface (API) 122, but other interfaces may be used. DLP service 121 also includes tracker 123 and classification service 124, which will be discussed in more detail below. API 122 may include one or more user interfaces, such as a web interface, API, terminal interface, console interface, command-line shell interface, Extensible Markup Language (XML) interface, etc. Tracker 123 maintains a count or quantity of sensitive data found for a specific document within the flattened portion of the structured user content, and also maintains a record of the positional offset within the flattened portion of the structured user content corresponding to the position of the sensitive data within the structured user content. Tracker 123 may also perform threshold analysis to determine when a threshold number of sensitive data has been found and should be annotated by the applied DLP module 113. However, in other examples, the threshold/count portion of DLP service 121 may be included within DLP module 113. Classification service 124 parses flattened user content to determine the presence of sensitive data, and can employ various inputs that define rules and strategies for identifying sensitive data. Elements of the application DLP module 113 and the shared DLP service 121 can be configured in different arrangements or distributions as shown in Figure 1, for example, when a portion of the shared DLP service 121 is included in application DLP module 113 or application 111, and other configurations exist. In one example, a portion of the shared DLP service 121 includes a dynamic link library (DLL) that is included on user platform 110 for use by application 111 and application DLP module 113.

为清楚起见，链路160-161连同图1的元素中没有示出的其他链路中的每个链路可以包括一个或多个通信链路，例如包括无线或有线网络链路的一个或多个网络链路。所述链路可以包括各种逻辑接口、物理接口、或应用编程接口。示例通信链路可以使用金属、玻璃、光学、空气、空间或一些其他材料作为传输介质。链路可以使用各种通信协议，例如互联网协议(IP)、以太网、混合光纤同轴电缆(HFC)、同步光纤网络(SONET)、异步传输模式(ATM)、时分复用(TDM)、电路交换、通信信令、无线通信、或一些其他通信格式，包括其组合、改进、或变型。所述链路可以是直接链路或者可以包括中间网络、系统、或设备，并且可以包括通过多个物理链路传输的逻辑网络链路。For clarity, each of links 160-161, along with other links not shown in Figure 1, may include one or more communication links, such as one or more network links including wireless or wired network links. The links may include various logical interfaces, physical interfaces, or application programming interfaces. Example communication links may use metal, glass, optics, air, space, or some other material as the transmission medium. Links may use various communication protocols, such as Internet Protocol (IP), Ethernet, Hybrid Fiber Coaxial (HFC), Synchronous Fiber Network (SONET), Asynchronous Transfer Mode (ATM), Time Division Multiplexing (TDM), circuit switching, communication signaling, wireless communication, or some other communication formats, including combinations, improvements, or variations thereof. The links may be direct links or may include intermediate networks, systems, or devices, and may include logical network links transmitted through multiple physical links.

为了进一步讨论环境100的元素和操作，呈现了图2。图2是示出了应用DLP模块113的示例配置200的框图，其突出显示了应用DLP模块113等的示例操作。在图2中，应用DLP模块113包括内容分派器(apportioner)211、注释器212、映射器213、和模糊器214。元素211-214中的每个可以包括由应用DLP模块113采用以如下所述地操作的软件模块。To further discuss the elements and operations of environment 100, Figure 2 is presented. Figure 2 is a block diagram illustrating an example configuration 200 of the application DLP module 113, highlighting example operations of the application DLP module 113, etc. In Figure 2, the application DLP module 113 includes a content apportioner 211, an annotator 212, a mapper 213, and a blurr 214. Each of elements 211-214 may include a software module employed by the application DLP module 113 to operate as described below.

在操作中，用户内容被提供至应用DLP模块113，例如电子表格文件或工作簿，如在图1中针对用户数据文件114可见。该用户数据文件可以被组织成结构化或半结构化格式，例如，针对电子表格示例是按行和列组织的单元格。可以替代地采用其他数据格式，例如具有页面/幻灯片和许多个体图形对象的幻灯片放映演示，在各种页面上具有各种对象的矢量绘图程序，具有各种对象(表格、文本框、图片)的文字处理文档，数据库，网页内容、或包括其组合在内其他格式。用户数据文件可以包含敏感内容或敏感数据。该敏感数据可以包括适合一个或多个模式或数据方案的任何用户内容。敏感数据类型的示例包括社会保险号码、信用卡号码、护照号码、地址、电话号码、或其他信息。During operation, user content is provided to the application DLP module 113, such as a spreadsheet file or workbook, as can be seen in Figure 1 for user data file 114. This user data file can be organized in a structured or semi-structured format; for example, in the spreadsheet example, it is cells organized by rows and columns. Alternatively, other data formats may be used, such as slideshow presentations with pages/slides and numerous individual graphic objects, vector drawing programs with various objects on various pages, word processing documents with various objects (tables, text boxes, pictures), databases, web page content, or other formats including combinations thereof. The user data file may contain sensitive content or sensitive data. This sensitive data may include any user content suitable for one or more patterns or data schemes. Examples of sensitive data types include social security numbers, credit card numbers, passport numbers, addresses, telephone numbers, or other information.

与对用户数据文件的编辑或查看并行地，内容分派器211将用户内容细分为一个或多个部分或“块”，其是来自原本/原生的结构化或层级形式的平面化形式。接着，内容分派器211可以将这些内容块以及针对每个块的块元数据提供至共享的DLP服务121。块元数据可以指示各种块属性，例如块在总内容中的位置偏移和块的长度。位置偏移对应于块相对于整个用户文档/文件的位置，并且块长度对应于块的大小。In parallel with editing or viewing user data files, content dispatcher 211 subdivides user content into one or more parts or "chunks," which are flattened forms of the original/native structured or hierarchical format. Content dispatcher 211 then provides these content chunks, along with chunk metadata for each chunk, to a shared DLP service 121. Chunk metadata can indicate various chunk attributes, such as the chunk's positional offset within the overall content and the chunk's length. The positional offset corresponds to the chunk's position relative to the entire user document/file, and the chunk length corresponds to the chunk's size.

共享的DLP服务121单独地解析内容块以识别块的平面化用户内容中的敏感数据，并且将对敏感数据的指示提供回应用DLP模块113。在下文所讨论的一些示例中，在向应用DLP模块113提供指示之前，将各种阈值应用至敏感数据的计数或数量。所述指示包括针对所述块中包含敏感数据的每个块的偏移，块的长度，以及可选地包括与敏感数据相关联的数据类型或数据方案的指示符。敏感数据指示可以用于确定用户数据文件的结构化数据中的敏感内容的实际或具体位置。对数据类型的指示符可以是以符号或数字编码的指示符，例如整数值，其指向映射器213可以使用以识别用于注释的数据类型的指示符列表。The shared DLP service 121 parses content blocks individually to identify sensitive data within the flattened user content of the blocks and provides indications of the sensitive data back to the application DLP module 113. In some examples discussed below, various thresholds are applied to the count or quantity of sensitive data before providing the indications to the application DLP module 113. The indications include an offset for each block containing sensitive data, the length of the block, and optionally, an indicator of the data type or data scheme associated with the sensitive data. Sensitive data indications can be used to determine the actual or specific location of sensitive content within the structured data of a user data file. Indicators for data types can be symbolically or numerically encoded indicators, such as integer values, which point to a list of indicators that mapper 213 can use to identify the data type used for annotation.

映射器213可以用于将偏移和长度转换成文档或用户文件内的具体位置。偏移和长度对应于由映射器213保留并且与会话标识符相关联地存储的具体块身份。会话标识符可以是唯一标识符，其至少与用户打开或查看文档的会话持续一样久。可以向映射器213提供来自内容分派器211的块元数据，以形成块偏移、长度、和会话标识符之间的映射关系。响应于接收到对敏感数据的指示，映射器213可以采用映射关系来识别针对敏感数据指示以在文档内对应于块偏移和长度的粗略位置。由于块可以包含用户数据文件的多于一个结构性或层级性元素，因此映射器213可以执行另外的定位过程以在用户数据文件中找到敏感数据的具体位置。Mapper 213 can be used to translate offsets and lengths into specific locations within a document or user file. The offset and length correspond to specific block identities stored by mapper 213 and associated with a session identifier. The session identifier can be a unique identifier that lasts at least as long as the user's session for opening or viewing the document. Block metadata from content dispatcher 211 can be provided to mapper 213 to form a mapping between block offsets, lengths, and session identifiers. In response to receiving an indication of sensitive data, mapper 213 can use the mapping to identify a rough location within the document corresponding to the block offset and length for the sensitive data indication. Since a block can contain more than one structural or hierarchical element of a user data file, mapper 213 can perform additional location procedures to find the specific location of the sensitive data within the user data file.

例如，偏移可以指示粗略位置，例如在电子表格中的特定行或特定列。为了确定具体位置(例如，在所指示的行或列内的单元格内)，映射器213可以使用偏移/长度连同结构化数据的本地知识和用户数据文件本身来定位结构化数据中的敏感内容。映射器213确定块是从用户数据文件中的何处提供的，例如针对电子表格示例的相关联的行、列、工作表，以及针对幻灯片放映示例的相关联的幻灯片/页面和对象。其他示例(例如，文字处理示例)可能没有太多结构，并且内容更容易被平面化，并且偏移可以是基于文档词语计数或类似定位的。For example, an offset can indicate a rough location, such as a specific row or column in a spreadsheet. To determine a specific location (e.g., within a cell in the indicated row or column), mapper 213 can use the offset/length along with local knowledge of the structured data and the user data file itself to locate sensitive content within the structured data. Mapper 213 determines where the block is provided from within the user data file, such as the associated row, column, and worksheet for the spreadsheet example, and the associated slides/pages and objects for the slideshow example. Other examples (e.g., word processing examples) may not have much structure, and the content may be more easily flattened, and the offset may be based on document word count or similar positioning.

在一些示例中，通过在特定粗略位置中针对敏感内容进行搜索来确定具体位置。当特定偏移涉及多个结构性元素或层级性元素时，映射器213可以迭代地搜索或遍历所述元素中的每个元素以定位敏感数据。例如，如果在文档中存在“n”个等级的结构/层级，则映射器213可以首先导航上层级，并且接着导航下层级。在电子表格示例中，层级/结构可以包括具有相关联的行和列的工作表。在演示文档示例中，层级/结构可以包括具有相关联的形状/对象的幻灯片/页面。可以逐步通过由偏移指示的每个工作表和幻灯片以找到包含敏感内容的确切单元格或对象。在另外的示例中，可以通过以下动作来完成对敏感数据的定位：重新创建与粗略位置相关联的一个或多个块以及在那些重新创建的块内找到敏感数据从而找到敏感数据的具体位置。In some examples, the specific location is determined by searching for sensitive content within a specific coarse location. When a particular offset involves multiple structural or hierarchical elements, mapper 213 can iteratively search or traverse each of those elements to locate the sensitive data. For example, if there are "n" levels of structure/hierarchy in a document, mapper 213 can first navigate to the upper level and then to the lower level. In the spreadsheet example, the hierarchy/structure may include worksheets with associated rows and columns. In the presentation document example, the hierarchy/structure may include slides/pages with associated shapes/objects. The exact cell or object containing the sensitive content can be found by progressively traversing each worksheet and slide indicated by the offset. In another example, the location of sensitive data can be accomplished by recreating one or more blocks associated with the coarse location and finding the sensitive data within those recreated blocks, thus finding the specific location of the sensitive data.

一旦确定了敏感数据的具体位置，则可以采用注释器212来向用户标记或以其他方式标注敏感数据。该注释可以采用全局标志或横幅(banner)的形式，其向用户指示该用户数据文件中存在敏感内容。该注释可以采用个体标志的形式，其指示接近敏感数据的标记。在一个示例中，图2示出了具有电子表格用户界面视图的配置201，该电子表格用户界面具有当前打开以供查看或编辑的工作簿。示出了横幅注释220以及个体的单元格注释221。个体的单元格注释221包括注释用户内容的一个或多个部分的图形指示，并且包括位于在用户界面112中可选择以呈现模糊选项的一个或多个部分附近的指示符。Once the specific location of sensitive data is determined, annotator 212 can be used to mark or otherwise annotate the sensitive data for the user. This annotation can take the form of a global flag or banner, indicating to the user the presence of sensitive content in the user data file. The annotation can also take the form of an individual flag, indicating a marker near the sensitive data. In one example, Figure 2 shows a configuration 201 with a spreadsheet user interface view featuring a currently open workbook for viewing or editing. Banner annotations 220 and individual cell annotations 221 are shown. Individual cell annotations 221 include a graphical indication of one or more portions of the user content being annotated, and include indicators located near one or more portions that can be selected in user interface 112 to present a blurring option.

当选择了特定注释时，可以向用户呈现一个或多个选项。可以呈现弹出菜单202，其包括各种查看/编辑选项，例如剪切、复制、粘贴等。弹出菜单202还可以包括模糊选项。对所述模糊选项中的一个的选择可以产生保留相关联的用户内容的数据方案的经模糊的内容，并且包括这样的符号，所述符号被选择以在保留相关联的用户内容的数据方案的同时防止识别相关联的用户内容。在一些示例中，部分地基于相关联的用户内容的数据方案等来选择所述符号。例如，如果数据方案包括数字数据方案，则字母可以用作模糊符号。同样，如果数据方案包括字母数据方案，则可以使用数字作为模糊符号。可以选择字母和数字的组合或其他符号作为字母数字内容示例中的模糊符号。When a specific annotation is selected, one or more options can be presented to the user. A pop-up menu 202 may be presented, including various viewing/editing options such as cut, copy, paste, etc. The pop-up menu 202 may also include blurring options. Selecting one of the blurring options can produce blurred content that retains the associated user content's data scheme and includes a symbol selected to prevent the identification of the associated user content while retaining the associated user content's data scheme. In some examples, the symbol is selected in part based on the associated user content's data scheme, etc. For example, if the data scheme includes a numeric data scheme, letters can be used as blurring symbols. Similarly, if the data scheme includes an alphabetic data scheme, numbers can be used as blurring symbols. Combinations of letters and numbers or other symbols can be selected as blurring symbols in alphanumeric content examples.

在图2中，第一模糊选项包括用掩盖的或以其他方式模糊的文本来替换敏感内容，而第二模糊选项包括用与当前选择的注释的内容类似的模式或数据方案来替换所有内容。例如，如果某个单元格中包含SSN，则可以向用户呈现这样的选项：用“X”字符替换SSN中的数字，同时保留SSN的数据方案完整，即留下由短划线字符分隔的熟悉的“3-2-4”字符布置。此外，另外的模糊选项可以包括用于用“X”字符替换适合所选SSN的模式的所有SSN的选项。应当理解的是，可以呈现不同的示例模糊选项，并且可以在替换过程中使用不同的字符。然而，无论采用什么模糊字符，敏感数据都将被匿名化渲染、净化、“清理”、或无法被识别为原始内容。In Figure 2, the first blurring option includes replacing sensitive content with masked or otherwise obscured text, while the second blurring option includes replacing all content with a pattern or data scheme similar to the content of the currently selected comment. For example, if a cell contains an SSN, the user could be presented with the option to replace the numbers in the SSN with the character "X" while preserving the integrity of the SSN's data scheme, leaving the familiar "3-2-4" character arrangement separated by dashes. Furthermore, additional blurring options could include the option to replace all SSNs with a pattern suitable for the selected SSN using the character "X". It should be understood that different example blurring options can be presented, and different characters can be used during the replacement process. However, regardless of the blurring character used, the sensitive data will be anonymized, sanitized, "cleaned," or rendered unrecognizable as the original content.

现在转到图3，示出了示例配置300以关注DLP服务121的各方面。在图3中，DLP服务121接收由内容分派器211在一个或多个内容块中提供的平面化的用户内容的部分，连同至少包括对块的总内容的偏移和块的长度的块元数据。在图3中示出了两种示例类型的结构化用户内容，即电子表格内容301和幻灯片放映/演示内容302。电子表格内容301具有反映定义个体单元格的行321和列322的结构。此外，电子表格内容301可以具有多于一个工作表320，其由工作表下方的选项卡限定，并且每个工作表可以具有单独一组行/列。每个单元格可以具有用户内容，例如字符、字母数字内容、文本内容、数字内容、或其他内容。幻灯片放映内容302可以具有包括多个对象324的一个或多个幻灯片或页面323。每个对象可以具有用户内容，例如字符、字母数字内容、文本内容、数字内容、或其他内容。Turning now to Figure 3, an example configuration 300 is shown focusing on various aspects of DLP service 121. In Figure 3, DLP service 121 receives portions of flattened user content provided by content dispatcher 211 in one or more content blocks, along with block metadata including at least the offset of the total content of the block and the length of the block. Two example types of structured user content are shown in Figure 3: spreadsheet content 301 and slideshow/presentation content 302. Spreadsheet content 301 has a structure that reflects rows 321 and columns 322 that define individual cells. Furthermore, spreadsheet content 301 may have more than one worksheet 320, which is defined by tabs below the worksheets, and each worksheet may have a separate set of rows/columns. Each cell may have user content, such as character, alphanumeric, text, numeric, or other content. Slideshow content 302 may have one or more slides or pages 323 comprising multiple objects 324. Each object may have user content, such as character, alphanumeric, text, numeric, or other content.

内容分派器211将用户内容细分成片段并移除任何相关联的结构，例如通过从单元格或对象中提取任何用户内容(例如，文本或字母数字内容)，并且接着将所提取的内容布置成平面化或线性块以用于传递至DLP服务121。这些块和块元数据被提供至DLP服务121以用于发现潜在的敏感数据。Content dispatcher 211 breaks down user content into fragments and removes any associated structure, such as by extracting any user content (e.g., text or alphanumeric content) from cells or objects, and then arranges the extracted content into flat or linear blocks for delivery to DLP service 121. These blocks and block metadata are provided to DLP service 121 for the discovery of potentially sensitive data.

一旦DLP服务121接收到用户内容的个体的块，则由分类服务124对块执行各种处理。而且，跟踪器123保留数据记录332，所述数据记录332包括将偏移/长度和会话标识符与找到的敏感数据的计数关联的一个或多个数据结构。为该DLP服务121存储数据记录332，以将包含敏感数据的块的偏移/长度提供回进行请求的应用，从而进一步定位和注释在其中找到的任何敏感内容。Once DLP service 121 receives a block of individual user content, classification service 124 performs various processes on the block. Furthermore, tracker 123 maintains data record 332, which includes one or more data structures that associate offsets/lengths and session identifiers with a count of sensitive data found. Storing data record 332 for DLP service 121 provides the offset/length of the block containing sensitive data back to the requesting application, thereby further locating and annotating any sensitive content found therein.

分类服务124针对各种分类规则331来解析所述块中的每个块以识别敏感数据或敏感内容。分类规则331可以建立由一个或多个表达式定义的一个或多个预先确定的数据方案，所述一个或多个表达式用于解析平面化的块/数据表示以将所述块的部分识别为指示一个或多个预先确定的内容模式或者一个或多个预先确定的内容类型。Classification service 124 parses each block in the block according to various classification rules 331 to identify sensitive data or sensitive content. Classification rules 331 may establish one or more predetermined data schemes defined by one or more expressions, which are used to parse the flattened block/data representation to identify portions of the block as indicating one or more predetermined content patterns or one or more predetermined content types.

通常基于与敏感内容相关联的数据结构模式或数据“方案”来识别敏感内容。这些模式或方案可以识别块的确切内容何时可能不同，但所述数据可能适合反映敏感数据类型的模式或布置。例如，SSN可以具有某一数据布置，该数据布置具有由预先确定的数量的短划线混合并且分隔的预先确定数量的数字。分类规则331可以包括在识别敏感数据时使用的各种定义和策略。这些分类规则可以包括隐私策略、数据模式、数据方案、和阈值策略。隐私策略可以指示，由于公司、组织、或用户策略等考虑，某些潜在敏感数据可能不会被指示为对应用敏感。在向应用报告敏感数据的存在之前，阈值策略可以建立用于在各个块中找到敏感数据的最小阈值。分类规则331可以由用户或由策略制定者(例如，管理员)来建立。Sensitive content is typically identified based on data structure patterns or data “schemas” associated with it. These patterns or schemes may identify when the exact content of a block might differ, but the data may fit into a pattern or arrangement that reflects the type of sensitive data. For example, an SSN might have a data arrangement with a predetermined number of numbers mixed and separated by a predetermined number of dashes. Classification rule 331 may include various definitions and strategies used in identifying sensitive data. These classification rules may include privacy policies, data patterns, data schemes, and threshold strategies. Privacy policies may indicate that certain potentially sensitive data may not be indicated as sensitive to the application due to considerations such as company, organizational, or user policies. Threshold strategies may establish a minimum threshold for finding sensitive data in each block before reporting its presence to the application. Classification rule 331 may be established by a user or by a policy maker (e.g., an administrator).

另外地，分类服务124可以通过由正则表达式(regex)服务333处理的一个或多个正则表达式来处理数据内容。Regex服务333可以包括正则表达式匹配和处理服务，以及用户或者策略制定者可以部署以用于识别敏感数据的各种正则表达式。下面在图7中讨论了regex服务333的另外的示例。Additionally, classification service 124 can process data content using one or more regular expressions processed by regular expression service 333. Regex service 333 may include regular expression matching and processing services, as well as various regular expressions that users or policy makers can deploy to identify sensitive data. Further examples of regex service 333 are discussed below in Figure 7.

作为具体示例，分类过程341示出了几个内容块C₁-C₈，它们是最初在文档或用户数据文件中的结构性或层级性布置中的内容的线性化版本。分类服务124处理这些块以识别所述块中包括敏感数据的块。如果找到任何敏感数据，则可以向应用提供指示。所述指示可以包括敏感数据的偏移和长度，并且被提供给映射器213以在用户数据文件的结构内定位敏感数据。在处理每个块以进行敏感数据识别之后，分类服务124可以丢弃所述块本身。由于偏移和长度允许在原始数据文件内找到敏感数据，并且原始内容保留在数据文件中(除非已经发生干预编辑)，因此实际的块不需要一被处理就被保存。As a concrete example, classification process 341 illustrates several content blocks _C1 - _C8 , which are linearized versions of the content originally arranged in a structured or hierarchical layout within a document or user data file. Classification service 124 processes these blocks to identify blocks containing sensitive data. If any sensitive data is found, an indication can be provided to the application. The indication may include the offset and length of the sensitive data and is provided to mapper 213 to locate the sensitive data within the structure of the user data file. After processing each block for sensitive data identification, classification service 124 may discard the block itself. Because the offset and length allow sensitive data to be found within the original data file, and the original content remains in the data file (unless edited intervention has occurred), the actual block does not need to be saved immediately upon processing.

为了形成所述块，内容分派器211将字母数字内容(例如，文本)捆绑到一个或多个线性数据结构中，例如，字符串或BSTR(基本字符串或二进制字符串)。分类服务124处理线性数据结构并且确定结果列表。针对敏感数据来对所述块进行检查，并且线性数据结构的部分可以被确定为具有敏感内容。分类服务124结合跟踪器123确定与线性数据结构中包含敏感数据的块相对应的偏移/长度。这些偏移可以指示粗略位置，所述粗略位置可以被转换回包含用户内容的原始文档(例如，用户数据文件)中的具体位置。当接收到块时，跟踪器123可以将每个块与在块元数据中指示的偏移/长度信息相关联。该偏移/长度信息可以用于通过映射器213反向映射至原始文档的结构或层级。To form the blocks, content dispatcher 211 bundles alphanumeric content (e.g., text) into one or more linear data structures, such as strings or BSTRs (basic strings or binary strings). Classification service 124 processes the linear data structures and determines a list of results. The blocks are examined for sensitive data, and portions of the linear data structures can be identified as containing sensitive content. Classification service 124, in conjunction with tracker 123, determines the offsets/lengths corresponding to the blocks containing sensitive data within the linear data structures. These offsets can indicate a coarse location, which can be converted back to a specific location in the original document (e.g., a user data file) containing the user content. When blocks are received, tracker 123 can associate each block with offset/length information indicated in the block metadata. This offset/length information can be used to back-map to the structure or hierarchy of the original document via mapper 213.

然而，DLP服务121通常仅具有回到原始文档或用户数据文件的部分上下文，例如由到原本生成的线性数据结构中的偏移所指示。此外，线性数据结构和用户内容本身可以在分类过程结束时由分类服务124释放/删除。这可以意味着分类服务124可能不能够直接搜索敏感内容以具体地在原始文档内定位敏感内容，并且即使分类服务124可以搜索精确的敏感内容，分类服务124也可能无法找到敏感内容，这是因为“分块”算法可以跨越原始文档或数据文件中的层级结构或构造的边界。作为具体示例，电子表格文档中的工作表320可以具有横跨四个相邻单元格的文本“SSN 12345 6789”。有利地，分类服务124可以发现该文本为包括敏感内容。然而，由于分类服务124进行的边界交叉分析，在策略规则评估结束时，分类服务124通常不具有足够的数据来找到原始文档中的敏感内容以呈现给用户。用户可能会留下不存在敏感内容的错误印象。However, DLP service 121 typically only has partial context back to the original document or user data file, indicated by offsets in the originally generated linear data structure. Furthermore, the linear data structure and the user content itself can be released/deleted by classification service 124 at the end of the classification process. This can mean that classification service 124 may not be able to directly search for sensitive content to specifically locate it within the original document, and even if classification service 124 can search for precise sensitive content, it may fail to find it because the "chunking" algorithm can cross the boundaries of hierarchical structures or constructs in the original document or data file. As a concrete example, worksheet 320 in a spreadsheet document may have the text "SSN 12345 6789" spanning four adjacent cells. Advantageously, classification service 124 can identify this text as containing sensitive content. However, due to the boundary cross-analysis performed by classification service 124, at the end of the policy rule evaluation, classification service 124 typically does not have enough data to find the sensitive content in the original document to present to the user. The user may be left with the false impression that no sensitive content exists.

为了高效地针对敏感内容来扫描用户内容，分类服务124在应用空闲期间成块地读取用户内容，进行部分分析，并且继续该过程。当分类服务124完成读取所有内容时，分类服务124仅具有原始内容中的敏感内容的粗略位置，例如仅开始/偏移和长度。为了高效地映射回结构化或半结构化文档，映射器213可以采用所述技术的组合。应当注意的是，这些技术与拼写检查或语法检查的工作方式不同，部分原因在于可能需要总内容而不仅仅是词语/句子/段落以便了解内容是否超过阈值。To efficiently scan user content for sensitive information, classification service 124 reads user content in chunks during application idle periods, performs partial analysis, and continues the process. When classification service 124 has finished reading all content, it only has a rough location of sensitive content in the original content, such as only the start/offset and length. To efficiently map back to structured or semi-structured documents, mapper 213 may employ a combination of the aforementioned techniques. It should be noted that these techniques work differently from spell checking or grammar checking, partly because the total content, not just words/sentences/paragraphs, may be needed to determine if the content exceeds a threshold.

针对原始文档中存在的每个级别的物理层级或结构(即，工作簿中的工作表，或者演示中的幻灯片)，映射器213使用标识符来指示在映射数据结构中的存在，并且还以合理数量的层级等级(即工作表中的行，幻灯片中的形状)进一步将内容细分，以使得当每个内容被处理时，映射器213跟踪原始内容的长度，并且基于插入到映射中的顺序，跟踪该元素的隐含开始。标识符可以是在特定文档的打开实例之间持久存在的持久性标识符，或者可以在特定文档的每个实例中是不同的。在一些示例中，保留用于合并敏感内容的存在/不存在的计算，直到没有剩余的未经处理的内容也没有将进一步改变内容的任何未决的编辑为止。For each level of physical hierarchy or structure present in the original document (i.e., a worksheet in a workbook, or a slide in a presentation), mapper 213 uses identifiers to indicate presence in the mapped data structure and further subdivides the content by a reasonable number of hierarchical levels (i.e., rows in a worksheet, shapes in a slide) such that as each piece of content is processed, mapper 213 tracks the length of the original content and, based on the order in which it is inserted into the map, the implicit start of that element. Identifiers can be persistent identifiers that persist across open instances of a particular document, or they can be different in each instance of a particular document. In some examples, calculations for merging the presence/absence of sensitive content are retained until there is no remaining unprocessed content and no pending edits that will further alter the content.

假设存在敏感内容，映射器213从DLP服务121接收每条敏感内容的开始和长度，并且映射器213在最精确映射区域内的敏感内容的标识符和插入内容的映射数据结构中进行查找以找到确切的位置。出于性能的原因，可以仅跟踪某一数量的等级的层级，这使得可能无法单独地跟踪幻灯片内部形状内的表格或工作表内部的行内的单元格。因此，可以在进行反向映射之后执行部分重新遍历以便找到精确的位置。Assuming sensitive content exists, mapper 213 receives the start and length of each piece of sensitive content from DLP service 121, and mapper 213 searches within the identifier of the sensitive content and the mapping data structure of the inserted content in the most precise mapping region to find the exact location. For performance reasons, only a certain number of levels may be tracked, which may make it impossible to track cells within rows within tables or worksheets in individual slide shapes. Therefore, a partial re-traversal can be performed after reverse mapping to find the precise location.

在具体示例中，工作簿可以具有20个工作表，但具有数百万行，并且所述数百万行中的每一行可以具有50列用户数据。对于此中相对较少数量的敏感数据(即，一个工作表中只有一列具有敏感数据)，分类过程可能由于具有20*100万*50记忆的“长度+偏移”条数据而变得非常耗费存储器。移除最后一个维度可节省50倍的存储器，因为在原始文档中实际识别敏感数据时的计算成本很低。有利地，可以保持小的存储器占用空间以将开始/长度反向映射回原始内容。In a specific example, a workbook might have 20 worksheets but millions of rows, and each of those millions of rows could have 50 columns of user data. For the relatively small amount of sensitive data (i.e., only one column in a worksheet contains sensitive data), the classification process could become extremely memory-intensive due to the 20 * 1 million * 50 memory "length + offset" data entries. Removing the last dimension saves 50 times the memory because the computational cost of actually identifying sensitive data in the original document is low. Advantageously, a small memory footprint can be maintained to backmap the start/length back to the original content.

为了进一步说明图1-3中的元素的操作，在图4中呈现了流程图。在图4中呈现了两个主要流程，即用于识别敏感数据的第一流程400，以及用于敏感数据注释和模糊的第二流程401。第一流程400可以馈送到第二流程401中，但其他配置也是可能的。To further illustrate the operation of the elements in Figures 1-3, a flowchart is presented in Figure 4. Figure 4 shows two main flows: a first flow 400 for identifying sensitive data, and a second flow 401 for annotating and obfuscating sensitive data. The first flow 400 can be fed into the second flow 401, but other configurations are also possible.

在图4中，DLP服务121接收(410)合并到关联的平面化表示中的结构化用户内容的子集，每个相关联的平面化表示具有到结构化用户内容的对应子集的映射。如在上文中提及的，结构化内容可以包括组织成表/行/列的电子表格内容，或者可以替代地包括其他结构，例如组织成幻灯片/对象的幻灯片放映内容，组织成页面/对象的绘制程序内容，或组织成页面的文本内容等。结构化用户内容的这些子集可以包括图1中所示的“块”141-146或图3中的块C₁-C₈等。底层用户内容的结构在这些子集中被平面化或移除以形成块，并且每个子集可以通过引用结构性标识符或定位器(例如，表/行/列或幻灯片/对象)而映射回原始结构。In Figure 4, DLP service 121 receives (410) subsets of structured user content merged into associated flattened representations, each associated flattened representation having a mapping to a corresponding subset of the structured user content. As mentioned above, the structured content may include spreadsheet content organized into tables/rows/columns, or alternatively may include other structures, such as slideshow content organized into slides/objects, drawing program content organized into pages/objects, or text content organized into pages, etc. These subsets of the structured user content may include “blocks” 141-146 shown in Figure 1 or blocks _C1 - _C8 in Figure 3, etc. The structure of the underlying user content is flattened or removed in these subsets to form blocks, and each subset can be mapped back to the original structure by referencing a structure identifier or locator (e.g., table/row/column or slide/object).

DLP服务121接收这些块和块元数据，例如，通过图1中的链路160或API 122，并且个体地解析(411)平面化表示以将部分分类为包括与一个或多个预先确定的数据方案相对应的敏感内容。分类规则125可以建立由一个或多个表达式定义的一个或多个预先确定的数据方案，所述表达式用于解析平面化的块/数据表示以将块的部分识别为指示一个或多个预先确定的内容模式或者一个或多个预先确定的内容类型。DLP service 121 receives these blocks and block metadata, for example, via link 160 or API 122 in Figure 1, and individually parses (411) the flattened representation to classify portions as including sensitive content corresponding to one or more predetermined data schemes. Classification rule 125 may establish one or more predetermined data schemes defined by one or more expressions used to parse the flattened block/data representation to identify portions of the block as indicating one or more predetermined content patterns or one or more predetermined content types.

如果找到敏感数据(412)，则针对所述部分中的每个部分，DLP服务121确定(413)与被指示为保留在数据记录332中的跟踪器123中的结构化用户内容相关的相关联的偏移/长度。DLP服务121接着至少向用户应用111指示(414)所述部分的相关联的偏移/长度，以用于向用户应用111标记用户界面112中的敏感内容。如果没有找到敏感数据，或者如果不满足任何相关联的阈值，则对块的进一步处理可以继续或随着用户应用111的提供进一步监视另外的块。此外，对用户内容进行编辑或改变可以针对任何经改变或编辑的用户内容来提示另外或重复的分类过程。If sensitive data (412) is found, for each of the segments, DLP service 121 determines (413) the associated offset/length related to the structured user content in tracker 123, which is indicated to be retained in data record 332. DLP service 121 then indicates (414) the associated offset/length of the segment to user application 111 for tagging sensitive content in user interface 112. If no sensitive data is found, or if no associated threshold is met, further processing of the block can continue or additional blocks can be monitored as user application 111 is provided. Furthermore, editing or altering user content can prompt additional or repeated classification processes for any altered or edited user content.

应用DLP模块113从DLP服务121的分类服务接收(415)对用户内容中包含敏感内容的一个或多个部分的指示，其中，所述指示包括与敏感内容相关联的偏移/长度。应用DLP模块113在用户应用111的用户界面112中呈现(416)图形指示，所述图形指示将用户内容的所述一个或多个部分注释为包含敏感内容。接着，应用DLP模块113可以在用户界面112中呈现(417)模糊选项，以用于在用户内容的一个或多个部分中掩盖至少选定部分内的敏感内容。响应于用户对模糊选项中的至少一个的选择，应用DLP模块113用保留相关联的用户内容的数据方案的经模糊的内容来替换(418)相关联的用户内容。The application DLP module 113 receives (415) an indication from the classification service of the DLP service 121 that one or more portions of the user content contain sensitive content, wherein the indication includes an offset/length associated with the sensitive content. The application DLP module 113 presents (416) a graphical indication in the user interface 112 of the user application 111, the graphical indication annotating the one or more portions of the user content as containing sensitive content. The application DLP module 113 may then present (417) blurring options in the user interface 112 for masking sensitive content within at least a selected portion of one or more portions of the user content. In response to the user's selection of at least one of the blurring options, the application DLP module 113 replaces (418) the associated user content with blurred content that retains the associated user content's data scheme.

图5示出了序列图500以进一步示出图1-3的元素的操作。此外，图5包括针对图5中的处理步骤中的一些的详细示例结构510。在图5中，应用111可以打开文档以供用户查看或编辑。该文档可以由应用DLP模块113检测。可以将任何相关联的策略或分类规则推送至DLP服务121以定义任何分类策略。接着，DLP服务121可以在记录332中保留打开文档的处理实例，其可以包括几个打开文档的列表。当DLP模块113检测到应用111的空闲处理时间帧时，可以向DLP服务121呈现空闲指示符，DLP服务121响应性地请求用户内容的块以进行分类。可替代地，DLP模块113可以在应用111的空闲时段期间将用户内容块推送至DLP服务121。DLP模块113将用户内容分派到块中，并且可以基于包括在文档的结构或层级对象中的文本或其他内容来确定这些块。一旦确定了所述块，DLP模块113就将块传送至DLP服务121以进行分类。DLP服务121个体地对每个块进行分类，并且将分类规则应用于所述块以识别所述块中的潜在敏感的用户内容。该分类过程可以是迭代过程，以确保已经处理了由DLP模块113传送的所有块。如果在块之间找到敏感数据或内容，则DLP服务121向DLP模块113指示敏感数据的存在以用于进一步处置。如在本文中提及的，敏感数据可以由偏移、粗略位置、或其他位置信息以及长度信息来指示。接着，DLP模块113可以对文档中的敏感数据执行一个或多个注释过程和模糊过程。Figure 5 illustrates sequence diagram 500 to further illustrate the operation of the elements of Figures 1-3. Furthermore, Figure 5 includes detailed example structures 510 for some of the processing steps in Figure 5. In Figure 5, application 111 can open a document for a user to view or edit. This document can be detected by application DLP module 113. Any associated strategies or classification rules can be pushed to DLP service 121 to define any classification strategy. DLP service 121 can then maintain processing instances of open documents in record 332, which may include a list of several open documents. When DLP module 113 detects an idle processing time frame of application 111, it can present an idle indicator to DLP service 121, which responsively requests blocks of user content for classification. Alternatively, DLP module 113 can push blocks of user content to DLP service 121 during idle periods of application 111. DLP module 113 dispatches user content into blocks, and may determine these blocks based on text or other content included in the document's structure or hierarchical objects. Once the blocks are identified, DLP module 113 transmits them to DLP service 121 for classification. DLP service 121 classifies each block individually and applies classification rules to the blocks to identify potentially sensitive user content within them. This classification process can be iterative to ensure that all blocks transmitted by DLP module 113 have been processed. If sensitive data or content is found between blocks, DLP service 121 indicates the presence of sensitive data to DLP module 113 for further processing. As mentioned herein, sensitive data can be indicated by offset, approximate location, or other location information, as well as length information. DLP module 113 can then perform one or more annotation and obfuscation processes on the sensitive data in the document.

例如可以由用户、管理员、策略人员、或其他实体在分类过程之前建立分类规则。如在结构510中所见，各种规则511和512可以是基于一个或多个断言(predicate)的。断言在图5中以两个类别示出，内容相关断言511和访问相关断言512。内容相关断言511可以包括指示敏感数据的数据方案，例如数据模式、数据结构信息、或定义数据方案的正则表达式。访问相关断言512包括用户级规则、组织级规则、或其他基于访问的规则，例如内容共享规则，其定义何时不希望敏感数据被特定用户、组织、或其他因素传播或释放。For example, classification rules can be established by users, administrators, policymakers, or other entities before the classification process. As seen in structure 510, various rules 511 and 512 can be based on one or more predicates. Predicates are shown in Figure 5 in two categories: content-related predicates 511 and access-related predicates 512. Content-related predicates 511 can include data schemes indicating sensitive data, such as data patterns, data structure information, or regular expressions defining data schemes. Access-related predicates 512 include user-level rules, organization-level rules, or other access-based rules, such as content-sharing rules, which define when sensitive data should not be disseminated or released by specific users, organizations, or other factors.

可以建立策略规则513，其将内容相关断言和访问相关断言中的一个或多个组合到策略551-554中。每个策略规则还具有优先级和相关联的动作。通常而言，优先级与动作的严重性相匹配。例如，策略规则可以定义要阻止应用的“保存”特征。在另一示例策略规则中，用户内容可以包含根据内容相关断言定义的SSN，但是根据访问相关断言，这些SSN可以是用于传播可接受的。大多数策略规则在断言511-512中包含至少一个分类断言。这些策略可以影响一个或多个动作514。所述动作可以包括应用可以响应于识别或敏感内容而采取的各种注释操作，例如对用户的通知，通知但允许用户覆盖，阻止特征/功能(即“保存”或“复制”特征)，以及合理的覆盖等。Policy rules 513 can be established, which combine one or more of content-related assertions and access-related assertions into policies 551-554. Each policy rule also has a priority and an associated action. Generally, the priority matches the severity of the action. For example, a policy rule could define to block the "save" feature of the application. In another example policy rule, user content may contain SSNs defined according to the content-related assertion, but these SSNs may be acceptable for propagation according to the access-related assertion. Most policy rules include at least one classification assertion in assertions 511-512. These policies can affect one or more actions 514. The actions can include various annotation actions that the application may take in response to identified or sensitive content, such as notification to the user, notification but allowing user overriding, blocking features/features (i.e., the "save" or "copy" features), and reasonable overriding, etc.

图6示出了流程图600，以进一步图示图1-3的元素的操作。图6关注于敏感数据识别、注释、和模糊过程的一个示例整体过程。子过程601包括策略和规则建立、存储、和获取。这些策略和规则可以是注释规则、分类规则、正则表达式、组织/用户策略、此外还有在本文中所讨论的其他信息。在图6的操作611中，可以经由用户接口或API引入各种检测规则630和替换规则631，以用于配置检测策略。检测规则630和替换规则631可以包括如在图5中找到的各种断言和规则等。用户、管理员、策略人员、或其他实体可以例如通过针对用户、组织、或应用使用以及其他实体和活动建立策略，来引入检测规则630和替换规则631。在操作612中，检测规则630和替换规则631可以被存储在一个或多个存储系统上以供以后使用。当一个或多个客户端希望使用由检测规则630和替换规则631建立的策略时，可以在操作613中下载或获取这些策略。例如，注释规则可以由应用下载以用于注释用户界面中的敏感内容，而分类规则可以由共享的DLP服务下载，以用于将用户内容分类为敏感内容。Figure 6 illustrates flowchart 600 to further illustrate the operation of the elements in Figures 1-3. Figure 6 focuses on an example overall process for sensitive data identification, annotation, and obfuscation. Subprocess 601 includes policy and rule creation, storage, and retrieval. These policies and rules can be annotation rules, classification rules, regular expressions, organization/user policies, and other information discussed herein. In operation 611 of Figure 6, various detection rules 630 and replacement rules 631 can be introduced via a user interface or API for configuring detection policies. Detection rules 630 and replacement rules 631 can include various assertions and rules as found in Figure 5. Users, administrators, policymakers, or other entities can introduce detection rules 630 and replacement rules 631, for example, by creating policies for users, organizations, or application usage, as well as other entities and activities. In operation 612, detection rules 630 and replacement rules 631 can be stored on one or more storage systems for later use. When one or more clients wish to use the strategies established by detection rule 630 and replacement rule 631, these strategies can be downloaded or obtained in operation 613. For example, annotation rules can be downloaded by the application to annotate sensitive content in the user interface, while classification rules can be downloaded by a shared DLP service to classify user content as sensitive content.

子过程602包括客户端侧应用活动，例如加载文档以用于在用户界面中编辑或查看，以及提供这些文档的块以用于分类。在操作614中，客户端应用可以提供一个或多个终端用户体验以处理用户内容，编辑用户内容，或查看用户内容，此外还有其他操作。操作614还可以提供稍后讨论的注释和模糊过程。操作615将该用户内容的部分提供至共享的DLP服务以用于对用户内容进行分类。在一些示例中，所述部分包括被从原始文档剥离的相关联的结构或层级的用户内容的平面化块。Subprocess 602 includes client-side application activities, such as loading documents for editing or viewing in a user interface, and providing blocks of these documents for categorization. In operation 614, the client application may provide one or more end-user experiences to process, edit, or view user content, as well as other operations. Operation 614 may also provide annotation and fuzzing processes, discussed later. Operation 615 provides a portion of the user content to a shared DLP service for categorizing the user content. In some examples, the portion includes flattened blocks of user content with associated structures or hierarchies stripped from the original document.

子过程603包括对用户内容的分类以检测用户内容中的敏感数据，以及向用户注释该敏感数据。在操作616中，应用各种检测规则，例如下文在图7中所讨论的正则表达式，此外还有其他检测规则和过程。如果找到敏感数据，则操作617确定是否应该通知用户。如果敏感数据的数量低于警报阈值数量，则可能不会发生通知。然而，如果用户要被警告，则操作619可以计算结构化数据的检测的区域内的敏感数据的位置。如在本文中所讨论的，可以采用映射过程以根据敏感数据串或部分的平面化数据偏移和长度来确定结构化元素或层级性元素内的敏感数据的具体位置。一旦确定了这些具体位置，则操作618可以向用户显示所述位置。采用注释或其他突出显示用户界面元素来向用户发信号通知用户内容中存在敏感数据。Subprocess 603 includes classifying user content to detect sensitive data within it, and annotating the user with the sensitive data. In operation 616, various detection rules are applied, such as regular expressions discussed below in Figure 7, as well as other detection rules and procedures. If sensitive data is found, operation 617 determines whether the user should be notified. If the amount of sensitive data is below an alarm threshold, no notification may occur. However, if the user is to be alerted, operation 619 can calculate the location of the sensitive data within the detected area of the structured data. As discussed herein, a mapping process can be employed to determine the specific location of the sensitive data within a structured or hierarchical element based on the flattened data offset and length of the sensitive data string or portion. Once these specific locations are determined, operation 618 can display the locations to the user. Annotations or other highlighted user interface elements are used to signal to the user that sensitive data exists in the user content.

子过程604包括对包括结构化或层级性元素的用户内容内的敏感数据进行模糊。在操作621中，可以接收用户输入以用“安全”或经模糊的数据/文本来替换敏感数据的至少一个实例。当向用户被示出以展示使得注释或“策略提示”出现的敏感数据片段的突出显示的区域时，可以向用户呈现用模糊敏感数据的“安全文本”来替换敏感数据的选项。取决于在操作611中最初设置策略的实体做出的选择，操作622和624确定并生成一个或多个替换或模糊规则。所述模糊规则可以用于用营销许可名称来替换内部代码名称，用于用样板名称来模糊个人可识别信息(PII)，可用于用向文档的未来观看者指示敏感数据类型(即，信用卡号，社会保险号，车辆识别号等)而不泄露实际的敏感数据的一组字符来替换数字敏感数据。操作623用经模糊的数据来替换敏感数据。经模糊的数据可用于用一组字符来替换数字敏感数据，所述字符可用于确认数据方案或内容类型，但即使由确定的个体仍然不足以导出原始数据(即，确定内容片段是SSN但不揭露实际的SSN)。用户可以使用经模糊的文本来执行个体或单个实例的敏感内容替换，或者从示出了多个敏感内容实例的用户界面进行批量替换。Subprocess 604 includes obfuscating sensitive data within user content that includes structured or hierarchical elements. In operation 621, user input may be received to replace at least one instance of sensitive data with “safe” or obfuscated data/text. When a user is shown a highlighted area displaying a sensitive data fragment that causes a comment or “policy hint” to appear, the user may be presented with the option to replace the sensitive data with “safe text” that obfuscates the sensitive data. Depending on the choice made by the entity that initially sets the policy in operation 611, operations 622 and 624 determine and generate one or more replacement or obfuscation rules. The obfuscation rules may be used to replace internal code names with marketing license names, to obfuscate personally identifiable information (PII) with boilerplate names, or to replace numeric sensitive data with a set of characters that indicate the sensitive data type (i.e., credit card number, social security number, vehicle identification number, etc.) to future viewers of the document without revealing the actual sensitive data. Operation 623 replaces sensitive data with obfuscated data. Obfuscated data can be used to replace numeric sensitive data with a set of characters that can be used to identify the data scheme or content type, but are still insufficient to derive the original data even from a identified individual (i.e., identifying the content fragment as an SSN but not revealing the actual SSN). Users can use the obfuscated text to perform sensitive content replacement on an individual or single instance, or perform batch replacement from a user interface that displays multiple instances of sensitive content.

可以用正则表达式或者可替代地经由非确定性有限自动机(NFA)、确定性有限自动机(DFA)、下推自动机(PDA)、图灵机、任意功能代码、或其他过程来完成对敏感内容(例如，文本或字母数字内容)的替换。对敏感内容的替换通常包括文本或内容中的模式匹配。通过考虑目标模式是否能够在字符串中的指定位置存在多个字符，该模式匹配可以留下未掩盖的字符或内容，并且所述字符不需要被掩盖，例如，针对分隔符字符。例如，字符串“123-12-1234”可以变为“xxx-xx-xxxx”，并且字符串“123 121234”在掩盖过程之后可以变为“xxx xx xxxx”。该模式匹配还可以出于唯一性目的而保持某些部分可辨识，例如使用信用卡号或SSN的最后的预先确定数量的数字。例如，在掩盖过程之后，“1234-1234-1234-1234”可以变成“xxxx-xxxx-xxxx-1234”。对于代码名称掩盖/替换，并非所有方面都是模式，并且可以实际上是内部代码名称或其他关键字。例如，代码名称“Whistler”可以在掩盖过程后变成“Windows XP”。此外，可以允许用安全文本替换不同数量的字符的模式以保持长度一致或者将长度设置为已知常数。例如，相同的规则可以在掩盖过程之后将“1234-1234-1234-1234”变成“xxxx-xxxx-xxxx-1234”和“xxxxx-xxxxx-xl234”。这可能需要包含足够数据的模式来处置这些情况中的任何情况。正则表达式可以通过用括号括起每个原子匹配表达式来扩充正则表达式并且跟踪哪些经扩充的“匹配”语句与哪个“替换”语句配对来处置这样的场景。正则表达式匹配的另外的示例在以下的图7中可见。The replacement of sensitive content (e.g., text or alphanumeric content) can be accomplished using regular expressions or alternatively via nondeterministic finite automata (NFA), deterministic finite automata (DFA), pushdown automata (PDA), Turing machines, arbitrary function codes, or other processes. The replacement of sensitive content typically involves pattern matching within the text or content. By considering whether the target pattern can exist at a specified position within the string, pattern matching can leave unmasked characters or content, and these characters do not need to be masked, for example, for delimiter characters. For example, the string "123-12-1234" can become "xxx-xx-xxxx", and the string "123 121234" after the masking process can become "xxx xx xxxx". Pattern matching can also preserve certain parts for uniqueness purposes, such as using the last predetermined number of digits of a credit card number or SSN. For example, after the masking process, "1234-1234-1234-1234" can become "xxxx-xxxx-xxxx-1234". For code name masking/replacement, not all aspects are patterns and can actually be internal code names or other keywords. For example, the code name "Whistler" can become "Windows XP" after the masking process. Furthermore, patterns can be allowed to replace different numbers of characters with safe text to maintain a consistent length or to set the length to a known constant. For example, the same rule can transform "1234-1234-1234-1234" into "xxxx-xxxx-xxxx-1234" and "xxxxx-xxxxx-xl234" after the masking process. This may require patterns containing enough data to handle any of these cases. Regular expressions can handle such scenarios by expanding the regular expression by enclosing each atomic match expression in parentheses and keeping track of which expanded "match" statements are paired with which "replace" statements. Further examples of regular expression matching are seen in Figure 7 below.

为了在多于一个文档/文件中保持注释和分类过程的完整性，可以建立各种过程。检测/分类、注释、和模糊规则和策略通常不被包括在文档文件中。这允许改变策略以及防止对模糊技术的逆向工程。例如，如果用户保存文档，接着关闭并加载同一文档，则针对文档的哪些部分包含考虑敏感数据存在策略问题所必需的敏感数据的规则可能已经改变。另外，注释标志不应被包括在剪贴板操作中，例如剪切、复制、或粘贴。如果用户要从一个文档复制内容并粘贴到另一个文档中，则该第二文档可以应用不同的检测/分类、注释、和模糊规则。如果用户要从第一文档中复制文本内容并粘贴到第二文档中，则在重新分类之前，应将第一文档注释视为是不相关的。即使用户要从一个文档中复制内容到同一文档中，敏感内容的任何计数也可能会变化，并且在整个文档中需要突出显示的内容可能会改变。To maintain the integrity of the annotation and classification process across more than one document/file, various procedures can be established. Detection/classification, annotation, and fuzzing rules and strategies are typically not included in the document file. This allows for changes to strategies and prevents reverse engineering of fuzzing techniques. For example, if a user saves a document, then closes and loads the same document, the rules regarding which parts of the document contain the sensitive data necessary to address policy issues related to sensitive data may have changed. Additionally, annotation markers should not be included in clipboard operations such as cut, copy, or paste. If a user copies content from one document and pastes it into another, different detection/classification, annotation, and fuzzing rules can be applied to that second document. If a user copies text content from a first document and pastes it into a second document, the annotations in the first document should be considered irrelevant before reclassification. Even if a user copies content from one document to the same document, any counts of sensitive content may change, and the content that needs to be highlighted throughout the document may change.

图7示出了流程图700以进一步图示图1-3的元素的操作。图7关注于敏感数据模糊过程中的正则表达式操作。在图7中，已知一正则表达式(regex)，例如虚构的驱动程序的许可证示例正则表达式730，以及与其匹配的字符串，可以通过以下方式来生成完全匹配：至少通过用括号(例如，每个原子)括起每个可分隔的字符匹配表达式来扩充正则表达式，如在操作711中所指示的。接着，可以在操作712中重新应用或执行经扩充的正则表达式来执行模糊或掩盖处理。针对每个匹配，操作713-714确定实际上匹配的最宽和最窄的字符集。例如，当匹配的字符是“-”时，字符较窄，因为其是单个字符。当匹配的字符是全字母字符的集合时，其较宽泛。可以在任何区域中的绝对字符计数是关键的决定因素。操作715中的模糊可以根据匹配宽泛度来替换字符。针对作为单个字符相匹配的字符，模糊过程可以不进行改变。针对那些在宽泛群组中相匹配的字符，模糊过程用不是该集合成员的“安全”字符来替换所述字符。例如，全字母的集合变为“0”，全数字的集合变为“X”，并且混合的字母数字内容变为“？”，其中，使用字符后退列表直到用尽为止。一旦文本或内容已经通过模糊或掩盖过程，操作716就确认当新文本/内容字符串不再与原始regex匹配时文本或内容已被成功地渲染为已模糊。Figure 7 illustrates flowchart 700 to further illustrate the operations of the elements in Figures 1-3. Figure 7 focuses on the regular expression operations in the sensitive data blurring process. In Figure 7, given a regular expression (regex), such as the example regular expression 730 for a fictitious driver's license, and the string it matches, a complete match can be generated by expanding the regular expression at least by enclosing each separable character matching expression in parentheses (e.g., each atom), as indicated in operation 711. The expanded regular expression can then be reapplied or executed in operation 712 to perform blurring or masking. For each match, operations 713-714 determine the widest and narrowest character sets that are actually matched. For example, when the matched character is "-", the character is narrower because it is a single character. When the matched character is a set of all-letter characters, it is wider. The absolute character count in any region is a key determining factor. The blurring in operation 715 can replace characters based on the match width. For characters that match as a single character, the blurring process may remain unchanged. For characters that match in a broad group, the blurring process replaces the character with a "safe" character that is not a member of that set. For example, the set of all letters becomes "0", the set of all numbers becomes "X", and mixed alphanumeric content becomes "?", where a character backlist is used until it is exhausted. Once the text or content has passed through the blurring or masking process, operation 716 confirms that the text or content has been successfully rendered as blurred when the new text/content string no longer matches the original regex.

图8示出了图表800以进一步图示图1-3的元素的操作。图8关注于在用户界面中对敏感数据注释时使用的增强的阈值过程。图8的操作可以包括用于注释敏感数据的增强的滞后操作，并且可以由策略管理员或用户以及其他实体来建立各种阈值或注释规则。Figure 8 illustrates diagram 800 to further illustrate the operation of the elements in Figures 1-3. Figure 8 focuses on the enhanced thresholding process used when annotating sensitive data in the user interface. The operation in Figure 8 may include enhanced hysteresis operations for annotating sensitive data, and various threshold or annotation rules can be established by policy administrators, users, or other entities.

图8包括图表800，其包括指示文档中存在的敏感数据/内容项的数量的竖直轴，以及指示时间的水平轴。建立第一阈值820，其可以发起对用户界面中敏感内容的注释的呈现或移除。可以建立第二阈值822，其还可以发起对敏感内容的注释的呈现或移除。可以建立弹性(elasticity)因子821和回弹(resiliency)属性823以修改第一和第二阈值的行为。Figure 8 includes a chart 800, which includes a vertical axis indicating the number of sensitive data/content items present in the document, and a horizontal axis indicating time. A first threshold 820 is established, which can trigger the rendering or removal of annotations for sensitive content in the user interface. A second threshold 822 can be established, which can also trigger the rendering or removal of annotations for sensitive content. An elasticity factor 821 and a resiliency attribute 823 can be established to modify the behavior of the first and second thresholds.

当在用户界面中注释敏感数据时，例如通过标志、标记、或突出显示，用户可以编辑敏感内容以修复敏感内容问题(例如，通过选择一个或多个模糊选项)。然而，一旦解决了阈值数量的敏感内容问题，可能没有足够的剩余问题实例来保证文档的注释总体上违反针对组织或保存位置的敏感内容规则。同样，当将新的敏感内容被引入文档中时，可以有足够的实例来保证文档的注释向用户指示敏感内容。When sensitive data is annotated in the user interface, such as through flags, tags, or highlighting, users can edit the sensitive content to fix sensitive content issues (e.g., by selecting one or more obfuscation options). However, once a threshold number of sensitive content issues are resolved, there may not be enough remaining instances to guarantee that the document's annotations as a whole violate the sensitive content rules for the organization or storage location. Similarly, when new sensitive content is introduced into the document, there may be enough instances to guarantee that the document's annotations indicate the sensitive content to the user.

在用户的内容编辑过程期间，针对一个或多个内容元素启用和禁用注释指示符可以是至少部分地基于关于注释规则的内容元素的当前数量的。注释规则可以包括至少第一阈值数量820，用于在启用时将第一阈值数量820修改为第二阈值数量822的弹性因子821，以及指示第二阈值数量822何时覆盖第一阈值数量820的阈值回弹或“粘性”属性823的指示。诸如注释器212之类的注释服务可以确定或识别注释规则，例如在图5中讨论的策略规则513和动作514，其针对与内容编辑相关联的目标实体而被建立。所述目标实体可以包括执行内容编辑的用户，包括执行内容编辑的用户的组织，或者用户应用的应用类型等。在用户编辑包含敏感内容或者潜在地包含敏感内容的文档期间，注释器212监视相关联的用户数据文件中的用户内容，其在用户应用的用户界面中呈现以进行内容编辑。注释器212识别用户内容中包含与在本文中所讨论的一个或多个预先确定的数据方案相对应的敏感内容的内容元素的数量。所述内容元素可以包括单元格、对象、形状、词语或其他数据结构或数据层级性元素。During a user's content editing process, enabling and disabling annotation indicators for one or more content elements may be at least partially based on the current number of content elements with respect to annotation rules. Annotation rules may include at least a first threshold number 820, a springiness factor 821 for modifying the first threshold number 820 to a second threshold number 822 when enabled, and a threshold bounce or "stickiness" attribute 823 indicating when the second threshold number 822 overrides the first threshold number 820. Annotation services such as annotator 212 may determine or identify annotation rules, such as policy rules 513 and actions 514 discussed in Figure 5, which are established for target entities associated with content editing. The target entity may include the user performing the content editing, the organization of the user performing the content editing, or the application type of the user's application, etc. During user editing of a document containing sensitive content or potentially containing sensitive content, annotator 212 monitors the user content in an associated user data file, which is presented in the user application's user interface for content editing. Annotator 212 identifies the number of content elements in the user content that contain sensitive content corresponding to one or more pre-defined data schemes discussed herein. The content elements may include cells, objects, shapes, words, or other data structures or hierarchical elements.

在编辑期间，并且至少基于内容元素的数量超过第一阈值数量，注释器212在用户界面中发起对至少一个注释指示符的呈现，所述注释指示符将用户界面中的用户内容标记为至少包含第一敏感内容。在图8(从“关闭”状态中的注释开始)中，第一阈值820将转换点830处的示例数量“8”指示为触发在用户界面中对注释指示符的呈现。具有敏感内容的内容元素的数量可以增加，例如通过用户编辑，并且接着在用户看到存在敏感内容并且开始选择模糊选项以掩盖该敏感内容之后可能减少。During editing, and based at least on the number of content elements exceeding a first threshold number, the annotator 212 initiates the rendering of at least one annotation indicator in the user interface, which marks the user content in the user interface as containing at least first sensitive content. In Figure 8 (starting with annotations in the "off" state), the first threshold 820 indicates an example number "8" at transition point 830 as triggering the rendering of the annotation indicator in the user interface. The number of content elements with sensitive content can increase, for example, through user editing, and then may decrease after the user sees the presence of sensitive content and begins to select a blurring option to conceal that sensitive content.

至少基于内容元素的数量最初超过第一阈值数量820并且在弹性因子821被应用于第一阈值数量820时随后下降到低于第一阈值数量820，注释器212至少基于该弹性因子建立第二阈值数量822。当第二阈值数量822活跃时(即，当弹性因子821被应用于第一阈值数量820时)，则第二阈值数量822用于当所述数量低于第二阈值数量822时开始对至少一个注释指示符的呈现的移除，如转换点832中可见的。然而，至少基于内容元素的数量最初超过第一阈值数量820并且在弹性因子没有被应用于第一阈值数量820时随后下降到低于第一阈值数量820，移除对至少一个注释指示符的呈现，如由转换点831所指示的。Annotator 212 establishes a second threshold number 822, at least based on the fact that the number of content elements initially exceeds a first threshold number 820 and subsequently falls below the first threshold number 820 when a flexibility factor 821 is applied to the first threshold number 820. When the second threshold number 822 is active (i.e., when the flexibility factor 821 is applied to the first threshold number 820), the second threshold number 822 is used to initiate the removal of rendering of at least one annotation indicator when the number falls below the second threshold number 822, as can be seen at transition point 832. However, at least based on the fact that the number of content elements initially exceeds the first threshold number 820 and subsequently falls below the first threshold number 820 when a flexibility factor is not applied to the first threshold number 820, rendering of at least one annotation indicator is removed, as indicated by transition point 831.

弹性因子821可以包括0-100％的百分比，或另一度量。在具体示例中，可以建立注释规则，其定义在文档中包含超过100个SSN违反公司策略。在对超过100个SSN的文档的编辑期间，针对第一阈值数量的注释规则可以提示突出显示文档中的所有SSN。当用户开始模糊所述SSN时，剩余的未模糊的SSN的数量将减少。即使不再满足触发注释的第一阈值数量820，例如当99个SSN保持未模糊时，弹性因子也可以保持对SSN的注释或突出显示。弹性因子100将对应于未经修改的第一阈值数量，并且弹性因子0将对应于在所有SSN被模糊之前不移除所述注释。弹性因子的中间值50将对应于一旦在注释最初触发以被呈现后第50个条目被修复则移除所述注释。因此，在图8的示例中，一旦注释已经被呈现给用户，弹性因子就建立了用于移除注释的第二阈值数量。在该示例中，第二阈值数量822处于“2”处，并且因此当剩余的敏感内容问题低于剩余“2”时，将移除所述注释，如由转换点832所指示的。The elasticity factor 821 can include a percentage from 0-100%, or another metric. In a specific example, an annotation rule can be established that defines a document containing more than 100 SSNs as violating company policy. During editing of a document with more than 100 SSNs, the annotation rule for a first threshold number can prompt highlighting all SSNs in the document. When the user begins to blur the SSNs, the number of remaining unblurred SSNs will decrease. Even if the first threshold number 820 for triggering annotation is no longer met, for example when 99 SSNs remain unblurred, the elasticity factor can maintain the annotation or highlighting of the SSNs. An elasticity factor 100 would correspond to the first threshold number without modification, and an elasticity factor 0 would correspond to not removing the annotation until all SSNs are blurred. An intermediate value of 50 for the elasticity factor would correspond to removing the annotation once the 50th entry is corrected after the annotation is initially triggered and presented. Thus, in the example of Figure 8, the elasticity factor establishes a second threshold number for removing the annotation once the annotation has been presented to the user. In this example, the second threshold number 822 is at “2”, and therefore the annotation will be removed when the remaining sensitive content issues are less than the remaining “2”, as indicated by the transition point 832.

如果第二阈值数量822已经下降，并且接着在内容编辑期间出现另外的敏感内容问题，则注释器212必须决定何时通过再次呈现注释来警告用户。至少基于内容元素的数量最初低于第二阈值数量822并且在阈值回弹属性823被应用于第二阈值数量822时随后超过第二阈值数量822，注释器212在用户界面中发起对另外的注释的呈现，其将用户界面中的用户内容标记为包含敏感内容，如由转换点833所指示的。If the second threshold number 822 has decreased, and then another sensitive content issue arises during content editing, the annotator 212 must decide when to warn the user by re-presenting the annotation. At least based on the fact that the number of content elements was initially below the second threshold number 822 and subsequently exceeded it when the threshold bounce attribute 823 was applied to the second threshold number 822, the annotator 212 initiates the presentation of an additional annotation in the user interface, marking the user content in the user interface as containing sensitive content, as indicated by the transition point 833.

回弹属性823包括第二阈值数量822的“粘性”属性，并且是由开/关或布尔条件定义的。当被禁用时，第二阈值数量822不被用于在超过的情况下重新呈现注释。当被启用时，第二阈值数量822被用于在超过的情况下重新呈现注释。因此，至少基于内容元素的数量最初低于第二阈值数量822并且在回弹属性没有被应用于第二阈值数量822时随后超过第二阈值数量822，注释器212拒绝对注释的呈现，其在用户界面中将用户内容标记为至少包含敏感内容直到内容元素的数量再次超过第一阈值数量820为止。The bounce attribute 823 includes a "sticky" property of the second threshold number 822 and is defined by an on/off or Boolean condition. When disabled, the second threshold number 822 is not used to re-render the annotation if exceeded. When enabled, the second threshold number 822 is used to re-render the annotation if exceeded. Therefore, at least based on the fact that the number of content elements was initially below the second threshold number 822 and subsequently exceeded it when the bounce attribute was not applied to the second threshold number 822, the annotator 212 refuses to render the annotation, marking the user content in the user interface as containing at least sensitive content until the number of content elements exceeds the first threshold number 820 again.

现在转到图9，呈现了计算系统901。计算系统901代表在本文中所公开的各种操作架构、场景、和过程可以在其中实现的任何系统或系统集合。例如，计算系统901可用于实现图1的用户平台110或DLP平台120中的任何一个。计算系统901的示例包括但不限于服务器计算机、云计算系统、分布式计算系统、软件定义的网络化系统、计算机、台式计算机、混合计算机、机架式服务器、web服务器、云计算平台、和数据中心设备，以及任何其他类型的物理或虚拟服务器机器，以及其他计算系统和设备，以及它们的任何变型或组合。当计算系统901的部分在用户设备上被实现时，示例设备包括智能电话、膝上型计算机、平板计算机、台式计算机、游戏系统、娱乐系统等。Turning now to Figure 9, computing system 901 is presented. Computing system 901 represents any system or set of systems in which the various operating architectures, scenarios, and processes disclosed herein can be implemented. For example, computing system 901 can be used to implement either user platform 110 or DLP platform 120 of Figure 1. Examples of computing system 901 include, but are not limited to, server computers, cloud computing systems, distributed computing systems, software-defined networked systems, computers, desktop computers, hybrid computers, rack servers, web servers, cloud computing platforms, and data center equipment, as well as any other type of physical or virtual server machine, and other computing systems and devices, and any variations or combinations thereof. When portions of computing system 901 are implemented on user devices, example devices include smartphones, laptops, tablets, desktop computers, gaming systems, entertainment systems, etc.

计算系统901可以被实现为单个装置、系统、或设备，或者可以以分布式方式被实现为多个装置、系统、或设备。计算系统901包括但不限于处理系统902、存储系统903、软件905、通信接口系统907、和用户接口系统908。处理系统902可操作地与存储系统903、通信接口系统907、和用户接口系统908相耦合。The computing system 901 can be implemented as a single device, system, or apparatus, or it can be implemented in a distributed manner as multiple devices, systems, or apparatuses. The computing system 901 includes, but is not limited to, a processing system 902, a storage system 903, software 905, a communication interface system 907, and a user interface system 908. The processing system 902 is operatively coupled to the storage system 903, the communication interface system 907, and the user interface system 908.

处理系统902从存储系统903加载并执行软件905。软件905包括应用DLP环境906和/或共享的DLP环境909，其代表关于前面的附图所讨论的过程。当由处理系统902执行以处理用户内容以用于对敏感内容的识别、注释、和模糊时，软件905指示处理系统902如在本文中至少针对在前述实现中讨论的各种过程、操作场景、和环境所描述的那样进行操作。计算系统901可以可选地包括为简洁起见没有讨论的另外的设备、特征、或功能。Processing system 902 loads and executes software 905 from storage system 903. Software 905 includes an application DLP environment 906 and/or a shared DLP environment 909, representing the processes discussed with respect to the foregoing figures. When executed by processing system 902 to process user content for the identification, annotation, and obfuscation of sensitive content, software 905 instructs processing system 902 to operate as described herein, at least with respect to the various processes, operating scenarios, and environments discussed in the foregoing implementation. Computing system 901 may optionally include additional devices, features, or functions not discussed for the sake of brevity.

仍然参考图9，处理系统902可以包括微处理器以及从存储系统903取回软件905并执行软件905的其他电路。处理系统902可以在单个处理设备内实现，但是也可以跨在执行程序指令时协作的多个处理设备或子系统而分布。处理系统902的示例包括通用中央处理单元、专用处理器、和逻辑器件，以及任何其他类型的处理设备、其组合或变型。Referring again to Figure 9, processing system 902 may include a microprocessor and other circuitry that retrieves software 905 from storage system 903 and executes software 905. Processing system 902 may be implemented within a single processing device, but may also be distributed across multiple processing devices or subsystems that cooperate in executing program instructions. Examples of processing system 902 include general-purpose central processing units, dedicated processors, and logic devices, as well as any other type of processing device, combinations thereof, or variations thereof.

存储系统903可以包括能够由处理系统902读取并且能够存储软件905的任何计算机可读存储介质。存储系统903可以包括以任何用于存储信息(例如，计算机可读指令、数据结构、程序模块、或其他数据)的方法或技术实现的易失性和非易失性、可移动和不可移动介质。存储介质的示例包括随机存取存储器、只读存储器、磁盘、光盘、闪速存储器、虚拟存储器和非虚拟存储器、盒式磁带、磁带、磁盘存储器或其他磁存储设备、或者任何其他合适的存储介质。计算机可读存储介质无论如何都不是传播的信号。Storage system 903 may include any computer-readable storage medium capable of being read by processing system 902 and capable of storing software 905. Storage system 903 may include volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information (e.g., computer-readable instructions, data structures, program modules, or other data). Examples of storage media include random access memory, read-only memory, magnetic disk, optical disk, flash memory, virtual memory and non-virtual memory, magnetic tape, magnetic tape, disk storage or other magnetic storage devices, or any other suitable storage medium. Computer-readable storage media are in any way signals that are transmitted.

除了计算机可读存储介质以外，在一些实现中，存储系统903还可以包括软件905中的至少一些软件可以内部地或外部地通过其传送的计算机可读通信介质。存储系统903可以被实现为单个存储设备，但也可以跨位于同一位置或相对于彼此分布的多个存储设备或子系统来实现。存储系统903可以包括能够与处理系统902或者可能的其他系统进行通信的额外的元件，例如控制器。In addition to computer-readable storage media, in some implementations, storage system 903 may also include computer-readable communication media through which at least some of the software in software 905 can be transmitted internally or externally. Storage system 903 may be implemented as a single storage device, but may also be implemented across multiple storage devices or subsystems located in the same location or distributed relative to each other. Storage system 903 may include additional elements, such as a controller, capable of communicating with processing system 902 or possibly other systems.

软件905可以以程序指令来实现，并且在由处理系统902执行时，所述软件905引导处理系统902如关于在本文中所示出的各种操作性场景、顺序、和过程所描述的那样操作，此外还有其他功能。例如，软件905可以包括用于实现在本文中所讨论的数据集处理环境和平台的程序指令。Software 905 can be implemented as program instructions, and when executed by processing system 902, software 905 guides processing system 902 to operate as described with respect to the various operational scenarios, sequences, and processes shown herein, in addition to other functions. For example, software 905 may include program instructions for implementing the dataset processing environment and platform discussed herein.

特别地，程序指令可以包括协作或以其他方式进行交互以实行在本文中所描述的各种处理和操作场景的各种组件或模块。可以以经编译或经解译的指令或者以指令的一些其他变型或组合来实施各种组件或模块。可以以同步或非同步的方式、顺序地或并行地、在单线程的环境中或在多线程的环境中、或者根据任何其他合适的执行范例、变型、或其组合来执行各种组件或模块。软件905可以包括除了或包括应用DLP环境906或共享的DLP环境909的额外的过程、程序、或组件，例如操作系统软件、虚拟机软件、或其他应用软件。软件905也可以包括固件或者可以由处理系统902执行的一些其他形式的机器可读处理指令。Specifically, program instructions may include various components or modules that cooperate or otherwise interact to implement the various processing and operational scenarios described herein. These components or modules may be implemented as compiled or interpreted instructions, or as some other variation or combination of instructions. They may be executed synchronously or asynchronously, sequentially or in parallel, in a single-threaded or multi-threaded environment, or according to any other suitable execution paradigm, variation, or combination thereof. Software 905 may include additional processes, programs, or components, such as operating system software, virtual machine software, or other application software, in addition to or including the application DLP environment 906 or the shared DLP environment 909. Software 905 may also include firmware or other forms of machine-readable processing instructions that can be executed by the processing system 902.

通常而言，当被加载到处理系统902中并被执行时，软件905可以将合适的装置、系统、或设备(其由计算系统901所代表)全部从通用计算系统转换成专用计算系统，所述专用计算系统被定制为促进增强的应用协作。事实上，将软件905编码在存储系统903上可以转换存储系统903的物理结构。物理结构的具体的转换可以取决于该说明书的不同的实现中的各种因素。这样的因素的示例包括但不限于：用于实现存储系统903的存储介质的技术和计算机存储介质被表征为主要存储还是辅助存储，以及其他因素。Generally, when loaded into and executed by processing system 902, software 905 can transform all suitable devices, systems, or equipment (represented by computing system 901) from a general-purpose computing system into a dedicated computing system tailored to facilitate enhanced application collaboration. In fact, encoding software 905 onto storage system 903 can transform the physical structure of storage system 903. The specific transformation of the physical structure can depend on various factors in different implementations of this specification. Examples of such factors include, but are not limited to, the technology of the storage medium used to implement storage system 903, whether the computer storage medium is characterized as primary or secondary storage, and other factors.

例如，如果计算机可读存储介质被实现为基于半导体的存储器，则当程序指令被编码在其中时，软件905可以转换半导体存储器的物理状态，例如，通过转换晶体管、电容器、或构成半导体存储器的其他分立电路器件的状态。可以关于磁或光介质而发生类似的转换。物理介质的其他转换是可以的而不脱离本说明的范围，其中，仅仅为了促进本讨论而提供了前述的示例。For example, if the computer-readable storage medium is implemented as a semiconductor-based memory, then when program instructions are encoded therein, software 905 can transform the physical state of the semiconductor memory, for example, by transforming the state of transistors, capacitors, or other discrete circuit devices constituting the semiconductor memory. Similar transformations can occur with respect to magnetic or optical media. Other transformations of the physical medium are possible without departing from the scope of this description, wherein the foregoing examples are provided merely to facilitate this discussion.

应用DLP环境906或共享的DLP环境909中的每个包括一个或多个软件元件，例如OS921/931和应用922/932。这些元件可以描述用户、数据源、数据服务或其他元件与之交互的计算系统901的各个部分。例如，OS 921/931可以提供应用922/932在其上执行的软件平台，并且应用922/932允许处理用户内容以用于对敏感内容的识别、注释、和模糊，此外还有其他功能。Each of the application DLP environment 906 or the shared DLP environment 909 includes one or more software elements, such as OS 921/931 and applications 922/932. These elements can describe various parts of the computing system 901 to which users, data sources, data services, or other elements interact. For example, OS 921/931 can provide a software platform on which applications 922/932 execute, and applications 922/932 can process user content for the identification, annotation, and obfuscation of sensitive content, among other functions.

在一个示例中，DLP服务932包括内容分派器924、注释器925、映射器926、和模糊器927。内容分派器924将结构化或层级用户内容元素平面化为线性块以供分类服务处理。注释器925在用户界面中以图形方式突出显示敏感数据或内容，以便可以警告用户存在阈值数量的敏感数据。映射器926可以导出文档中用于敏感数据注释的具体位置，例如当分类服务仅提供偏移/长度/ID以定位文档的各种结构化或层级性元素中的敏感数据时。模糊器927呈现用于掩盖/替换已经被识别为敏感数据的用户内容的模糊选项。模糊器927还响应于对模糊选项的用户选择来替换敏感内容。In one example, the DLP service 932 includes a content dispatcher 924, an annotator 925, a mapper 926, and a blurr 927. The content dispatcher 924 flattens structured or hierarchical user content elements into linear blocks for processing by the classification service. The annotator 925 graphically highlights sensitive data or content in the user interface to alert the user of the presence of a threshold number of sensitive data. The mapper 926 can derive the specific location in the document used for sensitive data annotation, for example, when the classification service only provides offsets/lengths/IDs to locate sensitive data within various structured or hierarchical elements of the document. The blurr 927 presents blurring options for masking/replacing user content that has been identified as sensitive data. The blurr 927 also replaces sensitive content in response to the user's selection of blurring options.

在另一示例中，DLP服务933包括分类服务934、跟踪器935、策略/规则模块936、和regex服务937。分类服务934解析数据或内容的线性块以识别敏感数据。跟踪器935保留由分类服务934找到的敏感数据项的计数或数量，并向用于在文档中注释的映射器(例如，映射器926和注释器925)指示敏感数据偏移和长度。策略/规则模块936可以接收和保留用于对用户内容进行注释、分类、检测、模糊、或其他操作的各种策略和规则。Regex服务937包括一个示例分类技术，其使用正则表达式匹配以使用数据模式或数据方案来识别敏感数据，并且用模糊的内容来替换匹配的内容的文本。In another example, DLP service 933 includes classification service 934, tracker 935, policy/rule module 936, and regex service 937. Classification service 934 parses linear blocks of data or content to identify sensitive data. Tracker 935 maintains a count or quantity of sensitive data items found by classification service 934 and indicates the sensitive data offset and length to mappers (e.g., mapper 926 and annotator 925) used for annotation in the document. Policy/rule module 936 can receive and maintain various policies and rules for annotating, classifying, detecting, obfuscating, or otherwise manipulating user content. Regex service 937 includes an example classification technique that uses regular expression matching to identify sensitive data using data patterns or data schemes and replaces the text of the matched content with obfuscated content.

通信接口系统907可以包括支持通过通信网络(未示出)与其他计算系统(未示出)进行通信的通信连接和通信设备。共同支持系统间通信的连接的示例可以包括：网络接口卡、天线、功率放大器、RF电路、收发机、以及其他通信电路。连接和设备可以通过通信介质来进行通信以与其他计算系统或系统的网络交换通信，所述通信介质例如金属、玻璃、空气、或任何合适的通信介质。通信接口系统907的物理或逻辑元件可以从遥测源接收数据集，在一个或多个分布式数据存储元件之间传输数据集和控制信息，以及与用户接合以接收数据选择并提供可视化数据集，此外还有其他特征。The communication interface system 907 may include communication connections and communication devices that support communication with other computing systems (not shown) via a communication network (not shown). Examples of connections that jointly support inter-system communication may include: network interface cards, antennas, power amplifiers, RF circuitry, transceivers, and other communication circuitry. Connections and devices may communicate via a communication medium to exchange communication with other computing systems or networks of systems, such as metal, glass, air, or any suitable communication medium. The physical or logical elements of the communication interface system 907 may receive datasets from telemetry sources, transfer datasets and control information between one or more distributed data storage elements, and interact with users to receive data selections and provide visualizations of datasets, among other features.

用户接口系统908是可选的，并且可以包括键盘、鼠标、语音输入设备、用于接收来自用户的输入的触摸输入设备。诸如显示器、扬声器、web接口、终端接口、和其他类型的输出设备之类的输出设备也可以被包括在用户接口系统908中。用户接口系统908可以通过网络接口(例如，通信接口系统907)来提供输出和接收输入。在网络示例中，用户接口系统908可以通过在一个或多个网络接口上耦合的显示系统或计算系统来分组化显示或图形数据以供远程显示。用户接口系统908的物理或逻辑元件可以从用户或策略人员接收分类规则或策略，从用户接收数据编辑活动，向用户呈现敏感内容注释，向用户提供模糊选项，以及向用户呈现经模糊的用户内容，等等。用户接口系统908还可以包括能够由处理系统902执行以支持上文讨论的各种用户输入和输出设备的相关联的用户接口软件。单独地或者彼此以及与其他硬件和软件元件结合，用户接口软件和用户接口设备可以支持图形用户接口、自然用户接口、或任何其他类型的用户接口。User interface system 908 is optional and may include a keyboard, mouse, voice input device, and touch input device for receiving input from a user. Output devices such as displays, speakers, web interfaces, terminal interfaces, and other types of output devices may also be included in user interface system 908. User interface system 908 can provide output and receive input via a network interface (e.g., communication interface system 907). In a network example, user interface system 908 may group display or graphical data for remote display via a display system or computing system coupled to one or more network interfaces. The physical or logical elements of user interface system 908 may receive classification rules or policies from users or policy personnel, receive data editing activities from users, present sensitive content annotations to users, provide users with fuzzy options, and present users with fuzzed user content, etc. User interface system 908 may also include associated user interface software that can be executed by processing system 902 to support the various user input and output devices discussed above. Individually or in combination with each other and with other hardware and software elements, user interface software and user interface devices may support graphical user interfaces, natural user interfaces, or any other type of user interface.

算系统901与任何其他计算系统(未示出)之间的通信可以通过通信网络或多个通信网络并且根据各种通信协议、协议的组合、或其变型来进行。示例包括：内联网、互联网、局域网、广域网、无线网络、有线网络、虚拟网络、软件定义的网络、数据中心总线、计算背板、或任何其他类型的网络、网络的组合、或其变型。前述的通信网络和协议是公知的并且不需要在这里详细讨论。然而，可以使用的一些通信协议包括但不限于：互联网协议(IP、IPv4、IPv6等)、传输控制协议(TCP)、和用户数据报协议(UDP)、以及任何其他合适的通信协议、其变型或组合。Communication between computing system 901 and any other computing system (not shown) may be conducted via a communication network or multiple communication networks and according to various communication protocols, combinations of protocols, or variations thereof. Examples include: intranets, the Internet, local area networks, wide area networks, wireless networks, wired networks, virtual networks, software-defined networks, data center buses, computing backplanes, or any other type of network, combination of networks, or variations thereof. The aforementioned communication networks and protocols are well known and do not need to be discussed in detail here. However, some communication protocols that may be used include, but are not limited to: Internet Protocol (IP, IPv4, IPv6, etc.), Transmission Control Protocol (TCP), and User Datagram Protocol (UDP), as well as any other suitable communication protocols, variations thereof, or combinations thereof.

从前述公开内容可以理解某些发明方面，其中以下是各种示例。Certain aspects of the invention can be understood from the aforementioned disclosure, among which the following are various examples.

示例1：一种操作用户应用的方法，所述方法包括：至少识别第一阈值数量、当被启用时将所述第一阈值数量修改为第二阈值数量的弹性因子、以及对指示所述第二阈值数量何时覆盖所述第一阈值数量的阈值回弹属性的指示；监视对用户数据文件中的用户内容的内容编辑过程，以识别所述用户内容中包含与一个或多个预先确定的数据方案相对应的敏感数据的内容元素的数量。所述方法包括：在所述内容编辑过程期间，至少部分基于以下项来启用和禁用对所述内容元素中的一个或多个内容元素的注释指示符的呈现：所述内容元素相对于所述第一阈值数量的当前数量、当被启用时针对所述第一阈值数量的所述弹性因子、以及对所述阈值回弹属性的指示。Example 1: A method for operating a user application, the method comprising: identifying at least a first threshold quantity, a resilience factor that modifies the first threshold quantity to a second threshold quantity when enabled, and an indication of a threshold bounce attribute indicating when the second threshold quantity overrides the first threshold quantity; monitoring a content editing process of user content in a user data file to identify the number of content elements in the user content that contain sensitive data corresponding to one or more predetermined data schemes. The method includes: during the content editing process, enabling and disabling the rendering of annotation indicators for one or more content elements based at least in part on: the current number of content elements relative to the first threshold quantity, the resilience factor for the first threshold quantity when enabled, and the indication of the threshold bounce attribute.

示例2：示例1的方法，其中，所述注释指示符包括以下中的一个或多个：在所述用户应用的用户界面中呈现的全局指示符，所述全局指示符适用于所述用户数据文件；以及在所述用户界面中呈现的、位于包含所述敏感数据的个体内容元素附近的个体指示符。Example 2: The method of Example 1, wherein the annotation indicator includes one or more of the following: a global indicator presented in the user interface of the user application, the global indicator being applicable to the user data file; and an individual indicator presented in the user interface, located near an individual content element containing the sensitive data.

示例3：示例1的方法，还包括：在所述内容编辑过程期间：至少基于内容元素的所述当前数量超过所述第一阈值数量，发起在所述用户界面中对至少一个注释指示符的呈现，所述至少一个注释指示符在所述用户界面中将所述用户内容标记为包含至少第一敏感数据。所述方法还包括：在所述内容编辑过程期间，至少基于内容元素的所述当前数量最初超过所述第一阈值数量，并且在所述弹性因子被应用于所述第一阈值数量时随后落到所述第一阈值数量以下，至少基于所述弹性因子来建立第二阈值数量以用于移除对所述至少一个注释指示符的所述呈现。所述方法还包括：在所述内容编辑过程期间，至少基于内容元素的所述当前数量在所述弹性因子被应用于所述第一阈值数量时落到所述第二阈值数量以下，发起对所述至少一个注释指示符的所述呈现的移除。所述方法还包括：在所述内容编辑过程期间，至少基于内容元素的所述当前数量最初落到所述第二阈值数量以下，并且在所述阈值回弹属性被应用于所述第二阈值数量时随后超过所述第二阈值数量，发起在所述用户界面中对至少一个另外的注释指示符的呈现，所述至少一个另外的注释指示符在所述用户界面中将所述用户内容标记为包含至少第二敏感数据。Example 3: The method of Example 1 further includes: during the content editing process: initiating the rendering of at least one annotation indicator in the user interface based at least on the current number of content elements exceeding a first threshold number, the at least one annotation indicator marking the user content as containing at least first sensitive data in the user interface. The method further includes: during the content editing process, establishing a second threshold number based at least on the elasticity factor to remove the rendering of the at least one annotation indicator, based at least on the elasticity factor, based at least on the current number of content elements initially exceeding the first threshold number and subsequently falling below the first threshold number when the elasticity factor is applied to the first threshold number. The method further includes: during the content editing process, initiating the removal of the rendering of the at least one annotation indicator based at least on the current number of content elements falling below the second threshold number when the elasticity factor is applied to the first threshold number. The method further includes: during the content editing process, at least based on the current number of content elements initially falling below the second threshold number and subsequently exceeding the second threshold number when the threshold bounce attribute is applied to the second threshold number, initiating the rendering of at least one additional annotation indicator in the user interface, the at least one additional annotation indicator marking the user content as containing at least second sensitive data in the user interface.

示例4：示例3的方法，还包括：在所述内容编辑过程期间，至少基于内容元素的所述当前数量最初超过所述第一阈值数量，并且在所述弹性因子没有被应用于所述第一阈值数量时随后落到所述第一阈值数量以下，移除对所述至少一个注释指示符的呈现。所述方法还包括：在所述内容编辑过程期间，至少基于内容元素的所述当前数量最初落到所述第二阈值数量以下，并且在所述回弹属性没有被应用于所述第二阈值数量时随后超过所述第二阈值数量，拒绝对至少一个另外的注释指示符的呈现直到内容元素的数量超过所述第一阈值数量为止，所述至少一个另外的注释指示符在所述用户界面中将所述用户内容标记为包含至少第二敏感数据。Example 4: The method of Example 3 further includes: during the content editing process, at least based on the current number of content elements initially exceeding the first threshold number, and subsequently falling below the first threshold number when the elasticity factor is not applied to the first threshold number, removing the rendering of the at least one annotation indicator. The method further includes: during the content editing process, at least based on the current number of content elements initially falling below a second threshold number, and subsequently exceeding the second threshold number when the bounce attribute is not applied to the second threshold number, refusing to render at least one additional annotation indicator until the number of content elements exceeds the first threshold number, the at least one additional annotation indicator marking the user content as containing at least second sensitive data in the user interface.

示例5：用于数据应用的数据隐私注释框架，包括：一个或多个计算机可读存储介质；操作性地与所述一个或多个计算机可读存储介质耦合的处理系统；以及存储在所述一个或多个计算机可读存储介质上的程序指令。所述程序指令至少基于由所述处理系统读取和执行，引导所述处理系统进行以下操作：至少识别第一阈值数量、针对所述第一阈值数量的弹性因子、以及对阈值回弹属性的指示，监视针对所述用户应用的用户界面中的内容编辑所呈现的用户数据文件中的用户内容，以识别所述用户内容中包含与一个或多个预先确定的数据方案相对应的敏感数据的内容元素的数量。所述程序指令还引导所述处理系统进行以下操作：在所述内容编辑期间，并且至少基于内容元素的数量超过所述第一阈值数量，发起在所述用户界面中对至少一个注释指示符的呈现，所述至少一个注释指示符在所述用户界面中将所述用户内容标记为包含至少第一敏感数据。所述程序指令还引导所述处理系统进行以下操作：在所述内容编辑期间，并且至少基于内容元素的数量最初超过所述第一阈值数量，并且在所述弹性因子被应用于所述第一阈值数量时随后落到所述第一阈值数量以下，至少基于所述弹性因子来建立第二阈值数量以用于移除对所述至少一个注释指示符的所述呈现。所述程序指令还引导所述处理系统进行以下操作：在所述内容编辑期间，并且至少基于内容元素的数量最初落到所述第二阈值数量以下，并且在所述阈值回弹属性被应用于所述第二阈值数量时随后超过所述第二阈值数量，发起在所述用户界面中对至少一个另外的注释指示符的呈现，所述至少一个另外的注释指示符在所述用户界面中将所述用户内容标记为包含至少第二敏感数据。Example 5: A data privacy annotation framework for a data application, comprising: one or more computer-readable storage media; a processing system operatively coupled to the one or more computer-readable storage media; and program instructions stored on the one or more computer-readable storage media. The program instructions, at least based on being read and executed by the processing system, instruct the processing system to: at least identify a first threshold number, a resilience factor for the first threshold number, and an indication of a threshold bounce attribute; monitor user content in a user data file presented during content editing in a user interface of the user application to identify the number of content elements in the user content that contain sensitive data corresponding to one or more predetermined data schemes. The program instructions also instruct the processing system to: during the content editing, and at least based on the number of content elements exceeding the first threshold number, initiate the presentation of at least one annotation indicator in the user interface, the at least one annotation indicator marking the user content in the user interface as containing at least the first sensitive data. The program instructions also instruct the processing system to: during the content editing, and at least based on the fact that the number of content elements initially exceeds the first threshold number and subsequently falls below the first threshold number when the elasticity factor is applied to the first threshold number, establish a second threshold number for removing the rendering of the at least one annotation indicator. The program instructions also instruct the processing system to: during the content editing, and at least based on the fact that the number of content elements initially falls below the second threshold number and subsequently exceeds the second threshold number when the threshold bounce attribute is applied to the second threshold number, initiate the rendering of at least one additional annotation indicator in the user interface, the at least one additional annotation indicator marking the user content as containing at least second sensitive data in the user interface.

示例6：示例5的数据隐私注释框架，包括另外的程序指令，所述另外的程序指令至少基于由所述处理系统读取和执行，引导所述处理系统至少进行以下操作：在所述内容编辑期间，至少基于内容元素的数量在所述弹性因子被应用于所述第一阈值数量时落到所述第二阈值数量以下，发起对所述至少一个注释指示符的所述呈现的移除。Example 6: The data privacy annotation framework of Example 5 includes additional program instructions, which, based at least on being read and executed by the processing system, instruct the processing system to perform at least the following operations: during the content editing, at least based on the number of content elements falling below a second threshold number when the elasticity factor is applied to the first threshold number, initiate the removal of the rendering of the at least one annotation indicator.

示例7：示例5的数据隐私注释框架，包括另外的程序指令，所述另外的程序指令至少基于由所述处理系统读取和执行，引导所述处理系统至少进行以下操作：在所述内容编辑期间，至少基于内容元素的数量最初超过所述第一阈值数量，并且在所述弹性因子没有被应用于所述第一阈值数量时随后落到所述第一阈值数量以下，移除对所述至少一个注释指示符的呈现。Example 7: The data privacy annotation framework of Example 5 includes additional program instructions, which, based at least on being read and executed by the processing system, instruct the processing system to perform at least the following operations: during the content editing, at least based on the number of content elements initially exceeding the first threshold number and subsequently falling below the first threshold number when the elasticity factor is not applied to the first threshold number, remove the rendering of the at least one annotation indicator.

示例8：示例5的数据隐私注释框架，包括另外的程序指令，所述另外的程序指令至少基于由所述处理系统读取和执行，引导所述处理系统至少进行以下操作：在所述内容编辑期间，至少基于内容元素的数量最初落到所述第二阈值数量以下，并且在所述回弹属性没有被应用于所述第二阈值数量时随后超过所述第二阈值数量，拒绝对至少一个另外的注释指示符的呈现直到内容元素的数量超过所述第一阈值数量为止，所述至少一个另外的注释指示符在所述用户界面中将所述用户内容标记为包含至少第二敏感数据。Example 8: The data privacy annotation framework of Example 5 includes additional program instructions, at least based on being read and executed by the processing system, to instruct the processing system to at least perform the following operations during the content editing: at least based on the number of content elements initially falling below a second threshold number and subsequently exceeding the second threshold number when the bounce attribute is not applied to the second threshold number, to refuse the rendering of at least one additional annotation indicator until the number of content elements exceeds the first threshold number, the at least one additional annotation indicator marking the user content in the user interface as containing at least second sensitive data.

示例9：示例5的数据隐私注释框架，其中，识别第一阈值数量、针对所述第一阈值数量的所述弹性因子、以及对阈值回弹属性的指示中的一个或多个包括：确定针对与所述内容编辑相关联的目标实体所建立的注释策略，所述注释策略包括以下中的一个或多个：所述第一阈值数量、针对所述第一阈值数量的所述弹性因子、以及对阈值回弹属性的所述指示。Example 9: The data privacy annotation framework of Example 5, wherein identifying one or more of a first threshold number, the elasticity factor for the first threshold number, and an indication of a threshold bounce attribute includes: determining an annotation strategy established for a target entity associated with the content editing, the annotation strategy including one or more of the following: the first threshold number, the elasticity factor for the first threshold number, and the indication of a threshold bounce attribute.

示例10：示例9的数据隐私注释框架，其中，所述目标实体包括以下中的至少一个：执行所述内容编辑的用户、包括执行所述内容编辑的所述用户的组织、以及所述用户应用的应用类型。Example 10: The data privacy annotation framework of Example 9, wherein the target entity includes at least one of the following: the user performing the content editing, the organization including the user performing the content editing, and the application type of the user application.

示例11：示例5的数据隐私注释框架，其中，所述至少一个注释指示符和所述至少一个另外的注释指示符中的每个注释指示符包括以下中的一个或多个：在所述用户界面中呈现的全局指示符，所述全局指示符适用于所述用户数据文件；以及在所述用户界面中呈现的、位于包含所述敏感数据的个体内容元素附近的个体指示符。Example 11: The data privacy annotation framework of Example 5, wherein each of the at least one annotation indicator and the at least one additional annotation indicator includes one or more of the following: a global indicator presented in the user interface, the global indicator being applicable to the user data file; and an individual indicator presented in the user interface, located near the individual content element containing the sensitive data.

示例12：示例5的数据隐私注释框架，其中，所述一个或多个预先确定的数据方案是由一个或多个表达式定义的，所述一个或多个表达式由分类服务使用，以解析所述用户内容并且识别所述内容元素中包含指示一个或多个预先确定的内容模式或者一个或多个预先确定的内容类型的数据的内容元素。Example 12: The data privacy annotation framework of Example 5, wherein the one or more predetermined data schemes are defined by one or more expressions used by a classification service to parse the user content and identify content elements containing data indicating one or more predetermined content patterns or one or more predetermined content types.

示例13：一种提供用于用户应用的数据隐私注释框架的方法，所述方法包括：识别所述第一阈值数量、针对所述第一阈值数量的所述弹性因子、以及对阈值回弹属性的指示中的一个或多个：；以及监视针对所述用户应用的用户界面中的内容编辑所呈现的用户数据文件中的用户内容，以识别所述用户内容中包含与一个或多个预先确定的数据方案相对应的敏感数据的内容元素的数量。所述方法包括，在所述内容编辑期间，至少基于内容元素的数量超过所述第一阈值数量，发起在所述用户界面中对至少一个注释指示符的呈现，所述至少一个注释指示符在所述用户界面中将所述用户内容标记为包含至少第一敏感数据。所述方法包括，在所述内容编辑期间，至少基于内容元素的数量最初超过所述第一阈值数量，并且在所述弹性因子被应用于所述第一阈值数量时随后落到所述第一阈值数量以下，至少基于所述弹性因子来建立第二阈值数量以用于移除对所述至少一个注释指示符的所述呈现。所述方法包括，在所述内容编辑期间，至少基于内容元素的数量最初落到所述第二阈值数量以下，并且在所述阈值回弹属性被应用于所述第二阈值数量时随后超过所述第二阈值数量，发起在所述用户界面中对至少一个另外的注释指示符的呈现，所述至少一个另外的注释指示符在所述用户界面中将所述用户内容标记为包含至少第二敏感数据。Example 13: A method for providing a data privacy annotation framework for a user application, the method comprising: identifying one or more of a first threshold number, a resilience factor for the first threshold number, and an indication of a threshold bounce attribute; and monitoring user content in a user data file presented during content editing in a user interface of the user application to identify the number of content elements in the user content that contain sensitive data corresponding to one or more predetermined data schemes. The method includes, during the content editing, initiating the presentation of at least one annotation indicator in the user interface based at least on the number of content elements exceeding the first threshold number, the at least one annotation indicator marking the user content in the user interface as containing at least first sensitive data. The method further includes, during the content editing, establishing a second threshold number based at least on the resilience factor to remove the presentation of the at least one annotation indicator, based at least on the resilience factor, as the number of content elements initially exceeds the first threshold number and subsequently falls below the first threshold number when the resilience factor is applied to the first threshold number. The method includes, during the content editing, initiating the rendering of at least one additional annotation indicator in the user interface, based at least on the fact that the number of content elements initially falls below a second threshold number and subsequently exceeds the second threshold number when the threshold bounce attribute is applied to the second threshold number, the at least one additional annotation indicator marking the user content as containing at least second sensitive data in the user interface.

示例14：示例13的方法，还包括：在所述内容编辑期间，至少基于内容元素的数量在所述弹性因子被应用于所述第一阈值数量时落到所述第二阈值数量以下，发起对所述至少一个注释指示符的所述呈现的移除。Example 14: The method of Example 13 further includes: during the content editing, at least based on the fact that the number of content elements falls below the second threshold number when the elasticity factor is applied to the first threshold number, initiating the removal of the rendering of the at least one annotation indicator.

示例15：示例13的方法，还包括：在所述内容编辑期间，至少基于内容元素的数量最初超过所述第一阈值数量，并且在所述弹性因子没有被应用于所述第一阈值数量时随后落到所述第一阈值数量以下，移除对所述至少一个注释指示符的呈现。Example 15: The method of Example 13 further includes: during the content editing, at least based on the fact that the number of content elements initially exceeds the first threshold number and subsequently falls below the first threshold number when the elasticity factor is not applied to the first threshold number, removing the rendering of the at least one annotation indicator.

示例16：示例13的方法，还包括：在所述内容编辑期间，至少基于内容元素的数量最初落到所述第二阈值数量以下，并且在所述回弹属性没有被应用于所述第二阈值数量时随后超过所述第二阈值数量，拒绝对至少一个另外的注释指示符的呈现直到内容元素的数量超过所述第一阈值数量为止，所述至少一个另外的注释指示符在所述用户界面中将所述用户内容标记为包含至少第二敏感数据。Example 16: The method of Example 13 further includes: during the content editing, at least based on the number of content elements initially falling below the second threshold number and subsequently exceeding the second threshold number when the bounce attribute is not applied to the second threshold number, refusing to render at least one additional annotation indicator until the number of content elements exceeds the first threshold number, the at least one additional annotation indicator marking the user content in the user interface as containing at least second sensitive data.

示例17：示例13的方法，其中，识别所述第一阈值数量、针对所述第一阈值数量的所述弹性因子、以及对阈值回弹属性的所述指示中的一个或多个包括：确定针对与所述内容编辑相关联的目标实体所建立的注释策略，所述注释策略包括以下中的一个或多个：所述第一阈值数量、针对所述第一阈值数量的所述弹性因子、以及对阈值回弹属性的所述指示。Example 17: The method of Example 13, wherein identifying one or more of the first threshold number, the elasticity factor for the first threshold number, and the indication of the threshold bounce attribute includes: determining an annotation strategy established for a target entity associated with the content editing, the annotation strategy including one or more of the following: the first threshold number, the elasticity factor for the first threshold number, and the indication of the threshold bounce attribute.

示例18：示例17的方法，其中，所述目标实体包括以下中的至少一个：执行所述内容编辑的用户、包括执行所述内容编辑的所述用户的组织、以及所述用户应用的应用类型。Example 18: The method of Example 17, wherein the target entity includes at least one of the following: a user performing the content editing, an organization including the user performing the content editing, and an application type of the user application.

示例19：示例13的方法，其中，所述至少一个注释指示符和所述至少一个另外的注释指示符中的每个注释指示符包括以下中的一个或多个：在所述用户界面中呈现的全局指示符，所述全局指示符适用于所述用户数据文件；以及在所述用户界面中呈现的、位于包含所述敏感数据的个体内容元素附近的个体指示符。Example 19: The method of Example 13, wherein each of the at least one comment indicator and the at least one additional comment indicator includes one or more of the following: a global indicator presented in the user interface, the global indicator being applicable to the user data file; and an individual indicator presented in the user interface, located near the individual content element containing the sensitive data.

示例20：示例13的方法，其中，所述一个或多个预先确定的数据方案是由一个或多个表达式定义的，所述一个或多个表达式由分类服务使用，以解析所述用户内容并且识别所述内容元素中包含指示一个或多个预先确定的内容模式或者一个或多个预先确定的内容类型的数据的内容元素。Example 20: The method of Example 13, wherein the one or more predetermined data schemes are defined by one or more expressions used by a classification service to parse the user content and identify content elements containing data indicating one or more predetermined content patterns or one or more predetermined content types.

在附图中所提供的功能块图、操作场景和序列、以及流程图代表用于执行本公开的新颖的方面的示例性系统、环境、和方法。尽管出于简化说明的目的，在本文中所包括的方法可以是以功能图、操作场景或序列、或流程图的形式的，并且可以被描述为一系列操作，但应当理解和领会的是，所述方法不受操作的顺序的限制，这是因为与此对应，一些操作可以以与在本文中所示出和描述的其他操作不同的顺序和/或同时进行。例如，本领域技术人员将理解并领会的是，方法可以可替代地被表示为一系列的相关的状态或事件，例如在状态图中。此外，不是在方法中所示出的所有的操作都针对新颖的实现而被需要。The function block diagrams, operation scenarios and sequences, and flowcharts provided in the accompanying drawings represent exemplary systems, environments, and methods for performing novel aspects of this disclosure. Although methods included herein may be in the form of function block diagrams, operation scenarios or sequences, or flowcharts for the purpose of simplicity, and may be described as a series of operations, it should be understood and appreciated that the methods are not limited by the order of operations, as some operations may be performed in a different order and/or simultaneously than other operations shown and described herein. For example, those skilled in the art will understand and appreciate that methods may alternatively be represented as a series of related states or events, such as in a state diagram. Furthermore, not all operations shown in the methods are required for the novel implementation.

所包括的描述和图描绘了具体的实现以教导本领域技术人员如何制作和使用最佳选项。出于教导发明性原理的目的，已经简化或省略了一些传统的方面。本领域技术人员将从落在本发明的范围内的这些实现中理解变型。本领域技术人员还将理解的是，可以以各种方法组合在上文中所描述的特征以形成多个实现。作为结果，本发明不限于在上文中所描述的具体的实现，而是仅由示例及其等价物来限制。The included descriptions and figures depict specific implementations to teach those skilled in the art how to make and use the best options. Some conventional aspects have been simplified or omitted for the purpose of teaching the inventive principles. Variations will be understood from these implementations falling within the scope of the invention. Those skilled in the art will also understand that the features described above can be combined in various ways to form multiple implementations. As a result, the invention is not limited to the specific implementations described above, but is limited only by examples and their equivalents.

HK42024088258.9A 2017-03-23 2024-03-05 Configurable annotations for privacy-sensitive user content HK40102319A (en)

Applications Claiming Priority (1)

Application Number	Priority Date	Filing Date	Title
US15/466,988		2017-03-23

Publications (1)

Publication Number	Publication Date
HK40102319A true HK40102319A (en)	2024-06-07

Publication	Publication Date	Title
JP7398537B2 (en)	2023-12-14	Obfuscation of user content in structured user data files
CN110506271B (en)	2023-09-29	Configurable annotations for privacy-sensitive user content
US10671753B2 (en)	2020-06-02	Sensitive data loss protection for structured user content viewed in user applications
US8949371B1 (en)	2015-02-03	Time and space efficient method and system for detecting structured data in free text
CA2786058C (en)	2017-03-28	System, apparatus and method for encryption and decryption of data transmitted over a network
HK40102319A (en)	2024-06-07	Configurable annotations for privacy-sensitive user content
HK40017079B (en)	2024-03-08	Configurable annotations for privacy-sensitive user content
HK40016404B (en)	2024-03-01	Obfuscation of user content in structured user data files
RU2772300C2 (en)	2022-05-18	Obfuscation of user content in structured user data files
CA3054035C (en)	2024-11-12	Configurable annotations for privacy-sensitive user content
HK40016404A (en)	2020-09-11	Obfuscation of user content in structured user data files
HK40017079A (en)	2020-09-18	Configurable annotations for privacy-sensitive user content

HK40102319A - Configurable annotations for privacy-sensitive user content - Google Patents

Info

Links

Description

Applications Claiming Priority (1)

Publications (1)

Family

ID=

Similar Documents