When it comes to data protection, should you crowdsource your security?
by Connie McFarland
People often seek a machine learning solution that installs and starts learning dynamically believing that, at some point, the system will be trained and effective going forward. This active learning approach is based on the assumption that the crowd knows best. And, in many cases, such as movie reviews, this is true: people know what they like.
But is this true for information security? How can you know that individuals in your organization have the right expertise and can be trusted to impart the learning the system needs? If we use active machine learning to protect digital assets, at what point in learning would we feel sufficiently confident that data leaks will not occur?
The application of machine learning to data security comes with high expectations of accuracy and consistency. Trust is always important in machine learning, but when the cost of a leaked email is millions of dollars in fines, the bar is higher than giving a bad movie recommendation or mistaking a bobcat for a housecat in a photo search.
In The Wisdom of Crowds, James Surowiecki identified four criteria of a wise crowd:
• a diversity of opinion,
• independence from the herd,
• ability to draw on specialized and local knowledge, and
• a mechanism for turning diverse opinions into a collective decision.
How does this relate to machine learning?
Well, it’s important to recognize what is being trained when we talk about using machine learning to protect digit assets. This is a model that’s essentially the standard for sensitive content for an organization. It’s a tall order and some organizations may require a governance body to take on this important task. But it’s possible to lean on the organizational community as well and TITUS Intelligent Protection, powered by machine learning, provides tools suitable for both approaches. Our solution uses supervised learning which means someone must provide the examples (with a label) – either the “wise crowd” or the data steward.
The data steward approach to machine learning categorization
Organizations may prefer to entrust the training of machine-learned models to an individual or small group of experts who are familiar with corporate assets and governance policies. The process of building models that are deployed throughout the organization begins with identifying the categories of data that are to be protected, then building a representative set of this data as a set of document or emails that is then used to train, evaluate and refine a model.
Building this dataset properly takes time. This is often why some organizations look for other, easier approaches that share the burden of collecting the dataset.
Crowdsourcing the collection of a dataset with classification
Organizations with knowledgeable staff can rely on their staff to classify based on a defined label set provided by the data steward. Crowd members can use tools like TITUS Classification Suite to label documents and emails as they go about their work. The process of labeling can be done for a period of time to accumulate a dataset for training and evaluation. But this data does not dynamically update any models.
Given the high bar for data governance, we believe it’s important to do a quality review of trained models. It’s possible to automate this process but fundamentally, models are still statically trained and evaluated so as to safeguard against inadvertent bias that can creep in or concerted attempts to skew the training.
Turning diverse opinions into a collective decision
TITUS Intelligent Protection is focused on discovery, trying to recognize what category a document or email is so that proper handling policies can be applied. Experts and the crowd may be able to participate in that process. But the final step belongs to the data steward (or governance body) that’s responsible for ensuring models are reliable and defining the policies for proper information handling.
Once the deployable model and policies are ready, they can be deployed using TITUS active configuration updates. Using the TITUS justification feature, it’s possible to gather feedback on the efficacy of models and policies in production. Although TITUS Intelligent Protection uses static learning, the continuous deployment and feedback process for model refinement that’s a natural part of the TITUS family of products ensures that the models don’t become stale.
Garbage in, garbage out, or “Lies, damned lies and statistics”
When Francis Galton conducted his weight of the ox experiment at an English fair in 1906, he made us aware that averaged data of large samples of “guesses” could be more accurate than the experts. This is great when you can solicit a crowd for opinions on things like movies or the weight of an ox, or the number of jelly beans in a jar.
When I think about how autonomous vehicles are trained and given how most people drive, we should all hope they called upon some expert drivers. And when I think about who I would trust with the proverbial crown jewels of my organization, I want experts at the helm who know what those jewels look like and what their loss means.
Connie McFarland is a senior software architect and engineering manager responsible for machine learning at TITUS. She enjoys the challenge of combining software engineering with applied machine learning to address real-world problems in data protection and corporate compliance.