Designing for critical privacy and data protection systems (I)

Mind the factors

Lately, I’ve been doing some work in the area of cryptography and enterprise scale data protection and privacy. And so, it hit me: things are a lot different than they used to be, and they are changing fast. It seems that things are changing towards a more secure environment, with stronger DP and privacy requirements and it also seems that these changes are widely adopted. Somehow, I am happy about it. Somehow, I am worried.

Before I go a little deeper into the topic of how to design for critical privacy and DP systems, let me just enumerate three of the factors that are responsible for generating the changes that we are witnessing:

  • The evolving worldwide regulation and technology adoption started by EU 2016/679 regulation (a.k.a. GDPR)
  • The unimaginable progress we are covering in terms of big data analysis and ML
  • The digitalization of the citizen facing services of nation-states. (stuff like e-voting, that I really advocate against)

I don’t want to cover in-depth the way I see each factor influencing the privacy and DP landscape, but, as we go on, I just want you to have these three factors in mind. Mind the factors.

Emerging technologies

Talking about each concept and technology that is gaining momentum in this context is absolutely impossible. So, I choose to talk about two of the most challenging ones. Or, at least the ones that I perceive as being the most challenging: this is going to be a two episodes series about Differential Privacy and Homomorphic Encryption.

Differential privacy. e-Differential Privacy.

Differential Privacy, in a nutshell, from a space-station view, is a mathematical way of ensuring that reconstruction attacks are not possible at the present or future time.

Mathematical what? Reconstruct what? Time what? Let me give you a textbook example:

Assume we know the following about a group of people:

There are 7 people with the median age of 30 and the mean of 38.

4 are females with the median of 30 and the mean of 33.5

4 love sugar with the median of 51 and a mean of 48.5

3 sugar lovers are females with the median 36 and the mean 36.6

Challenge: give me the: age, sex, sugar preference and marital status of each individual.

Solution:

1. 8, female, sugar, not married

2. 18, male, no sugar, not married

3. 24, female, no sugar, not married

4. 30, male, no sugar, married

5. 36, female, sugar, married

6, 66, female, sugar, married

7. 84, male, sugar, married

Basically, a reconstruction attack for such a scenario involves finding peaks of plausibility in a plausibility versus probability plot. It goes something like this:

You can start brute forcing all the combinations of the seven participants. Considering all the features except age (so, gender, sugar preference, marital) you have 7^8= 5764801 possibilities, but all have roughly the same plausibility. So a possibility / plausibility plot, looks something like this

See, there does not seem to be any peaks in plausibility. But, once we factor in the age, well, things change. For example, although possible to have a 150 years old person, it is very implausible. Furthermore, it is more plausible to have an older individual married than a younger one, and so on. So, if we factor in age plausibility, a graph looks more like this:

See, there’s a peak of plausibility. That is most likely our solution.  Now, if our published statistics are a little skewed. Say, we introduce just enough noise into them such that the impact is minimum for science, and we eliminate the unnecessary ones (if this can be done) then, a reconstruction attack is almost impossible. The purpose is to flatten, as much as possible, the graph above.

Now, to be fair, in our stretched-out textbook example, there’s no need to do the brute-force-assumption plausibility plot. Because the Mean and Median are published for each subset of results, you can simply write a deterministic equation system and solve for the actual solution.

Imagine you, as an attacker possess some external knowledge about your target from an external source. This external source may be an historical publishing over the same set of data, or a different data source altogether. This makes your reconstruction job easier.

e-Differential Privacy systems have a way of defining a privacy loss (i.e. a quantitative measure of the increase in the plausibility plot) Also, these systems define a privacy budget. And this is one of the real treasures of this math. You can make sure that, over time, you are not making the reconstruction attacks easier.

This stuff gained momentum as the US census bureau got the word out the they are using it, and also encouraged people to ask enterprises that own their data to use it.

So, as a software architect, how do I get ready for this?

First, at the moment, there are no out-of-the box solutions that can give you e-Differential Privacy for your data. If this is a requirement for you, you are most probably going to work with some data scientists / math degree that are going to tell you exactly what will be a measure a privacy loss for the features in your data. At least that is what I did 😊 Once they are defined you have to be ready to implement those.

There is a common pattern you can adopt. A proxy, a privacy guard:

You are smart enough to realize that CLEAN data means that some acceptable noise is introduced, such that the privacy budget is not greatly, if at all, impacted.

Challenges

If it was easy, everyone would do it, but it’s not, so suck it.

First, you and your team must be ready to understand what a highly trained math scientist is talking about. Get resources for that.

Second, you have to be careful, as an architect, to have formal definitions throughout your applications for the two concepts enumerated above: privacy budget, and privacy loss.

Third, in both my experience and in the textbook ongoing research, the database must contain the absolute raw data, including historic data if needed. This poses another security challenge: you don’t want to be fancy about using complicated math to protect your data while being vulnerable to a direct database attack. Something stupid like injection attacks have no place here. You can see now that the diagram above is oversimplified. It lacks a ton of proxies, security controls, DMZs and whatnot. Don’t make the same mistake I did and try to hide some data from the privacy guard, your life will be a misery.

Fourth, Be extremely careful about documenting this. It is not rare that software ecosystems change purpose, and the tend to be used where they are not supposed to. It may happen that such an ecosystem, with time, gets to be directly used for scientific research, from behind the privacy guard. That might not be acceptable. You know, scientists don’t like to have noisy data. So I’ve heard, I’m not a scientist.

That’s all for now.

In the second part we’re going to talk a little bit about the time I used Homomorphic Encryption. A mother****ing monster for me.

Stay safe!

Avoidable privacy happenings

Last time, I tried to brief some of the steps you need to cover before starting to choose tools that will help you achieve compliance. Let’s dig a little deeper into that by using some real life negative examples that I ran into.

Case: The insufficiently authenticated channel.

Disclosure disclaimer: following examples are real. I have chosen to anonymize the data about the bank in this article, although I have no obligation whatsoever to do so. I could disclose the full information to you per request.

At one point, I received an e-mail from a bank in my inbox. I was not, am not, and hopefully will not be a client of that particular bank. Ever. The e-mail seemed (from the subject line) to inform me about some new prices of the services the bank provided. It was not marked as spam, and so it intrigued me. I ran some checks (traces, headers, signatures, specific backtracking magic), got to the conclusion that it is not spam, so I opened it. Surprise, it was directly addressed to me, my full name appeared somewhere inside. Oh’ and of course thanking ME that I chose to be their client. Well. Here’s a snippet (it is in Romanian, but you’ll get it):

Of course I complained to the bank. I was asking then to inform me how they’ve got my personal data, asking them to delete it, and so on. Boring.

About four+ months later (not even close to a compliant time) a response popped up:

Let me brief it for you: It said that I am a client of the bank, that I have a current account opened, where the account was opened. Oh but that is not all. They have also given me a copy of the original contract I supposedly signed. And a copy of the personal data processing document that I also signed and provided to them. Will the full blown personal data. I mean full blown: name, national id numbers, personal address etc. One problem tough: That data was not mine, it was some other guy’s data that had one additional middle name. And thus, a miracle data leak was born. It is small, but it can grow if you nurture it right…

What went wrong?

Well, in short, the guy filled in my e-mail address and nobody checked it, not him, not the bank, nobody. You imagine the rest.

Here’s what I am wondering.

  1. Now, in the 21st century, is it so hard to authenticate a channel of communication with a person? it difficult to implement a solution for e-mail confirmation based on some contract id? Is it really? We could do it for you, bank. Really. We’ll make it integrated with whatever systems you have. Just please, do it yourselves or ask for some help.
  2. Obviously privacy was 100% absent from the process of answering my complaint. Even though I made a privacy complaint 🙂 Is privacy totally absent from all your processes?

In the end, this is a great example of poor legislative compliance, with zero security involved, I mean ZERO security. They have some poor legal compliance: there is a separate document asking for personal data and asking for permission to process it. The document was held, and it was accessible (ok, it was too accessible). They have answered my complaint even though it was not in a timely compliant manner, and I had not received any justification for the delay.

Conclusions?

  1. Have a good privacy program. A global one.
  2. Have exquisite security. OK, not exquisite, but have some information security in place.
  3. When you choose tools, make sure they can support your privacy program.
  4. Don’t be afraid to customize the process, or the tools. Me (and, to be honest, anybody in the business) could easily give you a quote for an authentication / authorization solution of your communication channels with any type of client.

I am sure you can already see for yourself how this is useful in the context of choosing tools that will help you organize your conference event, and still maintain its privacy compliance.

Is your conference event GDPR compliant? – Part 2

Last time, I have briefed some of the main points that need review before thinking about turning your event GDPR compliant, and also mentioned that in doing so, you will obtain, as a happy byproduct, a nice fingerprint of your event.

Now, as a side note, and as you probably have already figured out, this series of articles is not necessarily addressing those environments that already have a data governance framework in place. If this is your case, I am sure you already have the procedure and tools in place. This series may become interesting for you when we get to talk about some specific tools, information security topics and some disaster scenarios.

There are still some grounds to cover regarding this topic, so let’s go!

Most probably, your main focus in the beginning is: let’s cover some the costs using sponsors, and let’s fire that registration & call for content procedures right away. Now, let’s not just rush into that. In order for you to collect data from participants and speakers (in short), you must have a legal basis for doing that. The legal basis for doing the processing – in this case just collecting it – may not be much of a choice, even though it seems so. In our experience, given the specific of our activity, you may have as a choice: consent, and fulfillment of a contract. Probably you will want to have a homogenous legal basis for all of your participants. Let’s assume the consent as legal basis for processing.

Consent

In order to be provided with consent, you are obligated to notify to the person offering consent several pieces of information:

[…]

Recipients of the personal data
Intention to transfer data to a third country or international organization
Storage Period, or criteria used to determine it.
How is automated decision making present in processing?

[…]

Just to name a few. I will not detail the full challenges of what a consent should be here, because this may become boring to you. You may know all this already. After all, you are already in this business J

Several of these topics are easy to pin-point if you went to the process detailed in the first article of the series. (e.g. identifying the recipients of the personal data). Still, some of the topics did not derive from that first process.

Establishing Data-Flow and assessing the tools

In order for you to be able to answer some questions like:

Are these data going to travel outside EU? Where exactly?

Are we going to profile anybody, or do some automated decision making?

you first need to define a data-flow associated with personal data, and even more, start thinking about the tools you are going to use.

Remember, in my first article, I have talked about the need to think about some third party software that may help you with some of your activities? Where does this software maintain its data? Is it outside EU? Can you control this?

You see where I am going with this: formalizing the data-flow, knowing what tools touch your data is of uttermost importance before even asking anybody for consent.

Don’t panic! These are anyway things you needed to do for your event, now, you just need to do them earlier. And if you ask me, just at the proper moment in order to benefit at the maximum from them. You do not want to start thinking about what tools you need when you already have 300 attendees registered by phone. That would be a bummer.

Next time, I am going to take a deeper look into tools and some basic security requirements that we recommend! Be safe!

Is your conference event GDPR compliant? – Part 1

I’m starting a series of articles in which I will try to cover my experience in managing privacy and GDPR compliance for several IT related conference events that are handled by “Avaelgo”. During this journey, I will also touch some in-depth security aspects, so stay tuned for that.

As I am sure you know already, a conference is a place where people gather, get informed, do networking (business or personal), have fun, and who knows what other stuff they may be doing. The key aspect here is that for such a conference to be successful, you need to have a fair amount of people being part of it. And since people are persons, well, that also means a fair amount of personal data.

There’s a lot to cover, but we’ll start with the basics. If this is the first time you are organizing such a conference, then you already have a head start: you don’t have to change anything. If not, then you must start by reviewing the processes that you already have in place.

In this first article I’m just going to cover what are the key points that you should review. Let’s go:

  1. How do people get to know about your event?

It is very important to know how exactly you are going to market your event. The marketing step is very important, and itself must be compliant with the regulation. This is a slightly separate topic, but it cannot be overlooked.

It does not matter that you will market yourself to participants, speakers, or companies. Personal data is still going to be involved.

  1. How are people going to register for your event?

This means: how are you going to collect data regarding the participants? Is there going be a website that allows registration? Do you allow registration by phone? There are still more questions to answer, but you have an idea about the baseline. These decisions will have a later impact on the security measures you need to take in order to secure those channels

  1. How are speakers going to onboard your event?

Same situation as above, but it may be that there is a different set of tools for a different workflow.

  1. How are you going to verify the identity of the participants?

Is someone going to be manually verifying attendance and compare ID card names with a list? Is there going to be a tool? Is there a backup plan?

  1. Do you handle housing / travelling for speakers / participants?

If yes, you will probably need to transfer some data to some hotels / airlines / taxies, etc…

  1. Do you have sponsors? Do they require some privilege regarding the data of the participants?

This is a big one, as I am sure you know, some or all of the entities that collaborate on your conference will require some perks back from your conference. It may be that they are interested in recruitment activities, or marketing activities, or some other kind of activities on the personal data of your participants. Trade carefully, everything must be transparent.

  1. Will you get external help?

Companies / volunteers / software tools and services that will help you with different aspects of organizing the event? What are they going to do for you? If they touch personal data, it is kind of important to know before you give it away to them.

  1. Are there going to be promotions / contests?

Usually, these will be threated separately and onboarding to this kind of activities will be handled separately, but still, it is a good idea to know beforehand if you intend to do this.

  1. As you can already imagine, this is not all, but we will anyway cover each topic from here in future articles, and then, probably, extend with some more.

This may look freaky and like a lot of work, but it really is not. Anyway, by trying to tackle personal privacy beforehand, you also get, as a happy byproduct, a cool fingerprint of what you need to do in order to have a successful event. Cheers to that!

A future article will come soon, covering the next steps. I am sure you already have an intuition of what those are.

See you soon!