A lot, has been going on lately. So much so, that I do not even know how to start reviewing it.
I’ll just go ahead and speak about some technical projects and topics that I’ve been briefly involved in and that are giving me a fair amount of concern.
Issue number x: Citizen-facing services of nation states
A while back, I made a “prediction”: the digitalization of citizen facing services will be more present, especially as the pandemic situation is panning out. (here) and (here). I was right. Well, to be completely honest, it was not really a prediction as I had two side (as a freelancer) projects that were involving exactly this. So I kind of had a small and limited view from inside.
Those projects ended, successfully delivered, and then came the opportunity for more. I kindly declined. Partly because I’m trying to raise a child with my wife, and there’s only so much time in the universe, and partly because I have deep ethical issues with what is happening.
I am not allowed to even mention anything remotely linked with the projects I’ve been involved in, but I will give you a parallel and thus unrelated example, hoping you connect the dots. Unrelated in terms of: I was not even remotely involved in the implementation of the example I’m bringing forward.
The example is: The Romanian STS (Service for Special Telecommunications) introduced the blockchain technology in the process of centralizing and counting citizen votes, in national or regional elections that are happening in Romania. You can read more about it here, and connect the dots for yourselves. You’ll also need to know a fair amount about the Romanian election law, but you’re smart people.
The Issue?
Flinging the blockchain concept to the people so that the people misunderstand it. Creating a more secure image that necessary. Creating a security illusion. Creating the illusion of decentralized control, while implementing the EXACT opposite. I’m not saying this is intentional, oh, no, it is just opportunistic: it happened because of the fast adoption. Why? Blockchain is supposed to bring decentralization, and what it completely does in the STS implementation is the EXACT opposite: consolidate centralization.
While I have no link with what happened in Romania, I know for a fact that similar things shave happened elsewhere. This is bad.
I do not think that this is happening with any intention. I simply think there is A HUGE AMOUNT of opportunistic implementations going on SIMPLY because of the political pressure to satisfy the PR needs, and maybe, just maybe, give people the opportunity to simplify their lives. But the implementations are opportunistic, and from a security perspective, this is unacceptable!
Ethically
I think that while we, as a society, tend to focus on the ethics in using AI and whatnot, we are completely forgetting about ethics in terms of increased dependency of IT&C in general. I strongly believe that we have missed a link here. In the security landscape, this is going to cost us. Bigtime.
Last time I have talked about some of
the factors that influenced the evolution of privacy-preserving technologies.
I wanted to touch base with some of the
technologies emerging from the impact of these factors and talk about some of
the challenges that they come with.
After a discussion about e-differential
privacy, I promised you a little discussion about homomorphic encryption.
There is a small detour that I find
myself obligated to take. This is due to the latest circumstances of the
SARS-CoV-2 outbreak: I want to split this discussion in two parts, and start
with a little discussion about homomorphic secret sharing before I go
into sharing my experience about adopting the homomorphic encryption.
What?! Why?
In the last article, I argued that one
of the drivers for adopting new privacy mechanisms is: “The digitalization of
the citizen facing services of nation-states. (stuff like e-voting, that I
really advocate against)”
Well, sometime after the SARS-CoV-2 will
be gone (a long time from today) I foresee a future where this kind of services
will be more and more adopted. One of the areas where citizen facing services
of nation-states will be digitalized is e-voting – e-voting within the
parliament, for democratic elections, etc. I briefly mentioned last time that I
am really against this. At least for now, given the status quo of the research
in this area.
Let me explain a little bit the trouble
with e-voting
Starting with a question: Why do you
trust the people counting your vote?
[…annoying pause…]
A good answer here, could be:
Because all the
parties having a stake in the elections have people counting. The counting is
not done by a single ‘neutral’ authority.
Because, given the above
statement, I can see my vote from the moment that I printed it, to the moment I
cast it
Because your vote must
be a secret, so that you cannot be blackmailed or paid to vote in a certain way
– and there are some mechanisms for that
You can see that in an electronic
environment, this is hardly the case. Here, in an electronic environment, if
you have a Dragnea,
you are shot and buried. Here, in an electronic environment you:
Cannot see your vote
since the moment you have printed (or pushed the button) to the moment of
casting – anyone could see it
Cannot easily make
sure that your vote is a secret. Once you act upon your vote and it is
encrypted somehow, you have no way of knowing what you voted – it became a
secret. So there is the trouble with that.
Further more, assuming conventional encryption, there are master keys that can
be easily compromised by an evil Dragnea.
Auditing such a system
involves an extremelyhigh and particular level ofexpertise,
and any of the parties having a stake in the election would really have trouble
finding people willing to take the risk of doing that for them. This is an
extreme sensitive matter.
There
is a research area concerned with tackling these issues. It is called “End-To-End
Verifiable Voting Systems”.
End-To-End
Verifiable Voting Systems
Basically,
tackling these problems for e-voting systems means transforming an electronic
environment for voting in such a manner that it can, at least handle the standards
of non e-voting systems, and then add some specific electronic mumbo-jumbo to
it, and make it available in a ‘pandemic environment’. [Oh my God, I’ve just
said that, pandemic environment…]
The
main transformation is: I, as a voter, must be able to act a secret vote up to
casting it, and make sure my vote is accounted for, properly.
Homomorphic
secret sharing
It
would be wonderful if, while addressing the trust in the counting of the votes
problem we would have a way of casting an encrypted vote, but still be able to
count it even if it is encrypted. Well this can be done.
To
my knowledge today, the most effective and advanced technology that can be used
here is homomorphic encryption, and, more precise, a small subset of HE, called
homomorphic secret sharing.
Homomorphic secret sharing is a secret sharing algorithm where the secret is encrypted using homomorphic encryption. In a nutshell homomorphic encryption is a type of encryption where you can do computations on the ciphertext – that is compute stuff directly on encrypted data, with no prior decryption. For example: in some HE schemes an encryption of a 5 plus an encryption of a 2 is an encryption of a 7. Hooray.
Bear in mind, the mathematics behind all this is pretty complex. I would not call it scary, but close enough. However, there are smart people that are working on, and providing some, out-of-the-box libraries that software developers can use so that they can embed HE in their product. I would like to mention jus two here: Microsoft SEAL and PALISADE (backed by DARPA). Don’t get me wrong, today, you still have to know some mathematical tricks if you want to embed, HE in your software, but the really heavy part is done by these heroes that are providing these libraries.
Decentralized
voting protocols using homomorphic secret sharing
In
the next article I will talk about the challenges that you will face if you are
trying to embed HE in your product, but until then, if you want to get a glimpse
about the complexity, I will just go ahead and detail a decentralized voting
protocol that uses homomorphic secret sharing.
Assume you have a
simple vote (yes/no) – no overkill for now
Assume you have some
authorities that will ‘count’ the votes. – number of authorities noted as A
Assume you have N
voters
Each authority will generate a public key. Anumber. Xa
Each voter encodes his vote in a polynomial Pn, with the degree A-1 (number of authorities -1) and the constant term an encoding of the vote (for this case +1 for yes and -1 for no) all other coefs are random.
Each voter computes the value of his polynomial (Pn) – and thus his vote – at each authority public key Pn(Xa).
K points are produced, they are pieces of the vote.
Only if you know all the points you can figure out the Pn, and thus the vote. This is the decentralization part.
Voter sends each authority the value computed using its key only
Thus, each authority finds itself impossible to find how each user voted, as it does not have enough computed values – only has one.
After all votes have been casted, each authority computes and publishes a sum (Sa) of the received values.
Thus, a new polynomial is born (coefs are the Sa sums) with the constant term being the sum of all votes. If it is negative the result is yes, and vice versa.
If
you had troubles following the secret sharing algorithm, don’t worry, you’re
not alone. Here’s a helper illustration:
However,
there are still problems:
Still, the voter cannot be sure that his/hers vote is properly casted
The authorities cannot be sure that a malicious voter did not computed his polynomial with a -100 constant, such that a single cast would count for 100 negative votes.
The homomorphic secret sharing does not even touch the other problems of voting systems, only the secrecy and the trust are tackled.
The
challenges
See, you still have to
know a little bit about polynomials and interpolation to be able to use this in
your software.
The
crazy part is that, in homomorphic encryption terms, homomorphic secret sharing
is one of the simplest challenges.
Don’t
worry though, in my next article I will show you some neat library (Microsoft SEAL),
share my experience with you, and give you some tips and tricks for the moment
when you will try to adopt this.
Until
next time, remember: don’t take anything for granted.
Lately, I’ve been doing some work in the area of cryptography and
enterprise scale data protection and privacy. And so, it hit me: things are a
lot different than they used to be, and they are changing fast. It seems that things
are changing towards a more secure environment, with stronger DP and privacy
requirements and it also seems that these changes are widely adopted. Somehow,
I am happy about it. Somehow, I am worried.
Before I go a little deeper into the topic of how to design for critical
privacy and DP systems, let me just enumerate three of the factors that are
responsible for generating the changes that we are witnessing:
The
evolving worldwide regulation and technology adoption started by EU 2016/679 regulation (a.k.a. GDPR)
The unimaginable
progress we are covering in terms of big data analysis and ML
The digitalization of
the citizen facing services of nation-states. (stuff like e-voting, that I
really advocate against)
I don’t want to cover in-depth the way I see each factor influencing the privacy and DP landscape, but, as we go on, I just want you to have these three factors in mind. Mind the factors.
Emerging technologies
Talking about each concept and technology that is gaining momentum in
this context is absolutely impossible. So, I choose to talk about two of the
most challenging ones. Or, at least the ones that I perceive as being the most
challenging: this is going to be a two episodes series about Differential
Privacy and Homomorphic Encryption.
Differential privacy. e-Differential Privacy.
Differential Privacy, in a nutshell, from a space-station view, is a
mathematical way of ensuring that reconstruction attacks are not possible at
the present or future time.
Mathematical what? Reconstruct what? Time what? Let me give you a
textbook example:
Assume we know the following about a group of people:
There are 7 people with the median age of 30 and the mean of 38.
4 are females with the median of 30 and the mean of 33.5
4 love sugar with the median of 51 and a mean of 48.5
3 sugar lovers are females with the median 36 and the mean 36.6
Challenge: give me the: age, sex, sugar preference and marital status of
each individual.
Solution:
1. 8, female, sugar, not married
2. 18, male, no sugar, not married
3. 24, female, no sugar, not married
4. 30, male, no sugar, married
5. 36, female, sugar, married
6, 66, female, sugar, married
7. 84, male, sugar, married
Basically, a reconstruction attack for such a scenario involves finding
peaks of plausibility in a plausibility versus probability plot. It goes
something like this:
You can start brute forcing all the combinations of the seven participants. Considering all the features except age (so, gender, sugar preference, marital) you have 7^8= 5764801 possibilities, but all have roughly the same plausibility. So a possibility / plausibility plot, looks something like this
See, there does not seem to be any peaks in plausibility. But, once we
factor in the age, well, things change. For example, although possible to have
a 150 years old person, it is very implausible. Furthermore, it is more plausible
to have an older individual married than a younger one, and so on. So, if we
factor in age plausibility, a graph looks more like this:
See, there’s a peak of plausibility. That is most likely our solution. Now, if our published statistics are a little
skewed. Say, we introduce just enough noise into them such that the impact is
minimum for science, and we eliminate the unnecessary ones (if this can be
done) then, a reconstruction attack is almost impossible. The purpose is to flatten,
as much as possible, the graph above.
Now, to be fair, in our stretched-out textbook example, there’s no need
to do the brute-force-assumption plausibility plot. Because the Mean and Median
are published for each subset of results, you can simply write a deterministic equation
system and solve for the actual solution.
Imagine you, as an attacker possess some external knowledge about your target
from an external source. This external source may be an historical publishing over
the same set of data, or a different data source altogether. This makes your reconstruction
job easier.
e-Differential Privacy systems have a way of defining a privacy loss (i.e.
a quantitative measure of the increase in the plausibility plot) Also, these
systems define a privacy budget. And this is one of the real treasures of this
math. You can make sure that, over time, you are not making the reconstruction
attacks easier.
This stuff gained momentum as the US census bureau got the word out the
they are using it, and also encouraged people to ask enterprises that own their
data to use it.
So, as a software architect, how do I get ready for this?
First, at the moment, there are no out-of-the box solutions that can
give you e-Differential Privacy for your data. If this is a requirement for
you, you are most probably going to work with some data scientists / math degree
that are going to tell you exactly what will be a measure a privacy loss for
the features in your data. At least that is what I did 😊 Once
they are defined you have to be ready to implement those.
There is a common pattern you can adopt. A proxy, a privacy guard:
You are smart enough to realize that CLEAN data means that some
acceptable noise is introduced, such that the privacy budget is not greatly, if
at all, impacted.
Challenges
If it was easy, everyone would do it, but it’s not, so suck it.
First, you and your
team must be ready to understand what a highly trained math scientist is
talking about. Get resources for that.
Second, you have to be
careful, as an architect, to have formal definitions throughout your
applications for the two concepts enumerated above: privacy budget, and privacy
loss.
Third, in both my
experience and in the textbook ongoing research, the database must contain the
absolute raw data, including historic data if needed. This poses another security
challenge: you don’t want to be fancy about using complicated math to protect
your data while being vulnerable to a direct database attack. Something stupid
like injection attacks have no place here. You can see now that the diagram above
is oversimplified. It lacks a ton of proxies, security controls, DMZs and
whatnot. Don’t make the same mistake I did and try to hide some data from the
privacy guard, your life will be a misery.
Fourth, Be extremely
careful about documenting this. It is not rare that software ecosystems change
purpose, and the tend to be used where they are not supposed to. It may happen
that such an ecosystem, with time, gets to be directly used for scientific
research, from behind the privacy guard. That might not be acceptable. You
know, scientists don’t like to have noisy data. So I’ve heard, I’m not a
scientist.
That’s all for now.
In the second part we’re going to talk a little bit about the time I
used Homomorphic Encryption. A mother****ing monster for me.
Since you’re here, I believe that you have a general idea about what homomorphic encryption is. If, however you are a little confused, here it is in a nutshell: you can do data processing directly on encrypted data. E.G. An encryption of a 5 multiplied by an encryption of a 2 is an encryption of a 10. Tadaaa!
This is pure magic for privacy. Especially with this hype that is happening now with all the data leaks, and new privacy regulation, and old privacy regulation, and s**t. Essentially, what you can do with this is very close to the holy grail of privacy: complete confidential computing, process data that is already encrypted, without the decryption key. Assuming data protection in transit is already done. See picture bellow:
Quick note here, most of the homomorphic schemes (BFV/CKKS/blabla..) use a public/private scheme for encrypting/decryption of data.
Now, I have been fortunate enough to work, in the past year, on a side project involving a lot of homomorphic encryption. I was using Microsoft SEAL and it was great. I am not going to talk about the math behind this type of encryption, not going to talk about the Microsoft SEAL library (although I consider it excellent), not going to talk about the noise-propagation problem in this kind of encryption.
I am, however, going to talk about a common pitfall that I have seen, and that is worrying. This pitfall is concerning the integrity of the result processing. Or, to be more precise, attacking the integrity of the expected result of processing.
Some Example
Let me give you an example: Assume you have an IoT solution that is
monitoring some oil rigs. The IoT devices encrypt the data that is collected by
them, then sends it to a central service for statistical analysis. The central
service does the processing and provides an API for some other clients used by
top management.
(This is just an example. I am not saying I did exactly this. It would be $tupid to break an NDA and be so open about it.)
If I, as an attacker compromise the service that is doing the statistical analysis, I cannot see the real data sent by the sensors. However, I could mess with it a little. I could, for instance, make sure that the statistical analysis returned by the API is rigged, that it shows whatever I want it to show.
I am not saying that I am able to change the input data. After all, I as an attacker do not have the key used for encryption, so that I am not able to encrypt new data in the series. I just go ahead and alter the result.
It seems obvious that you should protect such a system against
impersonation/MitM/spoofing attacks. Well. Apparently, it is not that obvious.
The Trouble
While implementing this project, I got in touch with various teams that were working with homomorphic encryption, and it seems that there was a recurring issue. The problem is that the team that is implementing such a solution, usually is made up of experienced (at least) developers that have a solid knowledge of math / cryptography. But it is not their role to handle the overall security of the system.
The team that is responsible for the overall security of the system, is unfortunately, often decoupled with the details of a project that is under development. What do they “know” about the project? Homomorphic encryption? Well, that is cool, data integrity is handled by encryption, so why put any extra effort into that?
Please, please, do not overlook basic security just because some pretty neat researchers made a breakthrough regarding the efficiently of implementing a revolutionary encryption scheme. Revolutionary does not mean lazy. And FYI, a Full Homomorphic Encryption Scheme has been theorized since 1978.
To be fair play, I want to mention another library that is good at doing homomorphic encryption, PALISADE. I only have some production experience with Microsoft SEAL, and thus, I prefer it 😊
Last time, I tried to brief some of the steps you need to cover before starting to choose tools that will help you achieve compliance. Let’s dig a little deeper into that by using some real life negative examples that I ran into.
Case: The insufficiently authenticated channel.
Disclosure disclaimer: following examples are real. I have chosen to anonymize the data about the bank in this article, although I have no obligation whatsoever to do so. I could disclose the full information to you per request.
At one point, I received an e-mail from a bank in my inbox. I was not, am not, and hopefully will not be a client of that particular bank. Ever. The e-mail seemed (from the subject line) to inform me about some new prices of the services the bank provided. It was not marked as spam, and so it intrigued me. I ran some checks (traces, headers, signatures, specific backtracking magic), got to the conclusion that it is not spam, so I opened it. Surprise, it was directly addressed to me, my full name appeared somewhere inside. Oh’ and of course thanking ME that I chose to be their client. Well. Here’s a snippet (it is in Romanian, but you’ll get it):
Of course I complained to the bank. I was asking then to inform me how they’ve got my personal data, asking them to delete it, and so on. Boring.
About four+ months later (not even close to a compliant time) a response popped up:
Let me brief it for you: It said that I am a client of the bank, that I have a current account opened, where the account was opened. Oh but that is not all. They have also given me a copy of the original contract I supposedly signed. And a copy of the personal data processing document that I also signed and provided to them. Will the full blown personal data. I mean full blown: name, national id numbers, personal address etc. One problem tough: That data was not mine, it was some other guy’s data that had one additional middle name. And thus, a miracle data leak was born. It is small, but it can grow if you nurture it right…
What went wrong?
Well, in short, the guy filled in my e-mail address and nobody checked it, not him, not the bank, nobody. You imagine the rest.
Here’s what I am wondering.
Now, in the 21st century, is it so hard to authenticate a channel of communication with a person? it difficult to implement a solution for e-mail confirmation based on some contract id? Is it really? We could do it for you, bank. Really. We’ll make it integrated with whatever systems you have. Just please, do it yourselves or ask for some help.
Obviously privacy was 100% absent from the process of answering my complaint. Even though I made a privacy complaint 🙂 Is privacy totally absent from all your processes?
In the end, this is a great example of poor legislative compliance, with zero security involved, I mean ZERO security. They have some poor legal compliance: there is a separate document asking for personal data and asking for permission to process it. The document was held, and it was accessible (ok, it was too accessible). They have answered my complaint even though it was not in a timely compliant manner, and I had not received any justification for the delay.
Conclusions?
Have a good privacy program. A global one.
Have exquisite security. OK, not exquisite, but have some information security in place.
When you choose tools, make sure they can support your privacy program.
Don’t be afraid to customize the process, or the tools. Me (and, to be honest, anybody in the business) could easily give you a quote for an authentication / authorization solution of your communication channels with any type of client.
I am sure you can already see for yourself how this is useful in the context of choosing tools that will help you organize your conference event, and still maintain its privacy compliance.
Last time, I have briefed some of the main points that need review before thinking about turning your event GDPR compliant, and also mentioned that in doing so, you will obtain, as a happy byproduct, a nice fingerprint of your event.
Now, as a side note, and as you probably have already figured out, this series of articles is not necessarily addressing those environments that already have a data governance framework in place. If this is your case, I am sure you already have the procedure and tools in place. This series may become interesting for you when we get to talk about some specific tools, information security topics and some disaster scenarios.
There are still some grounds to cover regarding this topic, so let’s go!
Most probably, your main focus in the beginning is: let’s cover some the costs using sponsors, and let’s fire that registration & call for content procedures right away. Now, let’s not just rush into that. In order for you to collect data from participants and speakers (in short), you must have a legal basis for doing that. The legal basis for doing the processing – in this case just collecting it – may not be much of a choice, even though it seems so. In our experience, given the specific of our activity, you may have as a choice: consent, and fulfillment of a contract. Probably you will want to have a homogenous legal basis for all of your participants. Let’s assume the consent as legal basis for processing.
Consent
In order to be provided with consent, you are obligated to notify to the person offering consent several pieces of information:
[…]
Recipients of the personal data
Intention to transfer data to a third country or international organization
Storage Period, or criteria used to determine it.
How is automated decision making present in processing?
[…]
Just to name a few. I will not detail the full challenges of what a consent should be here, because this may become boring to you. You may know all this already. After all, you are already in this business J
Several of these topics are easy to pin-point if you went to the process detailed in the first article of the series. (e.g. identifying the recipients of the personal data). Still, some of the topics did not derive from that first process.
Establishing Data-Flow and assessing the tools
In order for you to be able to answer some questions like:
Are these data going to travel outside EU? Where exactly?
Are we going to profile anybody, or do some automated decision making?
you first need to define a data-flow associated with personal data, and even more, start thinking about the tools you are going to use.
Remember, in my first article, I have talked about the need to think about some third party software that may help you with some of your activities? Where does this software maintain its data? Is it outside EU? Can you control this?
You see where I am going with this: formalizing the data-flow, knowing what tools touch your data is of uttermost importance before even asking anybody for consent.
Don’t panic! These are anyway things you needed to do for your event, now, you just need to do them earlier. And if you ask me, just at the proper moment in order to benefit at the maximum from them. You do not want to start thinking about what tools you need when you already have 300 attendees registered by phone. That would be a bummer.
Next time, I am going to take a deeper look into tools and some basic security requirements that we recommend! Be safe!
I’m starting a series of articles in which I will try to cover my experience in managing privacy and GDPR compliance for several IT related conference events that are handled by “Avaelgo”. During this journey, I will also touch some in-depth security aspects, so stay tuned for that.
As I am sure you know already, a conference is a place where people gather, get informed, do networking (business or personal), have fun, and who knows what other stuff they may be doing. The key aspect here is that for such a conference to be successful, you need to have a fair amount of people being part of it. And since people are persons, well, that also means a fair amount of personal data.
There’s a lot to cover, but we’ll start with the basics. If this is the first time you are organizing such a conference, then you already have a head start: you don’t have to change anything. If not, then you must start by reviewing the processes that you already have in place.
In this first article I’m just going to cover what are the key points that you should review. Let’s go:
How do people get to know about your event?
It is very important to know how exactly you are going to market your event. The marketing step is very important, and itself must be compliant with the regulation. This is a slightly separate topic, but it cannot be overlooked.
It does not matter that you will market yourself to participants, speakers, or companies. Personal data is still going to be involved.
How are people going to register for your event?
This means: how are you going to collect data regarding the participants? Is there going be a website that allows registration? Do you allow registration by phone? There are still more questions to answer, but you have an idea about the baseline. These decisions will have a later impact on the security measures you need to take in order to secure those channels
How are speakers going to onboard your event?
Same situation as above, but it may be that there is a different set of tools for a different workflow.
How are you going to verify the identity of the participants?
Is someone going to be manually verifying attendance and compare ID card names with a list? Is there going to be a tool? Is there a backup plan?
Do you handle housing / travelling for speakers / participants?
If yes, you will probably need to transfer some data to some hotels / airlines / taxies, etc…
Do you have sponsors? Do they require some privilege regarding the data of the participants?
This is a big one, as I am sure you know, some or all of the entities that collaborate on your conference will require some perks back from your conference. It may be that they are interested in recruitment activities, or marketing activities, or some other kind of activities on the personal data of your participants. Trade carefully, everything must be transparent.
Will you get external help?
Companies / volunteers / software tools and services that will help you with different aspects of organizing the event? What are they going to do for you? If they touch personal data, it is kind of important to know before you give it away to them.
Are there going to be promotions / contests?
Usually, these will be threated separately and onboarding to this kind of activities will be handled separately, but still, it is a good idea to know beforehand if you intend to do this.
As you can already imagine, this is not all, but we will anyway cover each topic from here in future articles, and then, probably, extend with some more.
This may look freaky and like a lot of work, but it really is not. Anyway, by trying to tackle personal privacy beforehand, you also get, as a happy byproduct, a cool fingerprint of what you need to do in order to have a successful event. Cheers to that!
A future article will come soon, covering the next steps. I am sure you already have an intuition of what those are.