Can data overload protects privacy?

Privacy advocates are probably foaming at the mouth with the shocking revelation that in June 2006 all conversations on the MSN instant messaging system were being collected and passed to researchers Eric Horvitz and Jure Leskovec at Microsoft Research.
They claim they weren't interested in the content of the messages but were simply investigating the behaviour of a 'planetary scale system'.
There is nothing earth shattering about the results, they show people are more likely to chat with others in the same geographical location, age group and of the same sex. But as us pointed out by the arXiv blogger the most interesting aspect of the research is the fact the researchers struggled to cope with the size of their dataset.
"The dataset consisted of 30 billion conversations generated by 240 million distinct users over one month. We found that approximately 90 million distinct Messenger accounts were accessed each day and that these users produced about 1 billion conversations, with approximately 7 billion exchanged messages per day."
"The sheer size of the data limits the kinds of analyses one can perform,"For years now various security services around the world have made moves to assemble databases of online communication. They want to watch over phone calls, social networking sites and emails. But extracting useful information, not just generalities like the study mentioned, is going to require massive amounts of storage and processing power.
"Each day yielded about 150 gigabytes of compressed text logs (4.5 terabytes in total). Copying the data to a dedicated eight-processor server with 32 gigabytes of memory took 12 hours. Our log-parsing system employed a pipeline of four threads that parse the data in parallel, collapse the session join/leave events into sets of conversations, and save the data in a compact compressed binary format. This process compressed the data down to 45 gigabytes per day. Processing the data took an additional 4 to 5 hours per day."
But to quote the arXiv blogger
So will data overload always protect us from Big Brother’s prying eyes? Perhaps in some circumstances like these but otherwise I wouldn’t count on it. It’s straightforward to sample big datasets like this (although that can introduce problems of its own).
I wouldn’t mind betting that with a little more effort, it would be possible to identify individuals from their travel and chatting patterns, perhaps by correlating the data with local telephone and business directories much in the same way this has been done with search data. However, it looks as if Horvitz and Leskovec have steered carefully around this issue.
Of course, Microsoft doesn’t need to do this since it can store a much fuller set of data anyway including the full text of the conversations and whatever data it has on the identity of the owners.
And you can be sure that more shadowy organisations with access to much greater computing resources will also have this full data set and be happily chewing through it as you read this.
Labels: instant messanger, msn, privacy, social networking, spying



