Digital Breadcrumbs


In April 2013, the world population was 7,057,065, 162 (Hunt 2013). This is a population that increasing accesses and uses communications and digital media, and creates huge quantities of real-time and archived data, although it remains divided in its access to digital technology (Berry 2011). We often talk about the vast increase in data creation and transmission but it is sometimes difficult to find recent and useful quantitative measures of the current contours of digital media. Indeed, the internet as we tend to think of it, has become increasingly colonised by massive corporate technology stacks. These companies, Google, Apple, Facebook, Amazon, Microsoft, are called collectively “the Stacks” (Berry 2013). Helpfully, the CIA's chief technology officer, Ira Hunt (2013), has listed the general data numbers for the "stacks" and gave some useful comparative numbers in relation to telecoms and SMS messaging (see figure 1).



Data Provider

Quantitative Measures

Google (2009 Stats from SEC filing)
More than 100 petabytes of data.
One trillion indexed URLS. 
Three million servers. 
7.2 billion page-views per day.


Facebook (August 2012)

More than one billion users in August 2012.
300 petabytes of data. more than 500 terrabytes per day. 
Holds 35% of the world's photographs.


Youtube (2013)

More than 1000 petabytes of data (1 exabyte).
More than 72 hours of video uploaded per minute. 
37 million hours per year. 
4 billion views per day.


Twitter (2013)

More than 124 billion tweets per year.
390 million tweets per day or ~4500 tweets per second.

Global Text Messaging (2013)

More than 6.1 trillion text messages per year. 
193,000 messages sent per second 
or 876 per person per year

US Cell Calls (2013)
More than 2.2 trillion minutes per year. 
19 minutes per person per day. 
Uncompressed telephone data is smaller in 
size than Youtube data in a year.

figure 1: Growth in Data Collections and Archives (adapted from Hunt 2013)


The CIA have a particular interest in big data and growth in the "digital breadcrumbs" left by digital devices. Indeed, they are tasked with security of the United States and have always had an interest in data collection and analysis, but it is fascinating to see how increasingly the value of data comes to shape the collection of SIGINT which is digital and subject to computational analysis. Hunt argued,
"The value of any piece of information is only known when you can connect it with something else that arrives at a future point in time... Since you can't connect dots you don't have, it drives us into a mode of, we fundamentally try to collect everything and hang on to it forever" (Sledge 2013)
It is also interesting to note the implicit computationality that shapes and frames the way in which intelligence is expected to develop due to the trends in data and information growth. Nevertheless, these desires shape not just the CIA or other security services, but any organisation that is interested in using archival and real-time data to undertake analysis and prediction based on data – which is increasingly all organisations in a computational age.

Information has time value, and soon can lose its potency. This drives the growth of not just big data, but real-time analysis – particularly where real-time and archival or databases can be compared and processed in real-time. Currently real-time is a huge challenge for computational systems and pushes at the limits of current computal systems and data analytic tools. Unsurprisingly, new levels of expertise are called for, usually grouped under the notion of "data science", a thoroughly interdisciplinary approach sometimes understood as the movement from "search" to "correlation". Indeed, as Hunt argues,
"It is really very nearly within our grasp to be able to compute on all human generated information," Hunt said. After that mark is reached, Hunt said, the [CIA] agency would also like to be able to save and analyze all of the digital breadcrumbs people don't even know they are creating (Sledge 2013).
In a technical sense the desire in these "really big data" applications is the move from what is called "batch map/reduce", such as represented by Hadoop and related computational systems to "real-time map/reduce" whereby real-time analytics are made possible, represented currently by technologies like Google's Dremel (Melnik et al 2010), Caffeine (Higgenbotham 2010), Impala (Brust 2012), Apache Drill (Vaughan-Nichols 2013), Spanner (Iqbal 2013), etc. This is the use of real-time stream processing combined with complex analytics and the ability to manage large historical data sets. The challenges for the hardware are considerable, requiring peta-scale RAM architectures so that the data can be held in memory, but also the construction of huge distributed memory systems enabling in-memory analytics (Hunt 2013).


Traditional Computer Processing

Real-Time Analytics/Big Data
Data on storage area network (SAN) Data at processor
Move data to question Move question to data
Backup Replication Management
Vertical Scaling Horizontal Scaling
Capacity after demand Capacity ahead of demand
Disaster recovery Continuity of Operations Plan (COOP)
Size to peak load Dynamic/elastic provisioning
Tape storage area network (SAN)
storage area network (SAN) disk
disk solid-state disk
RAM limited Peta-scale RAM

figure 2: Tectonic Technology Shifts (adapted from Hunt 2013)

These institutional demands are driving the development of new computing architectures, which have principles associated with them, such as: data close to compute, power at the edge, optical computing/optical bus, the end of the motherboard and the use of shared pools of everything, new softwarized hardware systems that allow compute, storage, networking, and even the entire data centre to be subject to software control and management (Hunt 2013). This is the final realisation of the importance of the network, and shows the limitations of current network technologies such that they become one of the constraints on future softwarized system growth.

This continues the move towards context as the key technical imaginary shaping the new real-time streaming digital environment (see Berry 2012), with principles such as "Schema on Read", which enables the data returned to be shaped in relation to the context of the question asked, "user assembled analytics", which requires answers to be given for a set of research questions, and the importance of elastic computing, which enables computing power to be utilised in reference to a query or processing demand in real-time, similar to the way electricity is drawn from in greater proportions from the mains as it is required.

These forces are combining in ways that are accelerating the pace of data collection, whether from data exhausts left by users, or through open-source intelligence that literally vacuums the data from the fibre-optic cables that straddle the globe. As such, they also raise important questions related to the form of critical technical practices that are relevant to them and how we can ensure that citizens remain informed in relation to them. To take one small example, the mobile phone is now packed with real-time sensors which is constantly monitoring and processing contextual information about its location, use and the activities of its user. This data is not always under the control of the user, and in many cases is easily leaked, hacked or collected by third parties without the understanding or consent of the user (Berry 2012).

The notion that we leave behind "digital breadcrumbs", not just on the internet, but across the whole of society, the economy, culture and even everyday life is an issue that societies are just coming to terms with. Notwithstanding the recent Snowdon revelations (see Poitras et al 2013), new computational techniques, as outlined in this article, demonstrate the disconnect between people's everyday understanding of technology, and its penetration of life and the reality of total surveillance. Not just the lives of others are at stake here, but the very shape of public culture and the ability for individuals to make a "public use of reason" (Kant 1784) without being subject to the chilling effects of state and corporate monitoring of our public activities. Indeed, computal technologies such as these described have little respect for the public/private distinction that our political systems have naturalised as part of a condition of possibility for political life at all. This makes it ever more imperative that we provide citizens with the ability to undertake critical technical practices, both in order to choose how to manage the digital breadcrumbs they leave as trails in public spaces, but also to pull down the blinds on the post-digital gaze of state and corporate interests through the use of cryptography and critical encryption practices.



Bibliography

Berry, D. M. (2011) The Philosophy of Software: Code and Mediation in the Digital Age, London: Palgrave.

Berry, D. M (2012) The social epistemologies of software, Social Epistemology, 26 (3-4), pp. 379-398. ISSN 0269-1728

Berry, D. M. (2013) Signposts for the Future of Computal Media, Stunlaw, accessed 14/10/2013, http://stunlaw.blogspot.co.uk/2013/08/signposts-for-future-of-computal-media.html

Brust, (2012) Cloudera’s Impala brings Hadoop to SQL and BI, accessed 14/10/2013, http://www.zdnet.com/clouderas-impala-brings-hadoop-to-sql-and-bi-7000006413/

Higgenbotham, S. (2013) How Caffeine Is Giving Google a Turbo Boost, accessed 14/10/2013, http://gigaom.com/2010/06/11/behind-caffeine-may-be-software-to-inspire-hadoop-2-0/

Hunt, I. (2013) The CIA's "Grand Challenges" with Big Data, accessed 14/10/2013,  http://new.livestream.com/gigaom/structuredata/videos/14306067

Iqbal, M. T. (2013) Google Spanner : The Future Of NoSQL, accessed 14/10/2013,  http://www.datasciencecentral.com/profiles/blogs/google-spanner-the-future-of-nosql

Kant, I. (1784) What Is Enlightenment?, accessed 14/10/2013, http://www.columbia.edu/acis/ets/CCREAD/etscc/kant.html

Melnik, S., Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M. and Vassilakis, T. (2010) Dremel: Interactive Analysis of Web-Scale DatasetsProc. of the 36th Int'l Conf on Very Large Data Bases (2010), pp. 330-339.

Poitras, L., Rosenbach, M., Schmid, F., Stark, H. and Stock, J. (2013) How the NSA Targets Germany and Europe, Spiegel, accessed 02/07/2013, http://www.spiegel.de/international/world/secret-documents-nsa-targeted-germany-and-eu-buildings-a-908609.html

Sledge, M. (2013) CIA's Gus Hunt On Big Data: We 'Try To Collect Everything And Hang On To It Forever', accessed 14/10/2013, http://www.huffingtonpost.com/2013/03/20/cia-gus-hunt-big-data_n_2917842.html

Vaughan-Nichols, S. J. (2013) Drilling into Big Data with Apache Drill, accessed 14/10/2013, http://blog.smartbear.com/open-source/drilling-into-big-data-with-apache-drill/

Comments

Popular Posts