The
age of Big Data is upon us. Fuelled by an incendiary mix of overblown
claims and dire warnings, the public debate over the handling and
exploitation of digital information on an astronomically large scale has
been framed in stark terms: on one side are transformative forces that
could immeasurably improve the human condition; on the other, powers so
subversive and toxic that a catastrophic erosion of fundamental
liberties looks inevitable.
The tension between
these opposites has marooned the discussion of Big Data. It is stuck
somewhere between Bletchley Park — the former Government Communications
Headquarters (GCHQ) location where the godfather of the computational
universe, Alan Turing, primed today's Big Data explosion during the
Second World War — and the satirical tomfoolery of South Park, which
recently portrayed the living core of all data as an incarcerated Father
Christmas cruelly wired up to a machine by the US's National Security
Agency (NSA).
We know from Edward Snowden's
widely publicized whistle-blowing revelations that the NSA — in
collusion with GCHQ — lifted vast amounts of data from Google and Yahoo,
under the once-top-secret codename, Muscular. At the same time, we're
told that the potential for beneficial insights mined from anonymous,
adequately protected data is enormous.
Big Data
helps us find things we "might like" to buy on Amazon, for example, but
it has also left us vulnerable to surveillance by state and other
agencies. Companies such as Google and Facebook are essentially Big Data
businesses, whose staggering profitability stems from the application
of data analysis to advertising: these "free" services are paid for by
personal data surrendered automatically with every click.
In
finance, meanwhile, optimists foresee a theoretical end to all
stock-market crashes, thanks to insights derived from huge-scale
data-crunching, while others predict an automated, algorithmic road to
ruin. Similarly, the cost and efficiency of healthcare provision is set
to be radically transformed for the better with access to massive
amounts of data — likewise the development of new drugs and treatments.
But what about the mining of medical data without patient consent? So
the debate goes on.
One aspect of Big Data,
however, is beyond question: it is indeed very big, and it's getting
bigger by the millisecond. An IBM report in September estimated that 2.5
quintillion bytes of data are created every day (that's 25 followed by
17 zeros, or roughly 10 quadrillion laptop hard drives) and that 90 per
cent of the world's data has been generated in the past two years:
everything from geo-tagged phone texts and tweets to credit-card
transactions and uploaded videos. By 2020, it's thought that the number
of bytes will be 57 times greater than all the grains of sand on the
world's beaches.
So what's actually going on at
the coalface of Big Data, a code-centric world of striping,
load-balancing, clustering and massively parallel processing? What do
the analysts working with Big Data say it's going to do for us?
"You
get a fuller picture of the phenomenon you're interested in, with more
dimensions, and that lets you derive greater insights," says Big Data
pioneer Doug Cutting, chief architect at enterprise software company
Cloudera and founder of the popular open-source Big Data tool Hadoop.
Cutting's work on internet search technology for Yahoo during the
mid-2000s provided the ideal proving ground for combining vastly
increased computing power with huge and diverse datasets. "And from that
we've seen a new style of computing emerge."
The
revolutionary effects of this new approach cannot be understated,
especially within the scientific community. For Brad Voytek, professor
of computational cognitive science and neuroscience at the University of
California San Diego, and "data evangelist" for app-based taxi service
Uber, Big Data has had a profound effect on the traditional scientific
method. "You can sweep through huge amounts of data and come up with new
observations," he says. "That's where the power of Big Data comes in.
It's automating the observation process. It's making everything easier
but in a way that few people yet understand. It's going to dramatically
speed up the scientific process and people have been doing some really
cool stuff with it."
Michael Schmidt, founder
and chief executive of American "machine-learning" start-up Nutonian,
established a Big Data landmark when, in partnership with robotics
engineer Hod Lipson at Cornell University, New York, he created Eureqa —
a piece of software that deduced Newton's Second Law of Motion by
analyzing data from the chaotic movements of a double pendulum. What
took Newton years, the Eureqa algorithm accomplished in a matter of
hours. With Nutonian, Schmidt is now opening up that Big Data technology
beyond the college lab.
"We want to accelerate
the process that scientists go through, to help you discover very deep
principles from data," he says. "We want to explain how things work."
The range of Eureqa's uses couldn't be more striking, from the
construction of better warplanes to helping save the lives of infants.
Schmidt is currently working with the United States Air Force, analysing
the strength of advanced super-alloys used in engine components. "They
are really interested in anticipating failures — knowing when things are
going to break, explode or stop working. We were able to show them the
most important things that go into a failure of a particular engine
part, at a finer resolution than ever before."
Eureqa
has also been used to help discover the optimal moment to remove
breathing tubes from prematurely born babies. "It's really critical when
you remove that tube, and allow the child to start breathing on its
own," says Schmidt. "Premature babies are hooked up to every monitoring
device you can imagine and we were able to take that data and winnow it
down to a few of these key metrics that drive the future health of the
babies. Which is pretty neat."
Harnessed to Big
Data, this kind of analysis becomes the work of hours and minutes.
"Traditionally you could spend years before you could conclude on a
result. What's changed is that we have these huge datasets. You can
rapidly accelerate the entire discovery process."
While
the benefits of this revolutionary increase in analytical speed are
clear, Big Data is often inseparable from its source and context,
especially in the public realm, where ethical concerns are paramount.
Justin Keen, professor of health politics at the University of Leeds,
co-authored a June 2013 paper published in Policy & Internet, the
journal of the Oxford Internet Institute. In it, he addressed issues of
privacy and access in relation to Big Health Data. "The potential for
much greater exploitation of data held by government departments in
England and all around the world is real," he says. "We just haven't got
proper governance arrangements at the moment - we don't know what rules
should govern what NHS data gets published, and in what sort of
format."
Early in 2013, Health Secretary Jeremy
Hunt set the goal of a paperless NHS by April 2018, in line with
programmes including care.data, which links patient data across
different parts of the NHS. It is hoped that the resulting increase in
preventative treatments, coupled with improvements in health management,
will save billions and improve the quality of healthcare. The sticking
point is patient confidentiality.
"I'm very
happy to see that in the past month or two, senior civil servants have
actually put the brakes on," says Keen. "Releases of data through
care.data and other channels are actually going to be slowed down until
we've got these governance arrangements right. But we're not going to
get the releases of data that advocates are hoping for as early as they
might have hoped for it."
Despite this
slowdown, the Big Data community appears to be echoing Keen's note of
caution. "From my perspective as a person who works in data, of course I
want as much as I can get, because the more data you've got, the more
interesting things you can do with it," says Francine Bennett, chief
executive and co-founder of London-based Big Data specialists, Mastodon
C, which mined available data to co-create the CDEC Open Health Data
Platform, a showcase for insights generated by Big Health Data.
"However, as a person who's knowledgeable about data - and as a citizen
of the UK k whose health data is in these systems — I know that it could
be enormously damaging to privacy to release things which shouldn't be
released. It's hard to put the genie back in the bottle. I'm keen for it
to be done in a measured way."
Gil Elbaz,
founder and chief executive of open-data platform Factual, began his
career as a database engineer in Silicon Valley in the 1990s before
co-founding Applied Semantics, acquired by Google in 2003 for $102m.
Applied Semantics developed AdSense, the technology that matches online
advertisements to the pages being browsed and the person browsing them.
"The approach we took to the contextual targeting of ads was all rooted
in processing huge amounts of data," says Elbaz, whose Factual company
website affirms his core belief in "making data accessible".
"We
take data privacy very seriously, and if somebody's data is theirs,
they should have the right to keep it private. That being said, there
are significant opportunities where data shouldn't be kept fully
private, because it's to society's benefit for it to be open," he says,
citing David Cameron's October 2013 announcement, at the Open Government
Partnership summit, of a public register of business ownership. "Data
at Factual is primarily business data," says Elbaz. "These businesses
want it to be available."
Even where the
privacy question is not an issue, Elbaz is concerned that information
can get trapped in hard-to-reach databases. "Too often today data is not
accessible. For example, why is it that software can't automatically
check - given the age of a patient and any drug - whether a dosage is
healthy or lethal? Why can't it be flagged? The reason is that there is
no open API (Application Programming Interface, or app-creation tool)
that has drugs and dosage ranges. It does not exist. Is there a
database? Yes. But it'll take a long time even to find the right person
to buy that data from. To me, this is insane."
So
where's it all leading us? For some, the ultimate goal of Big Data has
been defined as a kind of supreme foresight: an ability to predict what
people want before they know they want it. Elbaz takes a more functional
view. "My holy grail is that if any piece of software needs access to
information, it can find that access at a reasonable cost," he says. "To
me, it is not crazy rocket science — it's the basic fabric of how a
global information system should work."
For
Schmidt, the quest for enlightenment has only just begun. "A lot of
promises have been made for Big Data in the hope that it has this
enormous value, and we're starting to chip away a little at that, but
there's still so much to be done."
Doug
Cutting, however, has little interest in the notion that Big Data will
supply some kind of predictive super-power. "I'm an engineer. I focus
down on the plumbing. I think I have a more concrete imagination about
what is possible. I don't believe it's possible to have an oracle that
can predict what I'll be interested in doing tomorrow. Moreover, I find
surprises invigorating; I'd hate to lose spontaneity in the world."
However,
he adds, certain kinds of things can be done better. "To me, the holy
grail is removing limitations and being able to achieve the
interconnectedness that we want; to be able to take advantage of all the
data and do all the things we imagine are possible. I don't think we
want to get there overnight as a society. We need to embrace these
things and understand what we want to happen and what we don't want to
happen - build the right societal, legal and business structures. We
need to evolve."
Three eye-catching big data ventures
Open Data Institute
Aim: free data for all
Co-founded
by Sir Tim Berners-Lee, the inventor of the World Wide Web, to
encourage the exploitation of freely available data — aka "open data" —
the not-for-profit Open Data Institute has positioned itself as both a
catalyst for data innovation and a global hub for data expertise. Based
in Shoreditch, east London, the ODI oversees a network of collaborative
international "nodes", including Dubai and Buenos Aires, and has
incubated a growing bunch of Big Data startups — for example, Mastodon C
(see main feature), which identified potential NHS savings of about
£200m by crunching data relating to branded and generic drugs; and
Placr, which analyses real-time transportation and timetable information
to improve daily travel. theodi.org
The Human Brain Project
Aim: to reveal the workings of human consciousness
Flush
with 1bn in funding, the Human Brain Project is a 10-year quest to
reveal the hidden workings of consciousness. The scale of this task is
so immense — the brain has around 100 trillion neural connections — that
many still doubt it can be achieved, but Switzerland-based project
leader Henry Markram believes his collaborative Big Data approach, using
statistical simulations and vast supercomputing power across "swarms"
of researchers, might do the trick. One aspect of the plan involves
mining a huge amount of available data on mental disorders from public
hospitals as well as pharmaceutical company databases; algorithms will
then isolate revealing patterns and connections. In a decade's time, the
neural picture should be much clearer. humanbrainproject.eu
IBM's Computational Creativity
Aim: to make computers 'creative'
Following
a line of computer evolution that runs from Deep Blue (which beat Gary
Kasparov at chess in 1997) through Watson (which beat human opponents on
the US quiz show Jeopardy! in 2011), IBM has continued its ingenious
manipulation of huge datasets with a system designed to generate
creativity. Big-data analytics techniques have been deployed by IBM's
Thomas J Watson Research Center to create new food recipes — what you
might call technouvelle cuisine — mined from sources including Wikipedia
and Fenaroli's Handbook of Flavor Ingredients, then tweaked with an
algorithm designed to add creativity to matched ingredients. The results
(from Vietnamese apple kebab to Cuban lobster bouillabaisse) have
impressed human chefs. research.ibm.com