A revelatory exploration of the hottest trend in technology
and the dramatic impact it will have on the economy, science, and
society at large.
Which paint color is most likely to tell you that a used car is
in good shape? How can officials identify the most dangerous New
York City manholes before they explode? And how did Google searches
predict the spread of the H1N1 flu outbreak?
The key to answering these questions, and many more, is big
data. “Big data” refers to our burgeoning ability to crunch vast
collections of information, analyze it instantly, and draw
sometimes profoundly surprising conclusions from it. This emerging
science can translate myriad phenomena—from the price of airline
tickets to the text of millions of books—into searchable form, and
uses our increasing computing power to unearth epiphanies that we
never could have seen before. A revolution on par with the Internet
or perhaps even the printing press, big data will change the way we
think about business, health, politics, education, and innovation
in the years to come. It also poses fresh threats, from the
inevitable end of privacy as we know it to the prospect of being
penalized for things we haven’t even done yet, based on big data’s
ability to predict our future behavior.
In this brilliantly clear, often surprising work, two leading
experts explain what big data is, how it will change our lives, and
what we can do to protect ourselves from its hazards. Big Data is
the first big book about the next big thing.
關於作者:
VIKTOR MAYER-SCH?NBERGER is Professor of Internet Governance
and Regulation at the Oxford Internet Institute, Oxford University.
A widely recognized authority on big data, he is the author of over
a hundred articles and eight books, of which the most recent is
Delete: The Virtue of Forgetting in the Digital Age. He is on the
advisory boards of corporations and organizations around the world,
including Microsoft and the World Economic Forum.
KENNETH CUKIER is the Data Editor of the Economist and a
prominent commentator on developments in big data. His writings on
business and economics have appeared in Foreign Affairs, the New
York Times, the Financial Times, and elsewhere.
內容試閱:
1
NOW
IN 2009 A NEW FLU virus was discovered. Combining elements of the
viruses that cause bird flu and swine flu, this new strain, dubbed
H1N1, spread quickly. Within weeks, public health agencies around
the world feared a terrible pandemic was under way. Some
commentators warned of an outbreak on the scale of the 1918 Spanish
flu that had infected half a billion people and killed tens of
millions. Worse, no vaccine against the new virus was readily
available. The only hope public health authorities had was to slow
its spread. But to do that, they needed to know where it already
was.
In the United States, the Centers for Disease
Control and Prevention CDC requested that doctors inform them of
new flu cases. Yet the picture of the pandemic that emerged was
always a week or two out of date. People might feel sick for days
but wait before consulting a doctor. Relaying the information back
to the central organizations took time, and the CDC only tabulated
the numbers once a week. With a rapidly spreading disease, a
two-week lag is an eternity. This delay completely blinded public
health agencies at the most crucial moments.
As it happened, a few weeks before the H1N1 virus
made headlines, engineers at the Internet giant Google published a
remarkable paper in the scientific journal Nature. It created a
splash among health officials and computer scientists but was
otherwise overlooked. The authors explained how Google could
“predict” the spread of the winter flu in the United States, not
just nationally, but down to specific regions and even states. The
company could achieve this by looking at what people were searching
for on the Internet. Since Google receives more than three billion
search queries every day and saves them all, it had plenty of data
to work with.
Google took the 50 million most common search terms
that Americans type and compared the list with CDC data on the
spread of seasonal flu between 2003 and 2008. The idea was to
identify people infected by the flu virus by what they searched for
on the Internet. Others had tried to do this with Internet search
terms, but no one else had as much data, processing power, and
statistical know-how as Google.
While the Googlers guessed that the searches might
be aimed at getting flu information?—?typing phrases like “medicine
for cough and fever”?—?that wasn’t the point: they didn’t know, and
they designed a system that didn’t care. All their system did was
look for correlations between the frequency of certain search
queries and the spread of the flu over time and space. In total,
they processed a staggering 450 million different mathematical
models in order to test the search terms, comparing its predictions
against actual flu cases from the CDC in 2007 and 2008. And they
struck gold: their software found a combination of 45 search terms
that, when used together in a mathematical model, had a strong
correlation between their prediction and the official figures
nationwide. Like the CDC, they could tell where the flu had spread,
but unlike the CDC they could tell it in near real-time, not a week
or two after the fact.
Thus when the H1N1 crisis struck in 2009, Google’s
system proved to be a more useful and timely indicator than
government statistics with their natural reporting lags. Public
health officials were armed with valuable information.
Strikingly, Google’s method does not involve
distributing mouth swabs or contacting physicians’ offices.
Instead, it is built on “big data”?—?the ability of society to
harness information in novel ways to produce useful insights or
goods and services of significant value. With it, by the time the
next pandemic comes around, the world will have a better tool at
its disposal to predict and thus prevent its spread.
Public health is only one area where big data is making a big
difference. Entire business sectors are being reshaped by big data
as well. Buying airplane tickets is a good example.
In 2003 Oren Etzioni needed to fly from Seattle to
Los Angeles for his younger brother’s wedding. Months before the
big day, he went online and bought a plane ticket, believing that
the earlier you book, the less you pay. On the flight, curiosity
got the better of him and he asked the fellow in the next seat how
much his ticket had cost and when he had bought it. The man turned
out to have paid considerably less than Etzioni, even though he had
purchased the ticket much more recently. Infuriated, Etzioni asked
another passenger and then another. Most had paid less.
For most of us, the sense of economic betrayal
would have dissipated by the time we closed our tray tables and put
our seats in the full, upright, and locked position. But Etzioni is
one of America’s foremost computer scientists. He sees the world as
a series of big-data problems?—?ones that he can solve. And he has
been mastering them since he graduated from Harvard in 1986 as its
first undergrad to major in computer science.
From his perch at the University of Washington, he
started a slew of big-data companies before the term “big data”
became known. He helped build one of the Web’s first search
engines, MetaCrawler, which was launched in 1994 and snapped up by
InfoSpace, then a major online property. He co-founded Netbot, the
first major comparison-shopping website, which he sold to Excite.
His startup for extracting meaning from text documents, called
ClearForest, was later acquired by Reuters.
Back on terra firma, Etzioni was determined to
figure out a way for people to know if a ticket price they see
online is a good deal or not. An airplane seat is a commodity: each
one is basically indistinguishable from others on the same flight.
Yet the prices vary wildly, being based on a myriad of factors that
are mostly known only by the airlines themselves.
Etzioni concluded that he didn’t need to decrypt
the rhyme or reason for the price differences. Instead, he simply
had to predict whether the price being shown was likely to increase
or decrease in the future. That is possible, if not easy, to do.
All it requires is analyzing all the ticket sales for a given route
and examining the prices paid relative to the number of days before
the departure.
If the average price of a ticket tended to
decrease, it would make sense to wait and buy the ticket later. If
the average price usually increased, the system would recommend
buying the ticket right away at the price shown. In other words,
what was needed was a souped-up version of the informal survey
Etzioni conducted at 30,000 feet. To be sure, it was yet another
massive computer science problem. But again, it was one he could
solve. So he set to work.
Using a sample of 12,000 price observations that
was obtained by “scraping” information from a travel website over a
41-day period, Etzioni created a predictive model that handed its
simulated passengers a tidy savings. The model had no understanding
of why, only what. That is, it didn’t know any of the variables
that go into airline pricing decisions, such as number of seats
that remained unsold, seasonality, or whether some sort of magical
Saturday-night-stay might reduce the fare. It based its prediction
on what it did know: probabilities gleaned from the data about
other flights. “To buy or not to buy, that is the question,”
Etzioni mused. Fittingly, he named the research project Hamlet.
The little project evolved into a venture
capital-backed startup called Farecast. By predicting whether the
price of an airline ticket was likely to go up or down, and by how
much, Farecast empowered consumers to choose when to click the
“buy” button. It armed them with information to which they had
never had access before. Upholding the virtue of transparency
against itself, Farecast even scored the degree of confidence it
had in own predictions and presented that information to users
too.
To work, the system needed lots of data. To improve
its performance, Etzioni got his hands on one of the industry’s
flight reservation databases. With that information, the system
could make predictions based on every seat on every flight for most
routes in American commercial aviation over the course of a year.
Farecast was now crunching nearly 200 billion flight-price records
to make its predictions. In so doing, it was saving consumers a
bundle.
With his sandy brown hair, toothy grin, and
cherubic good looks, Etzioni hardly seemed like the sort of person
who would deny the airline industry millions of dollars of
potential revenue. In fact, he set his sights on doing even more
than that. By 2008 he was planning to apply the method to other
goods like hotel rooms, concert tickets, and used cars: anything
with little product differentiation, a high degree of price
variation, and tons of data. But before he could hatch his plans,
Microsoft came knocking on his door, snapped up Farecast for around
$110 million, and integrated it into the Bing search engine. By
2012 the system was making the correct call 75 percent of the time
and saving travelers, on average, $50 per ticket.
Farecast is the epitome of a big-data company and
an example of where the world is headed. Etzioni couldn’t have
built the company five or ten years earlier. “It would have been
impossible,” he says. The amount of computing power and storage he
needed was too expensive. But although changes in technology have
been a critical factor making it possible, something more important
changed too, something subtle. There was a shift in mindset about
how data could be used.