(Part 2 of a 2 part article. Click to view part 1)
Google’s Analytics Evangelist Avinash Kaushik, in his book Web Analytics 2.0 (Sybex 2010), noted that Web Analytics today is “a framework that redefines what data means online. Web Analytics 2.0 is not simply about the clicks that you collect from your website…[but it is] about pouring your heart into understanding the impact and economic value of your website by doing rigorous outcomes analysis. It is about expressing your love for the principles of customer centricity by embracing voice-of-customer initiative and, my absolute favorite, learning to fail faster by leveraging the power of experimentation.” Today research institutions are taking his advice to heart.
Well-established with business and economics researchers throughout North America, analytics is leading to new types of assessments. At Massachusetts Institute of Technology an initiative dubbed the “Billion Prices Project” is measuring inflation using real-time data gleaned from online purchases tracked on retailers’ web sites. A Google Consumer Price Index and other search engine tools offer similar retail detail. Real-time data is allowing researchers to mine social media sites like Twitter looking for other leading economic indicators. The Twitter hashtag, #NFPGuesses, for example, provides weekly aggregations of estimates on non-farm payroll fluctuations.
Seattle-based online real estate service Zillow tracks home sales and mortgage lending, which is providing real-time data on the fast moving housing and mortgage sectors. Job data is said to be readily available from Internet job searches and job postings. George Washington University economics professor Tara Sinclair suggests that this data “could be used to predict employment for the following month.”
The field of data science is also emerging at the intersection of the fields of social science and statistics, information and computer science, and design. UC Berkeley School of Information Master of Information and Data Science program graduated its first class in June 2015, providing another strong move of analytical assessment in the public sector. The school notes that “The field of data science is emerging at the intersection of the fields of social science and statistics, information and computer science, and design. The UC Berkeley School of Information is ideally positioned to bring these disciplines together and to provide students with the research and professional skills to succeed in leading edge organizations.”
Libraries and other groups, such as Google Books Library Project, Internet Archive, and HathiTrust, are now digitizing analog content—printed maps, books, journals and other historic materials—that was created over centuries as well as integrating data from governmental sources, web logs, mobile devices, sensors, instruments, and transactions. This is allowing for the development of whole new types of scholarly methodology and research fields such as Digital Humanities, which hopes to glean new insights and findings using data science methods. Digital Humanities itself “is now backed on a growing number of campuses by a level of funding, infrastructure, and administrative commitments that would have been unthinkable even a decade ago,” and has been defined by UCLA’s program as a field that “interprets the cultural and social impact of new media and information technologies, as well as creates and applies these technologies to interrogate cultural, social, historical, and philological questions.”
Analytics are being applied in other areas as well. Historians are working to digitize known information on World War 2 in order to connect the dots “on documented events during the World War II era—26,500 and counting, so far—to allow scholars a new way to examine events, causation and relationships not possible before the advent of Big Data. The foundation, Envisioning History, is using geo-locating software from Palantir, which is based on methods first used by PayPal to track down fraud. Their goal is to “create a 4-dimensional visual framework for history with links to all relevant media.”
New Skills and Expertise Are Evolving
In order to extract useful social and economic value from big data, expertise is needed. iSchools—which by their very nature, teach the “diverse areas of expertise relating to the study of information and its use by people and organizations,” are as described by the University of Washington program. As such, they offer a Bachelor of Science in Informatics and their Data Lab is described as “the nexus for research on Data Science and Analytics. Learn more about how it’s using data for the social good.” Moving big data tools and expertise into other professional areas will be essential in the future.
Big data is clearly too large to fit on a single computer, or manipulate with existing desktop tools. It is also more heterogeneous (and often messy, incomplete, and highly unstructured) than the highly curated data we are used to seeing. Working with big data sets inevitably raises issues of privacy, security, as well as ethics. However academics are finding ways to integrate big data analytics into their research work.
In her Project Syndicate blog article, Tara Sinclair noted that, “properly used, new data sources have the potential to revolutionize economic forecasts. In the past, predictions have had to extrapolate from a few unreliable data points. In the age of big data, the challenge will lie in carefully filtering and analyzing large amounts of information. It will not be enough simply to gather data; in order to yield meaningful predictions, the data must be placed in an analytical framework.”
Social scientists’ use of data overlaps often with what is used in the sciences and business: “Their own observations, whether opinion polls, surveys, interviews, or field studies; build models of human behavior; and conduct experiments in the laboratory or field. Other social scientists rely on records collected by others, such as economic indicators or demographic data from the census. Government and corporate records are often of interest, as are the mass media. A number of important data repositories exist, especially for large social surveys.”
While in the humanities ‘data’ is often very different: “Humanities scholars rely most heavily on records, whether newspapers, photographs, letters, diaries, books, articles; records of birth, death, marriage; records found in churches, courts, schools, and colleges; or maps. Any record of human experience can be a data source to a humanities scholar. Many of those sources are public while others are private. Cultural records may be found in libraries, archives, museums, or government agencies, under a complex mix of access rules. Some records are embargoed for a century or more. Some may be viewable only on site, whether in print or digital form. Data sources for humanities scholarship are growing in number and in variety, especially as more records are digitized and made available to the public… It is the nature of the humanities that sources are reinterpreted continually; what is new is the necessity of making explicit decisions about what survives for migration to new systems and formats. Second is the implication for control of intellectual property. Generally speaking, humanities scholars have far less control over the intellectual property rights of their sources—these raw materials—than do scientists, whose data usually are original observations or specimens.” Access to this textual data is generally through text and data mining contracts—an area that is still evolving and can be difficult to establish.
Still Digital Humanities as a field is perhaps in many ways more established than other areas. Digital Humanities Quarterly journal, for example, has been in publication for ten years, defining the field as “a diverse and still emerging field that encompasses the practice of humanities research in and through information technology, and the exploration of how the humanities may evolve through their engagement with technology, media, and computational methods.”
Professional Schools Respond
Although the majority of the graduate programs today are specifically in business/management schools, a number are housed in departments of Professional Programs, Graduate Studies or Arts & Sciences. University of San Francisco’s Master of Science in Analytics, for example, is housed in their College of Arts & Sciences. Brandeis and Bowling Green State both have their masters programs in analytics through their Graduate colleges.
Johns Hopkins University has established a formal “Master of Science in Government Analytics [that] prepares students to become leaders in the data revolution. Students will develop expertise in analytical methods that are increasingly relied upon by government agencies, non-profit organizations and the private sector. Through the use of cutting-edge tools and skills, students will be able to address contemporary political, policy and governance challenges.” Given the huge stores of information available through governments and NGOs today, freely over the Web, it is no surprise that public affairs disciplines would work to establish best practices in this area. “All areas of politics and policy can benefit from a greater reliance on data,” notes the school’s Jill Rosen. “The new programs will prepare students to take on such challenges as predicting where crime is likely to occur, evaluating the effectiveness of new health care initiatives, identifying the best placement of a proposed public transportation route, and developing voter mobilization strategies.” Given the strong movement for Open Government Data, programs such as this are bound to grow.
For the past year, a special United Nations’ sponsored Global Working Group (GWG) on Big Data for Official Statistics has met to move forward the “the obligation of exploring the use of new data sources, such as Big Data, to meet the expectation of the society for enhanced products and improved and more efficient ways of working.” This group includes 18 member nations (including the U.S. and China) and a variety of NGOs who all agree that “using Big Data for official statistics is an obligation for the statistical community based on the Fundamental Principle to meet the expectation of society for enhanced products and improved and more efficient ways of working.” Reports from their working groups are expected by the end of this year.
The European Union has their Open Data Portals, which is available to anyone over the web, to provide a “single point of access to a growing range of data from the institutions and other bodies of the European Union (EU)” of data now containing nearly 9,000 datasets going back to 2012 from all EU countries, searchable in 23 languages. The Data Portals site currently lists more than 400 open data sites from cities to states to countries and NGOs. Japan has a Japanese-language open data site going back to 2013. The United Kingdom’s portal intends to increase transparency by “releasing public data to help people understand how government works and how policies are made. Some of this data is already available, but data.gov.uk brings it together in one searchable website.” Open Canada’s Open Data Portal provides access to nearly a quarter million datasets.
In February 2015, the Obama administration announced the appointment of D.J. Patil as the country’s first Chief Data Officer/Scientist and Deputy U.S. CTO for Data Policy. “While there is a rich history of companies using data to their competitive advantage, the disproportionate beneficiaries of big data and data science have been Internet technologies like social media, search, and e-commerce,” Patil noted. “Yet transformative uses of data in other spheres are just around the corner. Precision medicine and other forms of smarter health care delivery, individualized education, and the ‘Internet of Things’ (which refers to devices like cars or thermostats communicating with each other using embedded sensors linked through wired and wireless networks) are just a few of the ways in which innovative data science applications will transform our future.”
“In 2013,” Patil continued, “researchers estimated that there were about 4 zettabytes of data worldwide: That’s approximately the total volume of information that would be created if every person in the United States took a digital photo every second of every day for over four months! The vast majority of existing data has been generated in the past few years, and today’s explosive pace of data growth is set to continue. In this setting, data science—the ability to extract knowledge and insights from large and complex data sets—is fundamentally important….Given the importance this Administration has placed on data, along with the momentum that has been created, now is a unique time to establish a legacy of data supporting the public good.”
The Challenges Ahead
Virtually every sector of the economy now has access to more data than would have been imaginable even a decade ago. Businesses today are accumulating new data at a rate that exceeds their capacity to extract value from it. The question facing every organization is how to use data effectively—not just their own data, but all of the data that’s available and relevant.
Big Data’s ability to combine the minute details of individual behavior today—taken from clicks, views, purchases, movements, gaming, working, and so on—is allowing not only for more specific, targeted advertising and planning but for profiteering from the plundering of individual’s reputations, bank accounts, or other assets. Today, it’s hard to see this as a win-win situation, especially for consumers. The risks are too real and often clearly evident from the front pages of newspapers. Examples include the hacking of the Ashley Madison website, and the U.S. Office of Personnel Management’s hacking by what some reports have it from some Chinese source that involved personal information from an estimated 20 million people.
Eric Siegel, author of Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie or Die (Wiley, 2013) tells ATG that data security “is improving in mainstream business. Just how professional and how quickly it is becoming so is up for debate. Smaller or fringe companies are more likely to do things recklessly. Many organizations collected data with permission from the submitter to share or sell it with partners. It is common and not illegal, although often considered a ‘breach’ by some. A true ‘breach,’ however, is when an organization is hacked and the data is then used by criminals or posted publicly—as you know there are many such occurrences, most notably in recent times by the extramarital dating site Ashley Madison. Other incidents take place when a company releases consumer data that is meant to be ‘anonymized,’ but it then turns out to be possible to determine the individual to whom some records pertain—AOL and Netflix have endured such stories.”
The legal environment for big data needs to evolve to balance privacy, security and the interests of the companies or organizations that gather and maintain this information. The EU’s Right to be Forgotten is perhaps just the beginning of efforts to force some type of accountability and discretion on the industry. Straightening out issues of data ownership is another area of concern by many today.
In their book Data Science for Business (O’Reilly, 2013), Foster Provost and Tom Fawcett note that “a study by researchers at the University of Toronto and MIT showed that after particularly stringent privacy protection was enacted in Europe, online advertising became significantly less effective. The tension between privacy and improving business decisions is intense because there seems to be a direct relationship between increased use of personal data and increased effectiveness of the associated business decisions.”
However companies and other data-driven organizations have much to be concerned about as well. Big Data is very new and many of the analytics are still being developed—today no one can assume infallibility. Where errors in analysis or in fundamental assumptions are the cause, mistakes do happen. Just because there is ‘solid data’ doesn’t guarantee that the prediction or models or conclusions are valid. We all have much to gain or lose as we move into this new era.
“Data collection is rampant right now,” Michele Ufford notes, “as seen in the abundance of ‘games’ and ‘quizzes’ on sites such as Facebook, which are often created with the sole goal of gaining useful demographic and preference information about users. In addition, the application of data and analytics is increasing at a rapid pace, and not all organizations give appropriate consideration to topics such as security and data privacy. In general, people should be aware of the data collected about them, take the time to review data collection and privacy policies, and carefully weigh the benefits and consequences of sharing that data. Permitting the collection of data or the sharing of that data is not necessarily bad, but it should be an informed decision.”
Famed theoretical physicist Geoffrey West has postulated that “the paradigm of physics—with its interplay of data, theory, and prediction—is the most powerful in science.” Today, clearly, others might argue that Big Data has opened every form of scholarly work to new levels of potential power and precision. However, we are best served by also remembering Economist Milton Friedman’s caution that “the only relevant test of the validity of a hypothesis is comparison of prediction with experience.” As we move into the 21st century with these new opportunities and tools it will be interesting to watch this new renaissance in research develop.
Nancy K. Herther is Librarian for American Studies, Anthropology, Asian American Studies & Sociology at the University of Minnesota, Twin Cities campus. firstname.lastname@example.org