Big Data Science: Establishing Data-Driven Institutions through Advanced Analytics

Authors:: Cecilia Earls
Published:: Monday, May 6, 2019
Collection:: Editors' Pick

min read

Data analytics can drive decision-making, but to optimize those decisions, stakeholders must couple effective methods with a shared understanding of both the domain and the institutional goals.

photo of two humans' hands on a tabletop chart overlaid with data graph icons — *Credit: Wright Studio / Shutterstock.com © 2019*

From improving student success to forming optimal strategies that can maximize corporate and foundational relationships, data analytics is now higher education's divining rod. Faculty and administration alike make daily decisions that impact the future of our institutions and our students. Departments establish curriculums; labs invest in new technologies; we admit students, hire faculty, monitor meal plans, and define security protocols. How can we optimize those decisions over the coming years? How do we know if we are meeting our goals? Can we use our data to make better decisions?

I think the answer is yes, but only if we couple the use of state-of-the-art analytical methods with a focused approach to how and when we engage our data to make decisions. Our data strategy must reflect not only our institutional goals, but also the novel ways in which we can now collect and analyze data to attain those goals. Part of my role as a data scientist at Cornell University is to help guide this strategy by establishing a common understanding of, and vocabulary around, the data-driven decision-making process.

A Team Effort

Simply hiring a data scientist does not create a data-driven organization. Identifying and realizing relevant and measurable goals through a well thought-out data strategy does, and this requires collaboration. It is essential that data scientists' partner with four types of stakeholders:

Visionaries. These are the leaders with a vision of our organization's future. They can identify the areas in which, when influenced by informed decision-making, would result in the greatest impact toward achieving our institutional goals. Bottom line: they know what our "big questions" really should be.
Subject experts. Members of our community who deeply understand the area chosen for analysis. We rely on them to identify which variables are important. Subject experts can help guide the analysis because they understand the types of change that are truly possible. If we offer a "solution" that cannot be implemented, it is the wrong solution.
Data experts and archivists. These individuals know where the relevant data are stored and how they can be accessed. This group also includes experts on data quality and how the data have been collected.
Technology experts. Setting up a secure data ecosystem requires substantial computer expertise and resources. Many data scientists do not have this expertise and need support from those who do.

While this list is not exhaustive, it makes a key point clear: data-driven decision-making is a team effort. To have the desired impact, data solutions in higher education depend on the collective knowledge of visionaries and experts in subject matter, technology, data, and data science who work collaboratively to ensure that our goals are well defined and that our approach is practical. To this end, all engaged team members must have a common understanding of our framing themes, terms, and processes.

The Words of the Trade

As the world of data science has expanded, new language has emerged to describe (and sell) the tools of the trade. Although such new terminology should clarify what data science can offer, stakeholders often lack a common understanding of what a given rubric means, which leads to confusion and misaligned expectations. To address this, Cornell's central IT support organization has established a common vocabulary across the university with respect to our advanced data analytics initiative.

Data Scientist

Glassdoor recently ranked data scientist as the best job in the United States,¹ yet there is much debate about the skillset and role of these individuals. Minimally, a data scientist acts as a traditional business intelligence analyst. At the other end of the spectrum, a data scientist's role encompasses big data wrangling and munging, machine learning and artificial intelligence, advanced statistical analysis, and dynamic data visualization—all executed within big data ecosystems. We embrace this broader definition of a data scientist, while also acknowledging that few data scientists actually possess this breadth of expertise. Indeed, in many organizations, data analytic needs may be best met by forming well-coordinated teams.

Big Data Ecosystems

Big data ecosystems are platforms that allow users to import, store, and efficiently process and analyze large volumes of data. Such an ecosystem is comprised of tools for

transforming and storing large amounts of data in parallel;
loading data into the environment, either as a batch process or in real time;
real-time querying;
applying machine learning algorithms at scale; and
coordinating and monitoring the ecosystem as a whole.

In theory, a big data ecosystem can be completely open source; however, setting up a secure and operational open-source ecosystem may be outside a data scientist's expertise. Fortunately, commercial versions of big data ecosystems are now available that offer quick set up and user-friendly interfaces. Examples include Cloudera, Hortonworks [https://hortonworks.com/], and Data Bricks; Google and Amazon also offer equivalent service suites.²

Big Data

Big data descriptions often outline the three V's:

Volume refers to the size of the data. What's big? Generally, a big data ecosystem is necessary when you have petabytes of data, but such a system can also be useful in processing smaller datasets, especially when the analysis is complex or computationally intensive.
Velocity refers to how fast the data comes in. Machine telemetry often has high velocity; it may be continuously created and may need to be processed and analyzed in real time.
Variety refers to the different forms in which the raw data are stored; these forms are often referred to as the data structures. Datasets with a variety of structures include text documents, videos, pictures, logs, and other nontraditional data sources.

At Cornell, we consider data "big" if it can be characterized by at least one of the three V's.

Structured Versus Unstructured Data

Data are often traditionally classified as being in one of two formats:

Structured data are in the familiar table format, in which every row is an observation and every column is a well-defined categorical or continuous variable.
Unstructured data are any data that require either a data scientist to define how the data should be structured, or a data engineer to transform the data into a format that can be easily used for analysis.³

Common examples of unstructured data include pictures, videos, and documents, all of which have the following attributes:

A structure must be imposed on the data for it to be informative.
The best choice of that structure is selected from many options.
If the data are used for a different analysis, the choice of structure may change.

More recently, a third classification has been introduced: Semi-structured data are data that are not tabular, but could be easily put in table format without loss of information. An example here is log data. The information leveraged from these data does not change between analyses, but the raw data undergo a one-time standard transformation prior to any analysis.

Data Storage

Around campus, databases are commonly described as data warehouses, data marts, or data lakes. The terms warehouse and mart are often used interchangeably, although (by strict definition) there are subtle differences in the size or scope of each of these database types. In our view, however, whether you call it a data warehouse or data mart, you are referring to a place in which structured data are stored. At Cornell, we use the term data lake only to refer to a digital repository that contains unstructured data or a combination of structured and unstructured data.

The Data-Driven Decision-Making Process

Establishing clear goals is a crucial first step in any proposed analysis. At the Open Data Science Conference in San Francisco this past fall, Google's Chief Decision Scientist Cassie Kozyrkov asserted that there is a difference between an organization that does data science and a data-driven organization; while many institutions are investing in data science, most fail to be truly data driven.

In her talk, Kozyrkov proposed that, before an organization performs any analysis, it should address three key questions:

What would the decision makers choose to do without any information at all? In other words, what is the default position?
How important is this decision for the institution? A disciplined, carefully considered analysis requires a significant investment in both time and money. If the decision is not important, it is pragmatic and economical to forego (or cease) analysis.
What must the outcome of analysis be for the decision makers to choose an alternative action? Often, we are so attached to our default action that no amount of data science will change our decision. If this is the case, why bother to do the analysis at all?

Assuming the analysis is worth doing, the decision makers and data scientists together must agree on:

The goal of the analysis
How information gleaned from it will be used to make an informed decision

For example, "We want our students to be more successful" is not a well-defined goal. If an incoming freshman graduates in four years, is that success? Is success defined by the student's GPA? Or should we define success as income after graduation? These clarifying questions all embody a crucial element: they explicitly define what is being measured in such a way that the question can be explored using both machine learning algorithms and statistical analysis.

The Analysis: Machine Learning and Statistical Inference

At their core, supervised and unsupervised machine learning and statistical analysis are simply sets of algorithms used to extract useful information from data. While you can expect your data scientist to choose which algorithm to use, everyone on the team should have a basic understanding of what these algorithms do.

Both supervised machine learning algorithms and traditional statistical inference depend on historical data for either:

prediction—accurately estimating future outcomes; or
estimation—determining which variables are related to the outcome, and how and to what degree they are related to the outcome.

For a given question, decision makers may be interested in prediction, estimation, or both; in any case, this interest must be established prior to the analysis.

Quantitative and Qualitative Measurement

For analysis, both the outcomes and variables must be measurable. Specifically, both the response and the variables must be either a quantitative or a qualitative measure.

A quantitative measure is a number, such as age, income, and GPA.
A qualitative measure is a group indicator, such as race, gender, and any binary variable.

With unstructured data, it is not initially evident how your variables and response will achieve either measurable format. Your data scientist and subject expert will need to work together to decide how quantitative and qualitative information is best extracted from the raw data for analysis.

Supervised and Unsupervised Learning

Supervised machine learning algorithms fit models that, by some criteria, minimize the difference between outcomes estimated under the model and observed outcomes. These algorithms are supervised in that they are optimized using a dataset that includes known outcomes.

Unsupervised machine learning algorithms are designed to find natural groupings of your data. The underlying assumption is that your data belong to several different groups, but the group label (the outcome) is not in your dataset.

Unsupervised machine learning is often used in the context of visual AI. For example, suppose that oceanographers are interested in identifying flora and fauna in the sea at a depth that is impractical to explore directly, so they instead send down a robot to take thousands of pictures. They can then use an unsupervised learning algorithm to efficiently group similar pictures together, with each grouping (ideally) representing a particular species of plant or animal.

Special Considerations

In industry, big data analytics reduce the bottom line and maximize profits. Models are built for the sole benefit of the company and its shareholders with little to no regard for the customer base. In higher education, the paradigm is reversed. Every decision we make must balance the good of the institution with the good of the student, weighting the second more heavily than the first. This distinction drives a more conscientious approach to data collection, use, and analysis.

A Focus on Relationships

In higher education, important questions are often relational. Does a new teaching method increase class averages? Does the improvement depend on gender or race? How much of an improvement can we expect to see? We can never know the answers to these questions with absolute certainty, and to say anything with confidence requires a disciplined statistical analysis.

Yet even disciplined analyses can lead to poor decision-making. So, while advanced analytics are a powerful tool, it's often prudent to consider the following questions:

How unexpected was the result? If the analysis radically alters your current course of action, it should be closely scrutinized. Results that are "too good to be true" generally are.
Is there an alternative explanation to the implied relationship? While we shouldn't ignore strong associations, causal relationships are much harder to establish. Before making a decision, consider alternative explanations for the relationships you are seeing in the data.
What other impact might this decision have? Decisions are not made in isolation. For example, a new teaching method might dramatically improve student engagement and creativity, yet fail in teaching key fundamentals.

Privacy and the Need for Governance

Technological advances have made it possible to capture and analyze massive amounts of data in near real time but… we must balance the potential benefit of the analysis with the individual's right to privacy.

Although institutional governance and laws such as the Family Educational Rights and Privacy Act (FERPA), Health Insurance Portability and Accountability Act (HIPAA), and General Data Protection Regulation (GDPR) [https://eugdpr.org/] can serve as our guideposts, we must embrace our institutional standards for protecting the privacy of our community members with technical vigilance as well. Among the key questions to ask are the following:

Are the data really de-identified?⁴
Can identities be determined from multiple de-identified data sets?
Are the data encrypted at rest?
Do all of those who have access to the data understand the obligations under governance and the law?

Conclusion

Big data science is taking purchase in higher education, and our diverse institutions provide an exceptionally fertile ground for impactful data-driven decision-making. We are not corporations; we are small, vibrant communities that make decisions every day regarding critical issues such as safety, facilities management, risk management, housing, recruitment, admissions, research support, academic freedom, instruction, campus life, alumni relations, athletics, career services, support services, and healthcare. Each of these components creates independent data stores that, when analyzed collectively, can offer valuable insights for the institution as a whole.

To realize this potential, however, requires that the entire community of decision makers, data and subject experts, technological experts, and analysts work collaboratively and communicate effectively. All too often, we find team members hesitant to admit that they don't understand a particular topic or the potential value of advanced analytics.

A common understanding of the terms used and the role of data analytics within your organization will provide a solid foundation for establishing an advanced analytics strategy. Such a strategy can help our institutions move from simply "doing data science" to becoming institutions that are truly data-driven.

Notes

Paul Schrodt, "The 50 Best Jobs in America—and How Much They Pay," Money (January 24, 2018). ↩
Jeff Kelly (originating author), "Big Data: Hadoop, Business Analytics and Beyond," Wikibon (last updated, February 5, 2014). ↩
Unstructured data should not be confused with uninformative data. Data are uninformative if they are either completely unrelated to the purpose of the analysis or they are related to its purpose, but do not reflect the population you intended to study. Uninformative data can be structured or unstructured. In contrast, unstructured data can be highly informative, but the type and level of information extracted might vary from analysis to analysis. ↩
Scott Berinato, "There's No Such Thing as Anonymous Data," The Harvard Business Review (February 9, 2015). ↩

Cecilia Earls is a Data Scientist in Information Technologies at Cornell University.

ParentTopics:: Analytics Big Data Business Intelligence (BI) Data Administration and Management Data Warehouse Digital Learning