by John D. McDonald (EBSCO Information Services)
and Michael Levine-Clark (Dean, University of Denver Libraries)
Libraries have always collected a wealth of data about how their users have engaged with the collections, services, and systems that they offer. While much of that information was collected for logistical or operational reasons, such as keeping track of who checked out a book to ensure its later return, the existence of that data often has the added benefit of enabling librarians to understand how users are, or are not, interacting with the library so that efforts can be made to optimize library collections and services. From the early days of bibliometric research up through famous data-driven studies such as the University of Pittsburgh book circulation study (Kent, 1979), librarians have leveraged data to understand their users as they progress through their interactions with the library. In turn, librarians have used those data to help the library continually progress to meet user needs.
Library analytics as a field has focused primarily on research into collections and how they are used. Studies on book circulation, eBook usage, online journal downloads, and demand-driven acquisition have proliferated over the course of the past two decades (Way 2011; Corlett-Rivera and Hackman, 2014; McDonald 2006; Levine-Clark, McDonald, and Price 2014). In addition, librarians in the public services have made strides in using data to analyze how users are interacting with information literacy programs and library instruction courses (Nakerud et. al 2015, Croxton 2018). Bibliometric studies have evolved from mostly examining how authors cite prior research, into usage of research artifacts by format types and access types, including the usage of open access literature (Eysenbach, 2006).
Parallel to the evolution of library analytics has been an emphasis in higher education on “learning analytics” over the past ten years (SoLAR, 2020), as campuses have begun to analyze all types of student interaction data to understand their performance and the institution’s success (or lack thereof) in meeting its educational mission. The Horizon report has for the past three years listed Analytics Technologies as one of the most important developments in Educational Technology and for the past two years, has included it in the category of “Widespread adoption within one year or less.” Many campuses are keen to study the engagement of their students and faculty with campus services and use that data to better explore how to ensure their educational success. And maybe more importantly, funders at all levels are looking for hard data as evidence that the university is meeting its educational mission, often leaving future funding levels hanging in the balance.
At the library level, we may feel we’re a bit behind in our adoption of analytics technologies. While all libraries struggle to collect, organize, and preserve data about their users and systems and services, new products are consistently being developed that aim to make the process of data collection and analysis more efficient and more robust. They evolve to meet our needs, and also highlight what more needs to be done as we strive to use data as evidence in our decision-making processes.
Recently, there has been growing interest in linking library analytics to learning analytics in a structured fashion. Megan Oakleaf, in her role as the principle investigator for the Library Involvement in Learning Analytics (LIILA) grant, has noted that it is imperative that libraries become engaged in the overall campus efforts around learning analytics to ensure they have a seat at the table when the evaluation of data occurs and furthermore that they maintain an influential role in the development of governance and policy standards around data privacy, security, and ethics (Oakleaf, 2018). While some institutions have made strides in this direction, most notably the University of Minnesota and UNC-Charlotte (Nakerud et. al 2015, Croxton & Moore 2018), the vast majority of libraries have not and likely cannot without a stronger and more stable infrastructure that enables data collection and analysis.
In the summer of 2018, EBSCO approached the University of Denver (DU) about engaging in a proof of concept study to try to determine the feasibility of connecting library data, stored in separate systems, with campus level data representing student outcomes. The overall goal was to determine if it was possible to connect a range of datasets using a common identifier, and if that was possible, to determine the effect of library engagement on student learning outcomes. With successful answers to these questions, EBSCO would understand better the potential to build a robust data and analytics platform for the ingestion, storage, and analysis of data that could be brought to market to help libraries of all sizes and types to have an ongoing and near real-time library data repository with the potential to begin to connect that data to campus-level data of all types.
Librarians and campus administrators at the University of Denver provided files of data extracts from multiple systems that represented a variety of library engagement metrics and one student outcome metric. These came in multiple formats, including .xls, .csv, and .txt, and were often partitioned by month or week to allow the files to be extracted due to the limitations of source systems. The data was scrubbed of any personally-identifiable information (PII). EBSCO inspected the more than one hundred files and merged them for analysis. It is important to note that connecting to APIs was not in scope for this proof of concept, so this project did not determine whether data acquisition via APIs is possible.
The first dataset analyzed was DU’s EZProxy authentication logs, to represent use of online library resources. It is important to note that DU uses EZProxy to authenticate all users into online library resources, whether on the local network or not, so all online resource usage was captured. This is certainly not true for all customers using EZProxy, many (or most) of whom require authentication only for off-campus access. EZProxy Logs contained a unique user ID, that was common across all datasets, which made it possible to associate sessions to a particular user. The second dataset was an extract from DU’s Alma ILS (integrated library system) which provided two sets of metrics: partial patron demographic information, and book circulations by patron. The final two datasets included a record of librarian consultations by student for the two quarters of instruction, which was maintained manually by librarians and stored in an offline Excel spreadsheet, and an extract from the campus’ Banner Student Information System (SIS). The following fields were available for analysis: DOB, Gender, Race, Program (Major), and End of Term cumulative GPA.
Other data provided including the full bibliographic data from Alma, all online holdings from Alma, and COUNTER usage reports. These were not included in the analysis due to the limited scope and objective of the proof of concept but could have been analyzed in a full production analytics system. In addition, three EBSCO supplied data sets were considered, but not included in the analysis as they were out of scope. These included EBSCOhost database usage metrics, GOBI book bibliographic and acquisitions information, and EBSCONET serials bibliographic and acquisitions data for DU.
We ingested data via Alteryx with two outputs in mind: a combined usage table, and a user stats table. The combined usage table was designed to provide a complete, high-level picture of library usage during the period, while the user stats table was designed to allow analysis of the relationship between library use and student outcomes. We used Tableau to visualize both data sets and produced three comprehensive dashboards to better understand the relationship to library engagement and student success.
The combined usage dashboard provided a data discovery experience and descriptive analysis of library use that merged usage data from ILS (Loans), EZProxy logs (Sessions), and consultations (appointments) into a single source. There were 10,965 distinct users during the period, ninety-eight percent used online resources, twelve percent had a loan for a physical item, and five percent had an appointment with a librarian. Usage over time showed the typical seasonality of an academic institution with a lull over holiday breaks, increases in the last two weeks of the winter quarter, with graduate and Ph.D students and faculty making up seventy percent of loans. An interesting side note is that a single user account made up one third (52k) of all EZProxy sessions during the fall quarter, indicating that it was likely an unauthorized usage of DU credentials (aka, hacking), and since no method of monitoring is available for EZProxy in real-time, an analytics platform such as this PoC could have alerted librarians to the hack. (See Figure 1: Combined Usage Dashboard.)
The usage versus outcomes dashboard (Figure 2) allowed us to determine that a statistically significant relationship exists between library use and student outcomes. This analysis could demonstrate the potential value of combining library usage data with student outcomes, and with a full production system, allow librarians to track this relationship over time and add in additional data to build a more robust model of library engagement and student success. Insights included that for Undergrads, there was a statistically significant relationship between GPA and number of sessions for both quarters (p-Value <0.0001). Students with more online resources sessions tend to have better grades, while the same is not true for book circulation or librarian appointments, likely due to the lower number of observations, showing less engagement with the library for these two metrics over this short time period. The usage vs. outcomes (variance) dashboard indicated that for undergrads, there was a statistically significant relationship between GPA variance and metric variance for EZProxy sessions (p-Value = 0.045), showing evidence to reject the null hypothesis, however, the same was not true for loans or appointments (p-Value >0.05), and a Chi-square test for independence of categorical variables showed weak evidence of dependence (p-value = 0.048). (See Figure 2: Usage v. Outcomes Dashboard and Figure 3: Usage vs Outcomes Variance Dashboard.)
Obviously, determining what library usage is directly related to course work was not possible in this analysis, and a full answer to “What library usage is relevant to student outcomes?” remains an open question, but additional data loaded into the system, such as metrics on building usage, study room reservations, wifi access logs, information literacy/instruction engagement, and other typical library metrics, could help librarians to build a more robust model. While there is weak evidence for dependence, more research is required, with more variables, more observations, and a stronger model to build a compelling case for library engagement and student success. But what this proof of concept did prove was that under the right conditions, and with the future prospect of automated data harvesting and ingestion, a full production analytics platform is possible to provide real-time (or near real-time) analysis of user behavior, library usage, and success metrics, however a library or institution wishes to define that success.
Corlett-Rivera, K. and Hackman, T. “Ebook Use and Attitudes in the Humanities, Social Sciences, and Education.” portal: Libraries and the Academy, 14:2 (April 2014): 255-286. DOI: 10.1353/pla.2014.0008
Croxton, R. and Moore, A. C. (2018). Quantifying the Value of the Academic Library University of North Carolina at Charlotte. Proceedings of the 2018 Library Assessment Conference, Houston, Texas. Edited by Baughman, et.al. Available: https://www.libraryassessment.org/wp-content/uploads/2019/10/Proceedings-2018-rs.pdf.
Eysenbach, G. (2006). Citation Advantage of Open Access Articles. PLOS Biology, v.4:5. Available: file:///C:/Users/johnmcdonald/Documents/John%20Files/jmir_v8i2e8_app1.pdf.
Horizon Report. (2019). https://library.educause.edu/resources/2019/4/2019-horizon-report
Horizon Report. (2018). https://library.educause.edu/resources/2018/8/2018-nmc-horizon-report
Horizon Report. (2017). https://library.educause.edu/resources/2017/2/2017-horizon-report
Kent, Allen, et. al. (1979). Use of Library Materials: The University of Pittsburgh Study. M. Dekker, 1979.
Levine-Clark, Michael, John McDonald, and Jason Price (2014). The Effect of Discovery Systems on Online Journal Usage: A Longitudinal Study. Available: https://digitalcommons.du.edu/libraries_facpub/2/.
McDonald, John D. (2006). Understanding Online Journal Usage: A Statistical Analysis of Citation and Use. Journal of the American Society for Information Science and Technology, 58 (1). pp. 39-50. ISSN 1532-2882. Available: https://authors.library.caltech.edu/25916/.
Nackerud, S., Fransen, J., Peterson, K., and Mastel, K. L. (2015). Retention, Student Success, and Academic Engagement: A University of Minnesota Case Study. In B. Showers (Ed.), Library analytics and metrics: Using data to drive decisions and services (pp. 58-66). London: Facet Publishing. Available: https://experts.umn.edu/en/publications/retention-student-success-and-academic-engagement-a-university-of.
Oakleaf, M. (2018). Library Integration in Institutional Learning Analytics. Available: https://er.educause.edu/-/media/files/library/2018/11/liila.pdf.
Society for Learning Analytics Research (SoLAR). What is learning analytics? Available: https://www.solaresearch.org/about/what-is-learning-analytics/.
Way, Doug. (2011) Patron-Driven Acquisitions: Transforming Library Collections in the Virtual Environment. Available: https://scholarworks.gvsu.edu/library_presentations/23/.