by Gail McMillan  (Professor, University Libraries, Virginia Tech) 

Author’s Note:  Gail McMillan is a professor on the faculty of Virginia Tech Libraries and Director of Scholarly Communication.  Correspondence concerning this article should be emailed to <[email protected]>. — GM


The IR gives the university both a digital library and a showcase so the IR should accurately reflect its home institution.  Assessing IRs from the perspective of its resources, however, is an as yet unused frame of reference.  The goal of this initial study was to investigate whether the IR represents the scholarship and activities of its home institution by comparing a microcosm of the IR to the same microcosm at the institution.  The IR can be correlated with the university by using a controlled vocabulary to search each source and comparing the percentage of hits.  This study looked at VTechWorks, the IR at Virginia Tech, as a whole and through three lenses, that of graduate students’ ETDs, the faculty’s scholarly publications, and the academic units’ web-accessible publications.  Using the LGBTQ microcosm, the percentage of hits for a controlled vocabulary showed a good correlation, demonstrating that the IR is representative of the university for this microcosm.  Can we extrapolate and say this IR accurately represents its university?

Does the Repository Reflect the Institution?

Since institutional repositories (IRs)1 have been in use for about 20 years, it’s time to address how well they reflect their home institutions.  Within the wealth of articles about IRs, there is little attempt to assess the relationship of IR content to the scholarship and activities of its institution as an indicator of the value of the IR.  We do not know if IRs have attained Clifford Lynch’s vision of hosting “the intellectual works of faculty and students — both research and teaching materials and also documentation of the activities of the institution itself in the form of records of events and performance and of the ongoing intellectual life of the institution.” (2003, p.2)  Assessing IR content now is also appropriate in light of the COAR (Confederation of Open Access Repositories, 2017, p.4) recommendation that “The next generation repository… is resource-centric, making resources the focus of its services and infrastructure.”

Among my responsibilities at Virginia Tech Libraries, I oversee the IR, VTechWorks, established in 2012.  VTechWorks had about 70,000 items at the time of this study (April/May 2019).  About 96% of those items were publicly available and about 85% were textual.  Members of the university community largely created these works, but about 10% were created by others about Virginia Tech (e.g., Condolence Archives) or related to university interests (e.g., New River Symposium).  VTechWorks is highly focused on research and scholarship, but also hosts academic unit publications, governance and historical documents, etc.  Most items were born digital, but many items have been scanned and OCR’d.

IRs have not been developed like library collections by subject experts.  IRs are not dependent on money to purchase items, but on people’s time to locate, deposit, and describe  items.  VTechWorks has been populated in a variety of ways.  For example, some faculty voluntarily deposit directly or through integrated systems (e.g., Elements).  Mandatory ETD (electronic theses and dissertations) deposits come through the online graduate school system, and some courses require students to deposit final projects.  VTechWorks staff deposit through casual (e.g., reading VT news) and organized systems (e.g., OA Subvention Fund and SWORD protocol-captured articles).

To determine whether VTechWorks is representative of Virginia Tech, I chose to study a microcosm2 of VTechWorks, anticipating that it might encapsulate the characteristics of the repository as a whole.  The microcosm I chose to study was influenced in part by articles and presentations I read and heard that addressed questions about diversity within the academy.  At the 2017 CNI (Coalition for Networked Information) fall membership meeting, Amanda Rust from Northeastern University Library, presented “Design for Diversity,” a grant funded project that focused on ways in which information systems embody and reinforce cultural norms (e.g., data models that enforce strict gender binaries) and addressed designing systems that account for diverse cultural materials.

In “The Hubris of Neutrality in Archives,” Sam Winn (2017, p.2), at Virginia Tech Libraries, made several salient points, including “Archivists contribute to the omission or erasure of historically marginalized groups in the archives.”  And, a “radically inclusive historical record” will not happen by accident. 

Rebekah Scoggins (2018), a librarian at Leander University, authored “Broadening Your Library’s Collection: Implementing a LGBTQIA Collection Development Project.”  She determined that her library was not meeting the needs of its users because the LGBTQIA collection was out-of-date and incomplete.  It struck me as a well-aimed study but one that was limited because it only considered the traditional library collection, that is purchased books, serials, multimedia, etc., but did not consider the content of the IR.3 

Because of these works and the dearth of articles about IR content assessment, I chose to conduct a study that might also help me learn whether VTechWorks was contributing to the omission of works of marginalized people or providing an inclusive record.  I analyzed the microcosm of LGBTQ works and compared the IR findings to the output of the university as indicated by its website. 

I created a list of search terms by compiling terms and phrases from academic and community resources.  [Appendix A — see http://hdl.handle.net/10919/97085]  I eliminated some terms that historically had different meanings (e.g., gay and queer) or that were too broad (e.g., discrimination).  However, I did not discard biological terms because, though they sometimes refer to plants or animals, they appeared in each studied collection.  The resulting list had 155 terms.4  [Appendix B — see http://hdl.handle.net/10919/97085]

To refine my investigation, to help understand who was doing the scholarship and research in the LGBTQ microcosm, and to help put the data in context, I searched the terms across the university, the IR and within three of the IR’s actual and virtual collections: graduate students’ ETDs, peer-reviewed faculty publications,5 and academic units’ (called “colleges” at VT) web publications.  These collections targeted the scholarship of graduate students and faculty as well as information often aimed at the general public or alumni from the colleges and the university.

129 of the 155 terms searched got 21,455 hits in the 71,734 items in VTechWorks (VTW).  To search the university website, I entered the terms directly in Google (i.e., www.google.com) by using this search strategy:  [term]  site:vt.edu -site:vtechworks.lib -site:theses.lib.vt.edu.  In what I’m calling the “VT collection” (VTC), 109 of the 155 terms got 84,793 hits.

I did not compare the number of hits per se because of the radically different sizes and ages of the collections.  For example, the ETD collection had 32,557 works with LGBTQ terms dating from 1910.  In the virtual faculty research collection (FRC) of 3,870 items, these terms dated from 1989, and in the virtual college collection (CC) of 14,590 items, these terms dated from 1972.  Because of these discrepancies and for comparison purposes, I calculated the percentage of hits for each term within each collection. 

An example of the beginning of the alphabet displaying the percentage of hits in VTC and in the three targeted collections in VTW when the term was found is available at
http://hdl.handle.net/10919/97085 (Table 1).

In FRC, 40 of the 155 terms got hits, with the top 10 terms getting 86% of the hits.  There were two outliers in FRC.  “Gender bias” was used much more (9.3%) by faculty than any other collection (.6% and 1.9%).  FRC used “sexual orientation” twice as many times as ETDs (5.9% v 2.9%).  However, CC and VTC, the most public-focused collections, used it much more (13.8% each). 

In CC, 89 of the terms got hits, with the top 10 terms getting 81% of the hits.  “Sexual orientation” got more than twice the hits as FRC and more than four times the hits as ETDs.  VTC, however, used “gender identity” and “gender expression” about twice as often as it was used in CC, FRC, and ETDs. 

The term “gender” got nearly 50% of the hits in ETDs, leaving the remaining 114 terms with between 3.5% and 0.01% of the hits.  “Gender” also got about 50% of the hits in FRC and CC, though only 39% in VTC and VTW overall.

The same five terms got the most hits in VTW and VTC.  Only eight terms got more than 2% of the hits in VTC.  Gender, sexual orientation, gender identity, gender expression — the top four terms, were the only terms in VTW that got more than 2% of the hits.  The top 20 hits in VTC varied by <2% with VTW, except for “sexual orientation” which got 4.7% more hits in VTW.  See http://hdl.handle.net/10919/97085 for Table 2 comparing the 20 most used terms in VTW and VTC.

VTC had three terms that did not appear in VTW:  gender expansive, homonormative, and gender creative.  At
http://hdl.handle.net/10919/97085 see List of the 23 terms used in VTW but not in VTC.  Ten of these terms only appear in ETDs.  Two terms, cisnormative and diverse sexualities and genders, appeared only in CC.  Analyzing the terms that did not appear in a collection was not necessarily meaningful due to the very low number of hits (1-2).

Using VTC as the measure of scholarship and activities at the university, and comparing the percentage of hits in VTW with the percentage of hits in VTC, the data provides some evidence that there is a positive correlation between the IR and the university, at least, when studying the LGBTQ microcosm. 

To speculate how well VTechWorks represents the scholarship and activities of Virginia Tech, I considered, first, a difference in frequency of <1% to indicate that the works in the IR’s LGBTQ microcosm appropriately represent the university for this microcosm.  VTechWorks and VTC had 109 terms in common.  Only four terms appeared slightly more frequently in VTC: lesbian (+1.2%), “gender identity” (+1.3%), gender expression (+1.5%), and LGBTQ (+1.6).  95% of the terms appeared in both collections with about the same frequency (i.e., <1% difference in hits), which may indicate that the IR’s LGBTQ microcosm adequately represents the university’s scholarship and activities in this microcosm during this study. 

• ETDs and VTC had 99 terms in common.  89% of the terms appeared with about equal frequency.

• FRC and VTC had only 40 terms in common, with 73% of the terms appearing with about equal frequency. 

• CC and VTC, the two most public-oriented collections, had 85 terms in common.  87% of the terms appeared with about equal frequency.

If instead of a <1% difference, we consider <2% difference to be about the same frequency of appearance, no terms appeared more frequently in VTC than VTW.  One term appeared more frequently in VTW.  Therefore, 99% of the terms appeared with about the same frequency so the IR’s LGBTQ microcosm is representative of the university’s scholarship and activities in this microcosm during this study.

• ETDs and VTC: 97% of the terms appeared with about the same frequency. 

• FRC and VTC: 90% of the terms appeared with about the same frequency. 

• CC and VTC: 95% of the terms appeared with about the same frequency.

As a digital library and a showcase for the university, the IR should accurately reflect the scholarship and activities of its home institution.  This study was a preliminary investigation into whether the resources available from VTechWorks are aligned with scholarship and activities at Virginia Tech.  Not finding any guidance in the literature for assessing the contents of institutional repositories, I chose to investigate whether comparing the percentage of hits on a common list of terms used by authors at the university website and the IR would indicate a correlation and, therefore, a true reflection of the institution by its IR.  Looking into the LGBTQ microcosm also gave me a chance to see whether an unconscious bias had crept in.  With a 95% – 99% correlation, I feel confident saying that in the LGBTQ microcosm, VTechWorks accurately reflects Virginia Tech.

This preliminary investigation should be followed by studies of other microcosms in other IRs and universities as well as VTechWorks, before speculating that the IR truly reflects the university.  The information community will need to agree on what percentage of similarity indicates a high enough correlation to consider the IR representative of its university.  Readers feedback on the research methods as well as potential collaborators who would consider conducting similar studies at their institutions and comparing results among institutions, would be very welcome.


Please Note:  This article was originally intended to be part of Against the Grain’s IR themed issue “IRs R Cool Again,” ATG v.31#5, November 2019.


1.  I prefer Clifford Lynch’s broad definition: “a set of services that a university offers to the members of its community for the management and dissemination of digital materials created by the institution and its community members.” (p2) to Raym Crow’s: “digital collections capturing and preserving the intellectual output of a … university community.”  (brief, p1)

2.  “a community, place, or situation regarded as encapsulating in miniature the characteristic qualities or features of something much larger” from Dictionary, an Apple Inc. application for macOS.

3.  I later learned that Leander University does not have an IR.

4.  In an attempt to reduce wordiness in this article, when I use “terms,” I mean both terms and phrases.

5.  Articles are from Elements, SWORD and those supported by our Open Access Subvention Fund.

