Why Big Data Matters: Perspectives from the Libraries
20 October 2022
World Statistics Day is celebrated around the globe on 20 October quinquennially and was first celebrated in 2010. The theme of the third World Statistics Day was “Connecting the world with data we can trust”, reflecting the importance of trust, authoritative data, innovation and public good in national statistical systems. The celebration of World Statistics Day is a global collaborative endeavour, organised under the guidance of the United Nations Statistical Commission.
Oracle defines the term Big Data as datasets that contain greater variety, arriving in increasing volumes and with more velocity. Statista sources revealed that there are about 64.2 zettabytes of data created, captured, copied and consumed in the entire digital universe, and it is forecast that this number would triple by 2025.
This unprecedent growth could be attributed to business adaption in response to COVID-19. In the survey published in World Economic Forum (WEF)’s 2020 Future of Jobs report, employers are accelerating the digitalisation of work processes as well as automation of tasks, generating more data in return. Companies in various sectors surveyed in the report also indicated that Big Data analytics, Artificial Intelligence, Internet of things and connected devices are some of the technologies likely to be adopted by 2025. At the heart of these technologies is data.
New uses and applications of data change our economic and social lives, making it increasingly essential for everyone to be data literate With data becoming a core component in our lives and service delivery, it is important for all to understand data, and where AI is being applied, and what it may mean.
Big Data SIG: A Forum for Data-Pertinent Discussions
The Big Data Special Interest Group (SIG) was set up in response to the and is sponsored by the Information Technology Section. The Big Data SIG helps libraries find out how they can be part of the big data movement and leverage the rich streams of data generated as a possible source of information. The SIG also investigates how libraries can act as a data intermediary between data producers and data consumers for cultural heritage datasets as well as newly generated research and scholarly datasets.
In order to mark World Statistics Day, we highlight three library projects surrounding the use of Big Data. We are grateful to Dr Raymond Uzwyshyn at Texas University Libraries, United States, Dr Fenella France at Library of Congress, United States and Tan Wen Sze, National Library Board, Singapore for sharing their projects at past World Library and Information Congress (WLIC).
Data Ecosystem for Open Scholarly Research
Texas State University (TXST)’s project on Data Research Repository Ecosystem demonstrated a stellar example of an online networked data research infrastructure to enable sharing and archiving of research data for open scholarly research. The ecosystem enabled an end-to-end academic research cycle, from search and retrieval of data and content, to gathering and analysis, to writing and publishing online.
The unified digital scholarship ecosystem consists of the following components:
Primary components | 1. Online research data repository (utilising Harvard’s open-source Dataverse software) 2. Online institutional digital collections repository (utilising open-source Dspace software) |
Tertiary components | 1. Online electronic theses and dissertation management system 2. Identify management system 3. Open academic journal system software 4. User interface content management software |
Most research datasets are still within 1GB range. Recognising the future need to store larger datasets, preparations have begun for the next build. Various storage options for large datasets are being explored. These include Amazon Web Services S3 storage and Texas Advanced Computing Centre’s separated metadata/storage pointer systems to Data Dryad’s fee-based institutional models for datasets up to 300GB.
The recent years have seen a clear acceleration in the adoption of Artificial Intelligence. Machine learning, deep learning and neural net scientific breakthroughs are propelled by greater computing power, availability of well-labelled metadata enabled datasets and open-source research data repositories and ecosystems.
Data repositories and research ecosystem enable academic research and progress of knowledge and discovery. These can present interesting possibilities for open science, new knowledge discovery and innovation, even for globally dispersed research teams.
This project was presented at the WLIC 2022 conference in Dublin. The full paper and slides are available in the IFLA Repository.
Data-driven Approach to Assess the Condition of Collections
The Library of Congress presented an exceptional use case whereby physical, chemical and optical data was collected and analysed to make informed decisions on withdrawal or retention of collection items and ensure overall robustness of cultural heritage collections.
A data analytics approach was required to fill knowledge gaps and perform an objective assessment of collections to identify materials that are at risk and to differentiate between good and bad quality copies of books.
The research examined the physical, chemical and optical characteristics of over 500 identical books, published between 1840 and 1940, from five large research libraries in different parts of United States. A data platform was setup to store sampling data gathered by various instruments such as tensile strength and acidity. These data are stored in CouchDB, a JSON document store, along with non-scientific data for each book.
To analyse the data, the team had a query tool to assess the discrete data points in real, in addition to a compare tool which leveraged on open-source International Image Interoperability Framework (IIIF) viewer to evaluate photo documentation between books close up.
The main factors to consider and assess for preservations are impact of material, environment and usage. Through the research, the team discovered that the inherent properties of paper when the books were produced was the most critical factor for predicting condition.
This project demonstrated that having the capacity to collect, store and interrogate datasets can greatly advance an organisation’s capacity in making objective decisions.
This project was presented at the IFLA WLIC 2022 conference in Dublin.
Book Recommendations through Machine Learning
The National Library Board of Singapore (NLB)’s project to develop a recommendation engine for its collection of books and eBooks is an exemplary example in showing how libraries can evolve their systems to be Big Data compatible and leverage on cloud technologies so that it could focus on higher value tasks such as training the model and designing the user experience.
In 2019, NLB made use of Amazon Personalize, a cloud-based machine learning recommendation service for its new recommender service. The service was subsequently deployed on its website and mobile app to make personalised recommendations from its books and eBooks collections.
Leveraging on a fully managed cloud-based service allowed NLB to be free from maintaining the infrastructure and machine learning pipeline, enjoy product enhancements and new algorithms without further investments and reduce upfront capital expenditures.
Since NLB has an existing data warehouse containing transactional data, the team was able to use the infrastructure to create a unified view of the patrons to understand borrowing patterns. The team then focused on training the model to make better recommendations, for example reduce duplicate title recommendations of different formats and subject-based recommendations.
NLB also re-designed the user experience of interacting with the book recommendations on its digital touchpoints. Instead of integrating Personalize directly, NLB developed its own web service layer to layer additional parameters for better user experience. An example was to pass patron’s age from the digital touchpoint onto the recommender service to retrieve age-appropriate recommendations.
Personalised recommendations provide an effective way to introduce new books to patrons and to continually engage them. The availability of fully managed recommendation service such as Amazon Personalize has now made recommendation services accessible to libraries.
This project was first featured at an IFLA WLIC 2021 virtual conference and subsequently in the January 2022 issue of Trends and Issues in Library Technology newsletter.
Final Notes
These three use cases demonstrate the use, re-use and augmentation of library data within Big Data settings. Data is expected to be omnipresent in our lives and industries. It is important that libraries play an active role in developing data ecosystems to support open scholarly research and evolve its existing library technologies to become Big Data compatible so that it could better use data, be it for making informed decisions on what to preserve or what books to recommend to patrons. Lastly, libraries have a long history of helping our patrons to navigate information resources, we have a key role in promoting data literacy so that our patrons have the skills to find, assess, use and cite data.
I also strongly urge all of you to visit, join and contribute to the IT Section Facebook Group.
Happy World Statistics Day!
Written by: Patrick Cher, Big Data SIG Convenor