February 2023 – Web Archiving Section

This week’s post was written by Shereen Tay, Librarian at the National Library, Singapore.

The National Library, Singapore (NLS) is a knowledge institution under the National Library Board, which also manages 26 public libraries, the National Archives, and the Asian Film Archives. At the National Library of Singapore, we have a mandate to preserve the published heritage of our nation through the legal deposit of works published in Singapore, as well as web archiving of Singapore websites.

NLS started archiving Singapore websites in 2006 as a response to the growing use and popularity of the Internet. However, we discovered it was an administratively cumbersome process as we were required to seek the written consent of website owners first. Challenges in identifying the website owners and the low response rate hampered our ability to build a comprehensive national collection of Singapore websites. To enable us to scale up our collecting efforts, we updated our legislation to empower NLS to archive websites ending in “.sg” without the need for written permission. The new law came into effect on 31 Jan 2019.
We conducted our very first domain archiving in 2019. Domain archiving is mostly done in-house, which includes pre-archiving checks, archiving, indexing, quality assessment, and providing access. Prior to this, we also had to establish new workflows, automate processes, as well as enhance our Web Archive Singapore (WAS) portal to cope with the enormous volume of websites that we were going to crawl, some of which are detailed below.

*The Web Archive Singapore portal (https://eresources.nlb.gov.sg/webarchives)*

Each year, we receive about 180,000 registered .sg domain names via our Memorandum of Understanding with the Singapore Network Information Centre, who is the national registry of .sg websites in Singapore. To handle the large volume of websites, we leveraged Amazon Web Services to run the crawls as opposed to using our own servers, which we had been tapping on for our thematic crawls. This helped to reduce the time taken to archive from more than six months to three months.

Another process that we instituted was the two-step quality assessment (QA). Before the legislative changes, our team had done manual QA for all our thematic crawls. However, this became a challenge with the increased volume of websites harvested under domain archiving. To address this, the team developed an automated QA script to help sieve out archived websites that potentially do not contain substantial content, e.g., domain for sale, blank pages, under construction, etc. Those that do not pass the script are then sent for manual checking. As manual checking is an equally intensive process, we created a simple interface that displays the screenshot of the archived websites to help speed up the review. This is so that staff can assess the look and feel of the archived website at a glance. With this in place, we were able to complete the entire QA process within three months.

*Screenshot of the web archiving manual QA system*.

Our efforts would not be meaningful without providing public access. Prior to the domain archiving in 2019, we paved the way for access on the WAS portal by giving it a major makeover. Key enhancements included Solr full-text search, curation, public nomination of websites, and rights management.

Within our second year of domain archiving, we discovered that the sheer data size of the collection has become a strain on the WAS portal’s Solr indexing. We thus implemented distributed search with index sharding in 2020. This helped to achieve scalability by enabling the portal to query and fetch results in optimal time. The distributed Solr also helped in the indexing speed of our collection, which we estimated to grow by about 15% annually due to the domain archiving.

These are just some of the major implementations that we had done as part of our domain archiving journey. As of 2022, our collection (including thematic crawls) contains over 317,000 archived websites, which amounts to more than 200 TB. As we continue to carry out our mandate of archiving Singapore websites, our team is looking into migrating our collection to the government cloud infrastructure with the possibility of using SolrCloud, as well as providing a web archiving dataset for research. We have written a short blog post on our own observations of the .sg web using the data that we had collected in the past four years. We hope that in time, this collection will grow to be a valuable resource for researchers and Singaporeans.

Shereen Tay is a Librarian with the National Library, Singapore. She is part of the team that oversees the statutory functions of the National Library, in particular web archiving.

Web Archiving Section

Month: February 2023

Domain Archiving Experience at the National Library, Singapore