Forum on the archiving of the Polish Internet

Forum on the archiving of the Polish Internet resources 16-17 November 2021

The forum on the archiving of the Polish Internet includes two days of interesting lectures and discussions concerning this issue important e.g. from the point of view of securing this part of documentation heritage. We hope that these talks conducted in a wide range of experts, open at the same time to everyone interested in this subject, will result in developing key conclusions concerning legal, methodological and technical conditions necessary to launch the archive of Polish Internet resources, in particular those produced by the public sector.

We have invited excellent guests from Poland and abroad to participate in the Forum, including representatives of archives, libraries and institutions that have been already successfully implementing national Web archives.

The Internet resources have been archived in many countries in the world and in Europe since the mid-1990s, and the Web archival science is a dynamically developing field of knowledge. So far in Poland the Web archiving has been undertaken on a relatively small scale, and the discussion in this respect has been mainly academic. While recognising the achievements of many people and institutions, we would like to create a good atmosphere for a broader reflection and consultation, which with your participation and involvement will lead to the identification of specific solutions aimed at the construction of an optimal archive of the Polish Internet resources. Solutions which, while ensuring the quality of sets and legality of their collection, will also guarantee adequate protection of copyrights and civil rights, including respect for users’ privacy.

Lectures – combined with the session of open questions – given by our foreign partners on the first day of the Forum will provide an opportunity to learn about their positive and negative experiences related to the construction of the Web archive and the challenge of developing the rules for collecting and sharing the resources gathered in it. By gaining the best know-how and sharing it with you, we want to strengthen our future collective effectiveness in this field.

The formula of the second day of the Forum will be based on three discussion panels. The discussion within the legal panel will focus on the most important foundations and necessary legislative changes, which should ensure lawful, socially accepted collection and provision of selected Web resources. The aim of the archival methodology panel is to help develop the Polish terminology necessary for the Web archiving. Its participants will also consider how to apply the current rules for developing archive sets to re-born digital objects collected in Web archives. The technological panel will focus on necessary infrastructure solutions. During this part, we will also look for optimal ideas for software and systems enabling effective launch and maintenance of the Polish Web archive.

We warmly invite you to participate actively in the Forum. The participation in the Forum is free of charge, although it is subject to the registration. The number of participants is limited.

Sign up here:

https://eveningmedia.pl/webinars/forum-on-the-archiving-of-the-polish-internet/

DAY 1 | 16.11.2021 | 13.00 CET – 18.00 CET

13.00-13.10 Greeting by Paweł Pietrzyk, General Director of the State Archives

13.10-13.15 Józef Orzeł – Chairman of the Digitization Council

13.15-13.20 Kristinn Sigurðsson Vice-Chair of the International Internet Preservation Consortium

13.20-13.40 Silvia Sevilla – Publications Office of the European Union

13.40-14.00 Q&A

14.00-14.20 Márton Németh – National Széchényi Library (Hungary)

14.20-14.40 Q&A

14.40-15.20 Break

15.20-15.40 Daniel Gomes – Scientific Computing Unit of the Foundation for Science and Technology (Portugal)

15.40-15.55 Q&A

15.55-16.15 Tom Storrar – National Archives (UK)

16.15-16.30 Q&A

16.30-16.50 Break

16.50-17.10 Mar Pérez Morillo – National Library of Spain

17.10-17.25 Q&A

17.25-17.45 Antal Posthumus – National Archives of the Netherlands

17.45-18.00 Q&A

18.00 End of Day 1

Sign up here:

https://eveningmedia.pl/webinars/forum-on-the-archiving-of-the-polish-internet/

DAY 2 | 17.11.2021 | 9.30 CET – 17.00 CET

9.30-9.40 Start of Day 2, Introduction

9.40-11.10 Methodology Panel

dr hab. Lucyna Harc (State Archives Head Office)

Dominik Cieszkowski (National Library of Poland)

dr hab. Marlena Jabłońska (Nicolaus Copernicus University in Toruń)

dr Aneta Januszko-Szakiel (Jagiellonian University in Cracow)

Michał Knitter (State Archives in Szczecin, Central Methodical Committee)

Krzysztof Kołodziejczyk (State Archives in Lublinie, Central Methodical Committee)

dr hab. Wiesław Nowosad (Association of Polish Archivists, Nicolaus Copernicus University in Toruń)

Maciej Zdunek (National Digital Archives in Warsaw)

dr hab. Wiktor Werner (Adam Mickiewicz University in Poznań)

11.10-11.30 Q&A

11.30-12.00 Break

12.00-13.30 Technology Panel

dr inż. Mateusz Tykierko (Wrocław University of Science and Technology, Digitization Council)

dr Adam Jatowt (University of Innsbruck)

Filip Kłębczyk (expert)

Grzegorz Kolecki (State Archives Head Office)

Grzegorz Zajączkowski Digital Champion of European Commission for Poland

dr hab. inż. Maciej Piasecki (Wrocław University of Science and Technology, CLARIN-PL scientific consortium)

Maciej Stankiewicz (National Digital Archive in Warsaw)

Wojciech Goliński (National Digital Archive in Warsaw)

12.30-13.40 Q&A

13.40-14.40 Break

14.40-15.40 Legislation Panel

Dominik Cieszkowski (National Library of Poland, vice director)

Katarzyna Gaczyńska (University of Warsaw)

prof. dr hab. Katarzyna Grzybczyk (University of Silesia in Katowice)

dr hab. Marek Konstankiewicz, prof. UMCS (Maria Curie-Skłodowska University in Lublin)

Stefan Cimaszewski (National Library of Poland)

dr Katarzyna Pepłowska (Nicolaus Copernicus University in Toruń)

Mateusz Adamkowski (Ministry of Culture and National Heritage)

Łukasz Ciołko (Ministry of Culture and National Heritage)

15.40-16.00 Q&A

16.00 -16.15 Break

16.15-17.00 End of Day 2, Summary

Sign up here:

https://eveningmedia.pl/webinars/forum-on-the-archiving-of-the-polish-internet/

Portugal

Daniel Gomes – Scientific Computing Unit of the Foundation for Science and Technology

Daniel Gomes started Arquivo.pt, the Portuguese Web-Archive, in 2007 and currently leads this public service. He obtained his PhD in Computer Science in 2007. His thesis focused on the design of large-scale systems for the processing of web data. He is a researcher in web archiving and web-based information systems since 2001.

Abstract

Search the Past Web with Arquivo.pt

Arquivo.pt – the Portuguese Web-Archive is a research infrastructure that preserves millions of files archived from the web since the 1990s, containing information in several languages. It provides a public search service over this information. This presentation will shortly introduce the services publicly provided by Arquivo.pt and describe its functioning.

The Netherlands

Antal Posthumus – The National Archives of the Netherlands

Abstract

The National Archives of the Netherlands, as a permanent government agency and official archive for the Central Government (Ministries and their agencies), has the legal duty, laid down in the Archiefwet, to secure the future of the government record. Within this context our role isn’t one of actively forming a collection of archived websites through selecting and harvesting these ourselves. This is a key difference with other national archives, national libraries and other (inter-)national heritage institutions.

Therefore we put a fair amount of effort in advising our producers, Ministries and their agencies, in how they create and eventually will transfer this specific form of government records, archived public websites. One such example of support we’ve offered was issuing a very well received guideline on archiving websites (2018).

This guideline was also used as part of the requirements of a public European tender (2021). The objective of the tender: realizing a central harvesting platform to harvest circa 1500 public websites of the Central Government.

In 2019 we’ve started a project implementation to be able to receive this archived material. Our aim was to formulate and to implement requirements for the different aspects of the OAIS-model. In other words, we had to integrate in the existing infrastructure and workflows of our trusted digital repository (e-depot in short) processes relating to the ingestion, storage, management and preservation of and providing access to archived public websites of the Dutch Central Government.

These broad subject are central to my presentation next to more specific questions and challenges we’ve encountered such as:

Which search options will we be able to implement?

How to present and communicate to our users when content wasn’t harvested or was partly harvested due to (known) technical limitations of harvesting agents with handling dynamic content? Something we call non-harvestable content in our guideline.
Will our national metadata schema, our e-depot’s data model and the EAD as used in our collection management system suffice to provide adequate administrative, descriptive and technical metadata for archived websites? Or do we need to combine our national schema with another international schema?
Will the off the shelf viewer, Open Wayback, as it is installed in our Preservica solution, do the trick? Do we need to install a separate viewer in our infrastructure?
What’s WARC-validation and which tools to validate with?

European Union

Silvia Sevilla – Publications Office of the EU

Silvia SEVILLA is coordinator of the web preservation service at the Publications Office of the European Union (OP), an inter-institutional body based in Luxembourg that centralises all publications of the European institutions. Silvia has a degree in law and postgraduate studies in business administration and publishing. She joined the Publications Office in 2005 and since then has held various positions in publishing and archiving services. She has been in charge of the web archive since the creation of the service that manages it within the Publications Office in 2018.

Abstract

The European Union (EU) is a political and economic union of 27 Member States. On behalf of the EU institutions, the Publications Office of the European Union (OP) creates an EU web archive. It aims to archive web content concerning the EU project to preserve it for the long term and to keep it accessible for the public. The archive covers the various websites of the EU institutions, bodies and agencies. The majority of these are hosted on europa.eu, the domain that spans the entire institutional framework of EU powers and regulatory bodies.

The websites included in the seed list are archived regularly, at least four times per year. Captures can be scheduled as often as appropriate. It is also possible to create ‘ad hoc’ archives, if a website owner makes a justified request. The most common reason for requesting an exceptional file of a page or document is that the content is going to be put offline or changed significantly.

Our web archive is created in partnership with Archive-it (Internet Archive). This platform offers automatic tools for managing seeds, scoping and cataloguing, with the possibility of ingesting metadata at the seed or collection level. It supports automatic crawling, that can be scheduled, full text search and hosting. There is a quality control system and it is possible to download WARC files.

The presentation will outline the main characteristics, content and scope of the EU web archive as well as the archiving process and the tool used. It will cover some of the challenges and strategies developed to respond to them.

Hungary

Márton Németh – National Széchényi Library

Márton Németh is currently working as a web librarian in the Web Archiving Department at the National Széchényi Library in Budapest. He has master’s degrees in History, Library and Information Science from Szeged University, Hungary, European Studies from Aalborg University, Denmark, and a Digital Library Learning international degree by Oslo University College, Tallinn University and Parma University. He has just defended his phd thesis about web archiving in general and the evolution of the Hungarian Web archiving project at the Doctoral School of Informatics, University of Debrecen, Hungary.

Abstract
In my presentation I will offer a brief introduction to the web archiving activities of the National Széchényi Library. I will summarize the organisation framework together with the legislative environment, I will take a quick overview of the history of born-digital archiving in our library, the presentation will also cover the types of archiving, I will take a short overview about the different kind of collections, the IT background of web archiving, metadata issues, dissemination and collaboration.

Spain

Mar Pérez Morillo – National Library of Spain

Mar Pérez Morillo is PhD in Latin Language and Literature. She started working at the National Library of Spain (BNE) in 2004. Since then, she’s been in charge of the institutional website, the Library social media, the Web Archiving and the Non-Print Legal Deposit. Nowadays, she is the Director of Digital Processes and Services at the Library being her main tasks:

the coordination of the re-use projects,
the online portals of the Library,
the online library catalogue,
the Digital Library,
the Non-Print Legal Deposit (including the Web Archive) and
the long-term digital preservation program.

Abstract

The non-print legal deposit and the preservation of born-digital heritage

The National Library of Spain started crawling the Spanish Web on 2009, with the help of Internet Archive. This was afforded as one of the main Library duties: preserving the documentary heritage, as the publications were increasingly moving from the physical devices to online. Since then a new Legal Deposit Law was enacted (in 2011) to consider the online publications in scope of the Legal Deposit. As Spain has a regional structure with autonomous governments, the autonomous regions have competencies on legal deposit. So the non-print legal deposit in Spain, including the Web Archive, is a collaborative project where the regional libraries select and manage their own online collection in cooperation with the National Library of Spain. So far, the size of information archived is 1 Pb and the collection consists on a combination of annual crawls of the .es domain and selective crawls on different topics that require a specific management.

Luxembourg

Ben Els and Yves Maurer – National Library of Luxembourg

Ben Els

has been the digital curator of the Luxembourg Web Archive at the National Library of Luxembourg since 2017. He has previously worked in the cultural sector, as a project coordinator for the Mierscher Kulturhaus and the Séibühn Ënsber asbl. Ben completed his Bachelor’s degree in European cultures at the University of Luxembourg, which he followed up with a Master’s degree in comparative literary and art studies at the University of Potsdam.

Yves Maurer

is the deputy head of the IT and digital innovation division at the National Library of Luxembourg and is the technical lead on the Luxembourg Web Archive since 2016. He has an active role in all things digital happening at the library, from digital preservation, digital legal deposit, AI methods for enhancing usability of digitised materials, open data, transitioning to a new ILS etc. Previously he was responsible for the BnL’s digitisation program from 2007 onwards and the setting up of the portal of Luxembourg newspapers at eluxemburgensia.lu. In that period, he was a member of professional boards relating to digitisation at IFLA and Igelu. Previously he was Vice-President of Development at Atril Language Engineering in Madrid and responsible for the flagship DéjàVu Computer-Assisted Translation software. He holds an Msci in Mathematics and Computer Science from Imperial College London.

Abstract

In this presentation, Yves Maurer (deputy head of the IT department) and Ben Els (digital curator, Luxembourg Web Archive) will go back to the beginning of the web archiving program at the National Library of Luxembourg, covering the different aspects of the Luxembourg legal deposit legislation, insights from .lu domain crawls, as well as event and thematic collections since 2016. We will also talk about efforts in public outreach, community participation and research projects related to the Luxembourg Web Archive.

Denmark

Anders Klindt Myrvoll – The Royal Library

Anders Klindt Myrvoll is the Programme Manager at Netarkivet. Together with colleagues he aims to collect, preserve and provide access to the Danish web the best way possible. He´s also managing specific projects, taking care of relations to researchers and others user groups and representing the web archive in international forums and collaborative projects.

He has a versatile education in philosophy, IT and leadership and prior to joining Netarkivet had 13 years of management experience with localization of animated film and live action for broadcast, cinema and streaming services for local, regional and global client as well as creation of original productions.

https://www.linkedin.com/in/andersklindt/

https://twitter.com/AndersKlindt

Abstract

Netarkivet – the national Danish web archive at the Royal Danish Library

This presentation will outline the history, legal background and current state of the national Danish web archive, at the Royal Danish Library, Netarkivet. We´ll look at numbers, stats, technology used, the updated legal deposit laws that came in effect in 2005 making the web archive possible, collection practices and challenges and exciting new possibilities like our new open source discovery and playback platform, SolrWayback, and data dumps of archived content for researchers that has been possible since 2018.

Croatia

Karolina Holub – National and University Library in Zagreb

Karolina Holub, library adviser, Coordinator of the Croatian Digital Library Development Centre at the National and University Library in Zagreb. Her field of work includes development, implementation and maintenance of the digital library systems (Croatian Web Archive, Digital Collections of the National and University Library in Zagreb, Croatian electronic theses and dissertations repositories, etc.) as well as taking care of interoperability with other systems for all types of resources. She is involved in all stages of web archiving since 2005 and since 2016 coordinates the development of the Croatian Web Archive. She manages and participates in the Library’s digitization projects and development of the thematic portals, and is involved in several national and international projects.

Abstract

Archiving the Croatian Web

Web resources differ from other types of resources that libraries are curating in many ways: frequent change of URLs, content and size, short and unpredictable life cycle, etc. In Croatia, the task to save these types of resources for future generations belongs to the National and University Library in Zagreb. The Library in cooperation with the University Computing Centre University of Zagreb (Srce) in 2004 created a system for archiving web content – Croatian Web Archive. The presentation will provide an overview of the seventeen years of experience in archiving the Croatian web with the emphasis on the existing workflows

Iceland

Kristinn Sigurðsson – National and University Library of Iceland

Kristinn Sigurdsson has a master’s degree in computer science from the University of Iceland. He has worked for the National and University Library since 2003 and is currently Head of Digital Projects and Development. A notable part of Kristinn’s work for the library over the years has been web archiving oriented. This has ranged from software development, including work on the Heritrix crawler, to standards development and community building. Kristinn has represented the library on the steering committee of the International Internet Preservation Consortium since 2010 and currently serves as its Vice-Chair.

Abstract

Building the Icelandic Web Archive.
A retrospective on the process and challenges of building national web archive and doing so on a budget.

In this presentation I will discuss what motivated the National Library of Iceland to undertake this task back in the 90s as the dot com bubble was inflating, the important role of the 2002 revised Icelandic legal deposit legislation had on the effort and describe the practical work that has produced a comprehensive national web archive. An archive that spans over two decades, 5 billion documents and almost 150 terabytes of data. And is open to the world.

Czech Republic

Luboš Svoboda – The National Library of the Czech Republic

Czech web archive of National Library of the Czech Republic

Luboš Svoboda, Zdenko Vozár, Petra Habětínová

Short introduction of Webarchiv overall archiving practice from the point of view of web curators and technical support. We will walk you through our in-house software Seeder, and we will give you basic insight into our collection policy, legal issues, manual harvests, challenges and the open-source software (Heritirix/ OpenWayback) we use.

Sign up here:

https://eveningmedia.pl/webinars/forum-on-the-archiving-of-the-polish-internet/

We invite you to watch video records of experiences of representatives of Web archiving institutions in Croatia, Luxembourg, Denmark, Iceland and the Czech Republic, whose speeches within the Forum are available below.

Croatia