Note: I’m posting this rather late in an effort to publish all the half-written blog posts I have collected over the last year or so…please forgive me if its all a bit dated.
I had the opportunity to attend another Archives Unleashed event in February this year (AU3.0), this time hosted at the Internet Archive where it coincided with a symposium on the WASAPI Project (more on that later). The event began much like earlier iterations, with some opening slides from the hosts and organisers about the group and some general overviews on web archiving tools, available datasets, data formats and the like. As in previous years, the Internet Archive’s Jefferson Bailey and Vinay Goel gave a talk introducing the available APIs and datasets hosted by the Archive, but also included some information on the collaborative projects the Archive has undertaken with researchers over the years. Personally I found this presentation super interesting, not only because it gave a sense of the types of projects that the Archive has supported (through access to web archives), but also for more self-serving reasons related to my PhD research…
As I am closing in on the preliminary findings of my first case study, I’m increasingly interested in the undocumented work that surrounds web archiving activities. There has been a bit said over the years about the gap in web archiving between the needs of various stakeholders in the field (see Dougherty and Meyers 2014 for more) and the push to support more scholarly research using existing web archival data (a goal which Archives Unleashed is actively promoting). But the presentation got me thinking about the undocumented ways in which people, tools and infrastructures should (or already do) support these types of scholarly activities.
This is very much related to the WASAPI Project Symposium – as the project seems to be focused on developing the technical tools (using existing protocols) for negotiating things like data transfer, filtering and querying tasks for web archives. I do think it’s fruitful however, to not lose sight of the ‘people-work’ that goes into supporting research of this type – or indeed other use cases for web archives beyond scholarly interventions. And indeed, what are the necessary skills sets and expertise required to leverage web archives as cultural/information/data resources? Some of these questions have certainly been touched on by others such as the BUDDAH Project, but I do wonder how they may evolve over time in the form of research services – or at the very least, APIs that are built around these sorts of activities.
Perhaps I’m digressing. Back to AU3.0.
Pondering End of Term
After the introductory remarks we all set about the task of figuring out a project to work on. I’m told there were around 20 participants for this event, mostly composed of postgraduate researchers in related fields, with a mix of previous attendees and newcomers. As with AU2.0, each participant wrote a few topics of interest on post-its and the group then tried to co-locate similar topics in order to formulate groups and projects around particular interests. I’ll admit now that unlike the previous iteration, I came to the event with a vague idea of what I wanted to work on, although without having done any groundwork towards assessing the project’s feasibility.
As Nicholas Taylor recently noted in the Introduction to the special issue on Web Archiving: ‘there has never been as much web archiving as there is now’ (Taylor 2017). The recent upsurge in institutions, communities and public events committing resources to web archiving combined with the current political climate and (arguably) increased media coverage of web archiving activities – all point towards a potential shift in public consciousness towards the existence of web archiving initiatives. Post-Trump inauguration and in the lead up to AU3.0 for example, it became difficult for me not to notice the public outcry on social media and email lists over the ‘rapid’ removal of various webpages and content from the Obama era Whitehouse.gov. These were often followed by public statements and tweets by various members of the End of Term project notifying folks that this is something that happens with every administration changeover, and that there have been longstanding collaborative efforts towards archiving the federal government web presence.
Given all of this, I was curious if we could empirically demonstrate:
- Whether or not the kind of ‘changes’ in the federal government’s web presence at this End of Term transition could be classified as different in any way?
- Are the Trump administration changes happening quicker relative to other administration hand-overs?
- More generally, are there particular topics or content areas that are more or less vulnerable at times of transition?
After some discussion and milling about, our group formed around this topic which included myself, Mohamed Aturban (Old Dominion University), Justin Littman (George Washington University), Yu Xu (University of Southern California) and Shawn Walker (University of Washington). We didn’t get as far as we would have liked with the project but we did start a Github repository which contains some preliminary results and code. We put a lot of energy into experimenting with ways of plotting ‘change’ both statistically through regression analysis and visually using Google charts that linked directly to specific mementos that exhibited greater differences in simhash distances. Sadly, I have no conclusive answers to our questions, but I think there’s a lot of room for exploring these questions given the time and resources.
I’m thankful to add that we ended up winning Best Project, earning ourselves a lovely Internet Archive pint glass and those fetching caps we are sporting in the featured photo. Thanks to everyone on the team, Archives Unleashed organisers and the Internet Archive for supporting the project. I’m looking forward to whats next!