History in the Age of Abundance?: How the Web Is Transforming Historical Research

Author: Ian Milligan

Finished: 2020-02-12

Goodreads link

Overall: Milligan brings up a lot of important questions about what does an Internet archive, and, more broadly, what does an archive from present-day "digital day" look like, especially for historians of the future? I think this was a good place for me to start thinking about archives in relation to historical research, especially in the "digital age". I was a little bit disappointed because almost no mention of Asia. Everything seemed quite Euro-centric.

Below, I have my extensive notes on the book. Feel free to take from it whatever you'd like.

Introduction

Without web archives, a lot of websites would have been lost
The problems of having TOO much data - "digital traces are no so ubiquitous that finding something specific among the available information is the real challenge. Rather than being scarce yet valuable, these posts are now so common-place as to be a nuisance. Scarcity was frustrating, but super-abundance brings its own challenges" (9)
- My opinion on this is that maybe historical research was just constrained in the past because we just lost a lot of materials through time (due to war, neglect, natural disaster, etc). But analogous to how humans' life expectancy is a lot higher now because babies don't die in childbirth and we have modern medicine, we may think of these materials being preserved better as a symptom of what happens when we have higher "expectations"
- Maybe this is just a new way of doing things because we just have better ways of preserving peoples' voices in a way that was not technically or technologically possible in the past
"The ability to capture and archive information before it disappears will allow historians to construct a more complete, if complex and even confusing, historical record" (12)
"I argue that web archives will force the reshaping of how we train, write, and disseminate our histories. Historians and our society more generally need to address this shift. All of this is coming sooner than we may think." (12)
- Though: SNU paper trails and paper work instead of online... when is it obsolete, when is it paranoid, and when is it cautious to do things on paper?
Note about privacy and consent ... But does this apply to situations when we read peoples' diaries form 1900s? No privacy for Anne Frank
- I am reminded of the ending of Handmaid's Tale, with the interview where they analyze June's recording and wish to have had more sources, at the same time coarsely reading into her very personal thoughts
Biases in Who Gets To Be Archived:
- Representation Bias: People with college degrees, higher incomes, not people of color, more likely to use internet
- Selection bias: Pages with more links are ranked higher and selected at a higher rate than random websites... People with smaller personal blogs will be less likely to be archived, bigger corporations skew the website archives, etc
- Algorithmic bias: "If I designed the algorithm, my subjectivity is embedded in it" (59)
  - Historians are necessary for that extra step of interpretation

"No archive is a true reflection of the world" (17)

"When does the past become history?" (20)

Chapter 1 - Exploding the Library

NOTE: Need for knowledge management
History of internet... irrelevant and boring ... felt really unnecessary and just felt like Milligan just showing off the random shit he knows
How to determine a source's potential historical significance?
- Milligan argues that "selection criteria are largely algorithmically driven, rather than being determined according to a source's potential historical significance?"
- But when humans determine that significance, they are also using some sort of criteria or algorithm as well SOOOOOOOOOO I don't get it
Cliometrics (1960-1970s) but it fell out of style as historians felt "uncomfortable reducing human experiences to numbers" (54)
- Braudel, Annales School of history
- Cliodynamics (Peter Turchin, 2008) - Argued for a study of patterns that "cut across patterns and regions" to "collect quantitative data, construct general explanations and test them empirically on all the data ... To truly learn from history, we must transform it into a science" (58)
"Distant reading is a necessity when working with web archives" (57)

Chapter 2 - Web Archives and Their Collectors

"At first glance, it seems odd to be consulting print books to reconstruct digital objects. But paper is a surprisingly durable material" (64)
- In the process to reconstruct the "first website" (CERN website)
- "Can an archived representation ever capture the original? What does it truly mean to be able to access a backed-up version of a site from 1992?" (66)
- **"**The process of recovery and reconstruction of the original CERN page is a fascinating story of media archaeology" (66)
- Needed to reconstruct an old browser to simulate the full experience because modern browsers are too FAST
- Well... It's like trying to recreate Bach using Baroque instruments ONLY
- I guess seeing an old website on a modern browser is like playing Beethoven using modern instruments
- Or like painting the Greek columns and sculptures to not be white but be colored
- Or coloring in black and white photos
Physical archive - SAA's three core principles for archiving: provenance (history of an object or collection), original order, and collective control (68)
- Genealogy emphasized over similarity
- Feels so fabricated, stiff, fake, artificial
What counts as a web page?
- "Should websites be described as individual pages, or as collective sites?" (71)
THOUGHTS:
- It is cheap to store materials digitally. More expensive to leave physical traces, and more so on expensive paper that will last longer. Again, the rich and powerful are more likely to leave their traces
4 issues for producing web archives:
- 404 not found - websites are short lived
  - "Think of all the broken endnotes and footnotes" (78)
  - PROBLEM = "Archived pages do not resemble a random probability sample" (80)
  - "This is a clear bias toward prominent, well-known and highly-rated webpages. Smaller, less well-known and lower-rated webpages are less likely to be archived"
  - Digital sources are fragile and also preserved unevenly
  "It is a race against time" (81)
  - "The statistics on who produces content can obscure the type of content being produced" (83)
- walled gardens
  - "Think of all the potential historical data that Google has" (91)
  - Google servers as historical source
- robots.txt
- neglect of corporations - APATHY
- "We do not know what will be considered valuable by a historian decades or centuries from now, and taking a cavalier approach to any historical source is a foolhardy route" (97)
THOUGHT: is content only scraped and archived client-side? What about server side? Can we start with the web hosts and have them send snapshots to the archives rather than the archives doing the manual crawls?
- Representativeness & selection bias
- Selection criteria for special collections are often opaque and undocumented
- First step - be cognizant of the biases within a collection (recognizing that some voices are privileged over others) (85)
- Second step - reassess records that are being collected
"Even if we do not all use the web equally, and what we do put on the web is not captured systematically, without bias, we still have the potential to know far more about human culture and activity than ever before. Traditional archives are not free from these problems, either: their collections skew towards those who had their hands on the levers of power or who rubbed shoulders with those who did; they were literate, privileged, connected, usually men, and usually white" (88)
- Although all archives have bias, how can we avoid it MORE
How to deal with retroactive removals (94)
Using Tor for P2P (100), and VMs to "save" GeoCities
"The process of preserving and storage digital cultural heritage is a complicated one. This complexity includes the very definition of a web archive itself, as we see that this brings together the historian's archive, the technologist's archive, the archivist's archive, and the digital humanists' idiosyncratic combination of these" (105)

Chapter 3 - Accessing the Records of Our Lives

Historians have largely been absent from the decisions of archiving web sources
"The disquieting conclusion that the majority of archived pages might have the potential to be inventions of a web crawler is arresting, requiring a rethinking of how we approach these sources" (111)
- For example, crawling the same domain might take multiple weeks. So a news site might have the weather from one day but a picture of the weather from a different day
Ironic that we need to use magazines and print sources to see what these old websites looked like (bc of browser incompatibility, etc)

"we need to make these decisions now, creating protocols for how we save, preserve, and provide access to this material" (120)

"How a historian can leverage developments n the field of computer science and information retrieval to find new insights in large, vast arrays of text" (124)
IMPORTANT - Named Entity Extraction (NER) as an approach to explore web archives (to mark up and identify people, organizations, locations, other categories, etc) (126)
Metadata matters! Maybe more than content
- Metadata for historical research
- Metadata as a proxy for something
- Makes data more portable (131)
"Web popularity or importance does not necessarily correspond with real-world popularity or importance - something any historian has to keep in mind as they parse their data" (134)
"Such numbers need to be used with caution, as they are not inherently meaningful" (136)
- Milligan gives an example of how we can use the number of inbound and outbound links to draw conclusions about how important a website was ... HOWEVER I feel like this is not very rigorously done, or following a rigorous chain of influence. In fact, it feels very handwavy in the midst of a rigorous network-like structure. I think the chain of influence and Bayesian statistics needs to be more rigorously enforced
Combination of metadata and content analysis
QUESTION: Who builds this technology (the web archives) and should we just blindly trust then? Because that will probably happen
"We need to consciously avoid letting technology dictate the historical research agenda" (142)

Chapter Four - Unexpected Needles in Big Haystacks

The Wayback Machine's seeming simplicity helps to conceal some very important decisions and assumptions made behind the scenes. Reconstructing old websites is not a straightforward process" (144)
- What would help? More documentation?
- "Imperfect facsimiles of the original" (146) (due to different hardware, HTML standards, browsers)

NEED - "We need to start thinking about accessible research portals" (147)

Who is building these?
- What kind of UI do we need?
- In a sense, this is a knowledge management problem - a UI easy to search for what you need as a historian
- Historians need to be the USERS
- Blockchain version control for each site?
National legal deposit = legal obligation to deposit published materials with a national library or cultural organization
- depends on power of national library
- i.e. British Library, all website creators need to have their data preserved by British Library (all sites registered in UK)
Lots of legal restrictions with the national archives. Such as privacy issues. Like no photos
- "Paradoxically, their work is far slower" (162)
- Or two users cannot access the same content at the same time
- "Access restrictions" make pages less faithful (browsers render pages statically so dynamic content is eliminated) - Pages are generated as images, with only hyperlinks remaining (163). THIS SOUNDS HORRIBLE
Other Countries Outside Europe
- Examples of legal deposit: Austria, Croatia, Estonia, Finland, Germany, Portugal, New Zealand, Norway, Spain, Slovenia, Sweden
- Africa: Bibliotheca Alexandrina in Alexandria, Egypt
- Asia: Singapore (national domain crawls), Japanese Web Archiving Project
  - "China, while devoted to extensive preservation of print culture, also does not have a web archive" (167)
  - "South Korea carries out selective web archiving" THAT'S IT?
    - The footnote leads to: National Library of Korea, "Web Archiving System of the National Library of Korea"
    - CHECK THIS OUT
Tools like Archives Unleashed Toolkit, Shine, etc, try to "democratize this process and bring historians into the fold" (170)
"The true barrier ... is one of culture. Libraries have recognized the importance of web archives, with legal deposit and other access changes. Now historians must do their part." (170)

Chapter Five - Welcome to Geocities, Population Seven Million

FAAV Cycle: Filter - aggregate - analyze - visualize
- Using this for web archives (183)
Webring (like early blockchain?)
- Expanding Unidirectional Ring of Pages (EUROPa)
A PROBLEM: "lack of communication with the authors of these archived websites complicates ideas about informed consent" (197)
- Ethical conversation
- 1979 Belmont Report's framework for working with human subjects as basis for Matthew Salganik's approach for digital social scientists when working with online research questions (198)
  - Respect for persons (informed consent when possible)
  - beneficence (maximize benefits and minimize harms)
  - justice (distribute risk and benefits across social groups)
  - respect for law and public interest (transparency to society)
- Milligan has ethical problems with collecting these stories bc they are "fascinating" but also they "contain personal information that authors may not want associated with their names some twenty years after writing them"
  
  "Given the importance of the histories of everyday people, of course, it is not ethical to NOT collect these stories; ... they are important counterbalances to stories of the powerful and dominant" (199)
  - Legal vs. ethical
  - "Public information is not always ethical" (201)
- "The crucial question for researchers working with these archived websites is whether they were created with an expectation of privacy" (202)
  - But..... sources in archives rarely are
- "Emergent communications technologies are often first used informally, by early adopters who treat them as a sort of extended private space ... we should show some mercy to those early adopters who didn't realize that their blogs or social media would end up being more like a newspaper than a barstool" (205)
"Historians need to understand context, cultural protocols, and expectations of privacy when using these resources" (205)
- Milligan struggles with decision between not being able to get informed consent ... and feeling "uncomfortable with leaving the voices of everyday people completely outside the historical record when there is ample opportunity to include them" (206)
  - Ex: GeoCities does "not fit neatly into the public or private sphere" (207)
  - He feels weird reading personal blog posts from 2 decades earlier
  - But QUESTION: does this make a difference if he reads it 5 decades later? or 10 decades later? How much time is enough time for it to "not feel weird or invasive"??????
  - "The sheer scale at work ... There are now millions of diaries, on both the live and archived web, representing an impressive assemblage of everyday voices. But just because it is on the web does not mean that it waives the default expectation of human subjects and participants to privacy" (208)
  - "Historians now have even more power, because we can access the blogs, ruminations, and personal mementos of millions of people" (208)
"Who is chosen to set the standards for what is ethical and what is unethical" (209)
- "Ultimately and imperfectly, the onus will thus need to fall on individual researchers to carry out a risk assessment and to consider the context in which the material they are reading was published in" (210)
  - THIS IS TOO SUBJECTIVE
  - We need something more systematic
  - "The centric metric should be 'expectation of privacy'"
    - This is too vague
  - "At the very least, we need to pause and think about our obligations as historians" (210)
    - Which is.... WHAT?
- THOUGHT: Developing countries only recently being introduced to the internet ... Can we start archiving their stuff EARLY now??? Set up the systems in place so that archiving can be done easier and is built in? Now that we know how short lived so many parts of the web are

Chapter Six - The (Practical) Historian in the Age of Big Data

Web archives and accessibility - making it easier to collect data, and making it easier to analyze that data
webrecorder.io
http://netpreserve.org/web-archiving/ - has history of archiving
The point of this book is that web history is not "scientific history" ... human needs to figure out what the results of the algorithms that explore the text at scale MEANS for understanding the human condition (234)
Historians can and should learn basic computational skills, like command line

Conclusion

History academics mostly still focus on geographic location as specialization ... we need to focus more on method
"Arguing with Digital History" paper
No historians were involved in Google Books project although they should have been. They were in advisory positions but historians are not good at doing multidisciplinary projects with other disciplines, coauthoring, etc
- "Historians need to learn how to work better as part of interdisciplinary teams and must gain greater technical ability and competence themselves" (24)
- Programming Historian website
NOTE: What if archive snapshots could be "locked" for 50-100 years so that by the time someone looks at it no one in it is alive? This way, no one will ask to have their data removed.