Let’s Preserve Government Data Before It’s Too Late!

This has been one hell of a bumpy month, and I have a lot I could scream talk about, but for the moment, let’s talk data.

The US Government has spoiled us in recent years with the amount of public data and information available. NIH studies, wastewater virus shedding data, COVID-19 impacts, climate trends and forecasts, all kinds of things. Some of those are things I used for my COVID reporting over the past four years.

And in recent weeks, the US Government has demanded that some of this data be purged or altered to fit the whims of the new White House.

A bit too 1984 for my tastes. A bit too Fahrenheit 451.

And this is not the first time data’s disappeared under a new administration, and it’s not always one party or another, though what’s happening now is scary.

But as they say, look to the helpers. There are several massive efforts underway to archive as much data as possible so it’s not lost forever.

Today, I set up Archive Team’s Warrior, which automates the collaboration around spidering, downloading, processing, and uploading data from governmental sites (and others) to archive.org. All it takes is some bandwidth (okay, a fair amount of bandwidth — looking at 1TB/month right now), some hard drive space, and some CPU cycles, and I can help with this archiving project. It’s excellent, easy to set up, fun to watch, and requires virtually no work on the user’s end. They provide VMs and Docker images (I chose Docker), and once installed, it’s self-managing.

I’m exploring more of what’s out there for data preservation, and thinking about how I can get involved. There are a few really interesting resources out there, including:

  • GovDiff: See the differences in governmental information and resources before and after this administration began its.. work.
  • /r/DataHoarder on Reddit: A group of people working to collect and archive data of all kinds.
  • End of Term Archive Project: Captures US Government sites after presidential terms end.
  • Data Rescue Efforts by Lynda M. Kellam (archived link): A whole collection of sites worth exploring.
  • Archive.org, which hopefully you know about already, and which must be preserved at all costs.

Also of note, CDC Datasets prior to January 28th, 2025 (nearly 100GB worth), which I’ll be archiving myself.

No doubt, lots of data will be lost to time. I can only hope there’s enough people at these agencies who quietly, discretely backed up and sent off what they could before this all went down. In either case, the fact that so many people can join in on the data archiving effort today is incredible, and I hope anyone out there with the resources to spare will take a moment and set up Archive Team’s Warrior and contribute to the effort.