As an astrophysicist, I deal with large datasets which can be updated daily with 10s or 100s of GB every day; the next generation of projects will create TBs of data daily. We need archives in astronomy, and are constantly searching ("querying") them in order to select out data of interest. Some datasets go public immediately; others are always private; while others have a proprietary period (such as the standard NASA 1-year) after which the data become public to anyone. And we store lots of additional parameters in the archives, not just the data themselves, which allow for robust searches for unusual astrophysical objects, date ranges, keywords, etc.
Archiving at city hall shouldn't be much different -- it's just a lower bandwidth problem.
Michael Kineavy, the chief of policy and planning in the administration of Mayor Thomas Menino, daily deleted all of the emails in his inbox and then deleted them again from his trash folder before the automated computer backups would run at midnight every day. The problem with his deletions is that the emails are public records which are illegal to destroy under the state's Public Records Law; Kineavy has taken an indefinite, unpaid leave of absence until the matter is resolved by Secretary of the Commonwealth William Galvin and Attorney General Martha Coakley.
On his Media Nation blog, Dan Kennedy pointed to some archival ideas in an article in the lifestyle and tech magazine Blast. The city has already announced that they are no longer solely relying on nightly backups of individual employee desktop computers in order to save copies of all the email; instead, they have started an archival system where email that is sent or received on the city's servers is instantly saved in a centralized database:
The city implemented a comprehensive program that immediately creates a record of all user messages sent and received no matter what each individual does with them. All journaled messages are kept in the archive for three years.While I don't know the details of the city's newly deployed archival system, it sounds relatively similar to the Blast article's suggestion.
Lost in this technological solution, however, is how the storage and retrieval system can be designed in a manner that serves the public interest while reducing governmental waste.
The design of the archive is dictated by the "requirements document," i.e., the text of the state's Public Records Law. The whole current system is arcane, to put it mildly. In this 21st century, we could easily require through state law that electronic public records are archived and served to the public in a seamless, cost-effective, and transparent manner.
An Archive Design to Serve Public Documents in Real Time
A few points about the state's Public Records Law, and how it might be addressed in designing a transparent electronic archive:
- Some public records can be withheld from release according to 16 exemptions listed in the PRL. These exemptions range from personnel matters to personal information (home address of a public employee) to pending litigation. Any archival system should include a classification tag as to whether the document is fully public, or if it is exempt (along with the particular classification).
- Documents (letters, emails, etc.) become public records the moment they are created or received. The logical time to classify those documents as either public records is when they are first read by a person. The email application (MS Outlook, Thunderbird, Evolution, etc.) could be configured so that a user must always classify the email as public or exempt before allowing him to move to the next email in the inbox; that classification then gets stored in the archive.
- Once a document has been made public it is always public. If someone accidentally released a document, then the person must continue to release it. (No Uruguay Round of copyright law where works in the public domain can be reverted back into copyright.) Once a document is released, it can be tagged in the archive as such, freeing it from ever needing another review.
- State law does not require electronic documents to be provided in electronic form. The city has delivered Kineavy's 5,018 emails in paper form, then scanned the paper versions, and finally put the PDF scans on the internet. This is time-consuming, wasteful of paper, and annoying since the PDF documents are no longer electronically searchable -- a 20th century way that government offers the public its middle finger.
- The cost of searching and reviewing documents in response to a public records request can be charged to the requester. This is the most frustrating part of the PRL, since some records requests generate estimated costs of thousands -- or hundreds of thousands -- of dollars mostly in the charge to review all those documents one-by-one. If the search could be done using automated tools available to anyone -- inside or outside government -- then many man-hours could be saved; the member of the public now wastes his or her own time using the search tool, while the government employees are left to go get a cup of coffee instead. Or do their other job duties. The search tools are easily designed so that they only release documents that are public, and can generate summary lists of emails that are being withheld because they are exempt.
A government employee: receives an email; clicks on it to read it; when he clicks on the next email a dialog window asks him to classify the previous one first as either public or exempt; and so on. (The same goes for outgoing email: the employee has to classify it before the mail system will send it out to its recipients.) When the email was received by the city hall server it is immediately archived; when the employee classifies it, the classification is immediately entered into the city hall archive. If the email is a public document, then it is served up on the internet immediately.
A member of the public: wants to obtain some public documents; goes to the city's website; runs a search using a set of criteria (date range; email to/from/cc name[s]; keyword search; etc.); inspects the two sets of documents returned by the search engine (the public records resulting from the query and a summary table of the exempt documents); and downloads the documents he wants or starts a new query. No government employee has to waste a single moment of time responding to the public records request. Now that is efficiency in government!
What are the obvious holes in such a 21st century system? It has to be designed with the idiot government employee in mind. I don't mean to imply that all or most government employees are idiots, but that any system has to be designed in a way that the least clever employee can't release everyone's health records onto the internet. I can also imagine that some employees might think twice about sending so many three word emails when they have to classify each one before it is sent.
All the employees will complain at first about how much time it takes for them to read their email and classify each one when they're done -- although it will become quick and easy-to-do in no time. Nearly all the emails are public, not exempt, so it's usually pretty routine. (97.5% of Kineavy's emails that were found were released, with only 2.5% being withheld due to one or more exemptions listed in the PRL.)
And employees who like to make rude comments about members of the public will think twice before doing so in an email. Actually, that's a good thing.
I am under no illusion that any politician would be so crazy as to vote yes for a law that would require government electronic records to be served up in real time on government web servers. Remember: the state legislature has already exempted themselves from the state's Public Records Law. But it is a neat thing to think about.
In this Year of Astronomy celebrating the 400th anniversary of Galileo pointing his telescope at the heavens -- and President Barack Obama gazing at Jupiter from the White House lawn -- city hall might strive to reach the heavens by creating a transparent government whose electronic public records are served seamlessly to the citizenry.
Image of a desk inside the National Archives in the 1950s by C. Getts provided by The U.S. National Archives, and image of the National Archives by marttj, both provided through a Creative Commons license.