Text

Apr 13, 2014
@ 5:47 pm
Permalink
1 note

In 2001, The Washington Post broke a big story. Dozens of children in the District of Columbia’s foster care system had died in cases where government agencies and workers were at fault, either through failing to take preventative action or by placing the children in unsafe homes. The story, “The District’s Lost Children,” won a Pulizer Prize. More importantly, it drew necessary attention to a flaw in the way D.C. handled foster care cases.

Sarah Cohen was on the Washington Post team that spent a year investigating and sifting through the records of 180 children who died after somehow coming to the attention of the foster care system. Cohen recalls the massive amount of time spent deciphering documents. It would have been helpful to have a computer read the information, but that simply wasn’t possible. The documents were scanned PDFs of forms filled out by hand. The handwriting was at times hard to read. In other instances, the writing would extend sideways up the margin of the paper or the response wouldn’t logically make sense.

image

Now, thirteen years later, technology has advanced. Optical character recognition (OCR) enables computers to transcribe and record many documents, some of them even handwritten. However, OCR technologies still fail on the types of documents Cohen’s team used in the Lost Children investigation. With investigative bureaus shrinking across media organizations, reporters have less time to spend looking through documents.

We first learned about the Lost Children article while pursuing our master degrees at Columbia Graduate School of Journalism. A professor, Susan McGregor, recounted Cohen’s dilemma with the documents and posed to us a challenge: create a platform that would help investigative journalists unlock the data trapped in these difficult PDF documents. We saw crowdsourcing as a potential solution. By leveraging the time of others who were not journalists, but were invested in the stories waiting to be told, we could help investigative journalists decipher that data. At the same time, we could increase engagement between citizens and professional journalists.

Thanks to the Freedom of Information Act (FOIA), investigative journalists are able to access documents from a wide variety of government agencies and sources. However, these documents are often provided in inconvenient formats.There are stories waiting to be told across the globe that would benefit from easier, quicker access to the data in PDFs. In real life investigative stories, documents can include handwriting, poorly scanned areas and redactions. Any of these quirks can make it impossible for OCR technologies to extract textual data. What this means is that reporters have to painstakingly go through each document one by one. The overall format of the documents that make up a set can differ as well, or be unidentical but similar enough to confuse a computer. For example, one journalist shared the horror story of receiving documents in the form of spreadsheets that were printed out then scanned. This effectively transformed the easiest form a machine can read (csv, xls) into a mess.

*
There are a number of past and ongoing projects that pull out text from image-based documents:

image

DocHive by Charles and Edward Duncan of Raleigh Public Record extracts structured textual information from documents with a consistent format across pages, specifically PDFs of forms that were digitally produced, printed and scanned. The application processes page images through ImageMagick, then uses OCR to automatically read in content from user-designated areas of each page. Currently it does not handle handwriting.

image

Zooniverse is an online citizen science project portal that invites the public to annotate, filter, rank, or transcribe scientific records. Among their transcription projects are Old Weather, which digitized weather observations from ship logs dating from the mid-19th century, and more recently Notes from Nature, which transcribed biodiversity data from natural history museum records. They have an active blog that explains, among others, the technological background of their projects, and they have also released their Scribe transcription framework on GitHub.

image

Similarly, The New York Public Library Labs carries out digital library projects that frequently enlist members of the public to transcribe or verify information in library collections. Notable examples include What’s on the Menu and Ensemble, which transcribes historical restaurant menu collections and performing arts programs respectively. These projects are great examples of crowdsourcing transcriptions for documents with very loose formats that not necessarily tabular.

image

While projects by Zooniverse and NYPL Labs are not journalistic in the conventional sense, Free the Files is a prime example of how crowdsourcing document transcriptions can be used in investigative journalism. Records of spending in the 2012 presidential and congressional elections by outside groups like super PACs and nonprofits with secret donors were filed at TV stations across the country; ProPublica turned to their readers to extract and structure their content. Many features of Free the Files were specific to this particular project, including its attention to geolocation data. ProPublica later open-sourced the core functionality as Transcribable.

image

The Reporters’ Lab also coaches journalists on how to leverage crowdsourcing services such as Amazon’s Mechanical Turk and FromThePage to transcribe contents of documents and audio recordings.

Building on the wisdom of these projects, InfoScribe seeks to be a crowdsourcing transcription platform specifically for journalists but general enough to handle different projects. Even from a purely functional standpoint, it is necessary to have human eyeballs on pages to account for handwritten information and possible format variation. But, as was the case for Free the Files, perhaps the greater benefit of crowdsourcing is the community that forms around the content of the documents. We want a single web service where users can engage in a continuous dialog about investigative stories.

InfoScribe’s journalistic bent also necessitates certain features: Many existing non-journalistic projects do not appear to perform automatic validation, which is crucial for time-constrained journalistic investigations. Also important is an interface to monitor the transcription’s progress, as stories can arise even as the documents are being transcribed.

**
We are a team of two graduate students with complementary skills. As we build InfoScribe, we have assigned our roles accordingly:

Madeline’s role is to develop a crowdsourcing strategy that builds a community around InfoScribe, creating an experience that benefits both the media organizations and the transcribers. What makes a crowdsourcing platform “sticky”? How can we make the process seamless and fun for our transcribers? These are the questions Madeline seeks to answer, through case studies and interviews of other crowdsourcing platforms that have been successful. Additionally, Madeline is leading the user studies and collecting data on user experience that informs the design of the platform.

Aram’s role is to build the basic uploading and transcribing interface as well as the back-end functionality of assigning documents and validating transcriptions. In particular, assigning a document to a user must take into account both user and document information, selecting only the documents the user hasn’t seen yet and are under-transcribed. What decides which documents are under-transcribed? This is where automatic validation comes in, which not only reduces work for the uploader but also decides which documents have entries that need more transcribing.

***

As mentioned, InfoScribe is a project that stands on the shoulders of those who have come before us. Crowdsourcing has been harnessed effectively to help with everything from digitizing library collections to mapping disaster zones to funding new and innovative projects. While designing the platform from InfoScribe, we are learning from established crowdsourcing platforms, like the aforementioned NYPL projects, Kickstarter and others. In addition, we are conducting extensive user studies to figure out how to make InfoScribe a satisfying encounter for both journalists and the transcribers who are helping them. We are continually incorporating these reactions into our design. Through cheap paper prototyping and revision, we are able to save future time and energy by avoiding costly code revisions.

image

We made some unexpected discoveries through this process: We had originally thought crediting transcribers would serve as a great motivation for them to participate, but interviews uncovered that in similar past projects, some power users did not want their names published. Now we will be sure to communicate that users don’t have to be credited if they don’t want to.

Our next step is to finalize the structure of our user and document information within the confines of Google App Engine’s Datastore. The main consideration is to find the best structure that allows non-costly document assignment and transcription validation. For the time being, the front-end interface will focus more on function than form, incorporating knowledge from user studies of the prototype. These user studies are ongoing and will continue to help inform our design.

We are also continuing to work on case studies of other crowdsourcing platforms, such as Zooniverse and NYPL. We seek to publish these case studies so that others can benefit from the best practices and information we have gathered. As we gear up for a pilot run at The New York World, we will begin to work with the document set that they need transcribed and gather an initial community of Columbia graduate students to test run the platform as transcribers.


Text

Mar 25, 2014
@ 4:04 pm
Permalink
2 notes

Crowdsourcing - Insights from Trevor Owens

While Aram builds our platform, I’ve been tackling another difficult, though far less technical, challenge: how to build a crowdsourcing system that is sticky, effective and satisfying for both the InfoScribers and our journalist partners.

The ongoing user studies play a key role in our research - and I’ll be continuing to post on those as we progress. But we’re also examining the lessons that have already been learned in the field. We are by no means the first to launch a crowdsourcing platform with the goal of improving society and there’s a lot we can learn from those who have paved the way. Unfortunately, there’s no GitHub for best practices for crowdsourcing (at least not that I have found - if you know of one please email and let us know!). So I’ve resorted to the primary research we rely on so often in journalism: interviews.

I’m building a series of case studies of successful “altruistic crowdsourcing” campaigns. I wanted to talk with the man whose name kept surfacing in my research: Trevor Owens.

Trevor Owens (www.trevorowens.org) is officially at the Library of Congress, where he is a digital archivist and strategist. In his free time, Trevor also follows the developments surrounding online communities and cultural institutions. We see parallels between cultural institutions and organizations conducting investigative journalism - both suffer from increasingly shrinking resources and both have a mission to fulfill for the public good.

In one of his papers, Trevor talks about how “crowdsourcing” should really be called “community-sourcing.” In project after project, we see a repeated pattern. A small, galvanized group of followers do the vast majority of the work. I was interested in exploring this idea further with him. How do you find those die-hard contributors? Are there any patterns in how they interact or what they want from a platform?

In the course of our interview, Trevor shared a few observations that seemed especially relevant for InfoScribe:

1. It’s critical to reduce friction, especially for first-time users. This resonated with me because several of the users in our user studies have been overwhelmed or confused by our current design for the landing page.

2. Power users are the key. You want to cast a wide-enough net to attract and find your power users, then you want to do everything possible to keep them engaged.

3. For a truly successful crowdsourcing project, the completed project (in our case, the transcribed documents) should be just one of the objectives. Trevor pointed out how many of the most successful crowdsourcing projects provided a way to engage users with material in new ways. These projects strengthened the bond between participants and the host institution.


Text

Mar 14, 2014
@ 10:41 pm
Permalink
1 note

Querying the Datastore (Part II)

Part II was going to be about how to write Datastore queries and set up indexes, but that was becoming a rather massive and repetitive post no one really needs; the documentation may be murky on what queries are allowed but reasonably accessible on how. Instead this is going to be a post about the queries needed for each component of the application, which determine the properties of each Datastore kind.

These are the screens that make up the application (from our paper prototype):

image

Homepage
*Retrieve the 10 highest count of documents transcribed by each Scriber [to display as the leader board].

Scriber’s Dashboard Screen
*For the current user (a Scriber), retrieve his/her score [to display as badges/warnings].
*For the current user (a Scriber), retrieve Projects that he/she worked on and are finished [to display on “trophies”].
*For the current user (a Scriber), retrieve Projects that he/she worked on and are ongoing.
*For the current user (a Scriber), retrieve Projects that he/she would be interested in and are newly added.
*For the current user (a Scriber), retrieve Projects that he/she would be interested in and are under-transcribed.
*Retrieve all other Projects, under-transcribed to not.

Scriber’s Transcribing Screen
*For the selected Project, retrieve 20 documents that have fields that haven’t been verified.
*(If the Scriber opts for a random project, select the Project that is most under-transcribed.)
*Create Transcription.

Scriber’s Congrats Screen
*Perform validation. Update Project’s progress. Update Scriber’s score.

Journalist’s Dashboard Screen
*For the current user (a Journalist), retrieve Project he/she set up, ongoing to finished, new to old [searchable].

Journalist’s Project Monitor Screen
*For the current user (a Journalist) and the selected Project, retrieve validated Transcriptions
*If the Journalist decides to shut down the Project, set finished=True & delete all related Transcriptions.

Journalist’s Upload Screen
Journalist’s Descriptions Screen
Journalist’s Field Selection Screen
Journalist’s Congrats Screen
*Create Project.


Text

Mar 12, 2014
@ 5:44 pm
Permalink

Querying the Datastore (Part I)

If you’re used to traditional relational databases, using Google App Engine’s datastore is a bit like being Alice in Wonderland: The usual rules don’t apply, which isn’t to say it doesn’t follow (somewhat counterintuitive) rules of its own. This is a summary post for later reference, enumerating all possible queries and, in Part II, how to write them and what to expect in return. Note that all possible queries can be enumerated; the datastore supports only a limited number of simple queries for the sake of web speed.

Note 1: Okay, so these aren’t ALL POSSIBLE queries. I left out the ones I can’t imagine why anyone would ever want to do, like queries asking for DESC key order.
Note 2: As far as I can tell, all queries in google.appengine.ext.db work also for google.appengine.ext.ndb with a slightly different syntax.

I. kindless entity query

1. filter on entity key
  a. 1 equality(=) filter* —Possible, but use get() instead
  b. 1-or-more inequality(>,>=,<,<=) filter(s) [in “key order,” by default]*

II. entity query (on 1 kind)

0. no filter or sort order
  a. all entities of a kind [in “key order,” by default]*

1. filter on entity key
  a. 1 equality(=) filter* —Possible, but use get() instead
  b. 1-or-more inequality(>,>=,<,<=) filter(s) [in “key order,” by default]*

2. filter and/or sort on 1 single-valued property
  a. all entities of a kind in 1 ASC/DESC property value order*
  b. 1 equality(=) filter [in “key order,” by default]*
  c. 1-or-more inequality(>,>=,<,<=) filter(s) [in ASC property value order, by default]*
  d. 1-or-more inequality(>,>=,<,<=) filter(s) in DESC property value order*

3. filter and/or sort on 1 multiple-valued property
  a. all entities of a kind in 1 ASC/DESC property value order
  b. 1-or-more equality(=) filter(s) [in “key order,” by default]
  c. 1-or-more inequality(>,>=,<,<=) filter(s) [in ASC property value order, by default]
  d. 1-or-more inequality(>,>=,<,<=) filter(s) in DESC property value order

4. filter and/or sort on 2-or-more single-valued properties
  a. all entities of a kind in ASC/DESC property value orders on 2-or-more properties**
  b. equality(=) filter(s) on 2-or-more properties [in “key order,” by default]*
  c. equality(=) filter(s) on 1-or-more property(ies) & inequality(>,>=,<,<=) filter(s) on 1 property [in ASC property value order of the property with the inequality, by default]**
  d. inequality(>,>=,<,<=) filter(s) on 1 property in ASC/DESC sort order(s) on 1-or-more property(ies)**

5. filter and/or sort on 2-or-more single-valued/multiple-valued properties properties
  —Need to check how these work

III. keys-only query (on 1 kind)
  —Similar to II, but specify you only want __key__

IV. projection query (on 1 kind)
  —Similar to II, but specify the names of the properties you want**

*You don’t have to configure index.yaml for these queries; the datastore creates the index for you automatically.
**You have to configure index.yaml for these.

(…I’m going to read this over to check for mistakes, but after a break.)


Text

Feb 25, 2014
@ 1:25 pm
Permalink

I feel like I’m always scrambling to learn something new for InfoScribe, so here’s a short post to keep track.

First it was Ruby on Rails. Zooniverse’s Scribe and ProPublica’s Free the Files, notable transcription crowdsourcing applications that we’re expanding on for general investigative journalism projects, are both written in Rails. It seemed only natural to learn it if only to understand these systems, if not to actually develop InfoScribe using the same framework. I was learning the language (Ruby) and the framework (Rails) at the same time, and it actually took me a while to get a hang of it. While a lot of people seem to cite scaffolding as a good way to break into Rails, for me it only started to make sense only after I learned to create the individual components separately. I practiced by rewriting some of my old projects in Rails.

Then came AngularJS. Not only had I used it for a final team project in my user interface design class and found it very reasonable, but our application for Google’s Computational Journalism grant had also gone through. At a stage when I wasn’t sure what changes that would bring, a JavaScript framework seemed like a safe choice that would be supported on any platform. I was adamant that I should have a thorough grasp of AngularJS before I started applying it. Back when my team was working on our final project, our philosophy, if we can even call it that, was to try something and if it works, move on. Very soon, we were going around in circles fixing one bug and watching everything else break. We managed to patch our way through the project, but the lesson was learned.

The latest is Google App Engine and its Datastore. It turned out Google Cloud Service credits are indeed part of our grant, and App Engine, although not exactly intuitive, comes with enough support and flexibility that makes it enjoyable to play around with. On the other hand, I’ve been somewhat reluctant to pick up Datastore, which I knew to be somewhat different from traditional databases. I was briefly excited about the graph database Neo4j, which boasts an intuitive node and edge structure and easy migration, but despite scattered success stories online, I found it more trouble than it’s worth to make it work with App Engine. I resigned myself to learning Datastore, to find that its completely indexed queries might actually be very fast.


Link

Feb 20, 2014
@ 2:31 pm
Permalink

InfoScribe Awarded Google Computational Journalism Grant! »

We’ve got big news at InfoScribe. We’ve been wanting to share it for quite a while but had to wait until it was official. InfoScribe, with Professor Susan McGregor, is the proud recipient of a Google Computational Journalism Award.

What does this mean for the project? We get $20,000 of credits for Google Cloud Services, giving us plenty of room to grow as our platform develops and we attract users. We also get $60,000 of funding to continue our user research, platform development and partnership-building. 

Receiving this vote of confidence from Google means a lot to us. Stay tuned to see how we turn the dollars into document liberation.

-MR


Text

Jan 7, 2014
@ 9:01 pm
Permalink

User Testing

image

If there’s anything we learned, it’s that it’s really hard to get people to user test our stuff.

Madeline and I met a little early to add finishing touches to our paper prototypes for InfoScribe. We also had our use scenarios ready. In a user testing session, each tester is assigned a specific use scenario describing their relationship to the application and the task to be performed. Obviously with InfoScribe, you could be one of three things:

  • You’re a journalist, with documents in need of transcribing.
  • You’re an InfoScriber, ready to transcribe documents for the betterment of society.
  • You’re a random web surfer, who’s stumbled upon the URL and wants to explore.

Your familiarity with the app would also vary:

  • You already have an account.
  • You know vaguely what it’s about, although you’ve never used it.
  • You have no idea what it is.

Excluding the one impossible scenario—someone who’s just stumbled onto InfoScribe would not already have an account—we had eight to choose from.

Everything was going well at first. We posted on the Journalism School’s Class of ’14 Facebook group. We successfully ordered pizza after a brief struggle with an online ordering service (and vowed InfoScribe’s sign up process would be less annoying). We put up a sign that read “User Testing: Get Pizza & Good Karma.” I might add this was also when we decided on the term “InfoScribers.” (It seems so obvious now, but for a long time, we were at a loss for a friendly term to call our transcribers. “Transcribers” sounds strangely functional and detached; “citizens” invokes “citizen journalism” but has odd legal connotations: “Are you a citizen?” “No, I’m a legal alien.”)

We quickly realized we should have recruited at least some people beforehand. Where we were, the Stabile Center on the Journalism School’s ground floor, was hardly empty, but everyone was there for something else. It didn’t help that we both took our reporting classes with the class of ’13 (we’re both dual degree students), so neither of us are very familiar with the current batch of students. The rare acquaintances we did try to grab were sadly rushing off to other engagements.

image

In the end, we managed to get hold of Jonathan Stray, Tow Fellow and computational journalism professor, as he was leaving. After a brief rundown, he went through two use scenarios for us that we decided were the most likely cases for our initial users: as a journalist and as an InfoScriber, who have each heard of InfoScribe but never used it before. We were lucky to get Jonathan, as he’s no stranger to journalism app UIs, and he provided ample feedback about where the interface was unclear or misleading or what he thought was good.

image

The two major points I got out of his testing session was that it should be made clearer that the comments provided by our InfoScribers will be read directly by journalists, and that we should, in the long run, expect to handle a far greater variety of PDF page formats than we had anticipated, sometimes within a single file. One of the monstrous PDF examples he showed us over pizza was the Department of Defense’s budget submission file, a single PDF of more than 600 pages with no consistent page format.

It turned out to be a fruitful session in the end. We look forward to trapping more unsuspecting souls into user testing for us.


Text

Jan 5, 2014
@ 5:34 am
Permalink

Paper Prototyping

During the semester, Madeline and I got together to make paper prototypes for InfoScribe. 

image
Hard at work (Sort of. I’m taking the picture with my other hand.)

Paper prototyping is exactly what it says on the tin: It’s a user interface design technique of using paper mockups of the expected final product to test user behavior. Transitions in the interface are performed by hand, sliding bits of paper in and out (or, when that’s physically difficult to do, explained verbally to the tester). It’s quick, low-cost and easy to modify, and prevents the awkward situation where hours of coding go to waste or become subject to patchy revisions because the designers failed to predict certain user behavior.

I was awakened to the wisdom of this process through Prof. Steve Feiner’s User Interface Design class, although I must say it’s much more fun doing paper prototyping for a real project than for class. For class, your user base comprises of the professor and the TAs, as opposed to, you know, real people.

image

image

Our prototypes were based on the virtual sticky notes we had previously put up on RealTimeBoard.com, describing the functionality we need for each scenario. The overarching objective is, of course, to build an online community interested enough in investigative journalism to help transcribe data trapped in documents.

Prototyping the interface raised some interesting questions we hadn’t yet worked out: For example, what’s the right way to display InfoScribers’ achievements? We want users to have a sense of pride at how much they contributed to investigative journalism, but not so much that they’re distracted from transcribing.

Also, how do we deal with “field sets,” related fields of data that may occur multiple, even unspecified times on a page? A specific case we considered was when the document provided to a journalist comprises of spreadsheets that were printed out and scanned in as an image-based PDF (yes, sometimes that’s what you get after a FOIA request). We don’t want to tire out our InfoScribers with a single page. Do we let the journalist cut up their pages?

Some of our questions were resolved while making the prototypes, while others were left for further consideration and, of course, the user testing itself.


Text

Nov 14, 2013
@ 5:52 am
Permalink

InfoScribe: An Introduction

We’re Aram Chung and Madeline Ross, two nerds students from the Columbia Graduate School of Journalism building a document digitization platform for investigative journalism.

+
Despite the exponential increase of digital data today, newsrooms aren’t getting any larger and OCR technology isn’t advancing fast enough. Though on its surface greater availability of digital public records should be a boon to investigative journalism, the reality is that these records are often published as unstructured, image-based documents, or without essential metadata. 

Enter InfoScribe: a generalized, web-based crowd-sourcing document transcription platform that invites the public to participate in the journalistic process by transcribing specified data fields from documents. While providing journalists with access to data sources that would otherwise be beyond their reach, InfoScribe seeks to cultivate meaningful, long-term personal investment in the journalistic process by giving transcribers access to the journalists who are doing work they care about, as well as publication credit for their contribution. We want to invite community participation to increase the transparency of, and the public’s confidence in, the journalistic process.

++
Leveraging features of several enormously successful social and technical systems, we will have completed a comprehensive functional specification by January 2014, consisting of basic interface wireframes and technical design approaches. In partnership with the New York World, the alpha and beta prototypes will be developed from January to August of 2014. The New York World is an ideal first partner for this project: it is collocated with the Columbia Graduate School of Journalism, and is a genuine newsroom focused on the kinds of accountability issues that often require document-based reporting. In August of 2014 we will deploy an instance of InfoScribe to the New York Times’ Computer-Assisted Reporting department, which has already expressed interest in the platform. Over the subsequent months, we will begin to publicize the beta and invite document contributions more widely.


Text

Nov 2, 2013
@ 4:11 pm
Permalink

Blog created!

More to come.