aboutbeyondlogin

exploring and collecting history online — science, technology, and industry

advanced

ECHO Blogging Central

Can You Tell a Book By Its Cover?

edwired - Tue, 02/26/2013 - 20:56

It won’t be long (one month, actually) before Teaching History in the Digital Age is available. But the cover has now appeared on the Michigan Press website and I’m very pleased with the result.

Exploring the Significance of Digital Humanities for Philosophy

Digital Scholarship in the Humanities - Tue, 02/26/2013 - 14:34

On February 23, I was honored to speak at an Invited Symposium on Digital Humanities at the American Philosophical Association’s Central Division Meeting in New Orleans. Organized by Cameron Buckner, who is a Founding Project Member of InPhO and one of the leaders of the University of Houston’s Digital Humanities Initiative, the session also featured great talks by Tony Beavers on computational philosophy and David Bourget on PhilPapers.

“Join in,” by G A R N E T

One of the central questions that we explored was why philosophy seems to be less visibly engaged in digital humanities; as Peter Bradley once wondered, “Where Are the Philosophers?” As I noted in my talk, the NEH’s Office of Digital Humanities has only awarded 5 grants in philosophy (4 out of 5 to Colin Allen and colleagues on the InPhO project). Although the APA conference was much smaller than MLA or AHA, I was still surprised that there seemed to be only two sessions on DH, compared to 66 at MLA 2013 and 43 at AHA 2013.

Yet there are some important intersections among DH and philosophy. Beavers pointed to a rich history of scholarship in computational philosophy. With PhilPapers, philosophy is ahead of most other humanities disciplines in having an excellent online index to and growing repository of research.  Most of the same challenges faced by philosophers with an interest in DH apply to other domains, such as figuring out how to acquire appropriate training (particularly for graduate students), recognizing and rewarding collaborative work, etc.

My talk was a remix and updating of my presentation “Why Digital Humanities?” In exploring the rationale for DH, I tried to cite examples relevant to philosophy. For example, the Stanford Encyclopedia of Philosophy, a dynamic online encyclopedia that predates Wikipedia, has had a significant impact, with an average of nearly a million weekly accesses during the academic year. With CT2.0, Peter Bradley aims to create a dynamic, modular, multimedia, interactive, community-driven textbook on critical thinking. Openness and collaboration also inform the design of Chris Long and Mark Fisher’s planned Public Philosophy Journal, which seeks to put public philosophy into practice by curating conversations, facilitating open review, encouraging collaborative writing, and fostering open dialogue. Likewise, I described how Transcribe Bentham is enabling the public to help create a core scholarly resource.  I also discussed recent critiques of DH, including Stephen Marche’s “literature is not data,” the 2013 MLA session on the “dark side” of DH, and concerns that DH risks being elitist. I closed by pointing to some useful resources in DH and calling for open conversation among the DH and philosophy communities. With that call in mind, I wonder: Is it the case that philosophy is less actively engaged in digital humanities?  If so, why, and what might be done to address that gap?


Detecting Handwriting in OCR Text

Collaborative Manuscript Transcription - Mon, 02/25/2013 - 14:56
This is my fourth and final post about the iDigBio Augmenting OCR Hackathon.  Prior posts covered the hackathon itself, my presentation on preliminary results, and my results improving the OCR on entomology specimens.  The other participants are  slowly adding their results to the hackathon wiki, which I recommend checking back with (their efforts were much more impressive than mine).

Clearly handwritten: T=8, N=78% from terse and noisy OCR files
Let's say you have scanned a large number of cards and want to convert them from pixels into data.  The cards--which may be bibliography cards, crime reports, or (in this case) labels for lichen specimens--have these important attributes:
  1. They contain structured data (e.g. title of book, author, call number, etc. for bibliographies) you want to extract, and
  2. They were part of a living database built over decades, so some cards are printed, some typewritten, some handwritten, and some with a mix of handwriting and type.
The structured aspect of the data makes it quite easy to build a web form that asks humans to transcribe what they see on the card images.  It also allows for sophisticated techniques for parsing and cleaning OCR (which was the point of the hackathon).  The actual keying-in of the images is time consuming and expensive, however, so you don't want to waste human effort on cards which could be processed via OCR.

Since OCR doesn't work on handwriting, how do you know which images to route to the humans and which to process algorithmically?  It's simple: any images that contain handwriting should go to the humans.  Detecting the handwriting on the images is unfortunately not so simple.

I adopted a quick-and-dirty approach for the hackathon: if OCR of handwriting produces gibberish, why send all the images through a simple pass of OCR and look in the resulting text files for representative gibberish?  In my preliminary work, I pulled 1% of our sample dataset (all cards ending with "11") and classified them three ways:
  1. Visual inspection of the text files produced by an ABBY OCR engine,
  2. Visual inspection of the text files produced by the Tesseract OCR engine, and
  3. Looking at the actual images themselves.

To my surprise, I was only able to correctly classify cards from OCR output 80% of the time -- a disappointing finding, since any program I produced to identify handwriting from OCR output could only be less accurate.  More interesting was the difference between the kinds of files that ABBY and Tesseract produced.  Tesseract produced a lot more gibberish in general--including on card images that were entirely printed.  ABBY, on the other hand, scrubbed a lot of gibberish out of its results, including that which might be produced when it encountered handwriting.

This suggested an approach: look at both the "terse" results from ABBY and the "noisy" results from Tesseract to see if I could improve my classification rate.
Easily classified as type-only, despite (non-characteristic) gibberish: T=0,N=0 from terse and noisy OCR files.
But what does it mean to "look" at a file?  I wrote a program to loop through each line of an OCR file and check for the kind of gibberish characteristic of OCR and handwriting.  Inspecting the files reveals some common gibberish patterns, which we can sum up as regular expressions:

GARBAGE_REGEXEN = { 'Four Dots' => /\.\.\.\./, 'Five Non-Alphanumerics' => /\W\W\W\W\W/, 'Isolated Euro Sign' => /\S€\D/, 'Double "Low-Nine" Quotes' => /„/, 'Anomalous Pound Sign' => /£\D/, 'Caret' => /\^/, 'Guillemets' => /[«»]/, 'Double Slashes and Pipes' => /(\\\/)|(\/\\)|([\/\\]\||\|[\/\\])/, 'Bizarre Capitalization' => /([A-Z][A-Z][a-z][a-z])|([a-z][a-z][A-Z][A-Z])|([A-LN-Z][a-z][A-Z])/, 'Mixed Alphanumerics' => /(\w[^\s\w\.\-]\w).*(\w[^\s\w]\w)/ }
However, some of these expressions match non-handwriting features like geographic coordinates or bar codes.  Handling these requires a white list of regular expressions for gibberish we know not to be handwriting:

WHITELIST_REGEXEN = { 'Four Caps' => /[A-Z]{4,}/, 'Date' => /Date/, 'Likely year' => /1[98]\d\d|2[01]\d\d/, 'N.S.F.' => /N\.S\.F\.|Fund/, 'Lat Lon' => /Lat|Lon/, 'Old style Coordinates' => /\d\d°\s?\d\d['’]\s?[NW]/, 'Old style Minutes' => /\d\d['’]\s?[NW]/, 'Decimal Coordinates' => /\d\d°\s?[NW]/, 'Distances' => /\d?\d(\.\d+)?\s?[mkf]/, 'Caret within heading' => /[NEWS]\^s/, 'Likely Barcode' => /[l1\|]{5,}/, 'Blank Line' => /^\s+$/, 'Guillemets as bad E' => /d«t|pav«aont/ }
With these on hand, we can calculate a score for each file based on the number of occurrences of gibberish we find per line.  That score can then be compared against a threshold to determine whether a file contains handwriting. Due to the noisiness of the Tesseract files, I found it most useful to calculate their score N as a percentage of non-blank lines, while the score for the terse files T worked best as a simple count of gibberish matches.
Threshold Correct False
Positives False
Negatives T > 1 and N > 20% 82% 10 of 45 8 of 60 T > 0 and N > 20% 84% 13 of 45 4 of 60 T > 1 79% 10 of 45 12 of 60 N > 20% 75% 8 of 45 18 of 60 N > 10% 81% 14 of 45 6 of 60 One interesting thing about this approach is that adjusting the thresholds lets us tune the classifications for resources and desired quality. If our humans doing data entry are particularly expensive or impatient, raising the thresholds should ensure that they are only very rarely sent typed text. On the other hand, lowering the thresholds would increase the human workload while improving quality of the resulting text.
One of the  false negatives: T=0, N=10% from parsing terse and noisy text files.
I'm really pleased with this result.  The combined classifications are slightly better than I was able to accomplish by looking at the OCR myself.  The experience of a volunteer presented with 56 images containing handwriting and 13 which don't may necessitate a "send to OCR" button in the user interface, but must be less frustrating than the unclassified ratio of 45 in 105 from the sample set.  With a different distribution of handwriting-to-type in the dataset, the process might be very useful for extracting rare typed material from a mostly-handwritten set, or vice versa.

All of the datasets, code, and scored CSV files are in iDigBio AOCR Hackathon's HandwritingDetection reposity on GitHub..

Are human remains inappropriate for younger audiences?

Biomedicine on Display - Sun, 02/24/2013 - 16:25

A few days ago I received a notice from Youtube about one of our videos. Apparently someone had marked it “inappropriate” and following review by Youtube staff the video was age-restricted.

The video in question is part of a series called “Favourite Things“, in which museum staffers select one of their favourite museum objects and describes it and why it is so special. In this particular video, Collections Manager Ion Meyer, is showing and describing three preparations of a so-called ischiopagus. That is, twins conjoined at the pelvis.

Since the video was published in March 2011 it has had almost 220,000 views. In comparison, the second-most watched video in our Youtube-channel has had less than 10,000 views. The ischiopagus video has also triggered more comments than is usual for our videos. We have tried to respond to all serious comments, but we also chosen not to respond to some, e.g.

Why would any parent let someone do this to their children! They need a proper burial! Bless there souls! <3

If you look at the Youtube guidelines, reasons for placing an age-restriction on a video include

  • Sexually suggestive content
  • Partial nudity or non-sexual nudity
  • Actual violence or very graphic fictional violence
  • Gory, disturbing imagery in an appropriate context

However, they also highlight notable exceptions for

some educational, artistic, documentary and scientific content (e.g. health education, documenting human rights issues, etc.), but only if this is the sole purpose of the video and it is not gratuitously graphic…

Without proper context and explanations I can see how someone could feel the imagery in video could is disturbing. However, it should be clear that the purpose of this video is exactly as described in the exception.

Is the video inappropriate for young audiences? I don’t think so. However, Youtube provides no means of appealing an age-restriction imposed on a video, so it doesn’t really matter what we think. I wonder if other museums have had similar experiences with videos on Youtube?

You can see the video below and judge for yourself.

Black, White, and Red

Found History - Wed, 02/20/2013 - 15:26

Steering partners and clients toward simpler web designs is one of the greatest services we can render. In consultations and collaborative projects, I often find myself advocating for less, less, less. This is especially true when it comes to color schemes—historians aren’t easily put off their beiges, navy blues, burgundies, and parchment textured backgrounds. I do not have any design training, so I have just as often been frustrated by my lack of appropriate and convincing language to explain that when it comes to color, less is often more. Until now.

Last week I met a design professor who gave me the words. “When we are teaching color to design students,” he said, “we always tell them to start with black, white, and red.” “You don’t have to stay there, but any time you stray from black, white, and red, you should have a good reason.” “It’s no accident Coca-Cola, Marlboro, and Santa Claus are the world’s most recognizable brands.”

To this list he added the highly stylized opening titles of the fashion setting television show, Mad Men. I immediately thought of Nike Air Jordans, and the covers of Time, Life, Newsweek, and The Economist. I’m sure there are many others. Black, white, and red just work. Please feel free to share additional examples in comments.

[Image credit: ididj0emama]

Guest Post: Radical Collaboration - Tools for Partnering with Community Members

Museum 2.0 - Wed, 02/20/2013 - 08:00
This guest post was written by my incredible colleagues, Stacey Marie Garcia and Emily Hope Dobkin, with minimal input from me. It started as a handout for a session that Stacey and I are doing at the California Association of Museums, and then I realized it was so darn useful that it was worth sharing with all of you. Can't wait to hear what you think.

The majority of our public programs at the Santa Cruz Museumof Art & History are created and produced through community collaborations. Each month we work with 50-100 individuals to co-produce our community programs.  It’s not unusual for us to meet with an environmental activist, a balloon artist, a farmer, and the Mayor of Santa Cruz all in one day. Every time we collaborate, we learn new ways to improve our process, organization and communication.

We never received a “how-to-guide” for collaborating with community members here at the MAH, but over time, we have acquired some basic tools that have shaped our approach. We realize collaboration differs greatly for each individual and organization. We offer these tools in the spirit of sharing and look forward to learning about the techniques you use in your own community.

Start with and continuously identify your communities.
  • Who are they?
  • What are their needs?
  • What are their assets?
  • Who is represented in your museum? Who isn’t?
One way we do this is through C3 (Creative Community Committee) meetings. C3 is a group of diverse community members that meets to creatively brainstorm new forms of collaboration with community members. C3 topics have ranged from exhibition development, community needs, outreach programs, our Loyalty Lab project, and family programs.

Reach out to and continuously seek diverse collaborators--not just the usual suspects. We look for partners who have:
  • An understanding of and desire to help meet your community’s needs.
  • Incredible assets, skills and resources to offer to your community but they are in need of more awareness, promotion, visibility and representation.
  • A genuine enthusiasm for sharing their skills, building knowledge and developing relationships in the community even if they haven’t done it before. For example, a few months ago we had a couple approach us to propose a Pop-Up Tea Ceremony.  Their enthusiasm and commitment charmed us and aligned with our social bridging goals. We invited them to set up the day after we met them and they’ve been Friday regulars ever since.
  • Experience working with a wide variety of age groups or teaching in general. 
  • Good communication skills and are kind and friendly.
  • Large and small (or no) followings. When planning programs or events, we involve a combination of these groups to share and bridge audiences, bringing big, diverse crowds to new artists and ideas.

Openly invite collaboration by establishing and maintaining transparency about your partnerships with the public and fellow staff members.
  • On your website: share your programing goals, solicit collaborations in general and for specific events, provide easily accessible staff contact information, clearly state how your collaborations function, give thanks and acknowledgement to your collaborators through your website and on Facebook page.
  • At your museum: have your front desk staff aware of upcoming events and collaboration possibilities, always have business cards available for visitors interested in collaborating so they can easily contact staff members.  Be available to talk with people at your events and hand out your contact information to anyone who has an idea they’d like to talk with you about or is interested in helping. Follow up with them later.
  • Don’t pass judgment or make assumptions. Always be open to discussing collaborative possibilities with anyone and everyone and then decide if it’s a good fit.  
  • Mine your colleagues; ask for ideas and suggestions from staff members for resources. You never know who might have connections to some place or another. For our Art That Moves event, our Membership and Development Director suggested the incredibly popular Tarp Surfing activity.

Always meet your collaborators in person. We can’t overstate how important this is to getting everyone moving in the same direction.
  • Clearly explain how your organization collaborates with others before you meet.
  • Meet them at your museum so they begin to become more familiar and comfortable with the space and understand how they will fit into the event or program.
  • Ask them about their goals for this collaboration and share your goals.
  • Find a way, together, to achieve both.
  • Brainstorm together your wildest ideas and then scale back. For our 3rd Friday series, we like to have an initial meeting with all of our collaborators and together go over the community program goals tied to the theme of the event. Incredible projects can arise when you have a poet, a librarian, a printmaker, a bookbinder and a teacher all throwing out ideas together. (Radical Craft Night and Poetry & Book Arts)
  • Allow time to pass for further individual reflection, for them to share their ideas with other members of their organization and for you to give it further thought.
  • Confirm final details with them over phone, email or go to their location this time.

Collaboration is based upon communication. Get ready to talk.
  • Be prepared to spend an enormous amount of time communicating with each individual through email, over the phone and in person.
  • Make time for them. When you give collaborators more of your time, they will feel more confident about their role in the event, their project/workshop/demonstration will inevitably be stronger and your visitors will be happier.
  • When you produce a large event with many individuals, make sure they are all connected through email. This establishes communication across the entire group, collective teamwork, the opportunity to share resources and the possibility of future relationships and connections to develop amongst your collaborators.  Recently, we hosted a PechaKucha night at the MAH, which featured a wide range of community members presenting on eight different topics. These eight people didn't know each other at all before the event. In a pre-event email exchange, one presenter offered up a useful link to help practice giving this kind of talk. That email sparked several messages of appreciation and excitement, creating a sense of comradery.

Even if you can’t financially compensate your collaborators, show your collaborators how much you value them. Many times, we cannot pay our collaborators. For some MAH events, we collaborate with 120 individuals across the spectrum from amateurs to professionals, all of whom have very different expectations about compensation. How do we pay a group of ukulele players, a teenage rock band and a world-renowned musician fairly and on a very limited budget?

Here are some other ways we compensate our collaborators:
  • Give them as much press as possible. Suggest them to press for a feature in the local paper.
  • Acknowledge them on your website and always link to their website.
  • Pay for all their materials.
  • Offer food and drinks for them at the event.
  • Give them a guest pass.
  • Thank them and credit them for their work and volunteered time.
  • Refer them if someone asks you for a recommendation.
  • Help them learn from the experience. We recently had a group of students creating balloon art during our Winterpalooza Family Festival. New to the art form and the museum, we gave them a gift certificate to reflect over milkshakes at a local burger joint after the event.
  • Encourage them to promote themselves/their organization and offer ways for visitors to learn more about their events at your event. It’s a reciprocal appreciation: we are able to showcase and share the amazing talent in our community, and they’re able to share their work with a larger audience, make new connections in the community and learn from their experiences interacting with the public

Your partners are doing a lot of work. Make it as easy for them as possible.
  • Share your resources and connections that can help make their activity/collaboration stronger. A friendly sheet metal company in Santa Cruz provided scrap metal for our Experience Metal festival last summer; we thanked them by donating back the giant robot visitors partly made from the scrap.
  • Buy, gather, and prep all the materials you can. This might mean cutting thousands of papers various sizes, wheeling hundreds of library books through downtown, dumpster diving for cardboard boxes and driving up to the mountains to move a 200lb letterpress to the MAH.
  • Set up their tables and materials for them before they arrive.
  • Have volunteers ready to assist them with set up and break down, as well as coverage during breaks.
  • Clearly communicate with them throughout the process, show them exactly where they will be and where everyone else will be, let them know the schedule, where to check in, how and where to find help and assistance and what is expected of them before, during and after the event.

Get collaborators' feedback and give them credit for their contributions.
  • Survey your collaborators extensively to find out: ways to improve for next time, what they appreciated, how or if they benefited from the collaboration, and what changes they’d like to see made. Here's a sample collaborator survey from our recent Poetry and Book Arts event.
  • Read the surveys and make active and immediate changes based upon their feedback.
  • Document the event: Share photographs of the event on social media outlets and always have fully downloadable photographs available for their use.
  • Keep in contact with them. These people are now one of your best and most reliable resources and you can be theirs as well. Stay up to date with them about future collaborations or other potential collaborators they may know. Be helpful to them and they will be helpful to you. 
How do you collaborate with your community? What tools and methods have you found beneficial?

Rebuilding a Course Around Prior Knowledge

edwired - Mon, 02/18/2013 - 20:04

Of the many different courses I teach, the one I’ve made the fewest changes in over the past decade is my survey of modern Eastern Europe. Every other course I teach has been reconfigured in various ways as a result of my research into the scholarship of teaching and learning, but for some reason, I’ve never gotten around to altering this course. I’m ashamed to say that when I taught it last semester, it was really not that much different from the way I taught it for the first time way back in 1999.

I could offer various excuses for why that course seems so similar to its original incarnation, but really the only reason is inertia. I’ve rewritten four other courses and have created five others from scratch in the past six or seven years and because my East European survey worked reasonably well, it was last in line for renovation.

The good news for future students is that I’ve taught it that way for the last time.

Like all upper division survey courses, HIST 312 poses a particular set of challenges. Because we have no meaningful prerequisites in our department (except for the Senior Seminar, that requires students to pass Historical Methods), students can show up in my class having taken no history courses at the college level. And even if they had, the coverage of the region we used to call Eastern Europe is so thin in other courses, it is as though they had never taken another course anyway. That means I always spent a fair amount of time explaining just where we are talking about, who the people are who live there, and so on, before we get to the real meat and potatoes of the semester.

And then there is the fact that this course spans a century and eight countries (and then five more once Yugoslavia breaks up), it’s a pretty complex story.

To help students make sense of that complexity, over the years I’ve narrowed the focus of the course substantially, following Randy Bass’s advice to me many years ago: “The less you teach, the more they learn.” We focus on three main themes across all this complexity and by the end of the semester, most of the students seem to have a pretty good grasp of the main points I wanted to make. Or at least they reiterated those points to me on exams and final papers. And it’s worth noting that they like the course. I just got my end of semester evaluations from last semester and the students in that class rated it a 5.0 on a 5 point scale, while rating my teaching 4.94.

What I don’t know is whether they actually learned anything.

This semester I’m part of a reading group that is working its way through How Learning Works and this past week we discussed the research on how students’ prior knowledge influences their thinking about whatever they encounter in their courses. This chapter reminded me a lot of an essay by Sam Wineburg on how the film Forrest Gump has played such a large role in students’ learning about the Viet Nam wars. Drawing on the work of cognitive psychologists and their own research, Ambrose et al and Wineburg come to the same conclusion, namely, that it is really, really difficult for students (or us) to let go of prior knowledge, no matter how idiosyncratically acquired, when trying to make sense of the past (or any other intellectual problem).

The research they describe seems pretty compelling to me, especially because much of it comes from lab studies rather than water cooler anecdotes about student learning. Because it’s so compelling, I’ve decided to rewrite my course around the notion of working from my students’ prior knowledge. Getting from where they are when they walk in the room on the first day of the semester and where I want them to be at the final exam is the challenge that will animate me throughout the term.

My plan right now (and it’s a tentative plan because I won’t teach the course again for a couple of semesters) is to begin the semester with three short in class writing assignments on the three big questions/themes that run through the course. I want to  know where my students are with those three before I try to teach them anything. Once I know where they are, then I can rejigger my plans for the semester to meet them where they are rather than where I might like them to be. And then as we complete various segments of the course I’ll have them repeat this exercise so I can see whether they are, as I hope, building some sort of sequential understanding the material. By the end of the semester I ought to be able track progress in learning (at least I hope I will), which is an altogether different thing than hoping to see evidence of the correct answer compromise.

Results of the "Ocrocrop" Approach to Improving OCR

Collaborative Manuscript Transcription - Fri, 02/15/2013 - 22:21
This project attempted to improve the quality of OCR applied to difficult entomology images[*] by cropping labels from the images to run through OCR separately. In order to identify labels on the image to crop, an initial, 'naive' pass of OCR was made over the whole image, generating both
  • A) a set of rectangles on the image defined as word bounding boxes by the OCR engine, and 
  • B) a control OCR text file to be used for comparing the 'naive' model with the methodology.
Those word rectangles were then filtered, consolidated, and filtered again to identify the labels on the image, which were then extracted and run through the OCR engine separately. The resulting OCR output files were then concatenated into a single text file, which was compared against the 'naive' output described in A (above).

I'll call this method "ocrocrop". (For more detail on method, see the transcript of my preliminary presentation.)

The results were encouraging. (See CSV file listing results for each file, and the directory containing "naive" output, annotated JPGs, and cleaned output files for each test.)

Of 80 files tested, 20 experienced a decrease in score (see Alex Thomson's scoring service), but most (14/20) of those were on OCR output below 10% accuracy in the first place, and the remainder were at or below 20% accuracy. So it is reasonable to say that the ocrocrop method only degraded the quality of texts that were unusable in the first place.

40 of the 80 files tested showed more promising results, showing improvements from one to twenty percentage points -- in some cases only marginally improving unusable (below 10% accurate) outputs, but in many cases improving the scores more substantially (say from 25% to 35% in the case of EMEC609908_Stigmus_sp).

Most of the top quartile of results saw improvements on texts that were already scoring above 10% accuracy rates (16 of 20), so it appears that the effectiveness of the ocrocrop method is correlated to the quality of the naive input data -- garbage is degraded or only minimally improved, while OCR that is merely bad under the naive approach can be significantly improved.


The ocrocrop method saw the greatest improvement in cases where the naive OCR pass was effective at identifying word bounding boxes, but ineffective at translating their contents into words. Taking EMEC609928_Stigmus_sp, the case of greatest improvement (naive: 18.9%, ocrocrop: 70.5%), we see that all words on the labels except for the collector name were recognized as words (in purple), making the cropped label images (in blue) good representatives of the actual labels on the image.

The cropped image was more easily processed by our OCR image, so that we may compare the naive version of the second label:
CALIF:Hunbo1dt Co. ;‘ ~ 3 m1.N' Garbervflle ,::f< '_- ' v—23~75 n.n1e:z.' 9 ._ ’ with the ocrocrop version of the second label:
CALIF:Humboldt Co. 3 mi.N Garberville V-23-76 R.Dietz,'

One of the problems with the OCR-based pre-processing which may be hidden by the scores is that many labels are entirely missed by the ocrocrop if the first, naive OCR pass failed to identify any words at all on the label. In cases such as EMEC609651_Cerceris_completa, the determination label was not cropped (indicated by blue rectangles) because no words (purple rectangles) were detected by the original. As a result, while the ocrocrop OCR is an improvement over the naive OCR (6.6% vs. 6.5%), substantial portions of text on the image are unimproved because they are unattempted.

There are two possible ways to solve this problem. One is to abandon the ocrocrop model entirely, switching back to a computer vision approach -- either by programmatically locating rectangles on the image (as Phuc Nguyen demonstrated) or by asking humans to identify regions of interest for OCR processing (as demonstrated by Jason Best in Apiary and by Paul and Robin Schroeder in ScioTR). The other option is to improve the naive OCR -- perhaps by swapping out the engine (e.g. use ABBY instead of Tesseract), perhaps by using a different image pre-processor (like ocropus's front-end to Tesseract), perhaps by re-training Tesseract.

I suspect that a computer vision approach to extracting entomology labels (or similar pieces of paper photographed against a noisy background) will provide a more effective eventual solution than the ocrocrop method. Nevertheless, the ocrocrop "bang it with a rock until it works" approach has a lot of potential to take entomology-style OCR to bad from worse.

[*]In addition to the difficulties typical of specimen labels--mix of typefaces, handwritten material, typewritten material, text inventory with few overlaps with a dictionary of literary English--the entomology dataset contained additional challenges. Difficulties included the following:
  • Images containing specimens and rulers as well as labels. 
  • Labels casually arranged for photography, so that text orientation was not necessarily aligned. 
  • Labels photographed against a background of heavily pin-pricked styrofoam rather than a black or neutral background. 
  • 3-d images including what appear to be shadows, which soften the contrast differences around borders.

iDigBio Augmenting OCR Hackathon

Collaborative Manuscript Transcription - Fri, 02/15/2013 - 22:00
I spent the last three days at the iDigBio Augmenting OCR Hackathon working alongside mycologists, botanists, entomologists, herbarium managers, and bioinformaticians to explore ways to improve parsing of digitized specimen labels.  While I'm pleased with the results of my own contribution, I'd like to take a minute to talk about the hackathon process itself before I post them.

This was my first hackathon--a condition which seemed to be the rule among the participants--and I was really impressed with it.  The iDigBio folks defined a clear set of goals (improve OCR parsing of specimen labels) with clear metrics (these datasets, these output formats, this scoring algorithm) a couple of months beforehand, and organized five weekly videoconferences before the event.  Most important of all, the participants were encouraged to prepare a 10-minute lightning talk on their efforts and preliminary results.  (See below for the transcript of my talk, see the notes document for descriptions of all talks.)

In my opinion, these preliminary talks were critical to the success of the project.  The preliminary nature relaxed pressure on participants, so we were able to experiment beyond the target of the hackathon (as I did with my handwriting detection digression, a related, but un-scorable effort).  On the other hand, they did provide enough impetus to get many of us looking at the data, working with the tools, and thinking about approaches.  This meant that even before the hackathon started, many of us were familiar enough with the materials to have a real 'meeting of the minds' experience during the pre-event supper:  "Did you just say 'the contrast difference between the print and the label is higher than the difference between the label and the background'?  We ran into that too, and here's what we did..."

The experience was a real education in OCR for me, and I feel like I picked up techniques I can apply directly to projects I've discussed with clients and potential clients.  In particular, I got a real appreciation for how interrelated image preparation, OCR, and parsing are to each other.  One participant had created separate libraries of regular expressions to clean up each kind of field, having discovered that latitude/longitude coordinates require different error correction than personal names or herbarium catalog numbers do.  Another group had built a touch-screen tool for classifying segments of the image before submitting them to OCR.  My own project required a first pass of OCR to clean images before sending them to a second, 'real' pass of OCR.  A simple 1,2,3 workflow just isn't sufficient!

iDigBio itself is an NSF-funded attempt to advance digitization practices on natural history collections, combining disciplinary "thematic collection networks" and methodologically focused working groups on topics like georeferencing, crowdsourcing, and OCR.  Aware that they're not the only people digitizing things, they have been reaching out beyond the natural sciences to the library and information science community at the iConference this year.  This rejection of "not invented here" siloing was a big part of the hackathon, and I hope that more people from outside the natural sciences will get involved.

Bioartist Oron Catts speaking at Medical Museion

Biomedicine on Display - Fri, 02/15/2013 - 11:40

As part of the upcoming workshop “It’s Not What You Think: Communicating Medical Materialities”, we are delighted to announce that the pioneering bioartist Oron Catts will be giving a public keynote lecture on Friday March 8th at 17.00 in the auditorium at Medical Museion.

Oron Catts is a prominent and defining figure in the emerging field of bioarts, which examines shifting perceptions of life through the lens of the life sciences. Famous for his work with The Tissue Culture and Art Project, he also co-founded the bioart lab SymbioticA at the University of Western Australia.

Here is the title and abstract for the talk, which can also be found on our seminar page:

The Puzzle of Neolifism, the Strange Materiality of Regenerative and Synthetically Biological Things.

In 1906 Jacques Loeb suggested making a living system from dead matter as a way to debunk the vitalists’ ideas and claimed to have demonstrated ‘abiogenesis’. In 2010 Craig Venter announced that he created “the first self-replicating cell we’ve had on the planet whose parent is a computer” the “Mycoplasma laboratorium” which is commonly known as Synthia.  In a sense Venter claimed to bring Loeb’s dream closer to reality. What’s relevant to our story is that one of the main images Venter (or his marketing team) chose for the outing of Synthia was of two round cultures that looked like a blue eyed gaze; a metaphysical image representing the missing eyes of the Golem. These are the first bits of a jigsaw puzzle that will be laid in this talk. Through the notion of Neolifism, this puzzle will explore and Re/De-Contextualise the strange materiality of things and assertions of regenerative and synthetic biology. Other parts of the puzzle include a World War II crash site of a Junkers 88 bomber at the far north of Lapland, the first lab where the Tissue Culture & Art Project started to grow semi-living sculptures, frozen arks and de-extinctions, Alexis Carrel, industrial farms, Charles Lindbergh, worry dolls, rabbits’ eyes, ear-mouse, gas chambers, active biomaterials, in-vitro meat and leather, incubators, freak-shows, museums, ghost organs, drones, crude matter, mud and a small piece of Plexiglas that holds this puzzle together…

About Oron Catts:

Oron Catts is an artist, researcher and curator whose pioneering work with the Tissue Culture and Art Project, which he established in 1996, is considered a leading biological art undertaking. In 2000, Oron founded SymbioticA, an artistic research centre in the School of Anatomy, Physiology and Human Biology at The University of Western Australia. SymbioticA won the Prix Ars Electronica Golden Nica in Hybrid Art in 2007 and a year later became a Centre for Excellence. In 2009, Oron was listed in Thames & Hudson’s ‘60 Innovators Shaping our Creative Future’ and named by Icon Magazine (UK) as one of the ‘Top 20 designers making the future and transforming the way we work’. Oron’s interest is life itself or, more specifically, the shifting relations and perceptions of life in the light of new knowledge and its application. Often developed in collaboration with scientists and other artists, his body of work speaks volumes about the need for a new cultural articulation of evolving concepts of life. Oron has been a Research Fellow at Harvard Medical School and a Visiting Scholar at the Department of Art and Art History, Stanford University. He is currently the Director of SymbioticA, a Visiting Professor of Design Interaction at the Royal College of Arts, London, and a Visiting Professor at Aalto University’s Biofilia- base for Biological Arts, Helsinki. Oron’s work reaches beyond the confines of art, often being cited as an inspiration in areas as diverse as new materials, textiles, design, architecture, ethics, fiction and food.

Image credit – Crude Matter (2012) by The Tissue Culture & Art Project (Oron Catts and Ionat Zurr), installation detail from “SOFT CONTROL: Art, Science and the Technological Unconscious”, Koroška galerija likovnih umetnosti (KGLU), Slovenj Gradec.

Improving OCR Inputs from OCR Outputs?

Collaborative Manuscript Transcription - Thu, 02/14/2013 - 15:32
This is a transcript of my talk at the iDigBio Augmenting OCR Hackathon, presenting preliminary results of my efforts before the event.

For my preliminary work, I tried to improve the inputs to our OCR process through looking at the outputs of a naive OCR.
One of the first things that we can do to improve the quality of our inputs to OCR is to not feed them handwriting.  To quote Homer Simpson, "Remember son, if you don't try, you can't fail."  So let's not try feeding our OCR processes handwritten materials.
To do this, we need to try to detect the presence of handwriting.  When you try to feed handwriting to OCR, you get a lot of gibberish.  If we can detect handwriting, we can route some of our material to "humans in the loop" -- not wasting their time with things we could be OCRing.  So how do we do this?
My approach was to use the outputs of [naive] OCR to detect the gibberish it produces when it sees handwriting to try to determine when there was handwriting present in the images.  The first thing I did before I started programming, was classifying OCR output from the lichen samples by visual inspection: whether I thought there was hand writing present or not, based on looking at the OCR outputs.  Step two was to automate the classifications.
I tried this initially on the results that came out of ABBY and then the results that came out of Tesseract, and I was really surprised by how hard it was for me as a human to spot gibberish.  I could spot it, but in a lot of cases -- ABBY does a great job of cleaning up its OCR output -- so in a lot of cases, particularly the labels that were all printed with the exception of some species name that was handwritten, ABBY generally misses those.  Tesseract, on the other hand, does not produce outputs that are quite as clean.

So the really interesting thing about this to me is that while we were able to get 70-75% accuracy on both ABBY and Tesseract, if you look at the difference between the false positives that come out of ABBY and Tesseract and the false negatives, I think there is some real potential here for making a much more sophisticated algorithm.  Maybe the goal is to pump things through ABBY for OCR, but beforehand look at Tesseract output to determine whether there is handwriting or not.
The next thing I did was try to automate this.  I just used some regular expressions to look for representative gibberish, and then based on the number of matches got results that matched the visual inspection, though you do get some false positives. 
The next thing I want to do with this is to come up with a way to filter the results based on doing a detection on ABBY [output] and doing a detection on Tesseract [output].
The next thing that I wanted to work on was label extraction.

We're all familiar with the entomology labels and problems associated with them.
So if you pump that image of Cerceris through Tesseract, you end up with a lot of garbage. You end up with a lot of gibberish, a lot of blank lines, some recognizable words.  That "Cerceris compacta" is, I believe, the result of a post-digitzation process: it looks like an artifact of somebody using Photoshop or ImageMagick to add labels to the image.  The rest of it is the actual label contents, and it's pretty horrible.  We've all stared at this; we've all seen it.
So how do you sort the labels in these images from rulers, holes in styrofoam, and bugs?  I tried a couple of approaches.  I first tried to traverse the image itself, looking for  contrast differences between the more-or-less white labels and their backgrounds.  The problem I found with that was that the highest contrast regions of the image are the difference between print and the labels behind the print.  So you're looking for a fairly low-contrast difference--and there are shadows involved.  Probably, if I had more math I could do this, but this was too hard.

So my second try was to use the output of OCR that produces these word bounding boxes to determine where labels might be, because labels have words on them. 
If you run Tesseract or Ocropus with an "hocr" option, you get these pseudo-HTML files that have bounding boxes around the text.  Here you see this text element inside a span; the span has these HTML attributes that say "this is an OCR word".  Most importantly, you have the title attribute as the bounding box definition of a rectangle. 
If you extract that and re-apply it to an image, you see that there are a lot of rectangles on the image, but not all the rectangles are words.  You've got bees, you've got rulers; you've got a lot of random trash in the styrofoam.
So how do we sort good rectangles from bad rectangles?  First I did a pass looking at the OCR text itself.  If the bounding box was around text that looked like a word, I decided that this was a good rectangle.  Next, I did a pass by size.  A lot of the dots in the stryofoam come out looking suspiciously word-like for reasons I don't understand.  So if the area of the rectangle was smaller than .015% of the image, I threw it away.
The result was [above]: you see rectangles marked with green that pass my filter and rectangles marked with red that don't.  So you get rid of the bee, you get rid of part of the ruler -- more important, you get rid of a lot of the trash over here. [Pointing to small red rectangles on styrofoam.] There are some bugs in this--we end up getting rid of "Arizona" for reasons I need to look at--but it does clean the thing up pretty nicely.

Question: A very simple solution to this would be for the guys at Berkeley to take two photographs -- one of the bee and ruler, one of the labels.  I'm just thinking how much simpler that would be.

Me: If the guys in Berkeley had a workflow that took the picture--even with the bee--agaist a black background, that would trivialize this problem completely! 

Question: If the photos were taken against a background of wallpaper with random letters, it couldn't be much worse than this [styrofoam].  The idea is that you could make this a lot easier if you would go to the museums and say, we'll participate, we'll do your OCRing, but you must take photographs this way.

Me: You're absolutely right.  You could even hand them a piece of cardboard that was a particular color and say, "Use this and we'll do it for you, don't use it and we won't."  I completly agree.  But this is what we're starting with, so this is what I'm working on.
The next thing is to aggregate all those word boxes into the labels [they constitute]. For each rectangle, look at all of the other rectangles in the system, expand them both a little bit, determine if they overlap, and if they do, consolidate them into a new rectangle, and repeat the process until there are no more consolidations to be done. [Thanks to Sara Brumfield for this algorithm.]
If you do that, the blue boxes are the consolidated rectangles.  Here you see a rectangle around the U.C. Berkeley label, a rectangle around the collector, and a pretty glorious rectangle around the determination that does not include the border. 
Having done that, you want to further filter those rectangles.  Labels contain words, so you can reject any rectangles that were "primitives" -- you can get rid of the ruler rectangle, for example, because it was just a single [primitive] rectangle that was pretty large. 
So you make sure that all of your rectangles were created through consolidation, then you crop the results.  And you end up automatically extracting these images from that sample -- some of which are pretty good, some of which are not.  We've got some extra trash here, we cropped the top of "Arizona" here.  But for some of the labels -- I don't think I could do better than that determination label by hand. 
Then you feed the results back into Tesseract one by one, then we combine the text files in Y-axis order to produce a single file for all those images.  (Not something that's a necessary step, but that does allow us to compare the results with the "raw" OCR.)  How did we do?
This is a resulting text file -- we've got a date that's pretty recognizable, we've got a label that's recognizable, and the determination is pretty nice.
Let's compare it to the raw result.  In the cropped results, we somehow missed the "Cerceris compacta", we did a much nicer job on the date, and the determination is actually pretty nice.
Let's try it on a different specimen image.
We run the same process over this Stigmus image.  We again find labels pretty well.
 When we crop them out, the autocrop pulls them out into these three images.

Running those images through OCR, we get a comparison of the original, which had a whole lot of gibberish. 
The original did a decent job with the specimen number, but the autocrop version does as well.  In particular, for this location [field], the autocrop version is nearly perfect, whereas the original is just a mess.
My conclusion is that we can extract labels fairly effectly by first doing a naive pass of OCR and looking at the results of that, and that the results of OCR over the cropped images is less horrible than running OCR over the raw images -- though still not great. 
[2013-02-15 update: See the results of this approach and my write-up of the iDigBio Augmenting OCR Hackathon itself.]

Social media and disaster management

Biomedicine on Display - Thu, 02/14/2013 - 09:17

Social media and public health is a diverse field, and there is always some new corner to explore! These days I am increasing my knowledge on the use of social media for disaster management and coordination. The reason for this is that I next week will be giving a lecture on the topic to students at the Master of Disaster Management at University of Copenhagen.

It has been exiting to dig into a new field and to experience how social media really presents great new opportunities, but of course also new challenges. Since I haven’t previously worked specifically with disaster management, I choose a few weeks ago to ask my Twitter followers for help on finding good literature and resource people in the field. And once again, Twitter didn’t let me down.

Blogs, website and hashtags

I got a lot of great inputs to blogs, websites, Twitter chats, hashtags and people to follow and hook up with on Twitter (a big thank you to all of you who responded!).

The blogs are a good starting point, especially since most of them offer great links to other resources. The most helpful so far have been the website/blog Social Media 4 Emergency Management. From here there is access to wikis, archives of Twitter chats (#smemchat), videos, blogs etc. on social media and emergency management. The only ‘problem’ with the website is that there is almost too much information.

Another super helpful resource is the blog idisaster2.0 (primarily run by @kim26stephens). It have lots of informative blog posts as well as a good bibliography of selected academic and government resources on social media and emergency management.

Own experiences with disasters and social media?

When I was asked to give the lecture, I hesitated for a moment, because what did I know about emergencies and disasters? Apart from my solid knowledge of social media in public health, including some superficial insight into its role in disasters, I had never had anything to do with disasters or least of all experienced it… However, the later is not true, I quickly realised. I have actually to some extend been in an emergency setting and I have in fact experienced the role of social media in a disaster situation.

Earthquake in Japan in 2011

I was in Japan, when the big earthquake, subsequent tsunami and finally the Fukushima nuclear plant crisis occurred in March 2011. Being relatively far from the epicenter of the disaster (I was based in Kobe in the Kansai region), I wasn’t directly surrounded by flooded buildings, elevated radiation risks or other immediate danger. But I was surrounded by potential danger, by worried friends and family in Denmark and by Japanese friends and colleagues with close relatives in the affected areas.

Looking back on my Facebook timeline, I can now see how social media actually played an important role for me during the emergency. I used Facebook to assure others that I was okay and kept them updated on my situation. I started following the Danish Embassy in Japan’s Facebook page through which they several times daily shared information about risks, advice on how to act and the organisation of potential evacuation. I encourage the mobilization of emotionally and financial support to Japan by sharing links and QR-codes. And I experienced how a Japanese colleague of mine after days of no contact with her sister living in Sendai where the tsunami hit, finally through Facebook got in contact and found out that her and her were safe…

So yes, I have actually experienced a disaster, and experienced how social media can be used in this kind of situation. I plan to share my experiences as a case with the students next week and hope that this real life experience can contribute to the understanding and some discussions.

Your help

Although I already got great tips from people on Twitter, I am still the happy receiver of inputs on social media and emergencies/disaster management. Suggestions on discussion topics, assignments or any other ideas on how to involve the students are more than welcome as are links to guidelines, scientific articles etc.

The Hacker Way

Found History - Wed, 02/13/2013 - 19:29

On December 21, 2012, Blake Ross—the boy genius behind Firefox and currently Facebook’s Director of Product—posted this to his Facebook page:

Some friends and I built this new iPhone app over the last 12 days. Check it out and let us know what you think!

The new iPhone app was Facebook Poke. One of the friends was Mark Zuckerberg, Facebook’s founder and CEO. The story behind the app’s speedy development and Zuckerberg’s personal involvement holds lessons for the practice of digital humanities in colleges and universities.

Late last year, Facebook apparently entered negotiations with the developers of Snapchat, an app that lets users share pictures and messages that “self-destruct” shortly after opening. Feeding on user worries about Facebook’s privacy policies and use and retention of personal data, in little more than a matter of weeks, Snapchat had taken off among young people. By offering something Facebook didn’t—confidence that your sexts wouldn’t resurface in your job search—Snapchat exploded.

It is often said that Facebook doesn’t understand privacy. I disagree. Facebook understands privacy all too well, and it is willing to manipulate its users’ privacy tolerances for maximum gain. Facebook knows that every privacy setting is its own niche market, and if its privacy settings are complicated, it’s because the tolerances of its users are so varied. Facebook recognized that Snapchat had filled an unmet need in the privacy marketplace, and tried first to buy it. When that failed, it moved to fill the niche itself.

Crucially for our story, Facebook’s negotiations with Snapchat seem to have broken down just weeks before a scheduled holiday moratorium for submissions to Apple’s iTunes App Store. If Facebook wanted to compete over the holiday break (prime time for hooking up, on social media and otherwise) in the niche opened up by Snapchat, it had to move quickly. If Facebook couldn’t buy Snapchat, it had to build it. Less than two weeks later, Facebook Poke hit the iTunes App Store.

Facebook Poke quickly rose to the top of the app rankings, but has since fallen off dramatically in popularity. Snapchat remains among iTunes’ top 25 free apps. Snapchat continues adding users and has recently closed a substantial round of venture capital funding. To me Snapchat’s success in the face of such firepower suggests that Facebook’s users are becoming savvier players in the privacy marketplace. Surely there are lessons in this for those of us involved in digital asset management.

Yet there is another lesson digital humanists and digital librarians should draw from the Poke story. It is a lesson that depends very little on the ultimate outcome of the Poke/Snapchat horse race. It is a lesson about digital labor.

Mark Zuckerberg is CEO of one of the largest and most successful companies in the world. It would not be illegitimate if he decided to spend his time delivering keynote speeches to shareholders and entertaining politicians in Davos. Instead, Zuckerberg spent the weeks between Thanksgiving and Christmas writing code. Zuckerberg identified the Poke app as a strategic necessity for the service he created, and he was not too proud to roll up his sleeves and help build it. Zuckerberg explained the management philosophy behind his “do it yourself” impulse in the letter he wrote to shareholders prior to Facebook’s IPO. In a section of the letter entitled “The Hacker Way,” Zuckerberg wrote:

The Hacker Way is an approach to building that involves continuous improvement and iteration. Hackers believe that something can always be better, and that nothing is ever complete. They just have to go fix it – often in the face of people who say it’s impossible or are content with the status quo….

Hacking is also an inherently hands-on and active discipline. Instead of debating for days whether a new idea is possible or what the best way to build something is, hackers would rather just prototype something and see what works. There’s a hacker mantra that you’ll hear a lot around Facebook offices: “Code wins arguments.”

Hacker culture is also extremely open and meritocratic. Hackers believe that the best idea and implementation should always win – not the person who is best at lobbying for an idea or the person who manages the most people….

To make sure all our engineers share this approach, we require all new engineers – even managers whose primary job will not be to write code – to go through a program called Bootcamp where they learn our codebase, our tools and our approach. There are a lot of folks in the industry who manage engineers and don’t want to code themselves, but the type of hands-on people we’re looking for are willing and able to go through Bootcamp.

Now, listeners to Digital Campus will know that I am no fan of Facebook, which I abandoned years ago, and I’m not so naive as to swallow corporate boilerplate hook, line, and sinker. Nevertheless, it seems to me that in this case Zuckerberg was speaking from the heart and the not the wallet. As Business Insider’s Henry Blodget pointed out in the days of Facebook’s share price freefall immediately following its IPO, investors should have read Zuckerberg’s letter as a warning: he really believes this stuff. In the end, however, whether it’s heartfelt or not, or whether it actually reflects the reality of how Facebook operates, I share my colleague Audrey Watters’ sentiment that “as someone who thinks a lot about the necessity for more fearlessness, openness, speed, flexibility and real social value in education (technology) — and wow, I can’t believe I’m typing this — I find this part of Zuckerberg’s letter quite a compelling vision for shaking up a number of institutions (and not just “old media” or Wall Street).”

There is a widely held belief in the academy that the labor of those who think and talk is more valuable than the labor of those who build and do. Professorial contributions to knowledge are considered original research while librarians and educational technologists’ contributions to these endeavors are called service. These are not merely imagined prejudices. They are manifest in human resource classifications and in the terms of contracts that provide tenure to one group and, often, at will employment to the other.

Digital humanities is increasingly in the public eye. The New York Times, the Los Angeles Times, and the Economist all have published feature articles on the subject recently. Some of this coverage has been positive, some of it modestly skeptical, but almost all of it has focused on the kinds of research questions digital humanities can (or maybe cannot) answer. How digital media and methods have changed humanities knowledge is an important question. But practicing digital humanists understand that an equally important aspect of the digital shift is the extent to which digital media and methods have changed humanities work and the traditional labor and power structures of the university. Perhaps most important has been the calling into question of the traditional hierarchy of academic labor which placed librarians “in service” to scholars. Time and again, digital humanities projects have succeeded by flattening distinctions and divisions between faculty, librarians, technicians, managers, and students. Time and again, they have failed by maintaining these divisions, by honoring traditional academic labor hierarchies rather than practicing something like the hacker way.

Blowing up the inherited management structures of the university isn’t an easy business. Even projects that understand and appreciate the tensions between these structures and the hacker way find it difficult to accommodate them. A good example of an attempt at such an accommodation has been the “community source” model of software development advanced by some in the academic technology field. Community source’s successes and failures, and the reasons for them, illustrate just how important it is to make room for the hacker way in digital humanities and academic technology projects.

As Brad Wheeler wrote in EDUCAUSE Review in 2007, a community source project is distinguished from more generic open source models by the fact that “many of the investments of developers’ time, design, and project governance come from institutional contributions by colleges, universities, and some commercial firms rather than from individuals.” Funders of open source software in the academic and cultural heritage fields have often preferred the community source model assuming that, because of high level institutional commitments, the projects it generates will be more sustainable than projects that rely mainly on volunteer developers. In these community source projects, foundations and government funding agencies put up major start-up funding on the condition that recipients commit regular staff time—”FTEs”—to work on the project alongside grant funded staff.

The community source model has proven effective in many cases. Among its success stories are Sakai, an open source learning management system, and Kuali, an open source platform for university administration. Just as often, however, community source projects have failed. As I argued in a grant proposal to the Library of Congress for CHNM’s Omeka + Neatline collaboration with UVa’s Scholars’ Lab, community source projects have usually failed in one of two ways: either they become mired in meetings and disagreements between partner institutions and never really get off the ground in the first place, or they stall after the original source of foundation or government funding runs out. In both cases, community source failures lie in the failure to win the “hearts and minds” of the developers working on the project, in the failure to flatten traditional hierarchies of academic labor, in the failure to do it “the hacker way.”

In the first case—projects that never really get off the ground—developers aren’t engaged early enough in the process. Because they rely on administrative commitments of human resources, conversations about community source projects must begin with administrators rather than developers. These collaborations are born out of meetings between administrators located at institutions that are often geographically distant and culturally very different. The conversations that result can frequently end in disagreement. But even where consensus is reached, it can be a fragile basis for collaboration. We often tend to think of collaboration as shared decision making. But as I have said in this space before, shared work and shared accomplishment are more important. As Zuckerberg has it, digital projects are “inherently hands-on and active”; that “instead of debating for days whether a new idea is possible or what the best way to build something is, hackers would rather just prototype something and see what works”; that “the best idea and implementation should always win—not the person who is best at lobbying for an idea or the person who manages the most people.” That is, the most successful digital work occurs at the level of work, not at the level of discussion, and for this reason hierarchies must be flattened. Everyone has to participate in the building.

In the second case—projects that stall after funding runs out—decisions are made for developers (about platforms, programming languages, communication channels, deadlines) early on in the planning process that may deeply affect their work at the level of code sometimes several months down the road. These decisions can stifle developer creativity or make their work unnecessarily difficult, both of which can lead to developer disinterest. Yet experience both inside and outside of the academy shows us that what sustains an open source project after funding runs out is the personal interest and commitment of developers. In the absence of additional funding, the only thing that will get bugs fixed and forum posts answered are committed developers. Developer interest is often a project’s best sustainability strategy. As Zuckerberg says, “hackers believe that something can always be better, and that nothing is ever complete.” But they have to want to do so.

When decisions are made for developers (and other “doers” on digital humanities and academic technology projects such as librarians, educational technologists, outreach coordinators, and project managers), they don’t. When they are put in a position of “service,” they don’t. When traditional hierarchies of academic labor are grafted onto digital humanities and academic technology projects that owe their success as much to the culture of the digital age as they do to the culture of the humanities, they don’t.

Facebook understands that the hacker way works best in the digital age. Successful digital humanists and academic technologists do too.

[This post is based on notes for a talk I was scheduled to deliver at a NERCOMP event in Amherst, Massachusetts on Monday, February 11, 2013. The title of that talk was intended to be "'Not My Job': Digital Humanities and the Unhelpful Hierarchies of Academic Labor." Unfortunately, the great Blizzard of 2013 kept me away. Thankfully, I have this blog, so all is not lost.]

[Image credit: Thomas Hawk]

The Diversity Question in the Arts Blogosphere

Museum 2.0 - Wed, 02/13/2013 - 08:00
Every once in a while, I'll get a boring email inviting me to be part of some kind of blog salon on a particular topic, the idea being that all the bloggers who are contacted will write about that topic during the assigned month. This never seems like a good idea.

But this month, it's as if there was a subliminal email sent to a crew of bloggers in the arts suggesting a salon about audience diversity, and how/why to move in that direction. The posts are meaty and the commenting is robust. So this week, I want to honor this conversation with links to a few of the great posts and a couple other sources that inform the way I think about diversity and engagement.

Admittedly, many of these posts exist in a bubble of inter-referencing (which I am only exacerbating with this post):
  • Clay Lord weighs in on the data about audience representation in Bay Area theater, and the ways that a majority culture can oppress its own value systems on others. A rare blog post that combines personal narrative with statistical charts. 
  • Diane Ragsdale responds with some thoughts on how funders could influence these issues, whether they should, and how organizations might respond. She references my recent post about the Irvine Foundation's new approach to arts funding (which includes, but does not solely focus on diversifying audience engagement).
  • Barry Hessenius follows up with more thoughts on "coercive philanthropy" and how and whether funders make change possible in the field.
  • And then Ian David Moss pulls it together with an interesting question about whether we're too focused on how to support and shift institutions instead of how to engage and empower individual people/audience members.
In some ways, what's more interesting is the world beyond this bubble. Some events:
  • Aaron Dworkin, a pretty amazing individual in many ways, is putting together SphinxCon, a conference happening this weekend in Detroit with a focus on "empowering ideas for diversity in the arts." You should go and tell us all about it.
  • I truly wish I could have attended Facing Race, which sounded like a completely awesome and transformative event this past fall in Baltimore. My sister attended, and I kicked myself about 87 times for not knowing about it or getting out there.
  • And Carlton Turner runs Alternate Roots, another incredible artists' organization with a focus on social change that runs an annual conference/camp/experience which I have heard is mind-blowing in North Carolina.
And a couple museum-specific sites and resources:
  • I've become intrigued by the Incluseum blog, which is run by a group of museum folk in Seattle with a mission to encourage social inclusion in museums. Their interests run the gamut from issues of socio-economic inclusion to race, gender, and physical and mental abilities.
  • I recently met Jada Wright-Green, a museum professional who runs a site called Heritage Salon that looks at issues and possibilities in the African-American museum community. Jada is passionate about supporting the future of African-American heritage institutions and working to diversify the museum field as a whole.
  • The Center for the Future of Museums maintains a good list of top ten resources on demographic change as related to museums. While few are prescriptive in offering suggestions on how museums might meet the challenge of a changing population, they provide good research fodder for starting points.
  • And my favorite, unsurprisingly, is Elaine Heumann Gurian, who has written powerfully about the architecture of inclusion and exclusion in museums. Even amidst a sea of new books about museums and social change, I find myself reaching for Elaine's classics above all others.
Where do you fall in this conversation, and what resources have pushed your thinking about diversity?

O Knowledge Graph, Where Art Thou?

Data Mining - Mon, 02/11/2013 - 04:50

The web search community, in recent months and years, has heard quite a bit about the 'knowledge graph'. The basic concept is reasonably straightforward - instead of a graph of pages, we propose a graph of knowledge where the nodes are atoms of information of some form and the links are relationships between those statements. The knowledge graph concept has become established enough for it to be used as a point of comparison between Bing and Google.

Last night, I went to see a performance of Kodo - regarded internationally as the premier taiko group. A search on Bing for 'kodo' produced the following result:

 

Bing showed good results for the web and images as well as a knowledge driven portion of the answer from wikipedia with links to play some of their songs. Not bad - but no mention of the performance.

As Kodo were performing at Meany Hall on the University of Washington campus, I did another search on Bing for the venue:

Here we see something better - the venue is recognized as a venue and consequently joined with the events that are known to Bing, including the concert I was attending. As the event information included a link to the performer (the blue Kodo link in the screen shot) I followed through and found Bing gave me event information.


In these interactions, we can see part of the promise of the knowledge graph, but many areas for improvements. The event node relates the performer to the venue to the event. However the venue information in this part of the graph is isolated from that used to deliver the result for the query purely about the venue (note that the addresses are different - a common problem with campus and mall-like areas). The above experience, I think, shows the true challenge of the knowledge graph proposition - bringing all the isolated data graphs together correctly when the nodes in the graphs are actually representations of the same real world entities.

Note that in exploring this particular scenario, Bing appeared to be doing a little better than Google, though Google had partial event information associated with the Kodo entity.


As these names are possibly taken from the listings information from different sources, the name of the performer is confusingly presented in different forms.

Much of what we see out there in the form of knowledge returned for searches is really isolated pockets of related information (the date and place of brith of a person, for example). The really interesting things start happening when the graphs of information become unified across type, allowing - as suggested by this example - the user to traverse from a performer to a venue to all the performers at that venue, etc. Perhaps 'knowledge engineer' will become a popular resume-buzz word in the near future as 'data scientest' has become recently.

 

remix -- an aspect of all really popular media

if:book (The Institute for the Future of the Book) - Sat, 02/09/2013 - 23:40
Gangnam Style is being remixed and appropriated all over the planet. Reminds me of a wonderful recent piece by Tod Machover in which he talks about his daughter and her friends remixing as the principal way of sharing things they...

brilliant essay about snapchat

if:book (The Institute for the Future of the Book) - Sat, 02/09/2013 - 23:30
"Pix and It Didn't Happen" by Nathan Jurgenson in The New Inquiry. "A photograph is made of time as much as it is of light -- a frozen shutter-speed-size gap of the present captured within a photo border. Despite this,...

snapchat is a clear indication that we're entering the post-print era

if:book (The Institute for the Future of the Book) - Sat, 02/09/2013 - 16:44
The New York TImes published an article today about Snapchat -- the service that lets you send photos and texts that quickly self-destruct as soon as the person you've sent them to has seen them. Impermanence is the point. Before...

Speaking Truth to Power -- SocialBook version

if:book (The Institute for the Future of the Book) - Sat, 02/09/2013 - 16:33
Here's a link to a SocialBook version of the Aaron Swartz Reader, Speaking Truth to Power. In addition to SocialBook's conversation layer, this version also includes a number of excerpted video clips....

Participation and Observation in Search

Data Mining - Sat, 02/09/2013 - 01:47

The early days of web search were essentially about observation. The web search engine observed the web (documents, links and user behaviours) and then delivered results based on those observations.

In recent years we have started to see more of a position of participation in web search engines. Examples of participation include:

  • Hosting web sites for businesses - by getting their data on the web more useful targets are provided for user and a short loop is developed with the source of accurate data, i.e. the business.
  • Providing feed proxy services (like feedburner) - by providing a service to bloggers, the search engine gets access to valuable user information.
  • Hosting content - by hosting news articles and blogs directly, the search engine gets real time updates to content first as well as direct access to user behaviour.
  • Exposing data editing tools like map editors - by offering crowd sourcing tools the search engine benefits the community by improving data and is the first to know about and leverage that fresh information.

Participation looks like a core strategy for search.

« first‹ previous123456789next ›last »

Echo is a project of the Center for History and New Media, George Mason University
© Copyright 2008 Center for History and New Media