This is not an existing tools project but rather a proposal for a tools project arising from the Digital Tools Summit at the University of Virginia. For more on the Summit see or notes at

Exploration of Resources

Joanna is interested in notions of "presence" in 18th-century French and English philosophers. She calls up her Scholar’s Aide (Schaide) utility to find the texts she wants to study. By clicking and dragging texts that meet her needs into Gatherings she creates a personal study collection that she can examine. An on-line thesaurus helps her put together a list of words in French and English that indicate presence (such as near and proche), and she searches for texts containing those words. She then launches a Schaide search that only looks in her Gathering, even though the texts are in different formats and at different sites. When she checks in after teaching her Ethics of Play class she finds a concordance has been gathered that she can sort in different ways and begin to study. She saves her concordance as a View to the public area on the Schaide Site so her research assistant can help her eliminate the false leads. Maybe she’ll use the View in her presentation at a conference next week once she’s found a way to visualize the results according to genre.

How can Humanists ask questions of scholarly evidence on the Web? Humanists face a paradox of abundance and scarcity when confronting the digital realm. On the one hand, there has been an incredible growth in the number and types of documents reflecting on our cultural heritage that are now available in digital form. Projects like Google Print will in the coming years dramatically expand that abundance. Tools for discovering, exploring, and analyzing those resources remain limited or primitive, however. Only commercial tools, such as Google, search across multiple repositories and across different formats. Such commercial tools are shaped and defined by the dictates of the commercial market rather than the more complex needs of scholars. The challenges faced by scholars using commercial search tools are:

• It is hard to ask questions across intellectually coherent collections. What the inquirer considers a collection is usually spread across different on-line archives and databases, each of which will have a different search interface.

• Many resources are inaccessible except with local search facilities and many are gated to prevent free access.

• You cannot ask questions that take advantage of the metadata in many electronic texts indexed by commercial tools.

• You cannot ask questions that take advantage of structure within electronic scholarly texts (such as those encoded in TEI XML.)

• Where there is structure, it is rarely compatible from one collection to another.

• Collections of evidence are in different formats, from PDF to XML.

What kinds of tools would foster the discovery and exploration of digital resources in the humanities? More specifically, how can we easily locate documents (in multiple formats and multiple media), find specific information and patterns in across large numbers of differently formatted documents, and share our results with others in a range of scholarly disciplines and social networks? These tasks are made more difficult by the current state of resources and tools in the humanities. For example, many materials are not freely available to be crawled through or discovered because they are in databases that are not indexed by conventional search engines or because they are behind subscription-based gates. In addition, the most commonly used interfaces for search and discovery are difficult to build upon. And, the current pattern of saving search results (e.g., bookmarks) and annotations (e.g., local databases such as EndNote) on local hard drives inhibits a shared scholarly infrastructure of exploration, discovery, and collaboration.

The tasks are large, and many types of tools are needed to meet these goals. Among other things, our group saw the need for tools and standards that would facilitate:

• Multi-resource access that provide the ability to gather and reassemble resources in diverse formats and to convert and translate across those resources.

• A scholarly gift economy in which no one is a spectator and everyone can readily share the fruits of their discovery efforts.

• Serendipitous discovery and playful exploration.

• Visual forms of search and presentation.

But the group had a strong consensus, concluding that the most important effort would be one that focused on developing sophisticated discovery tools that would allow new forms of search and make resources accessible and open to discovering unexpected patterns and results. We described this as a “Google Aide for Scholars” (or Schaide in the story above) — something much broader than the bibliographic tool Google Scholar — that would be built on top of an existing search engine like Google but would allow for much more sophisticated searches than Google. Our talk of “Google” was not, however, meant to limit ourselves to a particular commercial product but rather to signal that we were interested in building on top of the existing infrastructure created by the multi-billion dollar search-industry giants such as Yahoo, MSN, and Google. Some of Schaide’s features would be:

• It would take advantage of commercial search utilities rather than replace them.

• It would allow scholars to create gatherings of resources that fit their research rather than be restricted by resources. These gatherings could be shared.

• It would allow scholars to formulate search questions in different ways that could be asked of the gatherings.

• It would allow scholars to ask questions that take advantage of metadata, ontologies and structure.

• It would negotiate across different formats and different forms of structure.

• It would allow researchers to save results for further study or sharing.

• It would allow researchers to view results in different ways.

Just as Google and the other search engine companies have created an essential search infrastructure that a tool-building effort like ours needs to leverage, there are also specific tool-creation efforts underway that we should at least examine closely and perhaps even embrace. Several were mentioned and discussed as part of the brainstorming process: Pandora (a search tool for music); Content Sphere (a personal search engine developed by Michael Jensen); Meldex (another music search tool); Syllabus Finder and H-Bot (tools that make use of Google API developed by Dan Cohen at CHNM); Firefox Scholar (a scholarly organization and annotation tool, also from CHNM); I Spheres (middleware that sits on top of digital collections); TAPoR (an online portal and gateway to tools for sophisticated analysis and retrieval based at McMaster University); Antartica (commercial data mining by Tim Bray); Citeseer; Proximity (a tool for finding patterns in databases developed by Jensen); personal search from commercial search engines (Google personal search and Yahoo Mindset); Amazon’s A9; Cluty; and data-mining packages (NORA, D2K, and T2K from NCSA).

We developed several key specifications for this new Google for Scholars. It would be extensibile through web services and, hence, might work as a plug in to Firefox or some other open client. It would be transparent in the sense that it would show you to see how it was working rather than simply hide its magic behind the scenes. It would also offer customizable utilities like a “query builder” that would allow you to write your own regular expressions and ontology. Most important, it would be able to plug in any ontology; filter results in complex ways and save those filters; classify and tag results; display, aggregate, and share search results.

But the success of such a tool also rests on the formatting of the resources that it seeks to access for the scholar. Scholarly resources — whether commercial aggregations (such as ProQuest Historical Newspapers), digital libraries (such as American Memory and Making of America), gated repositories of scholarly articles (such as JSTOR), and especially the emerging mega-resource promised by Google Print — need to be visible and open. Achieving that goal is more of a social and political problem than a technical challenge. But we can facilitate that goal by offering guidelines for how to make a site visible through existing and emerging standards, such as OAI and the XML approach followed by Google.

In general, then, we see on the one hand the need for a lobbying group that will promote making resources openly available and discoverable. On the other hand, we believe that the actual tools development can proceed in an incremental and decentralized fashion through three different development groups: (1) a group developing a client-based tool (perhaps built into the browser) that can access multiple resources but using Google; (2) a group developing a server-side repository that would aggregate information from searches and annotations; and (3) a decentralized group (or set of groups) that would write widgets, web services, and ontologies that would operate in the extensible client software as well as off the server.

Written by Roy Rosenzweig and Geoffrey Rockwell

