Many thanks to Jabin White, Director of Strategic Content for Wolters Kluwer Health Professional & Education division, for my new “elevator speech” for explaining the Semantic Web (originally conceived by Tim Berners-Lee). My “aha” moment came during Jabin’s presentation at a recent Association of American Publishers Professional and Scholarly (AAP PSP) division seminar on semantic publishing. Here’s the relevant slide:
The Semantic Web will both require and consume metadata of the type shown in the right-hand, XML-labeled column. On today’s HTML-based web, Jabin’s shopping list is tagged so that it can be displayed as a list, but not much else. The future’s semantic shopping list defines the list as relating to a trip to the grocery store, and adds a category for each list item, allowing a computer to do much more with it than simply rendering it in list format. With the right “app,” Jabin’s semantic list could be sorted to match the aisle-by-aisle layout as he walks into a grocery store. Or it could warn him that his ratio of grain to veggie will blow his diet.
Of course, still to come to enable a full realization of the Semantic Web is a controlled vocabulary (or taxonomy, or ontology) to serve as the authority file—the “single source of truth”—for these semantic categories and concepts, so that systems can understand that “veggie” is the same thing as “vegetable,” and that “grain” in this context doesn’t refer to sand or bullet weights. You can imagine the challenges of language variation and disambiguation as we move from grocery shopping to more complex domains such as in the scientific, technical, and medical world.
In the medical domain, the National Library of Medicine’s UMLS metathesaurus serves as just such an authority file. UMLS maps—to a single unique concept ID—the many different terminologies used in health care (to define insurance reimbursement and billing codes, disease diagnoses and treatments, drug information, medical literature concepts, and so on; UMLS maps 100+ terminologies). Silverchair’s Cortex biomedical taxonomy is connected to UMLS through this unique ID, but our team of taxonomy and subject matter experts work continually to make sure that Cortex and its companion Equivalents Server include the new concepts and evolving language that we encounter daily in the medical literature and through user interactions with our applications. We add new concepts and equivalents every working day, but they often don’t make it into the UMLS mapping for months or even years—an eternity in a fast-changing field like medicine.
But even with its drawbacks, UMLS gives medicine a significant leg up in terms of semantic infrastructure. At Silverchair we are betting that the promise of the Semantic Web will be realized for medicine well before most other domains.
We’re looking forward to the report that comes from this survey posted by the Gilbane Group, designed “to identify a number of ‘pain points’ or barriers encountered by book publishers when it comes to developing or expanding digital publishing programs…” and encourage you to complete it if you’re on the front lines of digital transformation. We have some ways to help with those obstacles and will comment on the report after it’s published.
The thesaurus supporting our Cortex medical taxonomy is distinguished from other standards by its inclusion of “real-world” equivalents. We generally call these “equivalents” rather than synonyms because we include things that arguably aren’t purely synonyms—jargon or shorthand versions of medical terminology that we run across in the medical literature we tag. More often, though, we learn about these equivalents (and common misspellings, which we also add to our thesaurus) by reviewing search queries submitted by real users to the sites we’ve built. Some examples: “C diff” for “Clostridium difficile,” “FB in foot” for “foreign body in foot,” “P4P” for “pay for performance,” “echo” for “echocardiography.”
Unlike some taxonomies that have a more “academic” (read: stodgy) approach to what is considered a synonym, we put real-world equivalents in our thesaurus because we want it to work for real-world users. Many users of our health care information sites are pressed for time and are looking for an answer to a specific question. They shouldn’t have to think very hard about how to structure a query so that a search engine can understand it. It’s our job to be knowledgeable about both their language and their lingo. At Silverchair, we believe the searcher is never wrong (our version of “the customer is always right”).
Bob Wachter, with whom we’re privileged to work on two sites sponsored by the Agency for Healthcare Research and Quality (PSNet and WebM&M), recently wrote a humorous blog post about the way his hospital colleagues at UCSF (and other hospitals) commonly turn the nouns of their everyday work life into verbs as a shorthand way of communicating. For example, a resident might report that she “heparinized” her patient, or that a patient ready for discharge has been “housed and spoused,” meaning it had been determined that the patient had somewhere to go and someone to care for him. In addition, he reports the creation of new terms based on healthcare IT functionality, based, for example, on the way buttons are named in an EHR (“I done-ed it”).
That post is a fun reminder of the many ways medical lingo—and language—evolve, and the importance of attentive, systematic approaches to managing and supporting the information needs of those who invent the common parlance of their specialty in the course of doing their work (we hope, while using the sites we develop for them).
Two recently released eBook surveys—the HighWire Press 2009 Librarian eBook Survey and Aptara’s eBooks: Uncovering Their Impact on the Publishing Market—serve as a catalyst to consolidate a set of inchoate notions that have multiplied in this season of bustling enthusiasm for everything “eBook.” Namely, what exactly is an eBook?
I propose that there exists broad confusion about what constitutes an eBook, and that this definitional problem is interfering with decision making by both content producers and consumers.
Any major product introduction by Apple commands intense awareness, and the iPad was no exception. After the Kindle surge over the recent holidays—Atlantic Monthly reported that on Christmas Day 2009, Amazon’s “e-book” [quotation marks mine] sales exceeded hard-copy sales for the first time in Amazon’s history—the lay public probably thinks it has a working definition. Likewise, the term is often invoked in the publishing world as if it is definitive (the recent O’Reilly Tools of Change conference in New York included no fewer than 11 sessions with “eBook” in the title (which list excludes many more that were substantively about the topic despite not meeting this particular criterion).
But in practical usage, a shared definition remains elusive. I have seen everything from a Kindle download to relational content databases referred to as an eBook. Wikipedia says an eBook is “an e-text that forms the digital media equivalent of a conventional printed book,” while the OED defines an eBook as “an electronic version of a printed book which can be read on a personal computer or hand-held device designed specifically for this purpose.” However, Planet eBook says it “can be anything from the digital version of a paper book, to more interactive content that includes hyperlinks and multimedia. It can even be the electronic reading device such as a Rocket eBook or Pocket PC.”
For purposes of this discussion, let’s oversimplify and identify two possible definitions: a Simple eBook (one title, unconnected, Wikipedia/OED-style) and a Rich eBook (interactive, interconnected, Planet eBook-style).
The HighWire survey, aimed at Librarians, contains significant evidence of confusion between the two poles, both in its questions and in the user responses and comments. For example, responses to the final question, “Are there any additional important factors not covered above that publishers should consider when publishing ebooks?” include:
- “eBooks are most useful in medicine, computer science, and other areas where frequent updates are important.”
- “Users use eBooks to answer a particular question, not to find a particular book. Therefore, a critical mass of material within one collection is important.”
- “Promote title‐level and page‐level linking via social software and course management software.”
- “eJournals have shown that it is possible for electronic publishing to transform the delivery and use of information, and the same can happen with eBooks.”
These perspectives strongly imply a Rich eBook model. But in response to the same query, others said:
- “…linking index to actual page numbers (is an) important feature”
- “Content should be available for course reserve”
- “A common user interface for ebooks is needed. This could be an open‐source product”
…responses that suggest an understanding of “eBook” much more like our Simple model.
The Aptara survey, which collects responses from publishers, exclusively (if not explicitly) implies the Simple eBook model. One can quickly discern this focus in the question about device support. Note that the percentages sum to 100, even though certain of these platforms overlap (Kindle has a iPhone app, for instance, and many web-based content applications offer alternative interfaces optimized for smartphone access). (Rich eBooks will generally not function in dedicated hardware environments.)
As I encounter people trying to plan for their future enterprises (be they authors, publishers, service providers, librarians and other licensees, or individual content consumers), I find many are conflating information about this very disparate continuum of information products. If the terms of the research are ambiguous or incomplete, then it risks obfuscating market realities. For instance, HighWire’s survey reports that 81% of respondents have a budget for electronic resources above $100,000, yet 91% report that “eBooks” represent less than 10% of their overall collection budget. Not only do these two questions confuse the numeric scale ($ v %), but they fail to compare the same resources, first citing “electronic resources” then switching to “eBooks”. What comprises the former category? Is it electronic journals+eBooks? If so, what is included in and excluded from the latter? Where do electronic databases fit? What about what we’re temporarily herein calling “Rich eBooks?”
In professional publishing, the simple concept of a downloadable unit, the content of which is the precise equivalent of the print, has broken down. Sure, a reader can buy professional eBooks that meet this definition. But more important, prevalent, and useful (despite the current eBook hype) are content systems that are addressing the fundamental purposes of a professional book but radically rethinking the means by which we achieve them. This rethink expresses itself in both the creation and the consumption modes of the Resource Formerly Known As The Professional Book.
So, what are the purposes of a professional book? Education, broadly defined: we read a professional book to learn, either in a formal educational context, or in a maintenance/current-awareness mode. The primary audience reads to edify itself, at whatever level of abstraction and practical applicability, about subjects related to its professional activities. This is distinct from (if overlapping at the margins with) reading at leisure/for pleasure, which is the primary purpose of trade books. Authoring a professional book is an exercise in consolidating and reporting the state of the art, proposing new perspectives and ideas that will be added to a foundation of shared understanding, and/or integrating concepts from disparate disciplines. An author (or editor) gathers and organizes preexisting information, synthesizes it, and sometimes posits new or alternative conclusions, attempting to expand on the shared foundation of the domain.
Given the technological tools available, the simple eBook as defined above is insufficient to support the central function of professional books. It confines a reader to a quarantined jurisdiction, administered by an individual author or editor, fixed at the moment of its publication, unconnected to related information resources that are often crucial to the context of the shared domain.
As a result, I suspect the prospects for the Simple eBook model are much more limited in the professional domain than they are in the world of trade, where it is far more closely aligned with the fundamental purpose of reading (and helpfully solves certain problems like portability and access). Instead, the Resource Formerly Known As The Professional Book will, in the service of its fundamental educational purpose, morph into something much more interactive, dynamic, and temporal than a Simple eBook. In fact, resources like this have already developed, resources that make the professional (Simple) eBook obsolete even as the format gains ascendancy in the popular imagination and the trade market. If you think about it, you’re likely already familiar with rudimentary examples, and you probably use some of the tools that will converge to make this professional post-book a reality.
For instance, think of aggregated content applications—in medicine, products like MDConsult (from Elsevier) and AccessMedicine (McGraw-Hill). These systems allow a user to focus on addressing the educational (again, in the broadest sense) objective with multiple sources of information conveniently organized into a single interface. Even this now well-understood innovation renders the education process far more efficient than the use of individual eBooks. How about Wikipedia or YouTube, continuously evolving reference databases that are good places to start many investigations (and could never be a book)? Social networks, from the 800-pound gorilla Facebook to the exclusive and targeted CardioExchange, are increasingly being used to annotate and aggregate content around professional domains. And Macmillan has taken a next step (as have a number of large educational publishers, in various ways) with its launch of Dynamic Books, a platform that will allow wide-scale annotation, reordering, and other modifications of its textbooks.
Speaking of which, consider the content creation/generation side of the equation. Professionals benefit from, and increasingly rely upon, information that is vastly more current than that provided by books (paper or E). Networking technology has made the exchange of what we will broadly call “updates” massively more accessible. Updates—my placeholder term encompassing all sorts of timely (and often time-sensitive) content—can include blog posts, wiki edits, dynamic news links, peer-reviewed syntheses, and so forth. Furthermore, connectivity to and between dynamic data sources (in medicine, for example, clinical trials registries, guideline clearinghouses, drug databases) can likewise enrich and propel vastly more integrated, comprehensive, and effective educational experiences.
Now imagine a future in which relevant information around a theme of user interest is dynamically aggregated across previous boundaries (publisher, format, repository, etc.). Wide deployment of digital publishing infrastructure (syntactic and now semantic XML, ontological standards, proliferation of web services standards, registries like DOI, and so on) is in place to begin delivering user experiences of ever-increasing richness, connectivity, efficiency, and dynamism. This is not your father’s Kindle.
(By the way, I am not a wide-eyed arcadian with respect to content sources. Nothing I’ve said should be taken to suggest that standards, pedigree, quality control, peer review, and other editorial values are either passé or automatic. Rigor in organizing and curating the Resource Formerly Known As The Professional Book will continue to be essential, and these qualities will be required to establish a new value proposition on top (not in lieu) of authors’ and publishers’ domain expertise. As seems to always be the case in publishing, it’s more work on top of the old work.)
So, how does a Simple eBook compete with this vision? Not favorably. I’m hard pressed to envision more than a peripheral role for it in the professional information sphere of the future. Now, if we redefine eBook as something more comprehensive… or maybe we should just give it a new name? What are your suggestions?
[your term here] (Rich eBook)
|Dynamic Content||Static Content|
|Interactive Content||Unconnected Content|
|Multimedia||Text, Limited Graphics|
|Social-media Enabled||Isolated to Individual Utility|
|Continuity (Subscription) Revenue Model||Single-sale Revenue Model|
|Non-proprietary Device & Platform||Proprietary Single-use Device & Platform|
|Addresses Fundamental Educational Process||Replaces a Single Book 1-for-1|
|Lives in the Cloud||Stored on Your Device|
|Transforms Pedagogy||Reduces the Weight of Your Backpack|
As we at Silverchair and Semedica see more and more interest in automated tagging solutions (such as our Tagmaster system), we are more frequently encountering questions about how to evaluate their results. Here are a few ideas on the subject:
Evaluation: Humans Required!
It is hard to get around the fact that you will need human editors (or professional indexers) and your human technology team (who will use the tags to create interesting new features) to verify that an automated system is working correctly and that the tagging is accurate and useful.
Recently, someone asked our CEO Thane Kerner if we had an automated system to verify the accuracy of our automated tagging. Thane replied (rather cheekily, I must say): “If we had an automated review system that could measure tagging accuracy more precisely than the current tagging system, we wouldn’t use it to verify tags, we’d use it to tag the content to begin with!” The lesson: Once you’ve deployed your best automated system to do the tagging, humans are the next logical reviewers.
Here are four factors your humans should consider in their review:
1. Expert/Editorial Accuracy Confidence
One key target for evaluation is to assess how much confidence your key stakeholders (journal boards, editors, etc.) express in the output of the system. But confidence is not a linear equation. I posit the following values:
- Impeccable tag placement: +1
- Debatable tag placement: −1
- Debatable tag omission: −1
- Obvious tag omission: −10
- Obvious irrelevant tag placement: −50
The first thing you’ll notice is the weight of positive to negative. In high-stakes fields (including science and medicine), humans are naturally biased to more heavily favor negative experiences. (Of course, this has aided us well in survival: “Don’t eat that type of berry again, it made you sick last time!”) What that means in terms of confidence is that stakeholders will need a disproportionate amount of positive reassurance to get over negative outcomes. And the impact of a particularly egregious negative outcome (resulting from a particularly poorly placed tag) can be devastating to your stakeholder’s impression of a tagging system. (This is why Silverchair’s system defaults to using conservative methods with very little “guessing” to avoid obvious irrelevant tag placement.)
The next key target for evaluation for both editorial and technical stakeholders to assess is usefulness of the tagging applied. Tags should be highly relevant in a domain-specific context and they should drive better discoverability and linking. Primary care, genetics, surgery, and emergency care all take very different approaches to the same topics, and their tagging should reflect their uses.
The tagging system you are evaluating may have added tagged concepts that are tangential or irrelevant to the use model of the content, and such tags would not be capable of driving innovative site features (in many cases, tangential tagging actually inhibits the ability for new systems to work effectively). For example, it is a nice-to-have if your tagging system can recognize place names and person names, but if it misses or miscategorizes important topics like clinical trial names it doesn’t matter how many people or places it can tag. (Clinical trial acronyms can be particularly tricky to tag―see our post about them.)
Does the system still work with “documents” or can it identify topics down to the section/paragraph/figure/table/equation level? At Silverchair we work with many dense medical chapters that may cover more than 200 distinct topics, so we see it as a necessity for our tagging system to break those documents down into smaller parts in order to deliver precise packets of highly relevant information to our users.
4. Control and Ongoing Improvement
Any system selected is not going to be extremely accurate “out-of-the-box.” (I write that as a realist, not as a pessimist!) So during evaluation you must ask, “How easy is it to make impactful positive changes to the system?” This can take a variety of methods—some systems suggest manually selecting training documents for each topic or category (which can get onerous when you have 20,000 topics), some systems allow your software developers to go in and tinker with the code (you have data classification expert software developers, right?!?), and some systems allow you to load and use a taxonomy or thesaurus to aid in topic identification and tagging (assumes a taxonomy/thesaurus exists or can be created for your domain).
At Silverchair, we work primarily in medicine, which is a taxonomy-rich domain with an ever-growing list of topics. For that reason, we’ve chosen the last method as our control and improvement strategy. Our editors update our Cortex medical taxonomy and its related thesaurus every day to keep pace with the topics being written about and searched for.
If you choose a system that 1) is accurate enough to instill confidence in your editorial team, 2) is useful enough to drive meaningful new features and improvements, 3) classifies your data at a granular level, and 4) is flexible enough to allow explicit control and ongoing improvements―you’ve made a wise purchase!
Clinical trials are popular targets of searches in medical journals. To deliver accurate search and browse results for them, semantic tagging and a semantic search engine are essential.
The names of clinical trials are often long and unwieldy, as they try to describe the focus and mission of the trial in their name—for example, a clinical trial studying drug treatment of high cholesterol is “Arterial Biology for the Investigation of the Treatment Effects of Reducing Cholesterol 6–HDL and LDL Treatment Strategies.” Because of these long names, trials are more commonly known by their acronyms—in this case, “ARBITER 6–HALTS” trial—and no doubt their full names are being crafted to result in a catchy or apropos—or hopeful—acronym. For example, the acronym for the trial studying the effect of the drug Vytorin on cholesterol levels is “IMPROVE-IT.” (See this blogpost for some humorous trial names and acronyms.)
One of my pet peeves is the incorrect use of the word “acronym” to mean any abbreviation for a term. Actually an abbreviation is also an acronym only when the abbreviation spells a word or is a combination of letters that people can pronounce as a word. So yes—abbreviations of clinical trials are acronyms, and ah, there’s the rub for commonly used full-text nonsemantic search engines. A full-text search engine treats them like any other word.
So yikes—a PubMed search for “JUPITER” (the acronym for the trial “Justification for the Use of Statins in Prevention: an Intervention Trial Evaluating Rosuvastatin”) delivers the first two results correctly, but the third result appears because the name of the institution that issued the paper is in Jupiter, Florida! OK so yes—the PubMed search box tries to help you by suggesting “Jupiter trial” (98 results) … but it also suggests “Jupiter study” (257 results). People—the JUPITER trial and the JUPITER study are exactly the same thing to any searcher wanting to know about JUPITER. The number of results should be the same for both searches. And nobody searching PubMed for JUPITER wants to know more about Jupiter, Florida. Trust me.
We can do better. At Silverchair, our Cortex taxonomy contains a list of clinical trials and the accompanying thesaurus includes their acronyms, so when our tagging and retrieval systems encounter those concepts, we’re able to separate them from their normal English language counterparts and tag them correctly. Yet another benefit of an automated tagging system supported by a robust and up-to-date medical thesaurus. It understands medical information and the health care professionals who depend on it so that we can give them results, not guesses.
As we were setting up a new external SAN (storage area network) on the Silverchair production web farm recently, the network engineer said something that caught my attention: “The web servers will be able to use the external SAN drives faster than their own internal memory.” At first that defied my expectations of “internal vs. external,” but when I thought about more, it made perfect sense.
The web servers are designed to execute application logic, store session tracking data, handle user interaction input, and synthesize, parse, and display data from a variety of sources—they are logic processing engines that handle data storage only when necessary. On the other hand, the SAN has one purpose—to store a large amount of data and enable a super-efficient data delivery channel that rapidly responds to content requests from the web servers.
The more I thought about it, the more I realized it was a fitting metaphor for how humans work. We are fantastic logic processing engines. We parse, synthesize, analyze, and use data input from a variety of sources to perform creative problem solving. And most importantly to this metaphor, we only store data internally when absolutely necessary. In the present day, the comprehensiveness and ubiquity of the Internet have allowed us to store an unprecedented amount of collective memory in external sources and access it from wherever we may be.
To be clear, human use of external memory did not arrive with the Internet—it has been around since the beginning of civilization. We are used to storing memory in external sources and freeing up our internal resources. Papyrus eliminated the need to memorize long epic poems. Abaci eliminated the need to memorize multiplication tables. (NB: Don’t try telling that to a 2nd grade teacher.) In modern medicine, drug handbooks store dosage and safety information that is too complex for doctors to memorize in toto. Phone numbers stored in our mobile phones eliminate the need to memorize the phone numbers of friends. We even store memories in our friends and family—I recently asked my wife, “What was the name of that hotel we liked in Chicago?” She knew, and voila, I had accessed my external memory successfully.
Alas, my comparison of human activity to Silverchair’s web farm breaks down at a key point. In many cases, accessing our external memory is not fast and efficient. Currently the external memory sources of humans are not deployed as efficiently as a SAN. Internet content sources can be hard to access, store content in highly variable forms, require a special vocabulary or technique to query, and return data in a way that does not suit our purpose.
This is the fundamental problem that Silverchair’s Semedica division addresses with semantic enrichment of data sources. We’re organizing a specific external memory category (in our case, online medical and health care information) in a way that allows it to be accessed more quickly and to return data in the right form for efficient use by clinicians and researchers. The less data that health care workers need to store internally, the more of their “processing time” can be used toward envisioning creative solutions for preventing and curing diseases. That is something that the Internet cannot do. (Yet.)