What Happens When My Metadata Leaves The House, Part II

In a previous post, I talked about the issues surrounding getting your data ingested by third parties. This week, let’s think: What are some of the issues around sending OUT metadata?


We have systems, which can be outdated, chaotic, and incompatible. We have a functional process – who does what when – which is riddled with exceptions. And we have a single source of truth that holds all the most updated information, but has to export that information in a lot of different ways.

Relationships are what get us through the gaps in tech, staffing, and standards.

This industry was built on relationships: Agents knowing the tastes of editors. Marketers knowing the tastes of bookstore buyers. Editors knowing what works for library acquisitions. Human relationships – our work friends – keep this industry in business in the face of imperfect systems. And that’s not appealing to some folks – which is also why we have a thriving indie scene with relationships of its own – some see these relationships as forming a gate-keeping system, a sense of exclusivity.

But for the most part, the relationships are there to smooth the bumps in the workflow. Sudden price change? Call up your vendor partner at B&N. Want to test out an idea for a project? Reach out to the collections development librarian to see if she thinks it’s viable. Need to recall a book or cancel a publication, and the ONIX feed may or may not get ingested? Call up your trading partners and they’ll do a manual update to their records.

Our relationships are the hack that makes it all work. The relationship between publisher and bookseller existed before there were systems to facilitate information exchange.

So, as with any relationship, communication is the highest priority. Much of that communication is automated with metadata transmissions – but as with anything automated, there are going to be exceptions that have to be dealt with manually. Which means an email or a phone call, or an in-person chat at a conference to gain or give clarification about what’s expected. Maybe the answers aren’t what we want to hear, but at least we know how to solve the problem.

I’d like to say that the one thing we can all count on is that everybody in the business wants it to work. And in most cases that’s true.

You do have the “digital disruptors” – Apple, selling devices; Google, selling ads; Amazon, selling everything – who don’t necessarily have the book industry’s best interests at heart. And that’s where we’re experiencing the greatest friction. It’s not necessarily with computer systems. It’s with market systems – when our world is disrupted not by ebooks, but by (for example) a vendor using our product as a loss leader; another vendor who is uncommunicative and not at any conferences/standards meetings; another vendor whose founder didn’t even want to sell books in the first place because “nobody reads anymore”.

We can’t NOT do business with these folks. But we are the little guys here, and that’s a reality. This is WHY we have to deal with competing vendor requirements – books might be a small fraction of Amazon’s business, but Amazon is a LARGE fraction of OUR business.

So how can we improve our relationships with these partners?

Well, on some level they do want things to work as well, or they wouldn’t be selling our books. And maybe, in addition to trying to school Google in the wacky ways of book publishing, we need to learn from Google about the also wacky ways of tech.

The other thing to note is that these companies DO hire book people. The folks at Google Books are longtime publishing operatives – from HarperCollins, B&N, Scholastic and Harvard University Press. Amazon regularly hires out of the NYC book publishing pool. And the iBookstore is staffed by people from Simon & Schuster, B&N, Oxford University Press, and other traditional publishing companies. So the encouraging thing is…we are infiltrating. There really ARE like-minded people at these companies. They seem (and feel) like faceless organizations, but they are not. That’s a myth that we’ve been telling ourselves for over 10 years.

Best Practices

What we’d consider best practices can vary depending on who’s consuming the metadata.

BISG has published a document on best practices for product metadata in the book supply chain. That’s a good foundation for bookselling – obviously the vendor requirements will differ as they compete with one another. Of course, Amazon doesn’t participate in hammering out these best practices, so they’re not perfect. But it’s a good guideline to begin with.

For non-bookselling purposes – libraries and other institutions – the idea of “best practice” gets a little murky. Resource Description and Access (RDA) was created by an international association of library organizations as a set of guidelines for cataloguing resources (books and other things). RDA can serve as a good foundation upon which to build library metadata.

Some systems vendors also do educational sessions where best practices are reviewed: Firebrand, Klopotek, Iptor and Ingenta all have user conferences or webinars for their clients that go over market needs for metadata and standards. And of course there are trade shows like ALA and BEA which also have conference tracks.

Basically, if you get out there and start talking – and you might not have to go places physically, if budget is an issue – you can get a conversation going. Maybe others are having similar problems to you. Maybe there’s a solution someone knows about. To some degree, we compete, but as an industry we’re also really good at helping each other.

What Happens To My Metadata When It Leaves the House?

We've all seen it. We spend time perfecting the metadata in our feeds, send it out to our trading partners, and had to take complaints from agents, authors, and editors. "Why is it like that on Amazon?"
The truth is, data ingestion happens on whatever schedule a given organization has decided to adhere to. Proprietary data gets added. Not all the data you send gets used. Data points get mapped. So what appears on any trading partner's system may well differ somewhat from what you’ve sent out.There are so many different players in the metadata arena that can affect what a book record looks like. When you send your information to Bowker, they add proprietary categories, massage author and series names, add their own descriptions, append reviews from sources they license – and send out THAT information to retailers and libraries. The same thing happens at Ingram, at Baker & Taylor – so what appears on a book product page is a mishmash of data from a wide variety of sources, not just you.

At an online retailer, different data sources get ranked differently. This happens over time, as a result of relationships and familiarity with data quality, and these rankings can change. The data can also get ranked on a field-by-field basis. So a publisher might be the best source of data for title, author, categories, and cover image. But the distributor might be ranked higher for price and availability. And an aggregator might be ranked higher for things like series name – especially if they specify to the retailer that it’s something they’re focusing on standardizing and cleaning up. It’s important to remember that in the eyes of the retailer, not all data feeds are equal. You’d think the publisher would be the best source of data about its own books but I can assure you, having worked with publisher data my entire 30-year career, that isn’t always the case.

For a publishing house, updating old metadata records is a break from normal workflow, so it doesn’t happen as often as it should for optimal marketing purposes. It’s important to remember, though, that the job doesn’t stop once the book leaves the house – there are reviews, awards, and other events that are worth making stores and readers aware of through your metadata feed.

Just another quick word on terminology when it comes to updates – a “delta file” is what we call these updates – additions, changes, and deletes only, rather than a full file. Most publishers will send an initial full file, and then supplement with delta files for a time, and begin the cycle again just to make sure that their trading partners are in sync.

But on the retailer/aggregator end, there’s no guarantee that your updates will get processed in a timely way (without a phone call). Companies ingest on their own schedule, and if they have a very heavy processing week, they might skip your delta file and wait for the next one, which means there might be gaps in data updates. This is why publishers find themselves occasionally sending a full file – just to be sure all their records are brought up to date.

Beyond Simple Math & 5th Grade Semantics: The Not-So-Mindless Musings of Two 21st Century Metadata Wonks

By Nannette Naught and  Laura Dawson

For the sake of clarity, as wonks, we started at the beginning with simple math and basic sentences. But let’s be honest,

  • The business of knowledge is far from simple. Knowledge acquisition, creation, and communication are complex tasks requiring advanced semantics.
  • The mechanics of the knowledge economy are a far cry from straightforward supply and demand. Knowledge creation, acquisition, and curation are more about performance than parts.
  • The obligations of knowledge stewardship are far and away larger than the web. Knowledge rights, ownership, and access are international in scope and fraught with location- and community-specific considerations.

And, for that matter, let’s get real, as modern day metadata mechanics, most of us are:

  • More concerned about music, streaming digital media, and demand driven acquisition than articles, offsite storage, and comprehensive collections.
  • More affected by the need to demonstrate impact, the pressures of digitization, and concerns related to institutional and service alignment than usage counts, known item fulfillment, and search versus discovery sidebars with our vendors.

And while we’re at it, let’s put a fine point on the problem and explicitly state our operational struggles — the “nickel and diming us to death, but can’t afford to stop long enough to replace” work -rounds our behemoth, scarce-resource-guzzling 20th Century information systems require of us in the 21st Century knowledge economy. As trained, GenX and Millennial information, data, and content scientists we are forced to spend way too many hours in “dare we say it” repetitive, menial data massage tasks such as:

  • Recataloging content objects and enhancing their string-based, record trapped metadata.
  • Finding, investigating, and hand correcting the associated master and local holdings record collisions which make our collections invisible and unavailable to our patrons. Over and over again, with each update, renewal, replacement, and reconciliation.
  • Rescuing valuable, expensive researcher and student access from our disconnected ERM- systems and KB-focused acquisition and delivery processes.

From this vantage point then, let’s clearly outline our shared interests. As 21st century, knowledge industry leaders, we are united with our management colleagues across the lifecycle by our:

  • Deep interest in the economics of metadata modernization and the ongoing costs (time, money, and opportunity) of our community’s extended re-engineering efforts.
  • Strong commitment to quickly and effectively bridging the gaps in communication, understanding, metrics, and use cases that plague both initiatives. Be those gaps within our niche or with our software development friends who create the applications that deploy our data.

And with all this, finally, off our chests, let’s throw some light on this 21st Century knowledge economy we keep talking about, using the recent ALA MidWinter Conference as our crystal ball.

  • What does it look like? And how is it different than the production and inventory infrastructures of yesterday? In one word: Energy. With over 15 years of attendance under my belt (yes, I, GenX, Nannette, have been attending since 2001 in San Francisco), I can honestly say, perhaps more than any I have attended, the knowledge economy seen through the lens of this conference looks like Diversity, Responsibility, and Transformation!  From the active, visible, and varied president elect candidates to the council and committee meetings where strategic plans, realistic achievable budgets, and forward reaching conference updates are being not just discussed, but enacted, this is a vital industry, on the move.

For me, it’s spirit was best summed up in a single quote:
“If you can make the 14th Librarian of Congress, a Librarian, you can do anything … Librarians are having a moment, remember your power.”Carla Hayden, in her address to Council. (And yes, earlier this year as the second link shows, she also visited those PCC meetings, gathering place of many a library metadata wonk.)

  • Who are the players? And how are they different than yesterday’s? Quite frankly, some of the players you already know — LC, Ingram, and TLC or as we noted last week, libraries, library technology companies, publishers, distributors, aggregators, and the like. Others, however, like Biblioboard, DLSG, and BluuBeam are more recent entrants with new landscape shaping technical offerings. As to difference, I go back to the words Energy and Spirit. Old or new, these players are different for their willingness to actively execute real, working software on the cusp of change. They are not simply promising a future feature set, they are actively demonstrating an understanding of 21st Century use cases and a commitment to economical service modernization with tangible results.

For example,

  • Library of Congress’ BIBFRAME Initiative, drawing on real project planning that began in early to mid2016 (they’ve been reporting on it, in some detail, incrementally since late spring or early summer of that year) and in collaboration with PCC, IndexData, and others is actively completing the MARC to BIBFRAME converter, processing 19 million MARC records, and generating billions of RDF triples over the next three months to feed its active production pilot and the coming objectification of library metadata. Talk about realism and controlling the cost of re-engineering efforts, this is a whole new level of performance. This is from 1967 to big data, from “dirty by design” to “specifically semantic” in months not years. To my way of thinking at least, these are the foundations that will free re-invention efforts from the bounds of MARC-driven ILSes:
  • Enabling metadata mechanics to connect sales metadata to knowledge across the lifecycle. Need a reference point? Think back to Jean Godby’s ONIX mappings and OCLC Research’s Schema.org work of a few years back. Recast it against Selection and Acquisition in a beyond BIBFRAME 2.0 world, and a whole new era of metadata automation opens up.
  • Allowing metadata curators to establish relationships between people, pieces, collections, and disciplines. Need a guidepost? Think back to 2010 or so and Tom Delsey’s, Barbara Tillet’s, and the RDA JSC’s work on relationships and relationship designators. Recast it against metadata curation postBIBFRAME 2.0 production pilot, and a whole new world of metadata driven inquiry opens up — inside, and outside, the library system.
  • Empowering lifecycle leaders to drive ROI and assess impact. Need some lane markers? Follow Wayne Schneider’s (IndexData speaking at the BIBFRAME Forum) train of thought. A post conversion, big data world (on and off the web), working with curated metadata resulting from steps 1 and 2 herein. Library metadata which is no longer surrogate and outside content, but expressed in and aggregated against our resources’, digital content’s native languages (e.g., objectified XML). And a whole new world of learning, research, and inquiry facilitation (not to mention library operations cost optimization) opens up!

And on that note, let’s look at a few of those cusp riding, working applications which cast against this upgraded, connected metadata uplift wonk spirits and re-energize us.

  • BiblioBoard with its Amazon Kindle store style interface is helping libraries affordably and intuitively provide access to digital materials outside cumbersome discovery interfaces. Giving librarians and their patrons the ability to create custom curations and a personalized user experience akin to their relationship with their tablet. Talk about metrics, this is a whole new level of embedding. This is ILS information (they have integrated with SirsiDynix so far) powering the digital life of the user.
  • DLSG with its BSCAN Interlibrary Loan & Digital Document Delivery is easily and affordably (for less than the cost of 1 staff member for a month, initial purchase, and a small, in the $100s, optional yearly support fee) placing the power of immediate PDF-based interlibrary loan into the hands of any library. No muss, no fuss, no system understanding or technical knowledge required. Talk about ease of use, this is as simple as taking a book off the shelf and making a photocopy. This is a whole new level of service. This is near immediate, near idiot proof managed access to physical library collections anywhere, at any time, at minimal cost.
  • BluuBeam with its small plastic beacons and location-triggered alert capabilities is affordably and noninvasively allowing libraries to integrate their physical and digital presence through their patron’s handheld devices. Talk about patron engagement and personalization. Products like this with low points of entry and creative deployment; open the door to a whole new level interactivity and performance. This could be the beginning of demand-driven, patron-controlled, library-managed, location-specific, device-delivered service.

Whew! That’s a lot to say in one blog post! It’s too much to chew in a single three article series! It is quite frankly, insufficient to the tasks at hand. Don’t worry, we metadata wonks, agree with you. Moreover, we believe it is just the start of a conversation that needs to involve many more players and address many more issues, viewpoints, and perspectives. It is a conversation at the cross roads of centuries, services, and business cases. For the feedback we’ve received so far, it is an exchange of ideas that needs a home. Thus, as we close this series on Life, Liberty, and the Pursuit of Knowledge, we open the door to a “coming soon” Crossroad Conversations space at the former www.imteaminc.com address and add the voices of both Kathryn Harnish and a rotating series of virtual coffee participants.
Until then, two metadata wonks, signing off!

Mastering the Math: The Not-So-Mindless Musings of Two 21st Century Metadata Wonks

By Nannette Naught and  Laura Dawson

"Slow down, start at the beginning.”

“Decrease the drama, keep it simple, silly.”

“The facts, Ma’am. Just the facts.”

Long lost episode of Dragnet or guiding remarks made by system designers in a focus group at ALA? Honestly, it can be hard to tell the difference at times. And yes, we metadata wonks understand the hard, uncomfortable truth of this statement. Heck, we’ve lived it!

So now what? How do we as a knowledge industry move beyond the battle lines of our individual business cases? How do we as professionals productively collaborate across the divides of discipline and raison d’etre? WHERE DO WE BEGIN?

As we said last week, for our parts, we metadata wonks begin with trusted, experienced human knowledge acquisition and deployment engines with access to vetted, curated collections of knowledge. Trained information and data service professionals backed by institution-branded, collaboratively curated, community-aware resource collections, we access either in person, virtually, or via the ubiquitous, affordable software programs deployed on our phones, tablets, watches, and laptops — aka Libraries and Librarians.

So starting here, bringing Joe Friday with us into the 21st Century...

Fact: Learning is contemplative. Humans do not simply point and click their way to knowledge. Unlike machines, people do not simply additively take in information and automatically get smarter. Learning is an experience that requires active thought.

Volume (even organized volume) ≠ Wisdom.

Fact: Research is contextual and collaborative. Unlike simple mathematics, human learning is not necessarily incremental, but rather, oft nonsequential and unpredictable. Knowledge creation is a shared experience that requires active thought, appropriate access, and imagination.

1s and 0s (even organized, interconnected 1s and 0s)  Thought.

Fact: Inquiry is a dependent and serendipitous process. Unlike nascent, database-driven sort and retrieval algorithms, the human brain natively understands real world objects and complex relationships. The investigative act is a personal, emotion-inclusive, uniquely human experience that requires MORE. MORE than just machines with their 1s and 0s, fledging algorithms, and sales-driven, mob ruled, short-term, oft insecure, point and click document collections.

THE Web (even the linked, semantic web)  Library.

Which brings us back to last week’s conclusion, to move beyond current stagnation, the knowledge economy requires a library-led programming toolset and the resulting librarian-curated, web-formatted, web-accessible metadata that augments current sales-driven applications with knowledge- and language-driven contextual metadata to power its services and applications.

Which if we are honest, requires us all (aka not just us metadata wonks) to face yet another hard, uncomfortable truth — said efforts within the Library community seem stalled at best; in crisis at worst. Heck all of us, leadership and feet on the street alike, are living this truth daily! So why, oh why, can’t we take a clue from Nike and Shia LeBeouf and “JUST DO IT,” already?

Heck, as many correctly point out, library metadata was authoritative, global, and knowledge-based before anyone even conceived of the web, let alone email. Librarians, archivists, and their cadre of associated technical professionals were serving communities long before anyone received a degree in computer science, let alone web design or search engine optimization. And therein lies the rub. The rub or controversy that is holding us back: The epic battle between commercialized computer science and community-based library and information science.

Thankfully, here, the wisdom of Dragnet holds: Strip away the drama (aka commercial versus community) and get down to the facts (aka the science). And Voila, a simple fact, not to mention a very human truth, emerges: Computers are simply the vehicles through which library delivers knowledge and information (and for that matter commercial interests deliver their products) to communities (geographical, practice, institutional, academic, public, K-12, and specialized communities). Communities of people who are:

  • Driven by their individual wants, needs, and relationships;
  • Learning, researching, and inquiring in very human ways.

Which brings us metadata wonks to our main point for the week: We believe subtraction, not addition, is required, if we are to successfully escape the current trough of disillusionment.

We believe it’s time to STOP ceding authority, ownership, and, and frankly, definition of what is possible/needed, to the programmers.

We believe it’s time to QUIT relying on simple arithmetic, directory-like relational databases, and 5th grade level semantics, to discovery, deliver, and manage our resources, our patrons, and our businesses.

We believe it’s time to STAND UP for ourselves and the communities we serve by insisting that data, library, and information science aided by computer science move forward together as equals — no ugly step children, no domineering bullies allowed.

Then and only then, can we collaborate successfully across the divides of discipline and motivation to additively:

  • Move innovation forward, grounded in Tradition and Diversity.
  • Reliably extend service levels with Economics and Ethics.
  • Help ensure Peace by relying on shared Governance structures and Negotiation methodologies.
  • Achieve collective and individual Success through Innovation, Service, and Peace

Join us next week for ideas from Digital Book World about how our lifecycle partners the publishers, distributors, and aggregators see this playing out in 2017 product and service lines. And the following week for ideas from ALA MidWinter about how their lifecycle partners the librarians, researchers, and library technologists see this playing out in their institutions and within their communities of practice.

Of Life, Liberty, and the Pursuit of Knowledge: The Not-So-Mindless Musings of Two 21st Century Metadata Wonks

By Nannette Naught and  Laura Dawson

When starting something new, it seems best to start at the beginning, without preconceptions: Library metadata is problematic from a wide variety of perspectives. Just about every interaction with it is an exercise in frustration. From legacy issues, retro-conversion trip-ups, the introduction of new concepts like Linked Data, and the simple volume of STUFF that needs to be effectively described to a multitude of audiences – it’s safe to say that increasingly efficient technologies (like the Web, like faster computers) means that we’re looking at a burgeoning crisis.

So. Basic concepts:

  • What is metadata?
  • Why do we need metadata?
  • And while we’re at it, why, oh why, hasn’t someone – anyone – fixed it by now?!

Not exactly the scintillating outline for a “politically-charged, pulse pounding” theatrical thriller. Not exactly the reader-grabbing, panic-inducing tagline of a breaking news alert.  And yet, a cursory search on Google brings up results about “Snowden: The Movie” and the NSA. Even prompting related “People also ask” questions such as: “What is metadata collection?” which give us results like:

  • What is the NSA and what do they do?
  • Is the NSA still spying on us?
  • What is the NSA Surveillance program?

For our parts, we metadata wonks trust knowledgeable, experienced, human search engines with access to vetted, curated collections of knowledge (aka libraries and librarians) more than we trust fickle, opaque, largely sales-motivated and easily manipulated, computer algorithms with access only to properly formatted, mostly recent documents on the Web. Documents like news feeds and advertising or pseudo-advertising in the guise of a PSA that are seldom sourced, let alone vetted and peer-reviewed. Especially when it comes to important decisions such as those related to our jobs, safety, security, privacy, healthcare, and the like! (To say nothing of the problems of “fake news”.)

Why? Because research, inquiry, and learning (or, rather, knowledge acquisition and deployment) are NOT purchasing decisions. Nor are they simple, fact-based, database-driven information sorting-and-retrieval questions. Yes, we wonks believe the wisdom and hearsay of an anonymous (for those who can afford it) web-based crowd – be it the commercial crowd; the free and open, no-copyright or peer-review crowd; or one of the many NIMBY mobs – is an inadequate, and frankly immature resource for the tasks of life, liberty, and the pursuit of knowledge.

So now what? Where do we go from here, if we can’t rely on Google, Amazon, and/or Facebook alone to answer important questions? If these lauded, affordable platforms are really just sort-and-retrieval systems for the easily accessible sales and social media databases of the internet, how do we get access to vetted, curated collections of knowledge and trained, experienced information professionals (as opposed to scripted, offshore customer support staff) we need to learn and grow?

For us, it’s simple – use these software programs to access your library and librarians! Let’s look at our metadata search again. Let’s add the word “library” to it and see what happens. This simple addition totally changes the results, yielding a Journal of American Librarianship article (V 21, N 2, pp 160-163) by Karen Coyle in the top 5 hits and no NSA or or Snowden references.

Having now used the tool, a sales-driven algorithmic search engine, to access the library realm and the writing of a leading librarian (vetted and trusted by her peers over the course of a long and productive career), in less than 4 of her paragraphs we find concrete, intelligible answers to our initial questions:

  • What is metadata? “Metadata is cataloging done by men.”

This is a quip attributed to two notable librarians and library metadata luminaries, Tom Delsey and Michael Gorman. And though we do not know Michael personally, Nannette knows and has worked with Tom closely. We’re fairly confident both use the term in its classical, inclusive sense, of “mankind,” referring to all humans, regardless of gender and/or biological equipment. But then this is why wonks trust librarians and librarian-guided, vetted programming over anonymous algorithms. Librarians are in the business of knowledge and value diversity. Anonymous, sales-driven algorithms are in the business of advertising and generally value the biases of BOTH commercial interests who have purchased or negotiated their way to preferential sort-and-retrieval placement (think of Ad Words as shelf positioning in a grocery store; you know that name brand is always at eye level, the generic products are at the bottom, and the nutritious stuff is up high) and their equally anonymous programmers.

  • How does metadata work? “…metadata is constructed information, which means that it is of human invention and not found in nature.”

This is an important point that Google and other algorithmic, natural language processing platforms overlook. Metadata is human, it takes a human act to create. Humans and their actions are by definition contextual. And unfortunately for most algorithms, most humans do not explicitly state their present context or that surrounding their desired outcome out loud. Let alone take the time to type it into an interface. Again, ask a reference librarian; that’s why they do introductory interviews in the same fashion that a doctor interviews you about your symptoms and concerns, only then discussing treatment with you. So too, your librarian attends to your very human, very individualized knowledge diagnosis and care needs.

  • Why do we need metadata? “… necessary characteristic of metadata: metadata is developed by people for a purpose or a function.”

This is a critical pivot point for all knowledge acquisition and deployment activities, be they research, inquiry, or learning related. Success or failure, and its accompanying metric (aka relevance), hinge on appropriateness or fitness for the human purpose against which the metadata is being deployed, as measured against the use case (need, model, and definition) under which the metadata was created, collected, and/or enhanced. And therein lies the rub: too much, way too much of our metadata, both that in the world at large and that in libraries specifically is created, collected, aggregated, and deployed without adequate, explicitly documented, appropriately serialized, adequately tested needs, models, and definitions.

  • Why, oh why, hasn’t someone, anyone fixed it by now?!

The answer seems to be the lack of library-led, governed, and administrated, explicitly documented, appropriately serialized, adequately tested knowledge acquisition and deployment use cases, metadata models, and element/term definitions – for our vendors, the web, publishers, and others in the knowledge economy to deploy.

As for our earlier trick of prefacing a search with “library” to access vetted, curated collections of knowledge and trained, experienced information professionals – why can’t we just do that going forward?

Simple. Without the above-mentioned library-led programming toolset and the resulting librarian-curated, web-formatted, web-accessible metadata that augments current sales-driven applications with knowledge- and language-driven contextual metadata, there is nothing (or at least very little) to power the applications.

Which of course leads to the question: “Ok, metadata wonks, you said it. WHERE DO WE BEGIN?”

To which we have to say, “honestly, we don’t pretend to know!” However, over the course of our 20+ year careers, we wonks – acting as consultants, product managers, knowledge product developers, and strategists – have learned a thing or two about metadata and knowledge economics, not to mention listening. Listening to the users and creators of knowledge, as well as their lifecycle partners the publishers, distributors, librarians, administrators, researchers, aggregators, lawyers, and business people who make the knowledge economy go round. And we have become quite the conversationalists and discussion moderators. Thus, this series of posts – a record, if you will, of not just our musings, but our ongoing conversations with the ourselves and industry nerds like ourselves. Conversations we hope will help us all pave the way answering the publics’ need for knowledge driven, library-led web-based inquiry.

Next week we'll dive into Mastering the Math:

Tradition + Diversity = Innovation
Economics + Ethics = Service
Negotiation + Governance = Peace
Innovation + Service + Peace = Success

On Taxonomies, Mapping, and Loss

The world of books represents the world of human thought. Concepts articulated, written down, codified, published. But of course, our understanding of these concepts can vary – by nationality, cultural background, experience, philosophy of life. The word “alienation,” for example, can mean different things to different people. It can be expressed differently in different languages – by a single word, or by a phrase rather than a word. And, in fact, in cultures all over the world, many words can be used to describe phenomena like “snow”, “walking” – think of how we describe colors in the Crayola box, for example, or the Pantone chart.

Words carry nuance that’s not always immediately apparent, which is why non-native speakers of languages tend to struggle, and why translations nearly always lose meaning. And it’s way our systems of categorization are the most subjective and argued-about forms of metadata.

Taxonomies, in particular, are inherently political and authoritarian. They are hierarchical. Taxonomies are, essentially, what we call “controlled vocabularies”. Which begs the question: Who controls them? Do we trust those people to express what we mean? What if we disagree?

As in politics, taxonomies evolve as society evolves. What used to be “Negro history” became “Afro-American history”, which became “African-American history”. What used to be “Occult” became “New Age”, which became “Body/Mind/Spirit”.

Taxonomies reflect our understanding of phenomena. And that understanding is deeply colored by our culture, our experience, our politics, and our vision of the world. It varies from person to person. Taxonomies are a compromise, a consensus.

They’re the result of committee work. Taxonomies are rarely finalized. They shift and change depending on cultural mood, society’s evolution, and market trends. They are living things.

I just want to go over some of the issues that we see in book commerce – where Amazon, B&N, and other booksellers have their own proprietary codes. How are those created? And how do BISACs influence them?

When a publisher communicates information about a book to a distributor or retailer, that publisher will assign a series of BISAC codes. Online retailers, as we know, have their own proprietary codes, based on how their users search and browse for books. Retailers such as Amazon and Barnes & Noble.com tend to look at BISACs as useful suggestions. They map BISAC codes to their own codes, and they each make separate decisions about which BISACs map to which proprietary code. A code like SOCIAL SCIENCE - Media Studies might map to a scholarly sociology code at B&N, and a commercial code (such as "media training") on Amazon.

Mapping data points always results in the loss of some meaning or context. Some categories don’t cleanly line up to others. So mapping one taxonomy to another is yet another compromise – one we have to live with in a taxonomic, hierarchical world.

From Systems To Metadata

There are many services out there that handle workflows. Some are comprehensive, like IngentaConnect and Klopotek. These cover every aspect of the publishing process, from title management to warehousing to metadata distribution. Some focus on specific parts of the publishing process – Firebrand focuses on title management and metadata extracts; Iptor includes modules for paper/print/binding and warehouse functionality; MetaComet focuses exclusively on royalty tracking. You may find yourself having to use portions of some put together – for example, Firebrand and MetaComet are able to integrate. Or you may find yourself using non-publishing-specific tools like SAP, which feed into systems like Firebrand or Klopotek.

Or, you may have your own in-house tools to manage workflows. Smaller publishers have made their businesses work on a series of spreadsheets stored in a central location, to which only a few people have access. Or a SQL database to handle title management with third-party tools for other functionalities.

All of which is to say that, in my experience, workflow management systems are somewhat Rube-Goldberg in nature, and there are usually systems talking to other systems. There’s double-keying – entering the same information in multiple systems. And there’s also a lot of sneaker-net.

These systems’ relationships to one another are complex. I’ve worked at McGraw-Hill, Bowker, Barnes & Noble, and consulted to many, many publishers and aggregators – and I have NEVER seen a smoothly running set of interoperating systems where everything worked elegantly and produced perfect and timely metadata with a minimum of effort. One or two components may be problem-free, but as a whole, there are ghosts in our machines.

And that affects workflow, of course. Certain jobs can only run at night, which means real-time data isn’t available. Your warehouse data runs on an open source platform which isn’t sufficiently supported. Your sales staff keeps entering endorsements in the reviews field because their system doesn’t have an endorsements field. Your company has acquired another company and the systems need to merge. Your digital asset management system is literally a box of CDs.

So there are lots of points where these systems don’t align perfectly, and that is going to affect the quality of output.

One way to begin to tackle this is to structure your system so that there’s a single repository everyone can tap into. Fran Toolan at Firebrand calls it the “Single Source of Truth” – having a central repository means you don’t have competing spreadsheets on people’s hard drives, or questions about whether you’ve got the latest version of information about a book.

Read more about these types of problems in the publishing supply chain in The Book On Metadata.

The Web Came For Books

In 1998, when I began working there, BN.com had 900,000 titles in its database for sale - representing the entire availability of books at that time.

Bowker reports there are over 38 million ISBNs in its database now. There are some caveats to this: some of these ISBNs don’t represent viable products, some are assigned to chapters rather than whole books…however, there are also a sizeable number of books (via Smashwords, Kindle, and other platforms) that never make it into Books in Print. We don’t know if this evens things out, or if 38 million is the minimum number of books available in the US market today.

That’s over a 4000% increase.

Further complicating this scenario, we are living in a world where the content is born digitally. It can be produced and consumed rapidly, which is why there is so much of it, and why there is only going to be more of it. Lots and lots of information and entertainment. Lots and lots of, essentially, data.

Nothing ever goes away anymore.

Another factor is that the internet provides a persistence even to physical objects. With the web, nothing goes away, even physical objects – they only accumulate (on eBay, in vintage shops, and in libraries). They accumulate and accumulate. And books are very much a part of this accumulation. We don’t order books out of paper catalogs anymore. We order books off the web.

And, in many cases, we order books that ARE websites – packaged into…ePub files.

We now have 38 million books to choose from. We also order music and movies over the web – and we do frequently don’t see a physical medium for most of these things. Physical media get scratched, damaged, lost, borrowed and never returned. But digital is forever, and there’s a freaking lot of it.

At some point (and remember, we’re in a world of rapid development, explosion of content, and ever-more-sophisticated ways of consuming it – so “at some point” could actually be sooner than we think it ought to be), search engines and online catalogs go one step further than asking publishers (and other manufacturers) for product metadata in a separate (e.g. ONIX) feed. They are increasingly going to want to derive that metadata (and more detailed metadata) directly from the file representing that product itself. In our case, the EPUB file - a “website in a box”.

This means that publishers are not only going to have to get good at creating and maintaining metadata at a pace that can sustain a 3000% increase over 18 years (so vastly more products to keep track of), they are going to have to get good at doing this inside the book file itself, which means not only grappling with markup language, but treating the EPUB file as a (really long) web page.

And at volumes that are unprecedented – because (a) publishing is easier than it has ever been before and (b) no book published now ever goes away.

This kind of rapid development doesn’t just change your workflow – it changes what and how you publish. And the more publishers understand about the web, the more likely they are to survive.

This is a different kind of survival than just holding on through a bad time, waiting out an economic downturn. This is survival that depends on evolution. On change. On new skills and abilities and ways of looking at things – while keeping in mind where we have come from and how we got here.

Let’s go back to the problem of content proliferation. How are we going to manage it, organize it, feed the search engines in ways that they understand so that normal people who think Google is magic can actually find it, discern it, and read it?

We impose a structure on it. We take that mess and organize the hell out of it. And yes, it has to be us - the book industry.

The search engine industry doesn’t really care what results they display. Books are no more important to a search engine than anything else – it’s all data. If we want to make the search engine work for us, we have to engage it. We have to understand how it searches, what’s most effective on it. Just as the industry worked very hard in the 1990s to understand superstores and how they displayed books and what co-op could get us, so must we understand storefront of search.

In an age of this much abundance, it’s not enough to simply create a thing and then offer it for sale on the web. We have to understand how the market works. And the market revolves around search.

A Detour Into Some History

ISBN History

Born in Romania, Emery Koltay grew up in Hungary and was active in the Hungarian resistance to Communism. He spent a lot of time in and out of prison camps before eventually fleeing to the US, where he began working…for R. R. Bowker.

Bowker’s Books in Print database has long roots, going back to the 1800s. It was (and still is) intended to be a catalog of all the books being published in the United States. (Later versions include the UK and Australia as well.)

While he was there, the UK bookseller and wholesaler WH Smith was building a new warehouse that was going to be computerized. This system required every book in the warehouse to be numbered. WH Smith recruited a man named Gordon Foster, who had worked at Bletchley Park during World War II and later worked with Alan Turing at the University of Manchester. In 1965, Foster developed the SBN, a nine-digit number that WH Smith could use to identify editions of books in their new computer system.

Koltay followed the SBN’s progress from the US, and introduced the concept to US publishers as part of his work at Bowker. That set the stage for the number to be refined and ratified by ISO – the International Standards Organization. Within 4 years, ISO published the standard for global use – the fastest an ISO standard had ever been approved.

The ISBN remained at 10 digits until 2005, when it evolved into a 13-digit number to align with the EAN barcode, which was becoming the standard in retail.

It’s important to remember that the ISBN was initially created to solve the problems that digitization was bringing to the book world. Our so-called “digital disruption” is actually a culmination of events that began in the 1960s, when computers began to be more widely used outside of military and academic settings.

BISAC History 

BISAC began as a standalone initiative in 1976, designed to create standards for EDI transmissions in the book supply chain, largely between ordering departments and warehouses.  These were early versions of what ONIX would eventually become – a way of communicating among trading partners.

With the rise of book superstores such as B&N, Borders, and Books a Million, it became apparent that an additional standard was needed to determine where in these stores books should be shelved. BISAC took on the responsibility of creating standardized codes that publishers could use to suggest to bookstores which section of a store a book would be a good fit for.

By 1995, there were around 50 general codes, with “sub-codes” under each – forming a 2-level hierarchy. The codes were rather cryptic – 3 letters followed by some numbers – because they were developed for machine-to-machine processing. The actual names of the codes were only used by those doing the assigning and those receiving the books and deciding where to put them. (There were those nerds who had the codes memorized, because there were so few of them.)

But with the emergence of online retailers, BISAC experienced a period of rapid change. It merged with BISG in 1999. BISAC codes were developed with an eye towards discovery on the web as well as in-store placement of books. Whereas bookstores required a single BISAC code, web stores could “shelve” a single book in multiple categories. Most guidelines now recommend 3-5 codes per title.

You might notice, for example, that the “Body, Mind and Spirit” BISAC categories begin with the characters “OCC”.

This is because that category used to be called “Occultism and Parapsychology” back in the 80s and 90s. It was where books about UFOs, spiritual healing, crystals, Wiccans, and other titles were shelved. The OCC prefix evolved into the “New Age” category. As “Body, Mind and Spirit”, it has been expanded to includes books about mindfulness, meditation, reiki, “inspiration and personal growth”, and feng shui, all of which are fairly mainstream, in addition to continuing with more obscure topics such as astrology and numerology.

So if the BISAC prefix doesn’t match up to the name of the category itself, it probably had a previous life as a category more appropriate for the 80s or 90s cultural landscape. Books reflect our landscape, and their subjects evolve over time.