lexicality: two hats, four colors, one word

That’s what it comes down to, isn’t it? Lexicality.

In my forthcoming Information Technology and Libraries paper [pdf] I talk about automated subject indexing. I used Wikipedia articles and category structures in an algorithmic scheme to classify documents, which sometimes worked (A Brief History of Time), sometimes didn’t (Guns, Germs, and Steel). And I poked around a bit at when it did, when it didn’t, but I didn’t really get into why. I didn’t realize it at the time — because I was putting off two weeks’ worth of LIS 419 readings to finish the paper — but the frame I needed was lexicality.

Which was — I forget the technical definition — basically, how easy is it to express a thing in words? Do we have a word, or at least a brief unit, which is more or less coextensive with the concept at hand (“physical cosmology”)? Or are we talking about more slippery concepts, things you maybe examine from different angles so you can get at them refractively because you can’t see them straight, things where it might take a whole chapter, or a whole book, to lay forth the idea? Automated subject indexing — no surprise, in retrospect — worked well for me when the books I fed it dealt with highly lexical concepts. Otherwise, not so much.


I was talking with my software engineer husband earlier tonight about searching Google versus library catalogs. And, although it wasn’t the discussion we were having, it reminded me of a discussion I’ve had repeatedly with technical people, almost all of whom seem convinced that fulltext search is all one will ever need, and who are genuinely baffled that anyone would ever find browsing the stacks to be useful — so baffled, in fact, they sometimes seem outright unable to believe me when I tell them it is the case.

(And don’t get the feeling I’m ragging on programmers here and librarians can gloat, because I have had exactly the same conversation in reverse with librarians, who sometimes have difficulty imagining that people’s habits of interfacing with libraries could be other than those of English or history majors.)

When I put my math-major hat on, I understand the software engineers, because in that guise I never once browsed library stacks, nor can I readily imagine needing to do so. But in when I put my classics-major hat on, subject browse (whether by catalog subject headers or by physical shelf browsing) is crucial. It’s how I found most of what I needed to know — not just where it was, but what.

And what it comes down to is lexicality. Math, computer science, the hard sciences in general — they’re highly lexical. If I need to know about tensors or the four color theorem or what-have-you, they are always and only called that, and the terms will appear in the text of the document, and everyone involved is very clear and specific on what these terms mean — and they may convey pages and worlds of subtle meaning, but the meaning is precision-crafted. Give me a good fulltext index; I’ll search for the term I care about and I will find what I need.

Classics, not so much. The ideas I thought about there were sprawling and ill-defined, things I defined through the process of writing about them, things that mean a range of not wholly agreed-upon things: “cities” and “Mithras” and the like. Sprawling ideas, intersecting with other sprawling ideas, that you somehow have to pare down and wriggle an argument through — an argument that might take a few dozen pages before the idea is properly out — and, with any luck, an idea that no one has ever quite written about before. How do you search for that in a fulltext index? You don’t. You look for things near it; you spiral in on a truth.

It all comes down to lexicality.

8 thoughts on “lexicality: two hats, four colors, one word

  1. I agree with you to a point, and then suddenly it all seems backwards. Because my intuition about lexicality’s application to the “browse stacks versus full-text-search” question is the opposite of yours.

    Full text search allows you to look in non-catalogued data, which it seems to me is exactly what you want for a less lexical topic. When I’m looking for information I typically don’t know what term a cataloguer will have chosen for what I’m looking for, but I might know, eg, that words like “graph traversal” and “matrix multiplication” will appear nearby. It is exactly where there is no precise, agreed terminology that one must rely on searching contents rather than browsing someone’s terrifically clever cataloguing scheme.

    It seems to me that “Mithras” is infinitely more likely to appear in a catalogue (and so get relevant things in close browsing-proximity) than “graph traversal algorithms related to matrix multiplication”, precisely because Mithras is a _more_ coherent and lexical concept. He has a name, for example.

    Like

  2. I’m surprised that computer scientists would be so gung-ho about full-text search. There is a huge body of [computer science] research that indicates full-text search by itself fails as the number of documents increase if you crunch the numbers for recall and precision. It’s usually necessary to have results of both full-text search and selective search (e.g. just abstract or subject headings) integrated into the search algorithm to improve recall and ranking. It’s not an either-or situation.

    Google does this through page rank and analyzing certain html header tags. Scholarly databases do it by elevating words in abstracts and subject headings.

    It would be interesting to know how these algorithms works across genres and if certain search algorithms work better with, say, science materials, than with humanities materials. Would the former put more emphasis on full-text returns and the latter on subject headings returns? Can search engines be designed to tell the difference?

    I imagine if we could get a bunch of computer scientists and digital humanists in a room together, they could hash this out 😉

    Like

    1. Hm. You’re right that I’m not thinking naive fulltext search, even if that’s technically what I said ;). I’m definitely thinking in terms of a fulltext corpus plus intelligent algorithms on top of that, with keyword searching (e.g., but not simply, Google). (I think both the fulltextness and the intelligence of scholarly databases are, alas, pretty debatable. Also the addition of subject headings, unless they’re automatically generated, removes that from the realm of “fulltext corpus” — it may (sometimes) contain one but it also contains this other stuff. Part of what I’m thinking here is that subject headings are going to seem useless if your queries are highly lexical; what extra value is there to a subject header of “four color problem” (http://lccn.loc.gov/75008860) if you have fulltext and can search for “four color theorem”? (Actually, I daresay the subject header in this instance if of negative use.) By contrast I use subject headers a lot for classics and LIS stuff, as a way of reducing the number of keyword searches I would otherwise have to do.)

      Like

    2. Also, putting a bunch of CS/digital humanities/library people in a room together for a while, with maybe snacks and alcohol and a project, is a dream/goal of mine. 🙂

      Like

    3. Purely flat text search is a bit of a straw man — I’ve not encountered a system that limited itself to that in ages. Cross-field matching and complex relevance ranking is critical as soon as data has any structure, searches can be phrased, etc. The crucial thing is that the full text be indexed for search, not that it be the only thing searched.

      The full text is where I’m going to find many of the more obscure things I look for; catalog metadata and even abstracts will tend to be heavily steeped in domain-specific buzzwords.

      For example the abstracts of most of the papers relating to the nested relational calculus system “Kleisli” never actually use the phrase “nested relational calculus” because the authors simply assume that you already knew that was what it was about (the standard academic model: you know the author is Limsoon Wong, and everyone knows that he’s the NRC guy…). Nor have the clever cataloguers at Springer-Verlag or wherever done anything for us; the phrase appears only in the text. The browsing model is worse still: Because Kleisli is primarily used for bioinformatics, in all likelihood these papers would end up in that category.

      Like

  3. Math, computer science, the hard sciences in general — they’re highly lexical.

    Hah! For many people working in sub-fields of math, computer science, and statistics, they might like to think of “their” field as possessing this property. But it is a self-fulfilling illusion: what starts as the simple ignorance of a novitiate is too often nurtured into a willful ignorance of prior art in other fields. Whenever the wheel can be re-invented, it can also be given a new name, flounced around a different set of conferences and proceedings as “novel”, propel a new career forwards.

    During my PhD program in EE/CS, I learned a lot of interesting things by wandering around the wrong sections of the library…

    Like

  4. Fantastic word–lexicality!

    It seems to me the human indexer faces the same challenges. The traditional stages of subject analysis are (1) figuring out what something is about and (2) translating that into some controlled vocabulary system (LCSH, Dewey, etc.).

    The example I often use to explain this is a pile of videos I once had to catalog. The videos were of lectures that had taken place on campus and were labeled with the speakers’ names and no further information. So I had to listen to parts of them to figure out what they were about.

    On one end of the spectrum was Sally Ride, who spoke about her experiences as a female astronaut. It was easy to figure out what she was talking about and easy to translate it into LCSH.

    On the other hand, I had to listen to quite a bit of Cornel West’s talk before I could write a passable summary and my attempts to translate that into LCSH were not entirely satisfactory.

    Given how much stuff seems to be hard to fit into boxes, it’s amazing that browsing shelves in libraries (or other topical categorizations) works as well as it does.

    Fulltext does give you another angle and more depth, but it only works if the concept is named explicitly in the text. It seems like there needs to be some other approach for these slippery, nameless things to relate them together in ways that would make them more findable and browsable, perhaps even without naming them.

    Like

    1. I appreciate the difficulties you outline but I am skeptical that fulltext only works if the concept is named explicitly in the text, seeing as the script I wrote does not depend on this. 😉 It definitely works better for things that are highly lexical and it is probably more likely to work well if the concept is repeatedly and explicitly named, but it does not require it.

      Fulltext *search* relies on explicit naming, but classification algorithms need not.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s