Monday I was at a hackathon for the Harvard LibraryCloud API. So this was great, not only because it’s good for us work-from-home types to leave the house and talk to humans instead of just our cats, but also because I got there and there were a whole lot of other women in the room! This hackathon shared at least one organizer with the prior one where I was the only woman at the table, so I really appreciate that he took my vehement criticism in the best possible way and conducted obvious outreach.
Then he asked who was there for the tutorial, not the hackathon, and literally every other woman in the room raised her hand. *sigh* [ref]Subsequently a few more women showed up, so I ended up being one of 2 or 3. Yay? Also not exactly escaping my attention that all but about two people in the room were white, so if I were a woman of color looking to not feel isolated, game over.[/ref]
What recourse does one have, really, but code as feminist critique? Hence: Intersectional LibraryCloud.
What you’re seeing on that web page are the most commonly used Harvard Library resources that match a given subject search. (The page is randomly seeded when you load with a popular topic, but do try your own!) The results are sorted by stackscore and highlighted with whether their subjects include terms commonly associated with women’s studies; African-American studies; or LGBT studies. (Thank you to Harvard librarian Vernica Downey for the cataloging help.)
With this page, I want to examine the question: when Harvard students and faculty develop their understandings of various topics, are those understandings informed by intersectional perspectives? (Answers are left as an exercise for the reader.)
This was the work of a day (…plus way too long shaving yaks to get it onto Heroku), so there are some issues with the code I’d love to see fixed. For one thing, I don’t handle substrings, so some subjects that should definitely be coded as matches, aren’t. For another, in my initial plan I wanted to look at disability studies too, but my first-pass layout doesn’t accommodate it and I don’t have a suitable set of subjects. This is an exercise for the reader too, though! Because I’ve got code, and you can hack on it: woo yeah intersectional librarycloud repo.
This is the sort of code that invites human-language commentary as well; would love to hear your thoughts.
15 thoughts on “LibraryCloud hackathon report: or, code as intersectional feminist critique”
I was asked, how does one know it’s working, given that none of the highlighted things are highlit?
…well…so…it turns out that’s how you know it’s working, because that’s how things work.
But you can search for something like “feminism” or “civil rights” if you would like to verify that highlighting actually ever occurs…
I have to confess I do not understand what I’m looking at here. What is stack score? (When I put in a subject I actually know something about, solar physics, the highest stack score is 2?) This doesn’t look like a list of books or journals or library resources in any sense I’m familiar with. What are the terms you’re looking to match?
Stackscore is a measure of popularity (this is where I handwave about the details of how it’s calculated). Solar physics is apparently not very popular 😉
Not quite sure what you mean by the last question as there are several possibilities there. Answers include:
* I match your query (or a randomly chosen one, when you first load the page) against the subject, in a search of Harvard library resources, and list the top 10 items returned (ordered by popularity).
* For the three highlight columns, I check to see whether the returned items ALSO have subject headers that are common in women’s studies, African-American studies, and/or LGBT studies.
In what way does it not look like a list of books to you? What would make it look more like one?
There’s a story here I’m trying to tell with code, but clearly I’m not doing a good job of it…
One thing that I think would be helpful, and which can’t be produced by someone else right now (thus no pull request) is some commentary on the criteria used to decide on the subject headings used. Even if it’s just “I picked the common ones that made sense to me,” I think that would be valuable information to capture for people who want to contribute.
I’m thinking, if I have time, of submitting a PR that crawls LOC for sub-headings of the hardcoded headings. Assuming that would be useful.
Also, Bobbi and I think we’ve found some bugs in the LibraryCloud handling of subject headings, which might be affecting recall here.
(This is Dave from 90 Mt. Auburn)
I am not a cataloger, so I asked Vernica Downey, who is, to supply good subject headers 🙂 She trawled some relevant libguides to determine headers that are common for these types of works.
Pull requests welcome, even if the LoC idea doesn’t work. What bugs did you find?
Urrrgh. I forgot it costs money to crawl LOC programmatically. Boo-urns.
This is an interesting idea.
It seems to me that the results you get from this kind of code can be interpreted as, “How much crosstalk is there between these three academic communities (women’s studies, black studies, lgbt studies) and other academic fields?” That kind of question is interesting for a lot of reasons — but it should be noted that there are two related questions that it is not:
1) How much do other academic fields talk about the issues that are important in these three fields?
2) How central are women, black people, and lgbt people to various academic fields?
Number 2 up there is probably pretty obvious, since it’s really not what you were aiming to do, and if it isn’t obvious to anyone reading this, you can try “Nursing” as a field and see what I mean. Number 1 might be more surprising, but I’m still pretty confident in it, for the reason that different fields sometimes have different words for similar ideas. This comes up a lot in statistics, which is important enough to many fields that methodologically-focused subdisciplines have developed with their own methods journals, etc.
I imagine intersectionality itself is one such idea, with people in different fields talking about the core conceit (that you can’t understand the effects of one variable without thinking about others) but in different ways. In psychology there is a lot of talk of interaction effects, where the effect of some variable X1 (say, race) depends on the level of some other variable X2 (say, sex). You also hear this referred to as non-additivity. In medical research they use interaction effects sometimes but they don’t always talk about them as such. Sometimes they’ll talk about stratifying a sample by multiple variables at once, which sort of gets at the same idea. Anyway, the point of all this is that this kind of tool probably can’t capture the places where fields have related concepts but divergent jargon; and jargon, like any other kind of linguistic innovation, is tied to historical accident, who studied with whom, which populations interacted, etc. So to some extent, what you’re seeing here may be less about demographic gaps per se and more about the phylogenetic history of academic thought.
I should note that the “different words for the same thing” problem is exactly what having, and using, subject headers rather than keyword searches ought to be avoiding. I can’t be 100% convinced it does in practice, but my default response to your argument here is intense skepticism; I’d need to see some evidence that cataloging works that way, instead of how it *ought* to work, before I’d buy it.
Huh. Well, I don’t have the domain knowledge to demonstrate it with your tool, but I poked at a few different keyword searches in OCLC to see whether some statistical terms that I know to have overlapping referents also give overlapping subjects. The results — better than I thought, but still not at all perfect. The universe of subjects appears to know that “multilevel model” and “mixed effects model” are the same thing. Other related terms (“hierarchical linear model,” “split-plot,” “nested model”) don’t cluster in this way. Incidentally I also had never heard the term “nested model” until I went to Wikipedia — which does know that these things are related.
How do catalogers do their work? Naively I imagine the incentive should be to err on the side of proliferation-of-terms rather than overgeneralization, but having never met a cataloger that’s really just a guess. There are almost certainly machine learning techniques that could aid in the effort to categorize — is this sort of thing in use in the cataloging world? Or do they do surveys? Or do they just read a lot?
Traditionally, catalogers have only assigned a limited number of subject headings (usually not more than 3). If a specific heading applied that was used. If not, a more general heading was applied. This came mostly out of the labor-intensive world of card catalogs, but library technical services is still very productivity focused, and there is a lot of incentive to do less. When I am working in areas outside of my expertise (which is more often than I would like to admit), I sometimes have to choose a broad subject just to get a record completed.
I have been thinking a lot about this project since the hackathon. Subject headings, as they are assigned in library records, are not the best way to get at the content of collections. For example, many academic libraries (mine included) generally do not assign subject headings to fiction or poetry. The intersections abundant in creative works are lost to users in that case.
Stack scores, as they are currently assigned, are also not the best way to rank popularity or use. Non-circulating collections and collections in libraries with separate circulation systems (i.e., most special collections and archives and many area studies collections) have low scores simply because they cannot be quantified at the moment.
I love the possibilities of this project, but the realities and limitations of library cataloging and metrics make things difficult (as usual!) . This is why there needs to be more collaboration between library tech and cataloging/technical services and more interest and support for bringing catalogers into the library tech world.
I love the possibilities, too. It seems to me this idea could be the kernel of a very interesting, more general tool that searches for gaps that could be filled and/or crosstalk that should be happening, but isn’t. And I mean, that seems to me to be one of the best case scenarios for an end product of a hackathon. Nobody goes straight from NaNoWriMo to a book contract. Everything could benefit from editing.
Now I am realizing what I totally should have earlier — there are definitely machine-learning types who think about this stuff: that’s part of what bibliometrics is about. I still have no idea about the culture of the LIS world, whether catalogers and ….bibliometricians? talk at all, whether some people are both. That feels like a very meta thing to wonder, in this context!
AFAIK, there is nothing in the subject cataloging codes that reflect the fact that one potential end-user of cataloging is an algorithm or application. Subject headings are assigned for human readers, and in particular for reference librarians as the interpreters of the meaning, since there is nothing intuitive about the meaning of the headings even though they use something approximating natural language.
The subject headings are intended to reflect the aggregate of topics in the book as a whole. If a book covers a number of specific topics, one assigns a general heading that covers them all rather than specific headings. For major portions of a work (>20%) one can assign a specific heading. (See LC’s own assignment policies here: http://www.itsmarc.com/crs/mergedprojects/subjhead/whnjs.htm). And it is true that during the card catalog years, 3 headings was the limit, and often only 1-2 were assigned.
I think that subject headings are pretty much useless, and that the addition of tables of contexts and terms from indexes would greatly enhance retrieval. For example, “Reading Lolita in Tehran” gets the heading “American literature — Study and teaching — Iran” — would anyone look for it there? It also gets “Women — Books and reading — Iran”. You’d have to already know the books exists to look there. Solnit’s book “Men Explain Things to Me” has the heading “Women — Social conditions — 21st century”. I have to stop myself now because I get so frustrated when I look at these headings and how useless they are. But you can see why researchers are so excited about full text indexing of books.
This is also very interesting. I certainly agree that in principle, more data (such as that provided by an index) is better than less. But it does make me wonder, where do index terms come from? Who decides whether a brief mention of an idea on page 320 warrants inclusion in an index? And will synonyms for that idea show up in the index?
And what about full-text indexing — is that something like e.g. Google Books, being able to do full-text searches on an entire work? We probably still have the synonyms problem (right?) but might we not also have the, for lack of a better word I’ll say “declensions” problem — different wordforms failing to be cross-indexed to one another?
There are probably dozens or even hundreds of experts in this sort of thing, but I am not one of them and now I’m curious.
There are some interesting projects involving using full-text indexing to provide subject access to library resources. It isn’t my area of expertise, but I attended a fascinating presentation by staff working on such a project at the Harvard Medical School. Here’s the link to the project website: https://osc.hul.harvard.edu/liblab/projects/automatic-subject-heading-extraction.
This is a late reply, but that was an intensely cool video, and a very cool project, too. Thanks for sharing it!