Though these be matrices, yet there is method in them.

When I first trained a neural net on 43,331 theses to make HAMLET, one of the things I most wanted to do is be able to visualize them. If word2vec places documents ‘near’ each other in some kind of inferred conceptual space, we should be able to see some kind of map of them, yes? Even if I don’t actually know what I’m doing?

Turns out: yes. And it’s even better than I’d imagined.

43,331 graduate theses, arranged by their conceptual similarity.

Let me take you on a tour!

Region 1 is biochemistry. The red dots are biology; the orange ones, chemistry. Theses here include Positional cloning and characterization of the mouse pudgy locus and Biosynthetic engineering for the assembly of better drugs. If you look closely, you will see a handful of dots in different colors, like a buttery yellow. This color is electrical engineering & computer science, and its dots in this region include Computational regulatory genomics : motifs, networks, and dynamics — that is to say, a computational biology thesis that happens to have been housed in computation rather than biology.

The green south of Region 2 is physics. But you will note a bit of orange here. Yes, that’s chemistry again; for example, Dynamic nuclear polarization of amorphous and crystalline small molecules. If (like me), you almost majored in chemistry and realized only your senior year that the only chemistry classes that interested you were the ones that were secretly physics…this is your happy place. In fact, most of the theses here concern nuclear magnetic resonance applications.

Region 3 has a striking vertical green stripe which turns out to be the nuclear engineering department. But you’ll see some orange streaks curling around it like fingers, almost suggesting three-dimensional depth. I point this out as a reminder that the original neural net embeds these 43,331 documents in a 52-dimensional space; I have projected that down to 2 dimensions because I don’t know about you but I find 52 dimensions somewhat challenging to visualize. However — just as objects may overlap in a 2-dimensional photo even when they are quite distant in 3-dimensional space — dots that are close together in this projection may be quite far apart in reality. Trust the overall structure more than each individual element. The map is not the territory.

That little yellow thumb by Region 4 is mathematics, now a tiny appendage off of the giant discipline it spawned — our old friend buttery yellow, aka electrical engineering & computer science. If you zoom in enough you find EECS absolutely everywhere, applied to all manner of disciplines (as above with biology), but the bulk of it — including the quintessential parts, like compilers — is right here.

Dramatically red Region 5, clustered together tightly and at the far end, is architecture. This is a renowned department (it graduated I.M. Pei!), but definitely a different sort of creature than most of MIT, so it makes sense that it’s at one extreme of the map. That said, the other two programs in its school — Urban Studies & Planning and Media Arts & Sciences — are just to its north.

Region 6 — tiny, yellow, and pale; you may have missed it at first glance — is linguistics island, housing theses such as Topics in the stress and syntax of words. You see how there are also a handful of red dots on this island? They are Brain & Cognitive Science theses — and in particular, ones that are secretly linguistics, like Intonational phrasing in language production and comprehension. Similarly — although at MIT it is not the department of linguistics, but the department of linguistics & philosophy — the philosophy papers are elsewhere. (A few of the very most abstract ones are hanging out near math.)

And what about Region 7, the stingray swimming vigorously away from everything else? I spent a long time looking at this and not seeing a pattern. You can tell there’s a lot of colors (departments) there, randomly assorted; even looking at individual titles I couldn’t see anything. Only when I looked at the original documents did I realize that this is the island of terrible OCR. Almost everything here is an older thesis, with low-quality printing or even typewriting, often in a regrettable font, maybe with the reverse side of the page showing through. (A randomly chosen example; pdf download.)

A good reminder of the importance of high-quality digitization labor. A heartbreaking example of the things we throw away when we make paper the archival format for born-digital items. And also a technical inspiration — look how much vector space we’ve had to carve out to make room for these! the poor neural net, trying desperately to find signal in the noise, needing all this space to do it. I’m tempted to throw out the entire leftmost quarter of this graph, rerun the 2d projection, and see what I get — would we be better able to see the structures in the high-quality data if they had room to breathe? And were I to rerun the entire neural net training process again, I’d want to include some sort of threshhold score for OCR quality. It would be a shame to throw things away — especially since they will be a nonrandom sample, mostly older theses — but I have already had to throw away things I could not OCR at all in an earlier pass, and, again, I suspect the neural net would do a better job organizing the high-quality documents if it could use the whole vector space to spread them out, rather than needing some of it to encode the information “this is terrible OCR and must be kept away from its fellows”.

Clearly I need to share the technical details of how I did this, but this post is already too long, so maybe next week. tl;dr I reached out to Matt Miller after reading his cool post on vectorizing the DPLA and he tipped me off to UMAP and here we are — thanks, Matt!

And just as clearly you want to play with this too, right? Well, it’s super not ready to be integrated into HAMLET due to any number of usability issues but if you promise to forgive me those — have fun. You see how when you hover over a dot you get a label with the format 1721.1-X.txt? It corresponds to a URL of the format Go play :).

Let’s visualize some HAMLET data! Or, d3 and t-SNE for the lols.

In 2017, I trained a neural net on ~44K graduate theses using the Doc2Vec algorithm, in hopes that doing so would provide a backend that could support novel and delightful discovery mechanisms for unique library content. The result, HAMLET, worked better than I hoped; it not only pulls together related works from different departments (thus enabling discovery that can’t be supported with existing metadata), but it does a spirited job on documents whose topics are poorly represented in my initial data set (e.g. when given a fiction sample it finds theses from programs like media studies, even though there are few humanities theses in the data set).

That said, there are a bunch of exploratory tools I’ve had in my head ever since 2017 that I’ve not gotten around to implementing. But here, in the spirit of tossing out things that don’t bring me joy (like 2020) and keeping those that do, I’m gonna make some data viz!

There are only two challenges with this:

  1. By default Doc2Vec embeds content in a 100-dimensional space, which is kind of hard to visualize. I need to project that down to 2 or 3 dimensions. I don’t actually know anything about dimensionality reduction techniques, other than that they exist.
  2. I also don’t know know JavaScript much beyond a copy-paste level. I definitely don’t know d3, or indeed the pros and cons of various visualization libraries. Also art. Or, like, all that stuff in Tufte’s book, which I bounced off of.

(But aside from that, Mr. Lincoln, how was the play?)

I decided I should start with the pages that display the theses most similar to a given thesis (shout-out to Jeremy Brown, startup founder par excellence) rather than with my ideas for visualizing the whole collection, because I’ll only need to plot ten or so points instead of 44K. This will make it easier for me to tell visually if I’m on the right track and should let me skip dealing with performance issues for now. On the down side, it means I may need to throw out any code I write at this stage when I’m working on the next one. 🤷‍♀️

And I now have a visualization on localhost! Which you can’t see because I don’t trust it yet. But here are the problems I’ve solved thus far:

  1. It’s hard to copy-paste d3 examples on the internet. d3’s been around for long enough there’s substantial content about different versions, so you have to double-check. But also most of the examples are live code notebooks on Observable, which is a wicked cool service but not the same environment as a web page! If you just copy-paste from there you will have things that don’t work due to invisible environment differences and then you will be sad. 😢 I got tipped off to this by Mollie Marie Pettit’s great Your First d3 Scatterplot notebook, which both names the phenomenon and provides two versions of the code (the live-editable version and the one you can actually copy/paste into your editor).
  2. If you start googling for dimensionality reduction techniques you will mostly find people saying “use t-SNE”, but t-SNE is a lying liar who lies. Mind you, it’s what I’m using right now because it’s so well-documented it was the easiest thing to set up. (This is why I said above that I don’t trust my viz.) But it produces different results for the same data on different pageloads (obviously different, so no one looking at the page will trust it either), and it’s not doing a good job preserving the distances I care about. (I accept that anything projecting from 100d down to 2d will need to distort distances, but I want to adequately preserve meaning — I want the visualization to not just look pretty but to give people an intellectually honest insight into the data — and I’m not there yet.)

Conveniently this is not my first time at the software engineering rodeo, so I encapsulated my dimensionality reduction strategy inside a function, and I can swap it out for whatever I like without needing to rewrite the d3 as long as I return the same data structure.

So that’s my next goal — try out UMAP (hat tip to Matt Miller for suggesting that to me), try out PCA, fiddle some parameters, try feeding it just the data I want to visualize vs larger neighborhoods, see if I’m happier with what I get. UMAP in particular alleges itself to be fast with large data sets, so if I can get it working here I should be able to leverage that knowledge for my ideas for visualizing the whole thing.

Onward, upward, et cetera. 🎉