the great thing about deferred maintenance is everything catches fire at the same time

Once upon a time in 2017, my colleague Andy Dorner, who is awesome at devops, made a magical deploy script for HAMLET. I was like, ugh, Heroku is not a good solution, I have to AWS, I regret my life choices, and he was like, never fear! I will throw together a script which will make it all test/build/deploy every time you git push, using magic.

It worked great and I basically didn’t have to think about it until July of this year.

And then, inevitably in retrospect, I found a deploy didn’t work because nothing worked.

The presenting issue was that a URL we’d used to download a thing for certbot now 404ed. But why stop there? Travis was no longer a thing, so there goes deployment. The Amazon Linux platform had gone comprehensively obsolete, replaced by AL2 (which has a subtly but importantly different set of instructions for…everything, and they usually aren’t clearly distinguished in the various StackOverflow and Amazon documentation places one naturally ends up; for extra fun Amazon Linux isn’t quite RHEL or Centos and definitely isn’t Ubuntu or Debian, so instructions you find for those platforms almost-but-don’t-quite work, too). The supported versions of Python were different.

A sensible and disciplined person carefully swaps out one variable at a time for ease of troubleshooting. I found myself standing in the wreck of a Jenga tower, with no such option. Here’s how I rebuilt it.

(This is the linear, simplified version. In practice, picture two months of spare time flailing, git pushing to redeploy and find out what broke, leaving an absolute wreck of a git history, which it kills me not to clean up, but which I am leaving so that you feel less alone in your own desperate sallies against AWS.)

Gonna need a web server (Beanstalk/AL2)

I created a new EC2 instance, using the latest available Amazon Linux 2/Python 3.8 platform. (It turns out you can’t swap out EC2 instances under your running environment, so I couldn’t just upgrade in place.) I then created a new Beanstalk environment for that instance.

To my great shock, the sample application that Amazon provides for your brand-new Python instance is a pleasant-looking page that links to AWS Python and Django documentation. It was weirdly hospitable.

I also manually copied over all the application-environment variables from the old instance to the new one in the AWS console (Elastic Beanstalk > Environments > (name of env) > Configuration).

Also, make sure to specify a keypair when you create a new environment. You can do this on instance creation, but it’s harder, and specifying it with the environment will apply it to any instances that get autoscaled. If you don’t do this you won’t be able to ssh to the instances, and then you will be sad during the inevitable debugging slog.

Finally, the default (free) instance type that gets created here is t2.micro, but that doesn’t have enough capacity for memory-hungry neural net dependencies, so I had to destroy my first attempt and recreate an environment where I went into advanced options somewhere and specified t2.small.

A digression, in which we update our application

HAMLET had been on Python 3.6, but the only available instances were Python 3.7 and 3.8. So there’s a moment in here where I create a new 3.8 pipenv on localhost and reinstall all the things, updating as needed. Luckily this is a pretty small change and the tests still pass.

How about that database (RDS)

Then I went to attach my new instance to my old database and discovered that’s not a thing, because why would it be. Instead, you need to snapshot your existing database on the old instance; create a new RDS instance attached to your new beanstalk environment; and attach the snapshot during the creation process. Luckily RDS figured it out automagically from there (I was not looking forward to spending time configuring a database).

And then I had a database that Django couldn’t connect to, because pipenv install psycopg2 fails unless you have pg_config, which means you have to install postgresql-devel in the packages section of an .ebextensions config file, which is an example of why my git history is such a mess.

What if code needs to get from GitHub to AWS (GitHub Actions)

This was the easy part. I blew away my travis config and set up GitHub Actions, which didn’t exist in 2017, but do now. This took, like, fifteen minutes and one config file, and just worked. Great job, GitHub.

Psych, your deploy is still broken (.ebextensions)

Remember how Amazon Linux and AL2 have all sorts of subtle, patchily documented differences? Yeah. It turns out the deployment process is among them. The syntax available in .ebextensions files is a little different! .platform/hooks now exists, and you can put shell scripts there and it’s honestly pretty useful — once you figure out what executes when! I referred frequently to this documentation of AL2 config options. After alarmingly many yakshaving commits: before, after.

Mostly this was removing stuff, which was great. Got rid of stuff which was there only to hack around problems with an old version of WSGI. No longer needed logging hacks, because logs work better on AL2. Got rid of certbot conf temporarily because I needed it to work on port 80 before I complicated things with port 443 (stay tuned!) And…

Everything’s better without Apache (nginx)

Andy had set it up with Apache and it worked so I didn’t touch it. That said, I personally have never gotten Apache to work for anything, fleeing instead to the comparatively smooth and glowing embrace of nginx. As long as everything was broken anyway, and AL2 defaults to nginx and certbot has clear instructions for nginx…why not try that.

This meant destroying the Apache config in .ebextensions and then not writing nginx config, because the existing stuff just worked (modulo one little change to the syntax for the WSGI file location — hamlet.wsgi:application instead of hamlet/ That was pretty great.

What if (when) you need to debug it

The most helpful thing here was connecting directly to the EC2 instance (you can do this with eb ssh but I just used the web console) and hand-running commands to see what happened, rather than making an .ebextensions or .platform/hooks change, deploying it, and then hunting through logs. This was also particularly helpful for dealing with packages issues; instructions for various install processes usually reference apt-get and Debian/Ubuntu, but I have yum and AL2, and packages often have slightly different names, and oy, just logging into the instance and doing some yum list is so much easier than guessing and hoping.

Connecting directly to the EC2 instance also makes it easy to view logs, though the download full logs option in the Beanstalk console is good too. Don’t just get the last 100 lines; this doesn’t include all the log files you will need — notably the cfn-init.log, to which I referred constantly. eb-engine.log was very helpful too, and sometimes the nginx access and error logs.

There was also a hilarious moment (…several moments) (…not initially hilarious) when I found that my captcha wasn’t working and updating it didn’t help, and when I went to view the application logs I discovered we weren’t writing them because reasons. Rather than figure out how to do that, there I was learning strace to grab log writes before they went to the ether, which is how I discovered this psycopg2 bug. captcha is still not working.

Miscellaneous small yaks

I needed to repoint my CNAME record for from my old Beanstalk environment to the new one. The cryptic URL is clearly visible in the AWS web console for Beanstalk, and the hardest part of this was remembering where my CNAME record lived.

Correspondingly, I had to update ALLOWED_HOSTS in the Django config.

There were, at this point, CSRF failures when I submitted forms, but I decided to ignore them until after I’d fixed SSL, on the theory that the CSRF_COOKIE_SECURE = True Django setting might, you know, not work if the “secure” wasn’t available. This was the correct call, as CSRF magically worked once I stood up SSL. Similarly, I didn’t bother to think about whether I was redirecting http to https until after SSL was working, and it turned out there was nothing I needed to do here — my existing conf worked. (Whatever it is. Seriously, I have no idea what Amazon puts in its default nginx file, or how certbot edits it.)

I updated Pillow. captcha is now working.

Speaking of certbot, isn’t that where we started? (SSL)

Remember how my presenting problem was that certbot installation 404ed? And here we are several screens later and I haven’t installed certbot yet? Yes. Well. The prerequisites for the honestly quite helpful certbot instructions include having a working web site on port 80, and we’ve only just gotten there. I used the other Linux + pip instructions variant, because AL2 is extremely other, and no one really seemed to know how to install snapd on it. I filled in some .ebextensions and .platform/hooks specifics with this extremely helpful AL2 + certbot blog post, which (bonus points!) has the clearest diagram I have encountered of the order in which things happen during Beanstalk deployments. In particular, this blog post tipped me off that the certbot certificate installation script needs to run during a postdeploy hook, not during .ebextensions, so that the changes that it makes to nginx configuration will persist across autoscaling events.

On the whole, though, EFF has improved its certbot installation process since 2017, and once I had the entire rest of the web site working, this part was fairly straightforward.

The life-changing magic of tidying up

At this point I had 3 Beanstalk environments — original HAMLET; failed (t2.micro) first attempt at upgrading to AL2; and shiny new working HAMLET. Don’t forget to destroy the old ones when you no longer need them! Terminating them will also terminate associated resources like databases, and then you stop getting charged for them.

Which brings me to the last item on my todo list, PARTY AND BLOG ABOUT IT.

🎉 🥳 🎊 Hi! I did it!

…but it is still broken

Remember that captcha? The form it guards 500s on submission. More time with strace later, I find that the fast version of (my very old) gensim isn’t compiling, so it’s falling back to the slow version, so it doesn’t have neg_labels available, and it needs that to run infer_vector. I have tried a variety of things to install gensim (notably, build-essentials isn’t a thing on this system; I need yum groupinstall "Development Tools"), but it still doesn’t compile. This means parts of the site work — the slow gensim path is still available — but not anything involving processing uploaded text.

I strongly suspect I’m caught in version shear problems (my gensim is extremely outdated), but upgrading that is going to be its own slow and careful thing, outside the scope of getting HAMLET to stand up again at all.

🎉 🥳 🎊 Hi! I sort of did it!

Though these be matrices, yet there is method in them.

When I first trained a neural net on 43,331 theses to make HAMLET, one of the things I most wanted to do is be able to visualize them. If word2vec places documents ‘near’ each other in some kind of inferred conceptual space, we should be able to see some kind of map of them, yes? Even if I don’t actually know what I’m doing?

Turns out: yes. And it’s even better than I’d imagined.

43,331 graduate theses, arranged by their conceptual similarity.

Let me take you on a tour!

Region 1 is biochemistry. The red dots are biology; the orange ones, chemistry. Theses here include Positional cloning and characterization of the mouse pudgy locus and Biosynthetic engineering for the assembly of better drugs. If you look closely, you will see a handful of dots in different colors, like a buttery yellow. This color is electrical engineering & computer science, and its dots in this region include Computational regulatory genomics : motifs, networks, and dynamics — that is to say, a computational biology thesis that happens to have been housed in computation rather than biology.

The green south of Region 2 is physics. But you will note a bit of orange here. Yes, that’s chemistry again; for example, Dynamic nuclear polarization of amorphous and crystalline small molecules. If (like me), you almost majored in chemistry and realized only your senior year that the only chemistry classes that interested you were the ones that were secretly physics…this is your happy place. In fact, most of the theses here concern nuclear magnetic resonance applications.

Region 3 has a striking vertical green stripe which turns out to be the nuclear engineering department. But you’ll see some orange streaks curling around it like fingers, almost suggesting three-dimensional depth. I point this out as a reminder that the original neural net embeds these 43,331 documents in a 52-dimensional space; I have projected that down to 2 dimensions because I don’t know about you but I find 52 dimensions somewhat challenging to visualize. However — just as objects may overlap in a 2-dimensional photo even when they are quite distant in 3-dimensional space — dots that are close together in this projection may be quite far apart in reality. Trust the overall structure more than each individual element. The map is not the territory.

That little yellow thumb by Region 4 is mathematics, now a tiny appendage off of the giant discipline it spawned — our old friend buttery yellow, aka electrical engineering & computer science. If you zoom in enough you find EECS absolutely everywhere, applied to all manner of disciplines (as above with biology), but the bulk of it — including the quintessential parts, like compilers — is right here.

Dramatically red Region 5, clustered together tightly and at the far end, is architecture. This is a renowned department (it graduated I.M. Pei!), but definitely a different sort of creature than most of MIT, so it makes sense that it’s at one extreme of the map. That said, the other two programs in its school — Urban Studies & Planning and Media Arts & Sciences — are just to its north.

Region 6 — tiny, yellow, and pale; you may have missed it at first glance — is linguistics island, housing theses such as Topics in the stress and syntax of words. You see how there are also a handful of red dots on this island? They are Brain & Cognitive Science theses — and in particular, ones that are secretly linguistics, like Intonational phrasing in language production and comprehension. Similarly — although at MIT it is not the department of linguistics, but the department of linguistics & philosophy — the philosophy papers are elsewhere. (A few of the very most abstract ones are hanging out near math.)

And what about Region 7, the stingray swimming vigorously away from everything else? I spent a long time looking at this and not seeing a pattern. You can tell there’s a lot of colors (departments) there, randomly assorted; even looking at individual titles I couldn’t see anything. Only when I looked at the original documents did I realize that this is the island of terrible OCR. Almost everything here is an older thesis, with low-quality printing or even typewriting, often in a regrettable font, maybe with the reverse side of the page showing through. (A randomly chosen example; pdf download.)

A good reminder of the importance of high-quality digitization labor. A heartbreaking example of the things we throw away when we make paper the archival format for born-digital items. And also a technical inspiration — look how much vector space we’ve had to carve out to make room for these! the poor neural net, trying desperately to find signal in the noise, needing all this space to do it. I’m tempted to throw out the entire leftmost quarter of this graph, rerun the 2d projection, and see what I get — would we be better able to see the structures in the high-quality data if they had room to breathe? And were I to rerun the entire neural net training process again, I’d want to include some sort of threshold score for OCR quality. It would be a shame to throw things away — especially since they will be a nonrandom sample, mostly older theses — but I have already had to throw away things I could not OCR at all in an earlier pass, and, again, I suspect the neural net would do a better job organizing the high-quality documents if it could use the whole vector space to spread them out, rather than needing some of it to encode the information “this is terrible OCR and must be kept away from its fellows”.

Clearly I need to share the technical details of how I did this, but this post is already too long, so maybe next week. tl;dr I reached out to Matt Miller after reading his cool post on vectorizing the DPLA and he tipped me off to UMAP and here we are — thanks, Matt!

And just as clearly you want to play with this too, right? Well, it’s super not ready to be integrated into HAMLET due to any number of usability issues but if you promise to forgive me those — have fun. You see how when you hover over a dot you get a label with the format 1721.1-X.txt? It corresponds to a URL of the format Go play :).

Let’s visualize some HAMLET data! Or, d3 and t-SNE for the lols.

In 2017, I trained a neural net on ~44K graduate theses using the Doc2Vec algorithm, in hopes that doing so would provide a backend that could support novel and delightful discovery mechanisms for unique library content. The result, HAMLET, worked better than I hoped; it not only pulls together related works from different departments (thus enabling discovery that can’t be supported with existing metadata), but it does a spirited job on documents whose topics are poorly represented in my initial data set (e.g. when given a fiction sample it finds theses from programs like media studies, even though there are few humanities theses in the data set).

That said, there are a bunch of exploratory tools I’ve had in my head ever since 2017 that I’ve not gotten around to implementing. But here, in the spirit of tossing out things that don’t bring me joy (like 2020) and keeping those that do, I’m gonna make some data viz!

There are only two challenges with this:

  1. By default Doc2Vec embeds content in a 100-dimensional space, which is kind of hard to visualize. I need to project that down to 2 or 3 dimensions. I don’t actually know anything about dimensionality reduction techniques, other than that they exist.
  2. I also don’t know know JavaScript much beyond a copy-paste level. I definitely don’t know d3, or indeed the pros and cons of various visualization libraries. Also art. Or, like, all that stuff in Tufte’s book, which I bounced off of.

(But aside from that, Mr. Lincoln, how was the play?)

I decided I should start with the pages that display the theses most similar to a given thesis (shout-out to Jeremy Brown, startup founder par excellence) rather than with my ideas for visualizing the whole collection, because I’ll only need to plot ten or so points instead of 44K. This will make it easier for me to tell visually if I’m on the right track and should let me skip dealing with performance issues for now. On the down side, it means I may need to throw out any code I write at this stage when I’m working on the next one. 🤷‍♀️

And I now have a visualization on localhost! Which you can’t see because I don’t trust it yet. But here are the problems I’ve solved thus far:

  1. It’s hard to copy-paste d3 examples on the internet. d3’s been around for long enough there’s substantial content about different versions, so you have to double-check. But also most of the examples are live code notebooks on Observable, which is a wicked cool service but not the same environment as a web page! If you just copy-paste from there you will have things that don’t work due to invisible environment differences and then you will be sad. 😢 I got tipped off to this by Mollie Marie Pettit’s great Your First d3 Scatterplot notebook, which both names the phenomenon and provides two versions of the code (the live-editable version and the one you can actually copy/paste into your editor).
  2. If you start googling for dimensionality reduction techniques you will mostly find people saying “use t-SNE”, but t-SNE is a lying liar who lies. Mind you, it’s what I’m using right now because it’s so well-documented it was the easiest thing to set up. (This is why I said above that I don’t trust my viz.) But it produces different results for the same data on different pageloads (obviously different, so no one looking at the page will trust it either), and it’s not doing a good job preserving the distances I care about. (I accept that anything projecting from 100d down to 2d will need to distort distances, but I want to adequately preserve meaning — I want the visualization to not just look pretty but to give people an intellectually honest insight into the data — and I’m not there yet.)

Conveniently this is not my first time at the software engineering rodeo, so I encapsulated my dimensionality reduction strategy inside a function, and I can swap it out for whatever I like without needing to rewrite the d3 as long as I return the same data structure.

So that’s my next goal — try out UMAP (hat tip to Matt Miller for suggesting that to me), try out PCA, fiddle some parameters, try feeding it just the data I want to visualize vs larger neighborhoods, see if I’m happier with what I get. UMAP in particular alleges itself to be fast with large data sets, so if I can get it working here I should be able to leverage that knowledge for my ideas for visualizing the whole thing.

Onward, upward, et cetera. 🎉