Finished course 4 of the deeplearning.ai sequence. Yay! The facial recognition assignment is kind of buggy and poorly documented and I felt creepy for learning it in the first place, but I’m glad to have finished. Only one more course to go! It’s a 3-week course, so if I’m particularly aggressive I might be able to get it all done by year’s end.
Tried making a 3d version of last week’s visualization — several people had asked — but it turned out to not really add anything. Oh well.
Been thinking about Charlie Harper’s talk at SWiB this year, Generating metadata subject labels with Doc2Vec and DBPedia. This talk really grabbed me because he started with the exact same questions and challenges as HAMLET — seriously, the first seven and a half minutes of this talk could be the first seven and a half minutes of a talk on HAMLET, essentially verbatim — but took it off in a totally different direction (assigning subject labels). I have lots of ideas about where one might go with this but right now they are all sparkling Voronoi diagrams in my head and that’s not a language I can readily communicate.
All done with the second iteration of my AI for librarians course. There were some really good final projects this term. Yay, students!
When I first trained a neural net on 43,331 theses to make HAMLET, one of the things I most wanted to do is be able to visualize them. If word2vec places documents ‘near’ each other in some kind of inferred conceptual space, we should be able to see some kind of map of them, yes? Even if I don’t actually know what I’m doing?
Turns out: yes. And it’s even better than I’d imagined.
The green south of Region 2 is physics. But you will note a bit of orange here. Yes, that’s chemistry again; for example, Dynamic nuclear polarization of amorphous and crystalline small molecules. If (like me), you almost majored in chemistry and realized only your senior year that the only chemistry classes that interested you were the ones that were secretly physics…this is your happy place. In fact, most of the theses here concern nuclear magnetic resonance applications.
Region 3 has a striking vertical green stripe which turns out to be the nuclear engineering department. But you’ll see some orange streaks curling around it like fingers, almost suggesting three-dimensional depth. I point this out as a reminder that the original neural net embeds these 43,331 documents in a 52-dimensional space; I have projected that down to 2 dimensions because I don’t know about you but I find 52 dimensions somewhat challenging to visualize. However — just as objects may overlap in a 2-dimensional photo even when they are quite distant in 3-dimensional space — dots that are close together in this projection may be quite far apart in reality. Trust the overall structure more than each individual element. The map is not the territory.
That little yellow thumb by Region 4 is mathematics, now a tiny appendage off of the giant discipline it spawned — our old friend buttery yellow, aka electrical engineering & computer science. If you zoom in enough you find EECS absolutely everywhere, applied to all manner of disciplines (as above with biology), but the bulk of it — including the quintessential parts, like compilers — is right here.
Dramatically red Region 5, clustered together tightly and at the far end, is architecture. This is a renowned department (it graduated I.M. Pei!), but definitely a different sort of creature than most of MIT, so it makes sense that it’s at one extreme of the map. That said, the other two programs in its school — Urban Studies & Planning and Media Arts & Sciences — are just to its north.
Region 6 — tiny, yellow, and pale; you may have missed it at first glance — is linguistics island, housing theses such as Topics in the stress and syntax of words. You see how there are also a handful of red dots on this island? They are Brain & Cognitive Science theses — and in particular, ones that are secretly linguistics, like Intonational phrasing in language production and comprehension. Similarly — although at MIT it is not the department of linguistics, but the department of linguistics & philosophy — the philosophy papers are elsewhere. (A few of the very most abstract ones are hanging out near math.)
And what about Region 7, the stingray swimming vigorously away from everything else? I spent a long time looking at this and not seeing a pattern. You can tell there’s a lot of colors (departments) there, randomly assorted; even looking at individual titles I couldn’t see anything. Only when I looked at the original documents did I realize that this is the island of terrible OCR. Almost everything here is an older thesis, with low-quality printing or even typewriting, often in a regrettable font, maybe with the reverse side of the page showing through. (A randomly chosen example; pdf download.)
A good reminder of the importance of high-quality digitization labor. A heartbreaking example of the things we throw away when we make paper the archival format for born-digital items. And also a technical inspiration — look how much vector space we’ve had to carve out to make room for these! the poor neural net, trying desperately to find signal in the noise, needing all this space to do it. I’m tempted to throw out the entire leftmost quarter of this graph, rerun the 2d projection, and see what I get — would we be better able to see the structures in the high-quality data if they had room to breathe? And were I to rerun the entire neural net training process again, I’d want to include some sort of threshold score for OCR quality. It would be a shame to throw things away — especially since they will be a nonrandom sample, mostly older theses — but I have already had to throw away things I could not OCR at all in an earlier pass, and, again, I suspect the neural net would do a better job organizing the high-quality documents if it could use the whole vector space to spread them out, rather than needing some of it to encode the information “this is terrible OCR and must be kept away from its fellows”.
Clearly I need to share the technical details of how I did this, but this post is already too long, so maybe next week. tl;dr I reached out to Matt Miller after reading his cool post on vectorizing the DPLA and he tipped me off to UMAP and here we are — thanks, Matt!
And just as clearly you want to play with this too, right? Well, it’s super not ready to be integrated into HAMLET due to any number of usability issues but if you promise to forgive me those — have fun. You see how when you hover over a dot you get a label with the format 1721.1-X.txt? It corresponds to a URL of the format https://hamlet.andromedayelton.com/similar_to/X. Go play :).
Skipped FridAI blogging last week because of Thanksgiving, but let’s get back on it! Top-of-mind today are the firing of AI queen Timnit Gebru (letter of support here) and a couple of grant applications that I’m actually eligible for (this is rare for me! I typically need things for which I can apply in my individual capacity, so it’s always heartening when they exist — wish me luck).
But for blogging today, I’m gonna talk about neural style transfer, because it’s cool as hell. I started my ML-learning journey on Coursera’s intro ML class and have been continuing with their deeplearning.ai sequence; I’m on course 4 of 5 there, so I’ve just gotten to neural style transfer. This is the thing where a neural net outputs the content of one picture in the style of another:
OK, so! Let me explain while it’s still fresh.
If you have a neural net trained on images, it turns out that each layer is responsible for recognizing different, and progressively more complicated, things. The specifics vary by neural net and data set, but you might find that the first layer gets excited about straight lines and colors; the second about curves and simple textures (like stripes) that can be readily composed from straight lines; the third about complex textures and simple objects (e.g. wheels, which are honestly just fancy circles); and so on, until the final layers recognize complex whole objects. You can interrogate this by feeding different images into the neural net and seeing which ones trigger the highest activation in different neurons. Below, each 3×3 grid represents the most exciting images for a particular neuron. You can see that in this network, there are Layer 1 neurons excited about colors (green, orange), and about lines of particular angles that form boundaries between dark and colored space. In Layer 2, these get built together like tiny image legos; now we have neurons excited about simple textures such as vertical stripes, concentric circles, and right angles.
So how do we get from here to neural style transfer? We need to extract information about the content of one image, and the style of another, in order to make a third image that approximates both of them. As you already expect if you have done a little machine learning, that means that we need to write cost functions that mean “how close is this image to the desired content?” and “how close is this image to the desired style?” And then there’s a wrinkle that I haven’t fully understood, which is that we don’t actually evaluate these cost functions (necessarily) against the outputs of the neural net; we actually compare the activations of the neurons, as they react to different images — and not necessarily from the final layer! In fact, choice of layer is a hyperparameter we can vary (I super look forward to playing with this on the Coursera assignment and thereby getting some intuition).
So how do we write those cost functions? The content one is straightforward: if two images have the same content, they should yield the same activations. The greater the differences, the greater the cost (specifically via a squared error function that, again, you may have guessed if you’ve done some machine learning).
The style one is beautifully sneaky; it’s a measure of the difference in correlation between activations across channels. What does that mean in English? Well, let’s look at the van Gogh painting, above. If an edge detector is firing (a boundary between colors), then a swirliness detector is probably also firing, because all the lines are curves — that’s characteristic of van Gogh’s style in this painting. On the other hand, if a yellowness detector is firing, a blueness detector may or may not be (sometimes we have tight parallel yellow and blue lines, but sometimes yellow is in the middle of a large yellow region). Style transfer posits that artistic style lies in the correlations between different features. See? Sneaky. And elegant.
Finally, for the style-transferred output, you need to generate an image that does as well as possible on both cost functions simultaneously — getting as close to the content as it can without unduly sacrificing the style, and vice versa.
As a side note, I think I now understand why DeepDream is fixated on a really rather alarming number of eyes. Since the layer choice is a hyperparameter, I hypothesize that choosing too deep a layer — one that’s started to find complex features rather than mere textures and shapes — will communicate to the system, yes, what I truly want is for you to paint this image as if those complex features are matters of genuine stylistic significance. And, of course, eyes are simple enough shapes to be recognized relatively early (not very different from concentric circles), yet ubiquitous in image data sets. So…this is what you wanted, right? the eager robot helpfully offers.
I’m going to have fun figuring out what the right layer hyperparameter is for the Coursera assignment, but I’m going to have so much more fun figuring out the wrong ones.
In 2017, I trained a neural net on ~44K graduate theses using the Doc2Vec algorithm, in hopes that doing so would provide a backend that could support novel and delightful discovery mechanisms for unique library content. The result, HAMLET, worked better than I hoped; it not only pulls together related works from different departments (thus enabling discovery that can’t be supported with existing metadata), but it does a spirited job on documents whose topics are poorly represented in my initial data set (e.g. when given a fiction sample it finds theses from programs like media studies, even though there are few humanities theses in the data set).
That said, there are a bunch of exploratory tools I’ve had in my head ever since 2017 that I’ve not gotten around to implementing. But here, in the spirit of tossing out things that don’t bring me joy (like 2020) and keeping those that do, I’m gonna make some data viz!
There are only two challenges with this:
By default Doc2Vec embeds content in a 100-dimensional space, which is kind of hard to visualize. I need to project that down to 2 or 3 dimensions. I don’t actually know anything about dimensionality reduction techniques, other than that they exist.
(But aside from that, Mr. Lincoln, how was the play?)
I decided I should start with the pages that display the theses most similar to a given thesis (shout-out to Jeremy Brown, startup founder par excellence) rather than with my ideas for visualizing the whole collection, because I’ll only need to plot ten or so points instead of 44K. This will make it easier for me to tell visually if I’m on the right track and should let me skip dealing with performance issues for now. On the down side, it means I may need to throw out any code I write at this stage when I’m working on the next one. 🤷♀️
And I now have a visualization on localhost! Which you can’t see because I don’t trust it yet. But here are the problems I’ve solved thus far:
It’s hard to copy-paste d3 examples on the internet. d3’s been around for long enough there’s substantial content about different versions, so you have to double-check. But also most of the examples are live code notebooks on Observable, which is a wicked cool service but not the same environment as a web page! If you just copy-paste from there you will have things that don’t work due to invisible environment differences and then you will be sad. 😢 I got tipped off to this by Mollie Marie Pettit’s great Your First d3 Scatterplot notebook, which both names the phenomenon and provides two versions of the code (the live-editable version and the one you can actually copy/paste into your editor).
If you start googling for dimensionality reduction techniques you will mostly find people saying “use t-SNE”, but t-SNE is a lying liar who lies. Mind you, it’s what I’m using right now because it’s so well-documented it was the easiest thing to set up. (This is why I said above that I don’t trust my viz.) But it produces different results for the same data on different pageloads (obviously different, so no one looking at the page will trust it either), and it’s not doing a good job preserving the distances I care about. (I accept that anything projecting from 100d down to 2d will need to distort distances, but I want to adequately preserve meaning — I want the visualization to not just look pretty but to give people an intellectually honest insight into the data — and I’m not there yet.)
Conveniently this is not my first time at the software engineering rodeo, so I encapsulated my dimensionality reduction strategy inside a function, and I can swap it out for whatever I like without needing to rewrite the d3 as long as I return the same data structure.
So that’s my next goal — try out UMAP (hat tip to Matt Miller for suggesting that to me), try out PCA, fiddle some parameters, try feeding it just the data I want to visualize vs larger neighborhoods, see if I’m happier with what I get. UMAP in particular alleges itself to be fast with large data sets, so if I can get it working here I should be able to leverage that knowledge for my ideas for visualizing the whole thing.
The San José State University School of Information wanted to have a half-course on artificial intelligence in their portfolio, and asked me to develop and teach it. (Thanks!) So I got a blank canvas on which to paint eight weeks of…whatever you might want graduate students in library & information science students to know about AI.
This is of course the problem of all teachers — too much material, too little time — and in an iSchool it’s further complicated because, while many students have technological interests and expertise, few have programming skills and even fewer have mathematical backgrounds, so this course can’t be “intro to programming neural nets”. I can gesture in the direction of linear algebra and high-dimensional spaces, but I have to translate it all into human English first.
But further, even if I were to do that, it wouldn’t be the right course! As future librarians, very few of my students will be programming neural nets. They are much more likely to be helping students find sources for papers, or helping researchers find or manage data sets, or supporting professors who are developing classes, helping patrons make sense of issues in the news, and evaluating vendor pitches about AI products. Which means I don’t need people who can write neural net code; I need people who understand the basics of how machine learning operates, who can do some critical analysis, situate it in its social context. People who know some things about what data is good for, how it’s hard, where to find it. People who know at least the general direction in which they might find news articles and papers and conferences that their patrons will care about. People who won’t be too dazzled by product hype and can ask pointed questions about how products really work, and whether they respect library values. And, while we’re at it, people who have some sense of what AI can do, not just theoretically, but concretely in real-world library settings.
Eight weeks: go!
What I ended up doing was 4 2-week modules, with a rough alternation of theory and library case studies, and a pretty wild mix of readings: conference presentations, scholarly papers from a variety of disciplines, hilarious computational misadventures, news articles, data visualizations. I mostly kept a lid on the really technical stuff in the required readings, but tossed a lot of it into optional readings, so that students with that background or interest could pull on those threads. (And heavily annotated the optional readings, to give people a sense of what might interest them; I’d like to say this is why surprisingly many of my students did some optional reading, but actually they’re just awesome.) For case studies, we looked at the Northern Illinois University dime novels collection experiments; metadata enrichment in the Charles Teenie Harris archive; my own work with HAMLET; and the University of Rhode Island AI lab. This let us hit a gratifyingly wide variety of machine learning techniques, use cases (metadata, discovery, public services), and settings (libraries, archives).
Do I have a couple of pages of things to change up next time I teach the class (this fall)? Of course I do. But I think it went well for a first-time class (particularly for a first-time class in the middle of a global catastrophe…)
Big ups to the following:
Matthew Short of NIU and Bohyun Kim of URI, for guest speaking;
Everyone at SJSU who worked on their “how to teach online” materials, especially Debbie Faires — their onboarding did a good job of conveying SJSU-specific expectations and building a toolkit for teaching specifically online in a way that was useful to me as someone with a lot of offline teaching experience;
Zeynep Tufecki, Momin Malik, Catherine D’Ignazio, who suggested readings that I ended up assigning;
and my students, who are about to get a paragraph.
My students. Look. You signed up to take a class online — it’s an all-online program — but none of you signed up to do it while being furloughed, while homeschooling, while being sick with a scary new virus. And you knocked it out of the park. Week after week, asking for the smallest of extensions to hold it all together, breaking my heart in private messages, while publicly writing thoughtful, well-researched, footnoted discussion posts. While not only doing even the optional readings, but finding astonishment and joy in them. While piecing together the big ideas about data and bias and fairness and the genuine alienness of machine intelligence. I know for certain, not as an article of faith but as a statement of fact, that I will keep seeing your names out there, that your careers will go places, and I hope I am lucky enough to meet you in person someday.
Let’s say you’re having problems parsing a csv file, represented as an InMemoryUploadedFile, that you’ve just uploaded through a Django form. There are a bunch of answers on stackoverflow! They all totally work with Python 2! …and lead to hours of frustration if, say, hypothetically, like me, you’re using Python 3.
If you are getting errors like _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?) — and then getting different errors about DictReader not getting an expected iterator after you use .decode('utf-8') to coerce your file to str — this is the post for you.
It turns out all you need to do (e.g. in your form_valid) is:
The seek statement ensures the pointer is at the beginning of the file. This may or may not be required in your case. In my case, I’d already read the file in my forms.py in order to validate it, so my file pointer was at the end. You’ll be able to tell that you need to seek() if your csv.DictReader() doesn’t throw any errors, but when you try to loop over the lines of the file you don’t even enter the for loop (e.g. print() statements you put in it never print) — there’s nothing left to loop over if you’re at the end of the file.
read() gives you the file contents as a bytes object, on which you can call decode().
decode('utf-8') turns your bytes into a string, with known encoding. (Make sure that you know how your CSV is encoded to start with, though! That’s why I was doing validation on it myself. Unicode, Dammit is going to be my friend here. Even if I didn’t want an excuse to use it because of its title alone. Which I do.)
io.StringIO() gives you the iterator that DictReader needs, while ensuring that your content remains stringy.
tl;dr I wrote two lines of code (but eight lines of comments) for a problem that took me hours to solve. Hopefully now you can copy these lines, and spend only a few minutes solving this problem!
(American Libraries has helpfully provided an unedited transcript of the ALA Council town hall meeting this past Midwinter, which lets me turn my remarks there into a blog post here. You can also watch the video; I start around 24:45. I encourage you to read or watch the whole thing, though; it’s interesting throughout with a variety of viewpoints represented. I am also extremely gratified by this press release, issued after the Town Hall, which speaks to these issues.)
As I was looking at the statements that came out at ALA after the election, I found that they had a lot to say about funding, and that’s important because that’s how we pay our people and collect materials and keep the lights on.
But my concern was that they seemed to talk only about funding, and I found myself wondering — if they come for copyright, will we say that’s okay as long as we’ve been bought off? If they come for net neutrality, will we say that’s okay, as long as we’ve been bought off? When they come for the NEH and the NEA, the artists who make the content that we collect and preserve, are we going to say that’s okay, as long as we get bought off? When they come for free speech — and five bills were introduced in five states just, I think, on Friday, to criminalize protest — will we say that’s okay, as long as we’ve been bought off?
I look at how people I know react and the past actions of the current administration. The fact that every trans person I know was in a panic to get their documents in order before last Friday because they don’t think they will be able to in the next four years. The fact that we have a President who will mock disabled people just because they are disabled and disagreeing with him. The fact that we have a literal white supremacist in the White House who co-wrote the inauguration speech. The fact that one of the architects of Gamergate, which has been harassing women in technology for years, is now a White House staffer. The fact that we have many high-level people in the administration who support conversion therapy, which drives gay and lesbian teenagers to suicide at unbelievable rates. Trans people and people of color and disabled people and women and gays and lesbians are us, they are our staff, they are our patrons.
Funding matters, but so do our values, and so do our people. Funding is important, but so is our soul. And when I look at our messaging, I wonder, do we have a soul? Can it be bought? Or are there lines we do not cross?
That’s what public libraries do, right? Provide service to everyone, respectfully and professionally — and without conditioning that respect on checking your papers. If you walk through those doors, you’re welcome here.
When you’re standing in the international arrivals area at Logan, you’re in a waiting area between a pair of large double doors, exiting from Customs, and then the doors to the outside world. We stood in a crowd of hundreds, chanting “Let Them In!” Sometimes, some mysterious number of minutes after a flight arrival, the doors would open, and tired people and their luggage pour through, from Zurich, Port-au-Prince, Heathrow, anywhere.
And the Code of Ethics ran through my head because that’s what we were chanting, wasn’t it? That anyone who walks through those doors is welcome here. Let them in.
Library values are American values. And if you have a stake in America, don’t let anyone build an America that’s less than what we as a profession stand for.
– EXT. HANGAR – REMOTE PLANET – DAY: Leia gestures to a document in her other hand. “There’s still 24 fighters that haven’t had their C checks, and we need them ready to scramble by 0500 Sunday. I can count on you to make that deadline, right?” The chief mechanic nods smartly.
– INT. STARSHIP – OPS DECK: Leia puts a hand on a young pilot’s shoulder; the pilot looks up nervously. “First shift on the big ship, Lieutenant Bey? It’s great to see you here. I knew you’d qualify.” Bey smiles and looks back confidently at her console.
– INT. BARRACKS – LEIA’S ROOM – MIDNIGHT: Leia taps a hand terminal. There are 94 new messages. Subject lines scroll past — “Quartermaster’s January report”; “Re: overdue Corellian inventory”; “Schedule for meeting with new EVA suit supplier”. She sighs, drinks some tea, and taps the first message.
It’s not light sabers, is it? It’s grueling and dull, decades of small things. It films poorly. And it’s why the rebellion exists at all.
Luke is the cinematic hero because he has magic powers that you either have or you don’t (and we don’t). Leia in another timeline might have had them too but hers instead is the heroism anyone can choose — responsibility, tenaciousness, care — anyone can, but often we don’t, and somehow without a flashy magical montage it seems less heroic.
How much better the world would be if we were all Leia, though.
Or — maybe we can’t. As the whole internet has pointed out lately, Leia’s the woman who consoles Luke for losing her mentor after her whole world has burned. In the original series I think, in fact, she shows the most distress when Luke on Endor has revealed to her the truth about their parentage, when Han walks into that and wonders why she’s treating him that way; her feelings matter to cinematography when they illustrate someone else’s story. Luke and Obi-Wan can abandon the galaxy for hidden places when one student going wrong provokes feelings too strong to bear; whatever feelings Leia has about Alderaan and everything else are not enough to stop her from decades, decades, decades of unglamorous work.
Maybe we can’t all choose that; maybe Leia gets to be the powerhouse she is because her inner life, her reactions to the world around her, do not matter to the narrative, can be treated as if they don’t have effects. We see the profundity of Luke’s and Obi-Wan’s losses in their withdrawal from the world; Leia’s are both greater still, and not painful enough to keep her from processing 94 new emails, every day, for the rest of her life. She gets to be an astonishing hero, in so many ways too-little-celebrated by the narrative, because maybe she isn’t a person, doesn’t react the way people do, doesn’t get to claim the meaning of her own inner life as relevant in its own right.
I’d urge you all to choose to be Leias if I thought it fair. I am not sure it is plausible, in this galaxy right now, where we all have inner lives and centrality to our own stories. And yet here we are, with far more emails than lightsabers.
Perhaps I’ll ask instead — look for the Leias. The people all around who may not have montages, but who strengthen people, who make the supply lines work, who follow up. They are indeed magic.
The question I keep coming back to is: where are our lines?
ALA’s communications have focused on the importance of securing funding for libraries over the next four years. And this is important; for both practical and philosophical reasons, libraries have to pay their people and keep the lights on. I hope ALA’s Washington Office lobbies hard for library funding. And yet…
If law enforcement shows up and says, we want all your circulation records, to go on a fishing expedition for who’s reading the “wrong” books, do we say, sod off; come back with a warrant, or not at all?
If Homeland Security shows up and says, we’d like your organization-of-information expertise updating the Muslim registry for the present day, do we say, never again?
If the horse-traders show up and say, nice IMLS funding you’ve got there, shame if something happened to it, have you considered dropping your support for strong encryption, do we say, the ALA Code of Ethics binds us to protect patron privacy and that is a line we cannot cross?
Do we? It seems almost unthinkable that we would not, and yet, that is what’s missing in ALA’s recent communication: the notion that there are lines, that these lines matter for our patrons and our consciences.