Let’s visualize some HAMLET data! Or, d3 and t-SNE for the lols.

In 2017, I trained a neural net on ~44K graduate theses using the Doc2Vec algorithm, in hopes that doing so would provide a backend that could support novel and delightful discovery mechanisms for unique library content. The result, HAMLET, worked better than I hoped; it not only pulls together related works from different departments (thus enabling discovery that can’t be supported with existing metadata), but it does a spirited job on documents whose topics are poorly represented in my initial data set (e.g. when given a fiction sample it finds theses from programs like media studies, even though there are few humanities theses in the data set).

That said, there are a bunch of exploratory tools I’ve had in my head ever since 2017 that I’ve not gotten around to implementing. But here, in the spirit of tossing out things that don’t bring me joy (like 2020) and keeping those that do, I’m gonna make some data viz!

There are only two challenges with this:

  1. By default Doc2Vec embeds content in a 100-dimensional space, which is kind of hard to visualize. I need to project that down to 2 or 3 dimensions. I don’t actually know anything about dimensionality reduction techniques, other than that they exist.
  2. I also don’t know know JavaScript much beyond a copy-paste level. I definitely don’t know d3, or indeed the pros and cons of various visualization libraries. Also art. Or, like, all that stuff in Tufte’s book, which I bounced off of.

(But aside from that, Mr. Lincoln, how was the play?)

I decided I should start with the pages that display the theses most similar to a given thesis (shout-out to Jeremy Brown, startup founder par excellence) rather than with my ideas for visualizing the whole collection, because I’ll only need to plot ten or so points instead of 44K. This will make it easier for me to tell visually if I’m on the right track and should let me skip dealing with performance issues for now. On the down side, it means I may need to throw out any code I write at this stage when I’m working on the next one. 🤷‍♀️

And I now have a visualization on localhost! Which you can’t see because I don’t trust it yet. But here are the problems I’ve solved thus far:

  1. It’s hard to copy-paste d3 examples on the internet. d3’s been around for long enough there’s substantial content about different versions, so you have to double-check. But also most of the examples are live code notebooks on Observable, which is a wicked cool service but not the same environment as a web page! If you just copy-paste from there you will have things that don’t work due to invisible environment differences and then you will be sad. 😢 I got tipped off to this by Mollie Marie Pettit’s great Your First d3 Scatterplot notebook, which both names the phenomenon and provides two versions of the code (the live-editable version and the one you can actually copy/paste into your editor).
  2. If you start googling for dimensionality reduction techniques you will mostly find people saying “use t-SNE”, but t-SNE is a lying liar who lies. Mind you, it’s what I’m using right now because it’s so well-documented it was the easiest thing to set up. (This is why I said above that I don’t trust my viz.) But it produces different results for the same data on different pageloads (obviously different, so no one looking at the page will trust it either), and it’s not doing a good job preserving the distances I care about. (I accept that anything projecting from 100d down to 2d will need to distort distances, but I want to adequately preserve meaning — I want the visualization to not just look pretty but to give people an intellectually honest insight into the data — and I’m not there yet.)

Conveniently this is not my first time at the software engineering rodeo, so I encapsulated my dimensionality reduction strategy inside a function, and I can swap it out for whatever I like without needing to rewrite the d3 as long as I return the same data structure.

So that’s my next goal — try out UMAP (hat tip to Matt Miller for suggesting that to me), try out PCA, fiddle some parameters, try feeding it just the data I want to visualize vs larger neighborhoods, see if I’m happier with what I get. UMAP in particular alleges itself to be fast with large data sets, so if I can get it working here I should be able to leverage that knowledge for my ideas for visualizing the whole thing.

Onward, upward, et cetera. 🎉

AI in the Library, round one

The San José State University School of Information wanted to have a half-course on artificial intelligence in their portfolio, and asked me to develop and teach it. (Thanks!) So I got a blank canvas on which to paint eight weeks of…whatever you might want graduate students in library & information science students to know about AI.

For those of you who just want the reading list, here you go. For those of you who thought about the second-to-last sentence: ahahaha.

this is fine dog meme
This is fine.

This is of course the problem of all teachers — too much material, too little time — and in an iSchool it’s further complicated because, while many students have technological interests and expertise, few have programming skills and even fewer have mathematical backgrounds, so this course can’t be “intro to programming neural nets”. I can gesture in the direction of linear algebra and high-dimensional spaces, but I have to translate it all into human English first.

But further, even if I were to do that, it wouldn’t be the right course! As future librarians, very few of my students will be programming neural nets. They are much more likely to be helping students find sources for papers, or helping researchers find or manage data sets, or supporting professors who are developing classes, helping patrons make sense of issues in the news, and evaluating vendor pitches about AI products. Which means I don’t need people who can write neural net code; I need people who understand the basics of how machine learning operates, who can do some critical analysis, situate it in its social context. People who know some things about what data is good for, how it’s hard, where to find it. People who know at least the general direction in which they might find news articles and papers and conferences that their patrons will care about. People who won’t be too dazzled by product hype and can ask pointed questions about how products really work, and whether they respect library values. And, while we’re at it, people who have some sense of what AI can do, not just theoretically, but concretely in real-world library settings.

Eight weeks: go!

What I ended up doing was 4 2-week modules, with a rough alternation of theory and library case studies, and a pretty wild mix of readings: conference presentations, scholarly papers from a variety of disciplines, hilarious computational misadventures, news articles, data visualizations. I mostly kept a lid on the really technical stuff in the required readings, but tossed a lot of it into optional readings, so that students with that background or interest could pull on those threads. (And heavily annotated the optional readings, to give people a sense of what might interest them; I’d like to say this is why surprisingly many of my students did some optional reading, but actually they’re just awesome.) For case studies, we looked at the Northern Illinois University dime novels collection experiments; metadata enrichment in the Charles Teenie Harris archive; my own work with HAMLET; and the University of Rhode Island AI lab. This let us hit a gratifyingly wide variety of machine learning techniques, use cases (metadata, discovery, public services), and settings (libraries, archives).

Do I have a couple of pages of things to change up next time I teach the class (this fall)? Of course I do. But I think it went well for a first-time class (particularly for a first-time class in the middle of a global catastrophe…)

Big ups to the following:

  • Matthew Short of NIU and Bohyun Kim of URI, for guest speaking;
  • Everyone at SJSU who worked on their “how to teach online” materials, especially Debbie Faires — their onboarding did a good job of conveying SJSU-specific expectations and building a toolkit for teaching specifically online in a way that was useful to me as someone with a lot of offline teaching experience;
  • Zeynep Tufecki, Momin Malik, Catherine D’Ignazio, who suggested readings that I ended up assigning;
  • and my students, who are about to get a paragraph.

My students. Look. You signed up to take a class online — it’s an all-online program — but none of you signed up to do it while being furloughed, while homeschooling, while being sick with a scary new virus. And you knocked it out of the park. Week after week, asking for the smallest of extensions to hold it all together, breaking my heart in private messages, while publicly writing thoughtful, well-researched, footnoted discussion posts. While not only doing even the optional readings, but finding astonishment and joy in them. While piecing together the big ideas about data and bias and fairness and the genuine alienness of machine intelligence. I know for certain, not as an article of faith but as a statement of fact, that I will keep seeing your names out there, that your careers will go places, and I hope I am lucky enough to meet you in person someday.

adventures with parsing Django uploaded csv files in python3

Let’s say you’re having problems parsing a csv file, represented as an InMemoryUploadedFile, that you’ve just uploaded through a Django form. There are a bunch of answers on stackoverflow! They all totally work with Python 2! …and lead to hours of frustration if, say, hypothetically, like me, you’re using Python 3.

If you are getting errors like _csv.Error: iterator should return strings, not bytes (did you open the file in text mode?) — and then getting different errors about DictReader not getting an expected iterator after you use .decode('utf-8') to coerce your file to str — this is the post for you.

It turns out all you need to do (e.g. in your form_valid) is:


csv_file.seek(0)
csv.DictReader(io.StringIO(csv_file.read().decode('utf-8')))

What’s going on here?

The seek statement ensures the pointer is at the beginning of the file. This may or may not be required in your case. In my case, I’d already read the file in my forms.py in order to validate it, so my file pointer was at the end. You’ll be able to tell that you need to seek() if your csv.DictReader() doesn’t throw any errors, but when you try to loop over the lines of the file you don’t even enter the for loop (e.g. print() statements you put in it never print) — there’s nothing left to loop over if you’re at the end of the file.

read() gives you the file contents as a bytes object, on which you can call decode().

decode('utf-8') turns your bytes into a string, with known encoding. (Make sure that you know how your CSV is encoded to start with, though! That’s why I was doing validation on it myself. Unicode, Dammit is going to be my friend here. Even if I didn’t want an excuse to use it because of its title alone. Which I do.)

io.StringIO() gives you the iterator that DictReader needs, while ensuring that your content remains stringy.

tl;dr I wrote two lines of code (but eight lines of comments) for a problem that took me hours to solve. Hopefully now you can copy these lines, and spend only a few minutes solving this problem!

my statement at the ALA Midwinter Town Hall

(American Libraries has helpfully provided an unedited transcript of the ALA Council town hall meeting this past Midwinter, which lets me turn my remarks there into a blog post here. You can also watch the video; I start around 24:45. I encourage you to read or watch the whole thing, though; it’s interesting throughout with a variety of viewpoints represented. I am also extremely gratified by this press release, issued after the Town Hall, which speaks to these issues.)

As I was looking at the statements that came out at ALA after the election, I found that they had a lot to say about funding, and that’s important because that’s how we pay our people and collect materials and keep the lights on.

But my concern was that they seemed to talk only about funding, and I found myself wondering — if they come for copyright, will we say that’s okay as long as we’ve been bought off? If they come for net neutrality, will we say that’s okay, as long as we’ve been bought off? When they come for the NEH and the NEA, the artists who make the content that we collect and preserve, are we going to say that’s okay, as long as we get bought off? When they come for free speech — and five bills were introduced in five states just, I think, on Friday, to criminalize protest — will we say that’s okay, as long as we’ve been bought off?

I look at how people I know react and the past actions of the current administration. The fact that every trans person I know was in a panic to get their documents in order before last Friday because they don’t think they will be able to in the next four years. The fact that we have a President who will mock disabled people just because they are disabled and disagreeing with him. The fact that we have a literal white supremacist in the White House who co-wrote the inauguration speech. The fact that one of the architects of Gamergate, which has been harassing women in technology for years, is now a White House staffer. The fact that we have many high-level people in the administration who support conversion therapy, which drives gay and lesbian teenagers to suicide at unbelievable rates. Trans people and people of color and disabled people and women and gays and lesbians are us, they are our staff, they are our patrons.

Funding matters, but so do our values, and so do our people. Funding is important, but so is our soul. And when I look at our messaging, I wonder, do we have a soul? Can it be bought? Or are there lines we do not cross?

Thank you.

the highest level of service

I. We provide the highest level of service to all library users… ALA Code of Ethics

That’s what public libraries do, right? Provide service to everyone, respectfully and professionally — and without conditioning that respect on checking your papers. If you walk through those doors, you’re welcome here.

When you’re standing in the international arrivals area at Logan, you’re in a waiting area between a pair of large double doors, exiting from Customs, and then the doors to the outside world. We stood in a crowd of hundreds, chanting “Let Them In!” Sometimes, some mysterious number of minutes after a flight arrival, the doors would open, and tired people and their luggage pour through, from Zurich, Port-au-Prince, Heathrow, anywhere.

And the Code of Ethics ran through my head because that’s what we were chanting, wasn’t it? That anyone who walks through those doors is welcome here. Let them in.

Library values are American values. And if you have a stake in America, don’t let anyone build an America that’s less than what we as a profession stand for.

Leia: a montage about heroism

MONTAGE

– EXT. HANGAR – REMOTE PLANET – DAY: Leia gestures to a document in her other hand. “There’s still 24 fighters that haven’t had their C checks, and we need them ready to scramble by 0500 Sunday. I can count on you to make that deadline, right?” The chief mechanic nods smartly.

– INT. STARSHIP – OPS DECK: Leia puts a hand on a young pilot’s shoulder; the pilot looks up nervously. “First shift on the big ship, Lieutenant Bey? It’s great to see you here. I knew you’d qualify.” Bey smiles and looks back confidently at her console.

– INT. BARRACKS – LEIA’S ROOM – MIDNIGHT: Leia taps a hand terminal. There are 94 new messages. Subject lines scroll past — “Quartermaster’s January report”; “Re: overdue Corellian inventory”; “Schedule for meeting with new EVA suit supplier”. She sighs, drinks some tea, and taps the first message.


It’s not light sabers, is it? It’s grueling and dull, decades of small things. It films poorly. And it’s why the rebellion exists at all.

Luke is the cinematic hero because he has magic powers that you either have or you don’t (and we don’t). Leia in another timeline might have had them too but hers instead is the heroism anyone can choose — responsibility, tenaciousness, care — anyone can, but often we don’t, and somehow without a flashy magical montage it seems less heroic.

How much better the world would be if we were all Leia, though.


Or — maybe we can’t. As the whole internet has pointed out lately, Leia’s the woman who consoles Luke for losing her mentor after her whole world has burned. In the original series I think, in fact, she shows the most distress when Luke on Endor has revealed to her the truth about their parentage, when Han walks into that and wonders why she’s treating him that way; her feelings matter to cinematography when they illustrate someone else’s story. Luke and Obi-Wan can abandon the galaxy for hidden places when one student going wrong provokes feelings too strong to bear; whatever feelings Leia has about Alderaan and everything else are not enough to stop her from decades, decades, decades of unglamorous work.

Maybe we can’t all choose that; maybe Leia gets to be the powerhouse she is because her inner life, her reactions to the world around her, do not matter to the narrative, can be treated as if they don’t have effects. We see the profundity of Luke’s and Obi-Wan’s losses in their withdrawal from the world; Leia’s are both greater still, and not painful enough to keep her from processing 94 new emails, every day, for the rest of her life. She gets to be an astonishing hero, in so many ways too-little-celebrated by the narrative, because maybe she isn’t a person, doesn’t react the way people do, doesn’t get to claim the meaning of her own inner life as relevant in its own right.


I’d urge you all to choose to be Leias if I thought it fair. I am not sure it is plausible, in this galaxy right now, where we all have inner lives and centrality to our own stories. And yet here we are, with far more emails than lightsabers.

Perhaps I’ll ask instead — look for the Leias. The people all around who may not have montages, but who strengthen people, who make the supply lines work, who follow up. They are indeed magic.

Locating my ALA in 2016

I’ve been reading discussion on ALA Council, Twitter, and blogs following recent ALA press releases and statements from the Committee on Legislation and the Washington Office, wondering where to locate my ALA, and where to locate myself within it as a member leader.

The question I keep coming back to is: where are our lines?

ALA’s communications have focused on the importance of securing funding for libraries over the next four years. And this is important; for both practical and philosophical reasons, libraries have to pay their people and keep the lights on. I hope ALA’s Washington Office lobbies hard for library funding. And yet…

If law enforcement shows up and says, we want all your circulation records, to go on a fishing expedition for who’s reading the “wrong” books, do we say, sod off; come back with a warrant, or not at all?

If Homeland Security shows up and says, we’d like your organization-of-information expertise updating the Muslim registry for the present day, do we say, never again?

If the horse-traders show up and say, nice IMLS funding you’ve got there, shame if something happened to it, have you considered dropping your support for strong encryption, do we say, the ALA Code of Ethics binds us to protect patron privacy and that is a line we cannot cross?

Do we? It seems almost unthinkable that we would not, and yet, that is what’s missing in ALA’s recent communication: the notion that there are lines, that these lines matter for our patrons and our consciences.

Librarians are among the most trusted professions, but we didn’t get there by being conciliatory. Our historical heroes include the Connecticut Four, Judith Krug, Zoia Horn, all the way back to Hypatia of Alexandria. We are, at our best, people who draw lines.

What are our lines?

What are yours?

Write them down.

Hold them.

An open letter to Heather Bresch

Dear Heather Bresch,

You lived in Morgantown. I did, too: born and raised. My parents are retired from the university you attended. My elementary school took field trips to Mylan labs. They were shining, optimistic.

You’re from West Virginia. I am, too. This means we both know something of the coal industry that has both sustained and destroyed our home. You know, as I do, how many miners have been killed in explosions: trapped underground when a pocket of methane ignites. We both know that miners long carried safety lamps: carefully shielded but raw flames that would go out when the oxygen went too low, a warning to get away — if they had not first exploded, as open flames around methane do. Perhaps you know, as I only recently learned, that miners were once required to buy their own safety lamps: so when safer ones came out, ones that would only warn without killing you first, miners did not carry them. They couldn’t afford to. They set probability against their lives, went without the right equipment, and sometimes lost, and died.

I’m a mother. You are, too. I don’t know if your children carry medication for life-threatening illnesses; I hope you have not had to face that. I have. In our case it’s asthma, not allergies, and an inhaler, not an Epi-Pen. It’s a $20 copay with our insurance and lasts for dozens of doses. It doesn’t stop asthma attacks once they start — my daughter’s asthma is too severe for that — but sometimes it prevents them. And when it does not, it still helps: we spend two days in the hospital instead of five; we don’t go to the ICU. (Have you ever been with your child in a pediatric ICU? It is the most miraculous, and the worst, place on earth.)

Most families can find their way to twenty dollars. Many cannot find six hundred. They’ll go without, and set probability against their children’s lives. Rich children will live; poor children will sometimes lose, and die.

I ask you to reconsider.

Sincerely,

Andromeda Yelton

Be bold, be humble: Wikipedia, libraries, and who spoke

Today I’m at a Wikipedia + libraries mini-conference, as a member of both worlds but also, strangely, neither. I write software for the Wikimedia Foundation (specifically the Wikipedia Library, which is among the conveners). I’m a librarian by training, and the President-Elect of LITA. But I also don’t identify as a Wikipedian (my edit count is, last I checked, 5), I don’t work in a library, and I have never worked in an academic library (whereas the other convener is the Association of Research Libraries).

This is a great excuse to be an observer, and try out a tool that was going around Twitter a month ago: http://arementalkingtoomuch.com/.

It’s a set of paired timers: “a dude” and “not a dude”. You click the button that represents the speaker. At the end, you have a count of how much time each category held the floor. In our first session today, 52% of the speaking time was men.

Sounds equal! Except…42% of the room appeared to be men. And as I looked around, I realized that all but perhaps one of the 10 men had spoken at least once, whereas about 5 of the 14 women had said nothing at all in our morning session. (Myself included; I was too busy processing the meta-meeting, tracking all of this.)

“Be bold”, said the coffee mug in front of me. Who is bold?

Interrupters are bold; I tracked interruptions, too. About ⅔ of the interruptions were by men (though, somewhat to my surprise, most of those were interrupting other men). Of the other interruptions, the ones by women — I did not track so I cannot say for sure, but I believe 100% of them were by two women, both of whom are highly involved Wikipedians.

(I suspect, in fact, though I did not track this either, that women’s propensity to speak correlated with the status they had in these spheres coming into the room: all the women here are librarians, but for the most part the women who who spoke were either widely recognized Wikipedians or library directors.)

“Be bold,” says the coffee mug, but so many librarians have worked in places where boldness is not valued, where indeed they have been punished for it.

I tried tracking self-undercutting behavior — “Hopefully I’m not speaking for everyone too much…” and “I don’t know what other, more technical people than me might say” and a staggering number of instances of “just”, for instance — but I walked back on that, because there are so many grey areas (e.g. sometimes “just” is not deployed to undermine one’s competence or status) that I had no coherent way to code it. But insofar as I tried to count, when a speaker labeled her contribution as possibly lacking value or her competence as possibly being insufficient, in every instance but one it was a woman.

“Be bold,” says the coffee mug, but we know in the room that librarians new to editing Wikipedia will need some acculturation to thrive in that process, and vice versa that Wikipedians working with libraries have their own cultural knowledge divide to cross.

Because — it is vice versa, too, isn’t it. It’s so easy to look at that value of boldness, one of the most celebrated in Wikipedia, or at the ways that male-coded discourse patterns aid in gaining or establishing status, and think, dammit, women should stop saying “just” all the time. But — even though I am often quietly flipping out inside when people undercut themselves — I saw the value of not being bold in the room, too. These humbler discourse patterns serve to recognize others’ contributions, others’ competence. They serve to hold space in the room for others to contribute, to build on or critique what’s being said, to establish their own expertise. They can represent the shakiness of not recognizing one’s own skill, yes — but they can also represent the humility in recognizing others’ skill. They allow space for others to have feelings on the topic that may not accord with the speakers’, yet retain legitimacy.

This is…not really how Wikipedia works. The encyclopedia that anyone can edit is the encyclopedia where everyone has the right to the floor. And there’s a liberation in that — on the internet, no one knows you’re a dog and the things you say matter — but there’s also an oppression, in that it rewards everyone who’s never stopped to think that maybe they’re not the expert. It rewards an investment in being right, but not in noticing the emotional undercurrents of the room, in building relationships over time and ensuring stakeholders are identified and heard, which is very much how well-run libraries tend to operate. It rewards the quick and assertive, whereas I spent today watching participation be more equal and distributed when there was structured moderation, and slide into literally 90% male voices in the last ten minutes of the day, when people were feeling punchy and discussion was totally open.

The gender gap stalks this meeting, every moment. Wikipedia editors are about 90% male; librarians are about 80% female (unless, of course, you’re in a room that draws heavily from upper management, as we are today). Wikipedia has a notorious, unsolved, and frequently gendered harassment problem. Unless you are content to be exceptionally disingenuous, you cannot talk about bringing librarians into Wikipedia without talking about this.

And here I am today not talking, because I’m listening instead. Because I’m counting. (Because I am, if I’m to be entirely honest, uncertain if I have anything to say in this context, undercutting myself in this very parenthesis.) Because I’m seeing a problem so much more difficult and more slippery than training, or documentation, or policy…our genders as ghosts in the very language we speak. Boldness as liberatory only for the bold, creating a space where the strengths of female-coded discourse patterns are pushed off to the margins, where humility looks like weakness.

I may go through a lot of coffee tomorrow.

"Be Bold" Wikipedia coffee mug

what I learned about leadership from the Emerging Leaders

About five and a half years ago, I was sitting in a big room in conventionland (San Diego, but who’s counting) with my class of Emerging Leaders, as we brainstormed about the qualities of an excellent leader.

Someone was writing those qualities up on a flip chart and, gosh, would I have liked to work for flip chart lady. She was so perceptive and thoughtful and strategic and empathetic and not bad at anything and just great. Way cooler than me. Everyone would like to work for flip chart lady.

And then one of my brainstorming colleagues said, you know, there’s one quality we haven’t put up there, because it’s not actually a core competency for leaders, and that’s intelligence. And the room nodded in agreement, because she was right. You probably can’t be an effective leader if you’re genuinely dumb, but all other things being equal, being smarter doesn’t actually make you a better leader. And we’ve all met really smart people who were disastrous leaders; intelligence alone simply does not confer the needed skills. Fundamentally, if “leader” were a D&D class, its prime requisite would not be INT.

The whole room nodded along with her while I thought, well crap, that’s the only thing I’ve always been good at.

So I was in a funk for a while, mulling that over. And eventually decided, well, people I respect put me in this room; I’m not going to tell them they’re wrong. I’m going to find a way to make it work. I’m going to look for the situations where the skills I have can make a difference, where my weaknesses don’t count against me too much. There’s not a shortage of situations in the world that need more leadership; I’ll just have to look for the ones where the leader that’s needed can be me. They won’t be the same situations where the people to my left and right will shine, and that’s okay. And if I’m not flip chart lady, if I’m missing half her strengths and I’m littered with weaknesses she doesn’t have (because she doesn’t have any)…well, as it turns out, no one is flip chart lady. We all have weaknesses. We are all somehow, if we’re leading interesting lives at all, inadequate to the tasks we set ourselves, and perhaps leadership consists largely in rising to those tasks nonetheless.

So here I am, five and a half years later, awed and humbled to be the LITA Vice-President elect. With a spreadsheet open where I’m sketching out at the Board’s request a two-year plan for the whole association, because if intelligence is the one thing you’ve always been good at, and the thing that’s needed is assimilating years’ worth of data about people and budgets and goals and strengths and weaknesses and opportunities, and transmuting that into something coherent and actionable…

Well hey. Maybe that’ll do.

Thanks for giving me the chance, everybody. I couldn’t possibly be more excited to serve such a thoughtful, creative, smart, motivated, fun, kind bunch of people. To figure out how LITA can honor your efforts and magnify your work as, together, we take a national association with near fifty years of history into its next fifty years. I can’t be flip chart lady for you (no one can), but I am spreadsheet lady, and I’m here for you. Let’s rock.