I haven’t failed, I’ve tried an ML approach that *might* work!

When last we met I was turning a perfectly innocent neural net into a terribly ineffective one, in an attempt to get it to be better at face recognition in archival photos. I was also (what cultural heritage technology experience would be complete without this?) being foiled by metadata.

So, uh, I stopped using metadata. 🤦‍♀️ With twinges of guilt. And full knowledge that I was tossing out a practically difficult but conceptually straightforward supervised learning problem for…what?

Well. I realized that the work that initially inspired me to try my hand at face recognition in archival photos was not, in fact, a recognition problem but a similarity problem: could the Charles Teenie Harris collection find multiple instances of the same person? This doesn’t require me to identify people, per se; it just requires me to know if they are the same or different.

And you know what? I can do a pretty good job of getting different people by randomly selecting two photos from my data set — they’re not guaranteed to be different, but I’ll settle for pretty good. And I can do an actually awesome job of guaranteeing that I have two photos of the same person with the ✨magic✨ of data augmentation.

Keras (which, by the way, is about a trillionty times better than hand-coding stuff in Octave, for all I appreciate that Coursera made me understand the fundamentals by doing that) — Keras has an ImageDataGenerator class which makes it straightforward to alter images in a variety of ways, like horizontal flips, rotations, or brightness changes — all of which are completely plausible ways that archival photos of the same person might differ inter alia! So I can get two photos of the same person by taking one photo, and messing with it.

And at this point I have a Siamese network with triplet loss, another concept that Coursera set me up with (via the deeplearning.ai sequence). And now we are getting somewhere!

Well. We’re getting somewhere once you realize that, when you make a Siamese network architecture, you no longer have layers with the names of your base network; you have one GIANT layer which is just named VGGFace or whatever, instead of having all of its constituent layers, and so when you try to set layer.trainable = True whenever the layer name is in a list of names of VGGFace layers…uh…well…it turns out you just never encounter any layers by that name and therefore don’t set layers to be trainable and it turns out if you train a neural net which doesn’t have any trainable parameters it doesn’t learn much, who knew. But. Anyway. Once you, after embarrassingly long, get past that, and set layers in the base network to be trainable before you build the Siamese network from it…

This turns out to work much better! I now have a network which does, in fact, have decreased loss and increased accuracy as it trains. I’m in a space where I can actually play with hyperparameters to figure out how to do this best. Yay!

…ok, so, does it get me anywhere in practice? Well, to test that I think I’m actually going to need a corpus of labeled photos so that I can tell if given, say, one of WEB Du Bois, it thinks the most similar photos in the collection are also those of WEB Du Bois, which is to say…

Alas, metadata.

archival face recognition for fun and nonprofit

In 2019, Dominique Luster gave a super good Code4Lib talk about applying AI to metadata for the Charles “Teenie” Harris collection at the Carnegie Museum of Art — more than 70,000 photographs of Black life in Pittsburgh. They experimented with solutions to various metadata problems, but the one that’s stuck in my head since 2019 is the face recognition one. It sure would be cool if you could throw AI at your digitized archival photos to find all the instances of the same person, right? Or automatically label them, given that any of them are labeled correctly?

Sadly, because we cannot have nice things, the data sets used for pretrained face recognition embeddings are things like lots of modern photos of celebrities, a corpus which wildly underrepresents 1) archival photos and 2) Black people. So the results of the face recognition process are not all that great.

I have some extremely technical ideas for how to improve this — ideas which, weirdly, some computer science PhDs I’ve spoken with haven’t seen in the field. So I would like to experiment with them. But I must first invent the universe set up a data processing pipeline.

Three steps here:

  1. Fetch archival photographs;
  2. Do face detection (draw bounding boxes around faces and crop them out for use in the next step);
  3. Do face recognition.

For step 1, I’m using DPLA, which has a super straightforward and well-documented API and an easy-to-use Python wrapper (which, despite not having been updated in a while, works just fine with Python 3.6, the latest version compatible with some of my dependencies).

For step 2, I’m using mtcnn, because I’ve been following this tutorial.

For step 3, face recognition, I’m using the steps in the same tutorial, but purely for proof-of-concept — the results are garbage because archival photos from mid-century don’t actually look anything like modern-day celebrities. (Neural net: “I have 6% confidence this is Stevie Wonder!” How nice for you.) Clearly I’m going to need to build my own corpus of people, which I have a plan for (i.e. I spent some quality time thinking about numpy) but haven’t yet implemented.

So far the gotchas have been:

Gotcha 1: If you fetch a page from the API and assume you can treat its contents as an image, you will be sad. You have to treat them as a raw data stream and interpret that as an image, thusly:

from PIL import Image
import requests

response = requests.get(url, stream=True)
response.raw.decode_content = True
data = requests.get(url).content
Image.open(io.BytesIO(data))

This code is, of course, hilariously lacking in error handling, despite fetching content from a cesspool of untrustworthiness, aka the internet. It’s a first draft.

Gotcha 2: You see code snippets to convert images to pixel arrays (suitable for AI ingestion) that look kinda like this: np.array(image).astype('uint8'). Except they say astype('float32') instead of astype('uint32'). I got a creepy photonegative effect when I used floats.

Gotcha 3: Although PIL was happy to manipulate the .pngs fetched from the API, it was not happy to write them to disk; I needed to convert formats first (image.convert('RGB')).

Gotcha 4: The suggested keras_vggface library doesn’t have a Pipfile or requirements.txt, so I had to manually install keras and tensorflow. Luckily the setup.py documented the correct versions. Sadly the tensorflow version is only compatible with python up to 3.6 (hence the comment about DPyLA compatibility above). I don’t love this, but it got me up and running, and it seems like an easy enough part of the pipeline to rip out and replace if it’s bugging me too much.

The plan from here, not entirely in order, subject to change as I don’t entirely know what I’m doing until after I’ve done it:

  • Build my own corpus of identified people
    • This means the numpy thoughts, above
    • It also means spending more quality time with the API to see if I can automatically apply names from photo metadata rather than having to spend too much of my own time manually labeling the corpus
  • Decide how much metadata I need to pull down in my data pipeline and how to store it
  • Figure out some kind of benchmark and measure it
  • Try out my idea for improving recognition accuracy
  • Benchmark again
  • Hopefully celebrate awesomeness