Adapting Coursera’s neural style transfer code to localhost

Last time, when making cats from the void, I promised that I’d discuss how I adapted the neural style transfer code from Coursera’s Convolutional Neural Networks course to run on localhost. Here you go!

Step 1: First, of course, download (as python) the script. You’ll also need the nst_utils.py file, which you can access via File > Open.

Step 2: While the Coursera file is in .py format, it’s iPython in its heart of hearts. So I opened a new file and started copying over the bits I actually needed, reading them as I went to be sure I understood how they all fit together. Along the way I also organized them into functions, to clarify where each responsibility happened and give it a name. The goal here was ultimately to get something I could run at the command line via python dpla_cats.py, so that I could find out where it blew up in step 3.

Step 3: Time to install dependencies. I promptly made a pipenv and, in running the code and finding what ImportErrors showed up, discovered what I needed to have installed: scipy, pillow, imageio, tensorflow. Whatever available versions of the former three worked, but for tensorflow I pinned to the version used in Coursera — 1.2.1 — because there are major breaking API changes with the current (2.x) versions.

This turned out to be a bummer, because tensorflow promptly threw warnings that it could be much faster on my system if I compiled it with various flags my computer supports. OK, so I looked up the docs for doing that, which said I needed bazel/bazelisk — but of course I needed a paleolithic version of that for tensorflow 1.2.1 compat, so it was irritating to install — and then running that failed because it needed a version of Java old enough that I didn’t have it, and at that point I gave up because I have better things to do than installing quasi-EOLed Java versions. Updating the code to be compatible with the latest tensorflow version and compiling an optimized version of that would clearly be the right answer, but also it would have been work and I wanted messed-up cat pictures now.

(As for the rest of my dependencies, I ended up with scipy==1.5.4, pillow==8.0.1, and imageio==2.9.0, and then whatever sub-dependencies pipenv installed. Just in case the latest versions don’t work by the time you read this. 🙂

At this point I had achieved goal 1, aka “getting anything to run at all”.

Step 4: I realized that, honestly, almost everything in nst_utils wanted to be an ImageUtility, which was initialized with metadata about the content and style files (height, width, channels, paths), and carried the globals (shudder) originally in nst_utils as class data. This meant that my new dpla_cats script only had to import ImageUtility rather than * (from X import * is, of course, deeply unnerving), and that utility could pingpong around knowing how to do the things it knew how to do, whenever I needed to interact with image-y functions (like creating a generated image or saving outputs) rather than neural-net-ish stuff. Everything in nst_utils that properly belonged in an ImageUtility got moved, step by step, into that class; I think one or two functions remained, and they got moved into the main script.

Step 5: Ughhh, scope. The notebook plays fast and loose with scope; the raw python script is, rightly, not so forgiving. But that meant I had to think about what got defined at what level, what got passed around in an argument, what order things happened in, et cetera. I’m not happy with the result — there’s a lot of stuff that will fail with minor edits — but it works. Scope errors will announce themselves pretty loudly with exceptions; it’s just nice to know you’re going to run into them.

Step 5a: You have to initialize the Adam optimizer before you run sess.run(tf.global_variables_initializer()). (Thanks, StackOverflow!) The error message if you don’t is maddeningly unhelpful. (FailedPreconditionError, I mean, what.)

Step 6: argparse! I spent some quality time reading this neural style implementation early on and thought, gosh, that’s argparse-heavy. Then I found myself wanting to kick off a whole bunch of different script runs to do their thing overnight investigating multiple hypotheses and discovered how very much I wanted there to be command-line arguments, so I could configure all the different things I wanted to try right there and leave it alone. Aw yeah. I’ve ended up with the following:

parser.add_argument('--content', required=True)
parser.add_argument('--style', required=True)
parser.add_argument('--iterations', default=400)        # was 200
parser.add_argument('--learning_rate', default=3.0)     # was 2.0
parser.add_argument('--layer_weights', nargs=5, default=[0.2,0.2,0.2,0.2,0.2])
parser.add_argument('--run_until_steady', default=False)
parser.add_argument('--noisy_start', default=True)

content is the path to the content image; style is the path to the style image; iterations and learning_rate are the usual; layer_weights is the value of STYLE_LAYERS in the original code, i.e. how much to weight each layer; run_until_steady is a bad API because it means to ignore the value of the iterations parameter and instead run until there is no longer significant change in cost; and noisy_start is whether to use the content image plus static as the first input or just the plain content image.

I can definitely see adding more command line flags if I were going to be spending a lot of time with this code. (For instance, a layer_names parameter that adjusted what STYLE_LAYERS considered could be fun! Or making “significant change in cost” be a user-supplied rather than hardcoded parameter!)

Step 6a: Correspondingly, I configured the output filenames to record some of the metadata used to create the image (content, style, layer_weights), to make it easier to keep track of which images came from which script runs.

Stuff I haven’t done but it might be great:

Updating tensorflow, per above, and recompiling it. The slowness is acceptable — I can run quite a few trials on my 2015 MacBook overnight — but it would get frustrating if I were doing a lot of this.

Supporting both num_iterations and run_until_steady means my iterator inside the model_nn function is kind of a mess right now. I think they’re itching to be two very thin subclasses of a superclass that knows all the things about neural net training, with the subclass just handling the iterator, but I didn’t spend a lot of time thinking about this.

Reshaping input files. Right now it needs both input files to be the same dimensions. Maybe it would be cool if it didn’t need that.

Trying different pretrained models! It would be easy to pass a different arg to load_vgg_model. It would subsequently be annoying to make sure that STYLE_LAYERS worked — the available layer names would be different, and load_vgg_model makes a lot of assumptions about how that model is shaped.

As your reward for reading this post, you get another cat image! A friend commented that a thing he dislikes about neural style transfer is that it’s allergic to whitespace; it wants to paint everything with a texture. This makes sense — it sees subtle variations within that whitespace and it tries to make them conform to patterns of variation it knows. This is why I ended up with the noisy_start flag; I wondered what would happen if I didn’t add the static to the initial image, so that the original negative space stayed more negative-spacey.

This, as you can probably tell, uses the Harlem renaissance style image.

It’s still allergic to negative space — even without the generated static there are variations in pixel color in the original — but they are much subtler, so instead of saying “maybe what I see is coiled hair?” it says “big open blue patches; we like those”. But the semantics of the original image are more in place — the kittens more kitteny, the card more readable — even though the whole image has been pushed more to colorblocks and bold lines.

I find I like the results better without the static — even though the cost function is larger, and thus in a sense the algorithm is less successful. Look, one more.

Superhero!

Dear Internet, merry Christmas; my robot made you cats from the void

Recently I learned how neural style transfer works. I wanted to be able to play with it more and gain some insights, so I adapted the Coursera notebook code to something that works on localhost (more on that in a later post), found myself a nice historical cat image via DPLA, and started mashing it up with all manner of images of varying styles culled from DPLA’s list of primary source sets. (It really helped me that these display images were already curated for looking cool, and cropped to uniform size!)

These sweet babies do not know what is about to happen to them.

Let’s get started, shall we?

Style image from the Fake News in the 1890s: Yellow Journalism primary source set.

I really love how this one turned out. It’s pulled the blue and yellow colors, and the concerned face of the lower kitten was a perfect match for the expression on the right-hand muckraker. The lines of the card have taken on the precise quality of those in the cartoon — strong outlines and textured interiors. “Merry Christmas” the bird waves, like an eager newsboy.

Style image from the Food and Social Justice exhibit.

This is one of the first ones I made, and I was delighted by how it learned the square-iness of its style image. Everything is more snapped to a grid. The colors are bolder, too, cueing off of that dominant yellow. The Christmas banner remains almost readable and somehow heraldic.

Style image from the Truth, Justice, and the American Way primary source set.

How about Christmas of Steel? These kittens have broadly retained their shape (perhaps as the figures in the comic book foreground have organic detail?), but the background holly is more polygon-esque. The colors have been nudged toward primary, and the static of the background has taken on a swirl of dynamic motion lines.

Style image from the Visual Art During the Harlem Renaissance primary source set.

How about starting with something boldly colored and almost abstract? Why look: the kittens have learned a world of black and white and blue, with the background transformed into that stippled texture it picked up from the hair. The holly has gone more colorblocky and the lines bolder.

Style image from the Treaty of Versailles and the End of World War I primary source set.

This one learned its style so aptly that I couldn’t actually tell where the boundary between the second and third images was when I was placing that equals sign. The soft pencil lines, the vertical textures of shadows and jail bars, the fact that all the colors in the world are black and white and orange (the latter mostly in the middle) — these kittens are positively melting before the force of Wilsonian propaganda. Imagine them in the Hall of Mirrors, drowning in gold and reflecting back at you dozens of times, for full nightmare effect.

Style image from the Victorian Era primary source set.

Shall we step back a few decades to something slightly more calming? These kittens have learned to take on soft lines and swathes of pale pink. The holly is perfectly happy to conform itself to the texture of these New England trees. The dark space behind the kittens wonders if, perhaps, it is meant to be lapels.

I totally can’t remember how I found this cropped version of US food propaganda.

And now for kittens from the void.

Brown, it has learned. The world is brown. The space behind the kittens is brown. Those dark stripes were helpfully already brown. The eyes were brown. Perhaps they can be the same brown, a hole dropped through kitten-space.

I thought this was honestly pretty creepy, and I wondered if rerunning the process with different layer weights might help. Each layer of the neural net notices different sorts of things about its image; it starts with simpler things (colors, straight lines), moves through compositions of those (textures, basic shapes), and builds its way up to entire features (faces). The style transfer algorithm looks at each of those layers and applies some of its knowledge to the generated image. So I thought, what if I change the weights? The initial algorithm weights each of five layers equally; I reran it weighted toward the middle layers and entirely ignoring the first layer, in hopes that it would learn a little less about gaping voids of brown.

Same thing, less void.

This worked! There’s still a lot of brown, but the kitten’s eye is at least separate from its facial markings. My daughter was also delighted by how both of these images want to be letters; there are lots of letter-ish shapes strewn throughout, particularly on the horizontal line that used to be the edge of a planter, between the lower cat and the demon holly.

So there you go, internet; some Christmas cards from the nightmare realm. May 2021 bring fewer nightmares to us all.

Of such stuff are (deep)dreams made: convolutional networks and neural style transfer

Skipped FridAI blogging last week because of Thanksgiving, but let’s get back on it! Top-of-mind today are the firing of AI queen Timnit Gebru (letter of support here) and a couple of grant applications that I’m actually eligible for (this is rare for me! I typically need things for which I can apply in my individual capacity, so it’s always heartening when they exist — wish me luck).

But for blogging today, I’m gonna talk about neural style transfer, because it’s cool as hell. I started my ML-learning journey on Coursera’s intro ML class and have been continuing with their deeplearning.ai sequence; I’m on course 4 of 5 there, so I’ve just gotten to neural style transfer. This is the thing where a neural net outputs the content of one picture in the style of another:

Via https://medium.com/@build_it_for_fun/neural-style-transfer-with-swift-for-tensorflow-b8544105b854.

OK, so! Let me explain while it’s still fresh.

If you have a neural net trained on images, it turns out that each layer is responsible for recognizing different, and progressively more complicated, things. The specifics vary by neural net and data set, but you might find that the first layer gets excited about straight lines and colors; the second about curves and simple textures (like stripes) that can be readily composed from straight lines; the third about complex textures and simple objects (e.g. wheels, which are honestly just fancy circles); and so on, until the final layers recognize complex whole objects. You can interrogate this by feeding different images into the neural net and seeing which ones trigger the highest activation in different neurons. Below, each 3×3 grid represents the most exciting images for a particular neuron. You can see that in this network, there are Layer 1 neurons excited about colors (green, orange), and about lines of particular angles that form boundaries between dark and colored space. In Layer 2, these get built together like tiny image legos; now we have neurons excited about simple textures such as vertical stripes, concentric circles, and right angles.

Via https://adeshpande3.github.io/The-9-Deep-Learning-Papers-You-Need-To-Know-About.html, originally from Zeller & Fergus, Visualizing and Understanding Convolutional Networks

So how do we get from here to neural style transfer? We need to extract information about the content of one image, and the style of another, in order to make a third image that approximates both of them. As you already expect if you have done a little machine learning, that means that we need to write cost functions that mean “how close is this image to the desired content?” and “how close is this image to the desired style?” And then there’s a wrinkle that I haven’t fully understood, which is that we don’t actually evaluate these cost functions (necessarily) against the outputs of the neural net; we actually compare the activations of the neurons, as they react to different images — and not necessarily from the final layer! In fact, choice of layer is a hyperparameter we can vary (I super look forward to playing with this on the Coursera assignment and thereby getting some intuition).

So how do we write those cost functions? The content one is straightforward: if two images have the same content, they should yield the same activations. The greater the differences, the greater the cost (specifically via a squared error function that, again, you may have guessed if you’ve done some machine learning).

The style one is beautifully sneaky; it’s a measure of the difference in correlation between activations across channels. What does that mean in English? Well, let’s look at the van Gogh painting, above. If an edge detector is firing (a boundary between colors), then a swirliness detector is probably also firing, because all the lines are curves — that’s characteristic of van Gogh’s style in this painting. On the other hand, if a yellowness detector is firing, a blueness detector may or may not be (sometimes we have tight parallel yellow and blue lines, but sometimes yellow is in the middle of a large yellow region). Style transfer posits that artistic style lies in the correlations between different features. See? Sneaky. And elegant.

Finally, for the style-transferred output, you need to generate an image that does as well as possible on both cost functions simultaneously — getting as close to the content as it can without unduly sacrificing the style, and vice versa.

As a side note, I think I now understand why DeepDream is fixated on a really rather alarming number of eyes. Since the layer choice is a hyperparameter, I hypothesize that choosing too deep a layer — one that’s started to find complex features rather than mere textures and shapes — will communicate to the system, yes, what I truly want is for you to paint this image as if those complex features are matters of genuine stylistic significance. And, of course, eyes are simple enough shapes to be recognized relatively early (not very different from concentric circles), yet ubiquitous in image data sets. So…this is what you wanted, right? the eager robot helpfully offers.

https://www.ucreative.com/inspiration/google-deep-dream-is-the-trippiest-thing-in-the-internet/

I’m going to have fun figuring out what the right layer hyperparameter is for the Coursera assignment, but I’m going to have so much more fun figuring out the wrong ones.