shaving yaks with gensim 3.8.3

I’m updating some old code which includes a model trained under gensim 3.8.3. (Or so I hope, based on the poetry.lock file.) Current stable is 4.3.0, so…I have some updating to do. In theory I can load a 3.8.x model in 4.0.x, save it, then open that in 4.1.x, et cetera. I’d rather do that than retraining the model (which would be a festival of limited documentation, missing institutional knowledge, and additional yaks), so here I am installing gensim 3.8.3 on my 2021 M1 Mac. What could go wrong, right?

First yak: Cython

gensim depends on numpy, numpy depends on Cython, and Cython 0.29.14 (the version in my poetry.lock) is all,

AttributeError: module 'collections' has no attribute 'Iterable'

However, it turned out my version of numpy wanted a higher version of Cython:

RuntimeError: Building NumPy requires Cython >= 0.29.30, found 0.29.14 at [local directory structure]/lib/python3.8/site-packages/Cython/__init__.py

So I just tried 0.29.30 (being as conservative as possible about upgrading dependencies, since my ultimate target is an old version of gensim), and the error went away.

Second yak: numpy

Every time I try to install numpy on an M1 Mac I run into errors. I can’t even find the errors in my scrollback any more (other than the one where it couldn’t install the C bindings and pleaded for a better Cython), but if you’re reading this post, you probably know the ones. I tried Rosetta; it didn’t help.1

I ended up with the same solution I always end up with2; to wit (modulo numpy version),

poetry run python -m pip install --no-use-pep517 --no-binary :all: numpy==1.24.2

It bothers me that I have to special-case the numpy install, because I’m always thinking about how this would look in CI/CD. But by running pip under poetry, I at least ensure that numpy ends up in the right virtualenv, and poetry is able to find it when it does dependency installation and resolution.

Boss yak: gensim

Now I have all the dependencies I need to install gensim; hooray! I verify that my model loads and saves in 3.8.3. It does. Now I install 4.0.x so that I can —

AttributeError: 'dict' object has no attribute '__NUMPY_SETUP__'

Oh. Oh dear.

This is actually a known bug that has been solved in newer versions of gensim, but that fix didn’t make it back to 4.0.x. OK. Well. I clone gensim locally, apply that patch to my version, and discover that you can poetry install a local project (this is actually an extremely sweet feature).

Now I can load my model in 4.0.x, relying on the backward compatibility guarantee, and —

  File "/path/to/local/version/gensim/models/doc2vec.py", line 328, in docvecs
    return self.dv
AttributeError: 'Doc2Vec' object has no attribute 'dv'. Did you mean: 'dm'
File "/path/to/local/version/gensim/models/keyedvectors.py", line 272, in _upconvert_old_vocab
    if 'sample_int' in self.expandos:
AttributeError: 'KeyedVectors' object has no attribute 'expandos'
  File "/path/to/local/version/gensim/models/keyedvectors.py", line 1700, in _upconvert_old_d2vkv
    self.vocab = self.doctags
  File "/path/to/local/version/gensim/models/keyedvectors.py", line 654, in vocab
    self.vocab()  # trigger above NotImplementedError
  File "/path/to/local/version/gensim/models/keyedvectors.py", line 645, in vocab
    raise AttributeError(
AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0.

Oh. Oh dear.

This was actually all pretty easy to solve, given that I now had an installable version on localhost. I added a property to gensim/models/doc2vec.py:

@property
def dv(self):
    return self.__dict__['docvecs']

And a check, before trying to access self.expandos:

if not hasattr(self, 'expandos'):
    self.expandos = {}

And I replaced a line that threw a KeyError:

# del self.expandos['offset']
self.expandos.pop('offset', None)

I also commented out the line self.vocab = self.doctags in _upconvert_old_d2vkv, because the very next line calls a function which destroys self.vocab, whereas this line just triggers a NotImplementedError in attempting to reference self.vocab, which should render the assignment moot.

Next steps

I’m going to have to verify that my model actually works as expected with these changes!

I’m also going to pull together a PR patching 4.0.x, assuming the maintainers are open to it, and work is cool with my using a version of this model as a test case (I need something which triggers the bugs in order to test the patch, you know?).

Also, of course, I will glory in these clouds of yak hair I am now surrounded by.

A majestic yak stares directly into the camera
Photo by Quaritsch Photography on Unsplash

Footnotes

1. But it let me have some fun messing around with Terminal settings. Now I have a duplicated version of Terminal.app, named Terminal_Rosetta, which is set to always open in Rosetta mode. I also gave it a different default color scheme so that I can always tell which architecture my terminal is in, because you know it would be the easiest thing in the world to end up with things not working because of weird architecture clash bugs that would take forever to track down.

I will probably never use this again.

2. Which maybe I will remember after having written a whole blog post about it? Honestly, probably not. But at least I might remember that I wrote a blog post about it and will therefore be able to find the answer faster.

Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s