Not much AI blogging this week because I have been buried in adulting all week, which hasn’t left much time for machine learning. Sadface.
However, I’m in the last week of the last deeplearning.ai course! (Well. Of the deeplearning.ai sequence that existed when I started, anyway. They’ve since added an NLP course and a GANs course, so I’ll have to think about whether I want to take those too, but at the moment I’m leaning toward a break from the formal structure in order to give myself more time for project-based learning.) This one is on sequence models (i.e. “the data comes in as a stream, like music or language”) and machine translation (“what if we also want our output to be a stream, because we are going from a sentence to a sentence, and not from a sentence to a single output as in, say, sentiment analysis”).
And I have to say, as a former language teacher, I’m slightly irked.
Because the way the models work is — OK, consume your input sentence one token at a time, with some sort of memory that allows you to keep track of prior tokens in processing current ones (so far, so okay). And then for your output — spit out a few most-likely candidate tokens for the first output term, and then consider your options for the second term and pick your most-likely two-token pairs, and then consider all the ways your third term could combine with those pairs and pick your most likely three-token sequences, et cetera, continue until done.
And that is…not how language works?
Look at Cicero, presuming upon your patience as he cascades through clause after clause which hang together in parallel but are not resolved until finally, at the end, a verb. The sentence’s full range of meanings doesn’t collapse until that verb at the end, which means you cannot be certain if you move one token at a time; you need to reconsider the end in light of the beginning. But, at the same time, that ending token is not equally presaged by all former tokens. It is a verb, it has a subject, and when we reached that subject, likely near the beginning of the sentence, helpfully (in Latin) identified by the nominative case, we already knew something about the verb — a fact we retained all the way until the end. And on our way there, perhaps we tied off clause after clause, chunking them into neat little packages, but none of them nearly so relevant to the verb — perhaps in fact none of them really tied to the verb at all, because they’re illuminating some noun we met along the way. Pronouns, pointing at nouns. Adjectives, pointing at nouns. Nouns, suspended with verbs like a mobile, hanging above and below, subject and object. Adverbs, keeping company only with verbs and each other.
There’s so much data in the sentence about which word informs which that the beam model casually discards. Wasteful. And forcing the model to reinvent all these things we already knew — to allocate some of its neural space to re-engineering things we could have told it from the beginning.
Clearly I need to get my hands on more modern language models (a bizarre sentence since this class is all of 3 years old, but the field moves that fast).