A new approach to natural language processing (NLP) that teaches neural networks linguistic fundamentals by training them using unsegmented textual input on the interaction between individual letters rather than whole words.
Most recurrent neural networks (RNNs) that form the basis of NLP systems are trained on vocabularies of known words. To train RNNs in a way that more closely resembles how humans learn the fundamentals of language, we removed the word boundaries from training datasets and trained the networks at the character (instead of word) level. A multilingual study of this unsupervised character-level language modeling task used datasets of millions of words in English, German, and Italian. It showed that these “near tabula rasa” RNNs develop an impressive spectrum of linguistic knowledge, including segmenting groups of characters into words, distinguishing nouns from verbs, and even inducing simple forms of word meaning.
This work is part of Facebook AI's ongoing efforts to reduce the need for supervision in language systems, including for machine translation tasks. Our results for this approach suggest that, given enough input, AI systems can learn many linguistic rules effectively from scratch, opening the door to future research into unsupervised language learning approaches that require less prior knowledge than present-day NLP systems.