An open-source, free, lightweight library created by Facebook R&D that learns text representations and build text classifiers.
-
Written in C++ and supports multiprocessing during training.
-
Allows us to train supervised and unsupervised representations of words and sentences.
Setting it up
$ pip install fasttext
----------------------------Installing-------------------------
$ python
Python 2.7.15 |(default, May 1 2018, 18:37:05)
Type "help", "copyright", "credits" or "license" for more information.
>>> import fasttext
>>>
Word Embeddings
-
For processing natural language & extracting useful info from that text using ML, requires that this text should be understandable by machine. For this purpose, the text is converted into set of real numbers, technically a vector.
-
WE is basically a learned representation for text where words that have the same meaning have a similar representation in the vector space.
-
The process of converting words into real numbers/vectors -> Vectorization.
- Word embeddings help in the following use cases.
- Compute similar words
- Calculate semantics behind words
- Document clustering/grouping
- Feature extraction for text classifications
- Natural language processing.
- Word embeddings can be calculated using pre-trained methods from libraries such as,
- Word2Vec — From Google
- fastText — From Facebook
- GloVe — From Stanford
-
These are distributed representations of text in an n-dimensional space. Essential for solving most NLP problems.
- These vectors capture hidden information about a language, like word analogies or semantics.
Word Embedding Methods
These methods learn a real-valued vector representation for a predefined fixed sized vocabulary from a corpus of text.
Most famous architectures such as Word2Vec, Fasttext, Glove converts text to word vectors and uses the cosine similarity between these WE in this n-dimensional vector space for calculating word similarity features.
- Embedding layer
-
Most native approach
-
Here word embeddings are learned jointly with a neural network model on a specific natural language processing task.
-
The embedding layer is used on the front end of a neural network and is fit in a supervised way using the Backpropagation algorithm.
-
This approach of learning an embedding layer requires a lot of training data and is slow.
- 2 methods:
- FeedForward Neural Net Language Model (NNLM)
- Recurrent Neural Net Language Model (RNNLM)
- NNLM, RNNLM outperforms for the huge dataset of words but computation complexity is a big overhead
-
- Word2Vec
-
Developed at Google
-
Word representations in Vector Space, or word2vec algorithm.
-
Makes the neural-network-based training of the embedding more efficient and is now the de-facto standard for developing pre-trained word embedding.
-
Takes a large text corpus as I/P and produces a vector space, of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in this space.
-
Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another.
- 2 learning models were introduced for learning the word embedding:
-
Continuous Bag-of-Words, or CBOW model: learns the embedding by predicting the current word based on its context.
-
Continuous Skip-gram Model: learns the embedding by predicting the surrounding words given a current word.
-
- Key benefit is that high-quality word embeddings can be learned efficiently (low space and time complexity), allowing larger embeddings to be learned (more dimensions) from much larger corpora of text (billions of words).
-
- GloVe
-
Developed at Stanford.
-
Global Vectors for Word Representation, or GloVe, algorithm is an extension to the Word2Vec method for efficiently learning word vectors.
-
an approach to combine both the global statistics of matrix factorization techniques like LSA (Latent Semantic Analysis) with the local context-based learning in Word2Vec.
-
Rather than using a window to define local context, GloVe constructs an explicit word-context or word co-occurrence matrix using statistics across the whole text corpus. The result is a learning model that may result in generally better word embeddings.
-
FastText
We can either generate word vectors for any raw data or use the pre-trained word vectors which ship with FastText.
Example
Downloading & cleaning raw dataset
jalaz@jalaz-personal:~ wget -c http://mattmahoney.net/dc/enwik9.zip -P data/
jalaz@jalaz-personal:~ unzip data/enwik9.zip -d data/
jalaz@jalaz-personal:~ perl /home/jalaz/fastText/wikifil.pl data/enwik9 > data/fil9
Generation of word vectors
import fasttext
# Generation by default settings
model = fasttext.train_unsupervised('data/fil9')
# Saving the model for later use
model.save_model("result/fil9.bin")
model = fasttext.load_model("result/fil9.bin")
2 models for computing word representations:
- skipgram model learns to predict a target word thanks to a nearby word.
- cbow model predicts the target word according to its context. The context is represented as a bag of the words contained in a fixed size window around the target word.
Practcally, skipgram models works better with subword information than cbow models.
Tweaking parameters
model1 = fasttext.train_unsupervised('data/fil9_small', model="cbow")
model2 = fasttext.train_unsupervised('data/fil9_small', model="skipgram")
model3 = fasttext.train_unsupervised('data/fil9', minn=2, maxn=5, dim=300)
model4 = fasttext.train_unsupervised('data/fil9', epoch=1, lr=0.5)
model5 = fasttext.train_unsupervised('data/fil9', thread=4)
- dim (dimension)
- Controls the size of the vectors
- larger dim -> more information capture but requires more data to be learned.
- If too large -> harder and slower to train.
- By default, dim = 100, but any value in the 100-300 range is good.
- subwords
- All the substrings contained in a word between minn and maxn.
- By default, subword = (3, 6). For different languages ranges may vary.
- epoch
- Controls how many times the model will loop over the dataset for training.
- By default, epoch = 5. For massive dataset, epoch should be less.
- lr
- Higher lr -> faster the model converge to a solution but at the risk of overfitting to the dataset.
- By default, lr = 0.05
- thread
- fastText is multi-threaded and uses 12 threads by default.
- This can easily be tweaked using this parameter for CPUs having less cores.
Usage of word embeddings
-
Word embedding generated for a word can be checked
Script
print(model.words) print("\n-------------------------------------\n") print(model.get_word_vector("female"))
Output
['the', 'of', ... 'germany', ... 'actress', ... 'governor', 'players', ... 'models', ...] ------------------------------------- [ 0.01122757 0.18961109 -0.16199729 0.11208588 --- --- --- --- --- --- --- 0.19992262 -0.06550902 -0.40920728 -0.16724268]
-
Semantic information of the vectors are captured with the nn functionality.
Script
model.get_nearest_neighbors('london')
Output
[(0.7785311341285706, 'princeton'), (0.7696226239204407, 'cambridge'), (0.7583264112472534, 'glasgow'), (0.7519310116767883, 'oxfordshire'), ..., (0.7124481797218323, 'routledge')]
-
nn functionality can also be used for spellcorrections purposes.
Script
model.get_nearest_neighbors('actres')
Output
[(0.9361368417739868, 'actress'), (0.9093650579452515, 'actresses'), (0.852777361869812, 'actor'), (0.8409433364868164, 'songwriter'), ..., (0.771904468536377, 'snooker')]
-
analogies functionality can be used for managing & finding hidden analogies between the data points.
Script
model.get_analogies("berlin", "germany", "france")
Output
[(0.896462, u'paris'), (0.768954, u'bourges'), ..., (0.740635, u'bordeaux'), (0.736122, u'pigneaux')]
-
character n-grams are really important. Using subword-level information helps building the vectors in the vector-space for totally unknown words.
Script
model_without_subwords = fasttext.train_unsupervised('data/fil9_small', maxn=0) model_normal = fasttext.train_unsupervised('data/fil9_small') model_without_subwords.get_nearest_neighbors('accomodation') print("\n------------------------------------\n") model_normal.get_nearest_neighbors('accomodation')
Output
[(0.775057, u'sunnhordland'), (0.769206, u'accomodations'), (0.753011, u'administrational'), ..., (0.732465, u'asserbo')] ------------------------------------ [(0.96342, u'accomodations'), (0.942124, u'accommodation'), (0.915427, u'accommodations'), ..., (0.701426, u'hospitality')]
Text Classification
Text classification is ML problem for classifying any text into 1 or more labels after training the model using supervised learning methods.
-
Spam detection, language identification, sentiment analysis comes under this domain.
-
Can be single-label classifiers (like spam identifier: spam-not spam) or multi-label classifiers (like language detector: hindi-english-telugu-tamil etc.)
-
For building such classifiers, labeled data is required, which consists of documents and their corresponding labels.
Preparing dataset
-
Download this awesome dataset on news item classification from Kaggle.
- Analyse the dataset present in JSON-Lines type file using
jalaz@jalaz-personal:~$ head -2 data/news-articles.jsonl {"category": "CRIME", "headline": "There Were 2 Mass Shootings In Texas Last Week, But Only 1 On TV", "authors": "Melissa Jeltsen", "link": "https://www.huffingtonpost.com/entry/texas-amanda-painter-mass-shooting_us_5b081ab4e4b0802d69caad89", "short_description": "She left her husband. He killed their children. Just another day in America.", "date": "2018-05-26"} {"category": "ENTERTAINMENT", "headline": "Will Smith Joins Diplo And Nicky Jam For The 2018 World Cup's Official Song", "authors": "Andy McDonald", "link": "https://www.huffingtonpost.com/entry/will-smith-joins-diplo-and-nicky-jam-for-the-official-2018-world-cup-song_us_5b09726fe4b0fdb2aa541201", "short_description": "Of course it has a song.", "date": "2018-05-26"}
- Comparing this raw dataset with the standard dataset provided by Facebook Research,
jalaz@jalaz-personal:~$ head -5 data/cooking.stackexchange.txt __label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe? __label__food-safety __label__acidity Dangerous pathogens capable of growing at acidic environments __label__cast-iron __label__stove How can I cover up the white spots on my cast iron stove? __label__restaurant Michelin Three Star Restaurant, but the chef is not there __label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
- Data cleaning & standarization achieved using:
import json fileReader = open("data/news-articles.jsonl", "r") fileWriter = open("data/news-articles.txt", "w") for line in fileReader: news = dict(json.loads(line)) fileWriter.write("__label__"+news["category"].lower()+" "+news["headline"].lower()+"\n")
- Final dataset ready for fastText classifier:
jalaz@jalaz-personal:~$ head -5 data/news-articles.txt __label__crime there were 2 mass shootings - texas last week, but only 1 on tv __label__entertainment will smith joins diplo and nicky jam the 2018 world cups official song __label__entertainment hugh grant marries the first at age 57 __label__entertainment jim carrey blasts castrato adam schiff and democrats new artwork __label__entertainment julianna margulies uses donald trump poop bags to pick up after her dog
- Training-Validation splitting (80-20):
jalaz@jalaz-personal:~$ wc data/news-articles.txt 200832 2189821 15670354 data/news-articles.txt jalaz@jalaz-personal:~$ head -n 160000 data/news-articles.txt > data/news-articles.train jalaz@jalaz-personal:~$ tail -n 40832 data/news-articles.txt > data/news-articles.valid
Creating, saving & using model
Script
import fasttext
model = fasttext.train_supervised(input="data/news-articles.train")
model.save_model("model/news-classifier-v1.bin")
modelLoaded = fasttext.load_model("model/news-classifier-v1.bin")
model.predict("Roger Federer wins US Grand Slam Men's final")
model.predict("North Korea threatens Japan with back to back 4 nuclear tests")
print("\n-----------------------\n")
model.predict("Britain exit from the European Union confirmed", k=5)
print("\n-----------------------\n")
model.predict("narendra modi aquitted for gujarat riots by the court", k=-1, threshold=0.1)
Output
(('__label__sports',), array([0.91453463]))
(('__label__politics',), array([0.88016534]))
-----------------------
(('__label__politics', '__label__worldpost', '__label__impact', '__label__business', '__label__religion'), array([0.41946396, 0.15596035, 0.13890333, 0.09830396, 0.02962857]))
-----------------------
(('__label__worldpost', '__label__crime', '__label__politics'), array([0.30460522, 0.25598988, 0.14045343]))
Testing model accuarcy
- Precision: Number of correct labels among the predicted labels.
- Recall: Number of real labels that could be predicted.
Script
model.test("data/news-articles.valid")
model.test("data/news-articles.valid", k=5)
Output
(40832, 0.5363685344827587, 0.5363685344827587)
(40832, 0.1458170062695925, 0.7290850313479624)
Tweaking parameters & improvisations
-
epochs & lr
- epoch denotes the number of iterations of training over each datapoint
-
lr denotes the learning rate.
Script
modelv2 = fasttext.train_supervised(input="data/news-articles.train", epoch=25) modelv2.test("data/news-articles.valid") modelv3 = fasttext.train_supervised(input="data/news-articles.train", lr=1.0) modelv3.test("data/news-articles.valid") modelv4 = fasttext.train_supervised(input="data/news-articles.train", epoch=25, lr=1.0) modelv4.test("data/news-articles.valid")
Output
(40832, 0.617701802507837, 0.617701802507837) (40832, 0.698226880877743, 0.698226880877743) (40832, 0.5116085423197492, 0.5116085423197492)
-
word n-grams
-
word n-grams: Performance of model can be improved by using word bigrams, instead of just unigrams. Important for classification problems where word order is important like sentiment analysis.
-
unigram refers to a single undividing unit, or token. Can be a word or a letter depending on the model. In fastText, we work at the word level and thus unigrams are words.
-
bigram is the concatenation of 2 consecutive tokens or words.
-
“Last donut of the night”
- unigrams: ‘last’, ‘donut’, ‘of’, ‘the’, ‘night’.
- bigrams: ‘Last donut’, ‘donut of’, ‘of the’, ‘the night’.
Script
modelv5 = fasttext.train_supervised(input="data/news-articles.train", epoch=25, lr=1.0, wordNgrams=2) modelv5.test("data/news-articles.valid") modelv5.predict("narendra modi aquitted for gujarat riots by the court", k=-1, threshold=0.1)
Output
(40832, 0.6719974529780565, 0.6719974529780565) (('__label__politics', '__label__crime', '__label__worldpost'), array([0.55176157, 0.24307342, 0.18037586]))
-
-
loss
- hs (scaling up for production)
- training can be made faster for large dataset by using the hierarchical softmax, instead of the regular softmax
- hierarchical softmax is a loss function that approximates the softmax with a much faster computation
- ova (multi label classification)
- for handling multiple labels, convenient way is to use independent binary classifiers for each label
- one vs all loss helps achieve this.
Script
modelv8 = fasttext.train_supervised(input="data/news-articles.train", lr=1.0, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='hs') modelv8.test("data/news-articles.valid") modelv9 = fasttext.train_supervised(input="data/news-articles.train", lr=1.0, epoch=25, wordNgrams=2, bucket=200000, dim=50, loss='ova') modelv9.test("data/news-articles.valid") news = "justin beiber and selena gomez splits after 4 years of relationship" print(modelv8.predict(news, k=-1, threshold=0.1)) print(modelv9.predict(news, k=-1, threshold=0.1))
Output
(40832, 0.6174568965517241, 0.6174568965517241) (40832, 0.655907131661442, 0.655907131661442) (('__label__entertainment',), array([0.99964833])) (('__label__entertainment',), array([0.95397609]))
- hs (scaling up for production)
Autotuning the hyperparameters
-
Finding best hyperparameters value is crucial for building efficient ml models.
-
Tuning these values manually is cumbersome since these parameters are dependent & their effects on final model vary from dataset to dataset.
-
fastText provides autotune feature for this task.
modelv10 = fasttext.train_supervised(input='data/news-articles.train', autotuneValidationFile='data/news-articles.valid')
modelv11 = fasttext.train_supervised(input='data/news-articles.train', autotuneValidationFile='data/news-articles.valid', autotuneDuration=600)
modelv12 = fasttext.train_supervised(input='data/news-articles.train', autotuneValidationFile='data/news-articles.valid', autotuneModelSize="2M")