Pre-trained embeddings into Torch LookupTables before LSTM training
A promising method to improve performance and reduce training time for an RNN / LSTM is to (a) use embeddings / LookupTables and (b) pre-train those embeddings before commencing full-blown training. I'm interested in how much better a well-trained embedding might be over say, a OneHot encoding.
I've been using GloVe (Global Vectors for Word Representation) from Stanford on my vocabulary and the final output of interest from GloVe is simply a file called vectors.txt with rows separated by newline characters for each unique token (after a frequency threshold is applied) in the ingested dataset. Here's a sample row for a vector with 50 elements:
MacBook -0.294293 0.545009 -0.660867 0.270468 0.715261 -0.130113 -0.779469 -0.959317 0.617241 -0.515404 -0.128529 -0.359542 -0.654849 0.007402 -0.733078 -0.053148 -1.372696 -0.534287 -0.944885 -0.496881 -0.937919 0.396898 -0.301901 -0.001513 0.354155 0.037759 0.950121 -0.117516 -0.149603 0.420002 0.512260 -1.876207 -0.729102 -2.751064 0.478130 0.670697 -1.197910 -0.360405 0.730464 0.387873 1.980520 0.249774 -0.924052 0.616303 1.003639 1.410831 0.023768 1.084876 0.929960 -1.247777
Now how to get this data into a Torch LookupTable? Well, here's some simple Lua code to do exactly that:
function Data:loadEmbeddings() if self.emb ~= nil then return self.emb end local i = 1 local csvFile = assert(io.open("/home/hsheil/code/glove/vectors.txt", 'r')) while true do local line = csvFile:read('*l') if line == nil then break end local vals = line:split(' ') local word = torch.Tensor(50) for k, v in ipairs(vals) do if k ~= 1 then --skip the first non-number element word[k-1] = tonumber(v) end end emb.weight[i] = word -- set the pretrained values in the matrix i = i + 1 if i % 5000 == 0 then print(i .. ' embeddings loaded') end end return emb end
TODOs on this code are to save the token in a Lua table to act as an index into the LookupTable we create, but that is pretty trivial to do - Lua tables are very nice in this respect.
Intuitively, an embedding should make the learning task of an RNN / LSTM easier - we are now providing contextual information as well as a simple on / off switch for each concept of interest.