Humphrey Sheil - Blog

Pre-trained embeddings into Torch LookupTables before LSTM training

13 Jul 2016 | 3 comments

A promising method to improve performance and reduce training time for an RNN / LSTM is to (a) use embeddings / LookupTables and (b) pre-train those embeddings before commencing full-blown training. I'm interested in how much better a well-trained embedding might be over say, a OneHot encoding.

I've been using GloVe (Global Vectors for Word Representation) from Stanford on my vocabulary and the final output of interest from GloVe is simply a file called vectors.txt with rows separated by newline characters for each unique token (after a frequency threshold is applied) in the ingested dataset. Here's a sample row for a vector with 50 elements:

MacBook -0.294293 0.545009 -0.660867 0.270468 0.715261 -0.130113 -0.779469 -0.959317 0.617241 -0.515404 -0.128529 -0.359542 -0.654849 0.007402 -0.733078 -0.053148 -1.372696 -0.534287 -0.944885 -0.496881 -0.937919 0.396898 -0.301901 -0.001513 0.354155 0.037759 0.950121 -0.117516 -0.149603 0.420002 0.512260 -1.876207 -0.729102 -2.751064 0.478130 0.670697 -1.197910 -0.360405 0.730464 0.387873 1.980520 0.249774 -0.924052 0.616303 1.003639 1.410831 0.023768 1.084876 0.929960 -1.247777

Now how to get this data into a Torch LookupTable? Well, here's some simple Lua code to do exactly that:

function Data:loadEmbeddings()
   if self.emb ~= nil then
      return self.emb
   end

   local i = 1
   local csvFile = assert(io.open("/home/hsheil/code/glove/vectors.txt", 'r'))
   while true do
      local line = csvFile:read('*l')

      if line == nil then
         break
      end
      local vals = line:split(' ')
      local word = torch.Tensor(50)
      for k, v in ipairs(vals) do
         if k ~= 1 then
            --skip the first non-number element
         word[k-1] = tonumber(v)
         end
      end
      emb.weight[i] = word -- set the pretrained values in the matrix
      i = i + 1
      if i % 5000 == 0 then
         print(i .. ' embeddings loaded')
      end
   end

   return emb
end

TODOs on this code are to save the token in a Lua table to act as an index into the LookupTable we create, but that is pretty trivial to do - Lua tables are very nice in this respect.

Intuitively, an embedding should make the learning task of an RNN / LSTM easier - we are now providing contextual information as well as a simple on / off switch for each concept of interest.


harald

20 Oct 2016

shit

Janet

08 Oct 2016

Stay with this guys, you're helinpg a lot of people.

Chuckles

08 Oct 2016

Kudos! What a neat way of thkniing about it.


Leave a comment

  Back to Blog