Is the Paragram_300_SL999 Word Embeddings file corrupt?

0

I need to use the Paragram_SL999_300 embeddings for my project that uses the open source code from a published article (https://github.com/cecilialeiqi/adversarial_text). When I try to run Step 4 (generate adversarial examples) from https://github.com/cecilialeiqi/adversarial_text, I get a ValueError saying int() expected but got ','. I know from the readme.txt for Paragram-SL999 300 that is supposed to be one token per line followed by its embeddings. Upon trying to open the Paragram_SL999_300.txt file to see if it matches this criteria, it loads about half way and then closes the TextEditor, without letting me edit it. Furthermore, it crashes LibreOffice if I try and open it in there. This was in an Ubuntu 18.04 Virtual Machine. However, I wasn't sure if this was because the author's code is wrong (in discrete_attack.py at https://github.com/cecilialeiqi/adversarial_text/blob/master/src/discrete_attack.py) or because the file is corrupt so I tried downloading and extracting the Paragram-SL999 300 archive from Wieting's website (http://www.cs.cmu.edu/~jwieting/) on my Windows computer, I get a message saying that the archive is corrupted, which prevents me from extracting the Paragram_SL999_300.txt file and also using it. On another Windows computer, I get the Error Code 0x80004005: Unspecified error when trying to extract the archive.

Is there any way to get around this issue or anyone who can provide insight on it? Would it be recommended instead to produce the embeddings from Wieting's GitHub (https://github.com/jwieting/paragram-word)? I would very much appreciate any input as these embeddings are paramount to my project.

python
nlp
lstm
embedding
cnn
asked on Stack Overflow Feb 28, 2020 by BeginnerByron

1 Answer

0

I managed to download it from the Google drive link at https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdrive.google.com%2Ffile%2Fd%2F0B9w48e1rj-MOck1fRGxaZW1LU2M%2Fview%3Fusp%3Dsharing&data=02%7C01%7C%7C36fd021bae0343bbe54408d7bdd28c81%7C1faf88fea9984c5b93c9210a11d9a5c2%7C0%7C0%7C637186584305548961&sdata=PouX2kyBlnQHpzAaDKjqe7gFC3ctti6tjBcGWt8pg1s%3D&reserved=0. In the end it worked but I'm not sure why the other times I was unable to get it to work. Also, I didn't realise that for the code I had I also I needed to add the vocabulary size and the embedding size at the first line of the file (1703756 300).

answered on Stack Overflow Mar 8, 2020 by BeginnerByron

User contributions licensed under CC BY-SA 3.0