Is there a way to save a Gensim doc2vec model as plain text (.txt)?

1

What I have achieved so far are models that can not be read by a person. I need to save the model as plain text to use it with a certain software, which requires that the model be this way.

I tried the following:

model = models.doc2vec.Doc2Vec(size=300, min_count=0, alpha=0.025, min_alpha=0.025)
model.train(sentences, total_examples=model.corpus_count, epochs=model.iter)
model.save('mymodel.txt')

But I get:

Process finished with exit code -1073741571 (0xC00000FD)

I do not know if I should pass a specific parameter.

python
gensim
doc2vec
asked on Stack Overflow Feb 26, 2018 by Mikel Laburu

1 Answer

1

The native gensim save() has no plain-text option: it makes use of Python core functionality like object-pickling, or writing large raw floating-point arrays (to secondary files with extra extensions .npy). Such files will include raw binary data – and merely specifying a .txt filename doesn't have any affect on what is written.

You can save just the word-vectors into the one-vector-per-line, plain-text format used by the original Google word2vec.c by using the alternate method save_word2vec_format(). Also, recent versions of gensim Doc2Vec add an optional doctag_vec option to this method. If you supply doctag_vec=True, the doctag vectors will also be saved to the file – with their tag-names distinguished from word-vectors by an extra prefix. See the method's doc-comment and source-code for more info:

https://github.com/RaRe-Technologies/gensim/blob/b000b4fa71386235ffa2b80a62bcccf73fa42c6e/gensim/models/doc2vec.py#L635

However, no variant of save_word2vec_format() saves the entire model, with internal model-weights and the vocabulary/doctag information (like relative frequencies) that are necessary for continued training. For that, you must use the native save(). If you need the full Doc2Vec model in a text format, you'll need to write that save code yourself, perhaps using the above method as a partial guide. (Additionally, I'm not aware of a preexisting convention for representing a whole model – so you'd have to find or devise that yourself, to match your needs wherever the full model is later-to-be-loaded.)

Separately regarding your Doc2Vec initialization parameters:

  • a min_count=0 is usually a bad idea: rare words make models worse, so the default of min_count=5 usually improves models, and as your corpus gets larger, even larger min_count values discarding more low-frequency words tend to help model quality (as well as speeding training and shrinking the model's RAM/save sizes)

  • a min_alpha the same as alpha is usually a bad idea, and means that train() is no longer performing the linear-decay of the alpha learning-rate that's the usual and effective manner of performing stochastic-gradient-descent optimization of the model

answered on Stack Overflow Feb 26, 2018 by gojomo

User contributions licensed under CC BY-SA 3.0