sklearn cosine_similarity memory error after PCA

Question

sklearn cosine_similarity memory error after PCA

I have a documents-terms matrix with 9000 rows (documents) and 1810 cols (terms). I have applied PCA for dimensionality reduction, that outputs, let's say a 9000x200 matrix.

My purpose is clustering on this data, and the next step for me is to apply some distance metrics, like cosine_similarity from sklearn.

If I run cosine_similarity directly on my DT-matrix (that is obviously sparse), then all works fine. But, if I try to run cosine_similarity on the matrix resulting from PCA, let's call it reducted_dtm, then I get a memory error in PyCharm:

RecursionError: maximum recursion depth exceeded while calling a Python object
Process finished with exit code -1073741571 (0xC00000FD)

Here my code (dtm is my documents-terms matrix, my code actually takes as input the transposed terms-documents matrix tdm):

dtm = tdm.T
# scale dtm in range [0:1] to better variance maximization
scl = MinMaxScaler(feature_range=[0, 1])
data_rescaled = scl.fit_transform(dtm)

# Fitting the PCA algorithm with our Data
pca = PCA(n_components=n).fit(data_rescaled)
data_reducted = pca.transform(data_rescaled)

# for continuity with my pipeline I need to
# return the TDM, so i transpose again
dtm_reducted = pd.DataFrame(data_reducted)

# Here apply cosine similarity
cs = cosine_similarity(dtm_reducted)
cs_pd = pd.DataFrame(cs)

python

scikit-learn

sklearn-pandas

cosine-similarity

asked on Stack Overflow Mar 30, 2020 by

MARCO LAGALLA

1 Answer

OMG, I have realized that actually in my code I was calling the cosine_similarity (sklearn) from a function that I have called 'cosine_similarity'. This caused an infinite recursive loop.

answered on Stack Overflow Mar 30, 2020 by

MARCO LAGALLA

User contributions licensed under CC BY-SA 3.0