I have a documents-terms matrix with 9000 rows (documents) and 1810 cols (terms). I have applied PCA for dimensionality reduction, that outputs, let's say a 9000x200 matrix.
My purpose is clustering on this data, and the next step for me is to apply some distance metrics, like cosine_similarity
from sklearn.
If I run cosine_similarity
directly on my DT-matrix (that is obviously sparse), then all works fine.
But, if I try to run cosine_similarity
on the matrix resulting from PCA, let's call it reducted_dtm
, then I get a memory error in PyCharm:
RecursionError: maximum recursion depth exceeded while calling a Python object
Process finished with exit code -1073741571 (0xC00000FD)
Here my code (dtm
is my documents-terms matrix, my code actually takes as input the transposed terms-documents matrix tdm
):
dtm = tdm.T
# scale dtm in range [0:1] to better variance maximization
scl = MinMaxScaler(feature_range=[0, 1])
data_rescaled = scl.fit_transform(dtm)
# Fitting the PCA algorithm with our Data
pca = PCA(n_components=n).fit(data_rescaled)
data_reducted = pca.transform(data_rescaled)
# for continuity with my pipeline I need to
# return the TDM, so i transpose again
dtm_reducted = pd.DataFrame(data_reducted)
# Here apply cosine similarity
cs = cosine_similarity(dtm_reducted)
cs_pd = pd.DataFrame(cs)
OMG, I have realized that actually in my code I was calling the cosine_similarity (sklearn) from a function that I have called 'cosine_similarity'. This caused an infinite recursive loop.
User contributions licensed under CC BY-SA 3.0