I have a text
file containing words with non-English alphabets and I want to open it, do some preprocess and finally save it as a csv
file.
and use it some where else.
the code to read and store file:
with open('file.txt', encoding="utf-8") as f:
train = f.read().splitlines()
then creating a dataframe, and the code to store it:
df.to_csv('file.csv', index=True, encoding="utf-8")
till now every thing seems ok, but when I try to open the file.csv with this code:
train = pd.read_csv('file.csv', encoding="utf-8")
I face this :
Process finished with exit code -1073740940 (0xC0000374)
without going to next lines.
also when I try to open it with ISO-8859-1 encoding, it is ok; but when I try to print the head of that csv it just print some question marks('?')
is anyone knows what is going wrong?
any kind of help will be appreciated.
I tried reproducing it with this code:
import pandas as pd
with open('persian.txt', encoding="utf-8") as f:
train = f.read().splitlines()
df = pd.DataFrame({'text': train})
df.to_csv('file.csv', index=True, encoding="utf-8")
train = pd.read_csv('file.csv', encoding="utf-8")
with a txt file containing two lines of sample Persian text. It ran without any problems in Python 3, producing this csv:
text
0 همهٔ افراد بشر آزاد به دنیا میآیند و حیثیت و حقوق شان با هم برابر است
1 همه اندیشه و وجدان دارند و باید در برابر یکدیگر با روح برادری رفتار کنند.
Can you provide more detail on the text properties and the operations you did in the dataframe processing, or identify the line where the reading breaks? You might be producing some invalid characters on the way.
I was getting crazy by writing Persian in a CSV file. Finally this one worked for me:
data.to_csv (r'hi.csv', encoding='utf-8-sig')
User contributions licensed under CC BY-SA 3.0