Convert column values to float using a conversion function

1

Given the following CSV

+-------------------------------+-------------+--------------------+--------------+
|           Timestamp           | DoublePoint |      HexPoint      | BooleanPoint |
+-------------------------------+-------------+--------------------+--------------+
| 07/23/2019 16:53:12.523-07:00 |         0.0 | 0x0000000000000001 | True         |
| 07/23/2019 16:53:14.519-07:00 |         0.0 | 0x0000000000000002 | False        |
| 07/23/2019 16:53:16.516-07:00 |        0.25 | 0x0000000000000003 | true         |
| 07/23/2019 16:53:18.513-07:00 |        0.25 | 0x00000004         | false        |
| 07/23/2019 16:53:20.526-07:00 |         0.0 | 0x00000005         | True         |
| 07/23/2019 16:53:22.522-07:00 |        0.50 | 0x00000006         | False        |
| 07/23/2019 16:53:24.519-07:00 |         0.5 | 0x00000007         | True         |
| 07/23/2019 16:53:26.516-07:00 |      0.9999 | 0x00000008         | False        |
+-------------------------------+-------------+--------------------+--------------+

I need to read it with the pandas library and get a DataFrame where all the columns, except the first one, are float. For numbers, this should be automatic, but for other types of input as HexPoint and BooleanPoint I need to provide a conversion function to convert them to numbers.

In this example, the HexPoint values should be converted to decimal and the BooleanPoints should convert True/true to 1 and False/false to 0.

So the resulting DataFrame should look like this:

+-------------------------------+-------------+----------+--------------+
|           Timestamp           | DoublePoint | HexPoint | BooleanPoint |
+-------------------------------+-------------+----------+--------------+
| 07/23/2019 16:53:12.523-07:00 |         0.0 |      1.0 |          1.0 |
| 07/23/2019 16:53:14.519-07:00 |         0.0 |      2.0 |          0.0 |
| 07/23/2019 16:53:16.516-07:00 |        0.25 |      3.0 |          1.0 |
| 07/23/2019 16:53:18.513-07:00 |        0.25 |      4.0 |          0.0 |
| 07/23/2019 16:53:20.526-07:00 |         0.0 |      5.0 |          1.0 |
| 07/23/2019 16:53:22.522-07:00 |        0.50 |      6.0 |          0.0 |
| 07/23/2019 16:53:24.519-07:00 |         0.5 |      7.0 |          1.0 |
| 07/23/2019 16:53:26.516-07:00 |      0.9999 |      8.0 |          0.0 |
+-------------------------------+-------------+----------+--------------+

Important considerations:

  • I don't know in advance how many columns the CSV has.
  • I don't know what kind of data the columns in the CSV are. They could be a mix of double, hex and boolean values.
  • The only thing that can be assumed is that the first column is named "Timestamp" and contains timestamps.

Is there a way to tell pandas to read this CSV and try to convert all columns (except the first one) to float. And when it can't do that natively, run a custom function that would take the value and return its number representation as mentioned above?

python
python-3.x
pandas
csv
asked on Stack Overflow Feb 11, 2020 by empz

2 Answers

1

This should do the trick.

def convert_to_float(_):
    try: 
        return float((False, True)[_.lower() == "true"])
    except:
        return float(_)

converters = {_: convert_to_float for _ in pd.read_csv(filename, nrows=1).columns[1:]}

pd.read_csv(filename, converters=converters)
answered on Stack Overflow Feb 11, 2020 by Ian
1

Hex, boolean and double values like the ones present in your table can directly be converted to float using the float() method in python.

Try this :

import pandas as pd

df = pd.read_csv("data.csv")

column_names = df.columns.tolist()
column_names.remove("Timestamp")

print(df)
print(df.dtypes)

print(type(df["DoublePoint"]))

for name in column_names:
  try:
    df[name] = df[name].astype(float)
  except ValueError:
    df[name] = df[name].apply(lambda x: float(int(x, 16)))

print(df)
print(df.dtypes)

Also, in your input df I see true/false is present in small case in 2 values which I think is not correct. If its correct you need to change them to True/False as in rest of the values.

answered on Stack Overflow Feb 11, 2020 by Prashant Kumar • edited Feb 12, 2020 by empz

User contributions licensed under CC BY-SA 3.0