Convert column values to float using a conversion function

Question

Convert column values to float using a conversion function

Given the following CSV

+-------------------------------+-------------+--------------------+--------------+
|           Timestamp           | DoublePoint |      HexPoint      | BooleanPoint |
+-------------------------------+-------------+--------------------+--------------+
| 07/23/2019 16:53:12.523-07:00 |         0.0 | 0x0000000000000001 | True         |
| 07/23/2019 16:53:14.519-07:00 |         0.0 | 0x0000000000000002 | False        |
| 07/23/2019 16:53:16.516-07:00 |        0.25 | 0x0000000000000003 | true         |
| 07/23/2019 16:53:18.513-07:00 |        0.25 | 0x00000004         | false        |
| 07/23/2019 16:53:20.526-07:00 |         0.0 | 0x00000005         | True         |
| 07/23/2019 16:53:22.522-07:00 |        0.50 | 0x00000006         | False        |
| 07/23/2019 16:53:24.519-07:00 |         0.5 | 0x00000007         | True         |
| 07/23/2019 16:53:26.516-07:00 |      0.9999 | 0x00000008         | False        |
+-------------------------------+-------------+--------------------+--------------+

I need to read it with the pandas library and get a DataFrame where all the columns, except the first one, are float. For numbers, this should be automatic, but for other types of input as HexPoint and BooleanPoint I need to provide a conversion function to convert them to numbers.

In this example, the HexPoint values should be converted to decimal and the BooleanPoints should convert True/true to 1 and False/false to 0.

So the resulting DataFrame should look like this:

+-------------------------------+-------------+----------+--------------+
|           Timestamp           | DoublePoint | HexPoint | BooleanPoint |
+-------------------------------+-------------+----------+--------------+
| 07/23/2019 16:53:12.523-07:00 |         0.0 |      1.0 |          1.0 |
| 07/23/2019 16:53:14.519-07:00 |         0.0 |      2.0 |          0.0 |
| 07/23/2019 16:53:16.516-07:00 |        0.25 |      3.0 |          1.0 |
| 07/23/2019 16:53:18.513-07:00 |        0.25 |      4.0 |          0.0 |
| 07/23/2019 16:53:20.526-07:00 |         0.0 |      5.0 |          1.0 |
| 07/23/2019 16:53:22.522-07:00 |        0.50 |      6.0 |          0.0 |
| 07/23/2019 16:53:24.519-07:00 |         0.5 |      7.0 |          1.0 |
| 07/23/2019 16:53:26.516-07:00 |      0.9999 |      8.0 |          0.0 |
+-------------------------------+-------------+----------+--------------+

Important considerations:

I don't know in advance how many columns the CSV has.
I don't know what kind of data the columns in the CSV are. They could be a mix of double, hex and boolean values.
The only thing that can be assumed is that the first column is named "Timestamp" and contains timestamps.

Is there a way to tell pandas to read this CSV and try to convert all columns (except the first one) to float. And when it can't do that natively, run a custom function that would take the value and return its number representation as mentioned above?

python

python-3.x

pandas

csv

asked on Stack Overflow Feb 11, 2020 by

empz

2 Answers

This should do the trick.

def convert_to_float(_):
    try: 
        return float((False, True)[_.lower() == "true"])
    except:
        return float(_)

converters = {_: convert_to_float for _ in pd.read_csv(filename, nrows=1).columns[1:]}

pd.read_csv(filename, converters=converters)

answered on Stack Overflow Feb 11, 2020 by

Ian

Hex, boolean and double values like the ones present in your table can directly be converted to float using the float() method in python.

Try this :

import pandas as pd

df = pd.read_csv("data.csv")

column_names = df.columns.tolist()
column_names.remove("Timestamp")

print(df)
print(df.dtypes)

print(type(df["DoublePoint"]))

for name in column_names:
  try:
    df[name] = df[name].astype(float)
  except ValueError:
    df[name] = df[name].apply(lambda x: float(int(x, 16)))

print(df)
print(df.dtypes)

Also, in your input df I see true/false is present in small case in 2 values which I think is not correct. If its correct you need to change them to True/False as in rest of the values.

answered on Stack Overflow Feb 11, 2020 by

Prashant Kumar • edited Feb 12, 2020 by

empz

User contributions licensed under CC BY-SA 3.0