apache spark - How to load CSVs with timestamps in custom format? -

i have timestamp field in tiny csv file reading dataframe using spark csv library. same piece of code works on local machine databricks spark 2.0 version throws error on azure hortownworks hdp 3.5 , 3.6. have checked , azure hd insight 3.5 using same spark version isn't problem spark version.

import org.apache.spark.sql.types._ val sourcefile = "c:\\2017\\datetest" val sourceschemastruct = new structtype()   .add("eventdate",datatypes.timestamptype)   .add("name",datatypes.stringtype) val df = spark.read   .format("com.databricks.spark.csv")   .option("header","true")   .option("delimiter","|")   .option("mode","failfast")   .option("inferschema","false")   .option("dateformat","yyyy/mm/dd hh:mm:ss.sss")   .schema(sourceschemastruct)   .load(sourcefile)

java.lang.illegalargumentexception: timestamp format must yyyy-mm-dd hh:mm:ss[.fffffffff]

the csv file has 1 row follows:

"eventdate"|"name" "2016/12/19 00:43:27.583"|"adam"

i've managed reproduce issue in latest spark version 2.2.0-snapshot (built master).

// os shell $ cat so-43259485.csv "eventdate"|"name" "2016/12/19 00:43:27.583"|"adam"  // spark-shell case class event(eventdate: java.sql.timestamp, name: string) import org.apache.spark.sql.encoders val schema = encoders.product[event].schema  scala> spark   .read   .format("csv")   .option("header", true)   .option("mode","failfast")   .option("delimiter","|")   .schema(schema)   .load("so-43259485.csv")   .show(false) 17/04/08 11:03:42 error executor: exception in task 0.0 in stage 7.0 (tid 7) java.lang.illegalargumentexception: timestamp format must yyyy-mm-dd hh:mm:ss[.fffffffff]     @ java.sql.timestamp.valueof(timestamp.java:237)     @ org.apache.spark.sql.catalyst.util.datetimeutils$.stringtotime(datetimeutils.scala:167)     @ org.apache.spark.sql.execution.datasources.csv.univocityparser$$anonfun$makeconverter$9$$anonfun$apply$17$$anonfun$apply$6.apply$mcj$sp(univocityparser.scala:146)     @ org.apache.spark.sql.execution.datasources.csv.univocityparser$$anonfun$makeconverter$9$$anonfun$apply$17$$anonfun$apply$6.apply(univocityparser.scala:146)     @ org.apache.spark.sql.execution.datasources.csv.univocityparser$$anonfun$makeconverter$9$$anonfun$apply$17$$anonfun$apply$6.apply(univocityparser.scala:146)     @ scala.util.try.getorelse(try.scala:79)

the corresponding line in spark sources following:

timestamp.valueof(s)

having read javadoc of timestamp.valueof, can learn argument should be:

timestamp in format yyyy-[m]m-[d]d hh:mm:ss[.f...]. fractional seconds may omitted. leading 0 mm , dd may omitted.

note "the fractional seconds may omitted" let's cut off first loading eventdate string , after removing unneeded fractional seconds convert timestamp.

val eventsasstring = spark.read.format("csv")   .option("header", true)   .option("mode","failfast")   .option("delimiter","|")   .load("so-43259485.csv")

spark 2.1.0

use schema inference in csv using inferschema option custom timestampformat.

it's important trigger schema inference using inferschema timestampformat take effect.

val events = spark.read   .format("csv")   .option("header", true)   .option("mode","failfast")   .option("delimiter","|")   .option("inferschema", true)   .option("timestampformat", "yyyy/mm/dd hh:mm:ss")   .load("so-43259485.csv")  scala> events.show(false) +-------------------+----+ |eventdate          |name| +-------------------+----+ |2016-12-19 00:43:27|adam| +-------------------+----+  scala> events.printschema root  |-- eventdate: timestamp (nullable = true)  |-- name: string (nullable = true)

"incorrect" initial version left learning purposes

val events = eventsasstring   .withcolumn("date", split($"eventdate", " ")(0))   .withcolumn("date", translate($"date", "/", "-"))   .withcolumn("time", split($"eventdate", " ")(1))   .withcolumn("time", split($"time", "[.]")(0))    // <-- remove millis part   .withcolumn("eventdate", concat($"date", lit(" "), $"time")) // <-- make eventdate right   .select($"eventdate" cast "timestamp", $"name")  scala> events.printschema root  |-- eventdate: timestamp (nullable = true)  |-- name: string (nullable = true)     events.show(false)  scala> events.show +-------------------+----+ |          eventdate|name| +-------------------+----+ |2016-12-19 00:43:27|adam| +-------------------+----+

spark 2.2.0 (not released yet)

as of spark 2.2 (which not available yet) can use to_timestamp function string timestamp conversion.

eventsasstring.select($"eventdate", to_timestamp($"eventdate", "yyyy/mm/dd hh:mm:ss.sss")).show(false)  scala> eventsasstring.select($"eventdate", to_timestamp($"eventdate", "yyyy/mm/dd hh:mm:ss.sss")).show(false) +-----------------------+----------------------------------------------------+ |eventdate              |to_timestamp(`eventdate`, 'yyyy/mm/dd hh:mm:ss.sss')| +-----------------------+----------------------------------------------------+ |2016/12/19 00:43:27.583|2016-12-19 00:43:27                                 | +-----------------------+----------------------------------------------------+

Search This Blog

Brent