Scala-Spark(version1.5.2) Dataframes split error -
i have input file foo.txt
following content:
c1|c2|c3|c4|c5|c6|c7|c8| 00| |1.0|1.0|9|27.0|0|| 01|2|3.0|4.0|1|10.0|1|1|
i want transform dataframe
perform sql
queries:
var text = sc.textfile("foo.txt") var header = text.first() var rdd = text.filter(row => row != header) case class data(c1: string, c2: string, c3: string, c4: string, c5: string, c6: string, c7: string, c8: string)
until point ok, problem comes in next sentence:
var df = rdd.map(_.split("\\|")).map(p => data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).todf()
if try print df
df.show
, error message:
scala> df.show() java.lang.arrayindexoutofboundsexception: 7
i know error might due split sentence. tried split foo.txt
using following syntax:
var df = rdd.map(_.split("""|""")).map(p => data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).todf()
and this:
scala> df.show() +------+---------+----------+-----------+-----+-----------+----------------+----------------+ | c1 | c2 | c3 | c4 | c5 | c6 | c7 | c8 | +------+---------+----------+-----------+-----+-----------+----------------+----------------+ | 0| 0| || | || 1| .| 0| | 0| 1| || 2| || 3| .| 0| +------+---------+----------+-----------+-----+-----------+----------------+----------------+
therefore, question how can correctly pass file dataframe.
edit: error in first row due ||
field without intermediate space. type of field definition depending on examples works fine or crashes.
this because 1 of lines shorter others:
scala> var df = rdd.map(_.split("\\|")).map(_.length).collect() df: array[int] = array(7, 8)
you can fill in rows manually (but need handle each case manually):
val df = rdd.map(_.split("\\|")).map{row => row match { case array(a,b,c,d,e,f,g,h) => data(a,b,c,d,e,f,g,h) case array(a,b,c,d,e,f,g) => data(a,b,c,d,e,f,g," ") } } scala> df.show() +---+---+---+---+---+----+---+---+ | c1| c2| c3| c4| c5| c6| c7| c8| +---+---+---+---+---+----+---+---+ | 00| |1.0|1.0| 9|27.0| 0| | | 01| 2|3.0|4.0| 1|10.0| 1| 1| +---+---+---+---+---+----+---+---+
edit:
a more generic solution this:
val df = rdd.map(_.split("\\|", -1)).map(_.slice(0,8)).map(p => data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).todf()
if assume have right number of delimiters, safe use syntax truncate last value.
Comments
Post a Comment