Scala-Spark(version1.5.2) Dataframes split error -

i have input file foo.txt following content:

c1|c2|c3|c4|c5|c6|c7|c8| 00| |1.0|1.0|9|27.0|0|| 01|2|3.0|4.0|1|10.0|1|1|

i want transform dataframe perform sql queries:

var text = sc.textfile("foo.txt") var header = text.first() var rdd = text.filter(row => row != header) case class data(c1: string, c2: string, c3: string, c4: string, c5: string, c6: string, c7: string, c8: string)

until point ok, problem comes in next sentence:

var df = rdd.map(_.split("\\|")).map(p => data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).todf()

if try print df df.show, error message:

scala> df.show() java.lang.arrayindexoutofboundsexception: 7

i know error might due split sentence. tried split foo.txt using following syntax:

var df = rdd.map(_.split("""|""")).map(p => data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).todf()

and this:

scala> df.show() +------+---------+----------+-----------+-----+-----------+----------------+----------------+ |  c1  |     c2  |    c3    |     c4    |  c5 |     c6    |        c7      |       c8       | +------+---------+----------+-----------+-----+-----------+----------------+----------------+ |     0|        0|         ||           |    ||          1|               .|               0| |     0|        1|         ||          2|    ||          3|               .|               0| +------+---------+----------+-----------+-----+-----------+----------------+----------------+

therefore, question how can correctly pass file dataframe.

edit: error in first row due || field without intermediate space. type of field definition depending on examples works fine or crashes.

this because 1 of lines shorter others:

scala> var df = rdd.map(_.split("\\|")).map(_.length).collect() df: array[int] = array(7, 8)

you can fill in rows manually (but need handle each case manually):

val df = rdd.map(_.split("\\|")).map{row =>   row match {     case array(a,b,c,d,e,f,g,h) => data(a,b,c,d,e,f,g,h)     case array(a,b,c,d,e,f,g) => data(a,b,c,d,e,f,g," ")   } }  scala> df.show() +---+---+---+---+---+----+---+---+ | c1| c2| c3| c4| c5|  c6| c7| c8| +---+---+---+---+---+----+---+---+ | 00|   |1.0|1.0|  9|27.0|  0|   | | 01|  2|3.0|4.0|  1|10.0|  1|  1| +---+---+---+---+---+----+---+---+

edit:

a more generic solution this:

val df = rdd.map(_.split("\\|", -1)).map(_.slice(0,8)).map(p => data(p(0), p(1), p(2), p(3), p(4), p(5), p(6), p(7))).todf()

if assume have right number of delimiters, safe use syntax truncate last value.

Search This Blog

Brent

Scala-Spark(version1.5.2) Dataframes split error -

Comments

Post a Comment

Popular posts from this blog

inversion of control - Autofac named registration constructor injection -

ios - Change Storyboard View using Seague -

verilog - Systemverilog dynamic casting issues -