python - How can I use a column as index to find words in another column using SparkSQL? -
my dataframe this:
and want use list of indexes top5 find corresponding word in words.
for example,if in first row, words
[i ,am , ,student, how, about, you]
, top5
[5,4,0,1,2]
want new column word form words
index number of top5
, result i , , a, how, about
. how can make it?
since number of values in top5
fixed can use bracket notation or getitem
. using example question:
from pyspark.sql.functions import col, array df = sc.parallelize([ (["i", "am", "a", "student", "how", "about", "you"], [5, 4, 0, 1, 2]) ]).todf(["words", "top5"])
you can either:
df.select([col("words")[col("top5")[i]] in range(5)])
or:
df.select([col("words").getitem(col("top5")[i]) in range(5)])
with both giving same result:
+--------------+--------------+--------------+--------------+--------------+ |words[top5[0]]|words[top5[1]]|words[top5[2]]|words[top5[3]]|words[top5[4]]| +--------------+--------------+--------------+--------------+--------------+ | about| how| i| am| a| +--------------+--------------+--------------+--------------+--------------+
if wan array column wrap 1 of above using array
function:
df.select(array(*[ col("words").getitem(col("top5")[i]) in range(5) ]).alias("top5mapped"))
+----------------------+ |top5mapped | +----------------------+ |[about, how, i, am, a]| +----------------------+
Comments
Post a Comment