python - How can I use a column as index to find words in another column using SparkSQL? -
my dataframe this:
and want use list of indexes top5 find corresponding word in words.
for example,if in first row, words [i ,am , ,student, how, about, you] , top5 [5,4,0,1,2] want new column word form words index number of top5, result i , , a, how, about. how can make it?
since number of values in top5 fixed can use bracket notation or getitem. using example question:
from pyspark.sql.functions import col, array df = sc.parallelize([ (["i", "am", "a", "student", "how", "about", "you"], [5, 4, 0, 1, 2]) ]).todf(["words", "top5"]) you can either:
df.select([col("words")[col("top5")[i]] in range(5)]) or:
df.select([col("words").getitem(col("top5")[i]) in range(5)]) with both giving same result:
+--------------+--------------+--------------+--------------+--------------+ |words[top5[0]]|words[top5[1]]|words[top5[2]]|words[top5[3]]|words[top5[4]]| +--------------+--------------+--------------+--------------+--------------+ | about| how| i| am| a| +--------------+--------------+--------------+--------------+--------------+ if wan array column wrap 1 of above using array function:
df.select(array(*[ col("words").getitem(col("top5")[i]) in range(5) ]).alias("top5mapped")) +----------------------+ |top5mapped | +----------------------+ |[about, how, i, am, a]| +----------------------+ 
Comments
Post a Comment