Quantcast
Viewing latest article 4
Browse Latest Browse All 6

Answer by anky for pyspark: what is the best way to select n distinct IDs from a dataset

You can also try StringIndexer which factorizes each string in id column and then filter according to the limit.

import pyspark.sql.functions as Ffrom pyspark.ml.feature import StringIndexern = 3 #change as per limitidx = StringIndexer(inputCol="id",outputCol="id_num")idx.fit(df).transform(df).filter(F.col("id_num")<n).drop("id_num").show()

+---+-------+-----+| id|manager|score|+---+-------+-----+|  A|      x|    3||  A|      y|    1||  B|      a|    2||  B|      b|    5||  C|      f|    2|+---+-------+-----+

Viewing latest article 4
Browse Latest Browse All 6

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>