You can also try StringIndexer
which factorizes each string in id
column and then filter
according to the limit.
import pyspark.sql.functions as Ffrom pyspark.ml.feature import StringIndexern = 3 #change as per limitidx = StringIndexer(inputCol="id",outputCol="id_num")idx.fit(df).transform(df).filter(F.col("id_num")<n).drop("id_num").show()
+---+-------+-----+| id|manager|score|+---+-------+-----+| A| x| 3|| A| y| 1|| B| a| 2|| B| b| 5|| C| f| 2|+---+-------+-----+