Answer by Som for pyspark: what is the best way to select n distinct IDs from...
You can simply avoid the join by where id in (select distinct id ...limit 3) as below- val df = Seq(("A", "x", "3"), ("A", "y", "1"), ("B", "a", "2"), ("B", "b", "5"), ("C", "v", "2"), ("D", "f", "6"))...
View ArticleAnswer by Shantanu Kher for pyspark: what is the best way to select n...
I notice one of the answers above is based out of Spark SQL.Here is another Spark SQL based approach, but with WINDOW clause -sql("select id, manager, score from (select e1.id, e1.manager, e1.score,...
View ArticleAnswer by Sourya_cool for pyspark: what is the best way to select n distinct...
This is a simple approach using 'colllect_set' function and some pythonic operations:idLimit=3 #define your limitid_lst=(sourceDF #collect a list of distinct ids .select(collect_set('id'))...
View ArticleAnswer by anky for pyspark: what is the best way to select n distinct IDs...
You can also try StringIndexer which factorizes each string in id column and then filter according to the limit.import pyspark.sql.functions as Ffrom pyspark.ml.feature import StringIndexern = 3...
View ArticleAnswer by suresiva for pyspark: what is the best way to select n distinct IDs...
You may use spark sql queries to do this.Just change the limit clause value in the subquery to choose number of distinct id.df=spark.createDataFrame([("A", "x", "3"), ("A", "y", "1"), ("B", "a", "2"),...
View Articlepyspark: what is the best way to select n distinct IDs from a dataset
There's a DataFrame in pyspark with data as below: id manager score A x 3 A y 1 B a 2 B b 5 C f 2 D f 6What I expect is exactly n IDs in the resulting dataset.eg. If I say 3 IDs needed, then the...
View Article