Quantcast
Channel: pyspark: what is the best way to select n distinct IDs from a dataset - Stack Overflow
Browsing all 6 articles
Browse latest View live

Answer by Som for pyspark: what is the best way to select n distinct IDs from...

You can simply avoid the join by where id in (select distinct id ...limit 3) as below- val df = Seq(("A", "x", "3"), ("A", "y", "1"), ("B", "a", "2"), ("B", "b", "5"), ("C", "v", "2"), ("D", "f", "6"))...

View Article



Answer by Shantanu Kher for pyspark: what is the best way to select n...

I notice one of the answers above is based out of Spark SQL.Here is another Spark SQL based approach, but with WINDOW clause -sql("select id, manager, score from (select e1.id, e1.manager, e1.score,...

View Article

Answer by Sourya_cool for pyspark: what is the best way to select n distinct...

This is a simple approach using 'colllect_set' function and some pythonic operations:idLimit=3 #define your limitid_lst=(sourceDF #collect a list of distinct ids .select(collect_set('id'))...

View Article

Answer by anky for pyspark: what is the best way to select n distinct IDs...

You can also try StringIndexer which factorizes each string in id column and then filter according to the limit.import pyspark.sql.functions as Ffrom pyspark.ml.feature import StringIndexern = 3...

View Article

Answer by suresiva for pyspark: what is the best way to select n distinct IDs...

You may use spark sql queries to do this.Just change the limit clause value in the subquery to choose number of distinct id.df=spark.createDataFrame([("A", "x", "3"), ("A", "y", "1"), ("B", "a", "2"),...

View Article


pyspark: what is the best way to select n distinct IDs from a dataset

There's a DataFrame in pyspark with data as below: id manager score A x 3 A y 1 B a 2 B b 5 C f 2 D f 6What I expect is exactly n IDs in the resulting dataset.eg. If I say 3 IDs needed, then the...

View Article
Browsing all 6 articles
Browse latest View live




Latest Images