Quantcast
Channel: pyspark: what is the best way to select n distinct IDs from a dataset - Stack Overflow
Viewing all articles
Browse latest Browse all 6

Answer by Sourya_cool for pyspark: what is the best way to select n distinct IDs from a dataset

$
0
0

This is a simple approach using 'colllect_set' function and some pythonic operations:

idLimit=3 #define your limitid_lst=(sourceDF  #collect a list of distinct ids        .select(collect_set('id'))        .collect()[0][0]       )id_lst.sort() #sort the ids alphabaticallyid_lst_limited=id_lst[:idLimit] #limit the list as per your defined limittargetDF=(sourceDF #filter the source df using your limited list          .filter("id in ({0})".format(str(id_lst_limited)[1:-1]))         )

Viewing all articles
Browse latest Browse all 6

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>