SQL: Random sampling X% of samples(unique IDs) from duplicated ID table

1

I am using SQL and obtained the following table:

userID|  time   | location   
A10   | 20130801| 1000  
A10   | 20130802| 1002  
A10   | 20130806| 1008  
B21   | 20130803| 1000  
B21   | 20130801| 1099  
C11   | 20130802| 1000  
D33   | 20130802| 1002  
D33   | 20130806| 1877  
E01   | 20130801| 1765  
E01   | 20130801| 1000  
E01   | 20130802| 1000 

where userID is String, Time is YYYYMM, and Location is location ID (numeric) for each userID in YYYYMM.

In this example, I have 5 unique userIDs (A10,B21,C11,D33,E01). I would like to write query that randomly sample X percent of unique userIDs (for example, if X=80, randomly sample 4 unique userIDs from 5).

I've written:

Select time,location,
       count(DISTINCT userID) as n_uu
from( ---
      --- here, I construct the example table
      ---
     ) as maintable
     where 0.8 >= CAST(CHECKSUM(NEWID(), userID) & 0x7fffffff AS float)/CAST 
     (0x7fffffff AS int)

group by time, location

where I finally intend to count the number of userIDs for randomly selected samples (i.e., 80% of unique userIDs). That is, I try to get following table in this example (now suppose that B21 is not sampled):

time    | location | n_uu  
20130801| 1000     | 2   
20130801| 1765     | 1  
20130802| 1002     | 2  
20130802| 1000     | 2  
20130806| 1008     | 1  
20130806| 1877     | 1  

Yet, this seems not to randomly sample unique userIDs, rather it randomly select rows.

How to fix it, or can I do this in faster by using other query? Any advice would be very appreciated.

===Added =========================

Select time,location,
   count(DISTINCT userID) as n_uu
from(
(Select 
  --- here, I construct the example table
  ---
 ) as maintable
 Select maintable.*
 from maintable join
 (Select top 80 percent userID
 from (Select Distinct userID from maintable) newtable
 order by NEWID()
 ) newtable
 on maintable.userID = newtable.userID
 )
 group by time, location
sql
sql-server
asked on Stack Overflow Mar 6, 2019 by Kehoe • edited Mar 6, 2019 by Kehoe

1 Answer

3

Hmmm. If I understand correctly, you want to sample within a hierarchy. So, get the user ids you want with a subquery and then join in the rest of the information:

select t.*
from t join
     (select top 80 percent userid
      from (select distinct userid from t) u
      order by newid()
     ) u
     on t.userid = u.userid;
answered on Stack Overflow Mar 6, 2019 by Gordon Linoff

User contributions licensed under CC BY-SA 3.0