Hello! I’d like to ask a noob question: I never used ADQL before, and I am trying to select a random subsample of galaxies from DP0.2. RAND() from standard SQL doesn’t work, and I am having a surprisingly hard time finding how to do this in Google. Gaia tables have a random_idx column, but I haven’t found something like that in DP0.2. Thanks!
Hi @bruno-moraes, great question, thank you for asking!
Not all of the ADQL functionality is yet implemented for queries. I hadn’t tried using RAND()
before, so first I confirmed this returns “ERROR Function [RAND] is not found in TapSchema” when I tried to include it in a query. I used the Portal for this test, but it’s the same underlying TAP service in the Portal and Notebook Aspects.
I think the recommended workaround here is to retrieve a larger sample than you need, and then down-select to a random sample. The Object
catalog does have an objectId
column, but they are not random. Other ideas for generating random samples with DP0.2 are welcome to be posted as replies in this thread.
In the Notebook Aspect, the down-selection could be done with, e.g., options in the numpy.random
package (e.g., numpy.random.choice
). It is recommended to always include some spatial constraints for queries because the catalogs are so large. Make sure you use a large container when you start your JupyterLab server, when working with large data sets.
In the Portal Aspect, this could be done in a similar way. First do a query, then create a new column using the RAND()
function, and then put a limit on the new column to achieve a random subset of the desired size. Below I put a step-by-step simplistic example for doing this.
@bruno-moraes, if this sufficiently answers your question, please mark this post as the solution? And if not, please follow-up in this thread so we can get the issue resolved.
Step-by-step example of creating a random subset of DP0.2 Objects in the Portal Aspect.
Go to data.lsst.cloud and enter the Portal Aspect.
Select “Edit ADQL” from upper right, and copy-paste this query into the ADQL box, or use your own query to obtain only potential galaxies of interest.
SELECT coord_dec,coord_ra,objectId
FROM dp02_dc2_catalogs.Object
WHERE CONTAINS(POINT('ICRS', coord_ra, coord_dec),CIRCLE('ICRS', 62, -37, 0.2))=1
Increase the Row Limit to 200000.
Click Search.
The query returns 105,251 rows.
In the results page, click “Bi-view Tables” at upper right.
The results view screen should now look like this:
At upper right in the table panel, click on the “Add a column” icon, the vertical rectangle with the plus sign. Hover over icons to see pop-ups of the function.
Name the new column ‘random’ and use the expression ‘RAND()’, like this:
Click “Add Column”.
A new column named random, populated with values between 0 and 1, now appears.
In the box at the top of the column, enter “<0.2” to get a random sample that is 20% the size of the full sample.
This is just a small supplement to Melissa’s answer, regarding her recommendation, “always include some spatial constraints.” If you’d like to use her approach to sample across a big area of the sky, you could stratify by location. Partition the sky into M polygons of equal solid angle. If you want N samples, draw M integers (which will sum to N) from a multinomial distribution in M categories with equal probabilities (or unequal if the solid angles aren’t equal); see scipy.stats.multinomial. Then do the larger-than-needed query in each polygon and downselect to the number from the multinomial. This aims to make sure the samples are appropriately random across the whole area.
Hi @MelissaGraham , @TomLoredo , thanks for the answers! (and apologies for the late reply). I am not 100% these solutions work for us. Our goal is to create a simulated fully-representative spectroscopic training set over the full DP0.2 area and then cut it in colour/magnitude/SNR to imitate certain well-known non-representative spectroscopic samples. We want to run several photo-z trainings and assess the impact of these sample choices. With this in mind:
- using a small area can induce cosmic variance effects;
- the original columns do not allow easily for a SNR cut (or do they?);
I suppose we could stratify by location, but I suppose using a multinomial with fixed total N would oversample low-density regions and undersample high-density ones? Off the top of my head, maybe this would work (adapting Tom’s steps):
- Partition the sky into M polygons.
- Take N_i objects uniformly randomly from each i polygon, where N_i is sampled from a Poisson with fixed_frac * N^tot_i / area_i rate.
Not sure all this is statistically robust, and it would require many many queries to span the footprint, or maybe some smarter cuts on the original sample to reduce its size.
Does this sound overly complicated? Any suggestions?
Thanks!
I wonder if doing 2 separate queries might do the job.
My idea is
- Download all the objectids from the Object table (maybe via an asynchronous query):
query = "SELECT objectId from Object"
results = service.search(query)
- Save the result to a file for general use later:
df = service.search(query).to_table().to_pandas()
df..to_csv('objectIds_all.csv')
- When a subsample is needed, read in the
objectIds_all.csv
file, use numpy to choose a random subsample, and then use the SQL/ADQLWHERE...IN
function to read those random Objects into a table in your current notebook; e.g.:
my_list = "(1249537790362809267, 1252528461990360512, 1248772530269893180, "\
"1251728017525343554, 1251710425339299404, 1250030371572068167, "\
"1253443255664678173, 1251807182362538413, 1252607626827575504, "\
"1249784080967440401, 1253065023664713612, 1325835101237446771)"
query = "SELECT objectId, g_calibFlux, r_calibFlux, i_calibFlux, z_calibFlux "\
"FROM dp02_dc2_catalogs.Object "\
"WHERE objectId IN "+my_list
results = service.search(query)
results.to_table()
(see also ADQL Recipes — Vera C. Rubin Observatory Documentation for Data Preview 0.2).
I’ve found that the SQL/ADQL WHERE...IN
function can accept fairly lengthy lists, and, even if not lengthy enough, one could run the commands in Step 3 multiple times with sub-lists until you get the number of random Objects you want.
Hi Douglas,
Thanks for the feedback and suggestion. My student and I will see what we can cook up, sometime in the next few days, and report back.
Cheers,
Bruno
Dear Bruno,
I have some updates that might be useful. These are still in development – and should be treated as such – but I hope they might be helpful!
First, based on suggestions by @MelissaGraham and @leanne, I downloaded all the objectId
's from the dp02_dc2_catalogs.Object
table. (There are basically a “gazillion” of these ; so this took more effort than I was originally expecting!) I have placed this download in my scratch area on https://data.lsst.cloud; so anyone with an RSP account can access this list without having to re-create it themselves. There is a CSV version and a FITS version:
/scratch/DouglasLTucker/dp02_dc2_catalogs_Object_all_objectids_sorted.csv
/scratch/DouglasLTucker/dp02_dc2_catalogs_Object_all_objectids_sorted.fits
The data in these two files are the same; the only difference is the file format. There are 278,318,455 objectId
's in each of these files. They are each about 5GB in size. The FITS file is faster to load than the CSV, but there are some nifty pandas tricks one can use with the CSV file.
I googled around and found this post on “2 Ways to Randomly Sample Rows from a large CSV file”:
https://cmdlinetips.com/2022/07/randomly-sample-rows-from-a-big-csv-file/
I tried the two methods, and they work pretty fast – or at least a lot faster than reading in the whole 5GB CSV file and then using numpy
methods to select a random set of objectId
's.
First, say you want to grab a random 0.001% of objectId
's from the full set listed in /scratch/DouglasLTucker/dp02_dc2_catalogs_Object_all_objectids_sorted.csv
. You can use this code snippet to do that:
import pandas as pd
import random
filename = '/scratch/DouglasLTucker/dp02_dc2_catalogs_Object_all_objectids_sorted.csv'
random.seed(4321)
df = pd.read_csv(filename,
skiprows=lambda x: x > 0 and random.random() >=0.00001)
In my case, I got 2872 objectId’s returned.
Or, if you want to grab an exactly N=100 random rows from
/scratch/DouglasLTucker/dp02_dc2_catalogs_Object_all_objectids_sorted.csv
, you can use a code snippet like this:
import pandas as pd
import random
filename = '/scratch/DouglasLTucker/dp02_dc2_catalogs_Object_all_objectids_sorted.csv'
sample_n_from_csv(filename, n=100)
where you have previously defined the function sample_n_from_csv
as such:
import sys
def sample_n_from_csv(filename:str, n:int=100, total_rows:int=None) -> pd.DataFrame:
if total_rows==None:
with open(filename,"r") as fh:
total_rows = sum(1 for row in fh)
if(n>total_rows):
print("Error: n > total_rows", file=sys.stderr)
skip_rows = random.sample(range(1,total_rows+1), total_rows-n)
return pd.read_csv(filename, skiprows=skip_rows)
One can then feed in that subsample of random objectId
's to a WHERE...IN
clause of your ADQL query like shown in Step 3 of my previous message, with the caveat that I am not sure what the limit of the number of items in the IN
list is (so you might need to break things up into multiple ADQL queries, each with a different sub-batch from the original list of random objectId
's.
There may be more updates later, but I hope this helps in case you want to get started before those updates are posted.
Thanks!