Can I lookup a value in one dataset using a field in another?

I’ve got a custom dataset of countries with IDs. Those IDs are referenced in another dataset’s rows.

I’ve created a new schema that uses the second dataset’s data, which means I only have access to the country IDs and not the actual country names. Is there a way to do a lookup from one dataset to another using a random value selected in my schema?

1 Like

If you have both country name and id in the dataset that you uploaded, you can pull both columns into your new schema and the values will match up automatically. This happens because Mockaroo always picks values from the same row in the dataset when generating a row.

This is more like a foreign key lookup, though. I’ve got the same countryId in two different datasets, but only one of them has the country name. In my schema, I want to pull in the correct country name from dataset A using the countryId from dataset B.

Related (I think): Is there a way to directly reference a field in a dataset using the formula functionality, if that field isn’t already explicitly included in the current schema?

No, there isn’t - but that sounds like an excellent idea! Wouldn’t be trivial, though. I’ll put some work into it and see if I can get it done soon. In the meantime you’d have to join A and B offline and reupload it so you could pull the related values from a single dataset.

Basically, I’ve got a whole lot of data to work from (too much for my prototyping needs), and I want to be able to create all sorts of derivative datasets from it. But the data I’ve got is meaningfully related. So, while I’m happy to pull a thousand random rows from dataset A, when I pull from dataset B, for certain fields, I only want rows that match the data pulled from dataset A. Does that make sense?

It does. I actually got a formula function that would allow you to lookup values from a dataset working in development last night. I should have it pushed up tonight. The syntax will be:

from_dataset('Countries', 'name', id: country_id)

Does that look like it would work for your purposes?

1 Like

Oooh. I think so!

So, what that does is look in the (external) dataset called “Countries” for a row with an id equal to the country_id that is returned from a different field (in this schema), and it results the field name from that row. Right?

Out of curiosity, how will this handle cases in which the lookup results multiple matches? For example, what if instead of a unique id, I try to match on something less unique? Or would it just pick the first match? Or randomly select from multiple? (My QA past is rearing its head here, and I wanna make sure this does explode or overly tax your DB.)

Also, how tough would multivariate lookup be at this point? It’s been a few years since I’ve played with Rails, but, given your syntax, it feels like this wouldn’t be a big jump.

Yep, you’ve got it.

Right now it just picks the last record matching the criteria. The third argument to the function is a Hash, so looking up based on multiple variables is a simple as:

`from_dataset(‘Countries’, ‘name’, col1: field1, col2: field2)

1 Like

The new from_dataset formula function is now available on mockaroo.com.

2 Likes

You’re amazing! Thank you so much.

Taking this one step further, could it be possible to nest this type of functionality?

from_dataset("Countries", "name", id: from_dataset("Contacts", "country_id", id: id))

In my example, none of the IDs are auto-generated (they come from my datasets). So, I’m looking to grab the country_id of a contact that matches the id of randomly selected contact for this row. And then, I’ll use this country_id to lookup the name of the country in the Countries dataset.

This is hardly an essential feature, since I can do this simply by adding a country_id field to this schema and doing this:

from_dataset("Countries", "name", id: country_id)

But it’d be nice to not have to spit out the extra (unnecessary) fields. (Which I’m going to start another thread about.)

It should be pretty easy for me to add nesting. I’ll take a look at it soon. You can hide any field from the output simply by naming it starting with two underscores. So for example, “__myHiddenField”.

1 Like

Perhaps I’m screwing something up, but this doesn’t seem to be working when I download the data.

I’m getting this error (in the field values):

error: Could not access blank value: Use || to provide a default value for blank fields.

But I’m fairly certain I solved for that already, and it doesn’t occur when I preview. And, just to be clear, it was happening in my previews, until I fixed the formulas.

Can you post the share link at the bottom of your schema? I’ll have a look.

Should this feature work within a JSON Array’s values?

For example:

(Where id in that from_dataset is the id of the current row.)

It should now :slight_smile: Good catch! You should also be able to nest calls to from_dataset to join across multiple datasets.

1 Like

“Right now it just picks the last record matching the criteria.”

Is there anyway you could modify this behavior such that rather than picking the last record matching the criteria, it picks a random record matching the criteria?

Possibly. I’ll look into it tonight…

Cheers. Thanks a lot for your product.

I quickly realized a work around for my particular situation that inspired the question, so no rush. I could still see the value in it though if it wouldn’t require too much of your time to implement.

Any chance you’ve had any luck with this? I’ve come across another situation in which it would be very valuable. Thanks in advance.