Dataset Column repeats in custom distribution

ian.butterworth · May 25, 2016, 5:30pm

I have a dataset set called iso3166CountryCodes with 242 records and I wanted it to deliver a high proportion of one country compared to others. I set the column to numeric isocode and then went to edit the customer distribution. I found that the 826 records appeared many times. When I changed its weight rule to 100, all of the repeated 826s changed to 100 too.

I have since experimented by removing the frequency column and all the repeated 826s disappeared - I assume that records are inserted into the dataset in the proportion of the frequency column.

mockaroo · May 26, 2016, 12:45pm

The frequency column predates the ability to edit a dataset column field’s distribution. Specifying the frequency in the dataset was the original way to control the distribution. If you include a frequency column in your csv file and also specify a custom distribution in your schema you’ll get a multiplicative effect. In other words, your assumption is correct. I suggest just using a custom distribution in your schema.

ian.butterworth · May 26, 2016, 1:15pm

Yes, that’s what I thought - it gets too complicated to understand otherwise.