PathToPerformance

A quick note for my Julia peeps to grok the difference between NaN, missing and nothing in JuliaLang. I have a few friends on twitter that remind me that the distinction between these concepts is not trivial, but I think I have a good mental model of how to address it and I might as well write it up. Hat tip to Jasmine Hughes for inspiring this post and also sponsoring me on GitHub so I can continue my open source campaing.

A rainy setup

You are running a science experiment where you must measure the amount of rainwater that falls in a given day. The scientific-est thing your advisor has recommend is to setup a RainWater-O-Tron 9000 that collects data every day on how much water fell into a tube that sticks on top of it and reports it back the total at the end of the day. Once the thing is plugged, your machine graciously spits into your data pipeline tool a table that looks something like this:

DaysRain [cm]Status
115OK
220OK
310OK

So far, nothing out of the ordinary. Another humble data gathering expedition to appease the fickle gods of science and grants. The machine kindly records the centimeters of rain collected and its operating status - seems sensible.

You reset the machine and leave for Easter break and leave the robot running for a week, ready to come back and do some proper Science TM once you get the data back.

Ominously, you find the report to say this:

DaysRain [cm]Status
112OK
222OK
313OK
4OK
5NO
6💩OK
718OK

Clearly something has gone wrong, on days 4-6, but if you think about it carefuly for a second, the Status of each data point gives you some insight into where your data collection could have gone wrong.

This is the big distinction in how much you know about your data, and the "failure modes" of how it was mis/collected: you get an idea for how to approach its shortcomings based on what you recorded.

Of course, these are just narratives for illustrative purposes, but hopefully it can help solidify the distinctions and how these can help you think to solve your problem. Does that mean you must always use these sentinel values in your code or data collection? Not necessarily, but that's for you to decide if these are the right tools.

'Til next time.