Early Data Flow for Skate Analytics

Below is an attempt at laying out the sequence of steps required to convert the data collected into probability distributions, including prior information.

For each trick attempt, there's 5 pieces of information recorded with data types in brackets: date/time of attempt [date/time], trick name [text], success bet [integer, any non-zero number], failure bet [integer, any non-zero number], and outcome [binary, 0 or 1]. (There can be others, too, but for analysis at this point this is all we need.)

Each trick attempt's success and failure bet information (collectively, the odds information) needs to be converted to probability: success bet / (success bet + failure bet).

Aggregate for each trick within particular time periods. This includes getting the mean of the odds information (it remains to be tested whether using the mean is a good approach, but it seems decent to start out with) and getting the number of successes (sum of outcome, data_a) and attempts (count of rows N).

Then use the aggregated odds information as priors via the method that I described in this post, which is rounding it to the nearest 1 / N, converting it to a fraction, and extracting the numerator and denominator from the fraction to get the a and b values (prior_a = numerator, prior_b = denominator - numerator). Add this information to each row.

Each resulting row of data can then have beta distributions built for it, using a = prior_a + data_a and b = prior_b + (N - data_a).


With the above, we can then compare the distributions for different time periods. I think the way that works is that we randomly sample a probability from each distribution N number of times — filling out a matrix of values where each distribution has a column and each sample from 1 to N is given a row — and make comparisons between the values in pairs of columns. (I seem to recall that in BayesianStatisticsTheFunWay the author only made comparisons between pairs of columns (each distribution's samples), which is fine for what I'm doing here but I imagine it might be interesting to compare across larger sets of columns.)

For instance, we can see simpler things like how many of the observations in one column are greater than the other (as shown in BayesianStatisticsTheFunWay) or what percentage one is higher than the other and visualize the distribution of resulting percentages (a riff on another example given in the book).


Data Collection (& Management)

I've enjoyed discovering in this project that there's an "environmental constraint" upon how this data can be manually collected and in general that such constraints exist: constraints placed upon the act of data collection by the context in which the data is being collected and that likely threaten the data's quality. In particular, when I'm out skating, it's a bit of a pain to stop skating to manually record data. And every additional second of distraction with data collection is an additional pain, as it's at the expense of time skating. A consequence of this is that I've found myself creating and using certain shortcuts which others might be inclined towards using and which I should in turn factor back into my analyses of the data — especially making sure to think in terms of what motivated those shortcuts to emerge so that I can address those motivating factors rather than simply addressing whichever effects they happened to have caused (various manifestations of general causes — what may have been seen and what may have not been seen).

I've been getting a bit ahead of myself, stressing over how I'd effectively collect the piece of information that identifies which unique Trick is being recorded. Currently with the pen and paper method I've used, I code the Trick correctly enough that I can re-code them later on, but ultimately, it'd be nice if there was a quick way to code tricks in a way that I don't need to re-code them later. I've been thinking about different possible UIs for a web app that'd let us do this precisely, and I've been stressed about how complicated making such things might be. Or for now I could setup the equivalent of the current pen and paper system, in digital form: setting up a quick entry method and a system for re-encoding the data later on, after skating.