Kaggle Horses for Courses analysis of last five starts with Azure Notebooks (part 4)

This blog post is part of a series describing my ongoing analysis of the Kaggle Horses For Courses data set using Azure Data Lake Analytics with U-SQL and Azure Notebooks with F#. This is part 4.

  1. Horses For Courses data set analysis with Azure Data Lake and U-SQL
  2. Horses For Courses barrier analysis with Azure Notebooks
  3. Kaggle Horses for Courses age analysis with Azure Notebooks
  4. Kaggle Horses for Courses analysis of last five starts with Azure Notebooks (this blog post)

Data set and recap

A quick recap of Kaggle and the data set we’re analyzing: Horses For Courses. Kaggle is a data science and machine learning community that hosts a number of data sets and machine learning competitions, some of which with prize money. ‘Horses For Courses’ is a (relatively small) data set of anonymized horse racing data.

In the first post I discussed how you could use Azure Data Lake Analytics and U-SQL to analyze and process the data. I used this mainly to generate new data files that can then be used for further analysis. In the second and third post I studied the effects of barrier and age on a horse’s chances of winning a race.

In this post I’m going to study the relation between the last five starts that is known before a horse starts a race and its chances of winning the race. For every horse in a race we know the results of its previous five races from the runners.csv file in the Kaggle dataset. At first sight, this seems a promising heuristic for determining how a horse will perform in the current race so let’s see if that’s actually the case.

The analysis itself will again be performed using Azure Notebooks with an F# kernel. Here’s the link to my notebook library.

What data are we working with?

A typical last five starts might look like this: 3x0f2. So what does this mean? A short explanation:

  • 1 to 9: horse finished in position 1 to 9
  • 0: horse finished outside the top 9
  • f: horse failed to finish
  • x: horse was scratched from the race

So in 3x0f2 a particular horse finished third, was scratched, finished outside the top 9, failed to finish and finished second in its previous five races.

You may already spot a problem here. When we get a 1 to 9, we know what happened in a previous race. When we get a 0, we have some information but we don’t know exactly what happened. For an f or an x we know nothing. In both cases, if the horse had run, it might have finished at any position.

To be able to compare the last five starts of two horses, we have to fix this. Especially, if we want to use this data as input to a machine learning algorithm, we should fix this1.

When we do some more digging in the dataset, it appears that we do not have a complete last five starts for every horse. For some horses, we only have the last four starts or the last two. And for some horses we have nothing at all. Let’s take a look at the distribution of the length of last five starts in our dataset:

(5: 72837) (4: 3379) (3: 3461) (2: 3553) (0: 5054)

I’ve written it a down a bit terse but you can see that for 72837 (or 83% of) horses we know the last five starts. But still, it’s hard to compare 32xf6 with 4f so we should fix the missing data as well.

Fixing the data

The accompanying Azure Notebook describes all fixes in detail, so I’ll give a summary here:

  • x and f: In both cases, a horse could have finished the race but didn’t2. What we do here is replace each x and f with the average finishing position of a horse over all races as a best guess (we can simply take the average over all races of the number of horses in a race).
  • 0: The horse finished outside the top 9 so we replace each 0 with the average finishing position for horses outside the top 9 (and here we take the average over all races with more than 9 horses).
  • missing data: This is essentially the same as not starting or failing to finish so we take the average finishing position again.

One small example of what’s happening: suppose we have 4xf0. With our current algorithm, this will be represented as (4.00, 6.49, 6.49, 11.66, 6.49) as follows:

44.00A 4 will remain a 4.
x6.49An x will be replaced by 6.49, the average finishing position over all races.
f6.49An f will be replaced by 6.49, the average finishing position over all races.
011.66A 0 will be replaced by 11.66, the average finishing position for horses that finish outside the top 9.
missing data6.49Missing data will be replaced by 6.49, the average finishing position over all races.

Comparing last five starts

Now that we can be sure that every last five starts has the same length, how do we compare them? The easiest way in my opinion is to take the average. So with our previous example we get:

4xf0(4.0, 6.5, 6.5, 11.7, 6.5)7.04

And we can do this for every horse. So now we have one number for every horse in a race that describes the last five starts, how convenient :) 3

Preparing the data file

With fixing and averaging in place, we will use switch back to U-SQL to prepare our dataset. Remember from the first post that we want pairs for all horses in a race so that we can reduce our ranking problem (in what order do all horses finish) to a binary classification problem (does horse a finish before or after horse b).

I’ll digress a bit into Azure Data Lake and U-SQL so if you just want to know how last five starts relates to finishing position you can skip this part. I’m assuming you already know how to create tables with U-SQL so I’ll skip to the part where I create the data file we will use for analysis.

First of all, we need the average finishing position over all races so we can fix x, f and missing data:

@avgNrHorses =
    SELECT (((double) COUNT(r.HorseId)) + 1d) / 2d AS AvgNrHorses
    FROM master.dbo.Runners AS r
    GROUP BY r.MarketId;
@avgPosition =
    SELECT AVG(AvgNrHorses) AS AvgPosition
    FROM @avgNrHorses;

We get the average number of horses in each race and than calculate the average over that. Second, we need the average finishing position of horses outside the top 9:

@avgNrHorsesAbove9 =
    SELECT
        (((double) COUNT(r.HorseId)) - 10d) / 2d AS AvgNrHorses,
        COUNT(r.HorseId) AS NrHorses
    FROM master.dbo.Runners AS r
    GROUP BY r.MarketId;
@avgPositionAbove9 =
    SELECT AVG(AvgNrHorses) + 10d AS AvgPosition
    FROM @avgNrHorsesAbove9
    WHERE NrHorses > 9;

A little more complex but essentially the same as the previous query but with just the races that have more than 9 horses.

The final part is where we generate the data we need and output it to a CSV file:

@last5Starts =
  SELECT
    HorsesForCourses.Udfs.AverageLastFiveStarts(
      r0.LastFiveStarts, avg.AvgPosition, avg9.AvgPosition) AS LastFiveStarts0,
    HorsesForCourses.Udfs.AverageLastFiveStarts(
      r1.LastFiveStarts, avg.AvgPosition, avg9.AvgPosition) AS LastFiveStarts1,
    p.Won
  FROM master.dbo.Pairings AS p
  JOIN master.dbo.Runners AS r0
    ON p.HorseId0 == r0.HorseId AND p.MarketId == r0.MarketId
  JOIN master.dbo.Runners AS r1
    ON p.HorseId1 == r1.HorseId AND p.MarketId == r1.MarketId
  CROSS JOIN @avgPosition AS avg
  CROSS JOIN @avgPositionAbove9 AS avg9;

OUTPUT @last5Starts
TO "wasb://output@rwwildenml.blob.core.windows.net/last5starts.csv"
USING Outputters.Csv();

There are two interesting parts in this query: the AverageLastFiveStarts function call and the CROSS JOIN. First the CROSS JOIN: both @avgPosition and @avgPositionAbove9 are tables with just one row. A cross join returns the cartesian product of the rowsets in a join so when we join with a rowset that has just one row, this row’s data is simply appended to each row in the first rowset in the join.

The AverageLastFiveStarts user-defined function takes a last five starts string, fixes it in the way we described earlier and returns the average value:

namespace HorsesForCourses
{
  public class Udfs
  {
    public static double AverageLastFiveStarts(string lastFiveStarts,
                                               double? avgPosition,
                                               double? avgPositionAbove9)
    {
      // Make sure the string has a length of 5.
      var paddedLastFiveStarts = lastFiveStarts.PadLeft(5, 'x');
      var vector = paddedLastFiveStarts
        .Select(c =>
        {
          switch (c)
          {
            case 'x':
            case 'f':
              return avgPosition.Value;
            case '0':
              return avgPositionAbove9.Value;
            case '1': case '2': case '3': case '4': case '5':
            case '6': case '7': case '8': case '9':
              return ((double) c) - 48;
            default:
              throw new ArgumentOutOfRangeException(
                "lastFiveStarts", lastFiveStarts, "Invalid character in last five starts");
          }
        });
      return vector.Average();
    }
  }
}

The code is also up on Github so you can check the details there.

Analysis

We now have a data file that has, on each row, the last five starts average for two horses and which of the two won in a particular race. Some example rows:

3.90, 6.49, True
4.30, 6.49, False
6.70, 3.50, False
6.70, 5.40, False
7.63, 4.40, False
6.69, 5.49, True

On the first row, a horse with an average last five starts of 3.90 beat a horse with an average last five starts of 6.5. On the second row, 4.3 got beaten by 6.5, on the third row, 6.7 got beaten by 3.5, etc.

So how do we get a feeling for the relation between last five starts and the chances of beating another horse. I decided to do the following:

  1. Get the largest absolute difference between last five starts for two horses over the entire data set.
  2. Get all differences between last five starts pairs.
  3. Distribute all differences into a specified number of buckets.
  4. Get the numbers of wins and losses in each bucket and calculate a win/loss ratio per bucket.

In the example rows above, the largest difference is in row 5: 3.23. Since differences can be both positive and negative, we have a range of length 3.23 + 3.23 = 6.46 to divide into buckets. Suppose we decide on two buckets: [-3.23, 0) and [0, 3.23]. Now get each difference into the right bucket:

                    diff        bucket
3.90, 6.49, True,  -2.59  --> bucket 1
4.30, 6.49, False, -2.19  --> bucket 1
6.70, 3.50, False,  3.19  --> bucket 2
6.70, 5.40, False,  1.29  --> bucket 2
7.63, 4.40, False,  3.23  --> bucket 2
6.69, 5.49, True,   1.20  --> bucket 2

So we have 2 horses in bucket 1 and 4 horses in bucket 2. The win/loss ratio in bucket 1 is 1 / 1 = 1, the win loss ration in bucket 2 is 1 / 4 = 0.25. So if the difference in last five starts is between -3.23 and 0, the win/loss ratio is 1.0. If the difference is between 0 and 3.23, the win/loss ratio is 0.25.

This is of course a contrived example. In reality we have almost 600000 rows so we will get some more reliable data. I experimented a little with bucket size and 41 turned out to be a good number. This resulted in the following plot. I skipped the outer three buckets on both sides because there aren’t enough data points in there.

Bucket win/loss ratio

The bars represent the buckets, the line represents the number of data points in each bucket. I highlighted bucket 24 as an example. This bucket represents the differences between average last five starts of two horses between 1.59 and 2.05. This bucket has 34777 rows and the win/loss ratio is 1.54.

This means that if the difference between average last five starts of two horses is between 1.59 and 2.05, the horse with the higher average is 1.54 times more likely to beat the other horse! This is pretty significant. If we take two random horses in a race, look at what they did in their previous five races and they happen to fall into this bucket, we can predict that one horse is 1.54 times more likely to win.

We need to put these numbers a little bit into perspective, because it matters how many records of the total population fall into bucket 24. This is about 5.83%. However, the data set is symmetric in the sense that it includes two rows for each horse pair (so if we have a,b,True we also have b,a,False). So bucket 16 is the inverse of bucket 24 with the same number of records: 34777. This means we can actually tell for 11.66% of the total population that one horse is 1.54 times more likely to win than another horse.

Conclusion

So far, we have analyzed three features for their effect on horse race outcomes: barrier, age and last five starts. Barrier and age had a clear effect and now we found that average last five starts also has an effect. Each one of these separately cannot be used to predict horse races but maybe combined they present a better picture.

Age and barrier are independent of each other. The barrier you start from is the result of a random draw and it has no effect on the age of a horse. Vice versa, the age of a horse has no effect on the barrier draw. We already established that both age and barrier have an effect on race outcomes so you might be inclined to think that both also have an effect on the last five starts. This is not true for barrier but it may be true for age. We determined in the previous post that younger horses outperform older horses. It makes sense then that the last five starts of younger horses is better than that of older horses.

Ideally we would like to present a machine learning algorithm a set of independent variables. Using both age and last five starts may not be a good idea.

In the next post we’ll get our hands dirty with Azure Machine Learning to see if we can get ‘better than random results’ when we present the features we analyzed to a machine learning algorithm. Stay tuned!

Footnotes

  1. Actually there is no machine learning ‘law’ that requires us to fix the data. We could just leave the x, f and 0 as they are and have the algorithm figure out what they mean. However, think about what this would mean. Suppose we have two horses: 067xf and 9822x and the first won. The input for our machine learning algorithm would be: 0,6,7,x,f,9,8,2,2,x,True. That’s 10 feature dimensions, just to describe the last five starts! High-dimensional sample spaces are a problem for most machine learning algorithms and this is usually referred to as the curse of dimensionality, very nicely visualized in these two two Georgia Tech videos. So the less dimensions, the better.
  2. You could argue that being scratched from a race (x) and failing to finish (f) are two different things. Especially an f could give us more information about future races. Suppose we see the following last five starts: 638ff. The horse failed to finished in its last two races. This doesn’t give much confidence about the current race. On the other hand, f8f63 tells a different story but has the same results, just in a different order. Maybe in a future blog post I’ll dig deeper into better methods for handling x and f.
  3. I have given some thought to other ways of comparing last five starts but averaging is at least the simplest and maybe the best solution. You could argue that trends should be taken into account so that 97531 is better that 13579. The first shows a clear positive, the second a clear negative trend. However, deriving a trend from a series of five events seems a bit ambitious so I decided against it.

Kaggle Horses for Courses age analysis with Azure Notebooks (part 3)

This blog post is part of a series describing my ongoing analysis of the Kaggle Horses For Courses data set using Azure Data Lake Analytics with U-SQL and Azure Notebooks with F#. This is part 3.

  1. Horses For Courses data set analysis with Azure Data Lake and U-SQL
  2. Horses For Courses barrier analysis with Azure Notebooks
  3. Kaggle Horses for Courses age analysis with Azure Notebooks (this blog post)
  4. Kaggle Horses for Courses analysis of last five starts with Azure Notebooks

Data set and recap

A quick recap of Kaggle and the data set we’re analyzing: Horses For Courses. Kaggle is a data science and machine learning community that hosts a number of data sets and machine learning competitions, some of which with prize money. ‘Horses For Courses’ is a (relatively small) data set of anonymized horse racing data.

In the first post I discussed how you could use Azure Data Lake Analytics and U-SQL to analyze and process the data. I used this mainly to generate new data files that can then be used for further analysis. In the second post I studied the effect of the barrier a horse starts from on its chances of winning a race.

In this post I’m going to do the same but now for age: how does the age of a horse affect its chances of winning a race. The analysis will again be based on a file that was generated from the raw data using a U-SQL script in Azure Data Lake. The file has a very simple format: column 1 has the age of the first horse, column 2 of the second horse and column 3 tells us who won in a particular race. So for example:

3,7,True
10,4,False

The first row tells us that in a particular race, a 3-year-old horse beat a 7-year-old horse. The second row tells us a 10-year-old horse got beaten by a 4-year-old.

The analysis will again be performed using an Azure Notebook with an F# kernel. Here is the link to my notebook library.

Ages notebook

As in the previous post, the details can be found in the accompanying Azure Notebook. You can clone the notebook library using a Microsoft account. Remember that Shift+Enter is the most important key combination; it executes the current cell and moves to the next cell.

The first thing we’d like to know is how many horses there are for a particular age. This information can be found in the raw data from Kaggle: horses.csv. If we plot the results we get the following:

You can see that for ages 3, 4, 5, 6 and maybe 7 we have a reasonable amount of data.

The next step is analyzing the ages.csv file we generated that has one row for each age combination in each race. For this we apply a similar tactic as we used in the previous post: check for each age how many times a horse from that age beat horses from other ages. This results in the following matrix:

Some examples to clarify what we see here:

  • On the first row we see how many times 2-year-old horses beat other horses. So 2-year-old horses beat 3-year-old horses 793 times, they beat 4-year-old horses 129 times, etc.
  • On the second row we have the 3-year-old horses. They beat 2-year-old horses 1424 times, other 3-year-old horses 32247 times, 4-year-old horses 11588 times, etc.

The absolute numbers in this matrix do not tell us a lot, since they are skewed by the number of horses of a particular age that actually ran races. So what we do next is divide the number of wins by the number of losses per age pair: the win-loss ratio. These are the numbers for ages 2 to 7:

The second value in the first row is obtained by dividing 793 by 1424. The first value in the second row is its inverse: 1424 divided by 793. Now let’s visualize the data. I started out with a 3D surface plot (as in the previous post) but that got a bit convoluted so I used simple line charts instead:

Conclusions

In the plot I compared ages 2 to 8. I highlighted the results of 2, 3 and 4 year old horses against other 4-year-olds. So, for example, you can see that a 2-year-old horse has a win/loss ratio of 0.701087 against 4-year-old horses. What is obvious is that younger horses outperform older horses (except for 2-year-old horses): 3-year-old horses have a positive win/loss ratio against any other age.

However, if we take the positive win/loss ratio of 1.078054 of 3-year-olds against 4-year-olds, it doesn’t really help us predict horse races. If we revisit the absolute numbers, we can see that 3-year-olds beat 4-year-olds 11588 times, but 4-year-olds beat 3-year-olds 10749 times.

But still, the effect of age is obvious so there must be some way to use it in predicting race outcomes. Maybe instead of age we could use the win/loss ratio directly. However, we may loose information if we reduce ages 2 and 4 in each race to the number 0.701087. Maybe age combined with another feature is a strong predictor for race outcomes. For example, maybe 2-year-old horses perform very well on muddy race tracks. By reducing age pairs to just a win/loss ratio this information may be lost.

So even if age is a factor to consider, I doubt whether it is actually useful as direct input for a machine learning algorithm.

Kaggle Horses For Courses barrier analysis with Azure Notebooks (part 2)

This blog post is part of a series describing my ongoing analysis of the Kaggle Horses For Courses data set using Azure Data Lake Analytics with U-SQL and Azure Notebooks with F#. This is part 2.

  1. Horses For Courses data set analysis with Azure Data Lake and U-SQL
  2. Horses For Courses barrier analysis with Azure Notebooks (this blog post)
  3. Kaggle Horses for Courses age analysis with Azure Notebooks
  4. Kaggle Horses for Courses analysis of last five starts with Azure Notebooks

Data set and recap

A quick recap of Kaggle and the data set we’re analyzing: Horses For Courses. Kaggle is a data science and machine learning community that hosts a number of data sets and machine learning competitions, some of which with prize money. ‘Horses For Courses’ is a (relatively small) data set of anonymized horse racing data. In the previous post I discussed how you could use Azure Data Lake Analytics and U-SQL to analyze and process the data. I used this mainly to generate new data files that can then be used for further analysis.

The first file I created, based on the source data, was a file called barriers.csv. It contains, for each race, the starting barrier for each pair of two horses in the race and which horse finished before the other horse. I am now going to analyze this file using Azure Notebooks with an F# kernel.

Azure Notebooks

You may now wonder, what is he talking about?! So hang on and let me explain. An Azure Notebook is a way of sharing and running code on the internet. A number of programming languages are supported like Python, R and F#. A programming language in a notebook is supported via a kernel and in my case I use the F# kernel.

Azure Notebooks isn’t exactly new technology because it’s an implementation of Jupyter Notebooks. Jupyter evolved as a way to do rapid online interactive data analysis and visualization, which is exactly what I’m going to do with my barriers.csv file.

Barriers notebook

The largest part of this post is actually inside the notebook itself, so let me give you the link to my notebook library. At the time of writing there is one notebook there: Barriers.ipynb. You can run this notebook by clicking on it and logging in with a Microsoft account (formerly known as Live account).

When you do that, the library is cloned (don’t worry, it’s free) and you can run the notebook. The most important thing to remember if you want to run an Azure (or Jupyter) notebook is the key combination Shift+Enter. It executes the current notebook cell and moves to the next cell.

I invite you to run the notebook now to see what the data looks like and how it is analyzed. It takes about five minutes. But if you do not have time or do not feel like cloning and running an Azure notebook, I will provide the summary here.

We have a data set of about half a million rows with two barriers and a flag indicating who won on each row. These are grouped together to determine for every barrier, how often starting from that barrier resulted in a win. The final result of this step is shown below (for 13 barriers):

On the first row, we see that barrier 1 beats barrier 2 3212 times. It beats barrier 3 3144 times, etc. You can immediately spot a problem with this data. We would expect that starting from a lower barrier gives you an advantage. However, for example, barrier 1 beats barrier 12 only 981 times. Reason for this is that there are less horse races with 12 horses than there are with 6 horses, for example.

We need the relative instead of the absolute numbers: the win-loss ratio per barrier combination. So we divide the number of times barrier x beats barrier y by the number of times barrier y beats barrier x. The result is below (for 6 barriers this time so everything fits nicely on a row).

You can see that barrier 1 gives positive win-loss ratios against all other barriers. To make this even more clear, let’s visualize the data (for the first 14 barriers).

I hope I found the right angle that makes the visualization the easiest to understand. In the notebook it is interactive so you can turn it around and zoom in. The diagonal line represents barriers racing against themselves so this is always 1. Behind this diagonal the graph rises up, indicating positive win-loss ratios. The graph comes down in front of the diagonal, indicating negative win-loss ratios.

The y-axis represents the first barrier, the x-axis the second barrier. So if you take a look at y = 0 (barrier 1), you can see it has a positive win-loss ratio against all other barriers (all x values). If you look at y=6 (barrier 7), it has a negative win-loss ratio against x = 0..5 (barriers 1 through 6) and a positive win-loss ratio against x = 7..13 (barriers 8 through 14).

The same is true for almost all barriers, indicating that it’s better to start from a lower barrier: it definitely increases your chances of winning the race.

However, it is definitely not the only feature we need to predict finishing positions in a race. Even though barrier 4 beats barrier 8 2739 times, barrier 8 also beats barrier 4 2421 times. We need to find additional features in the data set if we want to make accurate predictions. That’s a topic for a future post.

Kaggle Horses For Courses data set analysis with Azure Data Lake and U-SQL (part 1)

This blog post is part of a series describing my ongoing analysis of the Kaggle Horses For Courses data set using Azure Data Lake Analytics with U-SQL and Azure Notebooks with F#. This is part 1.

  1. Horses For Courses data set analysis with Azure Data Lake and U-SQL (this blog post)
  2. Horses For Courses barrier analysis with Azure Notebooks
  3. Kaggle Horses for Courses age analysis with Azure Notebooks
  4. Kaggle Horses for Courses analysis of last five starts with Azure Notebooks

The data set

Kaggle is a data science community that hosts numerous data sets and data science competitions. One of these data sets is ‘Horses For Courses’. It’s a (relatively) small set of anonymized horse racing data. It tells you which horses participated in a race, how old they were, from what barrier they started, what the weather was like at the time of the race, betting odds and a lot more. The official goal is to predict the finishing position of a horse in a race. The real goal is of course to beat the odds and win a lot of money betting on horse races via a machine learning model :)

Azure Data Lake

The data set, unzipped, is a little over 100MB, nothing you can’t handle on a laptop. But since data sets aren’t usually this small, I was looking for a cloud solution that allows analysis of large data sets, just to get some experience with that.

As it happens, Azure Data Lake provides storage for (really) large data sets and Azure Data Lake Analytics provides analytics capabilities through a SQL-dialect called U-SQL. Data Lake Store is built on top of Apache Hadoop YARN and Data Lake Analytics uses MapReduce-style execution of workloads. It’s a pay-as-you-go model so if you don’t run large jobs or store a lot of data, it costs you next to nothing.

Preparing the data

The data set can be downloaded as a zip file from Kaggle. I unzipped the file and uploaded the contents to an Azure Blob Storage container:

blob storage

Why Azure Blob Storage and not Azure Data Lake itself? Well, I’d like to do some data analysis using Azure Notebooks with F# and blob storage provides easy access to blobs without authentication (haven’t found that for Data Lake yet).

Data Lake tables

Now that we have the csv files in place, we first create Azure Data Lake tables from these files. This isn’t strictly necessary because Data Lake uses so-called schema-on-read so we could just extract data from the csv files and analyze this directly. However, when you need to extract data multiple times from the same set of csv files, table definitions are really helpful.

Our first target for analysis is runners.csv. It has a row for every horse in every race and includes finishing position (the field we want to predict with our money-making ML model). BTW: all source code in this post and the ones that follow is also available on Github. Let’s create a table for the runners.csv file:

CREATE TYPE IF NOT EXISTS RunnersType
AS TABLE (Id int, Collected DateTime, MarketId int, Position int?, ...,
          HorseId int, ..., Barrier int, ...);

DROP TABLE IF EXISTS Runners;
DROP FUNCTION IF EXISTS RunnersTVF;

CREATE FUNCTION RunnersTVF()
RETURNS @runners RunnersType
AS
BEGIN
    @runners =
        EXTRACT Id int,
                Collected DateTime,
                MarketId int,
                Position int?,
                ...,
                HorseId int,
                ...,
                Barrier int,
                ...
        FROM "wasb://raw@rwwildenml.blob.core.windows.net/runners.csv"
        USING Extractors.Csv(skipFirstNRows : 1);
END;

CREATE TABLE Runners
(
    INDEX RunnersIdx CLUSTERED(Id ASC)
    DISTRIBUTED BY HASH(HorseId)
)
AS RunnersTVF();

A lot of things happen here. Let’s get into the details one by one:

  • First of all, this is not the only way to create a table. You can also create a table directly from an EXTRACT statement. It’s a little more work like this but I like that you explicitly define a table type.
  • The column data types you see are actual C# types. You can use any C# type in a U-SQL script (but only a subset can be used for column types).
  • The EXTRACT statement reads a CSV file from a blob storage account via the url: wasb://raw@rwwildenml.blob.core.windows.net/runners.csv where:
    • wasb stands for Windows Azure Storage Blob,
    • raw is the name of the blob storage container where the files live,
    • rwwildenml.blob.core.windows.net is the name of my storage account and
    • runners.csv is the name of the file.
  • Before you can read from blob storage in Data Lake Analytics, you have to add the blob storage account as a data source.
  • I skipped some columns for readability, indicated by the ...s, the actual file has 40 columns.
  • A table needs an index and a partitioning/distribution scheme. An obvious column to partition the data on isn’t really available so I specified only a distribution.

When we run the above script an actual table is created and stored inside Azure Data Lake. You can join this table to other tables, you can group, you can use aggregates, etc. And while I have a very small table, the same principles apply to giga- or tera-byte tables. In that case you would have to give some more careful consideration to partitioning.

Visual Studio U-SQL projects

A next logical question would be: how do I run this script? The two easiest ways are: directly from the Azure portal or from within Visual Studio. I used the latter and installed Visual Studio 2017. A simple File New Project gives you the option to create a new U-SQL Project1.

Of course, you also need an Azure Data Lake Store and Azure Data Lake Analytics resource in Azure. Both can simply be created from the Azure Portal.

From within Visual Studio, you can actually connect to your Data Lake Analytics resource(s) in Azure using Server Explorer:

I highlighted the resources we created with the script above and the linked blob storage account (rwwildenml).

Analysis

With all the plumbing out of the way, we can start analyzing the data. Remember, our goal is to predict finishing positions of horses in a race. This is essentially a ranking problem. Suppose we have four horses in a race: A, B, C and D and let’s assume they finish in alphabetical order. We can then generate pairs for each horse combination and label them as follows:

 1.  A  B  won
 2.  A  C  won
 3.  A  D  won
 4.  B  A  lost
 5.  B  C  won
 6.  B  D  won
 7.  C  A  lost
 8.  C  B  lost
 9.  C  D  won
10.  D  A  lost
11.  D  B  lost
12.  D  C  lost

On row 1, horse A finished before horse B so we label this as won. On row 8, horse C finished after horse B so this is labeled lost. We can do this for every horse race in the runners.csv file.

The following U-SQL script creates a table called Pairings that contains each pair of horses per race, whether the first horse in the pair won and what the distance was between finishing positions.

DROP TABLE IF EXISTS Pairings;

CREATE TABLE Pairings
(
    INDEX PairingsIdx CLUSTERED(HorseId0 ASC, HorseId1 ASC)
    PARTITIONED BY HASH (HorseId0, HorseId1)
)
AS
SELECT
    r0.MarketId,
    r0.HorseId AS HorseId0,
    r1.HorseId AS HorseId1,
    HorsesForCourses.Functions.Won(
        r0.Position.Value, r1.Position.Value) AS Won,
    HorsesForCourses.Functions.Distance(
        r0.Position.Value, r1.Position.Value) AS Distance
FROM master.dbo.Runners AS r0
JOIN master.dbo.Runners AS r1 ON r0.MarketId == r1.MarketId
WHERE r0.HorseId != r1.HorseId
  AND r0.Position.HasValue
  AND r1.Position.HasValue;

We simply join the Runners table against itself where both race ids (market ids) are equal. Horses are not compared against themselves and both positions must have a value.

You may wonder, what are HorsesForCourses.Functions.Won and HorsesForCourses.Functions.Distance? Well, you can actually use C# user defined functions (and lots more) from U-SQL. Won and Distance are two very simple functions that return a bool and an int, respectively.

When we run this script, we get a table with the following columns:

  • MarketId: the id of the race the horses participated in
  • HorseId0: a horse in the race
  • HorseId1: another horse in the same race
  • Won: whether the first horse (HorseId0) won or lost
  • Distance: the difference between horse positions

Barriers

Using the Pairings table we just created, we can derive other data. The first data set we’re going to create has the following columns:

  • Barrier0: what was the starting barrier for the first horse
  • Barrier1: what was the starting barrier for the second horse
  • Won: which horse won

When we have this set, we can try and determine if barrier has an effect on a horse’s chances of winning a race. The analysis of this data set is the topic of the next post so we’ll finish here with a script that generates the CSV file in blob storage:

@barriers =
    SELECT r0.Barrier AS Barrier0,
           r1.Barrier AS Barrier1,
           p.Won
    FROM master.dbo.Pairings AS p
    JOIN master.dbo.Runners AS r0 ON p.HorseId0 == r0.HorseId AND
                                     p.MarketId == r0.MarketId
    JOIN master.dbo.Runners AS r1 ON p.HorseId1 == r1.HorseId AND
                                     p.MarketId == r1.MarketId;

OUTPUT @barriers
TO "wasb://output@rwwildenml.blob.core.windows.net/barriers.csv"
USING Outputters.Csv();

First we generate a new data set called @barriers by joining our Pairings table twice to Runners (for horse 0 and horse 1). Next we output this data set to a CSV file in blob storage. We now have a file that contains, for every race, the barriers for each horse pair in each race and whether this resulted in a win or a los. If we take one row, for example:

3,7,True

This indicates that in a particular race, the horse starting from barrier 3 has beaten the horse starting from barrier 7.

One final screenshot from Visual Studio may be interesting and that is the job graph that describes what happened when we ran the last script:

As you can see, there were two inputs: master.dbo.Runners and master.dbo.Pairings. Each block represents an operation on data. The final block generates the barriers.csv file with 596546 rows.

Footnotes

  1. Only if Visual Studio 2017 was installed with the Data storage and processing workload.

Client certificate authentication for Azure Service Fabric cluster API endpoint

In two previous posts I explained how to setup SSL for a local Azure Service Fabric cluster and how to configure this for a cluster running on Azure. In this post I describe how to setup client certificate authentication for the same API endpoint. Client certificate authentication requires that a client can only access the API with a client authentication certificate (certificate purpose 1.3.6.1.5.5.7.3.2).

At the moment of writing this post there is no built-in support for client certificate authentication in Service Fabric that I could find. So although everything described below actually works, it won’t win any beauty contests :)

Let’s begin: first of all, client certificate authentication won’t work without server authentication. So before you continue, make sure the Service Fabric API endpoint is protected by a server authentication certificate (check this post for details).

Before we start

On Windows, a server authentication certificate is bound to one or more specific TCP ports. When a client (a browser for example) sends an http request to this port, the server responds with the configured certificate (among other things). This is of course a gross oversimplification but it serves our purpose.

You can check which certificates are bound to which ports using the netsh command: netsh http show sslcert.

In the screenshot you can see that on my local machine, the certificate with thumbprint 6ffb99586b7580f67e8e6bb65a19067c62fb872b is bound to ports 44389 and 44399. If you look more closely at the output, you see that there is a property Negotiate Client Certificate for each port binding. If we can set this property to Enabled for the right port binding, we’re done.

If we take the binding for port 44399 as example, the following two statements accomplish that (line breaks are just for readability, each statement should be on a single line):

netsh http delete sslcert ipport=0.0.0.0:44399
netsh http add sslcert ipport=0.0.0.0:44399 `
                       certhash=6ffb99586b7580f67e8e6bb65a19067c62fb872b `
                       appid="{214124cd-d05b-4309-9af9-9caa44b2b74a}" `
                       clientcertnegotiation=enable

If we take a look at the output now it looks like this (showing just port 44399):

That was easy! The real problem is: how to do the same on the virtual machines in an Azure Service Fabric cluster?

Setup entry points

To make this work, we can use a feature of Service Fabric called a setup entry point. Besides the long running process that each micro service actually is, you can have special setup tasks that run each time a service is started on a cluster node. We will use a setup entry point to enable client certificate negotiation. In my configuration (ServiceManifest.xml) this looks as follows:

<CodePackage Name="Code" Version="1.0.0">
  <SetupEntryPoint>
    <ExeHost>
      <Program>EnableClientCertAuth.bat</Program>
      <Arguments>0.0.0.0:8677</Arguments>
      <WorkingFolder>CodePackage</WorkingFolder>
    </ExeHost>
  </SetupEntryPoint>
  <EntryPoint>
    <ExeHost>
      <Program>MyServices.SF.Api.exe</Program>
    </ExeHost>
  </EntryPoint>
</CodePackage>

Besides the regular EntryPoint we now also have a SetupEntryPoint. It has a batch file as the program and we pass the ipport as argument. In my case this is 0.0.0.0:8677.

Batch file and PowerShell script

The batch file EnableClientCertAuth.bat should be located at the project root. It’s very simple as it just calls a PowerShell script to do the real work:

powershell.exe -ExecutionPolicy Bypass `
               -Command ".\EnableClientCertAuth.ps1 -IpPort %1"

The PowerShell script should also be located at the project root and both files must be copied to the build directory. In Visual Studio solution explorer:

First I’ll show the PowerShell script itself, then an explanation of what happens.

param([String]$IpPort)

$match = (netsh http show sslcert |
          Select-String -Pattern $IpPort -Context:0,1 -SimpleMatch)
if ($match -eq $null) {
  Write-Warning "IpPort $ipPort not found in output of 'netsh http show sslcert'"
  exit
}
else {
  $certHash = $match.Context.PostContext.Split(@(": "), 1)[1]
}

Write-Output "Deleting SSL cert $certHash on port $IpPort"
netsh http delete sslcert ipport=$IpPort

Write-Output @"
Adding SSL cert $certHash on port $IpPort with clientcertnegotiation=enable
"@
netsh http add sslcert ipport=$IpPort `
                       certhash=$certHash `
                       appid="{11223344-5566-7788-9900-aabbccddeeff}" `
                       clientcertnegotiation=enable

The script has four steps:

  1. Use Select-String to find the output lines that match the specified ipport. We are looking for the certificate hash which always appears one line below the ipport.
  2. Get the certificate hash from the result by splitting on : if we actually found a result.
  3. Delete the binding for the specified ipport.
  4. Add the binding back but now with client certificate negotiation enabled. We need the certificate hash here so that is why we did all the parsing.

Privileged entry point

We now have a batch file, a PowerShell script and a setup entry point that runs the batch file. The only thing we haven’t covered yet is that for binding a certificate to a port you need administrator privileges. So the batch file should run under a privileged account.

This is all very well described in the Service Fabric documentation so I’ll just repeat here for the sake of completeness. First you add a principal to your application manifest:

<Principals>
  <Users>
    <User Name="ApiSetupAdminUser" AccountType="LocalSystem" />
  </Users>
</Principals>

And next you specify an additional policy in your ServiceManifestImport:

<Policies>
  <RunAsPolicy CodePackageRef="Code" UserRef="ApiSetupAdminUser"
               EntryPointType="Setup" />
</Policies>

Conclusion

If we deploy the updated Service Fabric application to our cluster (or locally), it will run the batch file on every node before our actual service starts. The certificate port binding will be removed and re-added with client certificate negotiation enabled.

To be honest, I’m not perfectly happy with the approach above for two reasons:

  1. I’m depending on string parsing for retrieving some information I need. I don’t think a lot will change to the output format of netsh http show sslcert but it doesn’t feel like a very stable solution.
  2. I’m actually hard-coding the ipport in my ServiceManifest.xml and I can’t easily change this between environments.

Unfortunately, it’s the only way I can think of to make this work. I’d rather have declarative support in the Service Fabric endpoint configuration instead. Something like this:

<Endpoint Protocol="https" Name="WebEndpointHttps" Type="Input"
          Port="8677" EnableClientCertificateNegotiation="True" />

There is actually a UserVoice request for supporting client certificate authentication so if you think this is important, please vote for it.

Important note

You may think that we have now protected our API because a client must first present a valid client certificate. This is actually literally true: any client with a valid client certificate can access our API. We have only implemented the authentication part: a client must tell who he is before entering.

You still need to authorize clients somehow; determining what an authenticated client is actually allowed to do. You can, for example, maintain a list of valid client certificates and deny access to any other certificate. Or you map client certificates to users in a database.