Deploy Tridion SDL Web 8.5 Discovery Service on Pivotal CloudFoundry (part 2)

This is part 2 of a series of (still) unknown length where I try to describe how to deploy the SDL Tridion Web 8.5 Discovery Service on CloudFoundry. All parts:

  1. Deploy Tridion SDL Web 8.5 Discovery Service on Pivotal CloudFoundry (part 1)
  2. Deploy Tridion SDL Web 8.5 Discovery Service on Pivotal CloudFoundry (part 2) (this post)

I finished the previous post thinking I was done, except a few small changes. Unfortunately, that wasn’t true. Remember we had to provide an explicit command line because of classpath requirements. This classpath wasn’t yet complete. Let’s analyze the start.sh file again:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
#!/usr/bin/env bash

# Java options and system properties to pass to the JVM when starting the service. For example:
# JVM_OPTIONS="-Xrs -Xms128m -Xmx128m -Dmy.system.property=/var/share"
JVM_OPTIONS="-Xrs -Xms128m -Xmx128m"
SERVER_PORT=--server.port=8082

# set max size of request header to 64Kb
MAX_HTTP_HEADER_SIZE=--server.tomcat.max-http-header-size=65536

BASEDIR=$(dirname $0)
CLASS_PATH=.:config:bin:lib/*
CLASS_NAME="com.sdl.delivery.service.ServiceContainer"
PID_FILE="sdl-service-container.pid"

cd $BASEDIR/..
if [ -f $PID_FILE ]
  then
    if ps -p $(cat $PID_FILE) > /dev/null
        then
          echo "The service already started."
          echo "To start service again, run stop.sh first."
          exit 0
    fi
fi

ARGUMENTS=()
for ARG in $@
do
    if [[ $ARG == --server\.port=* ]]
    then
        SERVER_PORT=$ARG
    elif [[ $ARG =~ -D.+ ]]; then
    	JVM_OPTIONS=$JVM_OPTIONS" "$ARG
    else
        ARGUMENTS+=($ARG)
    fi
done
ARGUMENTS+=($SERVER_PORT)
ARGUMENTS+=($MAX_HTTP_HEADER_SIZE)

for SERVICE_DIR in `find services -type d`
do
    CLASS_PATH=$SERVICE_DIR:$SERVICE_DIR/*:$CLASS_PATH
done

echo "Starting service."

java -cp $CLASS_PATH $JVM_OPTIONS $CLASS_NAME ${ARGUMENTS[@]} & echo $! > $PID_FILE

At line 12 the classpath is set to .:config:bin:lib/*. We ended the previous post with a classpath of $PWD/*:.:$PWD/lib/*:$PWD/config/*, not quite the same. Furthermore, on lines 42..45, additional folders are added to the classpath. Taking all this into account, we get the following classpath: $PWD/*:.:$PWD/lib/*:$PWD/config:$PWD/services/discovery-service/*:$PWD/services/odata-v4-framework/* and the following manifest.yml:

---
applications:
- name: discovery_service
  path: ./
  buildpack: java_buildpack_offline
  command: $PWD/.java-buildpack/open_jdk_jre/bin/java -cp $PWD/*:.:$PWD/lib/*:$PWD/config:$PWD/services/discovery-service/*:$PWD/services/odata-v4-framework/* com.sdl.delivery.service.ServiceContainer -Xrs -Xms128m -Xmx128m
  env:
    JBP_CONFIG_JAVA_MAIN: '{ java_main_class: "com.sdl.delivery.service.ServiceContainer", arguments: "-Xrs -Xms128m -Xmx128m" }'
    JBP_LOG_LEVEL: DEBUG

Now that we have fixed the classpath, let’s see if the discovery service still runs when we push it.

0 of 1 instances running, 1 starting
0 of 1 instances running, 1 starting
0 of 1 instances running, 1 starting
0 of 1 instances running, 1 crashed
FAILED
Error restarting application: Start unsuccessful

TIP: use 'cf logs discovery_service --recent' for more information

Ok, that is unfortunate, we broke it again. Let’s check the log files again:

[APP/PROC/WEB/0] OUT                                             '#b
[APP/PROC/WEB/0] OUT                                              @# ,###
[APP/PROC/WEB/0] OUT     ##########  @##########Mw     ####   ########^
[APP/PROC/WEB/0] OUT    #####%554WC  @#############p  j####       ##"@#m
[APP/PROC/WEB/0] OUT   j####,        @####     1####  j####      ##    "
[APP/PROC/WEB/0] OUT    %######M,    @####     j####  j####
[APP/PROC/WEB/0] OUT      "%######m  @####     j####  j####
[APP/PROC/WEB/0] OUT          "####  @####     {####  j####
[APP/PROC/WEB/0] OUT   ]##MmmM#####  @#############C  j###########
[APP/PROC/WEB/0] OUT   %#########"   @#########MM^     ###########
[APP/PROC/WEB/0] OUT :: Service Container :: Spring Boot  (v1.4.1.RELEASE) ::
[APP/PROC/WEB/0] OUT Exit status 0
[CELL/0] OUT Exit status 0
[CELL/0] OUT Stopping instance ef44cf20-b9da-48c6-5edc-a6d7
[CELL/0] OUT Destroying container
[API/0] OUT Process has crashed with type: "web"
[API/0] OUT App instance exited with guid e9a00d0c-86b4-4dad-ae5d-e4208f09590f payload: {"instance"=>"ef44cf20-b9da-48c6-5edc-a6d7", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"Codependent step exited", "crash_count"=>4, "crash_timestamp"=>1513173007899100032, "version"=>"692f3c6a-acf3-4adc-b870-3827355948d6"}
[CELL/0] OUT Successfully destroyed container

Not very informative… This just tells us that something went wrong but not what went wrong. It should be possible to get more logging than this. Lucky for us, it is.

In my config/logback.xml file, a number of RollingFileAppenders were configured (this may be different for your configuration). These were setup to log to a local folder. This isn’t going to fly on CloudFoundry of course, we should log to stdout and let the platform manage the rest. So I modified logback.xml:

<?xml version="1.0" encoding="UTF-8"?>
<configuration scan="true">
    <!-- Properties -->
    <property name="log.pattern" value="%date %-5level %logger{0} - %message%n"/>
    <property name="log.level" value="DEBUG"/>
    <property name="log.encoding" value="UTF-8"/>

    <!-- Appenders -->
    <appender name="stdout" class="ch.qos.logback.core.ConsoleAppender">
        <encoder>
            <charset>${log.encoding}</charset>
            <pattern>${log.pattern}</pattern>
        </encoder>
    </appender>

    <!-- Loggers -->
    <logger name="com" level="${log.level}">
        <appender-ref ref="stdout"/>
    </logger>

    <root level="ERROR">
        <appender-ref ref="stdout"/>
    </root>
</configuration>

This should take care of logging everything to stdout. If we push the app now, we get a lot of logging and in my case, the discovery service still crashes. But at least now I can see why:

[APP/PROC/WEB/0] OUT DEBUG SQLServerConnection - ConnectionID:1 Connecting with server: DBSERVER port: 1433 Timeout slice: 4800 Timeout Full: 15
[APP/PROC/WEB/0] OUT DEBUG SQLServerConnection - ConnectionID:1 This attempt No: 3
[APP/PROC/WEB/0] OUT DEBUG SQLServerException - *** SQLException:ConnectionID:1 com.microsoft.sqlserver.jdbc.SQLServerException: The TCP/IP connection to the host DBSERVER, port 1433 has failed. Error: "DBSERVER. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall.". The TCP/IP connection to the host DBSERVER, port 1433 has failed. Error: "DBSERVER. Verify the connection properties. Make sure that an instance of SQL Server is running on the host and accepting TCP/IP connections at the port. Make sure that TCP connections to the port are not blocked by a firewall.".

The service attempts to connect to a database named DBSERVER. I have not yet configured the discovery database so this makes sense.

All in all, we’re again one step further in deploying SDL Tridion Web 8.5 Discovery Service on CloudFoundry.

Deploy SDL Tridion Web 8.5 Discovery Service on Pivotal CloudFoundry (part 1)

A customer of ITQ is running SDL Tridion content management software and has asked us to deliver a proof-of-concept of running a Tridion website and the Tridion 8.5 micro services on Pivotal CloudFoundry. This post is a journal of my attempts of deploying the SDL Web 8.5 Discovery Service on CloudFoundry.

This is just part 1 of a series of unknown length (at the moment of writing). Here are all parts:

  1. Deploy Tridion SDL Web 8.5 Discovery Service on Pivotal CloudFoundry (part 1) (this post)
  2. Deploy Tridion SDL Web 8.5 Discovery Service on Pivotal CloudFoundry (part 2)

The discovery service is distributed as a binary Spring Boot application with the following directory structure:

│README.md
├bin
│    start.sh
│    stop.sh
├config
│    application.properties
│    cd_ambient_conf.xml
│    cd_ambient_conf.xml.org
│    cd_storage_conf.xml
│    logback.xml
│    serviceName.txt
├lib
│   ....
│   service-container-core-8.5.0-1014.jar
│   ....
└services
    ├discovery-service
    └odata-v4-framework

So there’s a bin folder with a start and stop script, some configuration and a lib folder that has a lot of jar files, including the one with our main class.

Binary buildpack

Since this is a binary distribution of a micro service, I first tried the CloudFoundry binary buildpack. A build pack is a small piece of software that takes your source code, compiles it and runs it on CloudFoundry (this is a very simplistic explanation). Let’s see how far the binary buildpack gets us.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
$ cf push discovery_service -b binary_buildpack -c './bin/start.sh' -i 1 -m 128m
Creating app discovery_service in org PCF / space Test as admin...
OK

Creating route discovery-service.cf-prod.intranet...
OK

Binding discovery-service.cf-prod.intranet to discovery_service...
OK

Uploading discovery_service...
Uploading app files from: /home/wildenbergr/microservices/discovery
Uploading 7.2M, 72 files
Done uploading
OK

Starting app discovery_service in org PCF / space Test as admin...
Downloading binary_buildpack...
Downloaded binary_buildpack
Creating container
Successfully created container
Downloading app package...
Downloaded app package (59.3M)
Staging...
-------> Buildpack version 1.0.13
Exit status 0
Staging complete
Uploading droplet, build artifacts cache...
Uploading build artifacts cache...
Uploading droplet...
Uploaded build artifacts cache (200B)
Uploaded droplet (59.3M)
Uploading complete
Destroying container
Successfully destroyed container

0 of 1 instances running, 1 crashed
FAILED
Error restarting application: Start unsuccessful

TIP: use 'cf logs discovery_service --recent' for more information
$ 

Obviously, the deploy did not go as planned so let’s check the logs:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
$ cf logs discovery_service --recent
Retrieving logs for app discovery_service in org PCF / space Test as admin...

[API/0] OUT Created app with guid fd8dd243-bc3f-4a26-83f7-44b8a06d95dd
[API/1] OUT Updated app with guid fd8dd243-bc3f-4a26-83f7-44b8a06d95dd ({"route"=>"5c279e23-17a0-48d6-b6dd-0c7fe8cbf17b", :verb=>"add", :relation=>"routes", :related_guid=>"5c279e23-17a0-48d6-b6dd-0c7fe8cbf17b"})
[API/0] OUT Updated app with guid fd8dd243-bc3f-4a26-83f7-44b8a06d95dd ({"state"=>"STARTED"})
[STG/0] OUT Downloading binary_buildpack...
[STG/0] OUT Downloaded binary_buildpack
[STG/0] OUT Creating container
[STG/0] OUT Successfully created container
[STG/0] OUT Downloading app package...
[STG/0] OUT Downloaded app package (59.3M)
[STG/0] OUT Staging...
[STG/0] OUT -------> Buildpack version 1.0.13
[STG/0] OUT Exit status 0
[STG/0] OUT Staging complete
[STG/0] OUT Uploading droplet, build artifacts cache...
[STG/0] OUT Uploading build artifacts cache...
[STG/0] OUT Uploading droplet...
[STG/0] OUT Uploaded build artifacts cache (200B)
[STG/0] OUT Uploaded droplet (59.3M)
[STG/0] OUT Uploading complete
[STG/0] OUT Destroying container
[CELL/0] OUT Creating container
[CELL/0] OUT Successfully created container
[STG/0] OUT Successfully destroyed container
[CELL/0] OUT Starting health monitoring of container
[APP/PROC/WEB/0] OUT Starting service.
[APP/PROC/WEB/0] ERR ./bin/start.sh: line 49: java: command not found
[APP/PROC/WEB/0] OUT Exit status 0
[CELL/0] OUT Exit status 143
[CELL/0] OUT Destroying container
[API/2] OUT Process has crashed with type: "web"
[API/2] OUT App instance exited with guid fd8dd243-bc3f-4a26-83f7-44b8a06d95dd payload: {"instance"=>"", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"2 error(s) occurred:\n\n* 2 error(s) occurred:\n\n* Codependent step exited\n* cancelled\n* cancelled", "crash_count"=>1, "crash_timestamp"=>1512986370928003691, "version"=>"26a55501-fbae-4e1e-87d0-4704f9ad0c78"}

And there we have it at line 29: the java command was not found. Makes sense of course because we used the binary buildpack that doesn’t know anything about Java.

Java buildpack

Ok, so the binary buildpack is a no-go. This would suggest we go with the Java buildpack. On the other hand, this buildpack by default assumes you push source code that needs to be compiled. Let’s see what happens.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
$ cf push discovery_service -b java_buildpack_offline -c './bin/start.sh' -i 1 -m 128m
Updating app discovery_service in org PCF / space Test as admin...
OK

Uploading discovery_service...
Uploading app files from: /home/wildenbergr/microservices/discovery
Uploading 7.2M, 72 files
Done uploading
OK

Stopping app discovery_service in org PCF / space Test as admin...
OK

Starting app discovery_service in org PCF / space Test as admin...
Downloading java_buildpack_offline...
Downloaded java_buildpack_offline
Creating container
Successfully created container
Downloading app package...
Downloaded app package (59.3M)
Downloading build artifacts cache...
Downloaded build artifacts cache (200B)
Staging...
-----> Java Buildpack Version: v3.17 (offline) | https://github.com/cloudfoundry/java-buildpack.git#87fb619
[Buildpack]                      ERROR Compile failed with exception #<RuntimeError: No container can run this application. Please ensure that you've pushed a valid JVM artifact or artifacts using the -p command line argument or path manifest entry. Information about valid JVM artifacts can be found at https://github.com/cloudfoundry/java-buildpack#additional-documentation. >
No container can run this application. Please ensure that you've pushed a valid JVM artifact or artifacts using the -p command line argument or path manifest entry. Information about valid JVM artifacts can be found at https://github.com/cloudfoundry/java-buildpack#additional-documentation.
Failed to compile droplet
Exit status 223
Staging failed: Exited with status 223
Destroying container
Successfully destroyed container

FAILED
Error restarting application: BuildpackCompileFailed

TIP: use 'cf logs discovery_service --recent' for more information

And this fails as well. The Java buildpack doesn’t understand what we are pushing. So with the binary buildpack we can run a shell script but we do not have java. With the Java buildpack we have java but it doesn’t understand the artifact we’re pushing. What to do?

Java buildpack with main() method

Digging around in the Java buildpack documentation, it looks like there is an option to run a self-executable jar file. The jar file we’d like to execute is lib/service-container-core-8.5.0-1014.jar. Let’s take a look at the start.sh script that is normally used to run the discovery micro service:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/usr/bin/env bash

# Java options and system properties to pass to the JVM when starting the service. For example:
# JVM_OPTIONS="-Xrs -Xms128m -Xmx128m -Dmy.system.property=/var/share"
JVM_OPTIONS="-Xrs -Xms128m -Xmx128m"
SERVER_PORT=--server.port=8082

# set max size of request header to 64Kb
MAX_HTTP_HEADER_SIZE=--server.tomcat.max-http-header-size=65536

BASEDIR=$(dirname $0)
CLASS_PATH=.:config:bin:lib/*
CLASS_NAME="com.sdl.delivery.service.ServiceContainer"

cd $BASEDIR/..
ARGUMENTS=()
for ARG in $@
do
    if [[ $ARG == --server\.port=* ]]
    then
        SERVER_PORT=$ARG
    elif [[ $ARG =~ -D.+ ]]; then
    	JVM_OPTIONS=$JVM_OPTIONS" "$ARG
    else
        ARGUMENTS+=($ARG)
    fi
done
ARGUMENTS+=($SERVER_PORT)
ARGUMENTS+=($MAX_HTTP_HEADER_SIZE)

for SERVICE_DIR in `find services -type d`
do
    CLASS_PATH=$SERVICE_DIR:$SERVICE_DIR/*:$CLASS_PATH
done

echo "Starting service."

java -cp $CLASS_PATH $JVM_OPTIONS $CLASS_NAME ${ARGUMENTS[@]}

A lot is going on in here but in the end the script runs the java command with a classpath, a main class and some options. Maybe we can accomplish the same with the Java buildpack. So, first let’s create a manifest.yml file in the root of the micro service folder structure:

---
applications:
- name: discovery_service
  path: lib/service-container-core-8.5.0-1014.jar
  buildpack: java_buildpack_offline

The path points to the jar file that has the class com.sdl.delivery.service.ServiceContainer with a main() method. However, if we deploy with this manifest, we get the same error: No container can run this application. So what is going on?

When running a Java application directly from a jar file, java has to know which class has the main() method. You can specify this on the command line or inside a manifest file inside the jar file. The service-container-core-8.5.0-1014.jar manifest file does not have a Main-Class entry so we have to specify it on the command line. How to do that?

Digging some more through the Java buildpack documentation I found that you can override buildpack settings by setting application environment variables. In our case, we want to override settings from the config/java_main.yml file so we update our manifest.yml file again:

---
applications:
- name: discovery-service
  path: lib/service-container-core-8.5.0-1014.jar
  buildpack: java_buildpack_offline
  env:
    JBP_CONFIG_JAVA_MAIN: '{ java_main_class: "com.sdl.delivery.service.ServiceContainer", arguments: "-Xrs -Xms128m -Xmx128m" }'
    JBP_LOG_LEVEL: DEBUG

Let’s see what happens this time:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
[CELL/0] OUT Creating container
[CELL/0] OUT Successfully created container
[STG/0] OUT Successfully destroyed container
[CELL/0] OUT Starting health monitoring of container
[APP/PROC/WEB/0] ERR Exception in thread "main" java.lang.NoClassDefFoundError: org/slf4j/LoggerFactory
[APP/PROC/WEB/0] ERR     at com.sdl.delivery.service.ServiceContainer.<clinit>(ServiceContainer.java:57)
[APP/PROC/WEB/0] ERR Caused by: java.lang.ClassNotFoundException: org.slf4j.LoggerFactory
[APP/PROC/WEB/0] ERR     at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
[APP/PROC/WEB/0] ERR     at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
[APP/PROC/WEB/0] ERR     at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
[APP/PROC/WEB/0] ERR     at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
[APP/PROC/WEB/0] ERR     ... 1 more
[APP/PROC/WEB/0] OUT Exit status 1
[CELL/0] OUT Exit status 0
[CELL/0] OUT Destroying container
[API/0] OUT Process has crashed with type: "web"
[API/0] OUT App instance exited with guid da7e3f48-151b-4d9a-9df6-cc8479efa839 payload: {"instance"=>"", "index"=>0, "reason"=>"CRASHED", "exit_description"=>"2 error(s) occurred:\n\n* 2 error(s) occurred:\n\n* Exited with status 1\n* cancelled\n* cancelled", "crash_count"=>1, "crash_timestamp"=>1513012904337368910, "version"=>"f75e2238-95dd-45ed-9d7f-66c6c3ef4d7f"}

Now it seems we’re getting somewhere: a NoClassDefFoundError for org/slf4j/LoggerFactory. This means that at least we managed to start a Java process, whoopdeedoo! So now we have to find the missing classes by adding them to the classpath somehow. This is where it all started to get complicated. There is no way I could find to add additional jar files to the classpath in the chosen setup. In fact, this setup has a serious flaw. The documentation for cf push on ‘how it finds the application’ states: if the path is to a file, cf push pushes only that file. So this is never going to work because we need a whole bunch of other files.

Java buildpack with main() method and explicit command

So, what’s next? Luckily, a colleague of mine who knows his way around CloudFoundry, found this blog post. The idea is to specify a number of settings to trick the buildpack into doing what we want (repeating some stuff from the aforementioned post in my own words):

  1. In the buildpack detect phase, we want to make sure the correct container is chosen: java-main. We force this by setting the JBP_CONFIG_JAVA_MAIN environment variable as before.
  2. For the buildpack compile phase, we need all the artifacts from the Tridion Discovery Microservice folder. So we specify a path of ./. Since we use the java-main container we do not really have a compile phase but we still need all microservice files.
  3. In the buildpack release phase we want to run our own Java command that has everything we want on the classpath. We can do this by explicitly specifying a command in our manifest.yml file.

Given these requirements, we come up with the following manifest file:

---
applications:
- name: discovery_service
  path: ./
  buildpack: java_buildpack_offline
  command: $PWD/.java-buildpack/open_jdk_jre/bin/java -cp $PWD/*:.:$PWD/lib/*:$PWD/config/* com.sdl.delivery.service.ServiceContainer -Xrs -Xms128m -Xmx128m
  env:
    JBP_CONFIG_JAVA_MAIN: '{ java_main_class: "com.sdl.delivery.service.ServiceContainer", arguments: "-Xrs -Xms128m -Xmx128m" }'
    JBP_LOG_LEVEL: DEBUG

And if we push the app this time, it works!!

App discovery_service was started using this command `$PWD/.java-buildpack/open_jdk_jre/bin/java -cp $PWD/*:.:$PWD/lib/*:$PWD/config/* com.sdl.delivery.service.ServiceContainer -Xrs -Xms128m -Xmx128m`

Showing health and status for app discovery_service in org PCF / space Test as admin...
OK

requested state: started
instances: 1/1
usage: 1G x 1 instances
urls: discovery-service.test-cf-prod.intranet
last uploaded: Tue Dec 12 15:05:32 UTC 2017
stack: cflinuxfs2
buildpack: java_buildpack_offline

     state     since                    cpu    memory    disk      details
#0   running   2017-12-12 04:06:04 PM   0.0%   0 of 1G   0 of 1G

You see that our new command is used, making everything we want available on the classpath. You may wonder, how did we know that the location of the java executable was $PWD/.java-buildpack/open_jdk_jre/bin/java (besides from the blog post I referred to earlier). This is where the JBP_LOG_LEVEL environment variable comes in. It is a variable specific to the Java buildpack that tells it to generate debug output. Part of the output is the exact command the buildpack will execute (if you do not specify your own command).

Run local Pivotal UAA inside a debugger

I’ve been involved in a project that uses Pivotal CloudFoundry as the PAAS platform of choice. To provide some minimal background info: CloudFoundry is an open-source PAAS platform that can run on top of a number of cloud infrastructures: Azure, AWS, GCP, OpenStack, VMware vSphere and more. Pivotal is a company that offers a commercial CloudFoundry package that includes support, certification and additional services.

I was asked to develop a smoke test to ensure a certain level of confidence in the Single-Sign-On (SSO) capabilities of the platform. SSO in CloudFoundry is taken care of by CloudFoundry User Account and Authentication (UAA) Server, an open-source, multi-tenant, OAuth2 identity management service. Not knowing a lot about UAA and knowing that it is open-source, I decided that my first step should be to try and install UAA on my laptop and get it up-and-running, ideally inside a debugger so that I could step through authorization and token requests. This blog post explains how to do that, how to configure a local UAA database and how to interact with UAA once installed.

Some additional details before getting started:

  • I’m running a Windows 10 laptop…
  • …with the Windows Subsystem for Linux running Ubuntu
  • UAA will be installed on this Ubuntu distribution
  • Debugging via IntelliJ on Windows. JetBrains has a free community edition of IntelliJ that is ideal for this sort of work.

Installing and configuring UAA

Cloning the UAA repo and performing an initial run

Following the UAA documentation you can see that installing UAA locally is really easy. Just perform the following steps:

$ git clone git://github.com/cloudfoundry/uaa.git
$ cd uaa
$ ./gradlew run

However, that is not exactly what I did… I’d like to use IntelliJ to set breakpoints and step through code and IntelliJ is installed on my Windows box. So what I actually did was clone the UAA repo on my Windows box to the folder %HOMEPATH%\IdeaProjects\uaa (in my case: C:\Users\rwwil\IdeaProjects\uaa). You can now open the project inside IntelliJ and browse through all the code.

Next, inside Ubuntu, you need to locate the folder you cloned UAA into. In my case this is /mnt/c/Users/rwwil/IdeaProjects/uaa. From that folder you can execute ./gradlew run and all should be well: you should now have a local UAA running on the default Tomcat port 8080.

Adding debugger support

Of course it’s very nice to have it all up-and-running but in my opinion it helps tremendously to be able to step through code to see what is going on and understand what happens. So we want to attach IntelliJ as debugger to the running UAA instance. First, this requires some configuration inside IntelliJ: you need to create a remote debugging configuration. This option is available from the RunEdit Configurations… menu. In my case it looks like this:

Note the command-line arguments that must be added to the remote JVM:

-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005

Unfortunately, we started UAA via Gradle and to be honest I have no idea how to add additional command-line options to the Java process that is started by Gradle. So what we need is the complete command line of the running Java process. This is quite easy on Linux:

$ ps -ef | less

We get all running processes (-e) with their full command line (-f). The output should look as follows:

UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 Nov20 ?        00:00:00 /init
rwwilden     2     1  0 Nov20 tty1     00:00:00 -bash
rwwilden    82     1  0 Nov20 tty2     00:00:00 -bash
rwwilden   179     1  0 Nov21 tty3     00:00:04 -bash
rwwilden   299     2  0 Nov24 tty1     00:06:19 /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5005 -javaagent:/tmp/cargo/jacocoagent.jar=output=file,dumponexit=true,append=false,destfile=/mnt/c/Users/rwwil/IdeaProjects/uaa/build/integrationTestCoverageReport.exec -DLOGIN_CONFIG_URL=file:///mnt/c/Users/rwwil/IdeaProjects/uaa/./uaa/src/main/resources/required_configuration.yml -Xms128m -Xmx512m -Dsmtp.host=localhost -Dsmtp.port=2525 -Dspring.profiles.active=default,sqlserver -Dcatalina.home=/mnt/c/Users/rwwil/IdeaProjects/uaa/build/extract/tomcat-8.5.16/apache-tomcat-8.5.16 -Dcatalina.base=/tmp/cargo/conf -Djava.io.tmpdir=/tmp/cargo/conf/temp -Djava.util.logging.manager=org.apache.juli.ClassLoaderLogManager -Djava.util.logging.config.file=/tmp/cargo/conf/conf/logging.properties -classpath /mnt/c/Users/rwwil/IdeaProjects/uaa/build/extract/tomcat-8.5.16/apache-tomcat-8.5.16/bin/tomcat-juli.jar:/mnt/c/Users/rwwil/IdeaProjects/uaa/build/extract/tomcat-8.5.16/apache-tomcat-8.5.16/bin/bootstrap.jar:/usr/lib/jvm/java-8-openjdk-amd64/lib/tools.jar org.apache.catalina.startup.Bootstrap start
rwwilden   772     1  0 Nov28 tty4     00:00:04 -bash

You get a very long Java command line that you can copy and modify as needed. In our case, we’d like to add debugging options (which I already added in the example output above).

Now paste the modified command line and run it and we have a Java process that IntelliJ can attach to.

Configuring for Microsoft SQL Server

By default, UAA runs with an in-memory database, losing all data between restarts. My laptop runs Microsoft SQL Server which UAA actually supports so let’s check out how to configure this.

The way UAA selects between data stores is via Spring Profiles. We can add a profile to the command-line we just copied. Just add sqlserver to the spring.profiles.active command-line parameter: -Dspring.profiles.active=default,sqlserver.

Next step is the connection string for SQL Server. This can be configured in uaa/server/src/main/resources/spring/env.xml. For my local setup I use the following:

<beans profile="sqlserver">
  <description>Profile for SQL Server scripts on an existing database</description>
  <util:properties id="platformProperties">
    <prop key="database.driverClassName">com.microsoft.sqlserver.jdbc.SQLServerDriver</prop>
    <prop key="database.url">jdbc:sqlserver://localhost:1433;database=uaa;</prop>
    <prop key="database.username">root</prop>
    <prop key="database.password">changemeCHANGEME1234!</prop>
  </util:properties>
  <bean id="platform" class="java.lang.String">
    <constructor-arg value="sqlserver" />
  </bean>
  <bean id="validationQuery" class="java.lang.String">
    <constructor-arg value="select 1" />
  </bean>
  <bean id="limitSqlAdapter" class="org.cloudfoundry.identity.uaa.resources.jdbc.SQLServerLimitSqlAdapter"/>
</beans>

So I have a local database named uaa and a user named root. Now we have a setup where we can actually see what UAA is writing to the database when certain actions are performed.

Interacting with UAA

Ok, final step: what can we do with UAA once we have it up-and-running? It is an OAuth2 server so let’s see if we can get a token somehow. The easiest way to communicate with UAA is through the UAA CLI (UAAC). This is a Ruby application so you need to install Ruby to get it working (there is some work underway on a Golang version of the CLI).

First we have to point UAAC to the correct UAA instance:

uaac target http://localhost:8080/uaa

Next, we’d like to perform some operations on UAA so for that we need an access token that allows these operations. UAA comes pre-installed with an admin client application that you can get a token for:

uaac token client get admin -s "adminsecret"

If we dissect this line:

  • uaac token: perform some token operation on UAA
  • client: use the OAuth2 client credentials grant
  • get: get a token
  • admin -s "adminsecret": get a token for the application with client_id=admin and client_secret=adminsecret

The output should be:

Successfully fetched token via client credentials grant.
Target: http://localhost:8080/uaa
Context: admin, from client admin

The obtained token is stored (in my case) in /home/rwwilden/.uaac.yml.

Using this token we can now perform some administration tasks on our local UAA. Some examples:

  • Add a local user:

    uaac user add smokeuser --given_name smokeuser --family_name smokeuser --emails smokeuser2@mail.com --password smokepassword
    
  • Add a local group (or scope in OAuth2 terminology):

    uaac group add "smoke.extinguish"
    
  • Add user to scope:

    uaac member add smoke.extinguish smokeuser
    
  • Add a client application that requires the smoke.extinguish scope and allows logging in via the OAuth2 resource owner password credentials grant:

    uaac client add smoke --name smoke --scope "smoke.extinguish" --authorized_grant_types "password" -s "smokesecret"
    
  • Obtain a token for user smokeuser on client application smoke using the password credentials grant:

    uaac token owner get smoke smokeuser -s smokesecret -p smokepassword
    

Of course, there is a lot more to know about CloudFoundry UAA. As I mentioned earlier, it is a full-fledged OAuth2 implementation that has proven itself in numerous (Pivotal) CloudFoundry production installations. Here are some additional references:

  • API overview: https://docs.cloudfoundry.org/api/uaa/version/4.7.1/index.html
  • SSO in Pivotal CloudFoundry: https://docs.pivotal.io/p-identity/1-2/index.html
  • Additional docs on Github: https://github.com/cloudfoundry/uaa/tree/master/docs

Kaggle Horses for Courses analysis of last five starts with Azure Notebooks (part 4)

This blog post is part of a series describing my ongoing analysis of the Kaggle Horses For Courses data set using Azure Data Lake Analytics with U-SQL and Azure Notebooks with F#. This is part 4.

  1. Horses For Courses data set analysis with Azure Data Lake and U-SQL
  2. Horses For Courses barrier analysis with Azure Notebooks
  3. Kaggle Horses for Courses age analysis with Azure Notebooks
  4. Kaggle Horses for Courses analysis of last five starts with Azure Notebooks (this blog post)

Data set and recap

A quick recap of Kaggle and the data set we’re analyzing: Horses For Courses. Kaggle is a data science and machine learning community that hosts a number of data sets and machine learning competitions, some of which with prize money. ‘Horses For Courses’ is a (relatively small) data set of anonymized horse racing data.

In the first post I discussed how you could use Azure Data Lake Analytics and U-SQL to analyze and process the data. I used this mainly to generate new data files that can then be used for further analysis. In the second and third post I studied the effects of barrier and age on a horse’s chances of winning a race.

In this post I’m going to study the relation between the last five starts that is known before a horse starts a race and its chances of winning the race. For every horse in a race we know the results of its previous five races from the runners.csv file in the Kaggle dataset. At first sight, this seems a promising heuristic for determining how a horse will perform in the current race so let’s see if that’s actually the case.

The analysis itself will again be performed using Azure Notebooks with an F# kernel. Here’s the link to my notebook library.

What data are we working with?

A typical last five starts might look like this: 3x0f2. So what does this mean? A short explanation:

  • 1 to 9: horse finished in position 1 to 9
  • 0: horse finished outside the top 9
  • f: horse failed to finish
  • x: horse was scratched from the race

So in 3x0f2 a particular horse finished third, was scratched, finished outside the top 9, failed to finish and finished second in its previous five races.

You may already spot a problem here. When we get a 1 to 9, we know what happened in a previous race. When we get a 0, we have some information but we don’t know exactly what happened. For an f or an x we know nothing. In both cases, if the horse had run, it might have finished at any position.

To be able to compare the last five starts of two horses, we have to fix this. Especially, if we want to use this data as input to a machine learning algorithm, we should fix this1.

When we do some more digging in the dataset, it appears that we do not have a complete last five starts for every horse. For some horses, we only have the last four starts or the last two. And for some horses we have nothing at all. Let’s take a look at the distribution of the length of last five starts in our dataset:

(5: 72837) (4: 3379) (3: 3461) (2: 3553) (0: 5054)

I’ve written it a down a bit terse but you can see that for 72837 (or 83% of) horses we know the last five starts. But still, it’s hard to compare 32xf6 with 4f so we should fix the missing data as well.

Fixing the data

The accompanying Azure Notebook describes all fixes in detail, so I’ll give a summary here:

  • x and f: In both cases, a horse could have finished the race but didn’t2. What we do here is replace each x and f with the average finishing position of a horse over all races as a best guess (we can simply take the average over all races of the number of horses in a race).
  • 0: The horse finished outside the top 9 so we replace each 0 with the average finishing position for horses outside the top 9 (and here we take the average over all races with more than 9 horses).
  • missing data: This is essentially the same as not starting or failing to finish so we take the average finishing position again.

One small example of what’s happening: suppose we have 4xf0. With our current algorithm, this will be represented as (4.00, 6.49, 6.49, 11.66, 6.49) as follows:

44.00A 4 will remain a 4.
x6.49An x will be replaced by 6.49, the average finishing position over all races.
f6.49An f will be replaced by 6.49, the average finishing position over all races.
011.66A 0 will be replaced by 11.66, the average finishing position for horses that finish outside the top 9.
missing data6.49Missing data will be replaced by 6.49, the average finishing position over all races.

Comparing last five starts

Now that we can be sure that every last five starts has the same length, how do we compare them? The easiest way in my opinion is to take the average. So with our previous example we get:

4xf0(4.0, 6.5, 6.5, 11.7, 6.5)7.04

And we can do this for every horse. So now we have one number for every horse in a race that describes the last five starts, how convenient :) 3

Preparing the data file

With fixing and averaging in place, we will use switch back to U-SQL to prepare our dataset. Remember from the first post that we want pairs for all horses in a race so that we can reduce our ranking problem (in what order do all horses finish) to a binary classification problem (does horse a finish before or after horse b).

I’ll digress a bit into Azure Data Lake and U-SQL so if you just want to know how last five starts relates to finishing position you can skip this part. I’m assuming you already know how to create tables with U-SQL so I’ll skip to the part where I create the data file we will use for analysis.

First of all, we need the average finishing position over all races so we can fix x, f and missing data:

@avgNrHorses =
    SELECT (((double) COUNT(r.HorseId)) + 1d) / 2d AS AvgNrHorses
    FROM master.dbo.Runners AS r
    GROUP BY r.MarketId;
@avgPosition =
    SELECT AVG(AvgNrHorses) AS AvgPosition
    FROM @avgNrHorses;

We get the average number of horses in each race and than calculate the average over that. Second, we need the average finishing position of horses outside the top 9:

@avgNrHorsesAbove9 =
    SELECT
        (((double) COUNT(r.HorseId)) - 10d) / 2d AS AvgNrHorses,
        COUNT(r.HorseId) AS NrHorses
    FROM master.dbo.Runners AS r
    GROUP BY r.MarketId;
@avgPositionAbove9 =
    SELECT AVG(AvgNrHorses) + 10d AS AvgPosition
    FROM @avgNrHorsesAbove9
    WHERE NrHorses > 9;

A little more complex but essentially the same as the previous query but with just the races that have more than 9 horses.

The final part is where we generate the data we need and output it to a CSV file:

@last5Starts =
  SELECT
    HorsesForCourses.Udfs.AverageLastFiveStarts(
      r0.LastFiveStarts, avg.AvgPosition, avg9.AvgPosition) AS LastFiveStarts0,
    HorsesForCourses.Udfs.AverageLastFiveStarts(
      r1.LastFiveStarts, avg.AvgPosition, avg9.AvgPosition) AS LastFiveStarts1,
    p.Won
  FROM master.dbo.Pairings AS p
  JOIN master.dbo.Runners AS r0
    ON p.HorseId0 == r0.HorseId AND p.MarketId == r0.MarketId
  JOIN master.dbo.Runners AS r1
    ON p.HorseId1 == r1.HorseId AND p.MarketId == r1.MarketId
  CROSS JOIN @avgPosition AS avg
  CROSS JOIN @avgPositionAbove9 AS avg9;

OUTPUT @last5Starts
TO "wasb://output@rwwildenml.blob.core.windows.net/last5starts.csv"
USING Outputters.Csv();

There are two interesting parts in this query: the AverageLastFiveStarts function call and the CROSS JOIN. First the CROSS JOIN: both @avgPosition and @avgPositionAbove9 are tables with just one row. A cross join returns the cartesian product of the rowsets in a join so when we join with a rowset that has just one row, this row’s data is simply appended to each row in the first rowset in the join.

The AverageLastFiveStarts user-defined function takes a last five starts string, fixes it in the way we described earlier and returns the average value:

namespace HorsesForCourses
{
  public class Udfs
  {
    public static double AverageLastFiveStarts(string lastFiveStarts,
                                               double? avgPosition,
                                               double? avgPositionAbove9)
    {
      // Make sure the string has a length of 5.
      var paddedLastFiveStarts = lastFiveStarts.PadLeft(5, 'x');
      var vector = paddedLastFiveStarts
        .Select(c =>
        {
          switch (c)
          {
            case 'x':
            case 'f':
              return avgPosition.Value;
            case '0':
              return avgPositionAbove9.Value;
            case '1': case '2': case '3': case '4': case '5':
            case '6': case '7': case '8': case '9':
              return ((double) c) - 48;
            default:
              throw new ArgumentOutOfRangeException(
                "lastFiveStarts", lastFiveStarts, "Invalid character in last five starts");
          }
        });
      return vector.Average();
    }
  }
}

The code is also up on Github so you can check the details there.

Analysis

We now have a data file that has, on each row, the last five starts average for two horses and which of the two won in a particular race. Some example rows:

3.90, 6.49, True
4.30, 6.49, False
6.70, 3.50, False
6.70, 5.40, False
7.63, 4.40, False
6.69, 5.49, True

On the first row, a horse with an average last five starts of 3.90 beat a horse with an average last five starts of 6.5. On the second row, 4.3 got beaten by 6.5, on the third row, 6.7 got beaten by 3.5, etc.

So how do we get a feeling for the relation between last five starts and the chances of beating another horse. I decided to do the following:

  1. Get the largest absolute difference between last five starts for two horses over the entire data set.
  2. Get all differences between last five starts pairs.
  3. Distribute all differences into a specified number of buckets.
  4. Get the numbers of wins and losses in each bucket and calculate a win/loss ratio per bucket.

In the example rows above, the largest difference is in row 5: 3.23. Since differences can be both positive and negative, we have a range of length 3.23 + 3.23 = 6.46 to divide into buckets. Suppose we decide on two buckets: [-3.23, 0) and [0, 3.23]. Now get each difference into the right bucket:

                    diff        bucket
3.90, 6.49, True,  -2.59  --> bucket 1
4.30, 6.49, False, -2.19  --> bucket 1
6.70, 3.50, False,  3.19  --> bucket 2
6.70, 5.40, False,  1.29  --> bucket 2
7.63, 4.40, False,  3.23  --> bucket 2
6.69, 5.49, True,   1.20  --> bucket 2

So we have 2 horses in bucket 1 and 4 horses in bucket 2. The win/loss ratio in bucket 1 is 1 / 1 = 1, the win loss ration in bucket 2 is 1 / 4 = 0.25. So if the difference in last five starts is between -3.23 and 0, the win/loss ratio is 1.0. If the difference is between 0 and 3.23, the win/loss ratio is 0.25.

This is of course a contrived example. In reality we have almost 600000 rows so we will get some more reliable data. I experimented a little with bucket size and 41 turned out to be a good number. This resulted in the following plot. I skipped the outer three buckets on both sides because there aren’t enough data points in there.

Bucket win/loss ratio

The bars represent the buckets, the line represents the number of data points in each bucket. I highlighted bucket 24 as an example. This bucket represents the differences between average last five starts of two horses between 1.59 and 2.05. This bucket has 34777 rows and the win/loss ratio is 1.54.

This means that if the difference between average last five starts of two horses is between 1.59 and 2.05, the horse with the higher average is 1.54 times more likely to beat the other horse! This is pretty significant. If we take two random horses in a race, look at what they did in their previous five races and they happen to fall into this bucket, we can predict that one horse is 1.54 times more likely to win.

We need to put these numbers a little bit into perspective, because it matters how many records of the total population fall into bucket 24. This is about 5.83%. However, the data set is symmetric in the sense that it includes two rows for each horse pair (so if we have a,b,True we also have b,a,False). So bucket 16 is the inverse of bucket 24 with the same number of records: 34777. This means we can actually tell for 11.66% of the total population that one horse is 1.54 times more likely to win than another horse.

Conclusion

So far, we have analyzed three features for their effect on horse race outcomes: barrier, age and last five starts. Barrier and age had a clear effect and now we found that average last five starts also has an effect. Each one of these separately cannot be used to predict horse races but maybe combined they present a better picture.

Age and barrier are independent of each other. The barrier you start from is the result of a random draw and it has no effect on the age of a horse. Vice versa, the age of a horse has no effect on the barrier draw. We already established that both age and barrier have an effect on race outcomes so you might be inclined to think that both also have an effect on the last five starts. This is not true for barrier but it may be true for age. We determined in the previous post that younger horses outperform older horses. It makes sense then that the last five starts of younger horses is better than that of older horses.

Ideally we would like to present a machine learning algorithm a set of independent variables. Using both age and last five starts may not be a good idea.

In the next post we’ll get our hands dirty with Azure Machine Learning to see if we can get ‘better than random results’ when we present the features we analyzed to a machine learning algorithm. Stay tuned!

Footnotes

  1. Actually there is no machine learning ‘law’ that requires us to fix the data. We could just leave the x, f and 0 as they are and have the algorithm figure out what they mean. However, think about what this would mean. Suppose we have two horses: 067xf and 9822x and the first won. The input for our machine learning algorithm would be: 0,6,7,x,f,9,8,2,2,x,True. That’s 10 feature dimensions, just to describe the last five starts! High-dimensional sample spaces are a problem for most machine learning algorithms and this is usually referred to as the curse of dimensionality, very nicely visualized in these two two Georgia Tech videos. So the less dimensions, the better.
  2. You could argue that being scratched from a race (x) and failing to finish (f) are two different things. Especially an f could give us more information about future races. Suppose we see the following last five starts: 638ff. The horse failed to finished in its last two races. This doesn’t give much confidence about the current race. On the other hand, f8f63 tells a different story but has the same results, just in a different order. Maybe in a future blog post I’ll dig deeper into better methods for handling x and f.
  3. I have given some thought to other ways of comparing last five starts but averaging is at least the simplest and maybe the best solution. You could argue that trends should be taken into account so that 97531 is better that 13579. The first shows a clear positive, the second a clear negative trend. However, deriving a trend from a series of five events seems a bit ambitious so I decided against it.

Kaggle Horses for Courses age analysis with Azure Notebooks (part 3)

This blog post is part of a series describing my ongoing analysis of the Kaggle Horses For Courses data set using Azure Data Lake Analytics with U-SQL and Azure Notebooks with F#. This is part 3.

  1. Horses For Courses data set analysis with Azure Data Lake and U-SQL
  2. Horses For Courses barrier analysis with Azure Notebooks
  3. Kaggle Horses for Courses age analysis with Azure Notebooks (this blog post)
  4. Kaggle Horses for Courses analysis of last five starts with Azure Notebooks

Data set and recap

A quick recap of Kaggle and the data set we’re analyzing: Horses For Courses. Kaggle is a data science and machine learning community that hosts a number of data sets and machine learning competitions, some of which with prize money. ‘Horses For Courses’ is a (relatively small) data set of anonymized horse racing data.

In the first post I discussed how you could use Azure Data Lake Analytics and U-SQL to analyze and process the data. I used this mainly to generate new data files that can then be used for further analysis. In the second post I studied the effect of the barrier a horse starts from on its chances of winning a race.

In this post I’m going to do the same but now for age: how does the age of a horse affect its chances of winning a race. The analysis will again be based on a file that was generated from the raw data using a U-SQL script in Azure Data Lake. The file has a very simple format: column 1 has the age of the first horse, column 2 of the second horse and column 3 tells us who won in a particular race. So for example:

3,7,True
10,4,False

The first row tells us that in a particular race, a 3-year-old horse beat a 7-year-old horse. The second row tells us a 10-year-old horse got beaten by a 4-year-old.

The analysis will again be performed using an Azure Notebook with an F# kernel. Here is the link to my notebook library.

Ages notebook

As in the previous post, the details can be found in the accompanying Azure Notebook. You can clone the notebook library using a Microsoft account. Remember that Shift+Enter is the most important key combination; it executes the current cell and moves to the next cell.

The first thing we’d like to know is how many horses there are for a particular age. This information can be found in the raw data from Kaggle: horses.csv. If we plot the results we get the following:

You can see that for ages 3, 4, 5, 6 and maybe 7 we have a reasonable amount of data.

The next step is analyzing the ages.csv file we generated that has one row for each age combination in each race. For this we apply a similar tactic as we used in the previous post: check for each age how many times a horse from that age beat horses from other ages. This results in the following matrix:

Some examples to clarify what we see here:

  • On the first row we see how many times 2-year-old horses beat other horses. So 2-year-old horses beat 3-year-old horses 793 times, they beat 4-year-old horses 129 times, etc.
  • On the second row we have the 3-year-old horses. They beat 2-year-old horses 1424 times, other 3-year-old horses 32247 times, 4-year-old horses 11588 times, etc.

The absolute numbers in this matrix do not tell us a lot, since they are skewed by the number of horses of a particular age that actually ran races. So what we do next is divide the number of wins by the number of losses per age pair: the win-loss ratio. These are the numbers for ages 2 to 7:

The second value in the first row is obtained by dividing 793 by 1424. The first value in the second row is its inverse: 1424 divided by 793. Now let’s visualize the data. I started out with a 3D surface plot (as in the previous post) but that got a bit convoluted so I used simple line charts instead:

Conclusions

In the plot I compared ages 2 to 8. I highlighted the results of 2, 3 and 4 year old horses against other 4-year-olds. So, for example, you can see that a 2-year-old horse has a win/loss ratio of 0.701087 against 4-year-old horses. What is obvious is that younger horses outperform older horses (except for 2-year-old horses): 3-year-old horses have a positive win/loss ratio against any other age.

However, if we take the positive win/loss ratio of 1.078054 of 3-year-olds against 4-year-olds, it doesn’t really help us predict horse races. If we revisit the absolute numbers, we can see that 3-year-olds beat 4-year-olds 11588 times, but 4-year-olds beat 3-year-olds 10749 times.

But still, the effect of age is obvious so there must be some way to use it in predicting race outcomes. Maybe instead of age we could use the win/loss ratio directly. However, we may loose information if we reduce ages 2 and 4 in each race to the number 0.701087. Maybe age combined with another feature is a strong predictor for race outcomes. For example, maybe 2-year-old horses perform very well on muddy race tracks. By reducing age pairs to just a win/loss ratio this information may be lost.

So even if age is a factor to consider, I doubt whether it is actually useful as direct input for a machine learning algorithm.