Our objective

If you studied math at school until at least 16 years old, chances are you studied the fundamentals of machine learning, what we commonly describe as artificial intelligence (AI) today. We’ll teach you this via linear regression.

But first a question: do you remember calculating line of best fit during algebra class?

© Maths Is Fun

If you answered “yes” (even if you don’t remember how you did it) then this tutorial is for you. If you answered “no”, check out this higher level explanation of machine learning.

Assuming you answered “yes”, great news: you can learn the fundamentals to most machine learning applications. These are broadly the same for most machine learning and deep learning systems.

Of course, there’s a lot more to machine learning than what we can cover in a single post, and such systems are a lot more complex in real life, but an understanding of the basics will go a long way, and provide confidence to explore these topics and more complex variations in more detail.

We hope this provides greater intuition regarding what is actually happening under the hood, without needing to know how to build your own engine. In particular, after reading this article you will understand:

  1. What a classical machine learning system is designed to do and how it does so
  2. How inputs are analysed by a machine learning system in order to generate an output
  3. How a machine learning system scores its abilities in order to improve itself
  4. Which parts of the machine learning system are fixed by the human designers, and those generated by the machine

We will do this via the example of linear regression, a commonly used machine learning technique.

Disclaimer: if you’re a developer or machine learning expert, this post probably isn’t for you as we necessarily keep things as simple as possible and in a few cases omit some details to keep things accessible.

How does machine learning fit into AI?

Machine learning is the study of computer algorithms that improve automatically through experience. It is a subset of AI.

What is Machine Learning?

The classical definition is Tom M. Mitchell‘s. Mitchell – a computer scientist – provided a widely quoted, definition of the algorithms studied in the machine learning field:

"A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.

Past experience, E, is data. Machine learning is garbage in, garbage out (or GIGO). If the data is wrong, incomplete, inconsistent, or two few, the system will perform poorly at its task, struggling to learn the correct behaviour(s) that improve its performance at that task (and vice versa).

How is machine learning different to previous forms of AI?

Recent advances in AI have been achieved by applying machine learning to very large data sets. These algorithms detect patterns, learning how to make predictions and recommendations by processing data and experiences, rather than by receiving explicit programming instruction (by humans). The algorithms also adapt in response to new data and experiences to improve efficacy over time.

Previous attempts at AI relied on hand coded logic, e.g. IF this THEN that. They required adaptation via addition, deletion or amendment of that logic by human engineers in response to new circumstances or understandings about a particular AI problem.

What are major types of machine learning?

There are three major types:

Supervised Learning

The computer is presented with example inputs (x) and their desired outputs (y), given by a “teacher” (an engineer of subject matter expert), and the goal is to learn a general rule mapping inputs (x) to outputs (y). This is the subject matter of the later worked example we will use to teach you machine learning, specifically the technique linear regression.

Unsupervised Learning

Unlike supervised learning, no labels are given to the algorithm. Instead, it must discern its own way in which to find structure in its input. Unsupervised learning can be a goal in itself (discovering hidden patterns in data) or a means towards an end (feature learning).

Reinforcement Learning

A computer program interacts with a dynamic environment in which it must perform a certain goal (such as playing a game against an opponent). As it navigates its problem space, the program is provided feedback that’s analogous to rewards, which it tries to maximize. A bit like teaching a child (or pet) good behaviours from bad and rewarding them with treats!

Supervised vs Unsupervised Learning – which is better?

For further discussion of supervised vs. unsupervised learning, and which is better, see here.

What can you do with these techniques?

ML systems use data to learn the mathematical relationship that produces an output y given an input x.  Where y can be:

  • a pattern, e.g. explores customer demographic data to identify patterns
  • classification, e.g. cat vs not cat
  • prediction, e.g. student A will score 92% on their next exam
  • goal, e.g. beat someone at chess. 

If it’s all about data, isn’t this analytics?

Sort of. Machine learning share a lot of DNA, but are distinct. That said, they are often used in conjunction with one another, sometimes with the same data.

Machine learning is a combination of mathematical techniques applied to data, stitched together in specific workflows (algorithms). Many of these borrow heavily from statistics, probability, linear algebra. All of which are used in analytics. So what’s the difference between machine learning and analytics?

Analytics is typically defined as descriptive. It’s used to interrogate data in order to describe what happened to help humans understand causation, review the performance of a product or process, draw insights and make decisions.

Machine learning is typically predictive or prescriptive. The former is about using data to anticipate what will happen based on other data (regression), or what something is (classification, e.g. is photo X “cat” or “not cat”?). e.g. is this photo. The latter is about providing recommendations on what to do to achieve one or more goals, e.g. Amazon’s “Customers who bought X also bought Y” feature learns recommendations based on your buying habits correlated with those of other buyers with similar habits to your own.

Linear Regression: the machine learning you didn’t know you knew

By now you hopefully have a roadmap for the world of machine learning, it’s major types and relationships with AI and analytics. Now let’s dive into a specific example and use this to unpick how most machine learning systems work.

We’re going to explore linear regression. Linear regression is a predictive and supervised machine learning technique.

Linear regression is the “hello world” of machine learning.

By this we mean it’s the first technique developers and machine learning engineers learn when deciding to study machine learning, often used to illustrate the basic concepts.

It’s used because it encompasses the principal components of most machine learning systems and is also something familiar from school level math.

As it is a supervised learning technique, it’s slightly more intuitive than unsupervised or reinforcement learning. The essential idea is that we give the system enough examples of some relationship between x and y and eventually the system learns the magic connection that links x and y, enabling it to perform a related task with that type of data.

And the best part? You studied this at school!

What is linear regression?

Linear regression is an algorithm that allows you to predict a given output y for a given input x. That’s it.

Going slightly deeper, it is a highly interpretable (i.e. easy for us to understand the relationship between an input and an output), standard method for modelling the past relationship between independent input variables (x) and dependent output variables (y, which can have an infinite number of values) to help predict future values of the output variables.

What can you do with linear regression?

Predict stuff! A concrete example:

Given the number of contracts to be drafted for a new legal matter (x) how many billables (y) will this cost the client based on historic examples of matters with the same or similar number of documents?

How does linear regression work?

Supervised machine learning algorithms such as linear regression have 8 components common to most machine learning systems:

Ingredients for machine learning

  1. An objective
  2. A dataset
  3. A model function
  4. A cost function
  5. An optimization function
  6. An iterative feedback loop
  7. Parameters
  8. Hyper parameters

We will explore each, layering them together.

1. The Objective

To keep things simple our objective is:

Given a number x we want to predict a related value y

This is what our system needs to learn, and what we will train the system to do. In real life, x could be the number of documents anticipated for a new legal matter and y the predicted cost in billable hours. More complex systems would include other factors, e.g. number and type of lawyers needed, jurisdictions, some score regarding the complexity of the intended matter etc.

The objective, the understanding of the x and y relationship, is known as the model (more on that later).

2. The Dataset

A dataset is simply the data you want your machine learning system to process in order to learn the desired behaviour. In supervised machine learning, this will be:

  • a collection of examples
  • where each example is a pair of x and y
  • where x is a feature about a thing (a variable independent of y)
  • where y is a label about a thing (a variable dependent on x)

For this reason, it’s best to think about machine learning, especially supervised learning, as thing labelling.

Our dataset looks like this:

Each row in the above Dataset is an example relationship between known x and y values.  Each x value is a feature and each y value a label (and the thing we want the system to predict).

We split the dataset into:

  1. A training dataset (green)
  2. A validation dataset (orange)
  3. A test dataset (red)

As we will explain, we train the system on training dataset then validate its performance on the validation dataset and finally test the system on the test dataset.

Think of it like how you prep to pass exams.

When planning exam prep you gather a dataset. That dataset will comprise your revision, facts from which you study and learn the relationships about the subject matter.

You then use mock exams or past papers plus model answers to validate how well you understand and have revised the subject matter. If you do well on the mocks, great. If not, you review the model answers and go over your revision materials again to try and understand what you don’t.

Finally, you test your knowledge at the final exam. This proves your ability to generalize your understanding of a subject matter to new, sight unseen questions about it.

The hope is that provided you revised (trained), checked your knowledge and made any changes to your understanding via mock exams (validation) you should pass the exam (test). If you still fail the final exam, you probably need to go back to the drawing board and reconsider the structure and content of your revision.

We can plot our dataset like this:

Our system’s goal is to learn what mathematical manipulation applied to x results in the correct value y.  This is known as the model. As we shall see, this can also be represented graphically.

3. The Model Function

Recall that the objective of our system is:

Given a number x we want to predict a related value y

Machine learning engineers express this relationship with a mathematical equation, known as the model function. Mathematically this is:

y = mx + b

Let’s break this down.

Of the entire universe of possible x & y relationships and their constituent values, we already know a small subset which we’ve labelled and organised, i.e. our dataset. So we can plug into the model function at least some x and y values, i.e. the ones we already know and have to hand.

But what about m and b?

m and b are the magic numbers that produce the correct y value when applied to a given x value.

m is known as a weight or parameter, and b is known as the bias unit. The goal of our system is to identify the correct values of m and b such that our system always (or as close to always as we can) correctly predict y for a new x value.

In this way, we hope to pick any x value in the universe and predict the y value, whether or not we know that y value.

Why is that useful? Well, it helps us make predictions about the world around us or an outcome, e.g. number of billable hours likely, given an input, e.g. number of contracts to be negotiated.

So you can begin to build an intutiion of what our system is learning toward, let’s plot the target model function our system needs to learn:

This new green line – the line of best fit – is something you calculated at school, or at least used your ruler to draw. Either way, you may recall the idea was to find the line through, or as close to, as many of the coordinates on your graph as possible.

You then used your ruler to draw a vertical line up from a new x value and horizontally across from the line of best fit to find the corresponding y value (e.g. the pink lines).

Later you learned how to do this mathematically using linear algebra.

This line of best fit represents the model function. It is what we need the system to learn. Spoiler alert, this line can be written algebraically as y = 1x + 1 or simply y = x + 1.

Try it for yourself.

For example, for row 1 of the dataset where x = 2, we can plug this x value into our model function y = x + 1 and return y = 3. When we compare this against the correct y value for example 2, we see that it is also y = 3. Our model correctly predicts a y value for a given x value.

In real life neither the system designers, nor the AI, will know this at the start of the process. I’ve stated the answer here so it’s easy to follow what the system needs to work towards in order to generate the correct y values for given x values.

So how does the system “learn” its way toward the correct values of m and b?

By guessing values of m and b over and over. Initially, the guesses are random but mathematically tuned via a feedback loop.

So you’re saying a lot of machine learning is analysing data to make probabilistic guesses at things? Yup, pretty much. Of course, there’s more complexity to this, and nuance, but that’s the gist. It’s important to understand that machine learning is maths not minds. Even a neural network – inspired by a theoretical model of how brain neurons process an input and produce an output – is just maths. There is no semantic understanding of anything.

Does that matter? It depends. But it’s a philosophical question beyond the scope of this article.

Ok, so philosophy aside. How does the system make these guesses, know if they are accurate or not and, depending on that analysis improve itself? Let’s find out!

The Cost Function

The cost function calculates the gap between the system’s guess at y and the actual value of y. It’s also known as the error function.

Let’s return to the first example in our dataset. In that example x = 2 and y = 3. If the system predicted y = 10 for x the system can calculate the error by doing 10 – 3 = 7. In other words, the system was 7 off.

Why is this useful? The system has a measure of how well it is performing. Returning to the exam analogy. It’s a bit like having the questions (x values) and the answers (y values) but first covering up the answers with your hand and guessing at how to answer each question then removing your hand to see how you did. You now have a measure of how well you performed that you can learn from.

We’ll see how this comes together in more detail below.

5. The Optimisation Function

The optimization function takes the learning from the cost function – i.e. how wrong (or right) the system is at guessing – and uses that new information to inform how best to update itself in order to maximise its chances at improving its guess at the next turn.

When we refer to “turns” this is because machine learning is iterative, which leads nicely into…

6. An iterative feedback loop

You may have noticed we’ve built up a number of components and hinted at the overall process being turn based. Well, you’d be correct. Machine learning is iterative. It’s about mathematically expedited trial and error.

So how does the dataset, model function, cost function and optimization function work together iteratively in order to learn the relationship between x and y? Let’s find out!

First Pass, First Configuration

At the start, neither the system engineers nor the system have any idea what m and b might be. Instead, the engineers initialize m and b to random values, e.g. m = 0.5 and b = 3. 

These are plugged into our model function, e.g. y = mx + b becomes y = 0.5x + 3.

Next, the x value (but not the y value) for the first example in our training dataset (below) is added into the model function.

First Pass, First Guess

The system computes the Model Function for its first guess at y. With this configuration:

y = (0.5 x 1) + 3
y = 0.5 + 3
y = 3.5

So is the Predicted Y Value for the system’s first guess is y = 3.5.

First Pass, First Check

So how does the system know if y = 3.5 is correct for x in this example, and whether or not the current model function configuration is correct or needs tuning?

Simple, the system calculates the difference between:

The Predicted Y Value (y = 3.5) 

AND 

The Actual Y Value (y = 2)

EQUALS

1. 5 (i.e. 3.5 - 2)

That calculation is made by the cost function and the resulting value, i.e. 1.5 in this example, is the cost value for the first guess.  The actual math function is more complex, but this is the core idea: comparing the prediction with the actual result and calculating the difference.

We can plot this version of the model function below (orange line), and highlight the cost (i.e. error) of its predictions against the actual data for our dataset.

The costs are the gaps (red arrows) between the blue points on the graph (our dataset of x + y pairs) and the current model function’s (y = 0.5x + 3) predictions for y, illustrated by the orange line.

As you can see, for the first example in our dataset where x = 1 the current model function’s prediction of y is off by +1.5 because it predicts y = 3.5 when it is in fact y = 2.

First Pass, First Update

So far, the system has made its first guess, compared its guess against the actual answer to calculate the difference between its guess and the truth which turns out to be wrong by +1.5 (the cost). 

Its goal at this step is to identify what adjustment to its model will minimize the cost for its next guess, i.e. so the next guess is closer to the actual y value given an input x.  

The optimization function takes the current value of m, currently m = 0.5, and adjusts it by:

a.       PLUS some small amount, e.g. +0.1; and

b.       MINUS some small amount, e.g. -0.1,

and in each case, calculates whether each slightly adjusted m value will make the cost get worse (i.e. increases) or better (i.e. decreases). 

This is essentially the mathematical equivalent of the Hot or Cold game where you are blindfolded and take small steps in different directions toward a hidden object and ask a friend to say “hotter” (i.e. closer to the object) or “colder” (i.e. further from the object). 

In that game you decide to head in the direction that returned the “hottest” signal from your friend. 

Without exploring the maths – which go a bit beyond school level algebra (unless you studied algebra beyond age 16) – this is the intuition you need to understand. The optimization function is like the Hot or Cold game.

This same logic from Hot and Cold applies to the Optimization Function, e.g.

(A) PLUS Adjustment

Original m = 0.5.  Adjusting m + 0.1 means new m = 0.6.   Plugging this new m into model function we get y = (0.6 x 1) + 3 or y = 3.6.  Plugging that into the cost function, i.e. difference between y = 3.6 (prediction) and y = 2 (actual value), means the cost has increased from 1.5 to 1.6, i.e. got worse. 

(B) MINUS Adjustment
Original m = 0.5. Adjusting m – 0.1 means new m = 0.4.  Plugging this new m into the model function we get y = (0.4 x 1) + 3 or y = 3.4.  Plugging that into the cost function, i.e. difference between y = 3.4 (prediction) and y = 2 (actual value), means the cost has decreased from 1.5 to 1.4, i.e. got better.

Because adjustment (B) reduced the cost, the system learns that it should reduce m by some amount in that direction (i.e. downwards) as this seems to be the correct direction that leads to a lower cost and a better prediction. 

At this point, the system is basing this decision on only having seen the first example and how this relates to the model function

With more examples, this understanding of micro-adjustments to m as they relate to the cost becomes more complex, nuanced and more accurate over time. Again, this is like in the Hot and Cold game where lots of little steps add up and with increasing speed take you to the hidden object.

To speed things up, the system designers use a Learning Rate, an arbitrary amount by which m is updated in the direction of the best micro adjustment from above.

Let’s assume the engineers set the Learning Rate as 0.25.

This means m is reset from the originally random value of m = 0.5 to m = 0.25 (i.e. 0.5 – 0.25 = 0.25). The system has moved from a random configuration of m to a data informed adjustment of those variables.

The same process applied to m is applied to b at the same time (omitted in the above for simplicity), and for simplicity let’s assume b is reset to 1.5 after this step.

Second Pass, Second Configuration

The system then restarts with its new configurations for the model function.

This means that the model function is reconfigured with:

m = 0.25 and b = 1.5 (whereas before, for the first pass they were m = 0.5 and b = 3)

Together this updates the model function as follows:

From y = 2x + 4 (the configuration from the first pass) to y = 0.25x + 1.5

As before, we choose the next example in our dataset and input the x value into our model function, i.e.

This means the model function configuration at the second pass is this:

y = (0.25 x 2) + 1.5

Crunching that equation, the second guess for the second example is therefore y = 2.5. 

Second Pass, Second Check

Same as for the first pass; the system checks this new Predicted Y Value (y = 2) against the Actual Y Value (y = 3) using the Cost Function to generate the new cost,  i.e. 3 – 2 = 1. 

Much better!

Now the system is only +1.0 off of the correct value (y = 3) with respect to the second example vs. being off by +1.5 after the first guess with respect to the first example. 

We can plot this as follows:

The costs are the gaps (red arrows) between the blue points on the graph (our dataset of x + y pairs) and the current model function’s (y = 0.25x + 1.5) predictions for y, illustrated by the orange line.

The system is learning the relationship between x and y, and has improved this understanding between the first guess for the first example and the second guess for the second example.  Notice also, the new model function has gotten even closer to the first example.

The more examples the system processes, the closer the line will adjust to all other points.

So the system has used the data (the first example in our training dataset) to learn how to adjust the weights (aka parameters) in our model function that tune the model function’s performance toward accurate predictions of y given x. 

Our system isn’t quite there yet, but it’s on its way to learning the correct function that maps a given x input to its correct y output.

Rinse and repeat through the training dataset

We repeat the above process again and again for each example. As the system works through each example it will improve, minimising its costs to maximise its predictions (i.e. its inaccuracy).

For a great visual of this in process on an entirely different and more varied dataset over 50 iterations (aka epochs) see the below:

Source: here

Validate

Once our system has processed the training dataset per the above process we freeze the system so there are no further adjustments to the values of m and b. We then run the above process, on the validation dataset.

However, two key points to understand:

  1. We run all steps in the above process again on the validation dataset except the optimization function. For validation we only want to see how the system performs on new, sight unseen, data not use that information to update the system’s performance.
  2. The validation dataset has not – as yet – been seen by the system. This is because we want to see how well the system performs on new, sight unseen data. This data was not part of the training dataset so it cannot have factored this into its learning prior to this validation process.

Recall the exam prep analogy. This is the step whereby after revising on our revision data we attempt a mock exam paper to test our knowledge before the real exam. We do this to provide an intermediate measure of how well we understand well we understand new data (i.e. new facts and exam questions) based on our existing performance learnt via revision. If we score well on the mock exam, indications are good that we might do well on the final exam. If we do poorly, we have room for improvement and a chance to go back and adjust our revision and method to hopefully improve our performance.

This is the intuition behind validation. It’s the past exam paper step in a student’s exam prep.

Test

This is where rubber meets the road. In our exam analogy, this is the final exam. Like the final exam, the idea is to test how well the student’s (in our case the system) learning generalizes to new sight unseen data.

As with validation, we run the above process of configurations, cost calculations but not the optimization function step.

If the system performs worse on the test dataset, what can the AI engineers do? Lots of things.  These things are called tuning the hyper parameters. Let’s explore these below, including the question of what parts of the overall process were fixed by the engineers and those generated or adjusted by the AI.

7. Hyper Parameters

AI engineer decided parts (known as hyper parameters):

1. The choice of Model Function, e.g. y = mx + b rather than y = m1 x2  + m2 x+ b

2. The initial choice of m and b

3. The choice of Learning Rate

4. The choice of the Cost Function

5. The choice of the Optimisation Function

6. The amount Training Data

7. The number and type of Training Data features

Regarding 7, in the worked example above we only use one feature about a thing, x, to predict a related value y.  However, returning to the idea that x could be the number of documents on a matter and y the total hours billed to that matter, we could add more features to our model, e.g. the types of documents and their number, the number and seniority of lawyers on the matter, the size of each document, the number of versions for each document etc.  This may improve the model because we know that the relationship of billable hours to a matter goes beyond the number of docs but aren’t certain which factors impact billables nor how we weigh these up to produce accurate billable hours estimates. 

It is these components of the AI system engineers systematically experiment with, also iteratively, to determine ways in which to boost the system’s performance, e.g. if it performs well on the training dataset but less good on the validation set.

8. Parameters

So what dos the AI system generate or adjust within itself? The AI system generated parts are:

 1. The intermediate and final choices of m (the weights) and b (bias unit)

2. The predicted y values for a given input x value

As you can see, the AI system updates its own code, i.e. the parameters in its model function, but a lot of the mechanics are fixed by the AI engineers. A common misconception is that AI systems have few fixed components and more or less write and rewrite the majority of their code. This isn’t strictly true, or not in the sense most non-technical individuals understand it.

A legal snack

As you can see, an AI system’s linear regression setup has many different components. Diagramatically we can summarize them below.

IMAGE

Note that many of the components originate with different sources or processes.

For instance, in our AI system there is an interesting legal analysis that could be applied regarding who owns what IP in the system, its inputs and its outputs?

Without exploring that in legal detail, suppose these questions:

Who owns the input data? If the engineers created it, perhaps them.  But if they scraped it (i.e. downloaded it) from the internet what were they allowed to do with that data?
Did the additional step of rearranging that data (e.g. into labelled x ad y pairs) create any new rights in that data or database? 
If the engineers sell their system to a third party who adds their own data and re-trains the model to generate better outputs than the original engineers do they own that output, i,e the newly adjusted model function that provides higher perfromance?
Can we say that the AI generated the m, b and output values and therefore has some agency and ownership of these values, which have potential commercial  value? What about the person providing the data - without the data the system can't learn anything?

Good questions to ponder!

What next? (Further reading)

If you made it this far, well done. This was a hard article to write and simplfy the underlying concepts, and we hope not hard to follow as a result! If you enjoyed it and want to learn more, we recommend you check out the below resources as a next step. In time we will write an article with a full worked + maths example of machine learning (linear regression most likely) and how it can be used in law.

Machine Learning for Everyone.  This is a well organized roadmap to the world of machine learning, including lots of intutivie real world examples (no code, no math).  It's excellent as a next step in your learning, going into more detail on the other subtypes of machine learning and their variations.
Machine Learning is Fun.  It's an 8 part article series of excellent practical worked examples.  Knowledge of some basic python and a deeper familiarity with maths will help, but isn't strictly necessary to further your education.

The post Machine learning with school math. Yes, you learnt the basics at school! appeared first on lawtomated.