This chapter provides a review of probability theory focused on those aspects of this extensive body of knowledge which are useful to develop and understand the risk models described earlier in the book.
We shall begin by describing the concepts of a random process and random variables. From this we can define probability and probability distributions. The main characteristics of probability distributions are defined and a summary of the best known distributions is provided in terms of these characteristics. This will provide a set of building blocks for risk models.
We then go on to some of the most important approximate properties of random variables. These enable us both to predict – to some degree – the results of risk modelling and to understand the results when we have them. This mitigates the black box effect.
Finally we consider the relevance of all this to risk analysis in organisations.
Random processes and random variables
Some things that happen in the world are pretty much predictable, especially if they are related to some basic physics. The sun will rise at the time printed in the paper. Your alarm will go off at the time you programmed it to. When you switch on the kettle for coffee, it will draw perhaps ten amps, depending on its rating.
But most things are less predictable. You do not know exactly what time you will arrive at work and you certainly cannot be sure what awaits you there. You do know it is much more likely to be something that ratchets up pressure and stress levels than the opposite. More mundanely, the ten amps will vary, depending on the voltage delivered by the power station that morning. It changes a bit, but hardly ever enough to make a noticeable, let alone significant, difference.
‘Much more likely?’ ‘Hardly ever?’ We certainly have a rich vocabulary to use to describe these unpredictable situations. The challenge is to see if we can do a bit better than to tell the chief executive, ‘we are pretty unlikely to go bankrupt’. It is ‘odds on’ that this will make increased stress a self-fulfilling prediction, with a ‘remote chance’ of an early reunion with the coffee kettle at home.
One solution is to develop the idea of randomness. This means that, for at least some part of the model, you abandon the idea of a deterministic relationship between cause and effect, between input and output, and replace it with something which is explicitly unpredictable. You can then ask what characterises this unpredictable element.
To make further progress we introduce the idea of a repeatable experiment. We can measure the supply voltage at 7 o’clock each morning. We can measure the number of radioactive decays in an hour. We can measure the lifetime of a batch of bulbs. We can measure the number of defective items on a car assembly line. We can measure the number of days per year it rains in Manchester. We can measure the number of Virgin trains from Manchester which turn up on time in London each day. We can measure the number of Labour victories in general elections since 1900. By examining these measurements, and by imagining that we could obtain an unlimited amount of data by repeating the experiment indefinitely, we can start to understand the nature of randomness and replace deterministic predictions with something else.
It is clear this is an idealisation. For a start there is no such thing as an infinitely repeatable experiment. There is only one set of election results since 1900. There are only a finite number of days since Virgin trains have been running from Manchester. Secondly, we know intuitively that this exercise is only going to be useful if none of the underlying factors changes. Virgin bought some new trains and ran them according to a new timetable on an upgraded infrastructure. All of this is going to change the characteristics of randomness as it affects punctuality. Thirdly, the world, and particularly the world as modified by human decisions and actions, is profoundly not random; just uncertain.
We shall return to this at the end of the chapter, because the nature of this uncertainty is obviously very important for risk analysis in organisations, where it is exactly human decisions and actions which are responsible for outcomes. But for now we take a more positive approach to the idea of repeatable experiments and ask what it can do for us.
It certainly does a lot for bookmakers who deploy the randomness concept to good effect. The idea that the outcome of a horse race can be considered the result of a random process is a credible one. But you can bet on many other propositions for which this is less the case. For example, at the time of writing the English football authorities had just appointed a new manager and before the appointment you could place bets on who the successful candidate was going to be. Given that the appointment would be decided by a small number of people in suits, it is not at all clear that that this could be considered as random (although in fairness these are suits celebrated for their capriciousness and absence of logic). Predictably there were allegations of bet fixing.
The key idea for turning repeatable experiments into something usable is statistical regularity. This means that the fraction of experiments with a particular outcome, the relative frequency, tends to stabilise to a particular number. For example, the number of experiments in which there is exactly one radioactive decay in an hour might be:
4 in the first 10 experiments
33 in the first 100 experiments (4, 5, 0, 1, 2, 2, 5, 4, 5, 5 in each batch of 10)
355 in the first 1000 experiments (33, 39, 31, 31, 33, 36, 42, 36, 40, 34 in each batch of 100)
3677 in the first 10 000 experiments (355, 365, 366, 356, 365, 379, 355, 392, 370, 374 in each batch of 1000).
You can see that the fraction of experiments with one decay is stabilising around 36.8 per cent, which was the ‘correct’ value for the way I designed the ‘experiment’, which was actually a numerical simulation of the random process.
You can also see that the relative frequency of an outcome, or event, has some specific properties. It must lie between zero and one, inclusive. The relative frequency of the combination of two mutually exclusive events is the sum of their individual relative frequencies. And the relative frequency of the event which comprises all possible outcomes is one.
Obviously, it is only a short step conceptually from a stabilised relative frequency to a probability. That is, the probability of an outcome, or event, is the stable value of the relative frequency with which it occurs in a large number of repetitions of the experiment. Equally obvious, the three properties of relative frequencies just mentioned also apply to the probability of events. It turns out that this is enough to develop the whole theory of probability, although we do not need to explore this further here.
To summarise, if you have a repeatable experiment you can define the probability of one of the outcomes of the experiment as the stable value of the relative frequency of the outcome you would find as you repeated the experiment an ever greater number of times. The important point is that the theory you can build on this basis can be used even where there is no underlying repeatable experiment. Whether you should use it is discussed later.
It is worth giving a warning whilst the data from the simulated experiment is fresh in our minds. Perhaps you were surprised that there were four occurrences of five decays in the first ten batches of ten. Could that really have happened by chance? In fact the chance of this is pretty low: there is a probability of 17 per cent of having exactly five incidences of a single decay in ten experiments. This translates to a 6 per cent probability of having exactly four incidents of this in ten such super experiments. But it is not that low. You can naturally find apparently non-random patterns in any simulation of a random process. Be careful not to be fooled: many have been.
The final concept here is that of the random variable. Instead of the number of experiments with a certain outcome, consider some function of the outcome which could be scored, that is, could be represented as a number. For example, the following lists the total number of radioactive decays in the simulation described above:
8 in the first 10 experiments (2, 0, 1, 0, 1, 1, 2, 1, 0, 0 in each experiment)
111 in the first 100 experiments (8, 11, 17, 12, 8, 8, 12, 16, 10, 9 in each batch of 10)
1038 in the first 1000 experiments (111, 98, 95, 113, 119, 98, 95, 102, 107, 100 in each batch of 100)
10 086 in the first 10 000 experiments (1038, 1031, 963, 967, 946, 1023, 1076, 1024, 995, 1023 in each batch of 1000).
The number of decays is a random variable. Its behaviour is determined by the probabilities associated with each outcome (0, 1, 2, … decays in the hour of the experiment). The data show that again the random variable is stabilising in some sense to around one per hour, which is, of course, exactly how it was designed.
The random variable can take several values, each one having a probability. This is known as a probability distribution. You need to keep clear in your mind the distinction between the probability of an event and the probability distribution of a random variable. To repeat, the distinction is that the random variable is a numerical function defined for each event.
Other random variables more relevant to organisational risk analysis include the total cost of a project (or its duration), the internal rate of return of an uncertain cash flow, the number of performance points for collecting waste in a timely way and so on.
Without being too mathematical, we have defined the probability of events and how this leads into the probability distribution of random variables. These probability distributions are the basic building blocks for risk models which are concerned with calculating the probability distribution of output random variables based on the probability distribution of input random variables.
We now describe some simple random processes and the probability distributions they give rise to. They can be grouped into either discrete distributions, where there are separately identifiable possible outcomes, or continuous distributions, where the random variable may take any value in a range.
Discrete Probability Distributions
Coins, dice and uniform distribution
The best known probability distribution is that which arises from the toss of a fair coin. There are only two events, heads or tails, and the fact that the coin is fair means each is equally likely. Since the probabilities add to one, there is no choice but to make the probability of each event 0.5. Another example is the throw of a fair dice, where the same reasoning says the chance of any of the six possibilities is 1/6.
Neither of these is a probability distribution as such, as there is no random variable, but if we identify the score on the dice as the random variable, then it takes the values 1, 2, 3, 4, 5 and 6 each with a probability of 1/6.
Repeated coin tossing and binomial distribution
If you toss a coin N times an obvious random variable is the number of heads. If you toss it three times, then there are eight outcomes, each with an equal probability of 1/8. Of these, one has three heads, three have two heads, another three have one head and one has no heads. So the corresponding probability distribution of the number of heads is: pr(0 heads) = 1/8, pr(1 head) = 3/8, pr(2 heads) = 3/8, pr(3 heads) = 1/8, where we are using the convention that pr(E) represents the probability of event, E.
For a more general case, where there are two possibilities, the event either happens at each experiment (with probability p) or it does not (with probability 1 – p) the corresponding result for the number of times the event happens in N trials is:
where n! means n factorial, that is, 1 × 2 × 3 … × (n − 1) × n. Broadly speaking the pn(1 − p)N − n term is the probability of any of the sequences in which the event happens n times (and therefore does not happen N − n times), and the N! / n!(N − n)! term is the number of such sequences. This discrete probability distribution, described by pn, is known as a binomial distribution.
The Poisson process
The previous section provided data on a radioactive decay process. The underlying process was assumed to be one where the probability that a given atom decays during a short interval is proportional to the size of the interval. Suppose there are N atoms (a very large number) and that the probability of any one of these decaying during the interval is FT where F is the decay rate and T is the size of the interval. In these circumstances the number of decays during the interval is given by the binomial distribution:
Note that N! / (N − n)! is approximately Nn and (1 − FT)N is approximately e−NFT, which you can see by taking logs of both sides and noting that ln(1 − FT) is approximately −FT.
In the formula f = NF, the product of the large number of atoms and the small probability of any individual one decaying per unit time, is assumed to be roughly comparable with the inverse of the observation time T, over which a reasonable number of decays might be seen. That means the parameter q = fT is of order one.
In risk analysis, the Poisson process is important because it represents events that may happen several times rather than just once. This is typically the case for safety risk analysis, where the chance of an accident occurring might be represented as once every thousand years, for example. This means that f = 0.001 per year in the formula. So the probability of one occurrence in 10 years is q = 1 per cent and the probability of two is q2 / 2 = 0.005 per cent, which is very small.
Continuous Probability Distributions
Some random variables can take a continuous range of values and their probability distribution is correspondingly defined on a range. This necessitates an additional concept, the probability density function, which locally is the probability that the variable falls within an interval divided by the size of the interval. Probability density functions can be quite difficult to understand for non-specialists. An alternative is the cumulative probability distribution, which is defined as the probability that a random variable is less than some value. We denote probability density functions by p(x) such that:
and cumulative probability functions by P(x) so that:
The uniform distribution
The idea of a probability density function is most easily understood in the context of a uniform distribution. Say that all that is known is that the random variable lies between a and b. Then the width of the interval is (b − a) and the probability density is 1 / (b − a), that is:
Obviously p(x) satisfies the normalisation condition that the probability of all possibilities is one:
The cumulative distribution is given by:
The exponential distribution
Going back to the Poisson process, another random variable is the length of time until a particular atom decays. If we write P(t) as the probability that it has decayed at time t, we know that this must satisfy the differential equation:
P(t) is the cumulative probability distribution. The probability density function of the decay-time random variable is:
which is called an exponential distribution. The fact that this function is decaying represents the point that as time goes on the atom is less likely to decay because the chances are that it already has. In this case we can see again that, arguably, it is easier to deal with the cumulative function, or S-curve as we have already called it, than the probability density function.
The normal distribution
This is the most important distribution of all. Unlike the others, the initial formula for the distribution is not rooted in a specific random process:
This represents a distribution which is symmetrical about x = μ and has a distance scale of the order of σ; that is, by the time x has reached μ ± σ the density function has dropped off significantly. It is the archetypal bell curve from Chapter 3. The initial, inverse square root factor is there to ensure normalisation.
To understand its importance, consider its logarithmic derivative:
This is the distribution whose logarithmic differential is given by a straight line passing through zero at the maximum.
Take the Poisson distribution, for example, in the case where q is very large. The most likely value of n is of the order of q and we can in fact write:
Hence, effectively turning n into a continuous variable:
This shows that the number of events in a Poisson process which is sufficiently long for many occurrences to be expected is distributed as a normal distribution with μ = σ2 = q.
This is intuitively reasonable. In the example we had a decay rate of 1 hour and the experiment lasted for 1 hour. So the number of decays ranged from zero to seven in the sample of 10 000 we looked at. If each experiment had lasted 100 hours, we would expect to see 100 decays with much smaller relative variation. Figure 12.1 shows the results from the sample of 10 000 considered as 100 experiments of 100 replications. It shows how the frequency of the number of decays in each experiment cluster roughly around the expected normal and that the cumulative is pretty close to its normal approximation.
You can do the same thing with the binomial distribution. If the number of experiments is large, you can show that the number of successes is normally distributed with μ = pN and σ2 = Np(1 − p). If you followed the argument for the Poisson process, you might like to try that here as well.
So the point about the normal distribution is not that it describes a specific individual process, but that it is the limiting distribution in some very important circumstances. The cases we have just seen are not the only ones.
The cumulative probability for the normal distribution cannot be written in closed form, but is available from tables (or more likely via the appropriate spreadsheet function, NORMDIST in Excel).
Mean, variance and percentiles
The next step is to look in more detail at how these distributions can be characterised. We have already noted that it is fundamental that the sum of probabilities is one, which is the normalisation condition:
It is natural to try to determine the central location of the distribution which we could expect to be the probability weighted mean:
This is also known as the expected value (in the mathematical sense).
Finally, we would like a measure of the spread of the distribution and it makes sense to look at the mean square departure from the mean. ‘Square’ ensures we get something that is positive; ‘from the mean’ ensures we eliminate the central tendency. Thus the variance, σ2 is defined to be:
and the square root of this, σ, is called the standard deviation.
These three sets of formulae are tied in with what are known as moments. The r-th moment (about the origin) is defined by:
whilst the r-th moment about the mean is:
and for second order moments you can show that:
The equation is sometimes useful to calculate the variance and standard deviation of distributions. Third and fourth order moments provide further information about the distribution in terms of skewness and kurtosis. Skewness is the third order moment about the mean divided by the σ3 and kurtosis is the fourth order moment divided by σ4. It is said to represent the peakiness of the distribution.
What is often more useful for risk analysis is percentiles. These apply particularly to continuous random variables where the cumulative is a continuous function from which specific values can be read off. The concept is simply one of selecting specific probability levels and determining the corresponding value of the random variable. For example, for a normal distribution there is a probability of 20 per cent that the value is higher than μ + 0.84σ (or less than μ − 0.84σ) whilst there is a probability of 5 per cent that the value is higher than μ + 1.64σ (or less than μ − 1.64σ). The values are known as the P80, P20, P95 and P5 respectively.
Table 12.1 provides the mean and variance values for the distributions discussed so far, as well as reiterating the probability distributions.
Some more useful distributions
There are plenty more distributions around. Many are mainly of application in statistics; we shall meet some of them in the next chapter. This section provides a brief list and the distribution gallery (see Table 12.1) provides more details. They are introduced because they have been shown to be of use as the inputs of risk models.
The Triangle Distribution
As its name suggests, this distribution has a triangular shape and is described by its minimum, maximum and most likely values. The most likely value is the peak of the triangle. It can be symmetrical, where the most likely value is midway between maximum and minimum, or skewed.
The useful feature of this distribution is that it is easy to elicit from subject experts. Generally they feel most comfortable in providing the three parameters. And the fully skewed triangle, where the minimum is also the most likely, is a useful concept for when things can only get worse; for example where a project has been costed at the best case, or nearly so.
However, it is worth giving a word of warning: the triangle distribution is narrower than it looks. A symmetrical triangle distribution has a standard deviation which is only 41 per cent of the distance from the centre to the extremes, and the P20 and P80 are only 37 per cent of the way out. The P5 and P95 are 68 per cent of the way out.
For a fully skewed triangle, the standard deviation is 24 per cent of the width and P5, P20, P50, P80 and P95 are at 2.5 per cent, 11 per cent, 29 per cent, 55 per cent and 78 per cent of the way out respectively.
Both of these distributions and the percentiles are illustrated in Figure 12.2.
One response to this is to fiddle around with the distribution, setting the minimum and maximum at P5 and P95, for example. However, a better approach is to force the experts to expand the range of possibilities, using the percentiles as evidence of the need for them to do so (as well as people’s war stories from when they have underestimated the range of possibilities).
The Gamma or Chi-Square Distribution
The Gamma distribution is a generalisation of the exponential distribution described previously. The exponential decays from a fixed value at x = 0; the Gamma has a more general power law at x = 0:
For s between −1 and 0, the probability density function tends to infinity at zero; above zero, the probability density function grows at first as x increases and then decays as the exponential term takes over. The exponential distribution corresponds to s = 0.
The mean and variance of the Gamma distribution are given by:
Thus for large values of s, the standard deviation is small compared to the mean and, in fact, the Gamma tends to a normal distribution. (You can prove this with the logarithmic derivative technique discussed above.) As s becomes negative and approaches –1, however, the standard deviation becomes much larger than the mean. This is a useful property on occasion.
The chi-square distribution is a special case of the Gamma (with k = 1/2 and s restricted to half integers). It is important in statistical inference because it is the distribution of the sum of squares of normally distributed random variables.
The Beta Distribution
The beta distribution can be a useful distribution where it is desired to fit elicited data with a maximum and minimum, and power law behaviour near the extremes. Thus:
In this case, the mean and variance are given by
For large values of s and t the distribution is symmetrically distributed about the mean and is approximately normal. As s becomes negative and approaches –1, the distribution shrinks to the minimum, but has a standard deviation which is much bigger than the difference between the mean and the minimum. A similar result applies when t approaches –1, in which case the distribution shrinks onto the maximum. Again, this can be a useful property.
The Weibull Distribution
We have already noted that the Poisson process describes events which happen at a constant rate. A generalisation of this is events which happen at a changing rate, either increasing or decreasing. Where this rate is power law with time, the resulting distribution of time until failure is called a Weibull distribution.
With a failure rate equal to at b−1 the probability density function of time to failure is
The mean and variance of this is provided in the distribution gallery, Table 12.1.
The Lognormal Distribution
If X is a random variable with a normal distribution then Y = eX is a random variable with a lognormal distribution. This is important if Y is the product of a number of random variables, all of which are themselves lognormal. The properties of the sums of normal random variables can be applied to this. We shall describe this later.
The mean and variance are as follows:
Obviously you need to concentrate on dimensions as X, and hence μ and σ, should be dimensionless.
Mixed Random Variables
Finally, risk modelling sometimes throws up circumstances where a risk may occur or not and, if it does, the consequence is uncertain. This can be decomposed into two random variables:
where Y has its own probability distribution. Thus a project may be considered to overrun with probability P and, if it does, the additional time might be considered to have a triangular distribution of some kind. The mean and variance of this distribution is shown in Table 12.1.
It is worth thinking about the reality of such modelling given that it produces unusual looking curves. The probability density function has a delta function peak at zero and then a standard triangle separated from it. The S-curve shows a jump followed by a traditional S. Ask yourself if this is really what you want and whether at least a fully skewed triangle might be more appropriate.
Figure 12.3 illustrates a jump in the S-curve. The chance of a risk materialising is 50 per cent. The dark line shows the S-curve for the case where the distribution of the impact is a triangular distribution between one and two. The lighter curve is an alternative of a fully skewed distribution between zero and two. For many types of risk this is inherently more plausible than the alternative. This is essentially the case where the risk represents some uncertainty: is the discontinuity really realistic? In neither case can the density function be drawn, as it is infinite at zero.
Multiple events, conditional probabilities and independence
In defining probabilities we used the concept of the ‘stabilised relative frequency’ of an outcome or event. We did not discuss the nature of the outcome or event in any detail. However, for most ‘repeated experiments’ the possible events have a much richer structure than we have implied up to now.
For example, if the repeated experiment is waking up in the morning and measuring where I am and whether it is raining, then the events could be ‘Manchester’, ‘raining’ or ‘dry in Bali’, to name a few. The probability of these events will be structured according to the way these events interact with one another. For example, the probability of rain could depend on whether I am in Manchester or London. To deal with this it is useful to define conditional probabilities. For example:
It is pretty obvious (and in fact axiomatic) that for any experiment and two events A and B
so the probability of waking up to rainy Manchester is the probability that I am in Manchester, multiplied by pr(rain|Manchester) as above.
Two things come out of this. Firstly, if the chance of A does not depend on B then
In this case A and B are said to be independent, and this is a necessary and sufficient condition for independence. Independence (or not) of risk events is a key concept in risk modelling.
Secondly, the symmetry in the expression pr(both A and B) means that:
This apparently innocuous result is the celebrated Bayes theorem; it is both intriguing and useful. The key point is the reversal of A and B in the conditional probabilities.
I can imagine waking up not knowing where I am. What does the fact that it is raining tell
me? Bayes theorem says:
I already know pr(rain|Manchester) and I know the fraction of time I am in Manchester: pr(Manchester). I also know the probability of rain. It is:
But actually I do not need to know this, as it is location independent. Thus Bayes theorem can use the fact that it is raining to update my underlying probability of waking up in Manchester. To make this concrete, consider Table 12.2 which assumes I wake up only in Manchester or London (I was just getting my hopes up with Bali).
Using the numbers in Table 12.2, the chance that I am in Manchester has increased from 80 per cent with no information about the weather to 87.5 per cent (that is, 0.8 × 0.7 / 0.64) with the knowledge that it is raining.
You may not find this very impressive, but using this approach enables doctors to refine diagnoses in the light of new evidence, as well as having many other applications. In Chapter 15 we use it in a decision problem.
It is also used where the A events refer to emerging data, as here, but the B events are essentially the truth of hypotheses about a statistical model: the probabilities become the degree of belief in the hypothesis. This technique gives rise to Bayesian methods in statistical inference.
Joint probability distributions and correlation
Following from this discussion of the structure of events, there are corresponding concepts for random variables.
We could convert our early morning location and weather measurements into random variables, but going back to our repeated experiments gives us better examples of multiple outputs. For instance, the measurement of component failure might include both the time of failure and the temperature at that time. Throwing a pair of dice would result in the score shown by each of them. The relative frequency concepts we used before can be applied just as well to define the probability distribution. This time we will refer to it as a joint probability distribution, to emphasise that it is several random variables being jointly described.
The joint distribution might be discrete (as for the dice), continuous (as for the time and temperature at failure) or might even represent a mixture of both types of random variable (as for the project overrun previously discussed under Mixed Random Variables):
We can use a joint probability distribution to determine the marginal distributions of the constituent random variables:
Note that the joint distribution can be used to determine the marginal distributions, but not vice versa.
Our core concern in risk modelling is to understand the distribution of functions of multiple random variables. That is, if we have a model which defines output random variables in terms of input random variables, we want to know how the output distribution depends on (a) the input distributions and (b) the functions in the model.
A key factor in determining this is whether the input random variables are independent or not. By definition X and Y are independent random variables if and only if the joint distribution is the product of their marginal distributions:
This implies that whatever the value of Y, the distribution of X is the same and equal to the marginal distribution. The same applies the other way around. It is entirely reasonable that if random variables are independent, then their distributions are not affected by each other. Equally, if their distributions do not affect each other, then there is no measurable dependency.
As we have seen in earlier sections, risk models invariably contain dependencies. Whilst the outcome of each dice throw is effectively independent (although you could think of things which affect each), we can certainly expect that component failure will be dependent on temperature. So, the next question is how to measure dependence.
One answer is to define the correlation coefficient between two random variables:
E(XY) stands for the expected value of the product XY. We shall show below that if X and Y are independent E(XY) = μXμY and the correlation coefficient is zero.
It can be shown that the correlation coefficient lies between –1 and 1 and that if it takes either of the values –1 or 1 the relationship between X and Y is linear. The focus of the correlation coefficient is the linearity of any relationship between X and Y and effectively ignores other aspects of any relationship or dependency. Thus, a correlation coefficient of zero does not mean the two random variables are independent.
As an example, consider Y = X 2 where X is normally distributed with a mean of 0 and a standard deviation of 1.
The marginal distribution of Y is a Gamma function (for the reason mentioned previously in connection with the chi-square distribution), so the joint distribution function does not satisfy the conditions for independence. But the correlation coefficient is clearly zero (as E(XY) = μX = 0).
Independence implies the joint distribution function is separable into the marginals, and a separable marginal implies independence.
Independence implies zero correlation, but zero correlation does not imply independence.
Putting dependence and correlation into risk models is discussed in other parts of this book, but as a final comment here, we know the output of risk models is dependent on the input, so we expect correlation between inputs and outputs. We can guess that the more the correlation, the more important the input is for the output. We return to this in the next chapter.
Functions of several random variables
We keep stressing that a risk model is just a set of input random variables and some functions providing the output(s). Some of this is quite simple and we now discuss some basic aspects of functions of random variables.
Sums of Random Variables
The most fundamental of these aspects is the case where the function is a sum of random variables. In this case the expected value of the sum is the sum of the expected values. To see this we have by definition that if X = X 1 + X 2, then:
This applies whether X 1 and X 2 are independent or not and can be extended to any number of random variables.
Furthermore, the variance of a sum of random variables can be written in terms of the individual variances and the correlation coefficient:
As a special case, if the variables are independent, the variance of the sum is the sum of the variances since the correlation coefficient is zero.
As another special case, if ρ = 1 (full correlation implying a positive linear relationship between X 1 and X 2) then
There is a corresponding formula for the case that ρ = −1 and the relationship is negative:
These formulae are useful because summing random variables happens a lot in business risk modelling for the reason that money is always additive.
Products of Random Variables
If X 1 and X 2 are independent and a third random variable X is their product, then the expected value of X is the product of the expected value of X 1 and X 2. To see this, if X = X 1X 2 then:
Finally, there is an analogous formula for the variance of the product of independent random variables:
These formulas can be combined to estimate the mean and variance of the output of a risk model formed of a sum of products where individual random variables are independent.
The final area of probability theory discussed in this chapter covers a number of approximations which are useful both to produce ballpark results and to provide an understanding of what is going on when more exact results are calculated.
Whilst the mean and standard deviation are only two parameters of a probability distribution, they convey a lot of information. This is perhaps best summed up by Chebyshev’s inequality: a simple result which says that a random variable cannot stray too many standard deviations from the mean. Or more precisely, that the probability of this is low. This is obvious in a way: if there were a decent probability of being some way from its mean, the standard deviation would be higher. The beauty of this result is that it is distribution independent.
although it is not much use unless k > 1, since probabilities must always be less than one.
The inequalities we have applied during the proof are pretty drastic. First we dropped the central term in the integral and secondly we reduced the integrand in the outer wings. Figure 12.4 shows the overall effect for the case of a normal distribution and k = 1.6.
The inequality amounts to approximating the integrand for the standard deviation integral with the heavy black line. The correct integrand is the bimodal lighter line. This is clearly not a good approximation, at least for the normal. Its impact for different distributions is shown in Figure 12.5.
Taking the normal as an example we might try to estimate the P20 and P80 from Chebyshev. The required probability is 40 per cent and this suggests k = 1.6, as marked. The actual probability in a normal distribution outside 1.66 is 11 per cent. (1.6 is the inverse square root of 0.4). Alternatively, we saw earlier that the exact k is 0.84. This suggests that the inequality is not of great practical use for estimating percentiles. However, Chebyshev remains an interesting result, which we build on in the next section.
The Law of Large Numbers
At the start of this chapter we made progress by asserting the statistical regularity of the results of repeated experiments. This allowed us to define probability as the stable value of a relative frequency. We can now use Chebyshev’s inequality to demonstrate statistical regularity. Consider an event which has probability p. The experiment is repeated N times and the relative frequency f is a random variable defined to be n / N where n is the number of times the event occurs. n has a binomial distribution with mean Np and variance Np(1 − p), so f has mean p and variance p(1 − p)/N. Applying Chebyshev gives us:
or with q = kσ
As we can see, the probability that the relative frequency lies outside a defined range about its stable value, in this case of width 2q, can be made small by making N sufficiently large.
Perhaps of more direct relevance is to note that the standard deviation of the relative frequency decreases proportionally to N−1/2 as N becomes larger. This is a key result for calculating risk models as the next chapter will show.
The Central Limit Theorem
The final piece of mathematics in this chapter is perhaps the most important, and perhaps even the most impressive, of all. It relates to the situation where a random variable is the sum of other random variables and these are independent. Recall that in these circumstances the mean is the sum of the means and the variance is the sum of the variances, so we know the mean and variance of the result. The Central Limit Theorem goes one step further and tells us that the sum has a distribution that is approximately normal. How come?
The brief answer is that as more and more random variables are added to the sum, the individual details are lost. The amount of information goes down and the normal distribution is the minimally informative among all distributions with a specified mean and standard deviation. The information referred to here is analogous to definitions used in statistical mechanical treatments of thermodynamics and it is for the same reason that the distribution of molecular speeds in a perfect gas is normal.
To provide a concrete example, consider a sum of random variables, each of which is a fully skewed triangle. Figure 12.6 shows the distribution of one, then two, then three of these. Even with two distributions the result is quite rounded and the most likely value has shifted to a central point not far off the mean. The normal distributions with the same characteristics are also shown. The greatest deviation is clearly at the margins as you would expect. But it is not long before the Central Limit Theorem takes over even for quite peculiar distributions. You will see the Central Limit Theorem demonstrated at several other points in this book.
A proper proof of the Central Limit Theorem is very complex. In developing the proof, the conditions under which it is true and the sense in which the distribution is approximately normal emerge. In broad terms, these are that no single distribution dominates or, very crudely, that all the individual standard deviations are small compared with the total standard deviation. This is not too surprising.
Is Probability Relevant?
This chapter has provided a superficial overview of some elements of probability theory which are useful to carry out and understand risk modelling. The starting point for defining probability is the concept of repeatable experiments which demonstrate statistical regularity. Out of this we can develop a rich theory which explains the outcome of various specific types of experiments.
So what? How can this have any relevance to risk modelling in organisations where there can be no concept of a repeatable experiment? Business life only happens once and we cannot try again, even if we wanted. More fundamentally, the idea that the outcome of what we do is random is clearly wrong. We have underlined a number of times the point that life is human error, filled with an exciting mixture of grand plans, incompetence, flashes of brilliance, computer crashes, trend-following consumers, personal grudges and so on.
The answer is, firstly, that the existence of probabilities and their behaviour is not dependent on repeatable experiments. It can just as well be developed as a mathematical construct from a series of axioms (which mirror the common sense aspects of statistical regularity). Secondly, you can choose to adopt probabilities as a way of expressing your beliefs about the likelihood of different future outcomes. Having done this, you can use the theory just as if the future were a repeatable experiment.
In fact, the idea of probabilities as subjective degrees of belief as opposed to stable relative frequencies is one that has been hotly debated over many decades by mathematicians and statisticians. It can be argued that there is no such thing as an objectively definable probability, and this position certainly leads to a clear and coherent philosophy. Some people find this hard to swallow in the context of tossing fair coins and so on. This does not matter: you do not need to believe that all probabilities are subjective to accept that using subjective probabilities is a good idea.
There are a number of points which arise from this.
Firstly, this is not very useful to the average manager. They will not want to think that the validity of their risk model, on which they are making important decisions, is dependent on the outcome of a titanic struggle between two warring camps of academically minded people, subjectivists and frequentists. Do not allow them to worry about such things.
Secondly, though, you have to recognise and embrace subjectivity. There is no right answer, just many reasonable ones from which you need to select one (preferably). Again, this is difficult for managers who put their careers on the line when they make decisions and have unrealistic hopes that a bit of analysis will tell them what to do. The idea of an organisation’s subjective probability is a difficult one. But consensus is needed on:
the input distributions;
how the model will derive the outputs from the inputs;
the principle that it is a good idea to build a probabilistic model.
The decision team and their expert support need to work on achieving this consensus. In particular, they need to recognise that achieving consensus on subjective numbers is very difficult. But they need to do it anyway.
Thirdly, these three elements of consensus have failed to highlight perhaps the biggest potential pitfall in risk modelling: independence. The approximations in the previous section and other aspects of probability theory underline the point that probabilistic systems tend to exhibit behaviour which is centred on the average rather than spread over all possible values. It is dependence – correlation – which guards against unrealistic narrowing of ranges.
A familiar practical aspect of this is the idea of diversifying risk in a portfolio of equities. If you put all your money into one equity, then you are exposed to the full risk (standard deviation) of that equity. It you split it between several and their values vary independently of each other, your risk is reduced as the inverse square root of the number of equities. And if you could find equities which are negatively correlated (you won’t) you might do even better. Investors use independence to reduce their risk as far as possible (for more on this see Chapter 10); businesses need to recognise their risk is increased by correlation.
This is the most common failing of risk models; that their outputs are too narrow because dependencies between the inputs are not recognised. So we extend the first bullet point above to include that consensus is also needed on:
the input distributions and the dependence between them.
In general, this is much more important than the precise details of the marginal distributions themselves. Just remember how the typical elements of a risk model – for example, the duration of each task in a schedule – are affected by ‘grand plans, incompetence, flashes of brilliance, computer crashes, trend-following consumers, personal grudges and so on’.