PC-tutor-Oldham

Data Analytics

1. Statistical Distributions

Intro

Statistics is the field of mathematics describing characteristics of data together with underlying probability (or "random variable") distributions. Many of the most common functions are available as formulas in programs like Excel while others requires more specialist software to complete the calculations. Compared to real-world data most such functions are only approximate models, though some do model processes very exactly e.g. the binomial distribution will model independent identical trials (often found in games of chance); the normal distribution will model population traits well (such as heights); and the Poisson distribution will model the number of occurrences of some random event in a given time (e.g. a radioactive particle being emitted).

Decision making

The estimation of statistical properties of datasets through probability distribution models lies at the heart of many kinds of business calculation (e.g. analysis of risk, insurance, demand forecasting, resource usage fluctuations, traffic flows). These may then be used to guide the allocation of business resources.

Statistics is not an exact science, but using statistics to provide measures of the spread of possible outcomes ahead of time is the basis for well informed , realistic, business decision making; and in an increasingly data driven world, it's value to businesses is clearly rising.

Regression

Regression means building relationships between independent random variables by using the scatter of observed data points to estimate trend lines capturing the average "response" of one or more variables to others. Regression line are usually those relationships which minimise the cumulative error - technically called the 'mean square error' - of data points around the identified trend, or regression, line.

Regression trend lines can be used to capture bulk characteristics of relationships between variable sets; but should not be used to model too far outside the range of observed data.

Further distributions

Other common choices commonly made use of include exponential, Chi-squared, gamma and beta distributions, together with T-tests and F-distributions. I can link any of these into coding pipelines via Java or R and advise on the correct circumstances in which to use each.

Some are only likely to be appropriate given a very regularized study design however (such as in a science experiment). Real world data is often generated by too many interacting processes to satisfy the assumptions of such models but specific complex cases may still be tractable (e.g. in modelling properties of natural common phenomena such as queues or epidemics). .

2. Hypothesis testing.

Hypothesis Testing

Oftentimes a data analyst must choose between a collection of models, any of which might be true. One way to discriminate between models is to calculate the probabilities that each of the available models generated the observed real data. If a model's probability falls below a threshold level (usually 5% or 1%), the hypothesized model (termed the 'null' model) is rejected. One of the 'alternative hypotheses can then be accepted.

Often there is a wide variety of different distribution hypotheses H0 H1 H2 ... to choose from, each with different assumptions and characteristics.

Monte-Carlo Methods

If the exact probability functions describing a collection of data are not known, then random data can instead be simulated, and probabilities of specific properties of the real data set, are then estimated from a large population of simulated data sets. This is known as Monte-Carlo simulation.

If the real data property is consistent with the model used for simulating data, then the real data property should be reflected often in the simulated datasets (in , say, more than 1% of those datasets). If however the real data property occurs less frequently than that, the model is rejected in favour of an alternative hypothesis.

3. Enrichment statistics

Significant subgroups

Suppose a collection of data can be divided into subgroups by some categorical property. And we wish to know if a certain subgroup could be overrepresented in data, compared to other groups. This type of question arises frequently in sociological contexts.

You might wish to find out if a certain demographic is buying not just more of a product - but significantly more - enough to design products or marketing campaigns with that group specifically in mind. Similar questions can be easily imagined and framed across e.g. business, medical sciences, and politics.

Subgroups become significant when "greater than expected" numbers of individuals from the group intersect to a condition of interest (or possess a property of interest).

This is effectively a classic "Venn diagram" in which each subgroup in turn is tested for "overlap" with a condition of interest. Probabilities can be easily estimated using statistics such as the hypergeometric function, or other functions, depending on additional assumptions underlying the test.

The result is a collection of scores reflecting the strength of enrichment of a property in members of each subgroup category in turn.

If a subgroup is "significant", then the property of interest (such as voting a certain way, or buying a certain product) is said to be "enriched" for the subgroup in question.

This provides an objective way of testing , for example, whether medical drug tests are working, whether they have too many of a certain side effect, whether a product roll out strategy is working, whether a political campaign message is effective for a certain target demographic of voters; and so on.

4. Network Analysis

Physical networks

Physically connected systems include systems of railways or roads, or an electricity grid, or a collection of connected machines (such as the internet). Performing calculations on the carrying capacity of any of these networks can be essential to planning traffic flow, power storage capacity, and bandwidth requirements, with massive impacts on the real economy day to day, for millions of individuals and businesses.

Social networks

Social and business contact networks reflect the human behaviour equivalent to systems of concrete physical connections. Network mathematics and programming allow to analyse collections of social network members, who might follow the same feeds, share the same interests, click on the same adverts, tend to like one another's posts, etc., with significant implications for economics (especially for "current trends" and other market aspects linked to popularity) and for politics. Personalised marketing content provides much of the revenue for two of the present day titans of the digital world - Facebook and Google.

Supply Chains

It is well known that companies such as Amazon can now achieve near to zero marginal costs through optimisation of logistics and in particular supply chain operations, across borders and continents. The more interconnected, the better our products and services might potentially become.

Not without its political risks as recent years make abundantly clear, this has nonetheless been the prevailing economic doctrine for decades. But it may be that "the globalising imperative" will hit snags in the form of growing human protests towards it & especially on the constraints which it in effect imposes on expressing "democratic will" via "sovereign institutions" - when governments are themselves constrained by a field of logistical constraints that is both intimately tied to the quality of life which we demand - in terms of choice and quality of goods and services - but increasingly cannot be adjusted even if we democratically wished it without a negative economic impact.

This is a clear area in which unpredictable political shocks may act to change the business paradigms - perhaps accepting lower quality and choice in order to wrest more "control" back at national democratic levels. Only time will tell.

Language networks

The intensive mining of sentence structures lies at the heart of the "informatic" part of the current A.I. "revolution"; this may for the first time put A1 professionals (legal advisors, doctors, etc) directly into the firing line as we embark on this new era in which computers "learn for the first time... to communicate like humans". The HAL 2000 from the classic film 2001 A Space Odyssey is really not so far off now though the ability of AI systems to cross multiple "domains of human-like enquiry" rather than to be adapted as a range of AI products for specific professional tasks will for sure be very limited. (We will have compartmentalised HALs).

5. Time Series

Causality

Events in time are linked through causal relationships, and, unlike those in generalized spaces, also satisfy a strict series of precedences defining which events are 'earlier', 'simultaneous' or 'later than' other events.

Unlike for a "pure probability distribution", time introduces a physical variable. Distributions of random events are therefore constrained by the laws of physics.

There are numerous ways to model series of events through time including - time-dependent distributions - state change matrix models (Markov chains) - and differential equations.

Distributions in time

Time-dependent functions include simple equations such as: Newton's laws of motion (to determine the position of masses at time points in the future, subject to forces acting upon their current motions); Accounting formulae used to calculate the compound interest on a savings account or due on a loan including mortgages; laws of electromagnetics governing the spread of electromagnetic waves (including radio, T.V), and in a more time critical sense, navigational systems including G.P.S.

Future values can be predicted within a physical system given these equations and subject to measurement, or model uncertainty, error bars.

State change matrices

Suppose there is a system which can exist in one of a number of states S1 S2 S3... and any state can turn in a single time interval into any other with a given probability (not all the same). The possible transitions can be represented by a grid termed the "transition matrix". By iterating matrix multiplication, a series of possible state transitions long into the future can be generated. Some transition matrices (i.e. some systems) have long term stable behaviour while others might blow up or remain chaotic until the end of time. The behavior of most such models can only be modelled by simulation techniques from which long term properties of the system are then estimated.

Differential equations

A further class of models capturing system dynamics are reflected in a branch of calculus termed "dynamic differential equations". These equations express variables of interest as continuous changes through time, in terms of algebraic relationships that extend to arbitrary combinations of "derivatives" (instanteous rates of change of one variable with respect to another), and "derivatives of derivatives" (rate of change of rate of change), etc; and algebraic sum, multiple, and power functions built out of these. All but the simplest such functions must be calculated by iterative approximations since no exact solutions to the equation systems tend to exist.

Time series economics

Time is one of the most critical of all economic variables. Interest rates and therefore the "time value of money" are linked intrinsically to it. Forecasting, purchasing, wages, and tax regulations, are tied to the calendar. A single bank holiday can affect the economy to the tune of billions of pounds. In financial capitals, speed is a critical factor (to place trades), even to billionths of seconds. Time series analyses are therefore the most common frameworks in commercial environments. There are numerous ways to model series of events through time including - time-dependent distributions - state change matrix models (Markov chains) - and differential equations.

Labour Calendars

All business activities are of course embedded in business and tax year cycles (aswell as monthly and weekly and daily cycles), built out of working weeks, days and hours; this results in often quite complex calculations to match processes to the exact intervals of time due to elapse, before a deadline, or a dividend, etc., (due to leap years and bank holidays and the like).

These kinds of clock and calendar issues can be handled quite elegantly using for example ready made Java classes, so that statistical models built using "continuous time lines" can be mapped in useable ways into business activity cycles.

It's a tedious but totally necessary aspect of coding projects in a human business environment involving actual human working cycles and other patterns (and exceptions such as leap years, etc.)

Nearly instant events

At the extremely short frequency end of calculations, the distributions of requests arriving at a server database, or of data requests or update to a central database, can introduce subtle bugs into business systems - sometimes with serious consequences in terms of data loss.

Complex threading and specialist database architectures are used to mitigate against such issues.

Complex "live" systems sometimes fail security checks of various kinds because of the nature of electronic event distributions, at remarkably short time scales - leading to loss, for example, of account authentication.

6. Cluster Analysis

Common cause factors

When a group of things behaves in a similar way, there is likely to be an underlying reason - a shared property or cause.

It is sometimes possible to infer such causes or properties and so learn totally new information by systematically scanning all "groups of things behaving similarly" in a large dataset.

Caution must be taken to apply correct controls to prevent a certain number of clustered groups detected "purely by chance".

Cluster detection

In mathematical terms a "cluster" is a group of correlated variables of matrices which are "close enough" to one another in some space to be considered a "similar group". The actual cut off line between "groups" is often arbitrary but the strongest signals in the data will often be stable across a very wide range of parameters to be tested.

After controlling for "chance" patterns, detected clusters of things can be investigated for explanatory cause or pursued as promising leads in an analysis pipeline.

7. Data visualisation

The power of graphics

Figures enable an analyst to present in a flash trends and relationships which would otherwise be unreadable to humans. They are particularly effective in talks or in situations where number crunching and first principles are not appropriate (such as communicating to a non-specialist audience).

A much wider variety of relationships can also be communicated succinctly. That being said we should be wary of graphics which just look very fancy but make no clear point which can be translated into a meaningful conclusion.

I am a bit sceptical of figures creating a "visual wow factor" but not actually offering any real actionable insights to anyone present to see them & think about them. Even cutting edge science research is definitely not immune from this issue!

Visualization pipeline

To build robustness into business analysis pipelines, it is most important to focus on methods (such as scripting) which will allow for regenerating figures in entirety at the click of a few buttons, every time the underlying data changes.

Or else there will be a bottleneck with analyst time required.

Therefore the best way to employ analysts would in theory be to work on the tools for generating the figures from raw data; rather than to work on preparing a few specific figures at a time.

Sophisticated charts

Publications today (whether academic or business focused) increasingly demand highly information rich figure designs - condensing patterns from massive data sets into essential visual relationships.

Impressive presentations will frequently employ sophisticated figures such as - complex highly coloured multivariate (3+ styles) - multiple plotting against axes (of a variety of chart types), - and dynamic techniques such as animated charts.

Heatmaps

As with all statistical tests on a collection of categories (all the possible subgroups) it is always necessary to "correct" for the fact a certain number are expected to overlap into a significant result purely by chance. (If you rolled 3 dice, from time to time, albeit quite rarely, you will get 3 sixes at once).

Multiple testing corrections approximately correct for the multiple possible significant observations;' the more subgroups tested, the more stringent the correction needs to be.

Technically it also depends on whether subgroups are independently generated compared to one another; but this is often unquantifiable.

8. Data Mining Big Data

Data warehouses

This term is increasingly not a metaphor.

Some of the major big data consortia now routinely gather thousands of new disks of data a day and keep them backed up and in storage facilities.

Often these have a military grade security and are designed to be resistent to natural disasters, human attacks, in addition to all the usual wear and tear.

The question is how will be build computers large enough and fast enough to discover what all this data might tell us.

Big data crunching

To parse so much data it is not enough to simply scale up in the number of computer cluster-style facilities, in order that businesses can "learn enough" from what is in the public domain, or from the data they themselves have harvested.

Most "blind search" processes will be far too slow to detect meaningful and significant results in the context of huge data sets. Heuristics are required. Which might cut certain corners but by tuning the procedure it can result in a high chance of encountering valuable information within a reasonable run time. The design of the best such procedures is what the field of "big data" is most concerned with.

Neural networks

This is a type of "machine learning" in which the model state from one step to another is based upon the structure of connections such as exists between neuronal cells in the brain (which have many input and output paths and can connect to countless other neurons ~ 10,000 / average in the human brain).

As training cycles continue the "learned state" of the neural net can begin to approximate or even exceed human natural language and strategic thinking abilites across a range of problem solving domains. This field is one of the most active and promising in the entire field of data analysis right now.