We all know the outcome. Donald Trump will be the next US president, despite almost all polls predicting he had practically no chance of attaining that post. Does this mean that statistics don’t work? Hardly.
Almost all the polls indicated that Hillary Clinton had between a 70% and 98% chance of winning the election. Only one poll, The Los Angeles Times Daybreak poll, said that Donald Trump was more likely to win. In the so-called swing states, these same polls were way off.
The polls were based on sampling a small percentage of the population, and then using statistical probability analysis to predict how the whole population would vote. This is a similar method used by TAR (Technology Assisted Review) – the fancy acronym that encompasses predictive coding and other cost-saving e-discovery techniques. So, what do the election results say about the validity of using TAR?
Based on what we have gathered so far, it appears the polls failed because the pollsters did not use representative samples. Statistical probability assumes that the sample will be made up of an even cross-section of the whole population. When dealing with voting patterns, this means the sample needs to include representatives from all different social groups: males, females, young, old, ethnic minorities, gay, straight, etc. More importantly, the pollsters needed to consider whether the people polled would actually vote, since those who would not vote should not have been included.
The unrepresentative nature of the samples is where the pollsters failed. There were many reasons for this. One was the way the samples were taken. Many of the polls were conducted online. Older people generally don’t use computers, or at least are less likely to answer polls online. The polls were also conducted primarily in populated areas. The results show that the percentage of people in urban areas who actually voted was lower than the percentage of people in rural areas. This means that, for the samples to be representative, more rural people and fewer urban people should have been included.
When sampling discovery data, the same representative sample requirement holds true. Statistics tell us that, if you have a large enough collection and the different records are somewhat evenly distributed throughout, a random selection of a few thousand will likely give you a representative sample. Unlike the election, in e-discovery we have a way to make sure our sample is representative – validation. After we run our sample and separate our records into relevant and not relevant, we can then go back and sample the not relevant set to see if we missed anything. If our initial sample was not representative, our second validation sample will very likely show up relevant records.
The theory behind statistics and probability have been proven to be valid. When used correctly, they will return defensible results. Even though the election results may have surprised you, there is no reason to worry about the value of TAR.