Original article was published on Artificial Intelligence on Medium
Some Thoughts on Predictive Policing
Overfitting, classification, OOB and typical critiques
In this article I do not intend to argue on behalf of predictive policing simply explore a few aspects, namely: overfitting, classification, OOB and typical critiques.
What is overfitting and what does it mean in practice?
Overfitting is largely a challenge due to the inclination of the analyst to adapt the model to reach the best possible prediction. This seems advantageous from the outside, yet one may not consider the possible implications of doing so. If you have a set of datapoints and your prediction is too good it will fit too neatly onto the datapoints, therefore it will be highly problematic when new data arrives that is to be predicted on — as such you have not build a model that is applicable to any large extent. Although this does depend on the purpose, and the whole discourse of overfitting is challenging in itself. If the model learns too much of the noise it can be a disadvantage for predicting outcomes. Noise is therefore an incredibly important aspect of a more realistic model. Overfitting a model can be dangerous and lead to unfortunate decisions.
One good solution to this issue when possible is to split the dataset into test data, training data and evaluation data. If we have a sample of one million we may split it into training (750,000) and test (250,000). Generally one rule of thumb could be to have a 25% / 75% split of the data. The smaller dataset is as described earlier in some cases more realistic than the training data, because in many regards one may not have as much information nor the opportunity to train the dataset to such an extent. A great deal of iterations depending on context may be important to learn more about the data one has received and to consider how predictions can be responsibly made. Working solely with test data and reporting performance solely on this in practice is however not advisable, however again it depends upon contextual questions of use.
As such overfitting or underfitting is a constant and important discussion not only once in the process of a dataset, yet in many regards throughout the process at different intervals. Accuracy for different groups may change depending on the data available and it is vitally important to be aware of this. As an example in cognitive machine learning based on images or video has for long been less able to identify dark skinned women than white males. One could easily enter discussion of fairness in regards to the fit of various variables in a dataset as a whole — overfitting or underfitting for a model and perhaps too the unequal distributions of how models fit in making predictions has overarching consequences that constantly must be assessed by the analyst.
What is a classification tree?
On a very basic level the classification tree is a visualisation device, mean to show a sequence of splits within the dataset. This is often too poor for a real analysis, yet may help the analyst to better understand the data that has been given for the input into the overall project. Should one consider too closely the results of decisions trees or classification trees these considerations may adversely affect the overall predictive result or the fit of the model — however if used with a critical eye it could help to consider certain trends in the data. A classification tree has roots at the top and leaves at the bottom it gives a number of cases predicted at the bottom and it shows decisions based on numbers in the dataset towards the root.
One important aspect as described above is the recursive partitioning — it creates populations split into sub-populations where different decisions are visualised. A typical example used for learning this concept is that of modelling survivors on the titanic. Indeed the data is defined in terms of itself, it is recursive, so with the survivors and the numbers for certain aspects or variables of a dataset what can we predict will happen? It is in this manner that the analyst may gain a visual overview of splits occuring. There however the persistent danger that upon seeing these trends in the decision the analyst may be tempted or unconsciously biased to focus more on some variables than others seeing the pattern displayed. So this intuitive model comes at a possible cost.
An example drawn up could be that public safety is a factor for the overall population of a dataset. However this may indeed not be generalisable in all possible criminal cases. In the case of domestic violence for example it is vitally important to predict or understand, yet the factor of public safety is not taken into consideration to the same extent. Still, is not the home part of the public or does crime have to happen in the public to be considered in this given model? The way we categorise has a large consequence on the set of decisions that is being shaped or illustrated in any given case. As such the importance to ensure data quality is largely important — yet even with perfect data quality (if such a thing existed) there would still be the issue of the analyst perception of predictors of any given set of decisions. This bias could in the worst case scenario be strengthened by the observation of results in a classification tree, and at best it could provide valuable insights into some trends within the given data.
What is OOB (out of bag)?
In the context of machine learning models and random forest in particular the training dataset when split the N not selected will be saved as out-of-bag data (OOB). Say 10e7 is the total number of observations, indeed if that is the case and we split the data by 70% into a training data, then the 30% within this given case will be saved as the OOB. Since these are dropped down the tree and differently to some extent in the large amount of iterations that often can be set for a random forest model it helps to cancel out the noise.
A confusion table from a random forest is often constructed with OOB data to help question the accuracy of the model. How is the model doing when the predictors are not used? One could draw up a histogram of the OOB data to examine the skew of the model. This is important in practice because one could easily think that the predictions that are being made are great without any form of comparison, and it is important part of building a better understanding of the prediction due to the method involved in creating the model in the first place.
In random forest as an example there is a process of averaging over a number of randoms samples, and this way to proceed is called bootstrap aggregation or bagging for short. The data is sub-sampled and used to train the model. Bootstrap aggregation is a staple of sorts for machine learning to improve the accuracy as well as reducing overfitting. OOB as part of bootstrap aggregation in these models is therefore important to consider. For randomly generated training sets in order to make sure our predictions are more realistic OOB is part of a method to strive for better understanding what is predicted and whether it is fair to say we made a decent prediction or not.
What is the most usual critique of predictive policing?
An article enthusiastic about the potential is from the journal of Statistical Association from 2015 called Randomized Controlled Field Trials of Predictive Policing. One of the co-founders of a predictive policing tool (Pred Pol) Jeffrey Brantingham is the co-author of this article, so there is a clear vested interest in proving the tool is effective in this given case. Although the article can be questioned on these ground the authors did undertake randomized controlled trials of what they call: “…near real-time epidemic-type aftershock sequence (ETAS) crime forecasting.” They did this in two divisions Kent Police Department and the Los Angeles Police Department. What they found was that the ETAS models of short term crime risk outperformed existing best practice of maps by crime analysts. They argued that since the police has limited resources they could dynamically control and that crime can be reduced with these measures. Andrew Guthrie Ferguson (2017) tells us a different story of how these technologies viewed as neutral can have more overarching consequences and that collecting as well as selling of this data can be a big business within private security. Brayne (2017) describes how people that have not been traditionally included in these analysis are being now since geolocations are part of the overall picture of how data can be monitored or used with police interventions. These two trends in combination can be cause for a moment of reflection, although we want to prevent crime we have to consider to a larger extent the actors involved.
Braga and Barao talks of different policing strategies related to place and how the police works with crime reduction in Targeted Policing for Crime Reduction (2019). It is known by the police already that crime is focused in a set of ‘risky places’, and located within ‘risky people’. When one knows this already one could make the argument that this can be made more efficient with machine learning. If police can target their resources they may save money etc. There are a series of different strategies that have different consequences on the ground — within this one can find hotspot policing, focused deterrence and problem-oriented policing. It is in particular hotspot policing where this has been happening, yet there has to be large overarching collaborations between the police and local actors in the community. If a gang meets at a toilet regularly (that has no code), this is spotted through prediction, and the restaurant decides to make a passcode for the toilet — this may simply move crime elsewhere. The focus on gangs of individuals is questionable in terms of its effectfulness on overall crime reduction, although it has shown to be effective. Is the problem solved or simply moved?
This is an important question to answer, and currently an understudied area that requires to be urgently addressed by researchers within social science. Although it is important to understand well the technical side of the tools being used to a greater extent so as to work with rather than against to the best extent that society can be considered in terms of safety that is consistent with goals in transparency, fairness and accuracy. Social science researcher working with a computational slant must communicate the tradeoffs that are happening within these areas as well as mapping the efforts by police when they attempt to understand with prediction where crime could possibly happen.
Would it be possible for the police to use existing data to predict individuals future criminality?
One of the most standard critiques of predictive policing is that it is biased. Berk retorts to this in his book on Machine Learning Risk Assessments in Criminal Justice that it may be important to consider the bias that is already present in the given systems. Indeed if we consider an algorithm black box, some would say a human could be equally as hard to question — and the risk assessments that are being used to a great extent around the world in different locations could be improved. However if we narrow down on the specific case of ‘where’ it is an immediate consideration by ‘who’ and ‘when’.
Predictions in the criminal justice system has according to Berk been present since the 1920, and is not novel or new at least in this regard. If we read Krohn’s Handbook on Crime and Deviance (2019) that there has been a large development in crime resarch and methodological concerns have been persistent for rather a long time. A cutting-edge method although it holds incremental improvements in practice (despite large data) is in the same regard to think of in a critical manner. According to Jenness (2004) there has been three streams of inquiry: (1) the relationship between demographic changes and the emergence of criminal law; (2) the relationship between the state, structure and mediation between; (3) more recent work with criminalisation as a social process that is global in its scope. These new predictive policing technologies certainly follow the last stream in many regards although it delves into each in turn, as it is spreading rapidly around the planet.
Berk in his book Machine Learning Risk Assessments in Criminal Justice Settings (2018) has a whole chapter dedicated to transparency, fairness and accuracy. Part of the problem is that these issues are often a tradeoff between accuracy and fairness — what can we collect in a fair way or what should we use? We may get better predictions with more data, however as Starr (2016) describes it rapidly becomes an issue of William or Robert. If Robert is from a wealthy background and William is from a less wealthy background then William is in many cases considered more of a risk to public safety than Robert. As such there is an overall bias in the system that is hard to be rid of and it relates to a large extent to bigger overarching factors. Esposti (2014) sees that there is an emerging trend of what he calls ‘dataveillance’ to attempt monitoring different groups to regulate and govern their actions — there is an interdependence with this and business objectives. As such simply using these tools from private companies may not be so simply in practice if one considers the already persistent problems in the United States for example with private companies operating prisons at the highest capacity possible to enable gains per inmate. We could be critical of similar approaches in private companies aggregating and selling data that is gathered in their ventures together with police stations or states.
Knowing this can present to us existing underlying issues that we know are important and strengthen the argument as to why we need to persist in efforts to ensure these groups do not experience any suffering or alleviate through concerted efforts. The police are likely familiar with these type of factors from before, however it is important to attempt using data for analysis to attempt seeing different trends in the dataset. Although some are more likely does not mean that any lesser likelihoods are not equally important to keep an eye on. The consequence of this analysis is not adverse unless used with little understanding and regard for critical thinking, as such whether we could attempt to use data to assist in helping a population or spotting issues that can be addressed by other agencies as well as communication between may be important for a comprehensive effort to predict and prevent crime in an ethical manner. The police is predicting crime currently and most humans make predictions, so this can be an added effort to an existing practice if carefully applied.