How we built an app to track police force and what I learned from it

Over the past month I worked closely with a group of data science and web development students on a application devised by Human Rights First. Human Rights First is a nonprofit, nonpartison organization with a mission to uphold and advocate for human rights. This means defending the powerless, i.e. refugees, victims of human trafficking or torture etc, and demanding accountability from the offenders. My team specifically was tasked with building an application called Blue Witness.

The purpose of Blue Witness is to track the use of police force against civilians. The idea for this project came up last year as a response to the increasing exposure on police brutality in the wake of the Black Lives Matter movement’s protests and political turmoil. The goal of this project is to allow journalists or the common lay person to see where these incidents of the police using force occur and what these incidents entail. The application will need to be able to scrape data from the web, decifer what data is relevant, and then create meaningful and comprehensible interpretations or visualizations. The project itself was larger and more daunting than other projects I have worked on in the past, which did make me feel anxious and excited, but I’m so glad to have worked on this.

The first week on the project was a week of madness. Entire days filled with meetings. Meetings with the client and team project leads, stand up meetings, meetings for planning our strategy and furthermore meetings with other teams that have contributed to the project. After everything was settled, as a team we did decide on concrete goals that we wanted to complete for the month. As a data science student, I will explain what we have planned on completing, and what we actually did complete.

Our goals were outlined in our product roadmap, the data science team needed to improve incident identification, implement a feature on our twitter bot to prompt users to provide more information about incidents, and to provide a confidence estimate for each estimate for administrators to approve them.

The feature we spent the most amount of time on was incident identification. Initially we were testing an implementation of a BERT(bidirectional encoder representations from transformers) natural language processing model stacked with a k nearest neighbors model. We discovered that the two models were not very compatible with each other and on top of that, it would become much more expensive to deploy two stacked machine learning models to Amazon Aws rather than just one. We settled on just relying on the BERT model to do a linear classification on whether the reported incident is a true or false incident and a simple text matcher to determine the severity of each incident of police force use.

Training the BERT model was the most time consuming process of this whole journey; not the literal training of the model, but what goes into training the model. Initially, we did not have good training data to train our model. Our twitter bot had scraped a few thousand tweets, of which about 1300 were manually verified. Training on this data made our model really good at outputting “True” when true police force use reports were inputted, but really bad at distinguishing between tweets that discuss about police force and actual reports of police force use. What we needed was a more substantial dataset that had both verified incidents and entries that are similar to the verified incidents but were not true incidents. Essentially we had to do some data ETL(extract, transform, load) to gather up and load a larger dataset of true positives and false positives.

Our initial approach of manually marking incidents as true and false was incredibly time consuming and we did not have much time. Another approach we tried was to use scripts to make fake entries, but that method was not very effective at making entries that would confuse our model. So rather than continuing on that path, we decided to compile multiple different datasets- datasets that knew for sure were all incidents of police force use- such as a dataset on fatal police encounters and then datasets that would likely trigger a true output from our model but were not reports of police force use. For this, we found multiple sources that documented police encounters and for the latter data set, we scraped sources such as twitter, reddit, youtube etc for data where there is discussion of topics such as police brutality, racism, protests, and filtered those to only include entries were police or cops are mentioned. From there, we only took the descriptions from each entry and put them in a dataset where there was one other column that was marked as either 1 for

How we removed symbols, numbers, urls from our data entries

“True” or 0 for “False.” After cleaning the data with some regex functions and removing nulls, we trimmed our dataset so that the number of true positives matched the number of false positives in a 50:50 split. We now had our csv file of 15,000 entries that could be input into a dataframe, lemmatized, and tokenized before being fed into our BERT model.

Testing out our deployed bert model

Given the amount of effort we spent in building our model, creating a usable dataset to train the model, and deploying our app in one aws instance and our bot and scraper in another aws instance, as well as our database to an Amazon rds instance, we unfortunately prioritized most of our time on these features and did not implement the twitter bot reply to users to obtain more information feature. We did make sure that our BERT model outputs a confidence estimate for each ranking, but admittedly our model could still improve with even larger training datasets. Unfortunately, I do not believe that Blue Witness is a finished product yet. There is quite simply more that can be improved. Regardless, the data science team made significant improvements and the UI/UX team did an excellent job implementing some of our features into the web application. On the main page there are incidents pinned on a map where users can click to view details about the incidents and there is also a reports page, where the reports and tags associated with them are listed

Reflecting on the process, I learned alot. For the first time, I experienced what it was like to work in a corporate like environment and develop a product for a client. I’ve seen parallels drawn between certain industries and sectors of data science to the work I spent all month doing, namely data engineering. I’ve gained invaluable experience explaining my technical decisions, coming up with user stories, and documenting my code. I’ve received feedback where I was told to be direct with my answers, but to elaborate on specifics of the how and what of my technical decisions and ultimately learned what it meant to be a technical communicator.

Regarding the future of this project, on the data science end, I would like the twitter bot reply functionality to be implemented, more training data to be added to improve the accuracy of our BERT model, and potentially converting our BERT model to a multi-class classifier rather than a linear one paired with a textmatcher. These are certainly very doable tasks, but certain challenges I see include, but are not limited to, the twitter bot must obey certain rate limits when replying to users, it may be difficult to come across reliable training data, namely the false positives, and the complexity of the BERT model will increase if converted to a multi class classifier.

Regarding my own personal future and career goals, I believe that this project experience has prepared me for the work place. I got to dip my feet in the water to know what’s coming and I got another project to add to my portfolio. Furthermore, as I really dive into my job search process, I will be certain to explore more sectors of data science, and I am certainly interested in data engineering as my experience here during this project has shown me just how challenging, yet rewarding the process is.