Application and python script to identify, remove, and/or recode personally identifiable information (PII) from field experiment datasets.
This application identifies likely PII (personally identifiable information) in a dataset. To use, download the .exe installer from the latest release and follow the in-app directions.
This tool is current listed as an alpha release because it is still being tested on IPA PII-containing field datasets.
There are a series of rules that are applied to a dataset's column to identify if a given column is a PII. Such rules are:
find_piis_based_on_column_name()
in PII_data_processory.py
.find_piis_based_on_column_format()
in PII_data_processory.py
.find_piis_based_on_sparse_entries()
in PII_data_processory.py
.find_piis_based_on_locations_population()
in PII_data_processory.py
.Importantly, this is an arbitrary defined list of conditions, and for sure can be improved. Very open to feedback!
Once the PIIs are identified, users have the opportunity to say what they would like to do with those columns. Options are: drop column, encode column or keep column. According to those instructions, a new de-identified dataset is created. Also, the system outputs a log .txt file and a .csv file that maps the new and encoded values.
The repo has code written to identify PII in text, and replace the PIIs for a 'xxxxxx' string. So, rather than flagging a whole column and dropping/encoding it, they user might prefer to replace the PII by this string and keep everything else. The code searches for PII based on classic common names of people and cities. This functionality is finished but super slow at the moment, so it is currently not enabled.
python app_frontend.py
Remember to install dependencies mentioned in requirements.txt
.
pyinstaller --windowed --icon=app_icon.ico --add-data="app_icon.ico;." --add-data="ipa_logo.jpg;." --add-data="anonymize_script_template_v2.do;." --additional-hooks-dir=. --hiddenimport srsly.msgpack.util --noconfirm app_frontend.py
Compile create_installer.iss
using Inno Setup Compiler
Reference: https://www.youtube.com/watch?v=RrpvNvklmFA https://www.youtube.com/watch?v=DTQ-atboQiI&t=135s
IPA's RT-DEG teams.
J-PAL: stata_PII_scan. 2020. https://github.com/J-PAL/stata_PII_scan
J-PAL: PII-Scan. 2017. https://github.com/J-PAL/PII-Scan
The PII script is MIT Licensed.