Lesson guide and textbook for "Data as a Science" course.
Data has become the most important language of our era, informing everything from intelligence in automated machines, to predictive analytics in medical diagnostics. The plunging cost and easy accessibility of the raw requirements for such systems - data, software, distributed computing, and sensors - are driving the adoption and growth of data-driven decision-making.
A data scientist is a researcher who answers a research question using data, and can lead the development of the research process. They may design the methods to acquire primary or secondary sources of data that inform the research process, monitor and ensure ethical responsibilities, curate the research data and results, or communicate the process and results to stakeholders. Coding is incidental to that process, and it is possible to be a data scientist without programming at all.
Higher education course modules continue to be an atomised collection of dissociated curricula, since the heart of the university process is the assumption that graduates serve apprenticeships in labs or organisations. But data-driven careers don’t offer an artisanship of learning where an inter-generational accumulation of experience is passed on. Instead, online-first education has become equivalent to a best-of collection with no context or process.
As it becomes ever-easier to collect data about individuals and systems, a diverse range of professionals - who have never been trained for such requirements - grapple with inadequate analytic and data management skills, as well as the ethical risks arising from the possession and consequences of such data and tools.
Ordinarily, when teaching data science, everyone - from teachers to students - prefers to focus on analysis and presentation since these are more fun and require less frustration with messy data or ethical dilemmas. Working data scientists will point out that the bulk of their time is taken up with social and ethical negotiations, and complex and tedious data integration.
There are two objectives for this syllabus:
The course is based on the Sloyd model of technical training. Each lesson is discrete, building on the previous lesson, and provides a functional and holistic understanding of the scientific method as it applies to data. It is not about learning an algorithm and applying it to abstract, arbitrary data. The course has the objective of training complete data scientists, you will learn how research works and apply tools to a specific case-study.
Each lesson starts with a research question, and progresses by teaching a complete, and practical, set of skills allowing students to learn at their own pace and in an order which suites their current understanding. Case-studies and tutorials are drawn from public health, economics and social issues, and the course is accessible to anyone with an interest in data. Course materials, case studies and guided tutorials are presented in Jupyter Notebooks permitting learners to test running code and gain hands-on understanding of the techniques discussed.
Each lesson is guided by the following four topics:
Science is a set of defined methods that stands up to scrutiny, supports replication, and is supported by ethical measurement data acquired during the study process. The way to gain confidence in these methods is to review the work of others.
Each lesson will guide you through review of published scholarly work in the following ways:
Synthetic data will include lessons in dependent randomisation, as well as agent-based modelling.
On completion of each lesson, students gain useful and meaningful skills, and are not left stranded. This means that even partial completion of the material permits students to be productive members of a research team. The first lesson will ensure students can become professional data wranglers, and – on completion of the first ten lessons – graduates will be capable of taking on a responsible data research role.
This is a brief video demonstrating the first module: https://www.youtube.com/watch?v=nZRL3OabbsY
I have prepared an overview of 20 lessons, each requiring two to three weeks to learn, which would comprise the complete course.
The first two lessons are complete, and I estimate about 6 weeks to research and create each of the remaining 18 lessons.
This course is not complete. My objective is that Data as a Science becomes a standard data science core syllabus, much as Core Econ has become for Economics. Progress is slow and dependent on the support and good-will of others.
Each lesson costs about $5,000 to research and create, and is released here on completion. Please contact me at gchait @ whythawk . com should you wish to sponsor a lesson (or part thereof).
My name is Gavin Chait, and I am an independent data scientist specialising in economic development and data curation. I spent more than a decade in economic and development initiatives in South Africa. I was the commercial lead of open data projects at the Open Knowledge Foundation, leading the open source CKAN development team, and led the implementation of numerous open data technical and research projects around the world. Recently, I have developed Sqwyre.com, an initiative to develop a comprehensive business intelligence search engine for entrepreneurs. Data are based on open data and Freedom of Information requests.
I have extensive experience in leading research projects, implementing open source software initiatives, and developing and leading seminars and workshops. I have taught for 25 years, including for undergraduates, adult education, and technical and analytical teaching at all levels.
This pedagogy and syllabus structure was developed with support from the Gates Foundation and WHO. Initial research into the need for education capacity building arose as a result of research supported by the Hewlett Foundation, Wellcome Trust and Public Health Research Data Forum.
Chait, Gavin; Sujith, Eramangalath; Grzywinska, Dominika; Wainwright, Mark (2018): Supporting capacity and skills development for public health data research management in low- and medium income countries. Wellcome Trust. Journal contribution. https://doi.org/10.6084/m9.figshare.6087161.v1
Chait, Gavin (2020): Data as a Science. Whythawk. https://doi.org/10.5281/zenodo.4194973
And as a BibTeX entry:
@book{chait_data_2020,
title = {Data as a {Science}},
copyright = {Creative Commons Attribution-ShareAlike 4.0 International and the GNU Affero General Public License},
publisher = {Whythawk},
author = {Chait, Gavin},
year = {2020},
doi = {10.5281/zenodo.4194973},
url = {https://doi.org/10.5281/zenodo.4194973}
}
Course content, materials and approach are copyright Gavin Chait, and released under both the Creative Commons Attribution-ShareAlike 4.0 International and the GNU Affero General Public License licences.
The objective is to ensure reuse, and that any modifications or adaptations of the source material must be released under an equivalent licence.