Buy Tickets

Virtual Event

Workshops

Wednesday December 2, 2020

Conference

Thursday December 3 - Friday December 4, 2020

Additional speakers and further programming, such as moderated panel discussions and community happy hours, will be announced shortly.

The Diversity & Inclusion Scholarship is now accepting applications for half-price General Admission tickets to people of color and underrepresented minorities.

Apply

Speakers

Lucy D'Agostino McGowan

Assistant Professor in Statistics,
Mathematics and Statistics Department, Wake Forest University
@lucystats

Andrew Gelman

Professor,
Department of Statistics and Department of Political Science, Columbia University
@StatModeling

Graciela Chichilnisky

CEO & Co-founder,
Global Thermostat
@chichilnisky

David Meza

Senior Data Scientist,
NASA
@davidmeza1

Maxine Drake

Data Analyst,
U.S. Army
@maxinedrake

Alex Gold

Solutions Engineer,
RStudio
@alexkgold

Kimberly F. Sellers

Professor; Principal Researcher,
Georgetown University; The U. S. Census Bureau
@KimFlaggSellers

Tyler Morgan Wall

Research Staff Member,
Institute for Defense Analyses
@tylermorganwall

Anna Mantsoki

Biobanking Data Scientist,
Foundation for Innovative New Diagnostics
@amantsok

Jared P. Lander

Chief Data Scientist,
Lander Analytics
@jaredlander

Wendy Martinez

Director, Mathematical Statistics Research Center,
Bureau of Labor Statistics
@wendyisthebest

Col Alfredo Corbett

Deputy Director of Communications,
United States Air Force
@usairforce

Rose Martinez

Senior Data Scientist,
New York City Council Data Team
@NYCCouncilData

Yvan Gauthier

Senior Defence Scientist and Director of Data Science,
Department of National Defence
@ygauthie

Imane El Idrissi

Junior Data Scientist and Biobank Coordinator,
Foundation for Innovative New Diagnostics
@FINDdx

Michael Jadoo

Economist,
Bureau of Labor Statistics
@MikeJadoo

Brook Frye

Senior Data Scientist,
New York City Council Data Team
@NYCCouncilData

Kazuki Sakamoto

Senior Data Scientist,
Lander Analytics
@UrbanDigitized

Selina Carter

Data Scientist,
Inter-American Development Bank
@selina_carter_

Refael Lav

Specialist Master – Cognitive,
Deloitte
@refaellav

Abhijit Dasgupta

Chief Data Scientist,
Zansors
@webbedfeet

Simina Boca

Associate Professor,
Innovation Center for Biomedical Informatics (ICBI) at Georgetown University Medical Center
@siminaboca

Wil Doane

Research Staff Member,
Institute for Defense Analyses Science & Technology Policy Institute
@IDA_org

Mo Johnson-León

Policy,
Insight Lane
@moridesamoped

Dan Chen

PhD Student,
Virginia Tech
@chendaniely

Gwynn Sturdevant

Post-Doctoral Fellow,
Harvard Business School
@nzgwynn

Marck Vaisman

Sr. Cloud Solutions Architect,
Microsoft
@wahalulu

Jonathan Hersh

Assistant Professor, Economics and Management Science, Argyros School of Business,
Chapman University
@DogmaticPrior

Malcolm Barrett

Clinical Research Data Scientist,
Livongo
@malco_barrett

Workshops

Geospatial expert and Columbia Professor Kaz Sakamoto will be leading this class on all matters GIS in public service positions. You’ll learn about map projections, spatial regression, plotting interactive heatmaps with leaflet and working with shapefiles. This course is designed for those who have familiarity with R and want to include spatial data into their work. The morning session will be an introduction to Geographic Information Systems (GIS), spatial features (sf package), Coordinate Reference Systems (CRS), and map making basics. The afternoon session will introduce spatial operations, geometric operations, and spatial point pattern analysis. By the end of the day participants will have learned how to read/work with spatial data, understand projections, utilize geoprocessing techniques, and gain basic spatial statistics comprehension.

In both data science and academic research, prediction modeling is often not enough; to answer many questions, we need to approach them causally. In this workshop, you’ll learn the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores, inverse probability weighting, and matching. The instructors will demonstrate that by distinguishing predictive models from causal models, we can better take advantage of both tools; prediction modeling plays a role in establishing many causal models, such as propensity scores. You’ll be able to use the tools you already possess--the tidyverse, regression models, and more--to answer the questions that are important to your work.

Hersh, who also taught at MIT and Wellesley College will be leading this class on machine learning in public policy positions. This course will provide a comprehensive overview of machine learning and why it should be incorporated into creating public policy. The session will cover basic concepts like supervised vs. unsupervised learning, testing and training sets, and the bias-variance tradeoff. Jonathan will also review linear regression, ridge (regularized) regression, cross-validation, and lasso regression. He will cover R Language and syntax, data manipulation in R, exploratory data analysis, and basic plotting in R. You will discover that machine learning can help solve prediction problems in public policy formation and in which situations it can be used for data-driven predictive modeling for the social good.

Agenda

Registration, Virtual Breakfast & Opening Remarks: 8:00 AM - 9:00 AM EDT

Geospatial expert and Columbia Professor Kaz Sakamoto will be leading this class on all matters GIS in public service positions. You’ll learn about map projections, spatial regression, plotting interactive heatmaps with leaflet and working with shapefiles. This course is designed for those who have familiarity with R and want to include spatial data into their work. The morning session will be an introduction to Geographic Information Systems (GIS), spatial features (sf package), Coordinate Reference Systems (CRS), and map making basics. The afternoon session will introduce spatial operations, geometric operations, and spatial point pattern analysis. By the end of the day participants will have learned how to read/work with spatial data, understand projections, utilize geoprocessing techniques, and gain basic spatial statistics comprehension.

In both data science and academic research, prediction modeling is often not enough; to answer many questions, we need to approach them causally. In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores, inverse probability weighting, and matching. We’ll also show that by distinguishing predictive models from causal models, we can better take advantage of both tools; prediction modeling plays a role in establishing many causal models, such as propensity scores. You’ll be able to use the tools you already know–the tidyverse, regression models, and more–to answer the questions that are important to your work.

Assistant professor of Economics and Management Science at Chapman University Argyros School of Business (previously, MIT and Wellesley College) Jonathan Hersh is leading this class on all things machine learning in public policy positions. This course will start with what machine learning is and isn’t and why we should use machine learning for public policy. We cover basic concepts like supervised vs. unsupervised learning, testing and training sets, and the bias-variance tradeoff. We will then go over linear regression, ridge (regularized) regression, cross-validation, and lasso regression. We will then cover R Language and syntax, data manipulation in R, exploratory data analysis, and basic plotting in R. Again and again, we will revisit the question, why machine learning for public policy? We will learn that machine learning can help solve prediction problems in public policy making and we will cover in which situations it can be used for data-driven predictive modeling for the social good.

Virtual Breakfast & Registration: 8:00 AM - 8:50 AM EDT
Opening Remarks: 8:50 AM - 9:00 AM EDT

Current innovations in coding have focused on ease of learning and reading. Unfortunately, a byproduct of these features is an increase in computation time for some coding. This talk will focus on vectorizing R code, or writing code that reduces computation times in some cases.

Michael introduces a set of functions for the R programming language to aid users constructing economic indexes for tracking trends in prices and quantities. For productivity statistics, the Tornqvist index is a standard algorithm to aggregate over products or industries. It uses a changing-weight formula that aggregates variables at two points in time using a cost/expenditure share approach to aggregate price or quantity indexes. He also provides methods of aggregating measures by industry and by a group of assets for an industry sector, and a set of examples to illustrate their use for multifactor productivity statistics.

Understanding occupation elements and employee skillsets is essential to properly align your workforce, identify skill gaps, emerging skills and career/training paths. In this presentation we will explore using tidy models to augment a knowledge graph with inferred employee attributes.

Break & Networking: 10:10 AM - 10:40 AM EDT

Maxine was on a team that developed the U.S. Army’s COVID-19 projection model. She will share lessons she learned developing this model on the DoD network. First, she will discuss the packages on which her team relied, specifically furrr, sharing a comparison of furrr with other iteration methods. Second, she will discuss how the team leveraged functions to make their code robust and flexible. Lastly, she will share what priorities and management techniques the team followed that they believe made their model influential among Army senior leaders.

When facing a problem on a few millions rows of data, Jared wrote code that took hours to run if at all. To speed things up he first split the data into smaller pieces, then did so in a smarter way. Still needing faster results, he wrote a custom function with a smarter algorithm, then sped it up further using Rcpp. All this took the runtime from hours to seconds, making it a feasible solution.

The tidyverse has grown to be a widely used set of tools with dplyr as one of its earliest members. One can leverage people’s familiarity with dplyr as the motivating example for going through the more complicated topics around tidy evaluation. By re-implementing the behaviours of some dplyr functions (e.g., select, filter, etc) one can see how rlang’s tools for quoting (e.g., quo, enquo) and unquoting (e.g. !! and !!! ) play a role in writing tidyverse functions. The audience may have already heard of “passing the dots”, but this talk will take of one of the training wheels to see how users can use the tools to create their own functions by replicating some of the behaviours of the ones that many folks know and are familiar with.

Lunch & Networking: 11:50 AM - 1:00 PM EDT
Speaker TBA: 1:00 PM - 1:20 PM EDT

Several months before the election, Andrew and his team worked with The Economist magazine to build a presidential election forecasting model combining national polls, state polls, and political and economic fundamentals. This talk will go over how the forecast worked, the team struggles in evaluating and improving it, and more general challenges of communicating data-based forecasts. For some background, see this article.

Break & Networking: 2:05 PM - 2:35 PM EDT

This talk will showcase how the USDA Forest Service is using LIDAR data to support large-scale forest management operations, conservation, and landscape-level ecosystem restoration. Marck provides a quick introduction to LIDAR and its benefits and using the lidR package to process images, and how using cloud technologies accelerates the process.

In this presentation, Wendy Martinez will describe some of her experiences (successes and failures) using the open-source statistical computing software R at several U.S. government agencies. By doing this, she hopes to inspire others and to pass along some of the lessons she has earned along the way. R is just one of the statistical computing tools available to us, and she believes data scientists and statisticians should have many computing tools ready to use. However, getting permission to use it in the U.S. federal government has been challenging.

Building a data analytics team in any context can be challenging, especially given the rapid pace of new tools and methods, the compute resources required, and the varied backgrounds of team members. Building a team within a federally funded research and development center poses additional constraints and opportunities. This talk will highlight some of the technical issues that arise which then translate into challenges for analytics teams as they collaborate to bring value to research sponsors.

Break & Networking: 3:45 PM - 4:15 PM EDT

How can we use R to predict project delays in international development? We’ll walk through this applied example from the Inter-American Development Bank. Their pipeline of R scripts sources and cleans data from internal and external sources, then generates predictions using a decision tree (random forest) algorithm with confidence intervals (using the infinitesimal jackknife approach). Selina will display results in real time to end users in an interactive online viz.

Learn how the New York City Council’s new Data Team uses a data-driven approach to improve the New York City Council’s policy making process. The New York Council Data Team answers policy questions and informs laws about everything from public housing residents’ heat, to school bus delays, and marijuana arrests. The Data Team also uses data to conduct oversight of City Agencies. They source datasets, create data analysis and models, maps, and dashboards to assist Council staff and Council Members to use data to make decisions. Their unique strength is in marrying data with public policy making.

Closing Remarks: 5:00 PM - 5:10 PM EDT
Virtual Breakfast & Registration: 9:00 AM - 9:50 AM EDT
Opening Remarks: 9:50 AM - 10:00 AM EDT

A rare disease is often defined as a disorder that affects fewer than 200,000 individuals in the United States. Over 70% of rare diseases are genetic and of those, 70% start in childhood and often lead to a substantially reduced life expectancy. Together, rare diseases affect 25-30 million individuals in the United States. Due to the small number of individuals affected by any one rare disease, their often progressive nature, and the fact that they often affect children, it can be challenging to perform clinical trials in this space. Because of this, the US Food and Drug Administration (FDA) allows for specific flexibilities when evaluating new drugs for the treatment of rare diseases, for example by allowing the use of biomarkers as surrogate endpoints in some instances. Simina considers the specific example of Duchenne Muscular Dystrophy (DMD), a devastating X-linked disease affecting around 1 in 5,000 newborn males that leads to muscle wasting, loss of ambulation and eventual death between the late teens and early twenties or thirties. She presents two vignettes related to the use of R in understanding DMD and setting research priorities in this clinical space. The first vignette concerns the analysis of the first comprehensive metabolomics study for DMD, which adds to the list of possible non-invasive blood circulating biomarkers, representing one of the first steps towards finding metabolic surrogate endpoints of disease progression (repository for analysis at https://github.com/SiminaB/DMD-metabolomics). The second vignette looks at curating DMD-related clinical trial data from the government-maintained database www.clinicaltrials.gov, with the eventual goal of developing a product to allow researchers, clinicians, and patients to stay up-to date with ongoing drug development in the DMD disease area, as well as to prioritize research focusing on individuals who do not currently have many available clinical trial options.

Imagine that your analysis and model are improving with each additional user using your system! Enterprise Recommendation engines are powerful analytical techniques that benefit each user more while getting information from their interaction to benefit the next user. This presentation will cover the creation of a recommendation engine using R using individuals’ data and site behavior. Once created, showcase the two methodologies used – On-prem and how we used docker, Plumber, and AWS to scale the infrastructure while allowing information to enhance the model. All of this to service a production-level website for resources with smarter recommendations.

Break & Networking: 10:45 AM - 11:15 AM EDT

For #rstats enthusiasts working in or with the public sector, it can be hard to promote the spread of R across your organization. Based on his experience working at think tanks, in federal consulting, and with a wide variety of organizations at RStudio, Alex will share patterns for treating an R package as a tool to promote better data science and more use of R. Daft Punk references will be plentiful.

When the COVID-19 pandemic began, the Foundation for Innovative New Diagnostics (FIND) developed an interactive data platform to build a global picture on testing coverage. Imane and Anna will showcase how their team in Geneva, Switzerland built a comprehensive dataset on testing coverage across 179 countries, using automatic data mining tools (Selenium, R, regular expressions and GitHub actions), minimizing the needs for manual intervention and maintenance. They will also preview their user-friendly Shiny application (SARS-CoV-2 Test tracker) which allows users to visualize and compare the number of tests, cases, deaths and of positivity rate across countries and inspect changes over time. The FIND SARS-CoV-2 test tracker is, to their knowledge, the only source that updates world-wide COVID-19 testing data daily.

How can financial instruments resolve climate change? Indeed, an interesting question. Here, Dr. Chichilnisky can show how this can be accomplished quickly and effectively by using existing capital markets and benefiting high- and, especially, low-income groups. The process Dr. Chichilnisky proposes is simple and can lead to a transformation of our capitalistic economy in the direction of human survival. Furthermore, it is realistic and is profitable along the way, supporting the transition.

Lunch & Networking: 12:25 PM - 1:35 PM EDT

Data visualizations are no longer afterthoughts destined for the supplementary material section: Learning how to create beautiful data visualizations is a key skill to influence decision makers and engage the public with your research and results. In particular, 3D visualizations are a powerful tool to attract attention to your projects and draw people into your research, and R has become one of the best language ecosystems for reproducibly generating high quality 3D visualizations. In this talk, Tyler will show how you can use the rayshader package along with several other tools to generate stunning 3D figures, entirely in R. He will also demonstrate how you can combine your data with free and open spatial datasets to create these figures in only a few lines of code, directly from the source data.

It is natural to consider a Poisson model to analyze count data, however such approaches maintain a constraining underlying equi-dispersion assumption (i.e. that the (conditional, when applicable) mean and variance equal); this assumption can lead to spurious results and inferences. Instead, much work has been conducted developing flexible alternative methods stemming from the Conway-Maxwell-Poisson (CMP) distribution – a two parameter distribution for count data that contains the Poisson model (among others) as a special case. To illustrate the impact of such contributions, this talk focuses on CMP regression models and related R packages available to perform such analyses.

Data is a warfighting asset, fundamental to how Air Combat Command (ACC) operates in and supports all five domains of warfare. With a rapidly growing data landscape, ACC is implementing major improvements to the way it manages, acquires, ingests, stores, processes, exploits, analyzes, and delivers data to its almost 100,000 operators. In coordination with the Department of Defense and the Department of the Air Force, ACC is pursuing six lines of effort to improve its data governance, data architecture, data standards, and data talent & culture.

Break & Networking: 2:45 PM - 3:15 PM EDT

Yvan Gauthier is a senior defence scientist with Defence R&D Canada – Centre for Operational Research and Analysis (DRDC CORA). Since 2017, he leads a data science team directly supporting the Chief Data Officer of the Department of National Defence (DND). He also chairs a NATO Specialist Team on Advanced Analytics and AI for Defence Resource Planning. Before in his career, he has led several operational research projects while working with various branches of DND, including the Strategic Joint Staff, the Canadian Joint Operations Command, Maritime Forces Pacific Headquarters, and the Air Staff. He also worked for two years in the UK as an exchange scientist with Dstl.

Data on the health and well-being of populations is increasingly available through open data initiatives at various government and inter-government agencies, including WHO, the World Bank, and different national agencies. This real world data is accessible to anyone to understand trends in disease prevalence and effects of policy change. This year, the power of open data has been seen in tracking the patterns of incidence and death in the COVID-19 pandemic. In this talk, Abhijit will describe different ways in which the world’s data repositories can be accessed using generic and specialized packages in R to enable visualization and analyses in the R ecosystem.

Coming Soon

Closing Remarks: 4:25 PM - 4:35 PM EDT

Sponsors

Platinum

RStudio
Deloitte

Gold

Georgetown University

Supporting

PolicyViz
Pearson
O'Reilly
Manning
CRC Press
Springer
Nausicaa Distribution

Media

Practical AI Podcast

Vibe

Matcha Bar/Hustle
Mount Gay

Tickets