Click here to buy tickets to the conference & workshops!

Virtual Event

Workshops

Wednesday December 8, 2021

Conference

Thursday December 9 - Friday December 10, 2021

Speakers

Alex Gold

Solutions Engineer,
RStudio
@alexkgold

Wendy Martinez

Director, Mathematical Statistics Research Center,
Bureau of Labor Statistics
@BLS_gov

Jared P. Lander

Chief Data Scientist,
Lander Analytics
@jaredlander

Asmae Toumi

Director of Analytics and Research,
PursueCare
@asmae_toumi

Marck Vaisman

Sr. Cloud Solutions Architect,
Microsoft
@wahalulu

Brook Frye

Senior Data Scientist,
New York City Council
@brook_frye

Tommy Jones

Member of the Technical Staff,
In-Q-Tel
@thos_jones

Jordan Jasuta Fischer

Managing AI Developer,
IBM
@JordanJasuta

Aaron Mannes

Senior Policy Advisor,
Culmen LLC supporting DHS S&T
@awmannes

Mayari Montes de Oca

Former Research Scientist,
NYU Global TIES for Children
@Mayari_MOca

Cezary Podkul

Data & Investigative Journalist,
ProPublica
@Cezary

Sydney Coston

Cadet,
United States Military Academy

Abhijit Dasgupta

Adjunct Professor,
Georgetown University's Data Science and Analytics
@webbedfeet

Vivian Peng

Senior Data Scientist,
City of Los Angeles
@create_self

Madhava Jay

Core Team Lead,
OpenMined
@madhavajay

Lauren Lombardo

Graduate Student,
Harvard University John F. Kennedy School of Public Policy
@_laurenlombardo

Jasmine Han

Data Reporter,
Bloomberg Industry Group
@JasmineHanYe

David Shor

Head of Data Science,
Blue Rose Research
@davidshor

Coline Zeballos

R Strategy Lead,
Roche Pharma
@colinezeballos

Jorge Luna

Lead Data Scientist,
Aetna, a CVS Health Company, Analytics & Behavior Change

Surabhi Hodigere

Research Assistant,
Ash Center for Democratic Governance, and Innovation at Harvard Kennedy School
@surabhihodigere

Benjamin Braun

Principal, Data and Systems Architecture,
202 Group
@Ben_G_Braun

Boriana P. Pratt

Statistical Programmer,
Office of Population Research, Princeton University

Alexandra Boghosian

Postdoctoral Research Scientist,
Lamont-Doherty Earth Observatory

Jonathan Hersh

Assistant Professor of Economics & Management Science,
Chapman University Argyros School of Business
@DogmaticPrior

Workshops

The focus on climate change is growing as government and industry explore ways to respond to a warming world. The data that scientists used to understand the planet is often publicly available, but perhaps obscure to the non-expert. Data vary in terms of file formats as well as spatial and temporal resolution, resulting in consequences for interpretation. Here we will explore how to find and analyze direct observations that show our climate is changing. By the end of the day, attendees should have a working knowledge of how the data support the science, and where to gather data and information about specific climate change issues they may face in their work.

Hersh, who also taught at MIT and Wellesley College will be leading this class on machine learning in public policy positions. This course will provide a comprehensive overview of machine learning and why it should be incorporated into creating public policy. The session will cover basic concepts like supervised vs. unsupervised learning, testing and training sets, and the bias-variance tradeoff. Jonathan will also review linear regression, ridge (regularized) regression, cross-validation, and lasso regression. He will cover R Language and syntax, data manipulation in R, exploratory data analysis, and basic plotting in R. You will discover that machine learning can help solve prediction problems in public policy formation and in which situations it can be used for data-driven predictive modeling for the social good.

Agenda

Registration, Virtual Breakfast & Opening Remarks: 9:00 AM - 10:00 AM EST

Hersh, who also taught at MIT and Wellesley College will be leading this class on machine learning in public policy positions. This course will provide a comprehensive overview of machine learning and why it should be incorporated into creating public policy. The session will cover basic concepts like supervised vs. unsupervised learning, testing and training sets, and the bias-variance tradeoff. Jonathan will also review linear regression, ridge (regularized) regression, cross-validation, and lasso regression. He will cover R Language and syntax, data manipulation in R, exploratory data analysis, and basic plotting in R. You will discover that machine learning can help solve prediction problems in public policy formation and in which situations it can be used for data-driven predictive modeling for the social good.

The focus on climate change is growing as government and industry explore ways to respond to a warming world. The data that scientists used to understand the planet is often publicly available, but perhaps obscure to the non-expert. Data vary in terms of file formats as well as spatial and temporal resolution, resulting in consequences for interpretation. Here we will explore how to find and analyze direct observations that show our climate is changing. By the end of the day, attendees should have a working knowledge of how the data support the science, and where to gather data and information about specific climate change issues they may face in their work.

Virtual Breakfast & Registration: 8:00 AM - 8:50 AM EST
Opening Remarks: 8:50 AM - 9:00 AM EST

If big data is characterized by volume, velocity, and variety, the Department of Homeland Security (DHS) is the ultimate big data organization. In its mission to protect the American people, DHS undertakes an array of diverse functions and often has to make decisions in real time. Using data analytics to enable these missions requires a blend of creativity and pragmatism.

Medical terms are linguistically very specific: a letter or two can completely change the word, and prefixes and suffixes can link two words that otherwise look wildly different. As such, many typical methods of natural language processing (NLP) are ill-adapted to work with medical records and their specific vocabulary and syntax. When a government client needed to classify medical conditions for record processing, IBM built a hybrid ensemble model that incorporates both rules-based and machine learning classification, to accommodate the client’s system structure while flexibly handling the nuances of medical terminology.

“Data or it didn’t happen” is a credo we all live by. It’s especially important for data journalism, but sourcing data for an investigation is rarely easy. In this talk I will walk you through how a data-driven story comes together and share some ideas for how the public sector and journalists can work together more effectively.

Break & Networking: 10:10 AM - 10:40 AM EST

Ensuring the reliability and quality of R packages used in regulatory interactions for drug approvals: a view on how Roche participates in enabling submissions in R.

Most companies don’t have big data, but rather medium data, that awkward in between where the data are too big to fit in memory but not big enough for Google-scale systems. Fortunately R has many options for working with data of this size. We will look at using the command line, {data.table} and {dplyr} to clean the data and load it into a Postgres database inside a Docker container. Then we will use {targets} to orchestrate the whole process.

The U.S. opioid epidemic, or opioid crisis, refers to the substantial medical, social, psychological and economic consequences due to the misuse and overdose deaths of a class of drugs called opioids. The number of drug overdose deaths increased by nearly 5% from 2018 to 2019 and has quadrupled since 1999, and over 70% of the 70,630 deaths in 2019 involved an opioid (CDC, 2021). PursueCare is a telehealth startup offering comprehensive care for opioid use disorder and other substance use disorders by combining telehealth technology, medication treatment and counseling. Asmae Toumi, the director of analytics and research at PursueCare, will talk about how data and R/RStudio’s public and professional tools are being used to uncover trends, deliver care and improve outcomes.

Lunch & Networking: 11:50 AM - 1:00 PM EST

In New York City, the City Council has many functions, including oversight of the Mayor’s operations and generating legislation that compliments the oversight function. We will provide a general overview of how data is used to underscore the rationale behind legislation and how the City Council works to ensure that these data feed into an overall ethos of transparency and evidence-based decision making.

A walk through how statistics and data science are commonly applied in politics and government.

Break & Networking: 1:45 PM - 2:15 PM EST

Some people get to write YORO (You Only Run Once) code, not really worrying about whether it’ll run again. You probably aren’t one of them.

More likely, you have to be ready to re-run analyses months or years later. That’s a tall order given the constant changes to the R language and package ecosystem.

In this talk, you’ll learn a taxonomy of reproducibility for your code, and be introduced to the foremost tools — docker and renv — for making your work environments more reproducible.

This presentation will continue the story that I started at last year’s R Conference | Government & Public Sector. At the previous conference, I described some of my experiences – both successes and failures – using the open-source statistical computing software R at several U.S. government agencies. I described the goal of my journey, which was to get agreement from my agency to use R in the production of our official statistics. I am happy to announce that I have reached an important waypoint in this journey. R has been approved for production at the Bureau of Labor Statistics! Notice that I did not say I reached the end of my journey. This is because there is still a lot of important work ahead of us. In this talk, I will briefly recap the start of my journey, how I got to this point, and our way forward.

Did you know you can use R to create and maintain teaching materials, including slides, assignments, exams and even a website? This talk will illustrate how several R packages -including but not limited to {xaringan}, {rmarkdown}, {distill}, {xaringanThemer}, {ghclass}, and {Rexams}- are used in preparing and maintaining materials for courses I teach at Georgetown University and the George Washington University.

Break & Networking: 3:25 PM - 3:55 PM EST

Public sector organizations are increasingly turning to platforms as ways to improve service delivery while reducing costs. However, the term “platform” has been used to describe several different architectural designs and operational approaches. Decision-makers need to understand how their selected architectural design and operational approach, which are constrained by government structures, will impact their implementation of government platforms. Without a clear definition and an understanding of the technical decisions that must be made it is impossible to responsibly build and implement public sector platforms.

Ever wished you could get access to more data for your data science problems without the painful and slow process of existing data access agreements?

Data Scientists are limited to the data their organization has painstakingly acquired a copy of. Data which often requires phone calls, contract negotiations, lawyers and special onsite security policies just to access. Getting to analyze personal data can take anywhere from weeks to months even if you understand the whole process.

Privacy Enhancing Technologies (PETs) are bringing that time down to seconds while giving data subjects even stronger privacy guarantees.

In this talk we will:

  • Examine the privacy problem and the field of Privacy Enhancing Technologies (PETs)
  • See how Syft’s Automatic Differential Privacy and Secure Multi-Party Compute feels like magic
  • Hear about OpenMined’s free online Privacy Focused Data Science Courses
  • Learn how to participate in Federated Networks and fuel tomorrow’s life changing discoveries
  • Discover why being a nonprofit foundation is key to OpenMined’s Mission

Closing Remarks: 4:40 PM - 4:50 PM EST
Virtual Breakfast & Registration: 9:00 AM - 9:50 AM EST
Opening Remarks: 9:50 AM - 10:00 AM EST

tidylda implements the Latent Dirichlet Allocation (LDA) topic model in a way that is fast, flexible, and most importantly tidy. Wait. Who needs another LDA implementation though? Tommy will talk us through what makes tidylda so unique and provide examples to stir your imagination on new ways you can use topic modeling in your own work.

nspired by the open-source movement, Digital Public Goods are not only non-rivalrous, but sharing them across jurisdictions could lower costs, speed adoption, and create standards to facilitate cooperation and trade. However, the joint management of any resource between sovereign entities—particularly of key infrastructure for the maintenance of public goods and services offered by the state—carries with it significant questions of governance. A team of researchers based at the Ash Center within the Harvard Kennedy School are publishing a report that proposes five governance best practices for DPGs—Codifying a Mission, Vision and Value Statement, Drafting a Code of Conduct, Designing Governance Bodies, Ensuring Stakeholder Voice and Representation, and Engaging External Contributors. These five recommendations seek to nurture institutions that will create public value, possess legitimacy, and maintain the necessary support and operational capacity.

Break & Networking: 10:45 AM - 11:15 AM EST

Some reporters chose journalism because they hate numbers. Data journalists are a group of story-tellers who like numbers and can code. And R is one of the most popular languages among them.This talk will introduce how a data journalist uses R, from web scraping, analysis, creating graphics to just automating the boring stuff.

Abstract Coming Soon

When we think about design, it’s common to jump immediately to thinking about what colors to choose or what graphs to make. The design process starts further back, by getting to know your audience at a foundational level – what motivates, challenges, and inspires them. At a time when we are overloaded by information, and desensitized to numbers, how do we develop data tools and visualizations that create an impact?

Lunch & Networking: 12:25 PM - 1:35 PM EST

Simulations are often run to benchmark a method using data where the results are known or to compare a few methods on a nicely structured (simulated) data. Simulating data in R is not hard. If you have to simulate many different datasets, tweaking some parameters, how to automate such a process to run multiple times and maximize the use of computer or server resources. In this talk I will show how I was able to run multiple simulations at the same time using the doParallel package to run a few R threads simultaneously (from within R) to simulated multiple datasets with genetics data under different scenarios.

Rigorous evidence of what works to support refugee children is scarce and challenging to attain. During this talk, Mayarí will share with us the strategy that she used to study the impact on children’s reading skills, of attending a remedial support program, brought to Syrian refugees by the IRC and NYU. She will share her experience working with machine learning and statistical frameworks that can be helpful to 1) leverage the information available in understudied contexts and to 2) better account for the problem of self-selection into different dosage levels, under a causal framework.

In this talk you will also learn about the data challenges of conducting research with vulnerable populations and the R tools that were helpful in the process.

The end-user understanding how something works is just as important as the result. Concepts and techniques that come naturally to us as Data Scientists can be totally perplexing to non-expert consumers . . . and that’s when critical analysis gets lost in translation.

We can’t help if we can’t communicate, explores how we bridge the gap between Data Science practitioners and the government executives who use our findings to develop policy. The talk will explore two use-cases—one failure and one success—from the speaker’s decade-plus of US Federal Government experience to establish a set of best practices in conveying data science findings to non-experts.

Why it’s important:

Government needs our help, but we can’t help if we can’t communicate. If we can’t convey what our findings mean and why they are important, essential decisions will be made without the data or analysis needed to back them up.

Participants who attend this session will leave with:

· A set of best practices for ensuring data science findings and products remain accessible, relevant, and actionable

· A deeper understanding of how government executives make decisions

· Where data science fits in to the decision-making process

Break & Networking: 2:45 PM - 3:15 PM EST

We live in a multilingual computational environment, where each language provides certain advantages in terms of developed packages and capabilities. Often, we are faced with utilizing multiple languages to create efficient data analytic workflows. In this talk, I’ll describe some experiences in integrating languages utilizing R as the backbone and glue.

This presentation will introduce an R Shiny app that my partner and I have created to examine the trend of negative outcomes after stem cell treatments over time. The dataset used is from Sloan Kettering Hospital and includes 5 years (20 quarters) of de-identified data from adults and children. The app allows the user to see the proportion of patients who experienced each negative outcome (toxicity) per quarter for the data set of their choice. This app reveals concerning trends in certain toxicities over time.

Speaker TBA: 4:05 PM - 4:25 PM EST
Closing Remarks: 4:25 PM - 4:35 PM EST

Sponsors

Gold

RStudio
Georgetown University

Silver

R Consortium

Bronze

PolicyViz

Supporting

Pearson
Chapman & Hall/CRC, Taylor & Francis Group
Springer