Gov R Conference
Stay tuned for 2024 Conference details!
Speakers


Abigail Haddad
Lead Data Scientist
Capital Technology Group
Talk: What Job Is This, Anyway?: Using LLMs to Classify USAJobs Data Scientist Listings


Melissa Albino Hegeman
Marine Fisheries Data Manager
NYSDEC
Talk: It Works on My Machine (Reproducibility in R for Small Teams)

Selen Stromgren
Associate Director
U.S. Food and Drug Administration
Talk: Deterministic Extraction vs. Probabilistic Extrapolation: A Pilot for R-Enabled Augmentation of Information Retrieval by Humans (Joint Talk with Danielle Larese)

David Meza
Head of Analytics – Human Capital, Branch Chief People Analytics
NASA
Talk: From Data Confusion to Data Intelligence

Irena Papst
Senior Scientist
Public Health Agency of Canada
Talk: From Scripts to Pipelines with Targets

Gary Harki
Investigations Editor
Bloomberg Industry Group
Talk: Using Open Records Laws to get Data from the Government (and When to Sue)


George Perrett
Director of Research and Data Analysis
New York University
Talk: stan4bart: Harnessing the Power of Stan and the Flexibility of Machine Learning

Danielle Larese
Scientific Coordinator (Chemist)
U.S. Food and Drug Administration
Talk: Deterministic Extraction vs. Probabilistic Extrapolation: A Pilot for R-Enabled Augmentation of Information Retrieval by Humans (Joint Talk with Selen Stromgren)

Alex Gurvich
Senior Graphics Designer & Data Visualization Specialist
NASA's Science Visualization Studio
Talk: Storytelling with Data at NASA's Earth Information Center

Zach Terner
Senior Data Scientist
The MITRE Corporation
Talk: Preparing for the Future: How Climate Change May Affect Food Growth

Gwynn Gebeyehu
Co-Founder
Perception Analytics
Talk: The R Project Sprint

Dusty Turner
Major
United States Army & Baylor
Talk: World Leaders, Military Service, and Their Propensity for War

Soubhik Barari
Quantitative Social Scientist
NORC
Talk: LocalView: Scaling up the Analytics of Local Politics with R

Aayushi Verma
Data Science Fellow
Institute for Defense Analyses
Talk: From Data to Collaboration: Connecting Our Researchers with R and Shiny

Jon Schwabish
Founder and CEO
PolicyViz
Talk: What Not to do in Data Visualization: A Walk through the Bad DataViz Hall of Shame

Rhys O'Neill
Innovations and Technology Lead - AIRA
World Health Organization
Talk: Democratizing Misinformation Management

Vivian Peng
Lead Data Scientist, Innovation
The Rockefeller Foundation
Talk: Using Large Language Models in Production: Hype vs Reality (Joint Talk with David Cyprian)


Benjy Braun
Vice President
Data Solutions and Innovation
Talk: You Don't See with Your Eyes, You Perceive with Your Mind: Sight, Psychology, and Data Visualization


David Cyprian
Partner
Rootwise
Talk: Using Large Language Models in Production: Hype vs Reality (Joint Talk with Vivian Peng)

Workshops

Introduction to Natural Language Processing
Hosted by William E J Doane
Wednesday, Oct 18 | 9:00am - 5:00pm
(In-person & Virtual Ticket Options) Unstructured and loosely-structured textual data is commonly used is public policy analyses to wrangle the vast amount of information available from open (and not so open) sources of information. This workshop will use R to acquire data from various sources, clean and standardize the data, and explore it for insights that can inform public policy discussions. B...
...Basic visualizations will be considered to help communicate stories about the collected data. Dr. Doane is a science and technology policy researcher in Washington DC with a background teaching computer science and information science at University. https://DrDoane.com/about/cv
Generative AI for Better Code
Hosted by Abigail Haddad & Benjy Braun
Wednesday, Oct 18 | 9:00am - 5:00pm
(In-person & Virtual Ticket Options) This one-day workshop focuses on how GPT-4 can help reduce technical debt for your R projects, regardless of whether you’re doing analysis, automation, or data science. Technical debt refers to 'debt' you accumulate when you write code and build tools quickly, but which later slows you down when trying to add additional functionality. We'll go from writing "cod...
...de that runs" to "code you can build on", or code that’s modular, documented, and on GitHub. The workshop is structured into four parts: -Code Refactoring: We take code and show you how to make it modular by putting it in functions and structuring it to be easier to run, debug, and build on. -Documentation: The next step is about making your code easy to understand and use. We will show you how to create clear and thorough documentation, at both the project and function level. -Unit testing: We’ll guide you through creating unit tests. These formal checks ensure your functions operate as expected so when you modify your code, you’re less reliant on ad hoc testing. -Version Control: The final step involves using git for local version control and GitHub for collaboration. Even if you’re already a git user, ChatGPT can help you write commands for less-used tasks and to debug your error messages. If GitHub's CoPilot is available in RStudio, we will also discuss how you can use this tool to generate code. At the end of the workshop, you will have transformed your code that runs into modular, well-documented code that's stored in a Git repository. You will understand how large language models like GPT-4 can help you create code that not only does what it's supposed to do but is also easy to work with and build on. This knowledge will be useful for your ongoing analytical work and future development projects.
Causal Inference + BART
Hosted by George Perrett
Wednesday, Oct 18 | 9:00am - 5:00pm
(In-person Ticket Option Only) This workshop will introduce using Bayesian Additive Regression Trees (BART) as a tool for causal inference. BART is a machine learning algorithm with applications in both randomized and observational studies. No prior experience with causal inference or machine learning is expected. By the end of this workshop you will have hands-on experience fitting BART models f...
...for causal inference. You will be able to articulate the main ideas of BART and communicate the advantages of utilizing BART models for causal inference and the underlying assumptions of these models. This workshop will begin with an introduction to causal inference and BART. You'll learn the basics of what causal inference is and why it matters. We'll then cover the intuition of BART and why it is a desirable tool for causal inference. After this introduction, I'll cover the applications of BART in randomized studies. Randomized studies are the "gold standard" of causal inference, but the choice of which model to use remains consequential. I will compare BART to other causal inference strategies in randomized studies and you will learn about utilizing BART to uncover treatment effect moderators and heterogeneous treatment effects. Randomized studies are not always practical or even possible. We'll extend the use of BART for causal inference to observational studies. Participants will learn about the advantages of BART for observational studies and gain hands-on experience working with observational data where individuals have self-selected into or out of the treatment in question. In public policy and governmental settings, data can be clustered and non-independent. Individuals in a data-set may share a common Congressional district, state membership, they may attend a common hospital or school. The workshop will end with extensions of the BART method for working with non-independent data to address these settings. This course is for you if you are interested in learning more about causal inference and implementing a cutting-edge machine learning method used by experts in causal inference. This course assumes familiarity and a basic understanding of R.Agenda
Wednesday, Oct 18
-
08:00 AM - 08:50 AM
Registration & Breakfast
-
08:50 AM - 05:00 PM
Workshop: William E J Doane Research Staff Member @ IDA Science & Technology Policy Institute
Introduction to Natural Language Processing ...
(In-person & Virtual Ticket Options) Unstructured and loosely-structured textual data is commonly used is public policy analyses to wrangle the vast amount of information available from open (and not so open) sources of information. This workshop will use R to acquire data from various sources, clean and standardize the data, and explore it for insights that can inform public policy discussions. Basic visualizations will be considered to help communicate stories about the collected data. Dr. Doane is a science and technology policy researcher in Washington DC with a background teaching computer science and information science at University. https://DrDoane.com/about/cv -
08:50 AM - 05:00 PM
Workshop: Abigail Haddad & Benjy Braun
Generative AI for Better Code ...
(In-person & Virtual Ticket Options) This one-day workshop focuses on how GPT-4 can help reduce technical debt for your R projects, regardless of whether you’re doing analysis, automation, or data science. Technical debt refers to 'debt' you accumulate when you write code and build tools quickly, but which later slows you down when trying to add additional functionality. We'll go from writing "code that runs" to "code you can build on", or code that’s modular, documented, and on GitHub. The workshop is structured into four parts: -Code Refactoring: We take code and show you how to make it modular by putting it in functions and structuring it to be easier to run, debug, and build on. -Documentation: The next step is about making your code easy to understand and use. We will show you how to create clear and thorough documentation, at both the project and function level. -Unit testing: We’ll guide you through creating unit tests. These formal checks ensure your functions operate as expected so when you modify your code, you’re less reliant on ad hoc testing. -Version Control: The final step involves using git for local version control and GitHub for collaboration. Even if you’re already a git user, ChatGPT can help you write commands for less-used tasks and to debug your error messages. If GitHub's CoPilot is available in RStudio, we will also discuss how you can use this tool to generate code. At the end of the workshop, you will have transformed your code that runs into modular, well-documented code that's stored in a Git repository. You will understand how large language models like GPT-4 can help you create code that not only does what it's supposed to do but is also easy to work with and build on. This knowledge will be useful for your ongoing analytical work and future development projects. -
08:50 AM - 05:00 PM
Workshop: George Perrett Director of Research and Data Analysis @ New York University
Causal Inference + BART ...
(In-person Ticket Option Only) This workshop will introduce using Bayesian Additive Regression Trees (BART) as a tool for causal inference. BART is a machine learning algorithm with applications in both randomized and observational studies. No prior experience with causal inference or machine learning is expected. By the end of this workshop you will have hands-on experience fitting BART models for causal inference. You will be able to articulate the main ideas of BART and communicate the advantages of utilizing BART models for causal inference and the underlying assumptions of these models. This workshop will begin with an introduction to causal inference and BART. You'll learn the basics of what causal inference is and why it matters. We'll then cover the intuition of BART and why it is a desirable tool for causal inference. After this introduction, I'll cover the applications of BART in randomized studies. Randomized studies are the "gold standard" of causal inference, but the choice of which model to use remains consequential. I will compare BART to other causal inference strategies in randomized studies and you will learn about utilizing BART to uncover treatment effect moderators and heterogeneous treatment effects. Randomized studies are not always practical or even possible. We'll extend the use of BART for causal inference to observational studies. Participants will learn about the advantages of BART for observational studies and gain hands-on experience working with observational data where individuals have self-selected into or out of the treatment in question. In public policy and governmental settings, data can be clustered and non-independent. Individuals in a data-set may share a common Congressional district, state membership, they may attend a common hospital or school. The workshop will end with extensions of the BART method for working with non-independent data to address these settings. This course is for you if you are interested in learning more about causal inference and implementing a cutting-edge machine learning method used by experts in causal inference. This course assumes familiarity and a basic understanding of R.
Thursday, Oct 19
-
08:00 AM - 08:50 AM
Registration & Breakfast
-
08:50 AM - 09:00 AM
Opening Remarks
-
09:00 AM - 09:20 AM
Irena Papst Senior Scientist @ Public Health Agency of Canada
From Scripts to Pipelines with Targets ...
Do you ever find yourself starting with a simple analysis script only to end up wrangling a thousand line behemoth? Are you sick of wasting time re-running long scripts from start to finish, just to make sure everything is up-to-date? Are you haphazardly saving objects to file because they take a long time to generate? There’s got to be a better way! Enter targets, an R package used to build reproducible, efficient, and scalable pipelines. In this talk, I’ll introduce the targets package and share how I’ve used it to streamline my work modelling infectious disease spread at the Public Health Agency of Canada. -
09:25 AM - 09:45 AM
Gary Harki Investigations Editor @ Bloomberg Industry Group
Using Open Records Laws to get Data from the Government (and When to Sue) ...
Gary will break down how to effectively use open records laws to get data from local, state and federal agencies. He'll talk about the hurdles you encounter and how to overcome them. -
09:50 AM - 10:10 AM
Jon Schwabish Founder and CEO @ PolicyViz
What Not to do in Data Visualization: A Walk through the Bad DataViz Hall of Shame ...
Prepare to be amused and enlightened as we embark on a comical journey through the quirky world of bad data visualizations. In this light-hearted talk, I’ll showcase some of the most outrageous and baffling data visual blunders that have left audiences scratching their heads. From pie charts that vie you everything to bar charts that distort and mislead, you’ll see it all. I mix the comical with the serious to unveil visual missteps in the data world. Amidst the 3D exploding charts, you'll also glean valuable lessons on what not to do when crafting data visualizations. Join me for a rollicking exploration of data gone wrong and leave with a smile and a newfound appreciation for the importance of clarity and accuracy in our data-driven endeavors. -
10:10 AM - 10:40 AM
Break & Networking
-
10:40 AM - 11:00 AM
Abigail Haddad Lead Data Scientist @ Capital Technology Group
What Job Is This, Anyway?: Using LLMs to Classify USAJobs Data Scientist Listings ...
Navigating the federal job market begins with finding appropriate job listings. But for data professionals, discrepancies often arise between the content of the listing - that is, the duties of the job - and either the job title or the occupational code, making this step more difficult. In this presentation, I discuss using a Large Language Model (LLM) to generate new job titles for listings in occupational code 1560, Data Science. I'll show examples of listings with mismatches between the official job title and the one generated by GPT-3.5 and discuss the potential uses of this for applicants and agencies. I'll also highlight the advantages of using Marvin, a library that lets you use LLMs to solve Natural Language Processing problems by just writing documentation rather than code. -
11:05 AM - 11:25 AM
Jared P. Lander Chief Data Scientist @ Lander Analytics
Mapping Big Data ...
Maps are one of the best forms of data visualization that readily understood while conveying a considerable amount of information. With the modern web, interactive, pannable, zoomable maps---known as slippy maps---have become the norm. Thanks to packages like {leaflet} it has never been easier to generate these maps. However, they don't scale well out of the box. We'll look at different methods for dealing with large data to make high performance maps. -
11:30 AM - 11:50 AM
Soubhik Barari Quantitative Social Scientist @ NORC
LocalView: Scaling up the Analytics of Local Politics with R
-
11:50 AM - 01:00 PM
Lunch & Networking
-
01:00 PM - 01:20 PM
David Meza Head of Analytics – Human Capital, Branch Chief People Analytics @ NASA
From Data Confusion to Data Intelligence ...
Data science teams operate in a unique environment, much different than the IT or software development life cycle. Hope from executives for the impact of data science is extremely high! Understanding of how to make data science efforts successful is very low! This creates an interesting set of organizational challenges for data and analytics teams. These are particularly clear when data science is being introduced at new companies, but plays out at organizations of all sizes. So, how do we navigate this dynamic? We’ll share some strategies for success. -
01:25 PM - 01:45 PM
David Shor Head of Data Science @ Blue Rose Research
Data, Surveys, and US Politics ...
A walk through of the state of the art of Data Science and US politics -
01:45 PM - 02:15 PM
Break & Networking
-
02:15 PM - 02:35 PM
Alex Gold Solutions Engineer @ Posit
Learn to Love Logging ...
Good logging practice makes the software development parts of Data Science easier and more fun. Learn about how to add logging to your apps, projects, and reports. -
02:40 PM - 03:00 PM
Melissa Albino Hegeman Marine Fisheries Data Manager @ NYSDEC
It Works on My Machine (Reproducibility in R for Small Teams) ...
Working collaboratively in R can be a lot of fun, but it can also be tricky to get started. A combination of GitHub, renv, and custom packages can help improve reproducibility, reduce stress, and lighten everyone's workload. I've made mistakes and hit roadblocks when implementing these tools within a team. But I've also learned a lot along the way. I'll share my experiences and tips so you can avoid the same mistakes and start on the right foot. -
03:05 PM - 03:25 PM
Dusty Turner Major @ United States Army & Baylor
World Leaders, Military Service, and Their Propensity for War ...
This presentation delves into the question of whether leaders with military experience are more likely to lead their countries into war. Utilizing the Leader Experience and Attribute Descriptions (LEAD) dataset, which comprises personal lives and experiences of over 2,000 state leaders from 1875–2004, the study examines various factors related to leaders' propensity for conflict initiation. Drawing upon a multitude of sources, including academic literature, obituaries, military archives, and more, the data covers aspects of leaders' childhoods, educations, personal lives, and pre-leadership occupations. The research employs a purposeful selection model-building technique, starting with univariable analyses and progressing through several multivariable models to assess the influence of different attributes on war initiation. Interspersed with anecdotal illustrations and personal backgrounds of leaders, the study aims to provide a comprehensive understanding of the intricate relationship between a leader's military experience and their inclination towards war. The presented findings do not represent the official views of the Army but contribute significantly to the broader discourse on civil-military relations and war dynamics. -
03:25 PM - 03:55 PM
Break & Networking
-
03:55 PM - 04:15 PM
Gwynn Gebeyehu Co-Founder @ Perception Analytics
The R Project Sprint ...
R is maintained by a group of 20 volunteers called the R Core Team who are responsible for maintaining and developing R. Without these volunteers, R could cease to exist. The purpose of the R Project Sprint was to encourage collaboration between novice and experienced R developers. This talk will give an overview of the sprint, including roles of the R Core Team, patches submitted throughout the duration of the sprint, and outline future work for this project. -
04:20 PM - 04:40 PM
Zach Terner Senior Data Scientist @ The MITRE Corporation
Preparing for the Future: How Climate Change May Affect Food Growth ...
In this study we examined how climate change may affect food growth in the coming years. To do so, we conducted a historical analysis to understand what weather patterns over the years 2000-2019 led to the largest errors in crop simulations. We focused on simulated and harvested amounts of winter wheat 104 in France, and used historical weather data from the NASA POWER API, as well as information on soil depth, to explain errors in crop simulations. We took a functional data analysis (FDA) approach using the refund package in R to build a longitudinal mixed effects regression model with functional covariates. The results showed that specific changes in weather patterns at different times of the year can explain a large proportion (65% or more) of the errors in crop simulation. This analysis was part of a larger project meant to understand and anticipate how climate change may affect food security, possibly leading to violent conflict. -
04:40 PM - 04:50 PM
Closing Remarks
-
05:00 PM - 07:00 PM
Happy Hour at Clubhouse
Data Happy Hour at Clubhouse - Hosted by Data Science DC ...
Take a break from your keyboard and join us at Clubhouse in Georgetown for this Data Science DC Happy Hour. Come socialize and network with fellow data scientists, analysts, software engineers, and other data enthusiasts. A range of non-alcoholic drinks will be supplied, with alcoholic beverages available for purchase. RSVP HERE- https://www.meetup.com/data-science-dc/events/295959864/
Friday, Oct 20
-
09:00 AM - 09:50 AM
Registration & Breakfast
-
09:50 AM - 10:00 AM
Opening Remarks
-
10:00 AM - 10:20 AM
Rhys O'Neill Innovations and Technology Lead - AIRA @ World Health Organization
Democratizing Misinformation Management ...
Covid brought along its own wave of misinformation, and like Covid, the misinformation is also here to stay. We know the detrimental impact of misinformation and are building a system to future-proof our health responses. The WHO’s Africa Infodemic Response Alliance, Rockefeller Foundation, and technology startup RootWise have partnered to pool expertise in ML, cloud computing, and infodemic management to create a tool that empowers local officials to identify harmful narratives in their own communities and digital spaces and immediately deploy relevant, appealing messages promoting proper public health practices back into these same information channels. -
10:25 AM - 10:45 AM
George Perrett Director of Research and Data Analysis @ New York University
stan4bart: Harnessing the Power of Stan and the Flexibility of Machine Learning ...
Data is often organized within social systems, people cities that are in counties that are in states. Nested data often violates the independence assumptions inherent to most statistical and machine learning methods. Multilevel models are a popular solution for accounting for these dependencies, but they make rigid parametric assumptions about the linearity of data. stan4bart is a new type of multi-level model that combines the flexibility of machine learning and the robust inference of traditional mulit-levle models. stan4bart has applications for both prediction and inference problems and my talk will introduce the method and its utility in educational and public policy domains. -
10:45 AM - 11:15 AM
Break & Networking
-
11:15 AM - 11:35 AM
Benjy Braun Vice President @ Data Solutions and Innovation
You Don't See with Your Eyes, You Perceive with Your Mind: Sight, Psychology, and Data Visualization ...
Inspired by Stephen Few's "Show Me the Numbers," this talk delves into the psychology of data visualization. We'll start by briefly exploring how the eye-brain interaction affects what we 'see' in a graph. The focus then shifts to the key Gestalt principles of design—proximity, similarity, enclosure, closure, continuity, and connection—that serve as the backbone of effective data visualization. We'll wrap up by critiquing poorly executed visualizations and discuss how to improve them using these principles. Attendees will leave with practical insights into making their data not just viewable, but truly 'seen'. -
11:40 AM - 12:00 PM
Selen Stromgren & Danielle Larese U.S. Food and Drug Administration
Deterministic Extraction vs. Probabilistic Extrapolation: A Pilot for R-Enabled Augmentation of Information Retrieval by Humans ...
Large language model AI systems are taking off at a dizzying speed and end users are trying to ascertain which output can be trusted to what degree. More recently, machine learning experts have pivoted to “refining” large language models with focused sets of data where they train the AI tool with a topic-specific corpus to increase the accuracy and reliability of the output. Examples of such “subject matter expert” AI systems are pharmaGPT, bioGPT, etc. However, the AI system itself still remains a black box to the end user. In this talk, we will present a pilot idea where we explore a very deterministic approach to extracting information from a well-defined corpus using R. Our approach is completely transparent to the end-user, does not include any extrapolation nor probability-based guessing, and produces an output only if the specific answer to the question posed is present in the reference corpus. If successful, such an approach can allow users to create their own R-code using different corpus inputs with the ultimate goal of automating and expediting information retrieval on-the-go with full accuracy. -
12:05 PM - 12:25 PM
Vivian Peng & David Cyprian The Rockefeller Foundation & Rootwise
Using Large Language Models in Production: Hype vs Reality ...
From Huggingface to ChatGPT, we have so many large language models (LLM) these days that are quickly making the days of hand-labeling data obsolete. How do we evaluate which models to use and understand the tradeoffs of each model? In this session, we’ll walk through the different LLM’s we deployed for misinformation management with the WHO and how to evaluate and scale the solution. -
12:25 PM - 01:35 PM
Lunch & Networking
-
01:35 PM - 01:55 PM
Marck Vaisman Sr. Cloud Solutions Architect @ Microsoft
Rockin' R with VSCode ...
Learn how to set up Visual Studio Code to use it with R on both your local workstation and on Azure Machine Learning. We’ll show what R packages you need to install in your R environment, what VSCode extensions you need to install, additional configuration optiona, and we’ll show an end-to-end example using R, VSCode and Azure Machine Learning -
02:00 PM - 02:20 PM
Tommy Jones CEO @ Foundation
R-Squared for Multidimensional Outcomes ...
The coefficient of determination---R-squared---is the most popular goodness of fit metric for linear models. Its appeal is so strong that nearly all statistical software reports it by default when fitting linear models. While several other pseudo R-squared measures have been developed for other use cases, to our knowledge our research is the first time anyone has proposed a variation of R-squared for models predicting an outcome in multiple dimensions. Multidimensional outcomes occur in settings such as modeling simultaneous equations, modeling multivariate distributions, or topic modeling of text. Our R-squared relies on a geometric interpretation of the standard definition of R-squared, and is, thus, an extension of the goodness of fit metric we all know and love. -
02:25 PM - 02:45 PM
Gus Lipkin Data Scientist @ Lander Analytics
Your Maps Might Be Lying to You ...
Mapping international borders can be tricky business. We'll dive into the messy world of inaccurate data sources, choosing base maps, and other layer problems. We'll also talk about why obsessing over perfect data might not be as important as it seems. -
02:45 PM - 03:15 PM
Break & Networking
-
03:15 PM - 03:35 PM
Alex Gurvich Senior Graphics Designer & Data Visualization Specialist @ NASA's Science Visualization Studio
Storytelling with Data at NASA's Earth Information Center ...
Exploratory data visualization is a crucial element for building intuition for complex datasets. Numerous tools and approaches exist for efficiently summarizing data in order to extract key insights. However, these visualizations are not always optimized for communicating the final results. In this talk, I will share my experience as a data visualization specialist at NASA's Science Visualization Studio developing content for the new Earth Information Center and discuss the key differences between exploratory and explanatory data visualization. I will also provide helpful tips to make effective explanatory visualizations and share resources to continue learning about the best practices in data storytelling; a new approach to data visualization and communication. -
03:40 PM - 04:00 PM
Aayushi Verma Data Science Fellow @ Institute for Defense Analyses
From Data to Collaboration: Connecting Our Researchers with R and Shiny ...
We present a case study of how R and Shiny are enabling our company to gain new insights into our research, and connect researchers with each other. -
04:00 PM - 04:10 PM
Closing Remarks
Sponsors