Gov R Conference
Workshops: October 18 | Location: Georgetown University
Conference: October 19-20 | Location: Georgetown University
Workshops: October 18
Location: Georgetown University
Conference: October 19-20
Location: Georgetown University
Speakers


Abigail Haddad
Lead Data Scientist
Capital Technology Group
Talk: What Job Is This, Anyway?: Using LLMs to Classify USAJobs Data Scientist Listings


Melissa Albino Hegeman
Marine Fisheries Data Manager
NYSDEC
Talk: It Works on My Machine (Reproducibility in R for Small Teams)

Selen Stromgren
Associate Director
U.S. Food and Drug Administration
Talk: Deterministic Extraction vs. Probabilistic Extrapolation: A Pilot for R-Enabled Augmentation of Information Retrieval by Humans (Joint Talk with Danielle Larese)


Irena Papst
Senior Scientist
Public Health Agency of Canada
Talk: From Scripts to Pipelines with Targets

Gary Harki
Investigations Editor
Bloomberg Industry Group
Talk: Using Open Records Laws to get Data from the Government (and When to Sue)

Jared P. Lander
Chief Data Scientist
Lander Analytics
Talk: I Wrote this Talk with an LLM

Vivian Peng
Lead Data Scientist, Innovation
The Rockefeller Foundation
Talk: Using Large Language Models in Production: Hype vs Reality (Joint Talk with David Cyprian)

George Perrett
Director of Research and Data Analysis
New York University
Talk: stan4bart: Harnessing the Power of Stan and the Flexibility of Machine Learning

Danielle Larese
Scientific Coordinator (Chemist)
U.S. Food and Drug Administration
Talk: Deterministic Extraction vs. Probabilistic Extrapolation: A Pilot for R-Enabled Augmentation of Information Retrieval by Humans (Joint Talk with Selen Stromgren)

Alex Gurvich
Senior Graphics Designer & Data Visualization Specialist
NASA's Science Visualization Studio
Talk: Storytelling with Data at NASA's Earth Information Center

Soubhik Barari
Quantitative Social Scientist
NORC
Talk: LocalView: Scaling up the Analytics of Local Politics with R

Dusty Turner
Major
United States Army & Baylor
Talk: World Leaders, Military Service, and Their Propensity for War

Rhys O'Neill
Innovations and Technology Lead - AIRA
World Health Organization
Talk: Democratizing Misinformation Management



Benjy Braun
Vice President
Data Solutions and Innovation
Talk: You Don't See with Your Eyes, You Perceive with Your Mind: Sight, Psychology, and Data Visualization


David Cyprian
Partner
Rootwise
Talk: Using Large Language Models in Production: Hype vs Reality (Joint Talk with Vivian Peng)
More speakers coming soon
Workshops

Introduction to Natural Language Processing
Hosted by William E J Doane
Wednesday, Oct 18 | 9:00am - 5:00pm
(In-person & Virtual Ticket Options) Unstructured and loosely-structured textual data is commonly used is public policy analyses to wrangle the vast amount of information available from open (and not so open) sources of information. This workshop will use R to acquire data from various sources, clean and standardize the data, and explore it for insights that can inform public policy discussions. B...
...Basic visualizations will be considered to help communicate stories about the collected data. Dr. Doane is a science and technology policy researcher in Washington DC with a background teaching computer science and information science at University. https://DrDoane.com/about/cv
Generative AI for Better Code
Hosted by Abigail Haddad & Benjy Braun
Wednesday, Oct 18 | 9:00am - 5:00pm
(In-person & Virtual Ticket Options) This one-day workshop focuses on how GPT-4 can help reduce technical debt for your R projects, regardless of whether you’re doing analysis, automation, or data science. Technical debt refers to 'debt' you accumulate when you write code and build tools quickly, but which later slows you down when trying to add additional functionality. We'll go from writing "cod...
...de that runs" to "code you can build on", or code that’s modular, documented, and on GitHub. The workshop is structured into four parts: -Code Refactoring: We take code and show you how to make it modular by putting it in functions and structuring it to be easier to run, debug, and build on. -Documentation: The next step is about making your code easy to understand and use. We will show you how to create clear and thorough documentation, at both the project and function level. -Unit testing: We’ll guide you through creating unit tests. These formal checks ensure your functions operate as expected so when you modify your code, you’re less reliant on ad hoc testing. -Version Control: The final step involves using git for local version control and GitHub for collaboration. Even if you’re already a git user, ChatGPT can help you write commands for less-used tasks and to debug your error messages. If GitHub's CoPilot is available in RStudio, we will also discuss how you can use this tool to generate code. At the end of the workshop, you will have transformed your code that runs into modular, well-documented code that's stored in a Git repository. You will understand how large language models like GPT-4 can help you create code that not only does what it's supposed to do but is also easy to work with and build on. This knowledge will be useful for your ongoing analytical work and future development projects.
Causal Inference + BART
Hosted by George Perrett
Wednesday, Oct 18 | 9:00am - 5:00pm
(In-person Ticket Option Only) This workshop will introduce using Bayesian Additive Regression Trees (BART) as a tool for causal inference. BART is a machine learning algorithm with applications in both randomized and observational studies. No prior experience with causal inference or machine learning is expected. By the end of this workshop you will have hands-on experience fitting BART models f...
...for causal inference. You will be able to articulate the main ideas of BART and communicate the advantages of utilizing BART models for causal inference and the underlying assumptions of these models. This workshop will begin with an introduction to causal inference and BART. You'll learn the basics of what causal inference is and why it matters. We'll then cover the intuition of BART and why it is a desirable tool for causal inference. After this introduction, I'll cover the applications of BART in randomized studies. Randomized studies are the "gold standard" of causal inference, but the choice of which model to use remains consequential. I will compare BART to other causal inference strategies in randomized studies and you will learn about utilizing BART to uncover treatment effect moderators and heterogeneous treatment effects. Randomized studies are not always practical or even possible. We'll extend the use of BART for causal inference to observational studies. Participants will learn about the advantages of BART for observational studies and gain hands-on experience working with observational data where individuals have self-selected into or out of the treatment in question. In public policy and governmental settings, data can be clustered and non-independent. Individuals in a data-set may share a common Congressional district, state membership, they may attend a common hospital or school. The workshop will end with extensions of the BART method for working with non-independent data to address these settings. This course is for you if you are interested in learning more about causal inference and implementing a cutting-edge machine learning method used by experts in causal inference. This course assumes familiarity and a basic understanding of R.Agenda
Wednesday, Oct 18
-
08:00 AM - 09:00 AM
Registration & Breakfast
-
09:00 AM - 05:00 PM
Workshop: William E J Doane Research Staff Member @ IDA Science & Technology Policy Institute
Introduction to Natural Language Processing ...
(In-person & Virtual Ticket Options) Unstructured and loosely-structured textual data is commonly used is public policy analyses to wrangle the vast amount of information available from open (and not so open) sources of information. This workshop will use R to acquire data from various sources, clean and standardize the data, and explore it for insights that can inform public policy discussions. Basic visualizations will be considered to help communicate stories about the collected data. Dr. Doane is a science and technology policy researcher in Washington DC with a background teaching computer science and information science at University. https://DrDoane.com/about/cv -
09:00 AM - 05:00 PM
Workshop: George Perrett Director of Research and Data Analysis @ New York University
Causal Inference + BART ...
(In-person Ticket Option Only) This workshop will introduce using Bayesian Additive Regression Trees (BART) as a tool for causal inference. BART is a machine learning algorithm with applications in both randomized and observational studies. No prior experience with causal inference or machine learning is expected. By the end of this workshop you will have hands-on experience fitting BART models for causal inference. You will be able to articulate the main ideas of BART and communicate the advantages of utilizing BART models for causal inference and the underlying assumptions of these models. This workshop will begin with an introduction to causal inference and BART. You'll learn the basics of what causal inference is and why it matters. We'll then cover the intuition of BART and why it is a desirable tool for causal inference. After this introduction, I'll cover the applications of BART in randomized studies. Randomized studies are the "gold standard" of causal inference, but the choice of which model to use remains consequential. I will compare BART to other causal inference strategies in randomized studies and you will learn about utilizing BART to uncover treatment effect moderators and heterogeneous treatment effects. Randomized studies are not always practical or even possible. We'll extend the use of BART for causal inference to observational studies. Participants will learn about the advantages of BART for observational studies and gain hands-on experience working with observational data where individuals have self-selected into or out of the treatment in question. In public policy and governmental settings, data can be clustered and non-independent. Individuals in a data-set may share a common Congressional district, state membership, they may attend a common hospital or school. The workshop will end with extensions of the BART method for working with non-independent data to address these settings. This course is for you if you are interested in learning more about causal inference and implementing a cutting-edge machine learning method used by experts in causal inference. This course assumes familiarity and a basic understanding of R. -
09:00 AM - 05:00 PM
Workshop: Abigail Haddad & Benjy Braun
Generative AI for Better Code ...
(In-person & Virtual Ticket Options) This one-day workshop focuses on how GPT-4 can help reduce technical debt for your R projects, regardless of whether you’re doing analysis, automation, or data science. Technical debt refers to 'debt' you accumulate when you write code and build tools quickly, but which later slows you down when trying to add additional functionality. We'll go from writing "code that runs" to "code you can build on", or code that’s modular, documented, and on GitHub. The workshop is structured into four parts: -Code Refactoring: We take code and show you how to make it modular by putting it in functions and structuring it to be easier to run, debug, and build on. -Documentation: The next step is about making your code easy to understand and use. We will show you how to create clear and thorough documentation, at both the project and function level. -Unit testing: We’ll guide you through creating unit tests. These formal checks ensure your functions operate as expected so when you modify your code, you’re less reliant on ad hoc testing. -Version Control: The final step involves using git for local version control and GitHub for collaboration. Even if you’re already a git user, ChatGPT can help you write commands for less-used tasks and to debug your error messages. If GitHub's CoPilot is available in RStudio, we will also discuss how you can use this tool to generate code. At the end of the workshop, you will have transformed your code that runs into modular, well-documented code that's stored in a Git repository. You will understand how large language models like GPT-4 can help you create code that not only does what it's supposed to do but is also easy to work with and build on. This knowledge will be useful for your ongoing analytical work and future development projects.
Thursday, Oct 19
-
08:00 AM - 08:50 AM
Registration & Breakfast
-
08:50 AM - 09:00 AM
Opening Remarks
-
09:00 AM - 09:20 AM
Irena Papst Senior Scientist @ Public Health Agency of Canada
From Scripts to Pipelines with Targets ...
Do you ever find yourself starting with a simple analysis script only to end up wrangling a thousand line behemoth? Are you sick of wasting time re-running long scripts from start to finish, just to make sure everything is up-to-date? Are you haphazardly saving objects to file because they take a long time to generate? There’s got to be a better way! Enter targets, an R package used to build reproducible, efficient, and scalable pipelines. In this talk, I’ll introduce the targets package and share how I’ve used it to streamline my work modelling infectious disease spread at the Public Health Agency of Canada. -
09:25 AM - 09:45 AM
Gary Harki Investigations Editor @ Bloomberg Industry Group
Using Open Records Laws to get Data from the Government (and When to Sue) ...
Gary will break down how to effectively use open records laws to get data from local, state and federal agencies. He'll talk about the hurdles you encounter and how to overcome them. -
09:50 AM - 10:10 AM
Jon Schwabish Founder and CEO @ PolicyViz
-
10:10 AM - 10:40 AM
Break & Networking
-
10:40 AM - 11:00 AM
Abigail Haddad Lead Data Scientist @ Capital Technology Group
What Job Is This, Anyway?: Using LLMs to Classify USAJobs Data Scientist Listings ...
Navigating the federal job market begins with finding appropriate job listings. But for data professionals, discrepancies often arise between the content of the listing - that is, the duties of the job - and either the job title or the occupational code, making this step more difficult. In this presentation, I discuss using a Large Language Model (LLM) to generate new job titles for listings in occupational code 1560, Data Science. I'll show examples of listings with mismatches between the official job title and the one generated by GPT-3.5 and discuss the potential uses of this for applicants and agencies. I'll also highlight the advantages of using Marvin, a library that lets you use LLMs to solve Natural Language Processing problems by just writing documentation rather than code. -
11:05 AM - 11:25 AM
Jared P. Lander Chief Data Scientist @ Lander Analytics
I Wrote this Talk with an LLM ...
We have all seen LLMs do data analysis, I even gave a talk about using an LLM to write an R package. But now I used an LLM to write these slides. Everything from creating the outline, to flushing out ideas, to writing the actual markdown. Let's see how it goes. -
11:30 AM - 11:50 AM
Soubhik Barari Quantitative Social Scientist @ NORC
LocalView: Scaling up the Analytics of Local Politics with R ...
Never before have there been more tools and resources for political data science, yet in 2023, there are shockingly few resources for analysts of local politics - one of the central pillars of American democracy. In this talk, I introduce LocalView, a database of over 100,000 local government public meetings with a dashboard that enables real-time text analytics on issues such as climate change and LGBTQ rights. I show how this database (and accompanying dashboard) was built drawing on tools such as the tidyverse, R Shiny, quanteda, and duckdb. Finally, I show how LocalView can be useful for social science and journalistic applications such as measuring political polarization in local politics and tracking shifts in public health attention across geography. -
11:50 AM - 01:00 PM
Lunch & Networking
-
01:00 PM - 01:20 PM
David Meza Head of Analytics – Human Capital, Branch Chief People Analytics @ NASA
-
01:25 PM - 01:45 PM
David Shor Head of Data Science @ Blue Rose Research
Data, Surveys, and US Politics ...
A walk through of the state of the art of Data Science and US politics -
01:45 PM - 02:15 PM
Break & Networking
-
02:15 PM - 02:35 PM
Marck Vaisman Sr. Cloud Solutions Architect @ Microsoft
Rockin' R with VSCode ...
Learn how to set up Visual Studio Code to use it with R on both your local workstation and on Azure Machine Learning. We’ll show what R packages you need to install in your R environment, what VSCode extensions you need to install, additional configuration optiona, and we’ll show an end-to-end example using R, VSCode and Azure Machine Learning -
02:40 PM - 03:00 PM
Melissa Albino Hegeman Marine Fisheries Data Manager @ NYSDEC
It Works on My Machine (Reproducibility in R for Small Teams) ...
Working collaboratively in R can be a lot of fun, but it can also be tricky to get started. A combination of GitHub, renv, and custom packages can help improve reproducibility, reduce stress, and lighten everyone's workload. I've made mistakes and hit roadblocks when implementing these tools within a team. But I've also learned a lot along the way. I'll share my experiences and tips so you can avoid the same mistakes and start on the right foot. -
03:05 PM - 03:25 PM
Dusty Turner Major @ United States Army & Baylor
World Leaders, Military Service, and Their Propensity for War
-
03:25 PM - 03:55 PM
Break & Networking
-
03:55 PM - 04:15 PM
TBD
-
04:20 PM - 04:40 PM
TBD
-
04:40 PM - 04:50 PM
Closing Remarks
-
05:00 PM - 07:00 PM
Happy Hour at Clubhouse
Data Happy Hour at Clubhouse - Hosted by Data Science DC ...
Take a break from your keyboard and join us at Clubhouse in Georgetown for this Data Science DC Happy Hour. Come socialize and network with fellow data scientists, analysts, software engineers, and other data enthusiasts. A range of non-alcoholic drinks will be supplied, with alcoholic beverages available for purchase. RSVP HERE!
Friday, Oct 20
-
09:00 AM - 09:50 AM
Registration & Breakfast
-
09:50 AM - 10:00 AM
Opening Remarks
-
10:00 AM - 10:20 AM
TBD
-
10:25 AM - 10:45 AM
George Perrett Director of Research and Data Analysis @ New York University
stan4bart: Harnessing the Power of Stan and the Flexibility of Machine Learning ...
Data is often organized within social systems, people cities that are in counties that are in states. Nested data often violates the independence assumptions inherent to most statistical and machine learning methods. Multilevel models are a popular solution for accounting for these dependencies, but they make rigid parametric assumptions about the linearity of data. stan4bart is a new type of multi-level model that combines the flexibility of machine learning and the robust inference of traditional mulit-levle models. stan4bart has applications for both prediction and inference problems and my talk will introduce the method and its utility in educational and public policy domains. -
10:45 AM - 11:15 AM
Break & Networking
-
11:15 AM - 11:35 AM
Benjy Braun Vice President @ Data Solutions and Innovation
You Don't See with Your Eyes, You Perceive with Your Mind: Sight, Psychology, and Data Visualization ...
Inspired by Stephen Few's "Show Me the Numbers," this talk delves into the psychology of data visualization. We'll start by briefly exploring how the eye-brain interaction affects what we 'see' in a graph. The focus then shifts to the key Gestalt principles of design—proximity, similarity, enclosure, closure, continuity, and connection—that serve as the backbone of effective data visualization. We'll wrap up by critiquing poorly executed visualizations and discuss how to improve them using these principles. Attendees will leave with practical insights into making their data not just viewable, but truly 'seen'. -
11:40 AM - 12:00 PM
Selen Stromgren & Danielle Larese U.S. Food and Drug Administration
Deterministic Extraction vs. Probabilistic Extrapolation: A Pilot for R-Enabled Augmentation of Information Retrieval by Humans ...
Large language model AI systems are taking off at a dizzying speed and end users are trying to ascertain which output can be trusted to what degree. More recently, machine learning experts have pivoted to “refining” large language models with focused sets of data where they train the AI tool with a topic-specific corpus to increase the accuracy and reliability of the output. Examples of such “subject matter expert” AI systems are pharmaGPT, bioGPT, etc. However, the AI system itself still remains a black box to the end user. In this talk, we will present a pilot idea where we explore a very deterministic approach to extracting information from a well-defined corpus using R. Our approach is completely transparent to the end-user, does not include any extrapolation nor probability-based guessing, and produces an output only if the specific answer to the question posed is present in the reference corpus. If successful, such an approach can allow users to create their own R-code using different corpus inputs with the ultimate goal of automating and expediting information retrieval on-the-go with full accuracy. -
12:05 PM - 12:25 PM
TBD
-
12:25 PM - 01:35 PM
Lunch & Networking
-
01:35 PM - 01:55 PM
Alex Gold Solutions Engineer @ Posit
Learn to Love Logging ...
Good logging practice makes the software development parts of Data Science easier and more fun. Learn about how to add logging to your apps, projects, and reports. -
02:00 PM - 02:20 PM
Tommy Jones CEO @ Foundation
R-Squared for Multidimensional Outcomes ...
The coefficient of determination---R-squared---is the most popular goodness of fit metric for linear models. Its appeal is so strong that nearly all statistical software reports it by default when fitting linear models. While several other pseudo R-squared measures have been developed for other use cases, to our knowledge our research is the first time anyone has proposed a variation of R-squared for models predicting an outcome in multiple dimensions. Multidimensional outcomes occur in settings such as modeling simultaneous equations, modeling multivariate distributions, or topic modeling of text. Our R-squared relies on a geometric interpretation of the standard definition of R-squared, and is, thus, an extension of the goodness of fit metric we all know and love. -
02:25 PM - 02:45 PM
Jake Dyal President @ Certus Group
Organizational Effects Driven by Ontologies ...
How organizations can bake their goals into data structures to enable decision-making for more aligned organizational impact. -
02:45 PM - 03:15 PM
Break & Networking
-
03:15 PM - 03:35 PM
Alex Gurvich Senior Graphics Designer & Data Visualization Specialist @ NASA's Science Visualization Studio
Storytelling with Data at NASA's Earth Information Center ...
Exploratory data visualization is a crucial element for building intuition for complex datasets. Numerous tools and approaches exist for efficiently summarizing data in order to extract key insights. However, these visualizations are not always optimized for communicating the final results. In this talk, I will share my experience as a data visualization specialist at NASA's Science Visualization Studio developing content for the new Earth Information Center and discuss the key differences between exploratory and explanatory data visualization. I will also provide helpful tips to make effective explanatory visualizations and share resources to continue learning about the best practices in data storytelling; a new approach to data visualization and communication. -
03:40 PM - 04:00 PM
TBD
-
04:00 PM - 04:10 PM
Closing Remarks
Sponsors