NY R Conference
Thanks for attending the 2023 New York R Conference! Check out the Recap Blog & Videos.
Stay tuned for 2024 Conference details!
Speakers



Wes McKinney
CTO & Co-founder
Voltron Data
Talk: Leveling Up the Data Stack: Thoughts on the Last 15 Years

Ayanthi Gunawardana
Senior Data Analyst
1-800-FLOWERS.COM, Inc.
Talk: CaRtography: Creating Accurate and Beautiful Maps in R


Molly Huie
Team Lead, Data Analysis & Surveys
Bloomberg Industry Group
Talk: How to Interrogate Data Like a Journalist (Joint talk with Andrew Wallender)

Jared P. Lander
Chief Data Scientist
Lander Analytics
Talk: Building an R Package with LLMs

Emily Riederer
Senior Manager of Data Science & Analytics
Capital One
Talk: Column Names as Contracts - Inspired by dplyr, Available in dbt

Mitchell O'Hara-Wild
Data Scientist
Nectric
Talk: From Forecast to Fable, Design Decisions for Statistical Software


Jessica Duncan
Marketing Data Scientist
Greenlight
Talk: Give Credit Where Credit Is Due: Data-Driven Approach to Marketing Channel Attribution

Emil Hvitfeldt
Software Engineer
Posit
Talk: Slidecraft: The Art of Creating Pretty Presentations

Saar Golde
Chief Data Scientist
Via
Talk: Evaluating Microtransit's Impact on Congestion

Caterina Constantinescu
Principal Consultant
GlobalLogic
Talk: Deconstructing LLM Use: Key Considerations to Deliver Custom Solutions

Matt Dupree
Founder
EXORVA
Talk: OpenAI's Embeddings are Cooler than ChatGPT: An Intro to using OpenAI's Embeddings API

Mike Band
Sr. Manager, Research & Analytics
NFL Next Gen Stats
Talk: The Many Models in Production at NFL Next Gen Stats

Rick Saporta
SVP of Data
Entera
Talk: Data Product Management for Data Science: How to Answer Every Team's Most Important Question

Ryan Klein
Principal IT Data Scientist
Continental Resources
Talk: Using Plumber to Expose Models In Excel

Chrys Wu
Consultant & Community Builder
Matchstrike
Talk: How to Win Friends and Influence Product Managers

George Perrett
Director of Research and Data Analysis
New York University
Talk: Bayesian Boosting

Daniel Chen
Post-Doc Research and Teaching Fellow & Data Science Educator
University of British Columbia & Lander Analytics
Talk: Moving to Quarto from RMarkdown and Python Jupyter Notebooks
Andrew Wallender
Investigative Data Reporter
Bloomberg Industry Group
Talk: How to Interrogate Data Like a Journalist (Joint talk with Molly Huie)
Workshops

Tidy Time Series and Forecasting in R
Hosted by Mitchell O'Hara-Wild
Tue, Jul 11 - Wed, Jul 12 | 9:00am - 5:00pm
It is common for organizations to collect huge amounts of data over time, and existing time series analysis tools are not always suitable to handle the scale, frequency and structure of the data collected. In this workshop, we will look at some packages and methods that have been developed to handle the analysis of large collections of time series. On day 1, we will look at the tsibble data struc...
...cture for flexibly managing collections of related time series. We will look at how to do data wrangling, data visualizations and exploratory data analysis. We will explore feature-based methods to explore time series data in high dimensions. A similar feature-based approach can be used to identify anomalous time series within a collection of time series, or to cluster or classify time series. Primary packages for day 1 will be tsibble, lubridate and feasts (along with the tidyverse of course). Day 2 will be about forecasting. We will look at some classical time series models and how they are automated in the fable package, and we will explore the creation of ensemble forecasts and hybrid forecasts. Best practices for evaluating forecast accuracy will also be covered. Finally, we will look at forecast reconciliation, allowing millions of time series to be forecast in a relatively short time while accounting for constraints on how the series are related. (In-Person Only)
Machine Learning in R
Hosted by Max Kuhn
Tue, Jul 11 - Wed, Jul 12 | 9:00am - 5:00pm
Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You'll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and featur...
...re engineering, model creation, evaluation, and tuning. This is not a deep learning course and will focus on tabular data. Pre-requisites: some experience with modeling in R and the tidyverse (don't need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository. (In-Person & Virtual)
Bayesian Data Analysis and Stan
Hosted by Jonah Gabry
Tue, Jul 11 - Wed, Jul 12 | 9:00am - 5:00pm

Causal Inference in R
Hosted by Malcolm Barrett & Lucy D'Agostino McGowan
Tue, Jul 11 - Wed, Jul 12 | 9:00am - 5:00pm
In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. In both data science and academic research, prediction modeling is often not enough; to answer many questions, we need to approach them causally. In this workshop, we’ll teach the essential elem...
...ments of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. We’ll also show that by distinguishing predictive models from causal models, we can better take advantage of both tools. You’ll be able to use the tools you already know--the tidyverse, regression models, and more--to answer the questions that are important to your work. This course is for you if you: -Know how to fit a linear regression model in R -Have a basic understanding of data manipulation and visualization using tidyverse tools -Are interested in understanding the fundamentals behind how to move from estimating correlations to causal relationships (In-Person & Virtual)Agenda
Tuesday, Jul 11
-
08:00 AM - 09:00 AM
Registration & Breakfast
-
09:00 AM - 05:00 PM
Workshop: Mitchell O'Hara-Wild Research Assistant @ Monash University
Tidy Time Series and Forecasting in R ...
It is common for organizations to collect huge amounts of data over time, and existing time series analysis tools are not always suitable to handle the scale, frequency and structure of the data collected. In this workshop, we will look at some packages and methods that have been developed to handle the analysis of large collections of time series. On day 1, we will look at the tsibble data structure for flexibly managing collections of related time series. We will look at how to do data wrangling, data visualizations and exploratory data analysis. We will explore feature-based methods to explore time series data in high dimensions. A similar feature-based approach can be used to identify anomalous time series within a collection of time series, or to cluster or classify time series. Primary packages for day 1 will be tsibble, lubridate and feasts (along with the tidyverse of course). Day 2 will be about forecasting. We will look at some classical time series models and how they are automated in the fable package, and we will explore the creation of ensemble forecasts and hybrid forecasts. Best practices for evaluating forecast accuracy will also be covered. Finally, we will look at forecast reconciliation, allowing millions of time series to be forecast in a relatively short time while accounting for constraints on how the series are related. (In-Person Only) -
09:00 AM - 05:00 PM
Workshop: Max Kuhn Scientist @ Posit
Machine Learning in R ...
Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You'll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and feature engineering, model creation, evaluation, and tuning. This is not a deep learning course and will focus on tabular data. Pre-requisites: some experience with modeling in R and the tidyverse (don't need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository. (In-Person & Virtual) -
09:00 AM - 05:00 PM
Workshop: Jonah Gabry Researcher @ Columbia University
Bayesian Data Analysis and Stan ...
This workshop will introduce the basics of applied Bayesian data analysis, the Stan modeling language, and how to interface with Stan from R. Participants will learn to write their own models in the Stan language, run them in R, and use a variety of R packages to work with the results. (In-Person & Virtual) -
09:00 AM - 05:00 PM
Workshop: Malcolm Barrett & Lucy D'Agostino McGowan Data Science Educator @ Posit
Causal Inference in R ...
In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. In both data science and academic research, prediction modeling is often not enough; to answer many questions, we need to approach them causally. In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. We’ll also show that by distinguishing predictive models from causal models, we can better take advantage of both tools. You’ll be able to use the tools you already know--the tidyverse, regression models, and more--to answer the questions that are important to your work. This course is for you if you: -Know how to fit a linear regression model in R -Have a basic understanding of data manipulation and visualization using tidyverse tools -Are interested in understanding the fundamentals behind how to move from estimating correlations to causal relationships (In-Person & Virtual)
Wednesday, Jul 12
-
08:00 AM - 09:00 AM
Registration & Breakfast
-
09:00 AM - 05:00 PM
Workshop: Mitchell O'Hara-Wild Research Assistant @ Monash University
Tidy Time Series and Forecasting in R ...
It is common for organizations to collect huge amounts of data over time, and existing time series analysis tools are not always suitable to handle the scale, frequency and structure of the data collected. In this workshop, we will look at some packages and methods that have been developed to handle the analysis of large collections of time series. On day 1, we will look at the tsibble data structure for flexibly managing collections of related time series. We will look at how to do data wrangling, data visualizations and exploratory data analysis. We will explore feature-based methods to explore time series data in high dimensions. A similar feature-based approach can be used to identify anomalous time series within a collection of time series, or to cluster or classify time series. Primary packages for day 1 will be tsibble, lubridate and feasts (along with the tidyverse of course). Day 2 will be about forecasting. We will look at some classical time series models and how they are automated in the fable package, and we will explore the creation of ensemble forecasts and hybrid forecasts. Best practices for evaluating forecast accuracy will also be covered. Finally, we will look at forecast reconciliation, allowing millions of time series to be forecast in a relatively short time while accounting for constraints on how the series are related. (In-Person Only) -
09:00 AM - 05:00 PM
Workshop: Max Kuhn Scientist @ Posit
Machine Learning in R ...
Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You'll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and feature engineering, model creation, evaluation, and tuning. This is not a deep learning course and will focus on tabular data. Pre-requisites: some experience with modeling in R and the tidyverse (don't need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository. (In-Person & Virtual) -
09:00 AM - 05:00 PM
Workshop: Jonah Gabry Researcher @ Columbia University
Bayesian Data Analysis and Stan ...
This workshop will introduce the basics of applied Bayesian data analysis, the Stan modeling language, and how to interface with Stan from R. Participants will learn to write their own models in the Stan language, run them in R, and use a variety of R packages to work with the results. (In-Person & Virtual) -
09:00 AM - 05:00 PM
Workshop: Malcolm Barrett & Lucy D'Agostino McGowan Data Science Educator @ Posit
Causal Inference in R ...
In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. In both data science and academic research, prediction modeling is often not enough; to answer many questions, we need to approach them causally. In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. We’ll also show that by distinguishing predictive models from causal models, we can better take advantage of both tools. You’ll be able to use the tools you already know--the tidyverse, regression models, and more--to answer the questions that are important to your work. This course is for you if you: -Know how to fit a linear regression model in R -Have a basic understanding of data manipulation and visualization using tidyverse tools -Are interested in understanding the fundamentals behind how to move from estimating correlations to causal relationships (In-Person & Virtual)
Thursday, Jul 13
-
08:00 AM - 08:50 AM
Registration & Breakfast
-
08:50 AM - 09:00 AM
Opening Remarks
-
09:00 AM - 09:20 AM
Emil Hvitfeldt Software Engineer @ Posit
Slidecraft: The Art of Creating Pretty Presentations ...
Do you want to make slides that catch the eye of the room? Are you tired of using defaults when making slides? Are you ready to spend every last hour of your life fiddling with css and js? Then this talk is for you! Making slides with Quarto and revealjs is a breeze and comes with many tools and features. This talk gives an overview of how we can improve the visuals of your slides with the highest effect to effort ratio. -
09:25 AM - 09:45 AM
Matt Dupree Founder @ EXORVA
OpenAI's Embeddings are Cooler than ChatGPT: An Intro to using OpenAI's Embeddings API ...
There's been a lot of talk about ChatGPT, but not enough talk about OpenAI's embedding models. Embeddings are a language model's representation of the meaning of text, and in this talk, we cover how we can use OpenAI embeddings API to solve classification and recommendation problems. We'll also cover how it can be used to intelligently search through documents and any other kind of text. I'll end with a quick demo showing how I'm using embeddings to search across application copy to help users find out how to do things within the software they use. -
09:50 AM - 10:10 AM
Jessica Duncan Marketing Data Scientist @ Greenlight
Give Credit Where Credit Is Due: Data-Driven Approach to Marketing Channel Attribution ...
Knowing which step in the customer journey is most influential in driving a conversion is crucial to optimizing marketing efficiency. Traditional approaches assign all credit to the first touch or last touch; newer rule-based approaches, such as linear, U-shaped, and time-decay, apply formulaic assignment of credit to touchpoints depending on the order they occur in. Using Markov chain modeling techniques, we can arrive at a more robust, algorithmic understanding of the steps in our customer journey. -
10:10 AM - 10:40 AM
Break
-
10:40 AM - 11:00 AM
Mitchell O'Hara-Wild Data Scientist @ Nectric
From Forecast to Fable, Design Decisions for Statistical Software ...
A well designed interface is instrumental in making software easy to learn and use. The design of statistical software is inherently subjective, and there are many difficult decisions involved in creating interfaces that work cohesively within the intended domain. In this talk, I will examine the design decisions made when creating fable, the tidy time series forecasting successor of the widely renowned forecast package. -
11:05 AM - 11:25 AM
Jared P. Lander Chief Data Scientist @ Lander Analytics
Building an R Package with LLMs ...
Can an LLM build an entire R package? We are going to prompt engineer our way to a working package. We are going to use a series of prompts to first build functions and write the roxygen documentation. After that we'll request it provide the steps for creating the package scafolding such as the DESCRIPTION file and folder structure. Then we will have the LLM write units tests, something that often falls be the wayside. We'll see how quickly the LLM can do all this for us as opposed to using the standard package building tools. -
11:30 AM - 11:50 AM
Daniel Chen Post-Doc Research and Teaching Fellow & Data Science Educator @ University of British Columbia & Lander Analytics
Moving to Quarto from RMarkdown and Python Jupyter Notebooks
-
11:50 AM - 01:00 PM
Lunch
-
01:00 PM - 01:20 PM
Mike Band Sr. Manager, Research & Analytics @ NFL Next Gen Stats
The Many Models in Production at NFL Next Gen Stats ...
Since its inception in 2016, the NFL's Next Gen Stats group has revolutionized football statistics. Through the utilization of player tracking data, NGS has developed a series of innovative metrics, many of which powered by distinct machine learning models. Each model delves into a unique facet of the game, contributing to comprehensive metrics that can evaluate the performance of not only individual players but entire teams and beyond. From Completion Probability to Expected Rushing Yards and the intuitive Fourth Down Decision Guide, I'll guide you on a fascinating journey through the many machine learning models in production at Next Gen Stats. -
01:25 PM - 01:45 PM
Saar Golde Chief Data Scientist @ Via
Evaluating Microtransit's Impact on Congestion ...
Traffic congestion is a perennial issue in transportation, with annual traffic delays totalling over 100 hours per person in many major cities around the world before COVID. Improving the impact of local transportation on emissions and congestion is a key motivation for many transportation providers that turn to Via’s technology as the backbone of their on-demand offerings. In this work we demonstrate a novel methodology for evaluating the impact of microtransit services on congestion and emissions at a very granular level. -
01:45 PM - 02:15 PM
Break
-
02:15 PM - 02:35 PM
Bob Rudis V.P. Research & Data Science @ GreyNoise Intelligence
Into the WebR-Verse ...
In early 2022, intrepid scientist Dr. George Stagg created the first WebAssembly (WASM) version of R — dubbed "WebR" — and captured the imagination of scores of RStats enthusiasts. One year later, WebR 0.1.x has been unleashed, and has expanded the R universe to every browser on every device across the galaxy. In this session, we'll take a WebR-slinging journey into and through the WebR-Verse, explaining what it is, the heroic efforts taken to bring it to life, why it is a game-changer for R, and show you practical examples of how to tap into the potential of this amazing new technology, and sling WebR apps of your own. -
02:40 PM - 03:00 PM
Molly Huie & Andrew Wallender Bloomberg Industry Group
How to Interrogate Data Like a Journalist ...
This talk will explore how R can be used to produce data-driven news stories and graphics. We’ll explore best practices, important questions to ask of data, and how to better communicate complicated topics for a mass audience. -
03:05 PM - 03:25 PM
Chrys Wu Consultant & Community Builder @ Matchstrike
How to Win Friends and Influence Product Managers ...
Among the many people at a company who need data, you may find yourself working with product managers. Creating an effective partnership will be the key to your mutual success. In this talk, I’ll share some practices for forming the initial bond, getting on the same page, and ensuring your work brings value. -
03:25 PM - 03:55 PM
Break
-
03:55 PM - 04:15 PM
Ayanthi Gunawardana Senior Data Analyst @ 1-800-FLOWERS.COM, Inc.
CaRtography: Creating Accurate and Beautiful Maps in R ...
One of the more niche areas of data science is geographic data science, or the art of using geographic information to derive and present location-specific insights. This talk will cover basic geospatial concepts and data formats, the essential elements of a map, how to import geospatial data in to R, the types of geospatial packages used to manipulate this data, and how to accurately present this data on a static map for exploratory and presentation purposes. Participants will learn what makes a map misleading and how to ensure their analysis shows accurate insights and is easy for users to understand. -
04:20 PM - 04:40 PM
Rick Saporta SVP of Data @ Entera
Data Product Management for Data Science: How to Answer Every Team's Most Important Question ...
Each data team is different in it's own way, but all successful data teams have one core thing in common: they know how to handle the many non-data parts needed for the success of data-work. In this talk, we'll discuss Data Product Management approaches that increase the likelihood of data initiatives achie -
04:40 PM - 04:50 PM
Closing Remarks
-
04:50 PM - 06:30 PM
Happy Hour
Friday, Jul 14
-
09:00 AM - 09:50 AM
Registration & Breakfast
-
09:50 AM - 10:00 AM
Opening Remarks
-
10:00 AM - 10:20 AM
George Perrett Director of Research and Data Analysis @ New York University
Bayesian Boosting ...
Bayesian Additive Regression Trees (BART) is a powerful machine learning algorithm that combines the power of Boosted Regression Trees and Bayesian Inference. BART is well-suited for both causal inference and prediction problems. In this talk, I will provide an overview of BART, explain how it works, and discuss its benefits for various applications. BART requires almost no tuning, includes built-in prediction and credible intervals, and can be extended to account for non-independent data structures. BART is implemented in the dbarts family of R pages and included in the tidymodels framework and I'll discuss how this powerful class of models can be easily utilized with R! -
10:25 AM - 10:45 AM
Hamdan Azhar Founder @ PRISMOJI
An Ode to Permissionless Data Science ...
In the 11 years since Harvard Business Review called data scientist the “sexiest job of the 21st century”, the field has exploded in both popularity and societal impact. In this talk, I argue that despite seemingly fierce competition in the job market, the barriers to entry to becoming a data scientist have never been lower. Data is all around us and the tools to understand this data are ubiquitous and more powerful than ever. Through sharing stories from my journey working with data in political campaigns, emojis in social networks, and news deserts, I’ll demonstrate that the best way to become a data scientist is by doing data science. -
10:45 AM - 11:15 AM
Break
-
11:15 AM - 11:35 AM
Ryan Klein Principal IT Data Scientist @ Continental Resources
Using Plumber to Expose Models In Excel ...
Getting useful models into the users hands can be a real game-changer for many organizations. While ShinyR is a great option to allow user interactivity, sometimes, users want to work in a spreadsheet. This talk will guide the audience through linking a Microsoft Excel workbook to a plumber R script via API to allow for the analysis of different combinations of variables and outputs. -
11:40 AM - 12:00 PM
Caterina Constantinescu Principal Consultant @ GlobalLogic
Deconstructing LLM Use: Key Considerations to Deliver Custom Solutions ...
The rate of advancement in AI research (and LLMs specifically) won't have escaped anybody's attention by now. This degree of progress naturally lends itself to hot takes, memes and dramatisation, when reality is much more nuanced. This talk will explore why AI won't 'take away our jobs' just yet, by discussing some of the real-world constraints and customisations still required. For instance, licensing, privacy and data ownership are legitimate talking points before large-scale adoption in industry. In addition, data collection/scraping for model fine-tuning might also be non-trivial to implement depending on the specific scenario at hand, and experience/UI design may similarly require considerable conscious thought before LLM deployment / integration. In this talk, I will discuss these issues (and more!) to highlight that LLM use involves a high degree of subtlety and customisation, acting as counterweights to now common hyperbolae on the topic. -
12:05 PM - 12:25 PM
Emily Riederer Senior Manager of Data Science & Analytics @ Capital One
Column Names as Contracts - Inspired by dplyr, Available in dbt ...
dplyr’s select helpers exemplify how the tidyverse uses opinionated design to push users into the pit of success. The ability to efficiently operate on names incentivizes good naming patterns and creates efficiency in data wrangling and validation. However, in a polyglot world, users may find they must leave the pit when comparable syntactic sugar is not accessible in other languages like python and SQL. In this talk, I will explain how dplyr’s select helpers inspired my approach to ‘column name contracts’, how good naming systems can help supercharge data management with packages like {dplyr} and {pointblank}, and my experience building the {dbtplyr} to port this functionality to dbt for building complex SQL-based data pipelines. -
12:25 PM - 01:35 PM
Lunch
-
01:35 PM - 01:55 PM
Caitlin Hudon Data Scientist @ Figma
How to Make Decisions with Data ...
We'll walk through a five step framework that has helped me to make hundreds of data-informed decisions during my career as a data scientist, and talk about how to make decisions more efficiently and effectively. -
02:00 PM - 02:20 PM
Wes McKinney CTO & Co-founder @ Voltron Data
Leveling Up the Data Stack: Thoughts on the Last 15 Years ...
In this talk, I will discuss some of my observations about data science tools and related computing infrastructure, both where we have come from and where we may be going in the coming years. I will connect these trends to different projects I’ve been involved with, such as pandas, Apache Arrow, Apache Parquet, Ibis, Substrait, and others. A particular focus will be on the themes of modularity and composability of system components. I will also touch on the rapid evolution of storage and computing hardware and how that may direct future development efforts in open source data software. -
02:25 PM - 02:45 PM
Max Kuhn Scientist @ Posit
The Post-Modeling Model to Fix the Model ...
It's possible to get a model that has good numerical performance but has predictions that are not really consistent with the data. Model calibration is a tool that can fix this. We'll show some examples of poor predictions and how different calibration tools can re-align them to the data. -
02:45 PM - 03:15 PM
Break
-
03:15 PM - 04:15 PM
Jon Krohn & Chris Wiggins SuperDataScience Podcast
SuperDataScience Podcast Live
-
04:15 PM - 04:25 PM
Closing Remarks
Sponsors