NY R Conference
Workshops: July 11-12 | Location: Columbia University
Conference: July 13-14 | Location: FIAF Manhattan
Speakers



Wes McKinney
CTO & Co-founder
Voltron Data
Talk: Leveling Up the Data Stack: Thoughts on the Last 15 Years


Ayanthi Gunawardana
Senior Data Analyst
1-800-FLOWERS.COM, Inc.
Talk: CaRtography: Creating Accurate and Beautiful Maps in R


Molly Huie
Team Lead, Data Analysis & Surveys
Bloomberg Industry Group
Talk: How to Interrogate Data Like a Journalist (Joint talk with Andrew Wallender)

Jared P. Lander
Chief Data Scientist
Lander Analytics
Talk: Building an R Package with LLMs


Emily Riederer
Senior Manager of Data Science & Analytics
Capital One
Talk: Column Names as Contracts - Inspired by dplyr, Available in dbt


Jessica Duncan
Greenlight
Marketing Data Scientist
Talk: Give Credit Where Credit Is Due: Data-Driven Approach to Marketing Channel Attribution

Matt Dupree
Founder
EXORVA
Talk: OpenAI's Embeddings are Cooler than ChatGPT: An Intro to using OpenAI's Embeddings API

Caterina Constantinescu
Principal Consultant
GlobalLogic
Talk: Deconstructing LLM Use: Key Considerations to Deliver Custom Solutions

Mike Band
Sr. Manager, Research & Analytics
NFL Next Gen Stats
Talk: The Many Models in Production at NFL Next Gen Stats

Ryan Klein
Principal IT Data Scientist
Continental Resources
Talk: Using Plumber to Expose Models In Excel

Daniel Chen
Post-Doc Research and Teaching Fellow & Data Science Educator
University of British Columbia & Lander Analytics

George Perrett
Director of Research and Data Analysis
New York University
Talk: Bayesian Boosting
Andrew Wallender
Investigative Data Reporter
Bloomberg Industry Group
Talk: How to Interrogate Data Like a Journalist (Joint talk with Molly Huie)
More speakers coming soon!
Workshops

Tidy Time Series and Forecasting in R
Hosted by Rob Hyndman
Tue, Jul 11 - Wed, Jul 12 | 9:00am - 5:00pm
It is common for organizations to collect huge amounts of data over time, and existing time series analysis tools are not always suitable to handle the scale, frequency and structure of the data collected. In this workshop, we will look at some packages and methods that have been developed to handle the analysis of large collections of time series. On day 1, we will look at the tsibble data struc...
...cture for flexibly managing collections of related time series. We will look at how to do data wrangling, data visualizations and exploratory data analysis. We will explore feature-based methods to explore time series data in high dimensions. A similar feature-based approach can be used to identify anomalous time series within a collection of time series, or to cluster or classify time series. Primary packages for day 1 will be tsibble, lubridate and feasts (along with the tidyverse of course). Day 2 will be about forecasting. We will look at some classical time series models and how they are automated in the fable package, and we will explore the creation of ensemble forecasts and hybrid forecasts. Best practices for evaluating forecast accuracy will also be covered. Finally, we will look at forecast reconciliation, allowing millions of time series to be forecast in a relatively short time while accounting for constraints on how the series are related. (In-Person Only)
Machine Learning in R
Hosted by Max Kuhn
Tue, Jul 11 - Wed, Jul 12 | 9:00am - 5:00pm
Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You'll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and featur...
...re engineering, model creation, evaluation, and tuning. This is not a deep learning course and will focus on tabular data. Pre-requisites: some experience with modeling in R and the tidyverse (don't need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository. (In-Person & Virtual)
Bayesian Data Analysis and Stan
Hosted by Jonah Gabry
Tue, Jul 11 - Wed, Jul 12 | 9:00am - 5:00pm

Causal Inference in R
Hosted by Malcolm Barrett & Lucy D'Agostino McGowan
Tue, Jul 11 - Wed, Jul 12 | 9:00am - 5:00pm
In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. In both data science and academic research, prediction modeling is often not enough; to answer many questions, we need to approach them causally. In this workshop, we’ll teach the essential elem...
...ments of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. We’ll also show that by distinguishing predictive models from causal models, we can better take advantage of both tools. You’ll be able to use the tools you already know--the tidyverse, regression models, and more--to answer the questions that are important to your work. This course is for you if you: -Know how to fit a linear regression model in R -Have a basic understanding of data manipulation and visualization using tidyverse tools -Are interested in understanding the fundamentals behind how to move from estimating correlations to causal relationships (In-Person & Virtual)Agenda
Tuesday, Jul 11
-
08:00 AM - 09:00 AM
Registration & Breakfast
-
09:00 AM - 05:00 PM
Workshop: Rob Hyndman
Tidy Time Series and Forecasting in R ...
It is common for organizations to collect huge amounts of data over time, and existing time series analysis tools are not always suitable to handle the scale, frequency and structure of the data collected. In this workshop, we will look at some packages and methods that have been developed to handle the analysis of large collections of time series. On day 1, we will look at the tsibble data structure for flexibly managing collections of related time series. We will look at how to do data wrangling, data visualizations and exploratory data analysis. We will explore feature-based methods to explore time series data in high dimensions. A similar feature-based approach can be used to identify anomalous time series within a collection of time series, or to cluster or classify time series. Primary packages for day 1 will be tsibble, lubridate and feasts (along with the tidyverse of course). Day 2 will be about forecasting. We will look at some classical time series models and how they are automated in the fable package, and we will explore the creation of ensemble forecasts and hybrid forecasts. Best practices for evaluating forecast accuracy will also be covered. Finally, we will look at forecast reconciliation, allowing millions of time series to be forecast in a relatively short time while accounting for constraints on how the series are related. (In-Person Only) -
09:00 AM - 05:00 PM
Workshop: Max Kuhn
Machine Learning in R ...
Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You'll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and feature engineering, model creation, evaluation, and tuning. This is not a deep learning course and will focus on tabular data. Pre-requisites: some experience with modeling in R and the tidyverse (don't need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository. (In-Person & Virtual) -
09:00 AM - 05:00 PM
Workshop: Jonah Gabry
Bayesian Data Analysis and Stan ...
This workshop will introduce the basics of applied Bayesian data analysis, the Stan modeling language, and how to interface with Stan from R. Participants will learn to write their own models in the Stan language, run them in R, and use a variety of R packages to work with the results. (In-Person & Virtual) -
09:00 AM - 05:00 PM
Workshop: Malcolm Barrett & Lucy D'Agostino McGowan
Causal Inference in R ...
In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. In both data science and academic research, prediction modeling is often not enough; to answer many questions, we need to approach them causally. In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. We’ll also show that by distinguishing predictive models from causal models, we can better take advantage of both tools. You’ll be able to use the tools you already know--the tidyverse, regression models, and more--to answer the questions that are important to your work. This course is for you if you: -Know how to fit a linear regression model in R -Have a basic understanding of data manipulation and visualization using tidyverse tools -Are interested in understanding the fundamentals behind how to move from estimating correlations to causal relationships (In-Person & Virtual)
Wednesday, Jul 12
-
08:00 AM - 09:00 AM
Registration & Breakfast
-
09:00 AM - 05:00 PM
Workshop: Rob Hyndman
Tidy Time Series and Forecasting in R ...
It is common for organizations to collect huge amounts of data over time, and existing time series analysis tools are not always suitable to handle the scale, frequency and structure of the data collected. In this workshop, we will look at some packages and methods that have been developed to handle the analysis of large collections of time series. On day 1, we will look at the tsibble data structure for flexibly managing collections of related time series. We will look at how to do data wrangling, data visualizations and exploratory data analysis. We will explore feature-based methods to explore time series data in high dimensions. A similar feature-based approach can be used to identify anomalous time series within a collection of time series, or to cluster or classify time series. Primary packages for day 1 will be tsibble, lubridate and feasts (along with the tidyverse of course). Day 2 will be about forecasting. We will look at some classical time series models and how they are automated in the fable package, and we will explore the creation of ensemble forecasts and hybrid forecasts. Best practices for evaluating forecast accuracy will also be covered. Finally, we will look at forecast reconciliation, allowing millions of time series to be forecast in a relatively short time while accounting for constraints on how the series are related. (In-Person Only) -
09:00 AM - 05:00 PM
Workshop: Max Kuhn
Machine Learning in R ...
Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You'll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and feature engineering, model creation, evaluation, and tuning. This is not a deep learning course and will focus on tabular data. Pre-requisites: some experience with modeling in R and the tidyverse (don't need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository. (In-Person & Virtual) -
09:00 AM - 05:00 PM
Workshop: Jonah Gabry
Bayesian Data Analysis and Stan ...
This workshop will introduce the basics of applied Bayesian data analysis, the Stan modeling language, and how to interface with Stan from R. Participants will learn to write their own models in the Stan language, run them in R, and use a variety of R packages to work with the results. (In-Person & Virtual) -
09:00 AM - 05:00 PM
Workshop: Malcolm Barrett & Lucy D'Agostino McGowan
Causal Inference in R ...
In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. In both data science and academic research, prediction modeling is often not enough; to answer many questions, we need to approach them causally. In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores and inverse probability weighting. We’ll also show that by distinguishing predictive models from causal models, we can better take advantage of both tools. You’ll be able to use the tools you already know--the tidyverse, regression models, and more--to answer the questions that are important to your work. This course is for you if you: -Know how to fit a linear regression model in R -Have a basic understanding of data manipulation and visualization using tidyverse tools -Are interested in understanding the fundamentals behind how to move from estimating correlations to causal relationships (In-Person & Virtual)
Thursday, Jul 13
-
08:00 AM - 08:50 AM
Registration & Breakfast
-
08:50 AM - 09:00 AM
Opening Remarks
-
09:00 AM - 09:20 AM
TBD
-
09:25 AM - 09:45 AM
Matt Dupree
OpenAI's Embeddings are Cooler than ChatGPT: An Intro to using OpenAI's Embeddings API ...
There's been a lot of talk about ChatGPT, but not enough talk about OpenAI's embedding models. Embeddings are a language model's representation of the meaning of text, and in this talk, we cover how we can use OpenAI embeddings API to solve classification and recommendation problems. We'll also cover how it can be used to intelligently search through documents and any other kind of text. I'll end with a quick demo showing how I'm using embeddings to search across application copy to help users find out how to do things within the software they use. -
09:50 AM - 10:10 AM
Jessica Duncan
Give Credit Where Credit Is Due: Data-Driven Approach to Marketing Channel Attribution ...
Knowing which step in the customer journey is most influential in driving a conversion is crucial to optimizing marketing efficiency. Traditional approaches assign all credit to the first touch or last touch; newer rule-based approaches, such as linear, U-shaped, and time-decay, apply formulaic assignment of credit to touchpoints depending on the order they occur in. Using Markov chain modeling techniques, we can arrive at a more robust, algorithmic understanding of the steps in our customer journey. -
10:10 AM - 10:40 AM
Break
-
10:40 AM - 11:00 AM
Asmae Toumi
-
11:05 AM - 11:25 AM
Jared P. Lander
Building an R Package with LLMs ...
Can an LLM build an entire R package? We are going to prompt engineer our way to a working package. We are going to use a series of prompts to first build functions and write the roxygen documentation. After that we'll request it provide the steps for creating the package scafolding such as the DESCRIPTION file and folder structure. Then we will have the LLM write units tests, something that often falls be the wayside. We'll see how quickly the LLM can do all this for us as opposed to using the standard package building tools. -
11:30 AM - 11:50 AM
Daniel Chen
-
11:50 AM - 01:00 PM
Lunch
-
01:00 PM - 01:20 PM
Mike Band
The Many Models in Production at NFL Next Gen Stats
-
01:25 PM - 01:45 PM
Rob Hyndman
Being Open to Being Open ...
I will reflect on 30+ years of experience in producing open-source software and open-access resources. We'll explore the many benefits of working openly and publicly, including academic, commercial, and social good advantages. Discover how adopting an open mindset can lead to increased collaboration and innovation, as developers and users work together to enhance software and other resources to meet their needs. Open-source software is also more secure and reliable, thanks to the collective review of code by many eyes. We'll also explore the benefits of open-access resources, such as educational materials, research papers, and datasets. By making these resources openly available, we can promote access to knowledge and encourage collaboration among researchers and educators. Move beyond using open-source materials to be a developer of open resources, and help make the world more collaborative, innovative, and equitable. -
01:45 PM - 02:15 PM
Break
-
02:15 PM - 02:35 PM
Bob Rudis
Into the WebR-Verse ...
In early 2022, intrepid scientist Dr. George Stagg created the first WebAssembly (WASM) version of R — dubbed "WebR" — and captured the imagination of scores of RStats enthusiasts. One year later, WebR 0.1.x has been unleashed, and has expanded the R universe to every browser on every device across the galaxy. In this session, we'll take a WebR-slinging journey into and through the WebR-Verse, explaining what it is, the heroic efforts taken to bring it to life, why it is a game-changer for R, and show you practical examples of how to tap into the potential of this amazing new technology, and sling WebR apps of your own. -
02:40 PM - 03:00 PM
Molly Huie & Andrew Wallender
How to Interrogate Data Like a Journalist ...
This talk will explore how R can be used to produce data-driven news stories and graphics. We’ll explore best practices, important questions to ask of data, and how to better communicate complicated topics for a mass audience. -
03:05 PM - 03:25 PM
TBD
-
03:25 PM - 03:55 PM
Break
-
03:55 PM - 04:15 PM
Ayanthi Gunawardana
CaRtography: Creating Accurate and Beautiful Maps in R ...
One of the more niche areas of data science is geographic data science, or the art of using geographic information to derive and present location-specific insights. This talk will cover basic geospatial concepts and data formats, the essential elements of a map, how to import geospatial data in to R, the types of geospatial packages used to manipulate this data, and how to accurately present this data on a static map for exploratory and presentation purposes. Participants will learn what makes a map misleading and how to ensure their analysis shows accurate insights and is easy for users to understand. -
04:20 PM - 04:40 PM
TBD
-
04:40 PM - 04:50 PM
Closing Remarks
-
04:50 PM - 06:30 PM
Happy Hour
Friday, Jul 14
-
09:00 AM - 09:50 AM
Registration & Breakfast
-
09:50 AM - 10:00 AM
Opening Remarks
-
10:00 AM - 10:20 AM
George Perrett
Bayesian Boosting ...
Bayesian Additive Regression Trees (BART) is a powerful machine learning algorithm that combines the power of Boosted Regression Trees and Bayesian Inference. BART is well-suited for both causal inference and prediction problems. In this talk, I will provide an overview of BART, explain how it works, and discuss its benefits for various applications. BART requires almost no tuning, includes built-in prediction and credible intervals, and can be extended to account for non-independent data structures. BART is implemented in the dbarts family of R pages and included in the tidymodels framework and I'll discuss how this powerful class of models can be easily utilized with R! -
10:25 AM - 10:45 AM
Hamdan Azhar
-
10:45 AM - 11:15 AM
Break
-
11:15 AM - 11:35 AM
Ryan Klein
Using Plumber to Expose Models In Excel ...
Getting useful models into the users hands can be a real game-changer for many organizations. While ShinyR is a great option to allow user interactivity, sometimes, users want to work in a spreadsheet. This talk will guide the audience through linking a Microsoft Excel workbook to a plumber R script via API to allow for the analysis of different combinations of variables and outputs. -
11:40 AM - 12:00 PM
Caterina Constantinescu
Deconstructing LLM Use: Key Considerations to Deliver Custom Solutions ...
The rate of advancement in AI research (and LLMs specifically) won't have escaped anybody's attention by now. This degree of progress naturally lends itself to hot takes, memes and dramatisation, when reality is much more nuanced. This talk will explore why AI won't 'take away our jobs' just yet, by discussing some of the real-world constraints and customisations still required. For instance, licensing, privacy and data ownership are legitimate talking points before large-scale adoption in industry. In addition, data collection/scraping for model fine-tuning might also be non-trivial to implement depending on the specific scenario at hand, and experience/UI design may similarly require considerable conscious thought before LLM deployment / integration. In this talk, I will discuss these issues (and more!) to highlight that LLM use involves a high degree of subtlety and customisation, acting as counterweights to now common hyperbolae on the topic. -
12:05 PM - 12:25 PM
Emily Riederer
Column Names as Contracts - Inspired by dplyr, Available in dbt ...
dplyr’s select helpers exemplify how the tidyverse uses opinionated design to push users into the pit of success. The ability to efficiently operate on names incentivizes good naming patterns and creates efficiency in data wrangling and validation. However, in a polyglot world, users may find they must leave the pit when comparable syntactic sugar is not accessible in other languages like python and SQL. In this talk, I will explain how dplyr’s select helpers inspired my approach to ‘column name contracts’, how good naming systems can help supercharge data management with packages like {dplyr} and {pointblank}, and my experience building the {dbtplyr} to port this functionality to dbt for building complex SQL-based data pipelines. -
12:25 PM - 01:35 PM
Lunch
-
01:35 PM - 01:55 PM
Caitlin Hudon
-
02:00 PM - 02:20 PM
Wes McKinney
Leveling Up the Data Stack: Thoughts on the Last 15 Years ...
In this talk, I will discuss some of my observations about data science tools and related computing infrastructure, both where we have come from and where we may be going in the coming years. I will connect these trends to different projects I’ve been involved with, such as pandas, Apache Arrow, Apache Parquet, Ibis, Substrait, and others. A particular focus will be on the themes of modularity and composability of system components. I will also touch on the rapid evolution of storage and computing hardware and how that may direct future development efforts in open source data software. -
02:25 PM - 02:45 PM
Max Kuhn
The Post-Modeling Model to Fix the Model ...
It's possible to get a model that has good numerical performance but has predictions that are not really consistent with the data. Model calibration is a tool that can fix this. We'll show some examples of poor predictions and how different calibration tools can re-align them to the data. -
02:45 PM - 03:15 PM
Break
-
03:15 PM - 04:15 PM
Jon Krohn & Chris Wiggins
SuperDataScience Podcast Live
-
04:15 PM - 04:25 PM
Closing Remarks
Sponsors