Click here to check out the 2022 NYR Recap Blog!

Thank you for attending the 2022 New York R Conference. To see what the most recent NYR Conference was like, you can keep scrolling. Also, make sure you check out all the talks in the video tab.

We’ll be back for the 2023 New York R Conference in the summer!




Download the program

Speakers


Andrew Gelman

Professor,
Department of Statistics and Department of Political Science, Columbia University
@StatModeling

Asmae Toumi

Director of Analytics and Research,
PursueCare
@asmae_toumi

Max Kuhn

Scientist,
RStudio
@topepos

Jennifer Hill

Professor of Applied Statistics,
New York University

Wes McKinney

CTO & Co-founder,
Voltron Data
@wesmckinn

Stacy Lansey

Associate Manager, Data Analysis,
Warby Parker
@StacyLansey

Tom Bliss

Manager, Football Operations Data Scientist,
NFL
@DataWithBliss

Sarah Catanzaro

Partner,
Amplify Partners
@sarahcat21

Jared P. Lander

Chief Data Scientist,
Lander Analytics
@jaredlander

Malorie Hughes

Senior Data Scientist,
Amazon
@data_all_day

Matt Heaphy

VP & Actuary,
Nassau Financial Group
@entreaphy

Megan Robertson

Senior Data Scientist,
Nike
@leggomymeggo4

Jon Keane

Engineering Manager,
Voltron Data
@jonkeane

Molly Huie

Team Lead, Data Analysis & Surveys,
Bloomberg Industry Group
@mollyhuie

Bernardo Lares

Marketing Science Partner,
Meta
@LaresDJ

Igor Skokan

Marketing Science Partner,
Meta

Ipek Ensari

Associate Research Scientist,
Data Science Institute at Columbia University
@datatransformr

Mike Band

Sr. Manager, Research & Analytics,
NFL Next Gen Stats
@MBandNFL

Geetu Ambwani

VP of Data Science,
Spring Health
@geetuji

Emil Hvitfeldt

Software Engineer,
RStudio
@Emil_Hvitfeldt

Cat Zhou

Data Science Manager,
Twitter
@catherinezh

Daniel Chen

Post-Doc Research and Teaching Fellow & Data Science Educator,
University of British Columbia & Lander Analytics
@chendaniely

Jeroen Janssens

Educator,
Data Science Workshops
@jeroenhjanssens

Dean Attali

Founder,
AttaliTech
@daattali

Lucy D'Agostino McGowan

Assistant Professor,
Wake Forest University
@LucyStats

Malcolm Barrett

Clinical Research Data Scientist,
Teladoc Health
@malco_barrett



Workshops

Workshops will be held on Wednesday, June 8th @ Columbia University, Hamilton Hall. Click on each workshop for more details.

Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You'll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and feature engineering, model creation, evaluation, and tuning. Pre-requisites: some experience with modeling in R and the tidyverse (don't need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository. (In-person & Virtual)

Even though you can accomplish pretty much anything in R, some situations call for that other language: Python. Using a specific Python package? Collaborating with a Pythonista? Starting out at a new place where Python is the language of choice? Or just curious? All valid reasons for you to dive into Python! Luckily, thanks to your existing R knowledge, instructor Jeroen Janssens can help you get started quickly. Jeroen has been using and teaching both languages for many years. Through plenty of exercises, you’ll learn how to read and write Python, how to translate between R and Python, and how to integrate the two using, for example, the reticulate package. We’ll cover not just the syntax, but also the bigger picture, such as navigating the Python ecosystem and documentation, and how to avoid common pitfalls. Lastly, we’ll look at some specific use cases that involve translating dplyr, tidyr, and ggplot2 code to the equivalent Python, pandas, and plotnine code. In short, by the end of this workshop you’ll have a solid foundation for getting comfortable with Python. (In-person & Virtual)

Shiny is an R package that can be used to build interactive web pages with R. This might sound strange or scary, but you don't need to have any web knowledge - it's just R! If you've ever written an analysis in R and you want to make it interactive, you can use Shiny. If you've ever written a function or model that you want to share with others who don't know how to use R, you can use Shiny. Shiny has many use cases, and this workshop will help you see how you can leverage it in your own work. You'll learn how to take a Shiny app from start to finish - we'll start by building a simple Shiny app to interactively visualize a dataset, and deploy it online to make it accessible to the world. In the process, you'll learn about reactive programming, shiny best and worst practices, and you may even pick up some tips that'll make you a better R programmer! (In-person Only)

In both data science and academic research, prediction modeling is often not enough; to answer many questions, we need to approach them causally. In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores, inverse probability weighting, and matching. We’ll also show that by distinguishing predictive models from causal models, we can better take advantage of both tools; prediction modeling plays a role in establishing many causal models, such as propensity scores. You’ll be able to use the tools you already know--the tidyverse, regression models, and more--to answer the questions that are important to your work. (Virtual Only)

Agenda

Workshops to be held at Columbia University, Hamilton Hall and the conference at 55 E 59th St, New York, NY 10022

Registration & Opening Remarks: 8:00 AM - 9:00 AM EST

Join Max Kuhn on a tour through Machine Learning in R, with emphasis on using the software as opposed to general explanations of model building. This workshop is an abbreviated introduction to the tidymodels framework for modeling. You’ll learn about data preparation, model fitting, model assessment and predictions. The focus will be on data splitting and resampling, data pre-processing and feature engineering, model creation, evaluation, and tuning. Pre-requisites: some experience with modeling in R and the tidyverse (don’t need to be experts); prior experience with lm is enough to get started and learn advanced modeling techniques. In case participants can’t install the packages on their machines, RStudio Server Pro instances will be available that are pre-loaded with the appropriate packages and GitHub repository. (In-person & Virtual)

Even though you can accomplish pretty much anything in R, some situations call for that other language: Python. Using a specific Python package? Collaborating with a Pythonista? Starting out at a new place where Python is the language of choice? Or just curious? All valid reasons for you to dive into Python!

Luckily, thanks to your existing R knowledge, instructor Jeroen Janssens can help you get started quickly. Jeroen has been using and teaching both languages for many years.

Through plenty of exercises, you’ll learn how to read and write Python, how to translate between R and Python, and how to integrate the two using, for example, the reticulate package. We’ll cover not just the syntax, but also the bigger picture, such as navigating the Python ecosystem and documentation, and how to avoid common pitfalls. Lastly, we’ll look at some specific use cases that involve translating dplyr, tidyr, and ggplot2 code to the equivalent Python, pandas, and plotnine code. In short, by the end of this workshop you’ll have a solid foundation for getting comfortable with Python. (In-person & Virtual)

Shiny is an R package that can be used to build interactive web pages with R. This might sound strange or scary, but you don’t need to have any web knowledge - it’s just R! If you’ve ever written an analysis in R and you want to make it interactive, you can use Shiny. If you’ve ever written a function or model that you want to share with others who don’t know how to use R, you can use Shiny. Shiny has many use cases, and this workshop will help you see how you can leverage it in your own work. You’ll learn how to take a Shiny app from start to finish - we’ll start by building a simple Shiny app to interactively visualize a dataset, and deploy it online to make it accessible to the world. In the process, you’ll learn about reactive programming, shiny best and worst practices, and you may even pick up some tips that’ll make you a better R programmer!
(In-person Only)

In both data science and academic research, prediction modeling is often not enough; to answer many questions, we need to approach them causally. In this workshop, we’ll teach the essential elements of answering causal questions in R through causal diagrams, and causal modeling techniques such as propensity scores, inverse probability weighting, and matching. We’ll also show that by distinguishing predictive models from causal models, we can better take advantage of both tools; prediction modeling plays a role in establishing many causal models, such as propensity scores. You’ll be able to use the tools you already know–the tidyverse, regression models, and more–to answer the questions that are important to your work. (Virtual Only)

Open Registration: 8:00 AM - 8:50 AM EST
Opening Remarks: 8:50 AM - 9:00 AM EST

Video games in R? You bet! Raylib is a C/C++ library for working with 2D & 3D OpenGL graphics, sounds & music, keyboard & mouse interactivity, and even gamepads & VR headsets. I have been working on a new R package called raylibr, which, as the name suggests, allows you to use all of Raylib’s functionality directly from R. I’m not expecting any Triple-A games, but I do believe that having such functionality is useful, especially for research, education, and simply having fun. In this talk I’ll demonstrate raylibr, discuss my design decisions, recall my struggles with Rcpp, and think out loud about the potential of raylibr.

No matter what you work on, visualizations are key to communicating what you’ve uncovered. Before the reign of tidyverse, the layered syntax of ggplot2 was an unfamiliar beast (I can just keep… “adding” to it?). By now, many of us have grown comfortable with thinking in layers and pipes. But to become a true master of viz, you have to know you’re way around the plot and produce visualizations that feel effortless to your audience. This talk will cover the shared anatomy and analogous syntax of the static ggplot and interactive highcharter vizualizations—and make the case for highcharter as a means to keep R alive in an otherwise python work environment.

Abstract Coming Soon

Break & Networking: 10:10 AM - 10:40 AM EST

When you’re the data person (or team) in an organization full of non-technical folks, how do you make data accessible? You make it Shiny! In this talk I’ll share about the organization at Bloomberg Law, how we’re striving to infuse our content with data and create a data-driven culture, and how my team has created shiny dashboards to help further this mission. Spoiler alert – it works!

Drawing maps is one of the world’s oldest forms of data visualization. While maps are straightforward to interpret, they can be harder to make. Yet it can all be done inside R, in many different ways. We will make maps using {ggplot2}, {tmap} and {leaflet}, including very large maps using WebGL. We will also see some GIS operations including including determining which points fall inside which polygons and how to triangulate a point on the Earth based on known distances to other areas.

Starting off with a good workflow baseline helps scale a project’s complexity. Even the tasks of creating an RStudio Project and having a folder structure can go a long way in managing large and complex projects. I’ll give an example of the things I’ve done in managing my dissertation (git, git submodules, github actions, and r project workflows), some of the corners I cut, and how upcoming tools (i.e., Quarto) can help round those corners.

Lunch & Networking: 11:50 AM - 1:00 PM EST

Robyn [https://facebookexperimental.github.io/Robyn/] is a semi-automated Marketing Mix Modelling (MMM) R package initially built by Meta’s Marketing Science team. It aims to reduce human bias by means of ridge regression and evolutionary algorithms, and allows ground-truth calibration to account for causation. In this talk we will focus on four specific main techniques for mitigating bias in model training and selection. Project Robyn, now in v.3.6+, being an open source and with the help of the international data science community keeps evolving a traditionally expensive and obfuscated modelling process, democratizing access to actionable MMM to a broader set of advertisers.

When you do applied statistics, you form hypotheses, gather data, run experiments, modify your theories, etc. Here, I’m not talking about hypotheses of the form “theta = 0” or whatever; I’m talking about hypotheses such as, “N=200 will be enough for this study” or “Instrumental variables should work on this problem” or “We can safely use the normal approximation here” or “We really need to include a measurement-error model here” or “The research question of interest is unanswerable from the data we have here; what we really need to do is . . .”, etc. Existing treatments of statistical practice and workflow (including in my own textbooks) do not really capture the way that the steps of statistical design, data collection, analysis, and decision making feel like science. We discuss the implications of this perspective and how it can make us better statisticians and data scientists.

Break & Networking: 2:05 PM - 2:35 PM EST

Actuarial modeling of insurance liabilities requires long-term assumptions for mortality, policyholder behavior, and a variety of other variables. Traditional actuarial assumption setting relies on performing experience studies and developing tabular rate tables considering a limited number of features. Modern data science tools like R can allow actuaries and data scientists to develop more robust and granular assumptions considering a wider variety of features. These techniques can have profound impacts on valuation and product management.

Five years ago, Monica Rogati (AI/ML advisor and former VP Data at Jawbone) published a blogpost outlining the AI Hierarchy of Needs. Although she warned readers that her guidance is “not an excuse to build disconnected, overengineered infrastructure,” many data teams have met this fate. In this talk, Sarah Catanzaro will revisit the AI Hierarchy of Needs and critically examine at which levels tools and the adoption of best practices have enabled teams to make progress and at which levels most teams are still failing to unlock rapid development and iteration on AI products. She will investigate why existing tools and ML stacks may inhibit the productivity of data teams and provide actionable recommendations to accelerate inner and outer ML development loops.

Player performance in the National Football League has traditionally been measured using box score stats and game outcomes. With better data - the NFLs Next Gen Stats which contains player tracking data for every player on every play - we can now measure a new element of player performance: player movement. We work to use player movement to better understand aspects of the rules and equity of the game including the NFL schedule and NFL pace of play.

Break & Networking: 3:45 PM - 4:15 PM EST

Many of us like puzzles. Wrestling with simple-to-understand questions, with trickly-to-find-solutions gives you a special type of bliss when you finally crack it. “Advent of Code” has been going for 7 years, challenging over 200,000 people with 25 days of action-packed programming puzzles. Participating in these events has taught me a lot about myself, my motivations, and my tool of choice, R.

This talk will pull back the curtain on a collective journey of learning. A series of elf-inspired tasks provides a challenge for all ages, skill levels, and languages. Tweeting, Slacking, and Discording with the extended R community, bringing people together to share, learn and send memes. A unique opportunity to learn something new; a different programming language, a new set of packages, or simply going as fast as possible to the horror of all style guides. I hope to introduce you to a unique learning experience that will bring you as much enjoyment as it has done for me.

To be successful as a data scientist in industry you need to collaborate with stakeholders with various levels of technical understanding. How many times has your work been referred to as data science magic or a vague combination of buzzwords? While stakeholders do not need to know how to optimize the loss function of the model you fit, it is important that they know the fundamentals behind its predictions or resulting analysis. This builds trust and leads to stronger working relationships between teams. This talk will share how to break down data science work into digestible content for non-technical stakeholders. Attendees will learn a strategy that they can apply across different types of models and algorithms to help non-technical audiences understand data science work.

Closing Remarks: 5:00 PM - 5:10 PM EST
Happy Hour: 5:25 PM - 6:45 PM EST
Open Registration: 9:00 AM - 9:50 AM EST
Opening Remarks: 9:50 AM - 10:00 AM EST

Abstract Coming Soon

The deaths from the U.S. opioid epidemic have reached a new record, totaling 108,000 in 2021 according to the CDC. The number of drug overdose has quadrupled since 1999. Curbing this unrelenting crisis is at the heart of many interventions by the government, public health experts, providers and community activists. PursueCare is a company offering comprehensive care for substance use disorders and other mental health conditions through telehealth technology and in-person treatments. Asmae Toumi, the director of analytics and research at PursueCare, will talk about how data and R/RStudio’s public and professional tools are being used to deliver evidence-based care and monitor outcomes.

Break & Networking: 10:45 AM - 11:15 AM EST

Mobile and wearable health (mHealth) technologies are increasingly being used in research and clinical settings to monitor a wide range of outcomes, enabling unprecedented high-resolution perspective into patient health status. However, there remain gaps in how to leverage the large waves of incoming patient-generated health data (PGHD) from these technologies for their sense-making. Aggregating these data into scalar summary scores, though currently a common practice, fails to capture potentially meaningful within- and between-individuals variations. This talk will explore functional data analysis as an alternative approach to address some of these challenges related to PGHD. We will focus on a functional mixture model (FMM)-based clustering technique, where entire data curves are used as the unit of analysis, using examples from research with disease populations that are currently not well understood.

The perfect data science stakeholder is the one you already have - different types of business stakeholders just require different strategies. In this talk I’ll focus on skeptical stakeholders or stakeholders that are just “not math people”. Before tackling their data questions head on you first need to spend some time building their trust and confidence. To do this I’ve successfully borrowed strategies from television makeover shows and successfully transformed stakeholders and their workflows. Rather than fixing people’s hair or teaching them to cook, I help people fix their data and teach them how to cook up data-driven solutions.

Large mental health care organizations are increasingly looking to use data to improve patient outcomes. By collecting high quality data from a large number of patients, providers, and their interactions over time, we can start to solve problems that were previously unapproachable. Critical problems span the range of mental health care experience from getting the right information, at the right time, in order to identify at-risk patients, and provide targeted treatments optimized for the individual patient. We briefly discuss v1.0 solutions to some of these problems, and focus on the path ahead to building solutions that leverage data to improve mental health at scale.

Lunch & Networking: 12:25 PM - 1:30 PM EST

Apache Arrow is a multi-language toolbox for accelerated data interchange and processing. Community-driven development in the Arrow project has continued at a fast pace, with numerous new capabilities, features and refinements added over the past year—all with the goal of making data interchange and processing easier, faster, and more interoperable. The Arrow format has also been adopted as a high-performance (and zero copy!) method of interchange from one toolkit or framework to another, making transitioning from one to another easy and quick. We’ll take a tour of some of the new features in Apache Arrow as well as examples of using the zero-copy data interface between Arrow R and other toolkits like DuckDB. The Arrow R package brings the Apache Arrow toolkit to anyone using R, providing access to the Arrow C++ library with a familiar dplyr and R interface.

Causal inference is a necessary tool in education research for answering pressing and ever- evolving questions around policy and practice. Increasingly researchers are using more complicated machine learning algorithms to estimate causal effects. These methods take some of the guesswork out of analyses, decrease the opportunity for “p-hacking,” and are often better suited for more fine-tuned causal inference tasks such as identifying varying treatment effects and generalizing results from one population to another. However, these more sophisticated methods are more difficult to understand and are often only accessible in more technical, less user-friendly software packages. The thinkCausal project is working to address these challenges (and more) by developing a highly-scaffolded multi-purpose causal inference software package with the BART predictive algorithm as a foundation. The software will scaffold the researcher through the data analytic process and provide options to access technology-based teaching tools to understand foundational concepts in causal inference and machine learning. This talk will briefly review BART for causal and then discuss the challenges and opportunities in building this type of tool. This is work in progress and the goal is to create a conversation about the tool and role of education in data analysis software more broadly.

With operations coming into its own, there is a lot to say about the how’s and why’s of model monitoring. However, until you update it, the model doesn’t change; it’s the data that moves. We’ll look at some tools to measure data drift with an eye towards model operations.

Break & Networking: 2:40 PM - 3:10 PM EST

Jon Krohn, Chief Data Scientist at Nebula, interviews Hilary Mason, Co-Founder & CEO of Hidden Door, live in this special conference event! The SuperDataScience podcast brings you the latest and most important machine learning, artificial intelligence, and broader data-world topics from across both academia and industry. As the quantity of data on our planet doubles every couple of years and this trend is set to continue for decades to come, there’s an unprecedented opportunity for you to make an enormous impact in your lifetime. Whether you’re curious about getting started in a data career or you’re a deep technical expert, whether you’d like to understand what A.I. is or you’d like to integrate more data-driven processes into your business, we have inspiring guests and lighthearted conversation for you to enjoy. We cover tools, techniques, and implementation tricks across data collection, databases, analytics, predictive modeling, visualization, software engineering, real-world applications, and commercialization − everything you need to crush it with data science.

Closing Remarks: 4:10 PM - 4:20 PM EST

Sponsors

Gold

Spring Health

Silver

Columbia University | Statistics
R Consortium

Bronze

RStudio
Saturn Cloud
Springer

Supporting

Pearson
Manning
Chapman & Hall/CRC, Taylor & Francis Group
No Starch Press
Sweet Francesca



If you are interested in being a sponsor for the 2022 New York R Conference, please contact us at info@landeranalytics.com