Buy Tickets

Workshops

Thurs May 7, 2020

Conference

Fri May 8 - Sat May 9, 2020

Florence Gould Hall

55 E 59th St, New York, NY, 10022

Speakers

Andrew Gelman

Professor,
Department of Statistics and Department of Political Science, Columbia University
@StatModeling

Emily Robinson

Senior Data Scientist,
Warby Parker
@robinson_es

Max Kuhn

Scientist,
RStudio
@topepos

Gabriela de Queiroz

Sr. Engineering & Data Science Manager,
IBM
@gdequeiroz

Jon Krohn

Chief Data Scientist,
untapt
@JonKrohnLearns

Jared P. Lander

Chief Data Scientist,
Lander Analytics
@jaredlander

Ludmila Janda

Data Scientist,
Amplify
@ludmila_janda

David Robinson

Ph.D. in Quantitative and Computational Biology,
Princeton University
@drob

Dan Chen

Doctoral Candidate,
Virginia Tech
@chendaniely

Erin LeDell

Chief Machine Learning Scientist,
H2O.ai
@ledell

David Smith

Cloud Advocate,
Microsoft
@revodavid

Jacqueline Nolis

Principal Data Scientist,
Nolis, LLC
@skyetetra

Wes McKinney

Director,
Ursa Labs
@wesmckinn

Brooke Watson Madubuonwu

Senior Data Scientist,
ACLU
@NextTopModeler

Sebastian Teran Hidalgo

Data Scientist,
Vroom
@steranhidalgo

Emily Dodwell

Principal Inventive Scientist,
AT&T Labs Research
@emdodwell

Heather Nolis

Principal ML Engineer,
T-Mobile
@heatherklus

Shane Conway

Researcher,
Kepos Capital
@statalgo

Monica Thieu

PhD Student,
Department of Psychology, Columbia University
@monica_too_

Workshops

Join Max Kuhn on a tour through Machine Learning in R. You'll learn about data preparation, model fitting, model assessment and predictions. Prior experience with lm is enough to get started and learn advanced modeling techniques.

Geospatial expert and Columbia Professor Kaz Sakamoto is leading this class on all things GIS. You'll learn how about map projections, spatial regression, plotting interactive heatmaps with leaflet and working with shapefiles. This course is designed for those who have familiarity with R and want to explore working spatial data into their work. The AM session will be an introduction to Geographic Information Systems(GIS), spatial features (sf package), Coordinate Reference Systems(CRS), and map making basics. The PM session will introduce spatial operations, geometric operations, statistical geography, spatial point pattern analysis and geostatistics. By the end of the day participants should be able to read/work with spatial data, understand projections, utilize geoprocessing techniques, and gain basic spatial statistics comprehension.

Daniel Chen, author of Pandas for Everyone, has given multiple talks at the New York R Conference about the data science workflow. In this workshop he'll teach how to use Git and project management for better organization and faster iteration. This workshop will have four parts: 1) Git on Your Own, 2) Working with Remotes, and 3) Git with Branches, and 4) Collaborating with Git. Part I will cover creating a git repository, adding and committing files, looking at differences between files, looking at your history, moving around your history, reverting changes, and undelete files. Part II will go over going from your computer to a remote (e.g., GitHub, BitBucket, GitLab), syncing your files by pushing and pulling, and conflicts. Part III will cover creating branches, moving around different branches, making commits in branches, merging branches, using branches with remotes, pull requests (aka, merge requests), merging pull requests, and syncing up with your remote. In Part IV, we will discuss how the skills you learned directly apply to collaboration with other people.

The tidyverse is a powerful collection of packages following a standard set of principles for usability. During this workshop David will demonstrate an exploratory data analysis in R using tidy tools. He will demonstrate the use of tools such as dplyr and ggplot2 for data transformation and visualization, as well as other packages from the tidyverse as they're needed. He'll narrate his thought process as attendees follow along and offer their own solutions. The workshop expects some familiarity with dplyr and ggplot2—enough to work with data using functions like mutate, group_by, and summarize and to create graphs like scatterplots or bar plots in ggplot2. These concepts will be re-introduced to ensure a smooth workshop, but it isn't designed for brand new R programmers. The workshop is designed to be interactive and participants are expected to type along on their own keyboards.

Andreas Mueller is a Research Scientist at Columbia University and has been a core-developer of scikit-learn for over 7 years. He's also author of the book "Introduction to Machine Learning with Python", co-authored with Sarah Guido. The workshop will go through the basics of machine learning with Python, data representation and preprocessing, and then work through details of the scikit-learn API and how to build and evaluate machine learning models in Python. We will in particular look at model selection and tuning with cross-validation and grid-search, building complex machine learning workflows with pipelines, and how to evaluate classification models with a variety of metrics. The workshop requires working knowledge of numpy, matplotlib and pandas, and familiarity with working in Jupyter Notebooks.

Agenda

Registration, Breakfast & Opening Remarks: 8:00 AM - 9:00 AM

The tidyverse is a powerful collection of packages following a standard set of principles for usability. During this workshop David will demonstrate an exploratory data analysis in R using tidy tools. He will demonstrate the use of tools such as dplyr and ggplot2 for data transformation and visualization, as well as other packages from the tidyverse as they're needed. He'll narrate his thought process as attendees follow along and offer their own solutions. The workshop expects some familiarity with dplyr and ggplot2—enough to work with data using functions like mutate, group_by, and summarize and to create graphs like scatterplots or bar plots in ggplot2. These concepts will be re-introduced to ensure a smooth workshop, but it isn't designed for brand new R programmers. The workshop is designed to be interactive and participants are expected to type along on their own keyboards.

Geospatial expert and Columbia Professor Kaz Sakamoto is leading this class on all things GIS. You'll learn how about map projections, spatial regression, plotting interactive heatmaps with leaflet and working with shapefiles. This course is designed for those who have familiarity with R and want to explore working spatial data into their work. The AM session will be an introduction to Geographic Information Systems(GIS), spatial features (sf package), Coordinate Reference Systems(CRS), and map making basics. The PM session will introduce spatial operations, geometric operations, statistical geography, spatial point pattern analysis and geostatistics. By the end of the day participants should be able to read/work with spatial data, understand projections, utilize geoprocessing techniques, and gain basic spatial statistics comprehension.

Daniel Chen, author of Pandas for Everyone, has given multiple talks at the New York R Conference about the data science workflow. In this workshop he'll teach how to use Git and project management for better organization and faster iteration. This workshop will have four parts: 1) Git on Your Own, 2) Working with Remotes, and 3) Git with Branches, and 4) Collaborating with Git. Part I will cover creating a git repository, adding and committing files, looking at differences between files, looking at your history, moving around your history, reverting changes, and undelete files. Part II will go over going from your computer to a remote (e.g., GitHub, BitBucket, GitLab), syncing your files by pushing and pulling, and conflicts. Part III will cover creating branches, moving around different branches, making commits in branches, merging branches, using branches with remotes, pull requests (aka, merge requests), merging pull requests, and syncing up with your remote. In Part IV, we will discuss how the skills you learned directly apply to collaboration with other people.

Andreas Mueller is a Research Scientist at Columbia University and has been a core-developer of scikit-learn for over 7 years. He's also author of the book "Introduction to Machine Learning with Python", co-authored with Sarah Guido. The workshop will go through the basics of machine learning with Python, data representation and preprocessing, and then work through details of the scikit-learn API and how to build and evaluate machine learning models in Python. We will in particular look at model selection and tuning with cross-validation and grid-search, building complex machine learning workflows with pipelines, and how to evaluate classification models with a variety of metrics. The workshop requires working knowledge of numpy, matplotlib and pandas, and familiarity with working in Jupyter Notebooks.

Breakfast & Open Registration: 8:00 AM - 8:50 AM
Opening Remarks: 8:50 AM - 9:00 AM

This talk will quantify various elements of RLadies NYC since the group’s start in 2017. We will look at things like attendance, talk topics, and book club books. Along the way, we will make visualizations, conduct some analyses, and consider some useful takeaways from the data on this group of women that use data.

Abstract Coming Soon

Speaker TBA: 9:50 AM - 10:10 AM
Break & Networking: 10:10 AM - 10:40 AM

Have you ever had a “first this then that” question? For example, maybe you want all the times people clicked on an item and then added it to their cart, or the last page they visited before registering. This talk will introduce funneljoin, an R package that makes it easy to analyze sequences of events. I'll illustrate how the powerful `type` argument lets you switch quickly between different kinds of funnels and then do a live demo of using funneljoin to analyze Stack Overflow R questions. After this talk, you'll be able to specify and code any type of funnel in R.

There are many ways to fit tree-based models in R, including the rpart, randomForest and xgboost packages. We compare their user interfaces and results to judge them on usability and accuracy.

Abstract Coming Soon

Lunch & Networking: 11:50 AM - 1:00 PM
Speaker TBA: 1:00 PM - 1:20 PM
Break & Networking: 2:10 PM - 2:40 PM

As predictive models and machine learning become key components of production applications in every industry, an end-to-end Machine Learning Operations (MLOPS) process becomes critical for reliable and efficient deployment of applications that depend on R-based models. In this talk, I’ll outline the basics of the DevOps process and focus on the areas where MLOPS diverges. The talk will show the complete process of building and deploying an application driven by a machine learning model implemented with R. We will show the process of developing models, triggering model training on code changes, and triggering the CI/CD process for an application when a new version of a model is registered. We will use the Azure Machine Learning service and the “azuremlsdk” package to orchestrate the model training and management process, but the principles will apply to MLOPS processes generally, especially for applications that involve large amounts of data or require significant computing resources.

Speaker TBA: 3:05 PM - 3:25 PM
Brooke Watson: 3:30 PM - 3:50 PM
Break & Networking: 3:50 PM - 4:20 PM

Abstract Coming Soon

Heather Nolis: 4:45 PM - 5:05 PM
Closing Remarks: 5:05 PM - 5:15 PM
Breakfast & Open Registration: 9:00 AM - 9:50 AM
Opening Remarks: 9:50 AM - 10:00 AM
Jacqueline Nolis: 10:00 AM - 10:20 AM
Speaker TBA: 10:25 AM - 10:45 AM
Break & Networking: 10:45 AM - 11:15 AM

Abstract Coming Soon

This talk begins with a survey of the primary families of Deep Learning approaches: Convolutional Neural Networks, Recurrent Neural Networks, Generative Adversarial Networks, and Deep Reinforcement Learning. Via interactive demos, the meat of the talk will appraise the two leading Deep Learning libraries: TensorFlow and PyTorch. With respect to both model development and production deployment, the strengths and weaknesses of the two libraries will be covered -- with a particular focus on TensorFlow 2 release that formally integrates the easy-to-use, high-level Keras API into the library.

Abstract Coming Soon

Lunch & Networking: 12:25 PM - 1:35 PM
Gabriela de Queiroz: 1:35 PM - 1:55 PM
David Robinson: 2:00 PM - 2:20 PM

Abstract Coming Soon

Break & Networking: 2:45 PM - 3:15 PM

The focus of this presentation is scalable and automatic machine learning in R using the H2O machine learning platform. H2O is an open source, distributed machine learning platform is designed to scale to very large datasets that may not fit into RAM on a single machine. We will provide a brief overview of the field of Automatic Machine Learning, followed by a detailed look inside H2O's AutoML algorithm. H2O AutoML provides an easy-to-use interface which automates data pre-processing, training and tuning a large selection of candidate models (including multiple stacked ensemble models for superior model performance), and due to the distributed nature of the H2O platform, H2O AutoML can scale to very large datasets. The result of the AutoML run is a "leaderboard" of H2O models which can be easily exported for use in production.

Monica Thieu: 3:40 PM - 4:00 PM
Speaker TBA: 4:05 PM - 4:25 PM
Closing Remarks: 4:25 PM - 4:35 PM

Sponsors

Platinum

Microsoft

Silver

Flatiron School
Codecademy

Supporting

Pearson
CloudFactory
NausicaaDistribution
Springer

Tickets