November 08, 2018

8:00am - 8:50am

Breakfast & Open Registration

8:50am - 9:00am

Opening Remarks

9:00am - 9:20am

Practical R - R as LEGO to Solve Real Problems
Refael Lav, Deloitte

We all know some part of the R ecosystem well, mostly dependent on how we got started on this journey. In this talk, I would like to discuss one aspect that fascinates me, how different capabilities link to solve real problems. I will demonstrate how we go about connecting different components into a pipeline which allowed us to move from images to NLP, to time series to ML to API calls. I will present a few examples of how, with R, we connected the dots. I hope this will open a path for anyone to use what they have learned at this conference to demonstrate a clear data problem to their clients.

9:25am - 9:45am

My Open Source Journey in R\Finance
Soumya Kalra, R-Ladies NYC

Open source contributions in the R community are an incredible way to learn and become part of a community. Often making these types of contributions are quite scary and one doesn't know quite where to begin. In this presentation, I will share the approaches I used to start contributing specifically focused in the finance space. I will also demonstrate some of the tools/analysis I have built along the way with lessons learned (both good and bad). In addition, I will lay out a path for all R community members to foster more collaboration and contribution in the R Finance community.

9:50am - 10:10am

Democratizing Data Science: Using R for End-to-End Intelligence Production
Michael Powell, Intelligence and Security Command

Organizations seeking to build data science capacity face many challenges; debates over hardware, software, data science team member roles, and even which problems to tackle first can make it hard to get started. In this talk, MAJ Mike Powell of the US Army Intelligence and Security Command will explore how the wide range of offerings in the R/RStudio community have allowed small data science efforts to gain traction without having to wait for enterprise decisions and resources. Even analysts without extensive formal training have access to the necessary tools to create websites, web apps, reproducible reports, and script-driven analysis and visualizations - all within the R/RStudio ecosystem. These tools have effectively democratized data science by giving analysts a single platform suitable for every phase of intelligence production.

10:10am - 10:40am

Break & Networking

10:40am - 11:00am

Building an A/B Testing Analytics System with R and Shiny
Emily Robinson, DataCamp

Online experimentation, or A/B Testing, is the gold standard for measuring the effectiveness of changes to a website. While A/B testing is used at thousands of companies, results can seem difficult to parse without resorting to expensive end-to-end commercial options. But using DataCamp's system as an example, I'll illustrate how R is a great language for building powerful analytical and visualization experiment tools. We'll first see how Shiny dashboards can help people monitor and quickly analyze multiple A/B tests each week. We'll then dive into the open-source funneljoin package we've created using variations on dplyr's joining function, allowing you to analyze sequential actions using behavioral funnels.

11:05am - 11:25am

Many Ways to Lasso
Jared P. Lander, Lander Analytics

The elastic net is one of my favorite algorithms, implementing both the lasso and ridge, and a combination of the two. The main way to fit the elastic net is with glmnet, written by Hastie, Tibshirani and Friedman. But there are many other ways, including xgboost, Stan and TensorFlow. We fit the elastic net a few ways and see how they work differently.

11:30am - 11:50am

Structuring Your Data Science Projects
Dan Chen, Virginia Tech

When we start programming we are happy when our code just runs. As we get more experienced, we begin to add more structure to our projects. We start off by dumping all our files into a folder, then we might create subfolders to help organize our work. We might even structure our projects so we can share our work easily with potential collaborators. As our projects grow, we might need to reuse bits and pieces in other projects, and finally consolidate our work into some written report. R has given us the tools to make your projects more structured and organized. And as we start to become more familiar with these tools, you realize that many people converge on very similar project templates. They all aim to make projects clearer, shareable, and simply "work". It doesn't matter where you are in your learning path, you can always benefit from adding a little more structure to your data science projects.

11:50am - 1:00pm

Lunch & Networking

1:00pm - 1:20pm

gg-what?: A Look at the ggplot Ecosystem
Marck Vaisman, Microsoft

Most R users are probably aware of the ggplot2 package for creating elegant visualizations. ggplot2 is either imported by or is a dependency for approximately 1,400 other packages. In this talk, we explore the connections and influence that ggplot2 has with so many other packages. We explore some of these packages and the extension mechanisms for ggplot2 by visualizing LEGO data obtained from Rebrickable.com.

1:25pm - 1:45pm

Anomaly Detection With Time Series Data: How to Know if Something is Terribly Wrong
Catherine Zhou, Codecademy

With the rise of streaming data and cloud computing, data scientists are often asked to analyze terabytes of data. The sheer amount of data available leads to a lag time in identifying irregularities, resulting in lost time and revenue. We can pinpoint these outliers through anomaly detection algorithms, which can be repurposed to monitor key metrics, website breakage, and fraudulent activity. I will demonstrate how we can build a system for anomaly detection to uncover blind spots in large datasets and reduce fire drills at work.

1:50pm - 2:10pm

Proximity Matching with Random Forests
Anna Sofia Kircher, Lendable

Estimation of treatment effects needs data on a control group. If there is no randomization in the assignment of data points to the treatment group or the control group the estimation strategy - no matter how sophisticated or good it is - will always suffer from selection bias. In the realm of financial investment this bias can be massive. But there are many ways to create a synthetic control group. This talk will focus on finding a control group using proximity scores from the proximity matrix of a random forest to monitor investments. We will discuss how a random forest can find similarities between observations either in an unsupervised or supervised setting and use that to create a control group for comparing the performance of a purchased portfolio of loan receivables against one that has not been purchased.

2:10pm - 2:40pm

Break & Networking

2:40pm - 3:00pm

Saving Lives, Thousands at a Time, Using R
Roger Peng, Johns Hopkins University

Air pollution levels in the United States have fallen substantially over the past 50 years due to increasingly stringent regulation, resulting in dramatic improvements in air quality and health. In parallel, there has been a revolution in air pollution epidemiology through the use of big data coupled with sophisticated statistical methods. At the center of all this is R, which has played a key role in allowing methodological development to happen in an open source manner, and has significantly accelerated the sharing of data and tools across the world. In this talk I will present some key examples of how R has enabled the creation of high-quality scientific evidence in support of air pollution policy and has contributed to protecting public health around the globe.

3:05pm - 3:25pm

Hacking Gmail to See if Anyone Actually Cares About Data Privacy
Jim Klucar, Nyla Technology Solutions

Two years ago I spoke about data privacy techniques at the NYR Conference and I had a secret: I hadn't written a line of R since that one course in grad school made me do it. Attending that conference opened my eyes to what R had become since the tidy revolution and I began learning and using it more for my daily work. This talk is about what I learned in my journey over the last two years of adopting R, using a recent project where I analyzed my Google Alerts emails for data privacy article trends.

3:30pm - 3:50pm

Analyzing Genomics Data in R with Bioconductor
Stephanie Hicks, Johns Hopkins University

Advances in biotechnology are leading to the generation new types of biological data with decreased costs in concert with increases in volume, resolution and diversity of data. However, effectively deriving knowledge from this data to understand biological systems and disease requires continuous improvements in computational methods, analysis tools and associated software engineering. Bioconductor is an open-source, open-development software for the analysis and comprehension of genomics data using the R programming language with over 1560 software packages and an active user and contributor community. In this talk, I will 1. give an overview of the R/Bioconductor community, 2. discuss the relationship between Bioconductor and CRAN, and 3. give examples how Bioconductor can enable the rapid analysis of genomics data at all stages of a project, from data generation to publication.

3:50pm - 4:20pm

Break & Networking

4:20pm - 4:40pm

Data Science? Make It Spatial
Angela Li, Center for Spatial Data Science

Many data scientists are familiar with techniques to handle traditional tabular data. But what happens when the data is location-based, or spatial, in nature? In this talk, we'll cover techniques for spatial data that may be outside of your traditional toolbox. We'll look at how spatial analysis and methods are being used in current social science research and discuss how you can bring some of these methods to your own work. Topics may include exploratory spatial analysis, spatial autocorrelation, clustering, and mapping.

4:45pm - 5:05pm

Let's Git'R Marked Down: Streamline Model Development Using R Markdown
Lizzy Huang, Freddie Mac

Model development consists of data extraction, estimation implementation, analysis visualization and the documentation of results. Usually coding, visualization, and documentation are done by different tools in separate platforms. It is tough to keep track of any changes and improvements, and to ensure they are applied uniformly, especially when the whole process involves collaboration of many developers. In this talk, I will discuss an approach to streamline the development process by utilizing R Markdown under version control. This approach not only allows better control of the development code change, but also synchronizes results into documentation to ensure reproducibility and ease of auditing.

5:05pm - 5:15pm

Closing Remarks

November 09, 2018

9:00am - 9:50am

Breakfast & Open Registration

9:50am - 10:00am

Opening Remarks

10:00am - 10:20am

Mining Text with textmineR
Tommy Jones, In-Q-Tel

textmineR introduces a framework for natural language processing (NLP) in R that improves upon current NLP frameworks available in R. Specifically, textmineR has a syntax that is more intuitive to experienced R users. It uses objects, methods, and functions that behave like regular R dense matrices. This lowers the barrier to beginning statistical analyses of language to statisticians and other data analytics professionals. textmineR also implements diagnostic and analysis methods for topic models.

10:25am - 10:45am

SQL for Everyone
Max Richman, Arcadia Power

In the spirit of Lander (R for Everyone book), Chen (Pandas for Everyone book), and Richman (R for Every Survey Analysis talk), this talk will focus on how you, yes you, can and should probably use SQL as part of your R data analysis workflow. This is a beginner-to-intermediate level introduction to fundamental data pipeline skills every analyst should have in their toolkit, explained in a way to make it easy to get started yourself afterwards. Topics include RODBC, SQL group by, SQL window functions, among other handy tools and tips.

10:45am - 11:15am

Break & Networking

11:15am - 11:35am

Not Hotdog: Image Recognition with R and the Custom Vision API
David Smith, Microsoft

Building an application that can recognize a specific type of object from scratch is possible with tools like convolutional neural networks, but it's not easy: you may need many thousands of labelled representative and unrepresentative images, and training such a model may consume many expensive GPU core-hours. An simpler yet effective way is to use Transfer Learning: use a standard neural network already trained to recognize general objects, and use the features it has already learned to recognize a new set of objects. With this method, you need far fewer novel images, and the training process is much faster. In this talk, I'll use R in conjunction with the Microsoft Custom Vision API to train and use a custom vision recognizer. I'll use an example motivated by the TV series "Silicon Valley", and with just a couple of hundred images of food, create Shiny application that can detect whether or not a given image contains a hot dog.

11:40am - 12:00pm

Equivocals in Predictive Modeling
Max Kuhn, RStudio

When important quantities are being predicted by a model, it makes sense to avoid making predictions when there is significant uncertainty. In laboratory diagnostic tests, it is common (and often mandated) to use an equivocal zones for this purpose. We'll show examples from drug discovery and two different methods for labeling predictions as equivocal.

12:05pm - 12:25pm

Tinkering with Serverless Computing and R in the Cloud
Kelly O'Briant, RStudio

Serverless compute is lauded as a solution to freeing developers from the arduous task of managing (cloud) infrastructure. I would argue that it can also be a good entry point into learning or getting more comfortable with cloud infrastructure services and offerings. In this talk I'll talk about deploying R projects with App Engine - the Google Cloud Platform offering for building fully managed serverless and scalable web apps and microservices. In the process, I hope to provide resources and a getting-started path for people interested in learning new skills related to analytic administration in R.\n\nKey Topics:\n- What is an analytic admin, what do they do for R and data science teams?\n- What is serverless compute, why is it fun to tinker with?\n- Steps to creating and managing a microservice on Google App Engine with R code

12:25pm - 1:35pm

Lunch & Networking

1:35pm - 1:55pm

How to Start a Data Science Insurrection at an Organization that Would Prefer You Not
Jonathan Hersh, Chapman University

Disruptive transformations are all the rage in Silicon Valley, and even traditional organizations are starting to discuss machine learning and artificial intelligence. However, new methods from data science can create internal frictions within organizations that alter power dynamics. The extent to which any project scales can often be determined by institutional support, even if the project is successful. Moreover, 'move fast and break things' isn't a reliable model in organizations where the cost to failure are high or catastrophic. Discussing his experience within institutions such as the World Bank and Inter-American Development Bank, this talk will discuss strategies to help data scientists respectfully advocate for data science solutions at more traditional institutions. In the process, we will learn about research on using satellites to estimate bombing damage in Syria, and how 3G cell phones are caused thousands of accidents per year.

2:00pm - 2:20pm

How I Found Your Answer
Mara Averick, RStudio

Equal parts history and mystery, noted *data sciolist*, Mara Averick, will take you on a whirlwind tour through the lighter side of learning and communicating (data) science. From the 19th-century pages of Science Gossip to modern-day social media, you’ll find out how the nature of scientific communities has and hasn’t changed over the past 200 years, and discover how “learning out loud” can help you navigate an ever-increasing amount of information, or (at the very least) keep you entertained while so doing.

2:25pm - 2:45pm

The Story of the MNIST Dataset
Michael Garris, NIST

It has been said the MNIST handprinted character image dataset is the “Hello World” implementation for machine learning, and the dataset is used as a world-wide machine learning benchmark. But where did it come from, how was it created, and for what purpose? A clue is found within its name. This talk will demystify the renown dataset by telling the story from its inception to its impactful journey.

2:45pm - 3:15pm

Break & Networking

3:15pm - 3:35pm

Activity Monitoring Using Sensors and R
Abhijit Dasgupta, Zansors

We increasingly use sensors to monitor our sleep and activity on a real-time basis, using ubuquitous wearable devices like the FitBit, furniture-linked sensors like Beddit, as well as the time-tested heart rate monitor and GPS. Zansors adds to this environment by developing a wearable breathing sensor to monitor activity and sleep. In this presentation, I’ll describe some of the characteristics of the data we see, how it can be munged using the tidyverse, analyzed using different R packages and displayed using htmlwidgets, to enable interpretation of different activities as well as sleep patterns. To this end, I will describe available implementations of moving averages, as well as linked dynamic graphs that are invaluable to intepreting the data.

3:40pm - 4:00pm

Association Rule Mining With Tweets: Thinking Outside the Basket
Ami Gates, Georgetown University

With the increasing and continued interest is text mining, and the potential for relationships between words or items, association rule mining has become a more popular technique. The classic example for association rule mining is to investigate “baskets” of items originating from transactions. The most notable such example is the “market basket”, where foods appear within transactions with greater or lower joint probabilities. However, collections of items, or baskets, are not the only application for association rule mining. Applying association rule mining to Twitter data (Tweet Text) using R offers interesting insight into words that are highly associated or correlated in a given set of Tweets. By thinking of each Tweet as a transaction, one can collect Tweets, reformat them into basket-style .csv data, and use R to apply association rule mining to discover relationships.

4:05pm - 4:25pm

Ethics of Data Storytelling
Vivian Peng

Data informs decision making from the individual to the system at play. What are the considerations to be mindful of when telling stories with data, particularly in the media? This talk explores how to choose what story to tell, find the right medium, and prepare your data for the media, keeping in mind that with great powers come great responsibilities.

4:25pm - 4:35pm

Closing Remarks