With its popularity as a statistical programming language rapidly increasing with each passing day, R is increasingly becoming the preferred tool of choice for data analysts and data scientists who want to make sense of large amounts of data as quickly as possible. R has a rich set of libraries that can be used for basic as well as advanced data analysis.

This comprehensive 3-in-1 course delivers you the ability to conduct data analysis in practical contexts with R, using core language packages and tools. The goal is to provide analysts and data scientists a comprehensive learning course on how to manipulate and analyse small and large sets of data with R. You will learn to implement your learning with real-world examples of data analysis. You will also work on three different projects to apply the concepts of data analysis.

This training program includes 3 complete courses, carefully chosen to give you the most comprehensive training possible.

The first course, **Learning Data Analysis with R**, starts off with covering the most basic importing techniques to download compressed data from the web and will help you learn more advanced ways to handle the most difficult datasets to import. You will then learn how to create static plots and how to plot spatial data on interactive web platforms such as Google Maps and Open Street maps. You will learn to implement your learning with real-world examples of data analysis.

The second course, **Mastering Data Analysis with R**, contains carefully selected advanced data analysis concepts such as cluster analysis, time-series analysis, Association mining, PCA (Principal Component Analysis), handling missing data, sentiment analysis, spatial data analysis with R and QGIS, advanced data visualization with R and ggplot2.

The third course, **R Data Analytics Projects**, takes you on a data-driven journey that starts with the very basics of R data analysis and machine learning. You will then work on three different projects to apply the concepts of machine learning and data analysis. Each project will help you to understand, explore, visualize, and derive domain- and algorithm-based insights.

**By the end of this Learning Path, you’ll gain in-depth knowledge of the basic and advanced data analysis concepts in R and will be able to put your learnings into practice.**

**Meet Your Expert(s):**

We have the best work of the following esteemed author(s) to ensure that your learning journey is smooth:

● **Fabio Veronesi** obtained a Ph.D. in digital soil mapping from Cranfield University and then moved to ETH Zurich, where he has been working for the past three years as a postdoc. In his career, Dr. Veronesi worked at several topics related to environmental research such as digital soil mapping, cartography and shaded relief, renewable energy and transmission line siting. During this time Dr. Veronesi specialized in the application of spatial statistical techniques to environmental data.

● **Dr. Bharatendra Rai** is Professor of Business Statistics and Operations Management in the Charlton College of Business at UMass Dartmouth. He received his Ph.D. in Industrial Engineering from Wayne State University, Detroit. His two master’s degrees include specializations in quality, reliability, and OR from Indian Statistical Institute and another in statistics from Meerut University, India. He teaches courses on topics such as Analyzing Big Data, Business Analytics and Data Mining, Twitter and Text Analytics, Applied Decision Techniques, Operations Management, and Data Science for Business. He has over twenty years of consulting and training experience, including industries such as automotive, cutting tool, electronics, food, software, chemical, defense, and so on, in the areas of SPC, design of experiments, quality engineering, problem solving tools, Six-Sigma, and QMS. His work experience includes extensive research experience over five years at Ford in the areas of quality, reliability, and six-sigma. His research publications include journals such as IEEE Transactions on Reliability, Reliability Engineering & System Safety, Quality Engineering, International Journal of Product Development, International Journal of Business Excellence, and JSSSE.

● **Raghav Bali** is a Data Scientist at Optum, a United Health Group Company. He is part of the Data Science group where his work is enabling United Health Group develop data driven solutions to transform healthcare sector. He primarily works on data science, analytics and development of scalable machine learning based solutions. In his previous role at Intel as a Data Scientist, his work involved research and development of enterprise solutions in the infrastructure domain leveraging cutting edge techniques from machine learning, deep learning and transfer learning. He has also worked in domains such as ERP and finance with some of the leading organizations of the world. Raghav has a master’s degree (gold medalist) in Information Technology from International Institute of Information Technology, Bangalore. Raghav has authored several books on Machine Learning and Analytics using R and Python. He is a technology enthusiast who loves reading and playing around with new gadgets and technologies.

**● Dipanjan Sarkar** is a Data Scientist at Intel, on a mission to make the world more connected and productive. He primarily works on data science, analytics, business intelligence, application development, and building large-scale intelligent systems. He holds a master of technology degree in Information Technology with specializations in Data Science and Software Engineering. He is also an avid supporter of self-learning. He has been an analytics practitioner for several years now, specializing in machine learning, natural language processing, statistical methods and deep learning.

### Learning Data Analysis with R

This video provides an overview of the entire course.

Accessing and importing open access environmental data is a crucial skill for data scientists. This section teaches you how to download data from the Web, import it in R and check it for consistency.

- Download open-access data from the USGS website
- Import it in R using read.table
- Check its structure to start exploring the data

Often times, datasets are provided for free, but on FTP, websites and practitioners need to be able to access them. R is perfectly capable of downloading and importing data from FTP sites.

- Understand the basics of downloading data in R
- Download the data with the download.file function
- Learn how to handle compressed formats

Not all text files can be opened easily with read.table. The fixed-width format is still popular but requires a bit more work in R.

- Understand the fixed-width format
- Identify the main stricture of the dataset
- Import the file in R

Some data files are simply too difficult to be imported with simple functions. Luckily R provides the readLines function that allows importing of even the most difficult tables.

- Understand where we need to use readLines
- Read the data in strings
- Work with strings to import the file

Most open data is generated automatically and therefore may contain NA or other values that need to be removed. R has various functions to deal with this problem.

- Reiterate the use of the readLines function
- Collect data in data frames
- Clean the dataset

To follow the exercises in the book viewers would need to install several important packages. This video will explain how to do and where to find information about them.

- Check the CRAN website for info about packages
- Install/load packages in R
- Find additional information

Vector data are very popular and widespread and require some thoughts before importing. R has dedicated tools to import these data and work with them.

- Work with shapefiles
- Differences between rgdal and raster
- GeoJSON, the format for web developers

Often times, spatial data is provided in tables and needs to be transformed before it can be used for analysis. This can be done simply with the sp package.

- Check the table structure to identify coordinates
- Transform a table into a spatial object
- Plot the data to check if the process was successful

Geographical projections are very important and need to be handled carefully. R provides robust functions to do so successfully.

- Understand projections
- Identify the data projection if unknown
- Set the projection of the file

Many datasets have a temporal component and practitioners need to know how to deal with it. R provides functions to do that in a very easy way.

- Identify the time variable
- The basic Date format
- The more advanced POSIXct format

Raster data is fundamentally different from vector data, since its values refer to specific areas (cells) and no single locations. This video will clearly explain this difference and teach users how to import this data in R.

- Explain what raster data is
- Importing with rgdal
- Introducing the raster package

The NetCDF format is becoming very popular, since it allows to store 4D datasets. This requires some technical skills to be accessed and this video will teach viewers to open and import NetCDF files.

- Gather open NetCDF data from the Web
- Understand the format
- Open it with R

Many raster datasets we download from the web are distributed in tiles, meaning a single raster for each subset of the area. To obtain a full raster for the study area we are interested to cover we can create a mosaic.

- Download raster DTMs
- Understand the process of mosaicking
- Create a full DTM

Mosaicking involves merging rasters based on location. Spatio-temporal datasets include also multiple rasters for the same location but different times. To merge these we need to use the stacking function.

- Download NDVI data
- Handle the temporal component
- Create a stack dataset

Once we complete our analysis we often need to export our results and share them with colleagues. Popular formats are CSV and TXT files, which we learn how to export in this video.

- Subset a dataset
- Export in CSV
- Export in TXT

If we work with vector data and we want to share the same format with our co-workers, we need to learn how to export in vector formats. This will be covered here.

- Export ESRI shapefiles
- Understand the process
- Open our results in a GIS

Many raster datasets we download from the Web are distributed in tiles, meaning a single raster for each subset of the area. To obtain a full raster for the study area we are interested in covering, we can create a mosaic.

- Download raster DTMs
- Understand the process of mosaicking
- Create a full DTM

Nowadays WebGIS applications are extremely popular. However, to use our data for WebGIS, we first need to export them in the correct format. This video will show how to do that.

- Export data in GeoJSON
- Export in KML
- Open our data on Google Maps

In the previous volume we explored the basics R functions and syntaxes to import various types of data. In this video we will put these functions together, and overcome some unexpected challenges, to import a full year of NOAA data.

- Download the raw data and import them
- Find the coordinates and merge two data.frames
- Save the cleaned dataset for later use

Before we can start analyzing our data we first need to properly understand what we are dealing with. The first step we have to take in this direction is describe our data with simple statistical indexes.

- Measure central tendency
- Measure spread
- Summarize our data

Numerical summaries are very useful but certainly not ideal to provide us with a direct feeling for the dataset in hands. Plots are much more informative and thus being able to produce them is certainly a crucial skill for data analysts.

- Download the EPA data
- Produce histograms
- Produce density plots

For multivariate data we are often interested in assessing correlation between variables. This can be done in R very easily, and ggplot2 can also be used to produce more informative plots.

- Assess multiple correlations at once
- Plot scatterplots by state
- Customize scatterplots to include 3 variables

Detecting outliers is another basic skill that every data analyst should have and master. R provides a lot of technical tools to help us in finding outliers.

- Understanding outliers
- Finding outliers with standard deviation and mean absolute deviation
- Box-plot provides another handy way to detect outliers

This Section will be dedicated entirely to manipulating vector data. However, viewers first need to familiarize with some basic concepts, otherwise they may not be able to understand the rest of the section.

- Understand the concept of bounding box
- Understand the concept of centroid
- Subset spatial objects by attribute

In volume 1 we learned how to set the projection of our spatial data. However, in many cases we have to change this projection to successfully complete our analysis, and this requires some specific knowledge.

- Understand that bounding boxes and centroids can be calculated for polygons too
- Re-project spatial objects
- Calculate area and perimeter of a polygon

In many cases we may be interested in understanding the relation between spatial objects. One of such relations is the intersection, where we first want to know how two objects intersect, and then also extract only the part of one of these object that is included or outside the first.

- Test intersections in R
- Extract only the part of the object included in the first
- Extract only the part of the object outside the first

Other important GIS operations that users have to master involve creating buffers and calculating distances between objects.

- Create a buffer around polygons
- Calculate distance between points
- Calculate distance between polygons

The last two GIS functions that anybody should master are used to merge different geometries and spatial objects and overlay.

- Merge geometries of the same type
- Merge different geometries
- Overlay and select by location

Raster objects are imported in R as rectangular matrixes. Users needs to be aware of this to properly work on these data, otherwise it may create some issues during the data analysis.

- Understand raster data
- Perform descriptive statistics on ra ster data
- Re-projecting raster data

In many cases open data are not distributed directly in raster formats and they need to be converted. This can be easily done with the right functions.

- Convert data.frames into rasters
- Convert spatial data into rasters
- Convert rasters into matrix or spatial data

Working with raster data often means extracting data for particular locations for further analysis, or crop the data to reduce their size. These are essential skills to master for any data analyst.

- Extract values from rasters
- Use these data for analysis
- Clipping

Sometimes we may need to filter out some values of our raster. It may seem tricky but only because it requires some skills.

- Filter temperature data
- Aggregate rasters
- Disaggregation

Creating new raster by calculating their value is extremely important for spatial data analysis. Doing so is simple but can be difficult to understand at first.

- A simple raster calculation
- Calculate slope and aspect
- Advanced calculation with shaded relief

Syntactically plotting spatial data in R is no different than plotting other types of data. Therefore, users need to know the basics of plotting before they can start making maps.

- Plotting symbols
- Plotting colors
- Save plots

Creating multilayer plot can be difficult because we need to take care of several different aspects at once. However, learning that is very easy.

- Create multilayer plots
- Understanding the layer system
- Zooming and saving

When plotting spatial data we are often interested in using colors to show the values of some variables. This can be done manually but producing the right color scale may be difficult. This issue can be solved employing automatic methods.

- Creating a manual color scale
- Understanding the plotting window
- Automatic color scale

Creating multivariate plots not only means adding layers, but also using legends so that the viewer understands what the plot is showing. Creating legends in R is tricky because it requires a lot of tweaking, which will be explained here.

- Change size and add title
- Create simply legend
- Add another legend column

Temporal data need to be treated with specific procedures to highlight this additional component. This may be done in different ways depending on the scope of the analysis and R provides the right platform for this.

- Extracting the temporal information
- Plot multiple images according to specific times
- Time distances

Being able to plot spatial data on web maps is certainly helpful and a crucial skill to have, but it can be difficult since it requires knowledge of different technologies. R makes this process very easy with dedicated functions that allow us to plot on web GIS services a breeze.

- Understand web mapping
- Mapping platforms
- Required packages

Plotting data with the function plotGoogleMaps is not as easy as using the function plot. With a simple step by step guide we can achieve good command of the function, so that users can plot whatever data they choose.

- Install plotGoogleMaps
- Create your first map
- Customize the plotting window

An interactive map with just one layer is hardly useful for our purposes. Many times we are faced with the challenge of plotting several data at once. This requires some additional work and understanding, but it is definitely not hard in R.

- Understand the layer system
- Add layers with the right options
- Check the result

Plotting raster data on Google maps can be tricky. The function plotGoogleMaps does not handle rasters very well and if not done correctly the visualization will fail. This video will show users how to plot rasters successfully.

- Download the seismic risk map
- Understand the limitations of plotting rasters on Google Maps
- Plotting rasters successfully

Plotting on Google Maps is easy but Google Maps are commercial products therefore if we want to use the on our commercial website we would need to pay. OpenStreetMaps are free to use, therefore knowing how to use them is certainly an advantage.

- Install leafletR
- LeafletR works with geoJSON
- Plot and customize your map

Using open data for our analysis requires a deep knowledge of the data provider and the actual data we are using. Without this knowledge we may end up with erroneous results.

- Getting to know the World Bank data
- What data are available
- Presenting the R package to download them

Downloading data from the World Bank can be difficult since it requires users to know the acronym used to refer to these data. However, with some help this process becomes very easy.

- Understanding the import process
- Search the correct indicator
- Download the data

To create a spatial map of the World Bank data we just have to download and we need to transform them into spatial data. However, in the dataset there are no coordinates of other information that may help us do that. The solution is to use the geocoding information from another dataset for this purpose.

- Use natural earth data to transform our data into spatial object
- Understand the transformation process
- Plot the results in a map

Using the world bank data just to plot a static spatial map is very limitative. There are tons of other uses that researchers can do with these data and this video serves to provide some guidance into these additional avenue of research.

- Downloading more than one dataset
- Correlation analysis
- Interactive map

Executing a point pattern analysis is technically easy in R. However, it is extremely important that practitioners understand the theory behind a point pattern analysis to ensure the correctness of the results. This video illustrates this theory.

- Understand a point pattern
- Assess its spatial distribution
- Try to model the local intensity

In many cases practitioners start their analysis by applying complex statistics without even looking at their data. This is a problem that may affect the correctness of their results. This video will teach the correct order to start a point pattern analysis.

- Descriptive statistics
- Define the study area
- Transform your data into a point pattern object

Calculating intensity and density of a point pattern can be done in many ways. Finding the best for the dataset in hand can be challenging. The package spatstat and the literature provides some tips to do it correctly.

- Computing intensity
- Quadrat counting for local intensity
- Continuous intensity with kernel density

By looking at the plot we created in the previous videos, we started understanding the spatial distribution of our data. However, we now need to prove quantitatively that our ideas are correct.

- Test the spatial distribution
- Ripley K function
- The G function

In many cases we may want to model a point pattern to try and explain its location intensity in a way that would allow us to predict it outside our study area. This requires a general understanding of the modelling process, which will be explained here.

- Download explanatory variables
- Formulate a hypothesis
- Test the hypothesis and validate the model

Cluster analysis is commonly used in many fields. The problem is that in order to use it correctly we need to understand the clustering process, which is what this video is about.

- Unsupervised learning
- K-Means Clustering
- Hierarchical Clustering

As in every data analysis the data preparation plays a crucial role in guaranteeing its success. This video will prepare the data to be used for clustering.

- Download USGS data
- Cleaning the data
- Calculating the distance matrix

Clustering algorithms are extremely simple to apply. The challenge is interpret their results and try to understand what the algorithm is telling us in terms of insights into our data.

- Understanding the Euclidean Distance
- Apply the k-means algorithm
- Interpret its result

When applying the k-means algorithms we need to specify the number of clusters in which we want our dataset to be divided. However, since it is often used as explanatory test, we may not know the optimal number of clusters.

- Understanding similarities
- Define the optimal number of clusters
- Scaling

Hierarchical clustering allows us to see how all of our points are related to each other with a bottom-up approach. However, determining the optimal number of clusters is not so trivial with this method.

- Basic code for hierarchical clustering
- Aggregation methods
- Interpretation

Determining the best clustering algorithm for our data is probably the most challenging part of such an analysis. This video will show the sort of reasoning users will need to make that decision.

- Review the clustering code in R
- Plot clusters on maps
- Investigate differences between algorithms

Time series analysis is another important technique to master. However, it requires some specific knowledge to understand the process and what this technique can actually do.

- Purpose of time-series analysis
- Correlation
- Forecasting

Time-series can be imported and analyzed using two formats: ts and xts. Both have their pros and cons and users need to be able to master both if they want to perform the best time-series analysis.

- The package ts
- Descriptive statistics and plotting
- The package xts, pros and cons

Dealing with time-series sometimes means extracting data according to their location along the time line. This can be done in R but require some explanation to do it correctly.

- Subsetting ts and xts objects
- Quantify temporal changes
- Temporal functions

Another important aspect of time-series analysis is decomposition and correlation. This allows us to draw important conclusions about our data. Technically this is not difficult to do, but it requires careful consideration if we want to do it right.

- Linear trend
- Decomposition
- Autocorrelation and cross-correlation

The final step of time-series analysis is forecasting, where we try to simulate future events. This is extremely useful but requires adequate knowledge of the methods available, their pros and cons.

- Simple forecasting methods
- ARIMA model
- Validation

There are numerous geostatistical interpolation techniques that can be used to map environmental data. Kriging is probably the most famous but it not the only one available. It is important to know every technique to understand where to use what.

- Deterministic estimators
- Variogram
- Kriging estimator

The first challenge of any geostatistical analysis is the data preparation. We cannot just download data, but we need to clean them and prepare them for analysis.

- Download data from EPA
- Extract data only from mainland US
- Clean the dataset

Simple interpolation is easy to use and easy to interpret, therefore it is still commonly used. The package gstat allows us to use inverse distance, but to do so we need to follow some simple but precise rules.

- Understanding gstat
- Cross-Validation
- Mapping

Before we can interpolate our data using kriging, we need to take care of some important steps. For example, we need to check if our data has a trend and then test for normality, because kriging can only be applied to normally distributed data.

- Check trend with the linear model
- Check normality
- Test common transformations

Variogram is the keystone of kriging interpolation and users need to know how to compute and fit a model to it. These things require careful considerations that we are going to explore here.

- Variogram cloud
- Variogram model and anisotropy
- Fitting a model automatically

In this video, all concepts learned previously will be merged to perform a kriging interpolation. The problem in this case is making sure that everything works correctly and the process is smooth.

- Trend, transformation, variogram, and anisotropy
- Cross-validation and back-transformation
- Mapping

There are numerous statistical learning algorithms that can be used to map environmental data. It is important to know every technique to understand where to use what.

- Linear models
- Regression trees
- Support vector machines

Once again for data analysis, getting to know our data is the most important thing we need to do once we start. This can be done by looking at the data provider and using some explanatory techniques.

- The housing dataset
- Explanatory analysis
- Correlation analysis

Many users start a data analysis by testing complex methods. This is a problem though, because many times a simpler method can help us better understand our data. This video shows how to fit these simple models.

- Linear regressions
- Ridge Regression
- LASSO

Support vector machine is another important algorithm that is sometimes difficult to train. In this video we will look at the methods in the package caret to do that using an additional cross-validation.

- Support vector machine
- Kernel options
- Training with caret

### Mastering Data Analysis with R

This video will give an overview of entire course

The aim of this video is to introduce R/RStudio to those using it for the first time.

- Install R and RStudio
- Explore the R/RStudio interface and understand the basics of working with data
- Illustrate steps of working with R using car failure data

The aim of this video is to introduce commonly used visualization tools in R.

- Get introduced to commonly used visualizations that are done for qualitative data
- Get introduced to commonly used visualizations that are done for quantitative data
- Illustrate the steps with vehicle failure data

The aim of this video is to introduce the interactive visualization package “plotly” in R.

- Get introduced to the interactive visualization package, plotly
- Know how to make interactive scatter plots, box plots, histograms, and pie charts
- View interpretation and how to use the charts

The aim of this video is to introduce the “googleVis” package in R.

- Get introduced to the googleVis package
- Learn how to make a geographic map for the world using the googleVis package
- Learn how to make a geographic map for USA using the googleVis package

The aim of this video is to introduce visualization with ggplot2, d3heatmap, and googleVis packages.

- View color-coded scatter plots and histograms including facets using the ggplot2 package
- View heatmap using the d3heatmap package
- View a motion chart using the googleVis package

The aim of this video is to introduce the idea of regression, logistic regression, and data partitioning.

- Describe the situation when output variable is numeric or factor
- Provide examples
- Describe data partitioning and its purpose

The aim of this video is to introduce data partitioning.

- View the steps for reading data and preparing it for analysis
- Learn how to prepare data for analysis by addressing missing data
- Illustrate the steps for data partitioning

The aim of this video is to present steps for multiple linear regression.

- Get introduced to the steps for multiple linear regression
- Learn how to do plots for model diagnostics
- Learn how to make predictions using the model

The aim of this video is to introduce multicollinearity issues with regression models.

- Explain what is multicollinearity
- Learn how VIF helps to assess multicollinearity
- Learn how to asses multicollinearity with the help of the faraway package in R

The aim of this video is to introduce logistic regression using R.

- Describe data and prepare it for model building
- Shows steps for developing the model
- Show how to identify significant variables and make adjustments to the model

The aim of this video is to provide a logistic model interpretation.

- Learn how to write the logistic regression model equation
- Use admit data to illustrate the model
- Provide interpretation of coefficients

The aim of this video is to show calculation for confusion matrix and misclassification error.

- View the steps to create confusion matrix for training and testing data
- View a discussion on the meaning of confusion matrix
- Learn how to calculate the misclassification error

The aim of this video is to show how to create ROC curves in R.

- View steps to predict probabilities needed for ROC curves
- Learn how to extract sensitivity and specificity information needed for ROC curves
- View how to create ROC curves

The aim of this video is to provide an overall view of prediction and model assessment.

- Show why 80% accuracy for two models is not the same.
- Show why baseline also matters
- Illustrate the idea using admit data

The aim of this video is to introduce multinomial logistic regression using R.

- Describe data and prepare it for model building
- View the steps for developing the model
- View how to identify significant variables and make adjustments to the model

The aim of this video is to provide the interpretation to the multinomial logistic model.

- Learn how to write the multinomial logistic regression model equation
- Use CTG data to illustrate the model
- Provide the interpretation of coefficients

The aim of this video is to show calculation for confusion matrix and misclassification error.

- View the steps to create confusion matrix for training and testing data
- View a discussion on the meaning of confusion matrix
- Provide how to calculate misclassification error

The aim of this video is to provide an overall view of prediction and model assessment.

- Explain why we need to look at prediction accuracy within each category
- Explain when we can say that model prediction accuracy is good
- Illustrate the idea using CTG data

The aim of this video is to introduce ordinal logistic regression using R.

- Describe data and prepare it for model building
- Show steps for developing the model
- Show how to identify significant variables and make adjustments to the model

The aim of this video is to provide ordinal logistic model interpretation.

- View how to write the ordinal logistic regression model equation
- Use CTG data to illustrate the model
- Provide interpretation of coefficients

The aim of this video is to show calculation for the confusion matrix and misclassification error.

- View the steps to create confusion matrix for training and testing data
- View a discussion on the meaning of confusion matrix
- Learn how to calculate the misclassification error