Learning Path: R: Powerful Data Analysis with R
 Description
 Curriculum
 FAQ
 Reviews
Thereās an increasing number of data being produced every day. This has led to the demand for skilled professionals who can analyze these data and make decisions. R is one of the popular tools which is widely used by data analysts for performing data analysis on realworld data.Ā
This Learning Path is the complete learning process to play with data. You will start with the most basic importing techniques for downloading compressed data from the Web. You will get introduced to how CRAN works and will demonstrate why viewers should use them.
Next, you will learn to create static plots. Then, you will understand how to plot spatial data on interactive web platforms such as Google Maps and OpenStreetMap.
You will learn advanced data analysis concepts such as cluster analysis, timeseries analysis, association mining, PCA, handling missing data, sentiment analysis, spatial data analysis with R and QGIS, and advanced data visualization with Rās ggplot2 library.
Finally, you will implement the various topics learned so far to analyze realworld datasets from various industry sectors.
By the end of this Learning Path, you will learn how to perform data analysis on realworld data.
For this course, we have combined the best works of these esteemed authors:
Fabio Veronesi
Fabio Veronesi obtained a Ph.D. in digital soil mapping from Cranfield University and then moved to ETH Zurich, where he has been working for the past three years as a postdoc. In his career, Dr. Veronesi worked at several topics related to environmental research: digital soil mapping, cartography and shaded relief, renewable energy and transmission line siting. During this time Dr. Veronesi specialized in the application of spatial statistical techniques to environmental data.
Dr. Bharatendra Rai
Dr. Bharatendra Rai is Professor of Business Statistics and Operations Management in the Charlton College of Business at UMass Dartmouth. He teaches courses on topics such as Analyzing Big Data, Business Analytics and Data Mining, Twitter and Text Analytics, Applied Decision Techniques, Operations Management, and Data Science for Business.Ā

1The Course Overview
This video provides an overview of the entire course.

2Importing Data from Tables (read.table)
Accessing and importing open access environmental data is a crucial skill for data scientists. This section teaches you how to download data from the Web, import it in R and check it for consistency.

3Downloading Open Data from FTP Sites
Often times, datasets are provided for free, but on FTP, websites and practitioners need to be able to access them. R is perfectly capable of downloading and importing data from FTP sites.

4FixedWidth Format
Not all text files can be opened easily with read.table. The fixedwidth format is still popular but requires a bit more work in R.

5Importing with read.lines (The Last Resort)
Some data files are simply too difficult to be imported with simple functions. Luckily R provides the readLines function that allows importing of even the most difficult tables.

6Cleaning Your Data
Most open data is generated automatically and therefore may contain NA or other values that need to be removed. R has various functions to deal with this problem.

7Loading the Required Packages
To follow the exercises in the book viewers would need to install several important packages. This video will explain how to do and where to find information about them.

8Importing Vector Data (ESRI shp and GeoJSON)
Vector data are very popular and widespread and require some thoughts before importing. R has dedicated tools to import these data and work with them.

9Transforming from data.frame to SpatialPointsDataFrame
Often times, spatial data is provided in tables and needs to be transformed before it can be used for analysis. This can be done simply with the sp package.

10Understanding Projections
Geographical projections are very important and need to be handled carefully. R provides robust functions to do so successfully.

11Basic time/dates formats
Many datasets have a temporal component and practitioners need to know how to deal with it. R provides functions to do that in a very easy way.

12Introducing the Raster Format
Raster data is fundamentally different from vector data, since its values refer to specific areas (cells) and no single locations. This video will clearly explain this difference and teach users how to import this data in R.

13Reading Raster Data in NetCDF
The NetCDF format is becoming very popular, since it allows to store 4D datasets. This requires some technical skills to be accessed and this video will teach viewers to open and import NetCDF files.

14Mosaicking
Many raster datasets we download from the web are distributed in tiles, meaning a single raster for each subset of the area. To obtain a full raster for the study area we are interested to cover we can create a mosaic.

15Stacking to Include the Temporal Component
Mosaicking involves merging rasters based on location. Spatiotemporal datasets include also multiple rasters for the same location but different times. To merge these we need to use the stacking function.

16Exporting Data in Tables
Once we complete our analysis we often need to export our results and share them with colleagues. Popular formats are CSV and TXT files, which we learn how to export in this video.

17Exporting Vector Data (ESRI shp File)
If we work with vector data and we want to share the same format with our coworkers, we need to learn how to export in vector formats. This will be covered here.

18Exporting Rasters in Various Formats (GeoTIFF, ASCII Grids)
Many raster datasets we download from the Web are distributed in tiles, meaning a single raster for each subset of the area. To obtain a full raster for the study area we are interested in covering, we can create a mosaic.

19Exporting Data for WebGIS Systems (GeoJSON, KML)
Nowadays WebGIS applications are extremely popular. However, to use our data for WebGIS, we first need to export them in the correct format. This video will show how to do that.

20Preparing the Dataset
In the previous volume we explored the basics R functions and syntaxes to import various types of data. In this video we will put these functions together, and overcome some unexpected challenges, to import a full year of NOAA data.

21Measuring Spread (Standard Deviation and Standard Distance)
Before we can start analyzing our data we first need to properly understand what we are dealing with. The first step we have to take in this direction is describe our data with simple statistical indexes.

22Understanding Your Data with Plots
Numerical summaries are very useful but certainly not ideal to provide us with a direct feeling for the dataset in hands. Plots are much more informative and thus being able to produce them is certainly a crucial skill for data analysts.

23Plotting for Multivariate Data
For multivariate data we are often interested in assessing correlation between variables. This can be done in R very easily, and ggplot2 can also be used to produce more informative plots.

24Finding Outliers
Detecting outliers is another basic skill that every data analyst should have and master. R provides a lot of technical tools to help us in finding outliers.

25Introduction
This Section will be dedicated entirely to manipulating vector data. However, viewers first need to familiarize with some basic concepts, otherwise they may not be able to understand the rest of the section.

26ReProjecting Your Data
In volume 1 we learned how to set the projection of our spatial data. However, in many cases we have to change this projection to successfully complete our analysis, and this requires some specific knowledge.

27Intersection
In many cases we may be interested in understanding the relation between spatial objects. One of such relations is the intersection, where we first want to know how two objects intersect, and then also extract only the part of one of these object that is included or outside the first.

28Buffer and Distance
Other important GIS operations that users have to master involve creating buffers and calculating distances between objects.

29Union and Overlay
The last two GIS functions that anybody should master are used to merge different geometries and spatial objects and overlay.

30Introduction
Raster objects are imported in R as rectangular matrixes. Users needs to be aware of this to properly work on these data, otherwise it may create some issues during the data analysis.

31Converting Vector/Table Data into Raster
In many cases open data are not distributed directly in raster formats and they need to be converted. This can be easily done with the right functions.

32Subsetting and Selection
Working with raster data often means extracting data for particular locations for further analysis, or crop the data to reduce their size. These are essential skills to master for any data analyst.

33Filtering
Sometimes we may need to filter out some values of our raster. It may seem tricky but only because it requires some skills.

34Raster Calculator
Creating new raster by calculating their value is extremely important for spatial data analysis. Doing so is simple but can be difficult to understand at first.

35Plotting Basics
Syntactically plotting spatial data in R is no different than plotting other types of data. Therefore, users need to know the basics of plotting before they can start making maps.

36Adding Layers
Creating multilayer plot can be difficult because we need to take care of several different aspects at once. However, learning that is very easy.

37Color Scale
When plotting spatial data we are often interested in using colors to show the values of some variables. This can be done manually but producing the right color scale may be difficult. This issue can be solved employing automatic methods.

38Creating Multivariate Plots
Creating multivariate plots not only means adding layers, but also using legends so that the viewer understands what the plot is showing. Creating legends in R is tricky because it requires a lot of tweaking, which will be explained here.

39Handling the Temporal Component
Temporal data need to be treated with specific procedures to highlight this additional component. This may be done in different ways depending on the scope of the analysis and R provides the right platform for this.

40Introduction
Being able to plot spatial data on web maps is certainly helpful and a crucial skill to have, but it can be difficult since it requires knowledge of different technologies. R makes this process very easy with dedicated functions that allow us to plot on web GIS services a breeze.

41Plotting Vector Data on Google Maps
Plotting data with the function plotGoogleMaps is not as easy as using the function plot. With a simple step by step guide we can achieve good command of the function, so that users can plot whatever data they choose.

42Adding Layers
An interactive map with just one layer is hardly useful for our purposes. Many times we are faced with the challenge of plotting several data at once. This requires some additional work and understanding, but it is definitely not hard in R.

43Plotting Raster Data on Google Maps
Plotting raster data on Google maps can be tricky. The function plotGoogleMaps does not handle rasters very well and if not done correctly the visualization will fail. This video will show users how to plot rasters successfully.

44Using Leaflet to Plot on Open Street Maps
Plotting on Google Maps is easy but Google Maps are commercial products therefore if we want to use the on our commercial website we would need to pay. OpenStreetMaps are free to use, therefore knowing how to use them is certainly an advantage.

45Introduction
Using open data for our analysis requires a deep knowledge of the data provider and the actual data we are using. Without this knowledge we may end up with erroneous results.

46Importing Data from the World Bank
Downloading data from the World Bank can be difficult since it requires users to know the acronym used to refer to these data. However, with some help this process becomes very easy.

47Adding Geocoding Information
To create a spatial map of the World Bank data we just have to download and we need to transform them into spatial data. However, in the dataset there are no coordinates of other information that may help us do that. The solution is to use the geocoding information from another dataset for this purpose.

48Concluding Remarks
Using the world bank data just to plot a static spatial map is very limitative. There are tons of other uses that researchers can do with these data and this video serves to provide some guidance into these additional avenue of research.

49Theoretical Background
Executing a point pattern analysis is technically easy in R. However, it is extremely important that practitioners understand the theory behind a point pattern analysis to ensure the correctness of the results. This video illustrates this theory.

50Introduction
In many cases practitioners start their analysis by applying complex statistics without even looking at their data. This is a problem that may affect the correctness of their results. This video will teach the correct order to start a point pattern analysis.

51Intensity and Density
Calculating intensity and density of a point pattern can be done in many ways. Finding the best for the dataset in hand can be challenging. The package spatstat and the literature provides some tips to do it correctly.

52Spatial Distribution
By looking at the plot we created in the previous videos, we started understanding the spatial distribution of our data. However, we now need to prove quantitatively that our ideas are correct.

53Modelling
In many cases we may want to model a point pattern to try and explain its location intensity in a way that would allow us to predict it outside our study area. This requires a general understanding of the modelling process, which will be explained here.

54Theoretical Background
Cluster analysis is commonly used in many fields. The problem is that in order to use it correctly we need to understand the clustering process, which is what this video is about.

55Data Preparation
As in every data analysis the data preparation plays a crucial role in guaranteeing its success. This video will prepare the data to be used for clustering.

56KMeans Clustering
Clustering algorithms are extremely simple to apply. The challenge is interpret their results and try to understand what the algorithm is telling us in terms of insights into our data.

57Optimal Number of Clusters
When applying the kmeans algorithms we need to specify the number of clusters in which we want our dataset to be divided. However, since it is often used as explanatory test, we may not know the optimal number of clusters.

58Hierarchical Clustering
Hierarchical clustering allows us to see how all of our points are related to each other with a bottomup approach. However, determining the optimal number of clusters is not so trivial with this method.

59Concluding
Determining the best clustering algorithm for our data is probably the most challenging part of such an analysis. This video will show the sort of reasoning users will need to make that decision.

60Theoretical Background
Time series analysis is another important technique to master. However, it requires some specific knowledge to understand the process and what this technique can actually do.

61Reading TimeSeries in R
Timeseries can be imported and analyzed using two formats: ts and xts. Both have their pros and cons and users need to be able to master both if they want to perform the best timeseries analysis.

62Subsetting and Temporal Functions
Dealing with timeseries sometimes means extracting data according to their location along the time line. This can be done in R but require some explanation to do it correctly.

63Decomposition and Correlation
Another important aspect of timeseries analysis is decomposition and correlation. This allows us to draw important conclusions about our data. Technically this is not difficult to do, but it requires careful consideration if we want to do it right.

64Forecasting
The final step of timeseries analysis is forecasting, where we try to simulate future events. This is extremely useful but requires adequate knowledge of the methods available, their pros and cons.

65Theoretical Background
There are numerous geostatistical interpolation techniques that can be used to map environmental data. Kriging is probably the most famous but it not the only one available. It is important to know every technique to understand where to use what.

66Data Preparation
The first challenge of any geostatistical analysis is the data preparation. We cannot just download data, but we need to clean them and prepare them for analysis.

67Mapping with Deterministic Estimators
Simple interpolation is easy to use and easy to interpret, therefore it is still commonly used. The package gstat allows us to use inverse distance, but to do so we need to follow some simple but precise rules.

68Analyzing Trend and Checking Normality
Before we can interpolate our data using kriging, we need to take care of some important steps. For example, we need to check if our data has a trend and then test for normality, because kriging can only be applied to normally distributed data.

69Variogram Analysis
Variogram is the keystone of kriging interpolation and users need to know how to compute and fit a model to it. These things require careful considerations that we are going to explore here.

70Mapping with kriging
In this video, all concepts learned previously will be merged to perform a kriging interpolation. The problem in this case is making sure that everything works correctly and the process is smooth.

71Theoretical Background
There are numerous statistical learning algorithms that can be used to map environmental data. It is important to know every technique to understand where to use what.

72Dataset
Once again for data analysis, getting to know our data is the most important thing we need to do once we start. This can be done by looking at the data provider and using some explanatory techniques.

73Linear Regression
Many users start a data analysis by testing complex methods. This is a problem though, because many times a simpler method can help us better understand our data. This video shows how to fit these simple models.

74Regression Trees
Regression trees are extremely powerful algorithms, but sometimes are considered as black boxes. This is a problem because only expert users can understand their output. This may change simply by understanding how these algorithms work.

75Support Vector Machines
Support vector machine is another important algorithm that is sometimes difficult to train. In this video we will look at the methods in the package caret to do that using an additional crossvalidation.

76Test Your Knowledge
Social Network