- Use the tool of your choice (RStudio, Excel, Python) to generate a word document with basic data analysis of the data set posted in the Week 2 content folder.
- Create a Word document that includes the screen shots described below.
Questions/Requests:
- Create a summary of statistics for the dataset. (provide a screen shot)
- Create a correlation of statistics for the dataset. (provide a screen shot)
- What is the Min, Max, Median, and Mean of the Price? (provide a screen shot)
- What is the correlation values between Price, Ram, and Ads? (provide a screen shot)
- Create a subset of the dataset with only Price, CD, and Premium. (provide a screen shot)
- Create a subset of the dataset with only Price, HD, and Ram where Price is greater than or equal to $1750. (provide a screen shot)
- What percentage of Premium computers were sold? (provide a screen shot)(Hint: Categorical analysis)
- How many Premium computers with CDs were sold? (provide a screen shot)(Hint: Contingency table analysis)
- How many Premium computers with CDs priced over $2000 were sold? (provide a screen shot)(Hint: Conditional table analysis)
Your document should be an easy-to-read font in MS Word. Your cover page should contain the following: Title, Student’s name, University’s name, Course name, Course number, Professor’s name, and Date.
Analyzing and Visualizing Data
Chapter 4
Working With Data
Data Assets and Tabulation Types
• Two main categories
o Data that exist in tables; Datasets
o Data that exist as isolated values
• Data Types
o Levels of data or scales of measurement
o Type of exploratory data analysis you can undertake
o Editorial thinking you establish
o Specific chart types you might use
o Color choices and layout decisions around composition
Data Assets and Tabulation Types cont.
• Textual (Qualitative)
o Unstructured streams of words
o Descriptive details of a weather forecast for a given city
o The full title of an academic research project
o The description of a product on Amazon
Data Assets and Tabulation Types cont.
• Nominal (Qualitative)
o Ordinal data is still categorical and qualitative in nature
o Characteristics of order
o The response to a survey question: based on a scale of 1 (unhappy)
to 5 (very happy)
o The general weather forecast: expressed as Very Hot, Hot, Mild, Cold,
Freezing
Data Assets and Tabulation Types cont.
• Interval (Quantitative)
o Interval data is the less common form of quantitative data
o Quantitative and numeric measurement
o Measure for temperature
Data Assets and Tabulation Types cont.
• Ratio (Quantitative)
o Most common quantitative variable
o Age of a survey participant in years
o Forecasted amount of rainfall in millimetres
o Unlike interval data, for ratio data variables zero means something
Data Assets and Tabulation Types cont.
• Temporal Data
o Time-based data
o Textual: ‘Four o’clock in the afternoon on Monday, 12 March 2016’
Ordinal: ‘PM’, ‘Afternoon’, ‘March’, ‘Q1’
o Interval: ‘12’, ‘12/03/2016’, ‘2016’
o Ratio: ‘16:00’
Data Assets and Tabulation Types cont.
• Discrete
o No ‘in-between’ state
o Days of the week
o Heads or tails for a coin toss
o 1,2,3,4,5,6,etc.
• Continuous
o Has in-between state
o Height and weight
o Temperature
o Time
o 1.1,1.2,1.3,1.4,1.5,etc.
Data Acquisition
• What data do you need and why?
• From where, how, and by whom will the data be acquired?
• When can you obtain it?
Data Acquisition cont.
• Curated by You
o Primary data collection
o Manual collection and data foraging
o Extracted from pdf files
o Web scraping (also known as web harvesting)
Data Acquisition cont.
• Curated by Others
o Issued to you
o Download from the Web
o System report or export
o Third-party services
o API
Data Examination
• Data Properties
o Data types
o Size
o Condition
▪ Missing values
▪ Erroneous values
▪ Inconsistencies
▪ Duplicate records
▪ Out of date
▪ Uncommon system characters or line breaks
▪ Leading or trailing spaces
Data Examination cont.
• How to Approach This?
o Inspect and scan
o Data operations
o Statistical methods
o Frequency counts
o Frequency distribution
o Measurements of central tendency
o Measurements of spread
o Maximum, minimum and range
o Percentiles
o Standard deviation
Influence on Process
• Moving forward
o Purpose map ‘tone’
o Editorial angles
o Physical properties influence scale
Data Transformation
• Potential Activities
o Transform to clean
o Transform to convert
o Transform to create
o Transform to consolidate
Data Exploration
• Exploratory Data Analysis
o Instinct of the analyst
o Reasoning
▪ Deductive
▪ Inductive
o Chart types
o Research
o Statistical methods
o Nothings
o Not always needed
How to Use the R Programming
Language for Statistical Analyses
Part I: An Introduction to R
What Is R?
◼ a programming “environment”
◼ object-oriented
◼ similar to S-Plus
◼ freeware
◼ provides calculations on matrices
◼ excellent graphics capabilities
◼ supported by a large user network
What is R Not?
◼ a statistics software package
◼ menu-driven
◼ quick to learn
◼ a program with a complex graphical interface
Installing R
◼ www.r-project.org/
◼ download from CRAN
◼ select a download site
◼ download the base package at a minimum
◼ download contributed packages as needed
http://www.r-project.org/
Tutorials
◼ From R website under “Documentation”
– “Manual” is the listing of official R documentation
• An Introduction to R
• R Language Definition
• Writing R Extensions
• R Data Import/Export
• R Installation and Administration
• The R Reference Index
Tutorials cont.
– “Contributed” documentation are tutorials and
manuals created by R users
• Simple R
• R for Beginners
• Practical Regression and ANOVA Using R
– R FAQ
– Mailing Lists (listserv)
• r-help
Tutorials cont.
◼ Textbooks
– Venables & Ripley (2002) Modern Applied
Statistics with S. New York: Springer-
Verlag.
– Chambers (1998). Programming With Data: A
guide to the S language. New York: Springer-
Verlag.
R Basics
◼ objects
◼ naming convention
◼ assignment
◼ functions
◼ workspace
◼ history
Objects
◼ names
◼ types of objects: vector, factor, array, matrix,
data.frame, ts, list
◼ attributes
– mode: numeric, character, complex, logical
– length: number of elements in object
◼ creation
– assign a value
– create a blank object
Naming Convention
◼ must start with a letter (A-Z or a-z)
◼ can contain letters, digits (0-9), and/or
periods “.”
◼ case-sensitive
– mydata different from MyData
◼ do not use use underscore “_”
Assignment
◼ “<-” used to indicate assignment
– x<-c(1,2,3,4,5,6,7)
– x<-c(1:7)
– x<-1:4
◼ note: as of version 1.4 “=“ is also a valid assignment operator
Functions
◼ actions can be performed on objects using
functions (note: a function is itself an object)
◼ have arguments and options, often there are
defaults
◼ provide a result
◼ parentheses () are used to specify that a
function is being called
Let’s look at R
R Workspace & History
Workspace
◼ during an R session, all objects are stored in
a temporary, working memory
◼ list objects
– ls()
◼ remove objects
– rm()
◼ objects that you want to access later must be
saved in a “workspace”
– from the menu bar: File->save workspace
– from the command line:
save(x,file=“MyData.Rdata”)
History
◼ command line history
◼ can be saved, loaded, or displayed
– savehistory(file=“MyData.Rhistory)
– loadhistory(file=“MyData.Rhistory)
– history(max.show=Inf)
◼ during a session you can use the arrow keys
to review the command history
Two most common object types
for statistics:
matrix
data frame
Matrix
◼ a matrix is a vector with an additional attribute
(dim) that defines the number of columns and
rows
◼ only one mode (numeric, character, complex,
or logical) allowed
◼ can be created using matrix()
x<-matrix(data=0,nr=2,nc=2)
or
x<-matrix(0,2,2)
Data Frame
◼ several modes allowed within a single data
frame
◼ can be created using data.frame()
L<-LETTERS[1:4] #A B C D
x<-1:4 #1 2 3 4
data.frame(x,L) #create data frame
◼ attach() and detach()
– the database is attached to the R search path so that the database is
searched by R when it is evaluating a variable.
– objects in the database can be accessed by simply giving their names
Data Elements
◼ select only one element
– x[2]
◼ select range of elements
– x[1:3]
◼ select all but one element
– x[-3]
◼ slicing: including only part of the object
– x[c(1,2,5)]
◼ select elements based on logical operator
– x(x>3)
Data Import & Entry
Importing Data
◼ read.table()
– reads in data from an external file
◼ data.entry()
– create object first, then enter data
◼ c()
– concatenate
◼ scan()
– prompted data entry
◼ R has ODBC for connecting to other programs
Data entry & editing
◼ start editor and save changes
– data.entry(x)
◼ start editor, changes not saved
– de(x)
◼ start text editor
– edit(x)