What we do with data - how we store it, process it, transfor it, ,summarise it, display it and analyse it - depends crucially on its type.
Classification of Data
Data Type | Possible Values | R data type | |
---|---|---|---|
Categorical | Boolean/Logical | FALSE or TRUE |
logical |
Unordered/Nominal | Character strings | character |
|
Unordered factors | factor |
||
Ordered/Ordinal | Character strings | character |
|
Ordered factors | factor |
||
Numerical | Binary | 0 or 1 | numeric , integer |
Integer | …, -2, -1, 0, 1, 2… | numeric , integer |
|
Continuous | Floating point | numeric , double |
A variable is a characteristic individuals, areas, populations etc. A variable can take take different values for different individuals, or at different times.
For example:
Classify the following variables according to the scheme in the table and diagram above:
Note that (at least as adults) we always round our age down to the nearest whole number. The fact that we record age as an integer value doesn’t make it a integer variable. In fact we always record continuous variables in a discrete way - e.g. to one decimal place, since we don’t have an infinite number of decimal places at our disposal when storing numbers.
Computers are finite machines. Now we all know that \(\sqrt{7}^2=7\) right? R certainly tells us so:
(sqrt(7))^2
## [1] 7
But try this:
(sqrt(7))^2-7
## [1] 8.881784e-16
This isn’t zero: it’s a very tiny number: 0.0000000000000008881784, caused by the fact that while R can store 7 exactly, it can’t do the same for \(\sqrt{7}\) because its binary (and decimal) representation is infinite. So when it computes it and squares it, it doesn’t quite get 7 back again.
By the way, you should be aware of floating point or scientific notation, which R uses for large and small numbers: e.g. 1.234E+06
is \(1.234\times10^6 = 1234000\), and 7.89E-04
\(7.89\times10^{-4}=0.000789\). You can type 1.234e6
or 7.89e-4
directly into R and it will interpret these correctly in floating point notation. R calls floating point numbers data type double
, which means ‘double precision’, a hangover from the days when floating point numbers were recorded with only with about 8 digit precision. Double precision means that 15-17 decimal places can be stored.
Computers can’t store arbitrarily large numbers of decimal places, nor arbitrarily large numbers. The largest integer you can store in R is
.Machine$integer.max
## [1] 2147483647
And -2147483647 can be stored too, but nothing more negative.
Any whole number larger than 2147483647 will be stored as a floating point number, and that means it will be stored only approximately (i.e. to 16 decimal places).
The smallest and largest floating point numbers R can handle are
.Machine$double.xmin
## [1] 2.225074e-308
.Machine$double.xmax
## [1] 1.797693e+308
Any positive number smaller than 2.225073910^{-308} will be treated as zero.
Compare the results of the following three versions of division
# Simple division: 25 divided by 7 - a decimal value
25/7
## [1] 3.571429
# The number of complete times 7 fits into 25 - an integer value
# this is the integer part of 25/7
25%/%7
## [1] 3
# 25 modulo 7 - the remainder after taking 25%%7 7s away from 25
25%%7
## [1] 4
Compare the results of the following rounding of 163.141593
# Round the to nearest whole number
round(163.141593)
## [1] 163
# Round the to nearest whole number
round(163.141593,0)
## [1] 163
# Round to one decimal place
round(163.141593,1)
## [1] 163.1
# Round to two decimal places
round(163.141593,2)
## [1] 163.14
# Round to the nearest multiple of 10
round(163.141593,-1)
## [1] 160
# Round to the nearest multiple of 100
round(163.141593,-2)
## [1] 200
# Round UP
ceiling(163.141593)
## [1] 164
# Round DOWN (this is what we do with our ages)
floor(163.141593)
## [1] 163
We can create, manipulate and store objects of all the various allowed data types.
We’ll start with the simple numeric types integer
and double
.
Some simple examples first. Let’s evaluate the following expressions
for the case where \(x=2\) and \(y=6.4\), storing all the results
x <- 2
y <- 6.4
x2 <- x^2
fred <- 2*x+5
wilma <- x+y
a.simple.fraction <- (x+1)/(y+2)
The object names that we create must start with a letter (a-z, A-Z), and can contain any number of letters and digits, as well as the special characters .
and _
.
If we change our minds about the values of x
and y
we can easily alter their values at the top of the code, and then rerun everything. (Select all the code, then Ctrl-Enter
)
Inspect the values of any of these by typing their names:
a.simple.fraction
## [1] 0.3571429
We can create logical
objects by setting them directly
yy <- FALSE
happy <- TRUE
Note how R colours TRUE
and FALSE
in a special way. Don’t ever try to create a variable called TRUE
or R will complain. TRUE
is a reserved word, it has a special meaning in R, and R won’t ever let you change it.
We can also do calculations that have a logical output:
2 < 3
## [1] TRUE
6 == (2+3)
## [1] FALSE
6 == (3+3)
## [1] TRUE
6 != (2+3)
## [1] TRUE
2 <= 2
## [1] TRUE
10 > 1.5
## [1] TRUE
These examples use the comparison operators <
, <=
, ==
, !=
, >
and >=
which mean \(<\), \(\leq\), \(=\), \(\neq\), \(>\) and \(\geq\) respectively. Note the double ==
sign when checking if two values are equal.
The !
operator inverts a logical value: turning TRUE
into FALSE
and FALSE
into TRUE
.
x <- 5
x<5
## [1] FALSE
!(x<5)
## [1] TRUE
R provides some handy logical functions - they evaluate to TRUE or FALSE - which test whether an object or expression has a value that is numeric or logical or whatever:
is.logical(2)
## [1] FALSE
is.numeric(2)
## [1] TRUE
is.numeric(2+4)
## [1] TRUE
is.logical(2>4)
## [1] TRUE
is.logical(a.simple.fraction)
## [1] FALSE
is.logical(happy)
## [1] TRUE
is.logical(fred)
## [1] FALSE
is.numeric(fred)
## [1] TRUE
is.double(fred)
## [1] TRUE
is.character(wilma)
## [1] FALSE
Characters are strings of symbols enclosed in matching quotes. Almost anything goes inside the quotes.
myname <- "Richard"
your.full.name <- 'Julius Caesar'
angry.statement <- "@@!!!*:)##!!"
Notice how R doesn’t notice the #
symbol here when its hidden inside quotes: anywhere else a #
would have caused R to treat the rest of the line as a comment.
There are two kinds of quote symbol: single '
and double "
, and as you can see above, we can use either to delimit (i.e. enclose) a character string. If a string contains a single or double quote we have to signal this to R using the backslash \
symbol. So to store Seamus O’Leary’s name:
seamus <- "Seamus O\'Leary"
This use of \
is called an escape character. It says that what comes next needs to be treated differently to everything else. This of course means if we want a backslash we have to go \\
. Two other escape sequences we’ll see are \n
which is the newline character, and \t
which is a tab (like the Tab towards the top left of your keyboard).
We’ll learn how to manipulate strings later.