Types of Data

What we do with data - how we store it, process it, transfor it, ,summarise it, display it and analyse it - depends crucially on its type.

Classification of Data

Data Type Possible Values R data type
Categorical Boolean/Logical FALSE or TRUE logical
Unordered/Nominal Character strings character
Unordered factors factor
Ordered/Ordinal Character strings character
Ordered factors factor
Numerical Binary 0 or 1 numeric, integer
Integer …, -2, -1, 0, 1, 2… numeric, integer
Continuous Floating point numeric, double

A variable is a characteristic individuals, areas, populations etc. A variable can take take different values for different individuals, or at different times.

For example:

  1. Human height is a numerical variable, which is continuous (it can take any value in a physical range: 0-290cm). When recorded it always has units (inches, cm, m, cubits) - and we have to know what those units are.
  2. Number of bedrooms in a house is a numerical variable which is integer (it can only take non-negative, whole number values 0, 1, 2, …)
  3. Rating of a film on a 1-5 star scale is a categorical variable which is ordered. It is not a numerical variable, since the difference between 2 and 3 stars can’t be sensibly called the same as the difference between 4 and 5 stars. All we know is that 3 stars is better than 2, and 5 is better than 4.
  4. Country of birth is a categorical variable which is unordered. There is no natural ordering of countries.
  5. Present/Absent is a boolean or logical variable: taking only the values FALSE or TRUE. Categorical Boolean variables and Numerical binary variables are effectively equivalent, though R stores them in two different ways.
  6. Age is a numerical, continuous variable - it measures how old we are. It is continuous by nature, even though it is usually recorded as an integer (a whole number of years) - except by young children.

Exercises

Questions

Classify the following variables according to the scheme in the table and diagram above:

  1. Weight of a person;
  2. Number of people living in a household;
  3. Employment status;
  4. Magnitude of an earthquake;
  5. Disease rate (cases per 100,000) in a population;
  6. Level of agreement with the statement: ‘For me, next year will be better than this year’, on a scale from 1-10;
Answers
  1. Weight of a person; Numerical, Continuous
  2. Number of people living in a household; Numerical, Integer
  3. Employment status; Logical (if categories are: not employed/employed = FALSE/TRUE), or Categorical, Ordered (if categories are: not employed/underemployed/employed), or Categorical, Unordered (if categories are: not in Labour Force/unemployed/underemployed/employed)
  4. Magnitude of an earthquake; Numerical, Continuous
  5. Disease rate (cases per 100,000) in a population; Numerical, Continuous
  6. Level of agreement with the statement: ‘For me, next year will be better than this year’, on a scale from 1-10; Categorical, Ordered

Note about precision - and the finite nature of computers

Note that (at least as adults) we always round our age down to the nearest whole number. The fact that we record age as an integer value doesn’t make it a integer variable. In fact we always record continuous variables in a discrete way - e.g. to one decimal place, since we don’t have an infinite number of decimal places at our disposal when storing numbers.

Computers are finite machines. Now we all know that \(\sqrt{7}^2=7\) right? R certainly tells us so:

(sqrt(7))^2
## [1] 7

But try this:

(sqrt(7))^2-7
## [1] 8.881784e-16

This isn’t zero: it’s a very tiny number: 0.0000000000000008881784, caused by the fact that while R can store 7 exactly, it can’t do the same for \(\sqrt{7}\) because its binary (and decimal) representation is infinite. So when it computes it and squares it, it doesn’t quite get 7 back again.
By the way, you should be aware of floating point or scientific notation, which R uses for large and small numbers: e.g. 1.234E+06 is \(1.234\times10^6 = 1234000\), and 7.89E-04 \(7.89\times10^{-4}=0.000789\). You can type 1.234e6 or 7.89e-4 directly into R and it will interpret these correctly in floating point notation. R calls floating point numbers data type double, which means ‘double precision’, a hangover from the days when floating point numbers were recorded with only with about 8 digit precision. Double precision means that 15-17 decimal places can be stored.

Computers can’t store arbitrarily large numbers of decimal places, nor arbitrarily large numbers. The largest integer you can store in R is

.Machine$integer.max
## [1] 2147483647

And -2147483647 can be stored too, but nothing more negative.

Any whole number larger than 2147483647 will be stored as a floating point number, and that means it will be stored only approximately (i.e. to 16 decimal places).

The smallest and largest floating point numbers R can handle are

.Machine$double.xmin
## [1] 2.225074e-308
.Machine$double.xmax
## [1] 1.797693e+308

Any positive number smaller than 2.225073910^{-308} will be treated as zero.

Division with integers

Compare the results of the following three versions of division

# Simple division: 25 divided by 7 - a decimal value
25/7
## [1] 3.571429
# The number of complete times 7 fits into 25 - an integer value
# this is the integer part of 25/7
25%/%7
## [1] 3
# 25 modulo 7 - the remainder after taking 25%%7 7s away from 25
25%%7
## [1] 4

Rounding numbers

Compare the results of the following rounding of 163.141593

# Round the to nearest whole number
round(163.141593)
## [1] 163
# Round the to nearest whole number
round(163.141593,0)
## [1] 163
# Round to one decimal place
round(163.141593,1)
## [1] 163.1
# Round to two decimal places
round(163.141593,2)
## [1] 163.14
# Round to the nearest multiple of 10
round(163.141593,-1)
## [1] 160
# Round to the nearest multiple of 100
round(163.141593,-2)
## [1] 200
# Round UP
ceiling(163.141593)
## [1] 164
# Round DOWN (this is what we do with our ages)
floor(163.141593)
## [1] 163

R Objects

We can create, manipulate and store objects of all the various allowed data types.

Numeric objects

We’ll start with the simple numeric types integer and double.

Some simple examples first. Let’s evaluate the following expressions

  1. \(x^2\)
  2. \(2x+5\)
  3. \(x+y\)
  4. \(\frac{x+1}{y+2}\)

for the case where \(x=2\) and \(y=6.4\), storing all the results

x <- 2
y <- 6.4
x2 <- x^2
fred <- 2*x+5
wilma <- x+y
a.simple.fraction <- (x+1)/(y+2)

The object names that we create must start with a letter (a-z, A-Z), and can contain any number of letters and digits, as well as the special characters . and _.

If we change our minds about the values of x and y we can easily alter their values at the top of the code, and then rerun everything. (Select all the code, then Ctrl-Enter)

Inspect the values of any of these by typing their names:

a.simple.fraction
## [1] 0.3571429

Logical objects

We can create logical objects by setting them directly

yy <- FALSE
happy <- TRUE

Note how R colours TRUE and FALSE in a special way. Don’t ever try to create a variable called TRUE or R will complain. TRUE is a reserved word, it has a special meaning in R, and R won’t ever let you change it.

We can also do calculations that have a logical output:

2 < 3
## [1] TRUE
6 == (2+3)
## [1] FALSE
6 == (3+3)
## [1] TRUE
6 != (2+3)
## [1] TRUE
2 <= 2
## [1] TRUE
10 > 1.5
## [1] TRUE

These examples use the comparison operators <, <=, ==, !=, > and >= which mean \(<\), \(\leq\), \(=\), \(\neq\), \(>\) and \(\geq\) respectively. Note the double == sign when checking if two values are equal.

The ! operator inverts a logical value: turning TRUE into FALSE and FALSE into TRUE.

x <- 5
x<5
## [1] FALSE
!(x<5)
## [1] TRUE

R provides some handy logical functions - they evaluate to TRUE or FALSE - which test whether an object or expression has a value that is numeric or logical or whatever:

is.logical(2)
## [1] FALSE
is.numeric(2)
## [1] TRUE
is.numeric(2+4)
## [1] TRUE
is.logical(2>4)
## [1] TRUE
is.logical(a.simple.fraction)
## [1] FALSE
is.logical(happy)
## [1] TRUE
is.logical(fred)
## [1] FALSE
is.numeric(fred)
## [1] TRUE
is.double(fred)
## [1] TRUE
is.character(wilma)
## [1] FALSE

Character objects

Characters are strings of symbols enclosed in matching quotes. Almost anything goes inside the quotes.

myname <- "Richard"
your.full.name <- 'Julius Caesar'
angry.statement <- "@@!!!*:)##!!"

Notice how R doesn’t notice the # symbol here when its hidden inside quotes: anywhere else a # would have caused R to treat the rest of the line as a comment.

There are two kinds of quote symbol: single ' and double ", and as you can see above, we can use either to delimit (i.e. enclose) a character string. If a string contains a single or double quote we have to signal this to R using the backslash \ symbol. So to store Seamus O’Leary’s name:

seamus <- "Seamus O\'Leary"

This use of \ is called an escape character. It says that what comes next needs to be treated differently to everything else. This of course means if we want a backslash we have to go \\. Two other escape sequences we’ll see are \n which is the newline character, and \t which is a tab (like the Tab towards the top left of your keyboard).

We’ll learn how to manipulate strings later.