Chapter 4 Look at Your Data! Two Variables at a Time…

4.1 Indexing Vectors using Logical Operations

4.1.1 Indices

In the previous chapter, you were introduced to logical evaluations between two scalar values and between a vector and a scalar. Here, we turn to the case where both objects are vectors. In this case, it’s important that both vectors are the same size — i.e., they have the same number of elements. As in the previous case, logical evaluation will proceed element-by-element. However, now, only values corresponding to the same location in each vector will be compared to each other. Let’s take a look at two examples.

> x <- 1:5
> y <- c(4,2,1,3,5)
> x
> y
> x == y

[1] 1 2 3 4 5
[1] 4 2 1 3 5
[1] FALSE  TRUE FALSE FALSE  TRUE

Here, R is evaluating the == expression for each pair of values in \(x\) and \(y\) that have the same position: 1==4, 2==2, 3==1, 4==3, 5==5.

In the next example, we’ll keep the same values of \(x\) and \(y\):

> x
> y
> x >= y

[1] 1 2 3 4 5
[1] 4 2 1 3 5
[1] FALSE  TRUE  TRUE  TRUE  TRUE

Again, R is evaluating the logical expression element-by-element: 1>=4, 2>=2, 3>=1, 4>=3, 5>=5.

Interactive Example: Logical Evaluation with Two Vectors

In this example, \(x\) and \(y\) are each assigned a vector of numbers. A logical evaluation is performed on the two and the results are shown as a vector of TRUE/FALSE values. Keep refreshing the app until you understand how logical evaluations work when both objects are vectors of the same length.

4.1.2 “And” and “Or” Operators

We can increase the complexity of our logical conditions using two important operators: “and” & and “or” |. (Note: The “or” operator is the bar or pipe symbol just above the on your keyboard.)

One big difference between the & operator and others like ==, >, or < is that the & operator only takes values of TRUE or FALSE as its RHS and LHS, not numbers or strings. Suppose we want to evaluate the expression \(A \, \& \, B\). This expression is TRUE if and only if both \(A\) and \(B\) are TRUE. If either is FALSE, the expression evaluates to FALSE. The following shows the & logic table for all values of \(A\) and \(B\):

A	B	A & B
FALSE	FALSE	FALSE
FALSE	TRUE	FALSE
TRUE	FALSE	FALSE
TRUE	TRUE	TRUE

So, for example,

> A <- TRUE
> B <- TRUE
> A & B

[1] TRUE

However,

> A <- FALSE
> B <- TRUE
> A & B

[1] FALSE

The “or” operator (|), on the other hand, evaluates to TRUE if either \(A\) or \(B\) are TRUE. \(A | B\) evaluates to FALSE only when both \(A\)=FALSE and \(B\)=FALSE. The following shows the | logic table for all values of \(A\) and \(B\):

A	B	A \| B
FALSE	FALSE	FALSE
FALSE	TRUE	TRUE
TRUE	FALSE	TRUE
TRUE	TRUE	TRUE

> A <- TRUE
> B <- FALSE
> A | B

[1] TRUE

> A <- FALSE
> B <- FALSE
> A | B

[1] FALSE

In practice, \(A\) and \(B\) are usually slightly more complicated logical expressions that we want to evaluate and to which we then want to apply the & or | operators. For example, suppose variables \(a\) and \(b\) are assigned values and we want to evaluate the expression (a==2) | (b>5). Notice that the LHS of the | operator will evaluate to either TRUE or FALSE, as will the RHS:

> a <- 2
> b <- -4
> (a==2) 
> (b>5)
> (a==2) | (b>5)

[1] TRUE
[1] FALSE
[1] TRUE

Similarly, suppose we want to evaluate the expression (a>3) & (b>0):

> (a>3) 
> (b<0)
> (a>3) & (b<0)

[1] FALSE
[1] TRUE
[1] FALSE

Interactive Example: “and” & and “or” | Operators

In this example, variables \(a\) and \(b\) each assigned a number. Each variable is then compared to another number via logical evaluation. There are two logical evaluations, each resulting in a TRUE/FALSE value. The example shows what happens when we apply the & and | operator to those two logical evaluations. Refresh the app to see different examples. Continue doing so until you understand the & and | operators.

4.1.3 And & and Or | Applied to Vectors

A natural extension of everything we have learned so far is to apply the “and” & and “or” | operators to vectors. Although it may not be obvious now why we would want to do so, we will later see that it’s a useful way to subset our data. For example, we can use combinations of these expressions to select out subsets of the data that meet certain (logical) criteria. For now, we’ll demonstrate the concepts using simple examples.

Suppose we have a vector \(x\)

> x <- -5:5
> x

 [1] -5 -4 -3 -2 -1  0  1  2  3  4  5

Further, suppose we want to select out only the values of \(x\) that are between -3 and 3 (inclusive). Normally, we would write that range of values as \(-3 \le x \le 3\). Another way to think of that range is as the set of values such that (x >= -3) & (x <= 3). Moreover, we can use the TRUE/FALSE values as indices to the vector. Positions where the index is TRUE will be returned; those where the index is FALSE will not be returned. The following example shows the intermediate steps, with the full expression shown at the end:

> x
> (x >= -3)
> (x <= 3)
> (x >= -3) & (x <= 3)

 [1] -5 -4 -3 -2 -1  0  1  2  3  4  5
 [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
 [1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

If we use the TRUE/FALSE values in the last row as indices, we will select the values of \(x\) for which the logical condition is TRUE:

> x[ (x >= -3) & (x <= 3) ]

[1] -3 -2 -1  0  1  2  3

As another example, suppose we want to select the values of \(x\) that are either strictly less than -4 or strictly greater than 4.

> x
> (x < -4)
> (x > 4)
> (x < -4) | (x > 4)

 [1] -5 -4 -3 -2 -1  0  1  2  3  4  5
 [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
 [1]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Again, we can use the last row of TRUE/FALSE values as indices, producing

> x[ (x < -4) | (x > 4) ]

[1] -5  5

4.2 When Both Variables are Categorical

4.3 When One Variable is Categorical

4.4 When Both Variables are Interval Variables

4.4.1 Covariance

Interactive Example: Sample Covariance

This example demonstrates the calculation of (and intuition behind) sample covariance. By clicking New Sample, you will generate a new set of \(N\) observations for variables \(X\) and \(Y\). A scatterplot is shown in the main panel. The equation for the covariance calculation is shown below that.

Click the “Show quadrants” box. A vertical line will appear at \(\bar{X}\) and a horizontal line will appear at \(\bar{Y}\), dividing the plot into quadrants. The blue points are observations for which \((X_i-\bar{X})(Y_i-\bar{Y})\) is positive. Red points are those for which \((X_i-\bar{X})(Y_i-\bar{Y})\) is negative.

If you’d like to calculate the covariance for the displayed data, click the “Show X & Y data” box. Copy and paste the \(X\) and \(Y\) vectors into R. First calculate the sample covariance using the displayed covariance equation. Then check your answer using the cov() command.

4.4.2 Correlation

Interactive Example: Sample Correlation

This example demonstrates the calculation of (and intuition behind) sample correlation. The variables \(X\) and \(Y\) are drawn from a Bivariate Normal distribution with correlation \(\rho\) and standard deviations \(\sigma_x\) and \(\sigma_y\). Change the parameters in the left panel to view different relationships between \(X\) and \(Y\).

By clicking New Sample, you will generate a new set of \(N\) observations for \(X\) and \(Y\). A scatterplot is shown in the main panel. The equation for the correlation calculation is shown below that.

If you’d like to calculate the correlation for the displayed data, click the “Show X & Y data” box. Copy and paste the \(X\) and \(Y\) vectors into R. First calculate the sample correlation using only the cov() and sd() commands. Check your answer using the cor() command.

Data Analysis I (DRAFT)