# Data Analysis

Data Analysis

Order Description

Think of a topic of interest to you, either professionally or personally, that addresses some natural phenomenon that can be studied by observation, experimentation, or by analyzing existing data. Example topics include: crime, poverty, e-commerce, privacy rights, sick leave policies in the U. S. versus Europe, money in politics, education, mental health issues, gun control, tax policy, race relations, corporate governance, minimum wage laws, free trade agreements, mobile phone usage, population growth, birth control in developing

?1

countries, body language, the work-life balance, and countless others. If you need inspiration, a favorite site of mine is TED.com1, which houses thousands of 15

BANA 5300

Data Analysis Assignment 1

The general approach you should take to a data analysis assignment is to write as if you

were providing a report to a manager who knows basic math and statistics, but is not a

specialist. Your work product should exhibit a high degree of professionalism. Speci?cally,

on every data analysis assignment, I expect your papers to conform to the following

guidelines:

1). Word-process the assignment using 11- or 12-point font, double spacing, and 1-inch

margins all around.

2). If applicable to the problem, use subscripts and superscripts. For example, write 253,

not 25^2 if you need to indicate an exponent; write x1; not x_1:

3). Write any equations or mathematical symbols using the equation editor (Word,

OpenO¢ ce, and other programs have these). For example, write the sample mean,

pronounced as “x-bar”, as x; not “xbar” or something similar.

4). Write all of your answers in complete, grammatically correct sentences. Never answer a

question with just a number. For example, if the question asks for the sample mean salary,

write “The sample mean of the salaries is <whatever>.”

5). Spell check your document.

6). Write for the reader, not “the teacher.” Explain exactly what you are doing on every

question as if writing for a colleague or supervisor who does not know the details of the

problem you are working on. Imagine you had to pick up your assignment a year from now

and understand it; write your answers so that you could do that.

7). Include an introductory sentence before every graph, symbol, table, or element of

software output. For example, if the directions ask for a histogram, include a sentence

before the histogram describing exactly what the graph shows. If the directions ask for a

con?dence interval, write out the conclusion as a statement in the context of the problem.

For example, do not write “[4.2,5];” write “We are 95% con?dent that the true mean length

of time spent at a job is between 4.2 and 5 years.”

8). Make all ?gures large enough to be easily read, and do not rely on color to distinguish

the various components of graphs. Rather use di¤erent patterns or shades of gray.

Professionalism, as judged by your paper adhering to the guidelines above, will

make up roughly 40% of each assignment grade.

1. (Note: For this question, and only this question, all group members must answer in-

dividually. Label each answer with the name of the student to whom it belongs. Each

person?s response should be no less than half a page, following the formatting guidelines

above. This question refers to material in the “Big Picture” document and video.)

Think of a topic of interest to you, either professionally or personally, that addresses

some natural phenomenon that can be studied by observation, experimentation, or

by analyzing existing data. Yes, your options here are quite varied, with the only

restriction being that the topic have the ability to be studied by gathering observable

data. Example topics include: crime, poverty, e-commerce, privacy rights, sick leave

policies in the U. S. versus Europe, money in politics, education, mental health issues,

gun control, tax policy, race relations, corporate governance, minimum wage laws, free

trade agreements, mobile phone usage, population growth, birth control in developing

1

countries, body language, the work-life balance, and countless others. If you need

inspiration, a favorite site of mine is TED.com1, which houses thousands of 15 ?? 20-

minute talks about almost any conceivable topic. Just pick one that interests you.

Personal beliefs that have no ability to be tested systematically, such as the belief that

the entire known universe rests on the back of a giant turtle, or that you are really just

a “brain ?oating in a jar” experiencing everything through an elaborate simulation,2

are not topics to discuss here. Once you have your topic, do the following.

(a) Think of a question that you have about your topic and use Google Scholar to ?nd

one (1) academic research paper that generally addresses that question3. Focus

on papers submitted to scholarly journals, not on news articles or “white papers”

written by companies or organizations. Note that you might not understand most

of the paper; that?s okay. Just read the abstract (i.e., the summary at the begin-

ning), the introduction section, and the conclusion section. If, after a few tries,

you don?t have a general idea of what the researchers did, ?nd a better article

(the point of these three sections is to explain clearly what is being done, but

some excellent researchers are poor communicators). Give a full citation for the

article using APA4 format. Then, brie?y, explain:

1. The aspect of Nature5 that was studied.

2. The Design and Measurement approach. That is, how did the researchers

decide to address the question? With an experiment? By observing a group?

By working with an existing data set? By some other means?

3. The Data the researchers obtained. That is, what did the numbers they

recorded represent? Crime rates? Weights? Reaction times? Website hits?

Rankings on a 1 ?? 5 (or similar) scale?6

(b) How convincing are the conclusions of the article? Do you believe the researchers

approached the problem the right way? If you had an opinion on the issue before,

has it changed or has it been reinforced? If you think the study was ?awed or not

convincing, state?brie?y but speci?cally?why you think so.

2. (Note: from here onward, you can work in a group as normal). Again referring to

the “Big Picture” document and/or video, brie?y state whether you think each of the

statements below is reasonable and why you think it is or is not. You don?t need to

write much more than a sentence or two for each, and you do not need to mention any

statistical methods or cite any outside sources. Just use your best judgement. You

may make additional (but reasonable) assumptions if needed to support your point.

(a) I?ve heard smoking causes cancer, but my dear aunt Hilda lived to be 90 and

smoked most of her life. So smoking does not cause cancer.

1Alphabetical list of TED talks]

2Yes, this is a real philosophical point of view. See http://en.wikipedia.org/wiki/Brain_in_a_vat

3You may not ?nd something that addresses the issue exactly, as that?s how new research starts, but you

should ?nd something broadly related.

4Here is a link to the format of an APA journal citation:

http://www.easybib.com/reference/guide/apa/journal

5Remember that “Nature,” in this context, is the entire observable world, not just trees and birds.

6These rating scales are formally called Likert scales.

2

(b) Two groups of U. S. high school freshmen, selected by randomly choosing student

ID numbers, were enrolled in a two-week drug abuse education program. One

group (A) received instruction from a police o¢ cer in uniform while the other

group (B) received no instruction. Five years later, the two groups were surveyed.

In Group A, 15% had tried an illegal drug at least once in the last ?ve years, while

in Group B, 30% had done so. The program therefore does not work because it

should be 0% for Group A.

(c) A researcher asked 100 shoppers in a mall to try a new bracelet that supposedly

improves balance. Each shopper was told that the bracelet used magnets to direct

the body?s energy ?ow. The researcher asked each shopper to stand on one foot as

long as possible while wearing the bracelet, and then to repeat the action without

wearing the bracelet. At the end of the study, 70% of the shoppers stood for

a longer time while wearing the bracelet. Therefore, the bracelet is e¤ective at

improving balance.

(d) A graph7 shows that as the sales of organic food increased from 1997 to 2007,

diagnoses of autism also increased. Organic foods are therefore a cause of autism.

3. This problem will use the “MBA survey data” data set. This is an additional data

?le located under the “Course Materials” area in Blackboard. The data were collected

from a survey given to one MBA statistics class at Sam Houston State University.

Students were not required to respond to the survey, and received no course credit

whatsoever for choosing to participate. In total, 38 out of 50 students responded. The

variable names and descriptions are as follows:

work_stat: A student?s employment status. Possible values were: working full

time; working part time; unemployed, laid o¤, or looking for work; other. If

“other” was chosen, the student given the chance to enter additional information.

job_hrs: The number of hours the student typically spends per week at his/her

job

sch_hrs: The number of hours the student typically spends on school work per

week

num110: The ?rst number between 1 and 10 the student thought of after reading

the question

last_name: Letter group containing the ?rst letter of the student?s last name

num_kids: The number of children the student has

age: student?s age

gender: student?s gender

yrs_stat: number of years since the student last studied statistics

accidents: number of automobile accidents (reported and unreported) the student

had in the past year

7Here is the link.

3

(a) Classify each variable as nominal/categorical, ordinal, or interval/ratio, and ex-

plain your reasoning.

(b) For the variable “job_hrs,” ?nd the mean, standard deviation, minimum, maxi-

mum, and quartiles. Explain what each measure tells you in the context of the

survey.

(c) Make a completely labeled histogram and boxplot for “job_hrs.” Describe the

shape of the histogram and what it tells you about the population or process the

data came from.

(d) Repeat (b) but separate the results by gender (you don?t have to repeat the

descriptions of what the measures tell you; just report the statistics). This can be

done using either a pivot table or in R Commander by going to Statistics ->

Summaries -> Numerical Summaries. Then click “Summarize by groups…”

and select “gender” as the group. Comment on the di¤erences you see between

the groups.

(e) Repeat (c) but, again, separate the results by gender. You can make separate

histograms in Excel, but R will be easier. Similar to (d), go to Graphs ->

Histogram and then click Plot by groups…, and again selecting “gender” as

the group. Describe the shapes of the plots and what additional information they

provide about the two groups.

(f) Investigate the data set to ?nd the cause of the outlier visible in the plots you

made. Putting yourself in the position of a professional data analyst, discuss

whether or not the outlier should be removed and why you think as you do.

(g) The dean of the College of Business eventually wants to make a report to the

Board of Regents about the demographics of the MBA program, and wants to

use the data you have collected. Is this data appropriate for that purpose? In your

answer, mention the population or process you think the sample of 38 students

is taken from (Hint: All data come from some population or process; the issue is

whether it?s the population we are interested in studying).

(h) The dean grants you access to the full registrar database containing the demo-

graphic information for all MBA students currently enrolled. Explain why you

still might not have “the population.” In what sense is this data better than the

sample of 38 students you used earlier?

4. The following problems will use the “AMESHousing.csv” data set, which includes 82

variables on nearly 3; 000 residential properties in Ames, Iowa, from 2006 to 2010,

obtained from the Ames Assessor?s O¢ ce. The names of the variables and what they

represent are found in the “AmesHousingDescription” ?le in the “Data Sets” folder on

Blackboard. Using R or Excel, do the following:

(a) Make a histogram of the Sale_Price variable and describe the pattern that you

see. Is this pattern surprising to you, given what you know about the housing

market? If you see strange looking numbers with “e?s” in the plot, those indicate

powers of 10: For example, 1:5e+05 is 1:5 105 = 150; 000: Determine a way to

get rid of the “e?s.” Hint: The hard way is to start messing with R?s graphing

4

code to get the big numbers to print. There is a much easier way to solve this

issue.

(b) Make a new variable called “lnSale” to record the natural logarithm of the sale

price, i.e., ln(Sale_Price). You can do this in either Excel by adding a column

and re-importing the data to R, or directly in R Commander by going to Data ->

Manage variables in active data set -> Compute new variable and inputting

the appropriate expression to calculate the natural logarithm of Sale_Price8.

Make a histogram of “lnSale” and compare its appearance to the one in (a).

Make a conclusion about one of the properties of the natural logarithm.

5. Using R Commander or Excel, calculate a new variable called “Age” to hold the age of

a house at the time of sale. In R Commander, select Data -> Manage variables in

active data set -> Compute new variable and input the appropriate expression

to calculate the age of the house using the variables “Year_Built” and “Year_Sold.”

Then do the following:

(a) Make a histogram of “Age” and describe the information you can get from it.

(b) Make a box plot of “Age” and describe what information you can get from it that

you cannot get easily from a histogram.

(c) Calculate the mean, standard deviation, and 90th percentile of the age data. De-

scribe what each measure tells you.

6. This problem will use the General Social Survey (GSS) data from 2012, which is in the

“gss_2012.csv” ?le. We will use side-by-side bar charts to investigate the question of

whether there is a relationship between marital status and general happiness. Using

an Excel Pivot Chart (or R Commander if you?re brave), do the following:

(a) Display a contingency table with the marital status (“marital”) variable down the

rows and the general happiness variable (“happy”) along the columns.

(b) Summarize the distribution of the “happy” variable by marital status by summa-

rizing by row percentages.

(c) Display a side-by-side bar chart of the distribution you found in (b).

(d) Using (b) and (c), does there seem to be a signi?cant relationship between marital

status and general happiness? If so, what is the apparent relationship? Describe

speci?cally what features you are looking for in the graph and in the table in

order to make your assessment.

7. Download two years of weekly stock closing prices from a publicly traded company of

your choice9. Then using Excel (probably easier) or R Commander, do the following:

8In R, the natural log function is log().

9Yahoo! Finance makes this easy. Go to http://?nance.yahoo.com and enter the company name (or stock

symbol) in the “Quote Lookup” area. Then click “Historical Prices” on the left-hand side and enter the

two-year time frame. The starting year should be 2012.

5

(a) Letting Yt be the closing price of the stock on day t, de?ne new variable called

“return” as

Rt =

Yt ?? Yt??1

Yt??1

where Yt??1 is the price of the stock the previous trading week (set the return for

the ?rst period to 0; and note that you will only have “real” data for the second

week onward). Thus, Rt is just the percentage change from week to week without

the multiplication by 100:

(b) Report the ?rst 10 rows of returns.

(c) De?ne a new variable Lt = ln(Yt) to hold the natural logarithm of the closing

price on each week. Report the ?rst 10 rows of Lt:

(d) De?ne one more variable “lnDi¤” as the di¤erence between the log of the price on

week t and the log of price on week t??1: That is, lnDi¤= Lt??Lt??1: Compare the

values of “lnDi¤” to the values of Rt from part (a) and draw a conclusion about

a property of the natural logarithm.

Extra Credit (i.e., I won?t test you on this, but I?ll give you some extra points if

you do it correctly): Show mathematically why the property in (d) occurs. Your answer

should not rely on any speci?c data values but show, in general, why the property “works.”

6