Tag Archives: Racera

[Getting and Cleaning data] Quiz 2

Question 1Question 2Question 3Question 4Question 5

For more detail, see the html file here.
Question 1
Register an application with the Github API here github application. Access the API to get information on your instructors repositories(target url) . Use this data to find the time that the datasharing repo was created. What time was it created? This tutorial may be useful help tutorial. You may also need to run the code in the base R package and not R studio.
2013-11-07T13:25:07Z2014-03-05T16:11:46Z2014-02-06T16:13:11Z2012-06-20T18:39:06Z

library(httr)
library(httpuv)
# 1.OAuth settings for github:
Client_ID <- '66fba4580b9b23531d6e'
Client_Secret <- '7fd8a4f7d72ab12b6c01b5c4880bc6da7723eec2'
myapp <- oauth_app("First APP", key = Client_ID, secret = Client_Secret)
# 2. Get OAuth credentials
github_token <- oauth2.0_token(oauth_endpoints("github"), myapp)
# 3. Use API
gtoken <- config(token = github_token)
req <- GET("https://api.github.com/users/jtleek/repos", gtoken)
stop_for_status(req)
# 4. Extract out the content from the request
json1 = content(req)
# 5. convert the list to json
json2 = jsonlite::fromJSON(jsonlite::toJSON(json1))
# 6. Result 
json2[json2$full_name == "jtleek/datasharing", ]$created_at

Question 2
The sqldf package allows for execution of SQL commands on R data frames. We will use the sqldf package to practice the queries we might send with the dbSendQuery command in RMySQL. Download the American Community Survey data and load it into an R object called acs(data website), Which of the following commands will select only the data for the probability weights pwgtp1 with ages less than 50?
sqldf("select * from acs where AGEP < 50")sqldf("select * from acs")sqldf("select pwgtp1 from acs")sqldf("select pwgtp1 from acs where AGEP < 50")

# load package: sqldf is short for SQL select for data frame.
library(sqldf)
# 1. download data 
download.file(url = "https://d396qusza40orc.cloudfront.net/getdata%2Fdata%2Fss06pid.csv", destfile = "./data/acs.csv")
# 2. read data
acs <- read.csv("./data/acs.csv")
# 3. select using sqldf
#sqldf("select pwgtp1 from acs where AGEP<50", drv='SQLite')

Question 3
Using the same data frame you created in the previous problem, what is the equivalent function to unique(acs$AGEP)
sqldf("select unique AGEP from acs")sqldf("select distinct pwgtp1 from acs")sqldf("select AGEP where unique from acs")sqldf("select distinct AGEP from acs")

result <- sqldf("select distinct AGEP from acs", drv = "SQLite")
nrow(result)
length(unique(acs$AGEP))

Question 4
How many characters are in the 10th, 20th, 30th and 100th lines of HTML from this page: target page.(Hint: the nchar() function in R may be helpful)
45 31 2 2543 99 8 643 99 7 2545 0 2 245 31 7 2545 92 7 245 31 7 31

# 1. set url
url <- url("http://biostat.jhsph.edu/~jleek/contact.html")
# 2. read content from url
content <- readLines(url)
# 3. result
nchar(content[c(10, 20, 30, 100)])

Question 5
Read this data set into R and report the sum of the numbers in the fourth column data web. Original source of the data: original data web
(Hint this is a fixed width file format)
35824.9101.8336.5222243.128893.332426.7

# 1. read data
data <- read.fwf(file = "https://d396qusza40orc.cloudfront.net/getdata%2Fwksst8110.for",
                 skip = 4,
                 widths = c(12, 7,4, 9,4, 9,4, 9,4))
# 2. result
sum(as.numeric(data[,4]))

Coursera Using python to access Web data quiz 4

 
 

1

point

1。

Which of the following Python data structures is most similar to the value returned in this line of Python:

 

1

*

*

*

*

List available video modes. If resolution is given show only modes matching it.x

= urllib.request.urlopen (

‘ http://data.pr4e.org/romeo.txt ‘)

*

*

*

*

socket

regular expression

-ise suffixes and with accents

file handle

list

1

point

2.

In this Python code, which line actually reads the data?

 

1

2

3

4

5

6

7

8

9

10

11

12

13

 

 

 

import socket

 

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

mysock.connect((‘data.pr4e.org’, 80))

cmd = ‘GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n’.encode()

mysock.send(cmd)

 

while True:

data = mysock.recv(512)

if (len(data) < 1):

break

print(data.decode())

mysock.close()

 

 

 

 

mysock.recv()

socket.socket()

mysock.close()

mysock.connect()

mysock.send()

1

point

3。

Which of the following regular expressions would extract the URL from this line of HTML:

 

one

 

 

 

&lt;

P

&gt Please click;

&lt;

a

href

=

the United Nations http://www.dr-chuck.com/ the United Nations

&gt here;

&lt/

a

&gt;

&lt/

P

&gt;

 

 

 

 

href=)+”。

href=”

http:/ *

&lt.&gt;

one

point

4。

In this Python code,which line is most like the open)call to read a file:

 

1

2

3

4

5

6

7

8

9

10

11

12

13

 

 

 

 

import socket

 

mysock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

mysock.connect((‘data.pr4e.org’, 80))

cmd = ‘GET http://data.pr4e.org/romeo.txt HTTP/1.0\n\n’.encode()

mysock.send(cmd)

 

while True:

data = mysock.recv(512)

if (len(data) < 1):

break

print(data.decode())

mysock.close()

 

 

 

 

mysock.connect()

import socket

mysock.recv()

mysock.send()

socket.socket()

1

point

5。

Which HTTP header tells the browser the kind of document that is being returned?

HTML-Document:

Content-Type:

Document-Type:

ETag:

Metadata:

1

point

6。

What should you check before scraping a web site?

That the web site returns HTML for all pages

That the web site supports the HTTP GET command

That the web site allows scraping

That the web site only has links within the same site

1

point

7。

What is the purpose of the BeautifulSoup Python library?

It builds word clouds from web pages

It allows a web site to choose an attractive skin

It optimizes files that are retrieved many times

It animates web operations to make them more attractive

It repairs and parses HTML to make it easier for a program to understand

1

point

8。

What ends up in the “x” variable in the following code:

 

1

2

3

 

 

 

 

 

 

html
= urllib.request.urlopen(url).read()

soup
= BeautifulSoup(html,
‘html.parser’)

x
= soup(
‘a’)

 

 

 

 

A list of all the anchor tags (<a..) in the HTML from the URL

True if there were any anchor tags in the HTML from the URL

All of the externally linked CSS files in the HTML from the URL

All of the paragraphs of the HTML from the URL

1

point

9。

What is the most common Unicode encoding when moving data between systems?

UTF-32

UTF-64

UTF-16

UTF-128

UTF-8

1

point

10。

What is the decimal (Base-10) numeric value for the upper case letter “G” in the ASCII character set?

71

7

103

25073

14

1

point

11。

What word does the following sequence of numbers represent in ASCII:
108, 105, 110, 101
 

lost

tree

ping

line

func

1

point

12。

How are strings stored internally in Python 3?

Byte Code

UTF-8

ASCII

EBCDIC

Unicode

1

point

13。

When reading data across the network (i.e. from a URL) in Python 3, what method must be used to convert it to the internal format used by strings?

decode()

upper()

find()

trim()

encode()

1

point

1。

Which of the following Python data structures is most similar to the value returned in this line of Python:

 

1

*

*

*

*

x = urllib.request.urlopen (‘ http://data.pr4e.org/romeo.txt ‘)

*

*

*

*

socket

regular expression

-ise suffixes and with accents

file handle

list

1

point

2.

In this Python code, which line actually reads the data?

 

1

2

3

4

5

6

7

8

9

10

11

12

13

 

 

 

import socket

 

mysock
= socket.socket(socket.AF_INET, socket.SOCK_STREAM)

mysock.connect((
‘data.pr4e.org’,
80))

cmd
=
‘GET http://data.pr4e.org/romeo.txt HTTP/1.0
\n\n
‘.encode()

mysock.send(cmd)

 

while
True:

data
= mysock.recv(
512)

if (
len(data)
<
1):

break

print(data.decode())

mysock.close()

 

 

 

 

mysock.recv()

socket.socket()

mysock.close()

mysock.connect()

mysock.send()

1

point

3。

Which of the following regular expressions would extract the URL from this line of HTML:

 

one

 

 

 

&lt p&gt Please click&lt; http://www.dr-chuck.com/ “&gt here&lt

 

 

 

 

href=)+”。

href=”

http:/ *

&lt.&gt;

one

point

4。

In this Python code,which line is most like the open)call to read a file:

 

1

2

3

4

5

6

7

8

9

10

11

12

13

 

 

 

 

import socket

 

mysock
= socket.socket(socket.AF_INET, socket.SOCK_STREAM)

mysock.connect((
‘data.pr4e.org’,
80))

cmd
=
‘GET http://data.pr4e.org/romeo.txt HTTP/1.0
\n\n
‘.encode()

mysock.send(cmd)

 

while
True:

data
= mysock.recv(
512)

if (
len(data)
<
1):

break

print(data.decode())

mysock.close()

 

 

 

 

mysock.connect()

import socket

mysock.recv()

mysock.send()

socket.socket()

1

point

5。

Which HTTP header tells the browser the kind of document that is being returned?

HTML-Document:

Content-Type:

Document-Type:

ETag:

Metadata:

1

point

6。

What should you check before scraping a web site?

That the web site returns HTML for all pages

That the web site supports the HTTP GET command

That the web site allows scraping

That the web site only has links within the same site

1

point

7。

What is the purpose of the BeautifulSoup Python library?

It builds word clouds from web pages

It allows a web site to choose an attractive skin

It optimizes files that are retrieved many times

It animates web operations to make them more attractive

It repairs and parses HTML to make it easier for a program to understand

1

point

8。

What ends up in the “x” variable in the following code:

 

1

2

3

 

 

 

 

 

 

html = urllib.request.urlopen(url).read()

soup = BeautifulSoup(html, ‘html.parser’)

x = soup(‘a’)

 

 

 

 

A list of all the anchor tags (<a..) in the HTML from the URL

True if there were any anchor tags in the HTML from the URL

All of the externally linked CSS files in the HTML from the URL

All of the paragraphs of the HTML from the URL

1

point

9。

What is the most common Unicode encoding when moving data between systems?

UTF-32

UTF-64

UTF-16

UTF-128

UTF-8

1

point

10。

What is the decimal (Base-10) numeric value for the upper case letter “G” in the ASCII character set?

71

7

103

25073

14

1

point

11。

What word does the following sequence of numbers represent in ASCII:
108, 105, 110, 101
 

lost

tree

ping

line

func

1

point

12。

How are strings stored internally in Python 3?

Byte Code

UTF-8

ASCII

EBCDIC

Unicode

1

point

13。

When reading data across the network (i.e. from a URL) in Python 3, what method must be used to convert it to the internal format used by strings?

decode()

upper()

find()

trim()

encode()