Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question.. should curlconverter::straigten() fail if curlconverter isn't attached? thanks #15

Open
ajdamico opened this issue Jan 15, 2017 · 5 comments
Assignees

Comments

@ajdamico
Copy link

browserGET <- "curl 'http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp' -H 'Host: www.worldvaluessurvey.org' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1'"

# fails
curlconverter::straighten( browserGET )

# works
library(curlconverter)
straighten( browserGET )
@hrbrmstr
Copy link
Owner

hrm. .onAttach() does not get called when you do that and that's where V8 gets initialized. However, I agree that this should work and it shld be as simple as a test for the pkg global being initialized when that function is called.

Huge thanks for finding this edge case. I'll try to get a patch on github tonight.

@ajdamico
Copy link
Author

hi, thanks. i guess i'll go with this workaround to eliminate the cran build note until you push the next version to cran :)

ajdamico/lodown@512ed29

thank you for making this possible

# automatically load the world values survey
devtools::install_github("ajdamico/lodown")
library(lodown)
lodown( "wvs" , output_dir = "C:/My Directory/WVS" )

@hrbrmstr
Copy link
Owner

OH wait. I get the use-case you're doing now. You really don't need to use curlconverter in a pkg that way. If you do just straighten():

library(curlconverter)

browserGET <- "curl 'http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp' -H 'Host: www.worldvaluessurvey.org' -H 'User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0' -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8' -H 'Accept-Language: en-US,en;q=0.5' --compressed -H 'Connection: keep-alive' -H 'Upgrade-Insecure-Requests: 1'"

you get back a list:

str(straighten(browserGET))
## List of 1
##  $ :List of 5
##   ..$ url      : chr "http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp"
##   ..$ method   : chr "get"
##   ..$ headers  :List of 6
##   .. ..$ Host                     : chr "www.worldvaluessurvey.org"
##   .. ..$ User-Agent               : chr "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0"
##   .. ..$ Accept                   : chr "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
##   .. ..$ Accept-Language          : chr "en-US,en;q=0.5"
##   .. ..$ Connection               : chr "keep-alive"
##   .. ..$ Upgrade-Insecure-Requests: chr "1"
##   ..$ url_parts:List of 9
##   .. ..$ scheme  : chr "http"
##   .. ..$ hostname: chr "www.worldvaluessurvey.org"
##   .. ..$ port    : NULL
##   .. ..$ path    : chr "WVSDocumentationWV4.jsp"
##   .. ..$ query   : NULL
##   .. ..$ params  : NULL
##   .. ..$ fragment: NULL
##   .. ..$ username: NULL
##   .. ..$ password: NULL
##   .. ..- attr(*, "class")= chr [1:2] "url" "list"
##   ..$ orig_curl: chr "curl 'http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp' -H 'Host: www.worldvaluessurvey.org' -H 'User-Agent: Mozilla/5."| __truncated__
##   ..- attr(*, "class")= chr [1:2] "cc_obj" "list"
##  - attr(*, "class")= chr [1:2] "cc_container" "list"

Which means you can either use dput() to capture that structure or saveRDS() to turn it into an R data file which you can have auto-loaded in your pkg.

You're prbly going the next step and doing a make_req():

straighten(browserGET) %>%
  make_req() -> req

One thing that I've been struggling how to make clearer is that immediately after make_req() is called the contents (source code) of the function it creates is placed on the clipboard. i.e. if you cmd-v (mac) or ctrl-v (win) in the editor you'll get the source code for the function placed right where the cursor is. In this case:

httr::VERB(verb = "GET", url = "http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp", 
    httr::add_headers(Host = "www.worldvaluessurvey.org", 
        `User-Agent` = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0", 
        Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
        `Accept-Language` = "en-US,en;q=0.5", 
        Connection = "keep-alive", 
        `Upgrade-Insecure-Requests` = "1"))

You could also get that by just typing req[[1]] (no parens) at the R console:

function () 
httr::VERB(verb = "GET", url = "http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp", 
    httr::add_headers(Host = "www.worldvaluessurvey.org", `User-Agent` = "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:49.0) Gecko/20100101 Firefox/49.0", 
        Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
        `Accept-Language` = "en-US,en;q=0.5", Connection = "keep-alive", 
        `Upgrade-Insecure-Requests` = "1"))
<environment: 0x10675c0d8>

that adds some cruft which is why i did made it "auto copy to clipboard".

That particular curl translation can be simplified to (when i do this for my own projected i iteratively remove individual cookies and headers until i get the minimum viable httr verb call I can):

GET(url="http://www.worldvaluessurvey.org/WVSDocumentationWV4.jsp"))

I'm still going to make straighten() work via :: calling but I wanted to make sure you knew ^^ since it's unlikely you really do need to use curlconverter within a pkg.

@hrbrmstr
Copy link
Owner

I think you're going to need to use a different target. A great deal of the content on that page is dynamically loaded at run-tme and the center column (which has the citation and data files) that you want to target is also an iframe:

image

(apologies for the faint highlighting due to the dark theme but it shld be visible).

The next problem is that more of the contents is loaded via another call to a javascript file:

image

And, your final problem is that the js file in ^^ loads the actual content but:

image

All of the hrefs are wrapped in a call to DocDownloadLicense() which dynamically builds the form you're prbly familiar with:

image

Without something like RSelenium or seleniumPipes you're not going to be able to automate this and you can't embed either in an R package since you need a back-end selenium grid, standalone selenium server or phantomjs running live to do the work.

@ajdamico
Copy link
Author

ajdamico commented Jan 16, 2017 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants