How to login to any website using Curl from the command line or shell script

There are times you need to scrape/crawl some field on a page but the page requires authentication (logging in). Unless the site is using Basic Auth, where you can have the username and password in the url like http://username:[email protected]/ then you'll need to curl with more sophistication. Besides curl, there are other web tools which you can use on the command line such as links/elinks (elinks is an enhanced version of links which also supports JavaScript to a very limited extent). Links and curl will not execute JavaScript though, so if that's necessary to get any fields then you should try Selenium or CasperJS/PhantomJS instead.

But for submitting forms and sending saved cookies, curl will suffice.

First, you should load the page with the form you want to submit in Chrome. Copy the url and try loading it again, and then make sure you can load that url in an incognito window (to ensure you can get there without having already logged in). Now you can use Chrome's developer tools to inspect the form element: note the form submit url, and the fields.

The form may have some hidden field with a value which looks random. This would be a type of CSRF (Cross-site request forgery) token, meant to protect forms from being submitted except when the form is generated and shown to the user. If your login form has a CSRF token field, then you will need to have a curl request to first load the form page and save the CSRF field's value and then use it to submit the login form (along with your username and password). The CSRF token may be in the form element, but it could also be in a sent cookie. Either way, you'll have to save some output from the initial request and look at the format.

This means you not only need to retrieve the page using curl, you need to be able to parse the resulting html to find the csrf token and get its unique value. Since you are only looking for one field and it should be on one line, you can probably do this using common Unix tools like grep, sed, awk, cut. This will depend on the format of the html or the name of the cookie. For cookies, you can just send all the cookies you received instead of parsing them.

Use curl's -d 'arg1=val1&arg2=val2&argN=valN argument to pass each field's value and POST it to the provided target url. For example, if the CSRF field was reallly called 'csrf' then you might POST to the login form like so: curl -d 'username=myname&password=sEcReTwOrD123&csrf=123456' http://example.com/login

How to authenticate to a site with cookies using Curl

When you login to a website with your browser, the way the site continues to consider your browser logged in is via session cookies. Each time your browser requests a new page, it will send the saved cookies for that site, and that will include one which has a "secret" session ID which shows that you've previously successfully authenticated. Sometimes the session ID is in the URL of each request but this is quite uncommon now. So it's important when running 'curl' to login to a site to save your cookies.

Use the -c cookies.txt argument to curl with each request in order to save the cookies to the file cookies.txt. E.g. curl -c cookies.txt http://example.com/login

If the site's CSRF token is in a cookie then you'll need to use the saved cookie jar (send the cookies you saved).

To send the cookies stored in cookies.txt use -b cookies.txt. You can save and send in the same command with the same file.

More curl crawling tricks

Now that you are logged in, you can request the pages you want using your saved session/auth cookie. But some web servers will try to block you if they know you're using curl. But all you need to get around this is to send forged HTTP request headers imitating another browser like Chrome.

When you load a page in Chrome and have Chrome Developer Tools open to the Network tab, you can inspect the headers which were sent and received with each request (one page load can involve hundreds of requests). Look at the original request for the page, the main url, not all the images or css files. If you click on the name for that resource (it should be the first one) then you'll see the various headers being sent and received. If you right click on the name, you'll have a menu option "Copy as cURL". If you do this, then you can retrieve the page on the command line the same as you did with Chrome.

Within the copied command line will be many '-H ...' arguments. These are all sent headers. For example, you'll see things like -H 'accept-encoding: gzip, deflate, sdch' -H 'accept-language: en-US,en;q=0.8' -H 'upgrade-insecure-requests: 1' -H 'user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'. The last one, 'user-agent', specifies that the browser here is (a specific version of) Chrome running on (a specific version of) OSX. You can copy that and use it in all your requests to trick the web server into thinking you're running Chrome, not curl.

You can do all the curl stuff in PHP as well.