Java: File type of `url.openStream()` – Education Career Blog

I wrote this this method to download a webpage given a URL. It is designed to download HTML only. If I want to do error checking and allow HTML only how should I do this?

public static String download(URL url) throws IOException {
    InputStream is = url.openStream();
    BufferedReader reader = new BufferedReader(new InputStreamReader(is));
    String page = "";
    String line;    
    while((line = reader.readLine()) != null){
        page = page + line;
    }
    return page;
}

Originally I was planning on doing this:

String file = url.getFile();
if(file.subString(file.indexOf("."),file.length()-1).equalsIgnoreCase("HTML")){
    // do method

However the URL: http://www.smu.com returns "" for url.getFile(). Anyone have any suggestions?

,

To test if you’re getting html you can use URL.openConnection() to get a UrlConnection can then call getContentType() which should return “text/html” for an HTML page. You can then use the getInputStream() method on the UrlConnection() as a drop in replacement for url.openStream();

If you actually want to validate that the content the server is sending you is HTML you’d need to find an HTML validation library. I don’t know of one off-hand, sorry.

Something to consider, which may be why www.smu.com returns no data, is that a number of websites will serve different data depending on the User-Agent string sent on the HTTP connection. You may need to modify that on your UrlConnection with: UrlConnection.addRequestProperty(“User-Agent”, …); See more info here : Setting user agent of a java URLConnection

,

If you want to check the content beyond checking the Content-Type header, then you can use an HTML parser such as (the misleadingly named!) JTidy.

,

“http://www.smu.com” sends you the data in “http://www.smu.com/index.html”. This is the (common) behavior of web-servers when “/” is requested (a web-server could also theoretically redirect one with a 302 or whatnot). Checking to see if the URL ends in “.html” is thus entirely silly (not to mention that it could be a “.php”, “.asp” or whatever).

However, a nice web-server serving up HTML should return return a Content-Type header of “text/html”. (This is of course assuming it’s returning HTML and not XHTML or XML or whatnot and the web-server is not broken).

You will likely want to use URLConnection. Here is an example of URLConnection with headers.

How did I determine the top bit?

I ran curl -I http://www.smu.com (and with ../index.html) and compared the results. They look like:

HTTP/1.1 200 OK
Date: Tue, 19 Oct 2010 18:01:39 GMT
Server: Apache
Last-Modified: Wed, 27 Jan 2010 20:27:52 GMT
Accept-Ranges: bytes
Content-Length: 2993
Content-Type: text/html

Leave a Comment