I wrote this this method to download a webpage given a URL. It is designed to download HTML only. If I want to do error checking and allow HTML only how should I do this?
public static String download(URL url) throws IOException {
InputStream is = url.openStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(is));
String page = "";
String line;
while((line = reader.readLine()) != null){
page = page + line;
}
return page;
}
Originally I was planning on doing this:
String file = url.getFile();
if(file.subString(file.indexOf("."),file.length()-1).equalsIgnoreCase("HTML")){
// do method
However the URL: http://www.smu.com
returns ""
for url.getFile()
. Anyone have any suggestions?
,
To test if you’re getting html you can use URL.openConnection() to get a UrlConnection can then call getContentType() which should return “text/html” for an HTML page. You can then use the getInputStream() method on the UrlConnection() as a drop in replacement for url.openStream();
If you actually want to validate that the content the server is sending you is HTML you’d need to find an HTML validation library. I don’t know of one off-hand, sorry.
Something to consider, which may be why www.smu.com returns no data, is that a number of websites will serve different data depending on the User-Agent string sent on the HTTP connection. You may need to modify that on your UrlConnection with: UrlConnection.addRequestProperty(“User-Agent”, …); See more info here : Setting user agent of a java URLConnection
,
If you want to check the content beyond checking the Content-Type
header, then you can use an HTML parser such as (the misleadingly named!) JTidy.
,
“http://www.smu.com” sends you the data in “http://www.smu.com/index.html”. This is the (common) behavior of web-servers when “/” is requested (a web-server could also theoretically redirect one with a 302 or whatnot). Checking to see if the URL ends in “.html” is thus entirely silly (not to mention that it could be a “.php”, “.asp” or whatever).
However, a nice web-server serving up HTML should return return a Content-Type
header of “text/html”. (This is of course assuming it’s returning HTML and not XHTML or XML or whatnot and the web-server is not broken).
You will likely want to use URLConnection. Here is an example of URLConnection with headers.
How did I determine the top bit?
I ran curl -I http://www.smu.com
(and with ../index.html) and compared the results. They look like:
HTTP/1.1 200 OK
Date: Tue, 19 Oct 2010 18:01:39 GMT
Server: Apache
Last-Modified: Wed, 27 Jan 2010 20:27:52 GMT
Accept-Ranges: bytes
Content-Length: 2993
Content-Type: text/html