使用JAVA解析网站HTML
使用JAVA解析网站HTML
我想解析一个简单的网站并从中抓取信息。
我过去用DocumentBuilderFactory解析XML文件,我尝试用同样的方法解析HTML文件,但总是陷入无限循环。
URL url = new URL("http://www.deneme.com"); URLConnection uc = url.openConnection(); InputStreamReader input = new InputStreamReader(uc.getInputStream()); BufferedReader in = new BufferedReader(input); String inputLine; FileWriter outFile = new FileWriter("orhancan"); PrintWriter out = new PrintWriter(outFile); while ((inputLine = in.readLine()) != null) { out.println(inputLine); } in.close(); out.close(); File fXmlFile = new File("orhancan"); DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance(); DocumentBuilder dBuilder = dbFactory.newDocumentBuilder(); Document doc = dBuilder.parse(fXmlFile); NodeList prelist = doc.getElementsByTagName("body"); System.out.println(prelist.getLength());
问题出在哪里?或者有没有更简单的方法来从给定的HTML标签中抓取网站数据?