使用JAVA解析网站HTML

13 浏览
0 Comments

使用JAVA解析网站HTML

我想解析一个简单的网站并从中抓取信息。

我过去用DocumentBuilderFactory解析XML文件,我尝试用同样的方法解析HTML文件,但总是陷入无限循环。

    URL url = new URL("http://www.deneme.com");
    URLConnection uc = url.openConnection();
    InputStreamReader input = new InputStreamReader(uc.getInputStream());
    BufferedReader in = new BufferedReader(input);
    String inputLine;
    FileWriter outFile = new FileWriter("orhancan");
    PrintWriter out = new PrintWriter(outFile);
    while ((inputLine = in.readLine()) != null) {
        out.println(inputLine);
    }
    in.close();
    out.close();
    File fXmlFile = new File("orhancan");
    DocumentBuilderFactory dbFactory = DocumentBuilderFactory.newInstance();
    DocumentBuilder dBuilder = dbFactory.newDocumentBuilder();
    Document doc = dBuilder.parse(fXmlFile);
    NodeList prelist = doc.getElementsByTagName("body");
    System.out.println(prelist.getLength());

问题出在哪里?或者有没有更简单的方法来从给定的HTML标签中抓取网站数据?

0