[heritrix]Heritrix镜像中中文路径乱码的解决方案

更新时间:2017-08-18    来源:中文酷站    手机版     字体:

【www.bbyears.com--中文酷站】

利用heritrix做网络爬虫,当选择以镜像方式存储heritrix下的文档时,如果URL中存在中文或者访问的文件名是中文时,在下载文件的镜像目录路径就会有乱码(如下图)。

 

在解决此问题前先看看为何会出现乱码。

解决方案就是在其创建路径时对路径名进行编码,主要代码在org.archive.crawler.writer. MirrorWriterProcessor类下的方法LumpyString方法。

为了尊重源码,我没有对原来的方法进行改动,新建了org.archive.crawler.writer. MirrorWriterForWenwuchinaProcessor类,来对heritrix进行扩展。复制了org.archive.crawler.writer. MirrorWriterProcessor类中所有代码,并对LumpyString进行必要的改动。如下(红色为修改的部分)

 代码如下

LumpyString(String str, int beginIndex, int endIndex, int padding,
                     int maxLen, Map characterMap, String dotBegin)  {
             if (beginIndex < 0) {
                 throw new IllegalArgumentException("beginIndex < 0: "
                                                    + beginIndex);
             }
             if (endIndex < beginIndex) {
                 throw new IllegalArgumentException("endIndex < beginIndex "
                     + "beginIndex: " + beginIndex + "endIndex: " + endIndex);
             }
             if (padding < 0) {
                 throw new IllegalArgumentException("padding < 0: " + padding);
             }
             if (maxLen < 1) {
                 throw new IllegalArgumentException("maxLen < 1: " + maxLen);
             }
             if (null == characterMap) {
                 throw new IllegalArgumentException("characterMap null");
             }
             if ((null != dotBegin) && (0 == dotBegin.length())) {
                 throw new IllegalArgumentException("dotBegin empty");
             }
 
             // Initial capacity.  Leave some room for %XX lumps.
             // Guaranteed positive.
             int cap = Math.min(2 * (endIndex - beginIndex) + padding + 1,
                                maxLen);
             string = new StringBuffer(cap);
             aux = new byte[cap];
             for (int i = beginIndex; i != endIndex; ++i) {
                 String s=str.substring(i, i + 1);
                 try {
                     s = new String(s.getBytes(),"GB2312");
                 } catch (UnsupportedEncodingException e) {
                     // TODO Auto-generated catch block
                     e.printStackTrace();
                 }
                 String lump; // Next lump.
                 if (".".equals(s) && (i == beginIndex) && (null != dotBegin)) {
                     lump = dotBegin;
                 } else {
                     lump = (String) characterMap.get(s);
                 }
                 if (null == lump) {
                     if ("%".equals(s) && ((endIndex - i) > 2)
                             && (-1 != Character.digit(str.charAt(i + 1), 16))
                             && (-1 != Character.digit(str.charAt(i + 2), 16))) {
 
                         // %XX escape; treat as one lump.
                         lump = str.substring(i, i + 3);
                         i += 2;
                     } else {
                         lump = s;
                     }
                 }
                 if ((string.length() + lump.length()) > maxLen) {
                     assert checkInvariants();
                     return;
                 }
                 append(lump);
             }
             assert checkInvariants();
         }

本文来源:http://www.bbyears.com/kuzhan/34921.html

热门标签

更多>>

本类排行