File size: 2,122 Bytes
2319518
 
f67d239
2319518
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
import re

from agent.utils.tokenization_qwen import count_tokens


def rm_newlines(text):
    text = re.sub(r'(?<=[^\.。::])\n', ' ', text)
    return text


def rm_cid(text):
    text = re.sub(r'\(cid:\d+\)', '', text)
    return text


def rm_hexadecimal(text):
    text = re.sub(r'[0-9A-Fa-f]{21,}', '', text)
    return text


def deal(text):
    text = rm_newlines(text)
    text = rm_cid(text)
    text = rm_hexadecimal(text)
    return text


def parse_doc(path):
    if '.pdf' in path.lower():
        from langchain.document_loaders import PDFMinerLoader
        loader = PDFMinerLoader(path)
        pages = loader.load_and_split()
    elif '.docx' in path.lower():
        from langchain.document_loaders import Docx2txtLoader
        loader = Docx2txtLoader(path)
        pages = loader.load_and_split()
    elif '.pptx' in path.lower():
        from langchain.document_loaders import UnstructuredPowerPointLoader
        loader = UnstructuredPowerPointLoader(path)
        pages = loader.load_and_split()
    else:
        from langchain.document_loaders import UnstructuredFileLoader
        loader = UnstructuredFileLoader(path)
        pages = loader.load_and_split()

    res = []
    for page in pages:
        dealed_page_content = deal(page.page_content)
        res.append({
            'page_content': dealed_page_content,
            'token': count_tokens(dealed_page_content),
            'metadata': page.metadata
        })

    return res


def pre_process_html(s):
    # replace multiple newlines
    s = re.sub('\n+', '\n', s)
    # replace special string
    s = s.replace("Add to Qwen's Reading List", '')
    return s


def parse_html_bs(path):
    from langchain.document_loaders import BSHTMLLoader

    loader = BSHTMLLoader(path, open_encoding='utf-8')
    pages = loader.load_and_split()
    res = []
    for page in pages:
        dealed_page_content = pre_process_html(page.page_content)
        res.append({
            'page_content': dealed_page_content,
            'token': count_tokens(dealed_page_content),
            'metadata': page.metadata
        })

    return res