Explorar o código

仲裁系统-3-17

andrews hai 2 meses
achega
b0049c761a
Modificáronse 100 ficheiros con 1818 adicións e 0 borrados
  1. 8 0
      .idea/.gitignore
  2. 8 0
      .idea/arbitration_system.iml
  3. 12 0
      .idea/dataSources.xml
  4. 6 0
      .idea/inspectionProfiles/Project_Default.xml
  5. 6 0
      .idea/inspectionProfiles/profiles_settings.xml
  6. 4 0
      .idea/misc.xml
  7. 8 0
      .idea/modules.xml
  8. BIN=BIN
      __pycache__/main.cpython-310.pyc
  9. BIN=BIN
      __pycache__/main.cpython-38.pyc
  10. BIN=BIN
      application_extractor/__pycache__/ocr_PP_StructureV3.cpython-310.pyc
  11. BIN=BIN
      application_extractor/__pycache__/rectify_OCR_result.cpython-310.pyc
  12. 72 0
      application_extractor/demo/PP-OCRv5/PP-OCRv5.py
  13. BIN=BIN
      application_extractor/demo/PP-OCRv5/output/F86-ZC1-2023-0001-010_00_0.jpg
  14. BIN=BIN
      application_extractor/demo/PP-OCRv5/output/刘正新-申请书_0.jpg
  15. BIN=BIN
      application_extractor/demo/PP-OCRv5/output/刘正新-申请书续_0.jpg
  16. BIN=BIN
      application_extractor/demo/PP-OCRv5/output/李述花-申请书_0.jpg
  17. 66 0
      application_extractor/demo/PP-StructureV3/PP-StructureV3.py
  18. 52 0
      application_extractor/demo/PP-StructureV3/PP-StructureV3_1.0.py
  19. 4 0
      application_extractor/demo/PP-StructureV3/output/doc_0.md
  20. BIN=BIN
      application_extractor/demo/PP-StructureV3/output/imgs/img_in_table_box_57_183_1074_1525.jpg
  21. BIN=BIN
      application_extractor/demo/PP-StructureV3/output/imgs/img_in_table_box_57_205_3145_2149.jpg
  22. BIN=BIN
      application_extractor/demo/PP-StructureV3/output/layout_det_res_0.jpg
  23. BIN=BIN
      application_extractor/demo/PP-StructureV3/output/layout_order_res_0.jpg
  24. BIN=BIN
      application_extractor/demo/PP-StructureV3/output/overall_ocr_res_0.jpg
  25. BIN=BIN
      application_extractor/demo/PP-StructureV3/output/region_det_res_0.jpg
  26. BIN=BIN
      application_extractor/demo/PP-StructureV3/output/table_cell_img_0.jpg
  27. 66 0
      application_extractor/demo/PaddleOCR-VL-1.5/PaddleOCR-VL-1.5.py
  28. 3 0
      application_extractor/demo/PaddleOCR-VL-1.5/output/doc_0.md
  29. BIN=BIN
      application_extractor/demo/PaddleOCR-VL-1.5/output/layout_det_res_0.jpg
  30. 97 0
      application_extractor/ocr_PP_StructureV3.py
  31. 101 0
      application_extractor/rectify_OCR_result.py
  32. 23 0
      application_extractor/run.py
  33. BIN=BIN
      application_extractor/test/刘正新/刘正新-庭审笔录.docx
  34. BIN=BIN
      application_extractor/test/刘正新/刘正新-申请书.png
  35. BIN=BIN
      application_extractor/test/刘正新/刘正新-申请书续.png
  36. BIN=BIN
      application_extractor/test/刘正新/刘正新-证据清单.png
  37. BIN=BIN
      application_extractor/test/李述花/李述花-劳动合同书1.png
  38. BIN=BIN
      application_extractor/test/李述花/李述花-劳动合同书2.png
  39. BIN=BIN
      application_extractor/test/李述花/李述花-劳动合同书3.png
  40. BIN=BIN
      application_extractor/test/李述花/李述花-劳动合同书4.png
  41. BIN=BIN
      application_extractor/test/李述花/李述花-庭审笔录.docx
  42. BIN=BIN
      application_extractor/test/李述花/李述花-申请书.png
  43. BIN=BIN
      application_extractor/test/李述花/李述花-证据清单.png
  44. BIN=BIN
      application_extractor/test/许泽用/许泽用-工伤认定书.png
  45. BIN=BIN
      application_extractor/test/许泽用/许泽用-庭审笔录.docx
  46. BIN=BIN
      application_extractor/test/许泽用/许泽用-申请书.png
  47. BIN=BIN
      application_extractor/test/许泽用/许泽用-申请书续.png
  48. BIN=BIN
      application_extractor/test/许泽用/许泽用-证据清单.png
  49. 1 0
      backend/__init__.py
  50. BIN=BIN
      backend/__pycache__/__init__.cpython-310.pyc
  51. BIN=BIN
      backend/__pycache__/api.cpython-310.pyc
  52. BIN=BIN
      backend/__pycache__/api.cpython-38.pyc
  53. BIN=BIN
      backend/__pycache__/db.cpython-310.pyc
  54. BIN=BIN
      backend/__pycache__/db.cpython-38.pyc
  55. BIN=BIN
      backend/__pycache__/embedding.cpython-310.pyc
  56. BIN=BIN
      backend/__pycache__/embedding.cpython-38.pyc
  57. BIN=BIN
      backend/__pycache__/services.cpython-310.pyc
  58. BIN=BIN
      backend/__pycache__/services.cpython-38.pyc
  59. BIN=BIN
      backend/__pycache__/text_utils.cpython-310.pyc
  60. BIN=BIN
      backend/__pycache__/text_utils.cpython-38.pyc
  61. 294 0
      backend/api.py
  62. 363 0
      backend/db.py
  63. 33 0
      backend/embedding.py
  64. 347 0
      backend/services.py
  65. 80 0
      backend/text_utils.py
  66. BIN=BIN
      config/__pycache__/config.cpython-310.pyc
  67. 7 0
      config/config.py
  68. BIN=BIN
      evidence_extractor/__pycache__/ocr_paddle_ocr_vl.cpython-310.pyc
  69. 42 0
      evidence_extractor/demo/kaoqinbiao_ocr.py
  70. 100 0
      evidence_extractor/ocr_paddle_ocr_vl.py
  71. 15 0
      evidence_extractor/run.py
  72. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/其他文字材料/F86-ZC1-2023-0001-009_18.png
  73. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/其他文字材料/F86-ZC1-2023-0001-009_19.png
  74. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/劳动关系证明材料/F86-ZC1-2023-0001-009_01.png
  75. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/劳动关系证明材料/F86-ZC1-2023-0001-009_02.png
  76. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/劳动关系证明材料/F86-ZC1-2023-0001-009_23.png
  77. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/劳动关系证明材料/F86-ZC1-2023-0001-009_24.png
  78. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/劳动关系证明材料/F86-ZC1-2023-0001-009_25.png
  79. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/劳动关系证明材料/F86-ZC1-2023-0001-009_26.png
  80. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/劳动关系证明材料/F86-ZC1-2023-0001-009_27.png
  81. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/工资单/F86-ZC1-2023-0001-009_20.png
  82. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/工资单/F86-ZC1-2023-0001-009_28.png
  83. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/工资单/F86-ZC1-2023-0001-010_06.png
  84. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/考勤表/F86-ZC1-2023-0001-010_00.png
  85. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/考勤表/F86-ZC1-2023-0001-010_01.png
  86. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/考勤表/F86-ZC1-2023-0001-010_02.png
  87. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/考勤表/F86-ZC1-2023-0001-010_03.png
  88. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/考勤表/F86-ZC1-2023-0001-010_04.png
  89. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/考勤表/F86-ZC1-2023-0001-010_05.png
  90. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/解除劳动关系相关材料/F86-ZC1-2023-0001-009_03.png
  91. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/证人证言/F86-ZC1-2023-0001-009_04.png
  92. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/证人证言/F86-ZC1-2023-0001-009_05.png
  93. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/证人证言/F86-ZC1-2023-0001-009_21.png
  94. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/证据清单/F86-ZC1-2023-0001-009_00.png
  95. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/证据清单/F86-ZC1-2023-0001-009_22.png
  96. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/银行流水/F86-ZC1-2023-0001-009_06.png
  97. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/银行流水/F86-ZC1-2023-0001-009_07.png
  98. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/银行流水/F86-ZC1-2023-0001-009_08.png
  99. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/银行流水/F86-ZC1-2023-0001-009_09.png
  100. BIN=BIN
      evidence_extractor/test/F86-ZC1-2023-0001/银行流水/F86-ZC1-2023-0001-009_10.png

+ 8 - 0
.idea/.gitignore

@@ -0,0 +1,8 @@
+# Default ignored files
+/shelf/
+/workspace.xml
+# Editor-based HTTP Client requests
+/httpRequests/
+# Datasource local storage ignored files
+/dataSources/
+/dataSources.local.xml

+ 8 - 0
.idea/arbitration_system.iml

@@ -0,0 +1,8 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<module type="PYTHON_MODULE" version="4">
+  <component name="NewModuleRootManager">
+    <content url="file://$MODULE_DIR$" />
+    <orderEntry type="jdk" jdkName="Poetry (arbitration_system)" jdkType="Python SDK" />
+    <orderEntry type="sourceFolder" forTests="false" />
+  </component>
+</module>

+ 12 - 0
.idea/dataSources.xml

@@ -0,0 +1,12 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="DataSourceManagerImpl" format="xml" multifile-model="true">
+    <data-source source="LOCAL" name="@localhost" uuid="ad02b210-e53c-4e3e-9a78-254fee01ba6c">
+      <driver-ref>mysql.8</driver-ref>
+      <synchronize>true</synchronize>
+      <jdbc-driver>com.mysql.cj.jdbc.Driver</jdbc-driver>
+      <jdbc-url>jdbc:mysql://localhost:3306</jdbc-url>
+      <working-dir>$ProjectFileDir$</working-dir>
+    </data-source>
+  </component>
+</project>

+ 6 - 0
.idea/inspectionProfiles/Project_Default.xml

@@ -0,0 +1,6 @@
+<component name="InspectionProjectProfileManager">
+  <profile version="1.0">
+    <option name="myName" value="Project Default" />
+    <inspection_tool class="JupyterPackageInspection" enabled="false" level="WARNING" enabled_by_default="false" />
+  </profile>
+</component>

+ 6 - 0
.idea/inspectionProfiles/profiles_settings.xml

@@ -0,0 +1,6 @@
+<component name="InspectionProjectProfileManager">
+  <settings>
+    <option name="USE_PROJECT_PROFILE" value="false" />
+    <version value="1.0" />
+  </settings>
+</component>

+ 4 - 0
.idea/misc.xml

@@ -0,0 +1,4 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="ProjectRootManager" version="2" project-jdk-name="Poetry (arbitration_system)" project-jdk-type="Python SDK" />
+</project>

+ 8 - 0
.idea/modules.xml

@@ -0,0 +1,8 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<project version="4">
+  <component name="ProjectModuleManager">
+    <modules>
+      <module fileurl="file://$PROJECT_DIR$/.idea/arbitration_system.iml" filepath="$PROJECT_DIR$/.idea/arbitration_system.iml" />
+    </modules>
+  </component>
+</project>

BIN=BIN
__pycache__/main.cpython-310.pyc


BIN=BIN
__pycache__/main.cpython-38.pyc


BIN=BIN
application_extractor/__pycache__/ocr_PP_StructureV3.cpython-310.pyc


BIN=BIN
application_extractor/__pycache__/rectify_OCR_result.cpython-310.pyc


+ 72 - 0
application_extractor/demo/PP-OCRv5/PP-OCRv5.py

@@ -0,0 +1,72 @@
+import os
+import base64
+import requests
+
+API_URL = "https://c8pdr6l6q4eal165.aistudio-app.com/ocr"
+TOKEN = "16455708d55afac2f074f4ae5a88fc6c7bae7920"
+
+file_path =     "E:\\project\\arbitration_system\\evidence_extractor\\test\\F86-ZC1-2023-0001\\考勤表\\F86-ZC1-2023-0001-010_00.png"
+input_filename = os.path.splitext(os.path.basename(file_path))[0]
+
+with open(file_path, "rb") as file:
+    file_bytes = file.read()
+    file_data = base64.b64encode(file_bytes).decode("ascii")
+
+headers = {
+    "Authorization": f"token {TOKEN}",
+    "Content-Type": "application/json"
+}
+
+required_payload = {
+    "file": file_data,
+    "fileType": 1,
+}
+
+optional_payload = {
+    "useDocOrientationClassify": False,
+    "useDocUnwarping": False,
+    "useTextlineOrientation": False,
+}
+
+payload = {**required_payload, **optional_payload}
+
+response = requests.post(API_URL, json=payload, headers=headers)
+
+assert response.status_code == 200
+result = response.json()["result"]
+
+os.makedirs("output", exist_ok=True)
+
+# 获取并处理识别文本
+if "rec_texts" in result:
+    print("=== 识别文本内容 ===")
+    for i, text in enumerate(result["rec_texts"]):
+        print(f"{i + 1:2d}: {text}")
+
+    # 保存文本到文件
+    text_filename = f"output/{input_filename}_text.txt"
+    with open(text_filename, "w", encoding="utf-8") as f:
+        for text in result["rec_texts"]:
+            f.write(text + "\n")
+    print(f"\n文本已保存到: {text_filename}")
+
+# 如果需要处理多个页面
+for i, res in enumerate(result.get("ocrResults", [])):
+    if "prunedResult" in res:
+        pruned_result = res["prunedResult"]
+        if "rec_texts" in pruned_result:
+            print(f"\n=== 页面 {i + 1} 的识别文本 ===")
+            for j, text in enumerate(pruned_result["rec_texts"]):
+                print(f"  行 {j + 1}: {text}")
+
+    # 下载图片部分保持不变
+    if "ocrImage" in res:
+        image_url = res["ocrImage"]
+        img_response = requests.get(image_url)
+        if img_response.status_code == 200:
+            filename = f"output/{input_filename}_{i}.jpg"
+            with open(filename, "wb") as f:
+                f.write(img_response.content)
+            print(f"图片已保存到: {filename}")
+        else:
+            print(f"图片下载失败,状态码: {img_response.status_code}")

BIN=BIN
application_extractor/demo/PP-OCRv5/output/F86-ZC1-2023-0001-010_00_0.jpg


BIN=BIN
application_extractor/demo/PP-OCRv5/output/刘正新-申请书_0.jpg


BIN=BIN
application_extractor/demo/PP-OCRv5/output/刘正新-申请书续_0.jpg


BIN=BIN
application_extractor/demo/PP-OCRv5/output/李述花-申请书_0.jpg


+ 66 - 0
application_extractor/demo/PP-StructureV3/PP-StructureV3.py

@@ -0,0 +1,66 @@
+# Please make sure the requests library is installed
+# pip install requests
+import base64
+import os
+import requests
+
+API_URL = "https://q2z8becfm967o4y7.aistudio-app.com/layout-parsing"
+TOKEN = "16455708d55afac2f074f4ae5a88fc6c7bae7920"
+
+file_path =     "E:\\project\\arbitration_system\\evidence_extractor\\test\\F86-ZC1-2023-0001\\考勤表\\F86-ZC1-2023-0001-010_00.png"
+
+with open(file_path, "rb") as file:
+    file_bytes = file.read()
+    file_data = base64.b64encode(file_bytes).decode("ascii")
+
+headers = {
+    "Authorization": f"token {TOKEN}",
+    "Content-Type": "application/json"
+}
+
+required_payload = {
+    "file": file_data,
+    "fileType": 1,  # For PDF documents, set `fileType` to 0; for images, set `fileType` to 1
+}
+
+optional_payload = {
+    "useDocOrientationClassify": False,
+    "useDocUnwarping": False,
+    "useTextlineOrientation": False,
+    "useChartRecognition": False,
+}
+
+payload = {**required_payload, **optional_payload}
+
+response = requests.post(API_URL, json=payload, headers=headers)
+print(response.status_code)
+assert response.status_code == 200
+result = response.json()["result"]
+print(result["layoutParsingResults"])
+
+
+output_dir = "output"
+os.makedirs(output_dir, exist_ok=True)
+
+for i, res in enumerate(result["layoutParsingResults"]):
+    md_filename = os.path.join(output_dir, f"doc_{i}.md")
+    with open(md_filename, "w", encoding="utf-8") as md_file:
+        md_file.write(res["markdown"]["text"])
+    print(f"Markdown document saved at {md_filename}")
+    for img_path, img in res["markdown"]["images"].items():
+        full_img_path = os.path.join(output_dir, img_path)
+        os.makedirs(os.path.dirname(full_img_path), exist_ok=True)
+        img_bytes = requests.get(img).content
+        with open(full_img_path, "wb") as img_file:
+            img_file.write(img_bytes)
+        print(f"Image saved to: {full_img_path}")
+    for img_name, img in res["outputImages"].items():
+        img_response = requests.get(img)
+        if img_response.status_code == 200:
+            # Save image to local
+            filename = os.path.join(output_dir, f"{img_name}_{i}.jpg")
+            with open(filename, "wb") as f:
+                f.write(img_response.content)
+            print(f"Image saved to: {filename}")
+        else:
+            print(f"Failed to download image, status code: {img_response.status_code}")

+ 52 - 0
application_extractor/demo/PP-StructureV3/PP-StructureV3_1.0.py

@@ -0,0 +1,52 @@
+import os
+import re
+import base64
+import requests
+import config
+
+
+API_URL = "https://q2z8becfm967o4y7.aistudio-app.com/layout-parsing"
+TOKEN = "16455708d55afac2f074f4ae5a88fc6c7bae7920"
+
+file_path = "E:\\project\\arbitration_system\\appplication_extractor\\test\\刘正新\\刘正新-申请书.png"
+input_filename = os.path.splitext(os.path.basename(file_path))[0]
+
+with open(file_path, "rb") as file:
+    file_bytes = file.read()
+    file_data = base64.b64encode(file_bytes).decode("ascii")
+
+headers = {
+    "Authorization": f"token {TOKEN}",
+    "Content-Type": "application/json"
+}
+
+required_payload = {
+    "file": file_data,
+    "fileType": 1,
+}
+
+optional_payload = {
+    "useDocOrientationClassify": False,
+    "useDocUnwarping": False,
+    "useTextlineOrientation": False,
+}
+
+payload = {**required_payload, **optional_payload}
+
+response = requests.post(API_URL, json=payload, headers=headers)
+
+assert response.status_code == 200
+result = response.json()["result"]
+
+os.makedirs("../PP-OCRv5/output", exist_ok=True)
+
+# 如果需要处理多个页面
+for i, res in enumerate(result.get("ocrResults", [])):
+    if "prunedResult" in res:
+        pruned_result = res["prunedResult"]
+        if "rec_texts" in pruned_result:
+            print(f"\n=== 页面 {i + 1} 的识别文本 ===")
+            result_text = ""
+            for j, text in enumerate(pruned_result["rec_texts"]):
+                result_text = result_text+"\n"+text
+            print(result_text)

A diferenza do arquivo foi suprimida porque é demasiado grande
+ 4 - 0
application_extractor/demo/PP-StructureV3/output/doc_0.md


BIN=BIN
application_extractor/demo/PP-StructureV3/output/imgs/img_in_table_box_57_183_1074_1525.jpg


BIN=BIN
application_extractor/demo/PP-StructureV3/output/imgs/img_in_table_box_57_205_3145_2149.jpg


BIN=BIN
application_extractor/demo/PP-StructureV3/output/layout_det_res_0.jpg


BIN=BIN
application_extractor/demo/PP-StructureV3/output/layout_order_res_0.jpg


BIN=BIN
application_extractor/demo/PP-StructureV3/output/overall_ocr_res_0.jpg


BIN=BIN
application_extractor/demo/PP-StructureV3/output/region_det_res_0.jpg


BIN=BIN
application_extractor/demo/PP-StructureV3/output/table_cell_img_0.jpg


+ 66 - 0
application_extractor/demo/PaddleOCR-VL-1.5/PaddleOCR-VL-1.5.py

@@ -0,0 +1,66 @@
+# Please make sure the requests library is installed
+# pip install requests
+import base64
+import os
+import requests
+
+API_URL = "https://q8d4u1u6c45dn7pd.aistudio-app.com/layout-parsing"
+TOKEN = "16455708d55afac2f074f4ae5a88fc6c7bae7920"
+
+file_path = "E:\\project\\arbitration_system\\evidence_extractor\\test\\F86-ZC1-2023-0001\\考勤表\\F86-ZC1-2023-0001-010_00.png"
+
+with open(file_path, "rb") as file:
+    file_bytes = file.read()
+    file_data = base64.b64encode(file_bytes).decode("ascii")
+
+headers = {
+    "Authorization": f"token {TOKEN}",
+    "Content-Type": "application/json"
+}
+
+required_payload = {
+    "file": file_data,
+    "fileType": 1,  # For PDF documents, set `fileType` to 0; for images, set `fileType` to 1
+}
+
+optional_payload = {
+    "useDocOrientationClassify": False,
+    "useDocUnwarping": False,
+    "useChartRecognition": False,
+}
+
+payload = {**required_payload, **optional_payload}
+
+response = requests.post(API_URL, json=payload, headers=headers)
+print(response.status_code)
+assert response.status_code == 200
+result = response.json()["result"]
+
+print("111111111111111111111111")
+print(result)
+
+output_dir = "output"
+os.makedirs(output_dir, exist_ok=True)
+
+for i, res in enumerate(result["layoutParsingResults"]):
+    md_filename = os.path.join(output_dir, f"doc_{i}.md")
+    with open(md_filename, "w") as md_file:
+        md_file.write(res["markdown"]["text"])
+    print(f"Markdown document saved at {md_filename}")
+    for img_path, img in res["markdown"]["images"].items():
+        full_img_path = os.path.join(output_dir, img_path)
+        os.makedirs(os.path.dirname(full_img_path), exist_ok=True)
+        img_bytes = requests.get(img).content
+        with open(full_img_path, "wb") as img_file:
+            img_file.write(img_bytes)
+        print(f"Image saved to: {full_img_path}")
+    for img_name, img in res["outputImages"].items():
+        img_response = requests.get(img)
+        if img_response.status_code == 200:
+            # Save image to local
+            filename = os.path.join(output_dir, f"{img_name}_{i}.jpg")
+            with open(filename, "wb") as f:
+                f.write(img_response.content)
+            print(f"Image saved to: {filename}")
+        else:
+            print(f"Failed to download image, status code: {img_response.status_code}")

A diferenza do arquivo foi suprimida porque é demasiado grande
+ 3 - 0
application_extractor/demo/PaddleOCR-VL-1.5/output/doc_0.md


BIN=BIN
application_extractor/demo/PaddleOCR-VL-1.5/output/layout_det_res_0.jpg


+ 97 - 0
application_extractor/ocr_PP_StructureV3.py

@@ -0,0 +1,97 @@
+import os
+import base64
+import requests
+from typing import List, Union
+
+import config.config
+
+
+class LayoutParserClient_application:
+    def __init__(self, api_url: str = None, token: str = None):
+        self.api_url = api_url or "https://q2z8becfm967o4y7.aistudio-app.com/layout-parsing"
+        self.token = token or config.config.PADDLE_TOKEN
+        self.headers = {
+            "Authorization": f"token {self.token}",
+            "Content-Type": "application/json"
+        }
+
+    def _encode_image(self, file_path: str) -> str:
+        """读取图片并转换为 base64 编码"""
+        with open(file_path, "rb") as file:
+            return base64.b64encode(file.read()).decode("ascii")
+
+    def _process_single_file(self, file_path: str) -> str:
+        """处理单张图片并返回解析后的文本"""
+        file_data = self._encode_image(file_path)
+
+        payload = {
+            "file": file_data,
+            "fileType": 1,
+            "useDocOrientationClassify": False,
+            "useDocUnwarping": False,
+            "useTextlineOrientation": False,
+        }
+
+        try:
+            response = requests.post(self.api_url, json=payload, headers=self.headers)
+            response.raise_for_status()  # 检查 HTTP 状态码
+
+            result = response.json().get("result", {})
+            full_text = []
+
+            # 解析 OCR 结果
+            for res in result.get("ocrResults", []):
+                pruned = res.get("prunedResult", {})
+                rec_texts = pruned.get("rec_texts", [])
+                if rec_texts:
+                    full_text.extend(rec_texts)
+
+            return "\n".join(full_text)
+
+        except Exception as e:
+            print(f"处理文件 {file_path} 时出错: {e}")
+            return ""
+
+    def parse(self, inputs: Union[str, List[str]]) -> str:
+        """
+        主入口方法
+        :param inputs: 可以是单张图片路径,也可以是图片路径列表
+        :return: 拼接后的所有文本
+        """
+        if isinstance(inputs, str):
+            # 如果输入是单个字符串,转为列表统一处理
+            file_list = [inputs]
+        else:
+            file_list = inputs
+
+
+
+        combined_results = []
+        for file_path in file_list:
+            print(f"正在处理: {os.path.basename(file_path)}...")
+            text = self._process_single_file(file_path)
+            if text:
+                combined_results.append(text)
+
+        # 将多张图片的结果按顺序拼接,中间用双换行分隔
+        return "\n\n--- Next Page ---\n\n".join(combined_results)
+
+
+if __name__ == '__main__':
+    # 实例化类
+    client = LayoutParserClient_application()
+
+    # 示例 1: 处理单张图片
+    # single_img = "E:\\project\\arbitration_system\\appplication_extractor\\test\\李述花\\李述花-申请书.png"
+    # result_1 = client.parse(single_img)
+    # print(result_1)
+
+    # 示例 2: 处理多张图片(按顺序拼接)
+    multi_imgs = [
+        "E:\\project\\arbitration_system\\appplication_extractor\\test\\刘正新\\刘正新-申请书.png",
+        "E:\\project\\arbitration_system\\appplication_extractor\\test\\刘正新\\刘正新-申请书续.png"
+        # "E:\\project\\arbitration_system\\appplication_extractor\\test\\李述花\\李述花-申请书.png"
+    ]
+    result_2 = client.parse(multi_imgs)
+    print(result_2)
+

+ 101 - 0
application_extractor/rectify_OCR_result.py

@@ -0,0 +1,101 @@
+from openai import OpenAI
+import config.config
+
+
+class RectifyClient_application:
+    def __init__(self, base_url: str = None, api_key: str = None):
+        self.base_url = base_url or "https://api.deepseek.com"
+        self.api_key = api_key or getattr(config.config, "DEEPSEEK_API", None)
+
+        if not self.api_key:
+            raise ValueError("API Key 缺失,请检查 config.config.DEEPSEEK_API")
+
+        self.client = OpenAI(
+            api_key=self.api_key,
+            base_url=self.base_url
+        )
+
+    def extract_legal_document(self, input_text: str):
+        system_prompt = """你是一个法律文书深度解析专家。请根据文书的语义逻辑,提取以下三个核心功能板块的内容。
+
+任务目标:
+根据语义逻辑提取信息,忽略页码、杂质字符、页眉页脚及“无正文”等标注。
+
+提取逻辑说明:
+1. 当事人信息板块:提取所有参与方的信息(无论其被称为申请人、原告、被申请人、被告还是第三人)。包含姓名/名称、证件号、地址、联系方式等所有原文描述。
+2. 诉求事项板块:提取其要求解决的具体事项(无论标题是“申请事项”、“请求事项”还是“诉讼请求”)。请按原文序号分行罗列。
+3. 事实与理由板块:提取文书中描述背景经过、证据理由的内容。
+   - 截断规则:该部分内容提取完毕后即停止,忽略后续的“此致”、“致某某委员会”、“落款签名”、“日期”以及“附:证据清单”等信息。
+
+核心要求:
+- 数据零偏差:身份证号、电话、统一社会信用代码、金额、日期必须与原文完全一致。
+- 剔除杂质:自动识别并删除OCR产生的乱码(如单独的“考”、“黄”)、页码(第x页)、或无关的地点水印(如文中散落的“苏州”字样)。
+- 保持结构:保留原文的段落感,不要压缩文本。
+- 注意生成的内容不要以markdown形式输出,我要纯文本"""
+
+        messages = [
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": f"请解析以下法律文书:\n{input_text}"}
+        ]
+
+        try:
+            response = self.client.chat.completions.create(
+                model="deepseek-chat",
+                messages=messages,
+                temperature=0.0,  # 必须为0以确保数字准确性
+                stream=False
+            )
+            return response.choices[0].message.content
+        except Exception as e:
+            # 在实际接口中,我们抛出异常以便被 FastAPI 捕获
+            raise Exception(f"模型调用失败: {str(e)}")
+
+
+if __name__ == '__main__':
+    test_input_3 = """
+    001
+    劳动人事争议仲裁申请书
+    申请人:刘正新,性别:男,民族:汉族,国籍:中国,公民身份证证件号:320911199504094914,
+    出生日期:1995年4月9日,手机号:16761734657,户籍地:江苏省盐城市盐都区郭猛镇育才巷33
+    -2号
+    被申请人:上海沐璨信息科技有限公司,住所地:上海市浦东新区杨南路455号A510室,法定代表
+    人:王需勇,单位电话:17316383135,法定代表人电话:17316383135
+    被申请人:
+    第三人:
+    苏州
+    请求事项:
+    1、请求被申请人向申请人支付2021年11月1日至2021年12月31日期间劳动报酬25000元;
+    苏州
+    事实与理由:
+    申请人于2021年4月21日在南京入职,担任销售主管的岗位,约定底薪6000
+    元,提成另算,2021年7月开始担任苏州城市经理。于12月份监管苏州、南京、
+    杭州,底薪调整为13000元。截止目前,11月薪资尚有11000元未发放,12月薪资
+    14000元未发放,老板2021年12月底以公司经营不善为由关门停业,薪资不予发
+    放。特申请劳动仲裁,要求单位支付剩余工资。(本页无正文)
+    苏州市
+
+    --- Next Page ---
+
+    002
+    (以下无正文)
+    此致
+    苏州市劳动人事争议仲裁委员会
+    刘新
+    申请人:
+    (签名或盖章)
+    2022年1月19日
+    附:本申请书副本1份
+    注:1、申请书应用钢笔、毛笔书写或打印。
+    2、申请书副本份数,应按被申请人、第三人人数提交。
+    3、请求事项应简明扼要地写明具体、明确的要求。
+    4、事实与理由部分页面不够使用时,可用同样大小纸张续加中页
+    5、当事人为自然人的,应写明姓名、性别、民族、户口性质、出生
+    年月日、住址、确认有效的通讯住址和邮编、电话等;当事人为
+    用人单位的,应写明单位名称、性质、住所地、确认有效的通讯
+    地址和邮编、电话、法定代表人(或主要负责人)姓名、职务等。
+    """
+
+    client = RectifyClient_application()
+    result = client.extract_legal_document(test_input_3)
+    print(result)
+

+ 23 - 0
application_extractor/run.py

@@ -0,0 +1,23 @@
+import config.config
+import application_extractor.ocr_PP_StructureV3
+import application_extractor.rectify_OCR_result
+
+multi_imgs = [
+    "E:\\project\\arbitration_system\\application_extractor\\test\\刘正新\\刘正新-申请书.png",
+    "E:\\project\\arbitration_system\\application_extractor\\test\\刘正新\\刘正新-申请书续.png"
+    # "E:\\project\\arbitration_system\\appplication_extractor\\test\\李述花\\李述花-申请书.png"
+]
+
+def application_extractor_run(multi_imgs):
+
+    application_client_ocr = application_extractor.ocr_PP_StructureV3.LayoutParserClient_application()
+    result_ocr = application_client_ocr.parse(multi_imgs)
+
+    application_client_rectify = application_extractor.rectify_OCR_result.RectifyClient_application()
+    result_rectify = application_client_rectify.extract_legal_document(result_ocr)
+
+
+    return result_rectify
+
+if __name__ == '__main__':
+    print(application_extractor_run(multi_imgs))

BIN=BIN
application_extractor/test/刘正新/刘正新-庭审笔录.docx


BIN=BIN
application_extractor/test/刘正新/刘正新-申请书.png


BIN=BIN
application_extractor/test/刘正新/刘正新-申请书续.png


BIN=BIN
application_extractor/test/刘正新/刘正新-证据清单.png


BIN=BIN
application_extractor/test/李述花/李述花-劳动合同书1.png


BIN=BIN
application_extractor/test/李述花/李述花-劳动合同书2.png


BIN=BIN
application_extractor/test/李述花/李述花-劳动合同书3.png


BIN=BIN
application_extractor/test/李述花/李述花-劳动合同书4.png


BIN=BIN
application_extractor/test/李述花/李述花-庭审笔录.docx


BIN=BIN
application_extractor/test/李述花/李述花-申请书.png


BIN=BIN
application_extractor/test/李述花/李述花-证据清单.png


BIN=BIN
application_extractor/test/许泽用/许泽用-工伤认定书.png


BIN=BIN
application_extractor/test/许泽用/许泽用-庭审笔录.docx


BIN=BIN
application_extractor/test/许泽用/许泽用-申请书.png


BIN=BIN
application_extractor/test/许泽用/许泽用-申请书续.png


BIN=BIN
application_extractor/test/许泽用/许泽用-证据清单.png


+ 1 - 0
backend/__init__.py

@@ -0,0 +1 @@
+

BIN=BIN
backend/__pycache__/__init__.cpython-310.pyc


BIN=BIN
backend/__pycache__/api.cpython-310.pyc


BIN=BIN
backend/__pycache__/api.cpython-38.pyc


BIN=BIN
backend/__pycache__/db.cpython-310.pyc


BIN=BIN
backend/__pycache__/db.cpython-38.pyc


BIN=BIN
backend/__pycache__/embedding.cpython-310.pyc


BIN=BIN
backend/__pycache__/embedding.cpython-38.pyc


BIN=BIN
backend/__pycache__/services.cpython-310.pyc


BIN=BIN
backend/__pycache__/services.cpython-38.pyc


BIN=BIN
backend/__pycache__/text_utils.cpython-310.pyc


BIN=BIN
backend/__pycache__/text_utils.cpython-38.pyc


+ 294 - 0
backend/api.py

@@ -0,0 +1,294 @@
+import os
+import shutil
+from uuid import uuid4
+from typing import Dict, Any, List
+
+from fastapi import FastAPI, UploadFile, File, Form
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi.responses import HTMLResponse
+from fastapi.staticfiles import StaticFiles
+
+from backend.db import (
+    init_case_db,
+    upsert_case_management,
+    list_case_management,
+    delete_case_management,
+    store_case_record,
+    update_case_materials,
+    fetch_case_management,
+    fetch_case_record,
+    parse_case_description
+)
+from backend.text_utils import save_uploads, list_files_by_ext
+from backend.services import (
+    extract_application_text,
+    extract_transcript_text,
+    process_case_text_with_evidence,
+    build_case_summary_text
+)
+from tools.documents_extractor import DocumentReader
+from application_extractor.rectify_OCR_result import RectifyClient_application
+from transcript_extractor.rectify_transcript import RectifyClient_transcript
+from law_rag.run import law_rag_run
+
+CASE_CACHE: Dict[str, Dict[str, Any]] = {}
+
+
+def build_file_items(file_paths: List[str], uploads_dir: str) -> List[Dict[str, str]]:
+    items = []
+    for path in file_paths:
+        rel_path = os.path.relpath(path, uploads_dir).replace("\\", "/")
+        items.append({"name": os.path.basename(path), "url": f"/uploads/{rel_path}"})
+    return items
+
+
+def create_app() -> FastAPI:
+    app = FastAPI()
+    app.add_middleware(
+        CORSMiddleware,
+        allow_origins=["*"],
+        allow_credentials=True,
+        allow_methods=["*"],
+        allow_headers=["*"]
+    )
+
+    root_dir = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+    frontend_dir = os.path.join(root_dir, "frontend")
+    uploads_dir = os.path.join(root_dir, "uploads")
+    os.makedirs(uploads_dir, exist_ok=True)
+
+    if os.path.exists(frontend_dir):
+        app.mount("/assets", StaticFiles(directory=frontend_dir), name="assets")
+
+        @app.get("/", response_class=HTMLResponse)
+        def serve_index():
+            with open(os.path.join(frontend_dir, "index.html"), "r", encoding="utf-8") as f:
+                return f.read()
+
+    app.mount("/uploads", StaticFiles(directory=uploads_dir), name="uploads")
+
+    @app.post("/api/arbitration/submit")
+    async def arbitration_submit(
+        case_id: str = Form(...),
+        case_title: str = Form(""),
+        applicationFile: List[UploadFile] = File(default_factory=list),
+        transcriptFile: List[UploadFile] = File(default_factory=list),
+        evidenceWage: List[UploadFile] = File(default_factory=list),
+        evidenceTerminate: List[UploadFile] = File(default_factory=list),
+        evidenceAttendance: List[UploadFile] = File(default_factory=list),
+        evidenceLabor: List[UploadFile] = File(default_factory=list),
+        evidenceBank: List[UploadFile] = File(default_factory=list),
+        evidenceList: List[UploadFile] = File(default_factory=list),
+        evidenceWitness: List[UploadFile] = File(default_factory=list),
+        evidenceOther: List[UploadFile] = File(default_factory=list)
+    ):
+        init_case_db()
+        case_dir = os.path.join(uploads_dir, case_id)
+        application_dir = os.path.join(case_dir, "申请书")
+        transcript_dir = os.path.join(case_dir, "庭审笔录")
+        evidence_dir = os.path.join(case_dir, "证据")
+
+        application_paths = await save_uploads(applicationFile, application_dir)
+        transcript_paths = await save_uploads(transcriptFile, transcript_dir)
+
+        evidence_map = {
+            "工资单": evidenceWage,
+            "解除劳动关系相关材料": evidenceTerminate,
+            "考勤表": evidenceAttendance,
+            "劳动关系证明材料": evidenceLabor,
+            "银行流水": evidenceBank,
+            "证据清单": evidenceList,
+            "证人证言": evidenceWitness,
+            "其他文字材料": evidenceOther
+        }
+        evidence_files = {}
+        for category, files in evidence_map.items():
+            file_paths = await save_uploads(files, os.path.join(evidence_dir, category))
+            evidence_files[category] = build_file_items(file_paths, uploads_dir)
+
+        application_text = extract_application_text(application_dir)
+        transcript_text = extract_transcript_text(transcript_dir)
+
+        CASE_CACHE[case_id] = {
+            "case_title": case_title,
+            "application_text": application_text,
+            "transcript_text": transcript_text,
+            "application_result": application_text,
+            "transcript_result": transcript_text,
+            "application_files": build_file_items(application_paths, uploads_dir),
+            "transcript_files": build_file_items(transcript_paths, uploads_dir),
+            "evidence_files": evidence_files,
+            "evidence_dir": evidence_dir,
+            "uploads_dir": uploads_dir
+        }
+        upsert_case_management(case_id, case_title or case_id, "", "草稿", "材料提交")
+        update_case_materials(
+            case_id,
+            {
+                "application_result": application_text,
+                "transcript_result": transcript_text,
+                "application_files": build_file_items(application_paths, uploads_dir),
+                "transcript_files": build_file_items(transcript_paths, uploads_dir),
+                "evidence_files": evidence_files
+            }
+        )
+
+        return {
+            "case_id": case_id,
+            "application_result": application_text,
+            "transcript_result": transcript_text,
+            "application_files": build_file_items(application_paths, uploads_dir),
+            "transcript_files": build_file_items(transcript_paths, uploads_dir),
+            "evidence_files": evidence_files
+        }
+
+    @app.post("/api/arbitration/judgement")
+    def arbitration_judgement(payload: Dict[str, Any]):
+        case_id = payload.get("case_id", "")
+        application_result = payload.get("application_result", "")
+        transcript_result = payload.get("transcript_result", "")
+        cache = CASE_CACHE.get(case_id, {})
+        evidence_dir = cache.get("evidence_dir", "")
+        result = process_case_text_with_evidence(case_id, application_result, transcript_result, evidence_dir)
+        cache.update(result)
+        CASE_CACHE[case_id] = cache
+        upsert_case_management(case_id, cache.get("case_title", case_id), "", "草稿", "裁决中")
+        return {
+            "case_id": case_id,
+            "final_decision": result["final_judgement"].get("final_decision", ""),
+            "final_judgement": result["final_judgement"],
+            "similar_cases": result["similar_cases"],
+            "law_results": result["law_results"]
+        }
+
+    @app.get("/api/arbitration/case")
+    def arbitration_case(case_id: str = ""):
+        init_case_db()
+        management = fetch_case_management(case_id)
+        materials = parse_case_description(management.get("description", "")).get("materials", {})
+        record = fetch_case_record(case_id)
+
+        case_dir = os.path.join(uploads_dir, case_id)
+        application_dir = os.path.join(case_dir, "申请书")
+        transcript_dir = os.path.join(case_dir, "庭审笔录")
+        evidence_dir = os.path.join(case_dir, "证据")
+        application_files = list_files_by_ext(application_dir, [".pdf", ".doc", ".docx", ".png", ".jpg", ".jpeg", ".bmp", ".gif"])
+        transcript_files = list_files_by_ext(transcript_dir, [".pdf", ".doc", ".docx", ".png", ".jpg", ".jpeg", ".bmp", ".gif"])
+        evidence_files = {}
+        if os.path.exists(evidence_dir):
+            for name in os.listdir(evidence_dir):
+                full_path = os.path.join(evidence_dir, name)
+                if os.path.isdir(full_path):
+                    files = list_files_by_ext(full_path, [".pdf", ".doc", ".docx", ".png", ".jpg", ".jpeg", ".bmp", ".gif"])
+                    evidence_files[name] = build_file_items(files, uploads_dir)
+
+        data = {
+            "case_id": case_id,
+            "title": management.get("title", ""),
+            "description": management.get("description", ""),
+            "status": management.get("status", ""),
+            "stage": management.get("stage", ""),
+            "application_result": materials.get("application_result", ""),
+            "transcript_result": materials.get("transcript_result", ""),
+            "application_files": materials.get("application_files") or build_file_items(application_files, uploads_dir),
+            "transcript_files": materials.get("transcript_files") or build_file_items(transcript_files, uploads_dir),
+            "evidence_files": materials.get("evidence_files") or evidence_files,
+            "law_results": record.get("law_results", {}),
+            "final_judgement": record.get("final_judgement", {}),
+            "final_decision": record.get("final_judgement", {}).get("final_decision", "")
+        }
+        return {"case_id": case_id, "case": data}
+
+    @app.post("/api/arbitration/confirm")
+    def arbitration_confirm(payload: Dict[str, Any]):
+        case_id = payload.get("case_id", "")
+        final_decision = payload.get("final_decision", "")
+        cache = CASE_CACHE.get(case_id, {})
+        final_judgement = cache.get("final_judgement", {})
+        if isinstance(final_judgement, dict):
+            final_judgement["final_decision"] = final_decision
+        init_case_db()
+        store_case_record(
+            case_id,
+            build_case_summary_text(cache.get("case_profile", {}), cache.get("dispute_points", [])),
+            cache.get("case_profile", {}),
+            cache.get("dispute_points", []),
+            cache.get("law_results", {}),
+            cache.get("evidence_results", {}),
+            final_judgement,
+            cache.get("embedding", [])
+        )
+        upsert_case_management(case_id, cache.get("case_title", case_id), "", "已完成", "裁决完成")
+        return {"case_id": case_id, "status": "stored"}
+
+    @app.get("/api/cases")
+    def list_cases():
+        init_case_db()
+        return {"cases": list_case_management()}
+
+    @app.post("/api/cases")
+    def create_case(payload: Dict[str, Any]):
+        init_case_db()
+        case_id = payload.get("case_id", "")
+        title = payload.get("title", "")
+        description = payload.get("description", "")
+        status = payload.get("status", "草稿")
+        stage = payload.get("stage", "案件管理")
+        return {"case": upsert_case_management(case_id, title, description, status, stage)}
+
+    @app.put("/api/cases")
+    def update_case(payload: Dict[str, Any]):
+        init_case_db()
+        case_id = payload.get("case_id", "")
+        title = payload.get("title", "")
+        description = payload.get("description", "")
+        status = payload.get("status", "草稿")
+        stage = payload.get("stage", "案件管理")
+        return {"case": upsert_case_management(case_id, title, description, status, stage)}
+
+    @app.delete("/api/cases")
+    def delete_case(payload: Dict[str, Any]):
+        init_case_db()
+        case_id = payload.get("case_id", "")
+        delete_case_management(case_id)
+        if case_id in CASE_CACHE:
+            CASE_CACHE.pop(case_id, None)
+        case_dir = os.path.join(uploads_dir, case_id)
+        if os.path.exists(case_dir):
+            shutil.rmtree(case_dir, ignore_errors=True)
+        return {"status": "deleted", "case_id": case_id}
+
+    @app.post("/api/tools/application")
+    async def tool_application(files: List[UploadFile] = File(default_factory=list)):
+        tool_id = uuid4().hex
+        tool_dir = os.path.join(uploads_dir, "tools", "application", tool_id)
+        file_paths = await save_uploads(files, tool_dir)
+        reader = DocumentReader()
+        contents = [reader.process_input(path) for path in file_paths]
+        content = "\n\n".join(contents)
+        client = RectifyClient_application()
+        result = client.extract_legal_document(content)
+        return {"result": result, "files": build_file_items(file_paths, uploads_dir)}
+
+    @app.post("/api/tools/transcript")
+    async def tool_transcript(files: List[UploadFile] = File(default_factory=list)):
+        tool_id = uuid4().hex
+        tool_dir = os.path.join(uploads_dir, "tools", "transcript", tool_id)
+        file_paths = await save_uploads(files, tool_dir)
+        reader = DocumentReader()
+        contents = [reader.process_input(path) for path in file_paths]
+        content = "\n\n".join(contents)
+        client = RectifyClient_transcript()
+        result = client.clean_text(content)
+        return {"result": result or "", "files": build_file_items(file_paths, uploads_dir)}
+
+    @app.post("/api/tools/law")
+    def tool_law(payload: Dict[str, Any]):
+        query = (payload.get("query") or "").strip()
+        result = law_rag_run(query, with_score=True) if query else []
+        return {"result": result}
+
+    return app
+
+
+app = create_app()

+ 363 - 0
backend/db.py

@@ -0,0 +1,363 @@
+import os
+import json
+from datetime import datetime
+from typing import List, Dict, Any
+
+try:
+    import pymysql
+except Exception as exc:
+    pymysql = None
+    _MYSQL_IMPORT_ERROR = str(exc)
+
+
+def get_mysql_config() -> Dict[str, Any]:
+    return {
+        "host": os.getenv("MYSQL_HOST", "127.0.0.1"),
+        "port": int(os.getenv("MYSQL_PORT", "3306")),
+        "user": os.getenv("MYSQL_USER", "root"),
+        "password": os.getenv("MYSQL_PASSWORD", "123456"),
+        "database": os.getenv("MYSQL_DATABASE", "arbitration_system"),
+        "charset": "utf8mb4"
+    }
+
+
+def get_mysql_connection(use_database: bool = True):
+    if pymysql is None:
+        raise RuntimeError(f"MySQL驱动不可用: {_MYSQL_IMPORT_ERROR}")
+    cfg = get_mysql_config()
+    if not use_database:
+        cfg = {k: v for k, v in cfg.items() if k != "database"}
+    return pymysql.connect(**cfg)
+
+
+def index_exists(cursor, database: str, table: str, index_name: str) -> bool:
+    cursor.execute(
+        """
+        SELECT 1
+        FROM information_schema.statistics
+        WHERE table_schema=%s AND table_name=%s AND index_name=%s
+        LIMIT 1
+        """,
+        (database, table, index_name)
+    )
+    return cursor.fetchone() is not None
+
+
+def init_case_db() -> None:
+    cfg = get_mysql_config()
+    conn = get_mysql_connection(use_database=False)
+    try:
+        cursor = conn.cursor()
+        cursor.execute(f"CREATE DATABASE IF NOT EXISTS `{cfg['database']}` CHARACTER SET utf8mb4")
+        conn.commit()
+    finally:
+        conn.close()
+
+    conn = get_mysql_connection(use_database=True)
+    try:
+        cursor = conn.cursor()
+        cursor.execute(
+            """
+            CREATE TABLE IF NOT EXISTS cases (
+                id BIGINT PRIMARY KEY AUTO_INCREMENT,
+                case_id VARCHAR(255),
+                summary_text TEXT,
+                case_profile_json LONGTEXT,
+                dispute_points_json LONGTEXT,
+                law_results_json LONGTEXT,
+                evidence_results_json LONGTEXT,
+                final_judgement_json LONGTEXT,
+                embedding_json LONGTEXT,
+                created_at DATETIME
+            ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
+            """
+        )
+        if not index_exists(cursor, cfg["database"], "cases", "idx_cases_case_id"):
+            cursor.execute("CREATE INDEX idx_cases_case_id ON cases(case_id)")
+        cursor.execute(
+            """
+            CREATE TABLE IF NOT EXISTS case_management (
+                id BIGINT PRIMARY KEY AUTO_INCREMENT,
+                case_id VARCHAR(255) UNIQUE,
+                title VARCHAR(255),
+                description TEXT,
+                status VARCHAR(50),
+                stage VARCHAR(50),
+                updated_at DATETIME
+            ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
+            """
+        )
+        if not index_exists(cursor, cfg["database"], "case_management", "idx_case_management_case_id"):
+            cursor.execute("CREATE INDEX idx_case_management_case_id ON case_management(case_id)")
+        conn.commit()
+    finally:
+        conn.close()
+
+
+def store_case_record(
+    case_id: str,
+    summary_text: str,
+    case_profile: Dict[str, Any],
+    dispute_points: List[str],
+    law_results: Dict[str, Any],
+    evidence_results: Dict[str, Any],
+    final_judgement: Dict[str, Any],
+    embedding: List[float]
+) -> None:
+    conn = get_mysql_connection(use_database=True)
+    try:
+        cursor = conn.cursor()
+        cursor.execute(
+            """
+            INSERT INTO cases (
+                case_id, summary_text, case_profile_json, dispute_points_json,
+                law_results_json, evidence_results_json, final_judgement_json,
+                embedding_json, created_at
+            ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s)
+            """,
+            (
+                case_id,
+                summary_text,
+                json.dumps(case_profile, ensure_ascii=False),
+                json.dumps(dispute_points, ensure_ascii=False),
+                json.dumps(law_results, ensure_ascii=False),
+                json.dumps(evidence_results, ensure_ascii=False),
+                json.dumps(final_judgement, ensure_ascii=False),
+                json.dumps(embedding, ensure_ascii=False),
+                datetime.utcnow()
+            )
+        )
+        conn.commit()
+    finally:
+        conn.close()
+
+
+def fetch_similar_cases(embedding: List[float], top_k: int = 3) -> List[Dict[str, Any]]:
+    conn = get_mysql_connection(use_database=True)
+    try:
+        cursor = conn.cursor()
+        cursor.execute(
+            """
+            SELECT case_id, summary_text, final_judgement_json, embedding_json
+            FROM cases
+            WHERE embedding_json IS NOT NULL AND embedding_json != ''
+            """
+        )
+        rows = cursor.fetchall()
+    finally:
+        conn.close()
+
+    scored = []
+    for case_id, summary_text, final_judgement_json, embedding_json in rows:
+        try:
+            vec = json.loads(embedding_json)
+        except Exception:
+            vec = []
+        scored.append(
+            {
+                "case_id": case_id,
+                "summary_text": summary_text,
+                "final_judgement_json": final_judgement_json,
+                "embedding": vec
+            }
+        )
+    return scored
+
+
+def parse_case_description(description: str) -> Dict[str, Any]:
+    if not description or not isinstance(description, str):
+        return {}
+    try:
+        data = json.loads(description)
+        return data if isinstance(data, dict) else {}
+    except Exception:
+        return {}
+
+
+def normalize_case_description(case_id: str, description: str) -> str:
+    if not case_id:
+        return description
+    existing = fetch_case_management(case_id)
+    existing_desc = existing.get("description", "") if existing else ""
+    existing_data = parse_case_description(existing_desc)
+    new_data = parse_case_description(description)
+    if existing_data.get("materials") and not new_data.get("materials"):
+        desc_text = description or existing_data.get("description", "")
+        merged = {"description": desc_text, "materials": existing_data.get("materials", {})}
+        return json.dumps(merged, ensure_ascii=False)
+    return description
+
+
+def upsert_case_management(case_id: str, title: str, description: str, status: str, stage: str) -> Dict[str, Any]:
+    conn = get_mysql_connection(use_database=True)
+    try:
+        cursor = conn.cursor()
+        description = normalize_case_description(case_id, description)
+        cursor.execute(
+            """
+            INSERT INTO case_management (case_id, title, description, status, stage, updated_at)
+            VALUES (%s, %s, %s, %s, %s, %s)
+            ON DUPLICATE KEY UPDATE
+                title=VALUES(title),
+                description=VALUES(description),
+                status=VALUES(status),
+                stage=VALUES(stage),
+                updated_at=VALUES(updated_at)
+            """,
+            (case_id, title, description, status, stage, datetime.utcnow())
+        )
+        conn.commit()
+    finally:
+        conn.close()
+    return {"case_id": case_id, "title": title, "description": description, "status": status, "stage": stage}
+
+
+def update_case_materials(case_id: str, materials: Dict[str, Any]) -> None:
+    if not case_id:
+        return
+    existing = fetch_case_management(case_id)
+    existing_desc = existing.get("description", "") if existing else ""
+    existing_data = parse_case_description(existing_desc)
+    description_text = existing_desc if not existing_data else existing_data.get("description", "")
+    payload = {
+        "description": description_text,
+        "materials": materials or {}
+    }
+    conn = get_mysql_connection(use_database=True)
+    try:
+        cursor = conn.cursor()
+        cursor.execute(
+            """
+            UPDATE case_management
+            SET description=%s, updated_at=%s
+            WHERE case_id=%s
+            """,
+            (json.dumps(payload, ensure_ascii=False), datetime.utcnow(), case_id)
+        )
+        conn.commit()
+    finally:
+        conn.close()
+
+
+def list_case_management() -> List[Dict[str, Any]]:
+    conn = get_mysql_connection(use_database=True)
+    try:
+        cursor = conn.cursor()
+        cursor.execute(
+            """
+            SELECT case_id, title, description, status, stage, updated_at
+            FROM case_management
+            ORDER BY updated_at DESC
+            """
+        )
+        rows = cursor.fetchall()
+    finally:
+        conn.close()
+    results = []
+    for case_id, title, description, status, stage, updated_at in rows:
+        results.append(
+            {
+                "case_id": case_id,
+                "title": title,
+                "description": description,
+                "status": status,
+                "stage": stage,
+                "updated_at": updated_at.isoformat() if updated_at else ""
+            }
+        )
+    return results
+
+
+def fetch_case_management(case_id: str) -> Dict[str, Any]:
+    if not case_id:
+        return {}
+    conn = get_mysql_connection(use_database=True)
+    try:
+        cursor = conn.cursor()
+        cursor.execute(
+            """
+            SELECT case_id, title, description, status, stage, updated_at
+            FROM case_management
+            WHERE case_id=%s
+            LIMIT 1
+            """,
+            (case_id,)
+        )
+        row = cursor.fetchone()
+    finally:
+        conn.close()
+    if not row:
+        return {}
+    case_id, title, description, status, stage, updated_at = row
+    return {
+        "case_id": case_id,
+        "title": title,
+        "description": description or "",
+        "status": status,
+        "stage": stage,
+        "updated_at": updated_at.isoformat() if updated_at else ""
+    }
+
+
+def fetch_case_record(case_id: str) -> Dict[str, Any]:
+    if not case_id:
+        return {}
+    conn = get_mysql_connection(use_database=True)
+    try:
+        cursor = conn.cursor()
+        cursor.execute(
+            """
+            SELECT summary_text, case_profile_json, dispute_points_json, law_results_json,
+                   evidence_results_json, final_judgement_json, embedding_json, created_at
+            FROM cases
+            WHERE case_id=%s
+            ORDER BY created_at DESC
+            LIMIT 1
+            """,
+            (case_id,)
+        )
+        row = cursor.fetchone()
+    finally:
+        conn.close()
+    if not row:
+        return {}
+    (
+        summary_text,
+        case_profile_json,
+        dispute_points_json,
+        law_results_json,
+        evidence_results_json,
+        final_judgement_json,
+        embedding_json,
+        created_at
+    ) = row
+
+    def parse_json(value: str, fallback):
+        if not value:
+            return fallback
+        try:
+            parsed = json.loads(value)
+            return parsed if parsed is not None else fallback
+        except Exception:
+            return fallback
+
+    return {
+        "summary_text": summary_text or "",
+        "case_profile": parse_json(case_profile_json, {}),
+        "dispute_points": parse_json(dispute_points_json, []),
+        "law_results": parse_json(law_results_json, {}),
+        "evidence_results": parse_json(evidence_results_json, {}),
+        "final_judgement": parse_json(final_judgement_json, {}),
+        "embedding": parse_json(embedding_json, []),
+        "created_at": created_at.isoformat() if created_at else ""
+    }
+
+
+def delete_case_management(case_id: str) -> None:
+    conn = get_mysql_connection(use_database=True)
+    try:
+        cursor = conn.cursor()
+        cursor.execute("DELETE FROM case_management WHERE case_id=%s", (case_id,))
+        conn.commit()
+    finally:
+        conn.close()

+ 33 - 0
backend/embedding.py

@@ -0,0 +1,33 @@
+from typing import List
+import numpy as np
+from text2vec import SentenceModel
+import config.config
+
+MODEL_PATH = config.config.MODEL_PATH
+_EMBEDDING_MODEL = None
+
+
+def get_embedding_model():
+    global _EMBEDDING_MODEL
+    if _EMBEDDING_MODEL is None:
+        _EMBEDDING_MODEL = SentenceModel(MODEL_PATH, device="cpu")
+    return _EMBEDDING_MODEL
+
+
+def compute_embedding(text: str) -> List[float]:
+    model = get_embedding_model()
+    vector = model.encode(text or "")
+    if hasattr(vector, "tolist"):
+        return vector.tolist()
+    return list(vector)
+
+
+def cosine_similarity(vec_a: List[float], vec_b: List[float]) -> float:
+    if not vec_a or not vec_b:
+        return 0.0
+    a = np.array(vec_a, dtype=float)
+    b = np.array(vec_b, dtype=float)
+    denom = np.linalg.norm(a) * np.linalg.norm(b)
+    if denom == 0:
+        return 0.0
+    return float(np.dot(a, b) / denom)

+ 347 - 0
backend/services.py

@@ -0,0 +1,347 @@
+import os
+import json
+import re
+from typing import List, Dict, Any
+
+from openai import OpenAI
+
+import config.config
+from tools.documents_extractor import DocumentReader
+from application_extractor.ocr_PP_StructureV3 import LayoutParserClient_application
+from application_extractor.rectify_OCR_result import RectifyClient_application
+from transcript_extractor.rectify_transcript import RectifyClient_transcript
+from evidence_extractor.ocr_paddle_ocr_vl import LayoutParserClient_evidence
+from law_rag.run import law_rag_run
+
+from backend.embedding import compute_embedding, cosine_similarity
+from backend.db import fetch_similar_cases, store_case_record, init_case_db
+from backend.text_utils import list_files_by_ext, parse_json_from_text, extract_evidence_text_from_ocr
+
+
+def call_deepseek_json(system_prompt: str, user_content: str, temperature: float = 0.0) -> Any:
+    client = OpenAI(api_key=config.config.DEEPSEEK_API, base_url="https://api.deepseek.com")
+    response = client.chat.completions.create(
+        model="deepseek-chat",
+        messages=[
+            {"role": "system", "content": system_prompt},
+            {"role": "user", "content": user_content}
+        ],
+        temperature=temperature,
+        stream=False
+    )
+    content = response.choices[0].message.content.strip()
+    parsed = parse_json_from_text(content)
+    return parsed if parsed is not None else content
+
+
+def extract_application_text(application_dir: str) -> str:
+    image_exts = [".png", ".jpg", ".jpeg", ".bmp", ".gif"]
+    doc_exts = [".pdf", ".doc", ".docx"]
+    image_files = list_files_by_ext(application_dir, image_exts)
+    doc_files = list_files_by_ext(application_dir, doc_exts)
+    if image_files:
+        ocr_client = LayoutParserClient_application()
+        raw_text = ocr_client.parse(image_files)
+        rectify_client = RectifyClient_application()
+        return rectify_client.extract_legal_document(raw_text)
+    if doc_files:
+        reader = DocumentReader()
+        contents = [reader.process_input(path) for path in doc_files]
+        raw_text = "\n\n".join(contents)
+        rectify_client = RectifyClient_application()
+        return rectify_client.extract_legal_document(raw_text)
+    return ""
+
+
+def extract_transcript_text(transcript_dir: str) -> str:
+    doc_exts = [".pdf", ".doc", ".docx", ".png", ".jpg", ".jpeg", ".bmp", ".gif"]
+    files = list_files_by_ext(transcript_dir, doc_exts)
+    if not files:
+        return ""
+    reader = DocumentReader()
+    contents = [reader.process_input(path) for path in files]
+    raw_text = "\n\n".join(contents)
+    rectify_client = RectifyClient_transcript()
+    cleaned = rectify_client.clean_text(raw_text)
+    return cleaned or raw_text
+
+
+def build_case_profile(application_text: str, transcript_text: str) -> Dict[str, Any]:
+    system_prompt = """
+你是劳动仲裁案件分析专家。请根据申请书与庭审笔录,构建案件画像。
+仅输出JSON,不要包含解释或Markdown。
+输出格式:
+{
+  "case_profile": {
+    "parties": "当事人信息与关系",
+    "claims": ["仲裁请求1", "仲裁请求2"],
+    "background": "事实与理由摘要",
+    "timeline": ["关键时间点1", "关键时间点2"],
+    "key_facts": ["关键事实1", "关键事实2"],
+    "disputed_facts": ["争议事实1", "争议事实2"]
+  }
+}
+"""
+    user_content = f"申请书:\n{application_text}\n\n庭审笔录:\n{transcript_text}"
+    result = call_deepseek_json(system_prompt, user_content)
+    if isinstance(result, dict) and "case_profile" in result:
+        return result
+    return {"case_profile": {"parties": "", "claims": [], "background": "", "timeline": [], "key_facts": [], "disputed_facts": []}}
+
+
+def extract_dispute_points(application_text: str, transcript_text: str) -> List[str]:
+    system_prompt = """
+你是劳动争议分析专家。请从申请书和庭审笔录中提取本案争议焦点。
+如果庭审笔录中有“争议焦点”或“争议焦点为”,优先提取其内容。
+仅输出JSON,不要包含解释或Markdown。
+输出格式:
+{
+  "dispute_points": ["争议焦点1", "争议焦点2"]
+}
+"""
+    user_content = f"申请书:\n{application_text}\n\n庭审笔录:\n{transcript_text}"
+    result = call_deepseek_json(system_prompt, user_content)
+    if isinstance(result, dict):
+        points = result.get("dispute_points", [])
+        if isinstance(points, list):
+            return [p for p in points if isinstance(p, str) and p.strip()]
+    text = transcript_text or application_text
+    matches = re.findall(r"争议焦点为[::]?(.*)", text)
+    if matches:
+        raw = matches[0]
+        parts = re.split(r"[;;。]\s*|\d+[、.]", raw)
+        return [p.strip() for p in parts if p.strip()]
+    return []
+
+
+def list_evidence_categories(evidence_dir: str) -> List[str]:
+    if not evidence_dir or not os.path.exists(evidence_dir):
+        return []
+    entries = []
+    for name in os.listdir(evidence_dir):
+        full_path = os.path.join(evidence_dir, name)
+        if os.path.isdir(full_path):
+            entries.append(name)
+    if entries:
+        return sorted(entries)
+    files = list_files_by_ext(evidence_dir, [".pdf", ".doc", ".docx", ".png", ".jpg", ".jpeg", ".bmp", ".gif"])
+    return ["未分类"] if files else []
+
+
+def select_relevant_categories(dispute_points: List[str], categories: List[str]) -> Dict[str, List[str]]:
+    if not dispute_points or not categories:
+        return {}
+    system_prompt = """
+你是证据分析专家。请根据争议焦点选择可能相关的证据类别。
+仅输出JSON,不要包含解释或Markdown。
+输出格式:
+{
+  "mapping": [
+    {"dispute_point": "争议焦点1", "categories": ["证据类别A", "证据类别B"]},
+    {"dispute_point": "争议焦点2", "categories": ["证据类别C"]}
+  ]
+}
+"""
+    user_content = json.dumps({"dispute_points": dispute_points, "evidence_categories": categories}, ensure_ascii=False)
+    result = call_deepseek_json(system_prompt, user_content)
+    mapping: Dict[str, List[str]] = {}
+    if isinstance(result, dict):
+        items = result.get("mapping", [])
+        if isinstance(items, list):
+            for item in items:
+                point = item.get("dispute_point")
+                cats = item.get("categories", [])
+                if isinstance(point, str) and isinstance(cats, list):
+                    valid = [c for c in cats if c in categories]
+                    if valid:
+                        mapping[point] = valid
+    return mapping
+
+
+def limit_files(files: List[str], limit: int = 10) -> List[str]:
+    if len(files) <= limit:
+        return files
+    head = limit // 2
+    tail = limit - head
+    return files[:head] + files[-tail:]
+
+
+def ocr_evidence_files(files: List[str]) -> List[Dict[str, Any]]:
+    image_exts = [".png", ".jpg", ".jpeg", ".bmp", ".gif"]
+    doc_exts = [".pdf", ".doc", ".docx"]
+    images = [f for f in files if os.path.splitext(f)[1].lower() in image_exts]
+    docs = [f for f in files if os.path.splitext(f)[1].lower() in doc_exts]
+    results = []
+    if images:
+        ocr_client = LayoutParserClient_evidence()
+        text = ocr_client.parse(images)
+        extracted = extract_evidence_text_from_ocr(text)
+        results.append(
+            {
+                "files": images,
+                "text": extracted["text"][:2000],
+                "lines": extracted["lines"][:200]
+            }
+        )
+    if docs:
+        reader = DocumentReader()
+        for doc in docs:
+            text = reader.process_input(doc)
+            results.append({"files": [doc], "text": text[:2000]})
+    return results
+
+
+def retrieve_laws(dispute_points: List[str]) -> Dict[str, List[Dict[str, str]]]:
+    laws = {}
+    for point in dispute_points:
+        laws[point] = law_rag_run(point)
+    return laws
+
+
+def final_judgement(
+    case_profile: Dict[str, Any],
+    dispute_points: List[str],
+    law_results: Dict[str, Any],
+    evidence_results: Dict[str, Any],
+    similar_cases: List[Dict[str, Any]]
+) -> Dict[str, Any]:
+    system_prompt = """
+你是劳动争议案件裁决分析专家。基于案件画像、争议焦点、证据摘要、相关法律条文与相似案例,给出最终判断。
+仅输出JSON,不要包含解释或Markdown。
+输出格式:
+{
+  "final_decision": "最终判断结论",
+  "reasoning": "综合理由",
+  "dispute_point_findings": [
+    {
+      "dispute_point": "争议焦点1",
+      "finding": "对此争议的判断",
+      "evidence_used": ["证据类别A", "证据类别B"],
+      "law_applied": ["法律条文ID1", "法律条文ID2"]
+    }
+  ]
+}
+"""
+    user_content = json.dumps(
+        {
+            "case_profile": case_profile,
+            "dispute_points": dispute_points,
+            "law_results": law_results,
+            "evidence_results": evidence_results,
+            "similar_cases": similar_cases
+        },
+        ensure_ascii=False
+    )
+    result = call_deepseek_json(system_prompt, user_content)
+    if isinstance(result, dict):
+        return result
+    return {"final_decision": "", "reasoning": "", "dispute_point_findings": []}
+
+
+def build_case_summary_text(case_profile: Dict[str, Any], dispute_points: List[str]) -> str:
+    profile = case_profile.get("case_profile", {}) if isinstance(case_profile, dict) else {}
+    parts = []
+    parties = profile.get("parties")
+    if parties:
+        parts.append(str(parties))
+    claims = profile.get("claims", [])
+    if isinstance(claims, list) and claims:
+        parts.append(" ".join([str(c) for c in claims]))
+    background = profile.get("background")
+    if background:
+        parts.append(str(background))
+    if dispute_points:
+        parts.append(" ".join(dispute_points))
+    return "\n".join([p for p in parts if p])
+
+
+def compute_similar_cases(embedding: List[float]) -> List[Dict[str, Any]]:
+    raw_cases = fetch_similar_cases(embedding, top_k=50)
+    scored = []
+    for item in raw_cases:
+        score = cosine_similarity(embedding, item.get("embedding", []))
+        if score <= 0:
+            continue
+        try:
+            final_judgement = json.loads(item.get("final_judgement_json") or "{}")
+        except Exception:
+            final_judgement = {}
+        scored.append(
+            {
+                "case_id": item.get("case_id"),
+                "summary_text": item.get("summary_text"),
+                "final_judgement": final_judgement,
+                "similarity": score
+            }
+        )
+    scored.sort(key=lambda x: x["similarity"], reverse=True)
+    return scored[:3]
+
+
+def process_case_text_with_evidence(case_id: str, application_text: str, transcript_text: str, evidence_dir: str) -> Dict[str, Any]:
+    case_profile = build_case_profile(application_text, transcript_text)
+    dispute_points = extract_dispute_points(application_text, transcript_text)
+    categories = list_evidence_categories(evidence_dir)
+    mapping = select_relevant_categories(dispute_points, categories)
+    selected = set()
+    for cats in mapping.values():
+        for cat in cats:
+            selected.add(cat)
+    if not selected and categories:
+        if "证据清单" in categories:
+            selected.add("证据清单")
+        else:
+            selected.update(categories)
+    evidence_results = {}
+    for category in sorted(selected):
+        category_path = evidence_dir if category == "未分类" else os.path.join(evidence_dir, category)
+        files = list_files_by_ext(category_path, [".pdf", ".doc", ".docx", ".png", ".jpg", ".jpeg", ".bmp", ".gif"])
+        files = limit_files(files, 10)
+        evidence_results[category] = ocr_evidence_files(files) if files else []
+    law_results = retrieve_laws(dispute_points)
+    summary_text = build_case_summary_text(case_profile, dispute_points)
+    embedding = compute_embedding(summary_text)
+    similar_cases = compute_similar_cases(embedding)
+    judgement = final_judgement(case_profile, dispute_points, law_results, evidence_results, similar_cases)
+    return {
+        "case_profile": case_profile,
+        "dispute_points": dispute_points,
+        "evidence_results": evidence_results,
+        "law_results": law_results,
+        "summary_text": summary_text,
+        "embedding": embedding,
+        "similar_cases": similar_cases,
+        "final_judgement": judgement
+    }
+
+
+def process_case_dir(case_dir: str) -> Dict[str, Any]:
+    init_case_db()
+    case_id = os.path.basename(case_dir.rstrip(os.sep))
+    application_dir = os.path.join(case_dir, "申请书")
+    transcript_dir = os.path.join(case_dir, "庭审笔录")
+    evidence_dir = os.path.join(case_dir, "证据")
+    application_text = extract_application_text(application_dir)
+    transcript_text = extract_transcript_text(transcript_dir)
+    result = process_case_text_with_evidence(case_id, application_text, transcript_text, evidence_dir)
+    store_case_record(
+        case_id,
+        result["summary_text"],
+        result["case_profile"],
+        result["dispute_points"],
+        result["law_results"],
+        result["evidence_results"],
+        result["final_judgement"],
+        result["embedding"]
+    )
+    return {
+        "case_dir": case_dir,
+        "application_text": application_text,
+        "transcript_text": transcript_text,
+        "case_profile": result["case_profile"],
+        "dispute_points": result["dispute_points"],
+        "law_results": result["law_results"],
+        "evidence_results": result["evidence_results"],
+        "final_judgement": result["final_judgement"],
+        "similar_cases": result["similar_cases"]
+    }

+ 80 - 0
backend/text_utils.py

@@ -0,0 +1,80 @@
+import os
+import json
+import re
+from typing import List, Dict, Any, Optional
+
+from fastapi import UploadFile
+
+
+def list_files_by_ext(root_dir: str, exts: List[str]) -> List[str]:
+    if not root_dir or not os.path.exists(root_dir):
+        return []
+    found = []
+    for base, _, files in os.walk(root_dir):
+        for file_name in files:
+            ext = os.path.splitext(file_name)[1].lower()
+            if ext in exts:
+                found.append(os.path.join(base, file_name))
+    return sorted(found)
+
+
+def parse_json_from_text(text: str) -> Any:
+    if not text:
+        return None
+    try:
+        return json.loads(text)
+    except Exception:
+        pass
+    match = re.search(r"\{[\s\S]*\}", text)
+    if match:
+        try:
+            return json.loads(match.group(0))
+        except Exception:
+            return None
+    match = re.search(r"\[[\s\S]*\]", text)
+    if match:
+        try:
+            return json.loads(match.group(0))
+        except Exception:
+            return None
+    return None
+
+
+def extract_evidence_text_from_ocr(ocr_text: str) -> Dict[str, Any]:
+    parsed = parse_json_from_text(ocr_text)
+    lines = []
+    if isinstance(parsed, list):
+        for item in parsed:
+            if not isinstance(item, dict):
+                continue
+            label = item.get("block_label")
+            content = item.get("block_content")
+            if not content or not isinstance(content, str):
+                continue
+            if label in {"text", "paragraph_title", "table", "header"}:
+                cleaned = content.strip()
+                if cleaned:
+                    lines.append(cleaned)
+    text = "\n".join(lines)
+    return {"text": text, "lines": lines}
+
+
+def normalize_filename(name: str) -> str:
+    base = os.path.basename(name)
+    return base.replace("..", "_")
+
+
+async def save_uploads(files: Optional[List[UploadFile]], target_dir: str) -> List[str]:
+    if not files:
+        return []
+    os.makedirs(target_dir, exist_ok=True)
+    saved = []
+    for file in files:
+        if not file or not file.filename:
+            continue
+        file_path = os.path.join(target_dir, normalize_filename(file.filename))
+        content = await file.read()
+        with open(file_path, "wb") as f:
+            f.write(content)
+        saved.append(file_path)
+    return saved

BIN=BIN
config/__pycache__/config.cpython-310.pyc


+ 7 - 0
config/config.py

@@ -0,0 +1,7 @@
+PADDLE_TOKEN = "16455708d55afac2f074f4ae5a88fc6c7bae7920"
+DEEPSEEK_API = "sk-72fccce99c164e3285f3d790e446d64f"
+MOONSHOT_API_KEY = "sk-dAWBpSrzFCMSRfyHYyjSkwWW4p5ZbIBJjkA5cHWGpbd8vwHo"
+
+VECTOR_STORE_BASE="E:\\project\\arbitration_system\\law_rag\\vectrization\\vector_store"
+MODEL_PATH = "E:\\environment\\models\\text2vec-base-chinese"
+FILE_STORAGE_BASE = "E:\\project\\arbitration_system\\law_rag\\vectrization\\file_storage"

BIN=BIN
evidence_extractor/__pycache__/ocr_paddle_ocr_vl.cpython-310.pyc


+ 42 - 0
evidence_extractor/demo/kaoqinbiao_ocr.py

@@ -0,0 +1,42 @@
+import base64
+import os
+import requests
+
+API_URL = "https://q8d4u1u6c45dn7pd.aistudio-app.com/layout-parsing"
+TOKEN = "16455708d55afac2f074f4ae5a88fc6c7bae7920"
+
+file_path = "E:\\project\\arbitration_system\\evidence_extractor\\test\\F86-ZC1-2023-0001\\证人证言\\F86-ZC1-2023-0001-009_04.png"
+
+with open(file_path, "rb") as file:
+    file_bytes = file.read()
+    file_data = base64.b64encode(file_bytes).decode("ascii")
+
+headers = {
+    "Authorization": f"token {TOKEN}",
+    "Content-Type": "application/json"
+}
+
+payload = {
+    "file": file_data,
+    "fileType": 1,
+    "useDocOrientationClassify": False,
+    "useDocUnwarping": False,
+    "useChartRecognition": False,
+}
+
+response = requests.post(API_URL, json=payload, headers=headers)
+print(f"响应状态码: {response.status_code}")
+
+if response.status_code == 200:
+    result = response.json()["result"]
+
+    # 提取 parsing_res_list
+    for layout_result in result.get("layoutParsingResults", []):
+        pruned_result = layout_result.get("prunedResult", {})
+        parsing_res_list = pruned_result.get("parsing_res_list", [])
+
+        if parsing_res_list:
+            print("\n=== parsing_res_list ===")
+            print(parsing_res_list)
+else:
+    print(f"请求失败,状态码: {response.status_code}")

+ 100 - 0
evidence_extractor/ocr_paddle_ocr_vl.py

@@ -0,0 +1,100 @@
+import json
+import os
+import base64
+import requests
+from typing import List, Union
+
+import config.config
+
+
+class LayoutParserClient_evidence:
+    def __init__(self, api_url: str = None, token: str = None):
+        self.api_url = api_url or "https://q8d4u1u6c45dn7pd.aistudio-app.com/layout-parsing"
+        self.token = token or config.config.PADDLE_TOKEN
+        self.headers = {
+            "Authorization": f"token {self.token}",
+            "Content-Type": "application/json"
+        }
+
+    def _encode_image(self, file_path: str) -> str:
+        """读取图片并转换为 base64 编码"""
+        with open(file_path, "rb") as file:
+            return base64.b64encode(file.read()).decode("ascii")
+
+    def _process_single_file(self, file_path: str) -> str:
+        """处理单张图片并返回解析后的文本"""
+        file_data = self._encode_image(file_path)
+
+        payload = {
+            "file": file_data,
+            "fileType": 1,
+            "useDocOrientationClassify": False,
+            "useDocUnwarping": False,
+            "useTextlineOrientation": False,
+        }
+
+        try:
+            response = requests.post(self.api_url, json=payload, headers=self.headers)
+            response.raise_for_status()  # 检查 HTTP 状态码
+
+            result = response.json()["result"]
+            # print(result)
+            full_text = []
+
+            # 提取 parsing_res_list
+            for res in result.get("layoutParsingResults", []):
+                pruned_result = res.get("prunedResult", {})
+                parsing_res_list = pruned_result.get("parsing_res_list", [])
+
+                if parsing_res_list:
+                    print("\n=== parsing_res_list ===")
+                    full_text.extend(parsing_res_list)
+
+            return json.dumps(full_text, ensure_ascii=False, indent=2)
+
+        except Exception as e:
+            print(f"处理文件 {file_path} 时出错: {e}")
+            return ""
+
+    def parse(self, inputs: Union[str, List[str]]) -> str:
+        """
+        主入口方法
+        :param inputs: 可以是单张图片路径,也可以是图片路径列表
+        :return: 拼接后的所有文本
+        """
+        if isinstance(inputs, str):
+            # 如果输入是单个字符串,转为列表统一处理
+            file_list = [inputs]
+        else:
+            file_list = inputs
+
+
+
+        combined_results = []
+        for file_path in file_list:
+            print(f"正在处理: {os.path.basename(file_path)}...")
+            text = self._process_single_file(file_path)
+            if text:
+                combined_results.append(text)
+
+        # 将多张图片的结果按顺序拼接,中间用双换行分隔
+        return "\n\n--- Next Page ---\n\n".join(combined_results)
+
+
+if __name__ == '__main__':
+    # 实例化类
+    client = LayoutParserClient_evidence()
+
+    # 示例 1: 处理单张图片
+    # single_img = "E:\\project\\arbitration_system\\appplication_extractor\\test\\李述花\\李述花-申请书.png"
+    # result_1 = client.parse(single_img)
+    # print(result_1)
+
+    # 示例 2: 处理多张图片(按顺序拼接)
+    multi_imgs = [
+        "E:\\project\\arbitration_system\\evidence_extractor\\test\\F86-ZC1-2023-0001\\证人证言\\F86-ZC1-2023-0001-009_04.png",
+        "E:\\project\\arbitration_system\\evidence_extractor\\test\F86-ZC1-2023-0001\\证人证言\\F86-ZC1-2023-0001-009_05.png"
+    ]
+    result_2 = client.parse(multi_imgs)
+    print(result_2)
+

+ 15 - 0
evidence_extractor/run.py

@@ -0,0 +1,15 @@
+from evidence_extractor import ocr_paddle_ocr_vl
+
+multi_imgs = [
+    "E:\\project\\arbitration_system\\evidence_extractor\\test\\F86-ZC1-2023-0001\\证人证言\\F86-ZC1-2023-0001-009_04.png",
+    "E:\\project\\arbitration_system\\evidence_extractor\\test\F86-ZC1-2023-0001\\证人证言\\F86-ZC1-2023-0001-009_05.png"
+]
+
+def evidece_extractor_run(multi_imgs):
+    client = ocr_paddle_ocr_vl.LayoutParserClient_evidence()
+    evidence_extractor_result = client.parse(multi_imgs)
+
+    return evidence_extractor_result
+
+if __name__ == '__main__':
+    print(evidece_extractor_run(multi_imgs))

BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/其他文字材料/F86-ZC1-2023-0001-009_18.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/其他文字材料/F86-ZC1-2023-0001-009_19.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/劳动关系证明材料/F86-ZC1-2023-0001-009_01.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/劳动关系证明材料/F86-ZC1-2023-0001-009_02.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/劳动关系证明材料/F86-ZC1-2023-0001-009_23.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/劳动关系证明材料/F86-ZC1-2023-0001-009_24.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/劳动关系证明材料/F86-ZC1-2023-0001-009_25.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/劳动关系证明材料/F86-ZC1-2023-0001-009_26.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/劳动关系证明材料/F86-ZC1-2023-0001-009_27.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/工资单/F86-ZC1-2023-0001-009_20.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/工资单/F86-ZC1-2023-0001-009_28.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/工资单/F86-ZC1-2023-0001-010_06.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/考勤表/F86-ZC1-2023-0001-010_00.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/考勤表/F86-ZC1-2023-0001-010_01.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/考勤表/F86-ZC1-2023-0001-010_02.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/考勤表/F86-ZC1-2023-0001-010_03.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/考勤表/F86-ZC1-2023-0001-010_04.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/考勤表/F86-ZC1-2023-0001-010_05.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/解除劳动关系相关材料/F86-ZC1-2023-0001-009_03.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/证人证言/F86-ZC1-2023-0001-009_04.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/证人证言/F86-ZC1-2023-0001-009_05.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/证人证言/F86-ZC1-2023-0001-009_21.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/证据清单/F86-ZC1-2023-0001-009_00.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/证据清单/F86-ZC1-2023-0001-009_22.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/银行流水/F86-ZC1-2023-0001-009_06.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/银行流水/F86-ZC1-2023-0001-009_07.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/银行流水/F86-ZC1-2023-0001-009_08.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/银行流水/F86-ZC1-2023-0001-009_09.png


BIN=BIN
evidence_extractor/test/F86-ZC1-2023-0001/银行流水/F86-ZC1-2023-0001-009_10.png


Algúns arquivos non se mostraron porque demasiados arquivos cambiaron neste cambio