arxiv-tracker: LLM-Filtered arXiv Paper Tracking Pipeline

arxiv-tracker is a Python pipeline for staying on top of arXiv papers in specific research areas. It harvests paper metadata daily via the OAI-PMH API, applies a two-stage LLM filter to surface the most relevant papers, and provides tools to download, translate, and summarise them. The result is a browsable local collection that grows automatically — no manual trawling of arXiv required.

arxiv-tracker 是一个 Python 流水线，用于持续追踪特定研究领域的 arXiv 论文。它每天通过 OAI-PMH API 抓取论文元数据，经过两阶段 LLM 过滤筛选出最相关的论文，并提供下载、翻译和生成摘要的工具。最终形成一个可浏览的本地论文集，自动持续增长，无需手动检索 arXiv。

arxiv-tracker は特定の研究分野の arXiv 論文を継続的に追跡するための Python パイプラインです。 OAI-PMH API を通じて毎日論文メタデータを収集し、二段階の LLM フィルタリングで最も関連性の高い論文を抽出し、ダウンロード・翻訳・要約のツールも提供します。結果として自動的に蓄積されるブラウザブルなローカルコレクションが構築され、手動で arXiv を巡回する必要がなくなります。

Active collections include: LLM_MEMORY, LLM_PSYCHOLOGY, MMaDB, SEMANTIC_OPS, VECTOR_DB.

当前活跃的论文集包括：LLM_MEMORY、LLM_PSYCHOLOGY、MMaDB、SEMANTIC_OPS、VECTOR_DB。

現在アクティブなコレクション：LLM_MEMORY・LLM_PSYCHOLOGY・MMaDB・SEMANTIC_OPS・VECTOR_DB。

The pipeline has four stages, orchestrated daily by daily_run.sh:

流水线共分四个阶段，由 daily_run.sh 每日调度：

パイプラインは4つのステージで構成され、daily_run.sh が毎日オーケストレーションします：

fetch.py pulls arXiv paper metadata (title, abstract, authors, categories, date) via the OAI-PMH protocol and stores it in a local SQLite database (papers.db). Incremental runs are efficient — only new records since the last fetch are downloaded.

fetch.py 通过 OAI-PMH 协议拉取 arXiv 论文元数据（标题、摘要、作者、分类、日期），并存入本地 SQLite 数据库（papers.db）。增量运行效率高——每次只下载上次抓取之后的新记录。

fetch.py は OAI-PMH プロトコルを通じて arXiv の論文メタデータ（タイトル・アブストラクト・著者・カテゴリ・日付）を取得し、ローカルの SQLite データベース（papers.db）に保存します。差分実行により効率的で、前回のフェッチ以降の新しいレコードのみダウンロードします。

paper_collect.py reads from papers.db and applies a three-step filter to each paper:

paper_collect.py 从 papers.db 读取数据，对每篇论文依次执行三步过滤：

paper_collect.py は papers.db から読み込み、各論文に3ステップのフィルタリングを適用します：

Keyword pre-filter — fast in-process check against abstract_keywords to skip clearly irrelevant papers without an LLM call.
Coarse LLM pass — a cheaper, faster model scores the abstract for relevance. Papers below coarse_min_confidence are dropped.
Refine LLM pass — a stronger model re-evaluates the survivors with the full topic description, producing the final relevance score written to collections/NAME.json.

关键词预过滤 — 对 abstract_keywords 进行快速进程内匹配，跳过明显无关的论文，无需调用 LLM。
粗粒度 LLM 过滤 — 使用更便宜、更快速的模型对摘要的相关性进行评分，低于 coarse_min_confidence 的论文被丢弃。
细粒度 LLM 精炼 — 使用更强的模型结合完整主题描述对候选论文进行二次评估，生成最终相关性分数并写入 collections/NAME.json。

キーワード事前フィルタ — abstract_keywords に対してプロセス内で高速チェックを行い、明らかに無関係な論文を LLM 呼び出しなしにスキップします。
粗粒度 LLM パス — より安価で高速なモデルがアブストラクトの関連性をスコアリング。coarse_min_confidence 未満の論文は除外されます。
精密 LLM パス — より強力なモデルがトピック全文で候補を再評価し、最終スコアを collections/NAME.json に書き込みます。

Runs are resumable: progress fields (checked_base_ids, missing_base_ids) are appended directly to the filter config JSON, so an interrupted run picks up exactly where it left off.

运行可中断恢复：进度字段（checked_base_ids、missing_base_ids）直接追加到过滤配置 JSON 中，因此中断后可从上次中断处精确继续。

実行は再開可能です。進捗フィールド（checked_base_ids・missing_base_ids）がフィルタ設定 JSON に直接追記されるため、中断後も正確に中断箇所から再開できます。

introduce_papers.py generates a Markdown summary for each matched paper, using web search to augment the abstract with any follow-up context. batch_download_translate.py downloads PDFs from arXiv and produces Chinese translations in parallel.

introduce_papers.py 为每篇匹配的论文生成 Markdown 摘要，并通过网络搜索补充摘要之外的背景信息。 batch_download_translate.py 并行地从 arXiv 下载 PDF 并生成中文翻译。

introduce_papers.py は各マッチ論文の Markdown 要約を生成し、ウェブ検索でアブストラクト以外の文脈情報を補完します。batch_download_translate.py は arXiv から PDF を並行ダウンロードし、中国語翻訳を生成します。

render_html.py builds an interactive HTML viewer from the collected JSON, and serve.py runs a local HTTP server with PUT support so that read-state (which papers you've already seen) is persisted across sessions.

render_html.py 从收集到的 JSON 构建交互式 HTML 查看器，serve.py 提供支持 PUT 的本地 HTTP 服务，使已读状态（哪些论文已阅读）可跨会话持久化。

render_html.py は収集した JSON からインタラクティブな HTML ビューアを構築し、 serve.py は PUT サポート付きのローカル HTTP サーバーを提供することで、既読状態（どの論文を読んだか）をセッション間で永続化します。

Each research topic is defined by a NAME.filter.json file in collections/. The key fields are:

每个研究主题由 collections/ 目录下的 NAME.filter.json 文件定义。核心字段包括：

各研究トピックは collections/ 内の NAME.filter.json で定義されます。主要フィールドは以下のとおりです：

The focus field is a free-text description of the research area passed verbatim to the LLM, allowing precise scoping without rewriting code. Progress fields are appended automatically as the run proceeds, so the same file serves as both configuration and checkpoint.

focus 字段是研究领域的自然语言描述，将直接传递给 LLM，无需修改代码即可精确调整关注范围。进度字段在运行过程中自动追加，使同一文件兼具配置和断点续传的功能。

focus フィールドは研究領域の自由記述で、そのまま LLM に渡されます。コードを変更せずに対象範囲を精密に調整できます。進捗フィールドは実行中に自動追記されるため、同じファイルが設定とチェックポイントの両方を兼ねます。

arxiv-tracker: LLM-Filtered arXiv Paper Tracking arxiv-tracker：基于 LLM 过滤的 arXiv 论文追踪流水线 arxiv-tracker：LLM フィルタリングによる arXiv 論文追跡パイプライン

1. Overview

1. 概述

1. 概要

2. Pipeline

2. 流水线

2. パイプライン

Stage 1 — Fetch

第一阶段 — 抓取

ステージ 1 — フェッチ

Stage 2 — Two-Stage LLM Filter

第二阶段 — 两阶段 LLM 过滤

ステージ 2 — 二段階 LLM フィルタリング

Stage 3 — Summaries and Translations

第三阶段 — 摘要生成与翻译

ステージ 3 — 要約と翻訳

Stage 4 — Browse

第四阶段 — 浏览

ステージ 4 — ブラウズ

3. Filter Configuration

3. 过滤器配置

3. フィルター設定