10 min read

研究综述:自然语言处理研究笔记

研究综述:自然语言处理研究笔记

研究概述

本研究笔记汇总了自然语言处理领域的关键理论、技术发展、应用实践及前沿探索,涵盖语义建模、架构优化、知识增强等多个方面,旨在为相关研究者提供全面的参考资料。

研究耗时:约10分钟详细介绍
主题:自然语言处理、机器学习、人工智能

基础理论

1. 语义建模

  • PLSI:基于EM算法的概率语义分析,解决同义/多义词问题(Hofmann 1999)
  • Cloze任务:MLM任务前身,通过上下文预测缺失词(Taylor 1953; Devlin et al. 2018)
  • 注意力机制:高效多对一RNN(Feng et al. 2024)
  • Radford et al. (2021) 即 openai/CLIP,生成正负样本对。
  • Q. Chen et al. (2025) 用电磁场(ICL)与串联电阻(CoT)类比大模型推理机制,揭示ICL的"语义磁场"遵循法拉第定律产生额外电压,CoT的推理难度符合欧姆定律约束电流,模型性能等效输出功率。
  • S. Kim et al. (2025) 现有LLM未充分捕捉时序信息,提出LLM-SRec蒸馏时序知识,提升推荐性能。

2. 架构优化

  • ALBERT:SOP替代NSP任务(Lan et al. 2019)
  • 唐东格 (2023) NLP在度小满风控中的应用,字词对齐
  • SimSiam(X. Chen and He 2020):Stop-gradient机制防特征坍塌。
  • Hu et al. (2021) LoRA通过冻结预训练权重,注入低秩矩阵A、B表示权重更新(ΔW=BA)实现高效适配。在GPT-3上可训练参数减10,000倍,硬件需求降3倍且无推理延迟。实验显示其在WikiSQL、MNLI等任务上性能优于全量微调及Prefix-tuning,低秩(r=1/2)即可起效。
  • Ouyang et al. (2022) InstructGPT是通过强化学习从人类反馈(RLHF)微调GPT-3得到的模型,旨在使语言模型更贴合用户意图。其训练分为三步:收集人类演示数据进行有监督微调(SFT)、收集模型输出排名数据训练奖励模型(RM)、使用PPO算法基于奖励模型优化策略。

3. 知识增强

  • RAG框架(Guu et al. 2020; Lewis et al. 2020; Yan et al. 2024)预训练语言模型能够从数据中学习大量知识,但它们在访问和精确操纵知识方面存在局限性,尤其是在知识密集型任务上表现不佳。
  • 检索增强生成(Retrieval-Augmented Generation, RAG)模型通过检索外部知识库来增强生成任务的性能。
  • 公式:P(Y|X) = P(D|X)P(Y,D|X),表示生成输出Y的概率等于检索到的文档D给定输入X的概率乘以Y和D给定X的概率。

4. 鲁棒特征设计

  • 双稳健特征(Yang Zhou, Fan, and Xue 2024):
    • 识别关键主题(如经济增长)
    • 捕捉未被主题覆盖的细节信息

5. 行为分析

  • 虚假新闻传播(Vosoughi, Roy, and Aral 2018):虚假新闻传播更远更快,与新颖性及情绪激发相关,用KL散度量化信息距离。
  • Bycroft (Accessed on June 8, 2024) 关于LLM(Large Language Models,大型语言模型)的可视化内容。
  • Mu et al. (2025) 比较了人类和不同版本ChatGPT在"贝叶斯决策任务"中的表现,发现人类高效但有偏差,GPT-4o接近完美。

语音技术

1. 语音合成

  • VALL-E & ChatTTS(C. Wang et al. 2023):零样本语音生成,支持情感/声学迁移。
  • AudioLDM(Liu et al. 2023):学习音频连续表示生成语音。

2. 语音表征学习

  • wav2vec 2.0(Baevski et al. 2020):
    • 对比损失:
      L_m = -log [e^sim(ct, qt)/κ / Σ_{q̃ ~ Q_t} e^sim(ct, q̃)/κ]
    • 多样性损失:
      L_d = 1/GV Σ_{g=1}^G (-H(p̄_g) = Σ_{v=1}^V p̄_{g,v} log p̄_{g,v})

应用实践

1. 文本标注与实验设计

  • DSL框架: 专家+自动化标注(Egami et al. 2023, 2024)
  • GPT标注: 置信度反馈降幻觉(Lightman et al. 2023; Jesson et al. 2024; A. G. Kim, Muhn, and Nikolaev 2024)
  • 金融预测:
    • GPT-4收益预测 > 人类(A. G. Kim, Muhn, and Nikolaev 2024)
    • BERT嵌入+ANN互补,提升精度(A. G. Kim, Muhn, and Nikolaev 2024; Li et al. 2024)
  • M. Yang et al. (2023) UniSim是一个学习交互式真实世界模拟器的项目。展示UniSim如何模拟具有长时间交互的场景,以支持搜索、规划、最优控制或强化学习的决策优化。

2. 提示词工程

  • 关键技巧
    • 思维链(CoT):分步解释再输出结果(Kok 2024)
    • 分布声明:提示中声明罕见结果频率(如"65%为空列表")(Kok 2024)
    • 角色扮演:`Act as a top journal editor`(Lin 2024)
    • 学习使用 `{}` 来指代(Lin 2024)
  • 扎根理论自动化
    • GPT辅助三级编码(开放→主轴→选择)(Nelson 2020; Dunivin 2025; Yaxian Zhou et al. 2024)
    • 分代码策略优化/生成解释性备忘录,κ一致性达0.68(社科标准≥0.6)(Dunivin 2025)
    • Alqazlan et al. (2025) HITL-CGT三阶段框架
      1. 探索:人工编码+LDA交叉验证
      2. 建模:QDTM分层主题+人工评估(27%主题剔除)
      3. 解释:手工编码+理论抽样(图2情绪分析)
      核心:人机协同实现大数据质性分析严谨性

3. 训练

  • Zhao et al. (2025) LLM-as-a-Judge存漏洞,易被"主密钥"(符号、推理开头)欺骗致FPR达80%,威胁RLVR。研究通过数据增强添加20k负样本,训练Master-RM,其FPR近零,与GPT-4o一致性0.96,提升LLM评估可靠性。

4. 其他应用

  • 城市计算(Lai et al. 2025):LDA分析建成环境,XGBoost量化与网约车等待时间非线性关系。
  • 战略对话分析(Y. Chen, Rui, and Whinston 2024):用ALBERT量化信息优势方的回避行为,预测企业收益。
  • 人类-AI协作:资深工作者从AI获益较少,信任度影响责任分配(W. Wang, Gao, and Agarwal 2023)。
  • 个人AI知识库搭建方法、Kimi问题及豆包优势。

前沿探索

1. 可控生成

(Vafa et al. 2025)

  • 分解可生成性/可控性差距,优化人机交互机制。
  • 推理最优步骤长度(C. Lee, Rush, and Vafa 2025)。

2. 长上下文技术

  • Kimi模型:支持200万字上下文(2024年),但归纳能力有限。

3. 跨模态学习

(Song et al. 2024)

  • SSPL:消除图像-音频假阴性(FN)样本,增强语义对齐。通过仅使用图像-音频正样本对来发现音频和视觉特征之间的语义一致性。

核心结论

技术趋势

生成模型替代传统方法 + RAG增强(Guu et al. 2020; Lewis et al. 2020)

关键挑战

  • 幻觉控制(置信度反馈)(Lightman et al. 2023; Jesson et al. 2024; A. G. Kim, Muhn, and Nikolaev 2024)
  • 训练泄露防范(Ludwig, Mullainathan, and Rambachan 2024),展示了GPT-4o模型在完成国会法案描述和财经新闻标题时,能够精确复制原始文本的例子,暗示了模型可能已经在训练中接触过这些数据,存在训练泄露问题。
  • 长上下文优化

社会科学融合

计算国际关系(Ünver 2019)、自动化认知简报(Hinck et al. 2024)体现交叉领域创新。

参考文献

Alqazlan, Lama, Zheng Fang, Michael Castelle, and Rob Procter. 2025. “A Novel, Human-in-the-Loop Computational Grounded Theory Framework for Big Social Data.” Big Data & Society I: 16. https://doi.org/10.1177/20539517251347598.

Babaei, Golnoosh, and Paolo Giudici. 2024. “GPT Classifications, with Application to Credit Lending.” Machine Learning with Applications 16 (100534): 1–6.

Baevski, Alexei, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. “Wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations.” arXiv Preprint arXiv:2006.11477.

Bycroft, Ben. Accessed on June 8, 2024. “LLM Visualization.” Personal website. https://bbycroft.net/.

Chen, Qiguang, Libo Qin, Jinhao Liu, Dengyun Peng, Jiaqi Wang, Mengkang Hu, Zhi Chen, Wanxiang Che, and Ting Liu. 2025. “ECM: A Unified Electronic Circuit Model for Explaining the Emergence of in-Context Learning and Chain-of-Thought in Large Language Models.” https://arxiv.org/abs/2502.03325.

Chen, Xinlei, and Kaiming He. 2020. “Exploring Simple Siamese Representation Learning.” arXiv Preprint arXiv:2011.10566.

Chen, Yanzhen, Huaxia Rui, and Andrew B. Whinston. 2024. “Conversation Analytics: Can Machines Read Between the Lines in Real-Time Strategic Conversations?” Information Systems Research. https://doi.org/10.1287/isre.2022.0415.

Dahwa, Charles. 2023. “Adapting and Blending Grounded Theory with Case Study: A Practical Guide.” Quality & Quantity n.d. (n.d.): 1–23. https://doi.org/10.1007/s11135-023-01793-7.

Deldjoo, Yashar. 2023. “Fairness of ChatGPT and the Role of Explainable-Guided Prompts.” In Challenges and Opportunities of Large Language Models in Real-World Machine Learning Applications, COLLM@ECML-PKDD’23.

Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. “Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” arXiv Preprint arXiv:1810.04805.

Dowling, Michael, and Brian Lucey. 2023. “ChatGPT for (Finance) Research: The Bananarama Conjecture.” Finance Research Letters 53: 103662.

Dunivin, Zackary Okun. 2025. “Scaling Hermeneutics: A Guide to Qualitative Coding with LLMs for Reflexive Content Analysis.” EPJ Data Science 14 (28). https://doi.org/10.1140/epjds/s13688-025-00548-8.

Egami, Naoki, Musashi Hinck, Brandon M Stewart, and Hanying Wei. 2024. “Using Large Language Model Annotations for the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analyses.” SocArXiv.

Egami, Naoki, Brandon M. Stewart, Musashi Jacobs-Harukawa, and Hanying Wei. 2023. “Using Large Language Model Annotations for Valid Downstream Statistical Inference in Social Science: Design-Based Semi-Supervised Learning.” arXiv Preprint arXiv:2306.04746.

Feng, Leo, Frederick Tung, Hossein Hajimirsadeghi, Mohamed Osama Ahmed, Yoshua Bengio, and Greg Mori. 2024. “Attention as an RNN.” arXiv Preprint arXiv:2405.13956.

Guu, Kelvin, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. 2020. “REALM: Retrieval-Augmented Language Model Pre-Training.” International Conference on Machine Learning.

Heil, Nils, Zhijing Jin, Shehzaad Dhuliawala, and Jiarui Liu. 2024. “Testing Auto-Personalization of Large Language Models.” arXiv Preprint arXiv:xxxx.xxxxx.

Hinck, Musashi, Uma Ilavarasan, Gary King, Kentaro Nakamura, and Brandon M. Stewart. 2024. “Automated Cognitive Debrie?ng.” In POLMETH 2024, UC Riverside.

Hofmann, Thomas. 1999. “Probabilistic Latent Semantic Indexing.” In Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 50–57. SIGIR ’99. New York, NY, USA: ACM.

Hu, Edward J., Yelong Shen, Phil Wallis, Yuanzhi Allen-Zhu, Zhili Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. “LoRA: Low-Rank Adaptation of Large Language Models.” arXiv Preprint arXiv:2106.09685. https://arxiv.org/pdf/2106.09685v1.pdf.

Jahani, Eaman, Benjamin S Manning, Joe Zhang, Hong-Yi TuYe, Mohammed Alsobay, Christos Nicolaides, Siddharth Suri, and David Holtz. 2024. “As Generative Models Improve, We Must Adapt Our Prompts.” arXiv Preprint arXiv:2407.14333. https://arxiv.org/abs/2407.14333.

Jesson, Andrew, Nicolas Beltran-Velez, Quentin Chu, Sweta Karlekar, Jannik Kossen, Yarin Gal, John P. Cunningham, and David Blei. 2024. “Estimating the Hallucination Rate of Generative AI.” arXiv Preprint arXiv:2406.07457v1.

Jian, Jie, Siqi Chen, Xin Luo, Tien Lee, and Xiaoming Yu. 2022. “Organized Cyber-Racketeering: Exploring the Role of Internet Technology in Organized Cybercrime Syndicates Using a Grounded Theory Approach.” IEEE Transactions on Engineering Management 69 (6): 1–15. https://doi.org/10.1109/TEM.2020.3002784.

Kim, Alex G., Maximilian Muhn, and Valeri V. Nikolaev. 2024. “Financial Statement Analysis with Large Language Models.” Draft, May.

Kim, Sein, Hongseok Kang, Kibum Kim, Jiwan Kim, Donghyun Kim, Minchul Yang, Kwangjin Oh, Julian McAuley, and Chanyoung Park. 2025. “Lost in Sequence: Do Large Language Models Understand Sequential Recommendation?” In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 1–12. Toronto, ON, Canada: ACM. https://doi.org/10.1145/3711896.3737035.

Kok, Ties de. 2024. “ChatGPT for Textual Analysis? How to Use Generative LLMs in Accounting Research.” Working Paper.

Korinek, Anton. 2023a. “Generative AI for Economic Research: Use Cases and Implications.” Journal of Economic Literature.

———. 2023b. “Language Models and Cognitive Automation for Economic Research.” NBER.

Lai, Jianhui, Yanyan Wang, Yang Yang, Xiaojie Wu, and Yue Zhang. 2025. “Exploring the Built Environment Impacts on Online Car-Hailing Waiting Time: An Empirical Study in Beijing.” Computers, Environment and Urban Systems 115: 102205.

Lan, Zhenzhong, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. “Albert: A Lite Bert for Self-Supervised Learning of Language Representations.” arXiv Preprint arXiv:1909.11942.

Lazer, David M. J., Matthew A. Baum, Yochai Benkler, Adam J. Berinsky, Kelly M. Greenhill, Filippo Menczer, Miriam J. Metzger, et al. 2018. “The Science of Fake News.” Science 359 (6380): 1094–96.

Lee, Celine, Alexander M. Rush, and Keyon Vafa. 2025. “Critical Thinking: Which Kinds of Complexity Govern Optimal Reasoning Length?” arXiv Preprint arXiv:2504.01935v1.

Lee, Mina, Megha Srivastava, Amelia Hardy, John Thickstun, Esin Durmus, Ashwin Paranjape, Ines Gerard-Ursin, et al. 2023. “Evaluating Human-Language Model Interaction.” Transactions on Machine Learning Research 14. https://arxiv.org/abs/2212.09746.

Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, et al. 2020. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.” In 34th Conference on Neural Information Processing Systems (NeurIPS 2020).

Li, Peiyao, Noah Castelo, Zsolt Katona, and Miklos Sarvary. 2024. “Frontiers: Determining the Validity of Large Language Models for Automated Perceptual Analysis.” Marketing Science.

Lightman, Hunter, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. 2023. “[Let’s Verify Step by Step]().” arXiv Preprint arXiv:2305.20050, May.

Lin, Zhicheng. 2024. “Techniques for Supercharging Academic Writing with Generative AI.” Nature Biomedical Engineering xx (xx): xx.

Liu, Haohe, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. 2023. “AudioLDM: Text-to-Audio Generation with Latent Diffusion Models.” arXiv Preprint arXiv:2301.12503.

Lomo, Marvin Adjei Kojo, and A. F. Salam. 2025. “GTM Approaches with Humans and LLM.” In Thirty-First Americas Conference on Information Systems (AMCIS 2025). Vol. 3. Montréal: AIS Electronic Library (AISeL). https://aisel.aisnet.org/amcis2025/data_science/sig_dsa/3.

Ludwig, Jens, Sendhil Mullainathan, and Ashesh Rambachan. 2024. “Large Language Models: An Applied Econometric Framework.” arXiv Preprint arXiv:2412.07031.

Madhusudhan, Nishanth, Sathwik Tejaswi Madhusudhan, Vikas Yadav, and Masoud Hashemi. 2024. “Do LLMs Know When to NOT Answer? Investigating Abstention Abilities of Large Language Models.” https://arxiv.org/abs/2407.16221.

Mu, Tianshi, Pranjal Rawat, John Rust, Chengjun Zhang, and Qixuan Zhong. 2025. “Who Is More Bayesian: Humans or ChatGPT?” arXiv Preprint arXiv:2504.10636v1, April.

Nelson, Laura K. 2020. “Computational Grounded Theory: A Methodological Framework.” Sociological Methods & Research 49 (1): 3–42. https://doi.org/10.1177/0049124117729703.

Ouyang, Long, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, et al. 2022. “Training Language Models to Follow Instructions with Human Feedback.” arXiv Preprint arXiv:2203.02155.

Peskoff, Denis, Adam Visokay, Sander Schulhoff, Benjamin Wachspress, Alan Blinder, and Brandon M Stewart. 2023. “GPT Deciphering Fedspeak: Quantifying Dissent Among Hawks and Doves.” In Findings of the Association for Computational Linguistics: EMNLP 2023, 6529–39.

Radford, Alec, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, et al. 2021. “Learning Transferable Visual Models from Natural Language Supervision.” arXiv Preprint arXiv:2103.00020.

Reagan, A. J. 2017. “Towards a Science of Human Stories: Using Sentiment Analysis and Emotional Arcs to Understand the Building Blocks of Complex Social Systems.” arXiv Preprint arXiv:1712.06393.

Song, Zengjie, Jiangshe Zhang, Yuxi Wang, Junsong Fan, and Zhaoxiang Zhang. 2024. “Enhancing Sound Source Localization via False Negative Elimination.” IEEE Transactions on Pattern Analysis and Machine Intelligence.

Taylor, Wilson L. 1953. “‘Cloze Procedure’: A New Tool for Measuring Readability.” Journalism Quarterly 30 (4): 415–19.

Ünver, H Akın. 2019. “Computational International Relations What Can Programming, Coding and Internet Research Do for the Discipline?” All Azimuth: A Journal of Foreign Policy and Peace 8 (2): 157–82.

Vafa, Keyon, Sarah Bentley, Jon Kleinberg, and Sendhil Mullainathan. 2025. “What’s Producible May Not Be Reachable: Measuring the Steerability of Generative Models.” arXiv Preprint arXiv:2503.17482. https://arxiv.org/abs/2503.17482.

Vosoughi, Soroush, Deb Roy, and Sinan Aral. 2018. “The Spread of True and False News Online.” Science 359 (6380): 1146–51.

Wang, Chengyi, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, et al. 2023. “Neural Codec Language Models Are Zero-Shot Text to Speech Synthesizers.” arXiv:2301.02111v1. https://arxiv.org/abs/2301.02111.

Wang, Weiquang, Guodong (Gordon) Gao, and Ritu Agarwal. 2023. “Friend or Foe? Teaming Between Artificial Intelligence and Workers with Variation in Experience.” Management Science, 1–2.

White, R. E., and K. Cooper. 2022. “Chapter 9 Grounded Theory.” In Qualitative Research in the Post - Modern Era. Springer Nature Switzerland AG. https://doi.org/10.1007/978-3-030-85124-8_9.

Xu, Tianyang, Shujin Wu, Shizhe Diao, Xiaoze Liu, Xingyao Wang, Yangyi Chen, and Jing Gao. 2024. “SaySelf: Teaching LLMs to Express Confidence with Self-Reflective Rationales.” In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 5985–98. Singapore: Association for Computational Linguistics. https://arxiv.org/abs/2405.20974.

Xu, Yuzhuang, Shuo Wang, Peng Li, Fuwen Luo, Xiaolong Wang, Weidong Liu, and Yang Liu. 2023. “Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf.” arXiv Preprint arXiv:2309.04658.

Yan, Shi-Qi, Jia-Chen Gu, Yun Zhu, and Zhen-Hua Ling. 2024. “Corrective Retrieval Augmented Generation.” arXiv Preprint arXiv:2401.15884v3.

Yang, Mengjiao, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. 2023. “Learning Interactive Real-World Simulators.” arXiv Preprint arXiv:2310.06114.

Yang, Yiyuan, Zichuan Liu, Lei Song, Kai Ying, Zhiguang Wang, Tom Bamford, Svitlana Vyetrenko, Jiang Bian, and Qingsong Wen. 2026. “Time-RA: Towards Time Series Reasoning for Anomaly with LLM Feedback.” In Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’26), 1–19. KDD ’26. New York, NY, USA: ACM. https://doi.org/XXXXXXX.XXXXXXX.

Zhao, Yulai, Haolin Liu, Dian Yu, S. Y. Kung, Haitao Mi, and Dong Yu. 2025. “One Token to Fool LLM-as-a-Judge.” https://arxiv.org/abs/2507.08794.

Zhong, Zhiqiang, Kuangyu Zhou, and Davide Mottin. 2024. “Harnessing Large Language Models as Post-Hoc Correctors.” arXiv Preprint. https://arxiv.org/abs/[ARXIV_ID].

Zhou, Linjiang, Xiaochuan Shi, Zepeng Wang, Chao Ma, and Lihua Gao. 2025. “Exploration of Applications with ChatGPT for Green Supply Chain Management.” Annals of Operations Research. https://doi.org/10.1007/s10479-025-06713-6.

Zhou, Yang, Jianqing Fan, and Lirong Xue. 2024. “How Much Can Machines Learn Finance from Chinese Text Data?” Management Science.

Zhou, Yaxian, Yufei Yuan, Kai Huang, and Xiangpei Hu. 2024. “Can ChatGPT Perform a Grounded Theory Approach to Do Risk Analysis? An Empirical Study.” Journal of Management Information Systems 41 (4): 982–1015. https://doi.org/10.1080/07421222.2024.2415772.

Ziems, Caleb, Omar Shaikh, Zhehao Zhang, William Held, Jiaao Chen, and Diyi Yang. 2024. “Can Large Language Models Transform Computational Social Science?” Computational Linguistics 50 (1).

唐东格. 2023. “Nlp在度小满风控中的应用.” DataFunSummit, October. https://mp.weixin.qq.com/s/5i3goEHrD7RBVC6JjwMOag.

自然语言处理研究笔记

整理自论文阅读与研究实践

© 2024 自然语言处理研究笔记. 保留所有权利.