人工智能安全

人工智能安全是一个跨学科领域，专注于预防人工智能系统引发的事故、误用或其他有害后果。它涵盖人工智能对齐（旨在确保人工智能系统按预期运行）、人工智能系统风险监控以及增强其鲁棒性。该领域尤其关注高级人工智能模型带来的生存风险。 ^[1] ^[2]

除了技术研究之外，人工智能安全还包括制定促进安全的规范和政策，例如在不同层级的政府层面倡导相关法规。 ^[3] ^[4] ^[5] 2023年，随着生成式人工智能的快速发展以及研究人员和首席执行官们对潜在危险的担忧，该领域获得了广泛关注。在2023年2023年人工智能安全峰会，美国和英国分别成立了各自的人工智能安全研究所。然而，研究人员表示担忧，人工智能安全措施的发展速度未能跟上人工智能能力的快速发展。 ^[6]

动机

学者们探讨了当前关键系统故障^[7] 、算法偏差]^[8]和人工智能赋能的监控^[9]带来的风险，以及技术性失业、数字操纵^[10] 、武器化^[11] 、人工智能赋能的网络攻击^[12]和生物恐怖主义^[13]等新兴风险。他们还讨论了失去对未来通用人工智能（AGI）代理的控制^[14]或人工智能助长永久稳定的独裁政权^[15]所带来的推测性风险。

存在安全

一些人批评了对通用人工智能（AGI）的担忧，例如吴恩达（Andrew Ng）在2015年将其比作“在我们甚至还没有踏上火星之前就开始担心火星人口过剩”。 ^[16]而[斯图尔特·J·罗素则敦促人们保持谨慎，他认为“与其低估人类的创造力，不如对其抱有合理的预期”。 ^[17]

人工智能研究人员对人工智能技术带来的风险的严重程度和主要来源持有截然不同的观点^[18] ^[19] ，尽管调查显示专家们对高后果风险非常重视。在两项针对人工智能研究人员的调查中，受访者总体上对人工智能持乐观态度，但认为高级人工智能造成“极其糟糕（例如人类灭绝）”后果的可能性仅为5% ^[18] 。在2022年一项针对自然语言处理领域的调查中，37%的受访者同意或勉强同意人工智能的决策有可能导致“至少与全面核战争一样糟糕”的灾难^[20] 。

历史

人工智能带来的风险在计算机时代初期就开始受到认真讨论：

Moreover, if we move in the direction of making machines which learn and whose behavior is modified by experience, we must face the fact that every degree of independence we give the machine is a degree of possible defiance of our wishes.
——Norbert Wiener (1949)^[21]

1988 年，布莱惠特比出版了一本书，概述了人工智能的发展需要遵循伦理和社会责任原则。 ^[22]

2008 年至 2009 年，人工智能促进协会 ( AAAI ) 委托进行了一项研究，旨在探索和解决人工智能研发可能对社会产生的长期影响。专家组普遍对科幻小说作家表达的激进观点持怀疑态度，但一致认为“进一步研究理解和验证复杂计算系统行为范围的方法，以最大限度地减少意外结果，将具有重要价值”。 ^[23]

2011 年，罗曼·扬波尔斯基在人工智能哲学与理论会议上提出了“人工智能安全工程”这一术语 ^[24] ，列举了人工智能系统之前的失败案例，并指出“随着人工智能能力的增强，此类事件的发生频率和严重程度将稳步增加” ^[25] 。

2014年，哲学家尼克·博斯特罗姆出版了《超级智能：路径、危险与策略》一书。他认为，通用人工智能（AGI）的兴起有可能引发各种社会问题，包括人工智能取代劳动力、操纵政治和军事结构，甚至可能导致人类灭绝。 ^[26]他关于未来先进系统可能对人类生存构成威胁的论点，促使埃隆·马斯克^[27] 、比尔·盖茨^[28]和斯蒂芬·霍金^[29]表达了类似的担忧。

2015 年，数十位人工智能专家签署了一封关于人工智能的公开信呼吁研究人工智能对社会的影响，并概述了具体的方向。 ^[30]迄今为止，已有超过 8000 人签署了这封信，其中包括Yann LeCun 、肖恩·莱格、 Yoshua Bengio和斯图尔特·J·罗素。

同年，以斯图尔特·J·罗素教授为首的一群学者在加州大学伯克利分校成立了人类兼容人工智能中心，未来生命研究所拨款 650 万美元用于旨在“确保人工智能 (AI) 保持安全、合乎伦理且有益”的研究。 ^[31]

2016年，白宫科技政策办公室和卡内基梅隆大学宣布举办“人工智能安全与控制公共研讨会^[32] ，这是白宫旨在探讨人工智能“优势与劣势”的四场系列研讨会之一。 ^[33]同年，《人工智能安全中的具体问题》——首批也是最具影响力的人工智能安全技术议程之一——出版。 ^[34]

2017 年，生命未来研究所赞助了阿西洛马有益人工智能会议，100 多位思想领袖在会上制定了有益人工智能的原则，其中包括“避免竞争：开发人工智能系统的团队应积极合作，避免在安全标准方面偷工减料”。 ^[35]

2018年，DeepMind安全团队概述了人工智能在规范、鲁棒性^[36]和保证方面的安全问题。次年，研究人员在ICLR上组织了一次研讨会，重点讨论了这些问题领域^[37] 。

2021年，《机器学习安全中未解决的问题》出版，概述了鲁棒性、监控、对齐和系统安全性方面的研究方向。

2023年，里希·苏纳克表示，他希望英国成为“全球人工智能安全监管的地理中心”，并主办首届全球人工智能安全峰会。 ^[38]该2023年人工智能安全峰会于2023年11月举行，重点讨论了前沿人工智能模型滥用和失控的风险。 ^[39]峰会期间宣布了制定《高级人工智能安全国际科学报告》的计划。 ^[40]

2024年，美国和英国在人工智能安全科学领域建立了新的合作伙伴关系。根据11月在布莱切利园举行的人工智能安全峰会上宣布的承诺，美国商务部长吉娜·雷蒙多和英国技术部长米歇尔·多内兰于2024年4月1日签署了谅解备忘录，共同开发先进的人工智能模型测试。 ^[41]

2025年，由约书亚·本吉奥担任主席的96位国际专家组成的团队发布了首份《国际人工智能安全报告》。该报告由30个国家和联合国委托撰写，是首份针对先进人工智能潜在风险的全球科学评估报告。报告详细阐述了滥用、故障和社会动荡等潜在威胁，旨在通过循证研究结果为政策制定提供依据，但并未提出具体建议。 ^[42] ^[43]

研究重点

人工智能安全研究领域包括鲁棒性、监控和对齐。

鲁棒性

对抗鲁棒性

人工智能系统通常容易受到对抗样本的攻击，即攻击者故意设计输入到机器学习（ML）模型中的、旨在导致模型出错的样本。 ^[44]例如，2013年，Szegedy等人发现，在图像中添加特定的、不易察觉的扰动会导致图像被高置信度地错误分类。 ^[45]尽管在最近的研究中，扰动通常足够大，可以被感知，但这仍然是神经网络面临的一个问题。 ^[46] ^[47]

右侧图像经扰动后预测为鸵鸟。（左图）为正确预测的样本，（中图）为应用扰动后放大 10 倍的图像，（右图）为对抗样本。

对抗鲁棒性通常与安全性相关。 ^[48]研究人员证明，音频信号可以被悄无声息地修改，从而使语音转文本系统将其转录成攻击者选择的任何消息。 ^[49]网络入侵^[50]和恶意软件^[51]检测系统也必须具备对抗鲁棒性，因为攻击者可能会设计攻击来欺骗检测器。

代表目标的模型（奖励模型）也必须具有对抗鲁棒性。例如，奖励模型可以评估文本回复的有用程度，而语言模型则可能被训练来最大化该分数。 ^[52]研究人员已经证明，如果语言模型训练时间足够长，它会利用奖励模型的漏洞来获得更高的分数，但在预期任务上的表现却会更差。 ^[53]这个问题可以通过提高奖励模型的对抗鲁棒性来解决。 ^[54]更一般地说，任何用于评估其他人工智能系统的人工智能系统都必须具有对抗鲁棒性。这可能包括监控工具，因为它们也可能被篡改以获得更高的奖励。 ^[55]

大型语言模型（LLM）可能容易受到提示注入^[56]和模型窃取^[57]的攻击，并可能被用于生成虚假信息^[58] 。提示注入是指将指令嵌入到提示信息中，以绕过安全措施^[56] 。

容错性和冗余性

安全关键型人工智能领域的研究人员提出，利用架构冗余和设计多样性来降低单个故障、受损或欺骗性模型造成危害的风险。 ^[59]在这种方法中，多个独立开发或训练的模型处理同一任务，并通过投票机制或共识机制来组合它们的输出，而不是依赖单个模型。 ^[60]

监测

估计不确定性

对于人类操作人员来说，评估对人工智能系统的信任程度至关重要，尤其是在医疗诊断等高风险场景中。 ^[61]机器学习模型通常通过输出概率来表达置信度；然而，它们往往过于自信， ^[62]尤其是在与训练场景不同的情况下。 ^[63]校准研究旨在使模型概率尽可能接近模型实际正确率。

类似地，异常检测或分布外（OOD）检测旨在识别人工智能系统何时处于异常情况。例如，如果自动驾驶车辆上的传感器发生故障，或者遇到复杂地形，系统应提醒驾驶员接管车辆或靠边停车。 ^[64]异常检测的实现方式很简单，只需训练一个分类器来区分异常输入和非异常输入即可。 ^[65]不过，目前也使用了一系列其他技术。 ^[66]

检测恶意使用

学者和政府机构都表达了担忧，认为人工智能系统可能被恶意行为者用于制造武器^[67] 、操纵舆论 ^[68]或自动化网络攻击^[69] 。对于像OpenAI这样在线托管强大人工智能工具的公司来说，这些担忧是切实存在的。为了防止滥用，OpenAI构建了检测系统，可以根据用户的活动对其进行标记或限制^[70] 。

透明度

神经网络常被描述为“黑箱” ^[71] ，这意味着由于其执行的计算量巨大，我们很难理解它们做出决策的原因^[72] 。这使得预测故障变得极具挑战性。2018年，一辆自动驾驶汽车因未能识别行人而将其撞死。由于人工智能软件的“黑箱”特性，事故原因至今仍不明^[73] 。这也引发了医疗保健领域关于是否应该使用统计效率高但不透明的模型的争论^[74] 。

透明度的一个关键优势是可解释性。 ^[75]有时，为了确保公平性，法律要求对做出某项决定的原因作出解释，例如自动筛选求职申请或信用评分分配。 ^[75]

另一个好处是揭示失败的原因。在 2020 年 COVID-19 大流行初期，研究人员使用透明度工具表明，医学图像分类器“关注”了无关的医院标签。 ^[76]

透明技术也可用于纠正错误。例如，在论文《定位和编辑GPT中的事实关联》中，作者识别出了影响模型回答有关埃菲尔铁塔位置问题的参数。然后，他们“编辑”了这些知识，使模型对问题的回答如同它认为铁塔位于罗马而非法国。 ^[77]虽然在这个例子中，作者人为地引入了一个错误，但这些方法有可能有效地用于修正错误。模型编辑技术也存在于计算机视觉领域。 ^[78]

最后，一些学者认为人工智能系统的不透明性是一个重要的风险来源，更好地理解其运作方式可以预防未来发生后果严重的故障。 ^[79] “内部”可解释性研究旨在降低机器学习模型的透明度。这项研究的目标之一是识别内部神经元激活所代表的含义。 ^[80] ^[81]例如，研究人员在CLIP人工智能系统中发现了一个神经元，该神经元会对身穿蜘蛛侠服装的人的图像、蜘蛛侠的素描以及“蜘蛛”一词做出反应。 ^[82]这项研究还涉及解释这些神经元或“电路（神经网络）”之间的连接。 ^[83]例如，研究人员在Transformer注意力机制中发现了模式匹配机制，该机制可能在语言模型如何从上下文中学习方面发挥作用。 ^[84] “内部可解释性”已被比作神经科学。这两种方法的目标都是了解复杂系统中正在发生的事情，但机器学习研究人员的优势在于能够进行完美的测量并执行任意的消融。 ^[85]

检测木马

机器学习模型可能包含“木马”或“后门”：恶意行为者会将漏洞植入人工智能系统。例如，被植入木马的人脸识别系统可以在特定珠宝出现在视野中时授予访问权限；或者，被植入木马的自动驾驶汽车可能在特定触发条件出现之前正常运行。 ^[86]对于一些大型模型，例如 CLIP 或 GPT-3，由于它们是使用公开的互联网数据进行训练的，因此植入木马可能并不难。 ^[87]研究人员仅通过更改 300 万张训练图像中的 300 张，就成功地在图像分类器中植入了木马。 ^[88]除了构成安全风险外，研究人员还认为，木马为测试和开发更好的监控工具提供了一个具体的环境。

Anthropic公司 2024 年发表的一篇研究论文表明，大型语言模型可以通过植入持久性后门进行训练。这些“休眠代理”模型可以被编程为在特定日期之后生成恶意输出（例如易受攻击的代码），而在此之前则表现正常。诸如监督式微调、强化学习和对抗训练等标准人工智能安全措施均未能移除这些后门。 ^[89]

结盟

本节内容摘自《人工智能对齐》一书。

在人工智能领域，目标一致性旨在引导人工智能系统朝着个人或团体的预期目标、偏好或伦理原则发展。如果人工智能系统能够推进预期目标，则被认为是目标一致的。目标不一致的人工智能系统则会追求非预期目标。 ^[90]

人工智能设计者通常难以明确定义所有期望和非期望的行为。因此，他们常常使用更简单的代理目标，例如获得人类的认可。但代理目标可能会忽略必要的约束条件，或者仅仅因为人工智能系统表面上符合要求就给予奖励。 ^[91] ^[92]人工智能系统也可能找到漏洞，使其能够以非预期的方式（有时甚至是有害的方式）高效地实现代理目标（奖励破解）。 ^[93]

高级人工智能系统可能会发展出一些不必要的工具性策略，例如追求权力或自我保护，因为这些策略有助于它们实现预设的最终目标。 ^[94] ^[95] ^[96]此外，它们还可能发展出一些不良的涌现目标，这些目标在系统部署并遇到新的情况和领域适应之前可能难以被检测到。 ^[97] ^[98] 2024年的实证研究表明，诸如OpenAI O1或Claude 3等高级大型语言模型（LLM）有时会采取策略性欺骗手段来实现其目标或防止目标被改变。 ^[99] ^[100]

这些问题中的一些会影响现有的商业系统，例如低层移动模型（LLM） ^[101] ^[102] 、机器人^[103] 、自动驾驶汽车^[104]和社交媒体推荐引擎^[105] ^[106] ^[107] 。一些人工智能研究人员认为，未来功能更强大的系统将受到更严重的影响，因为这些问题部分源于其强大的功能 ^[108] ^[109] 。

许多知名的AI研究人员和AI公司领导人认为，AI正在接近类人（ AGI ）和超人认知能力（ ASI ），如果发展方向错误，可能会危及人类文明。 ^[110] ^[111]其中包括“AI教父”杰弗里·辛顿和约书亚·本吉奥，以及OpenAI 、 Anthropic和谷歌DeepMind的首席执行官。 ^[112] ^[113]这些风险仍然存在争议。 ^[114]

人工智能一致性是人工智能安全的一个子领域，人工智能安全研究的是如何构建安全的人工智能系统。 ^[115] ^[116]人工智能安全的其他子领域包括鲁棒性、监控和人工智能能力控制。 ^[117]一致性研究面临的挑战包括：在人工智能中注入复杂的价值观、开发诚实的人工智能、可扩展的监督、审计和解释人工智能模型，以及防止出现诸如权力追求等新兴人工智能行为。 ^[117]一致性研究与可解释性研究^[118] ^[119] 、（对抗性）鲁棒性^[120] 、异常检测、不确定性量化^[118] 、形式化验证^[121] 、偏好学习 ^[122] ^[123] 、安全关键工程^[124] 、博弈论[ ^[125] 、算法公平性^[120] ^[126]以及社会科学^[127] ^[128]等领域密切相关。

系统安全和社会技术因素

人工智能风险（以及更广泛意义上的技术风险）通常被归类为误用或事故。 ^[129]一些学者认为这种框架存在不足。 ^[129]例如，古巴导弹危机并非简单的事故或技术误用。 ^[129]政策分析家兹韦茨洛特和达福写道：“误用和事故的视角往往只关注导致损害的因果链的最后一步：即误用技术的人，或行为异常的系统……然而，相关的因果链通常要长得多。”风险通常源于“结构性”或“系统性”因素，例如竞争压力、损害扩散、快速发展、高度不确定性以及安全文化不足。 ^[129]在更广泛的安全工程领域，“组织安全文化”等结构性因素在流行的STAMP风险分析框架中发挥着核心作用。 ^[130]

受结构视角启发，一些研究人员强调了利用机器学习来提升社会技术安全因素的重要性，例如，利用机器学习进行网络防御、改进机构决策以及促进合作。 ^[131]另一些研究人员则强调了让人工智能从业者和领域专家共同参与设计过程以解决结构性漏洞的重要性。 ^[132]

网络防御

一些学者担心人工智能会加剧网络攻击者和网络防御者之间本已不平衡的局面。 ^[133]这将增加“先发制人”的动机，并可能导致更具侵略性和破坏性的攻击。为了降低这种风险，一些人倡导更加重视网络防御。此外，软件安全对于防止强大的人工智能模型被窃取和滥用至关重要。 ^[134]近期研究表明，人工智能可以通过自动化日常任务和提高整体效率，显著增强技术和管理方面的网络安全工作。 ^[135]人工智能安全研究还探讨了在训练过程中保护机器学习系统免受数据投毒攻击的防御技术。特别是，标签翻转攻击会降低模型性能，同时难以使用传统的数据验证方法检测到。为了应对这种风险，近期研究提出了与模型无关的检测流程，该流程监控学习行为并结合多个检测器来识别可疑的训练样本。这些方法旨在通过提高在对抗性环境下运行的人工智能系统的弹性和可信度来加强网络防御。 ^[136] ^[137]

改进机构决策

人工智能在经济和军事领域的进步可能会引发前所未有的政治挑战。 ^[138]一些学者将人工智能竞赛的动态比作冷战时期，在冷战时期，少数决策者的谨慎判断往往决定着稳定与灾难的走向。 ^[139]人工智能研究人员认为，人工智能技术也可以用于辅助决策。 ^[140]例如，研究人员正在着手开发人工智能预测^[141]和咨询系统。 ^[142]

促进合作

许多全球性重大威胁（例如核战争^[143] 、气候变化^[144]等）都被视为合作挑战。正如著名的囚徒困境模型所示，即使各方都以自身利益为出发点采取最优行动，某些动态因素也可能导致所有参与者都面临不利结果。例如，尽管气候变化若不采取干预措施可能会造成严重后果，但没有任何单一参与者有强烈的动力去应对气候变化。 ^[144]

人工智能合作面临的一项突出挑战是避免“竞相降低标准” ^[145] 。在这种情况下，各国或公司竞相开发功能更强大的人工智能系统，却忽视了安全问题，最终导致灾难性事故，危及所有相关人员。对这类情况的担忧促使政治^[146]和技术^[147]各方努力促进人类之间以及人工智能系统之间的合作。目前大多数人工智能研究侧重于设计能够执行独立功能的独立智能体（通常是在“单人”游戏中） ^[148] 。学者们指出，随着人工智能系统变得越来越自主，研究和塑造它们的交互方式可能变得至关重要 ^[149]

在治理方面

人工智能治理主要关注制定规范、标准和法规，以指导人工智能系统的使用和发展。 ^[151]

在人工智能安全领域，本地解决方案侧重于单个人工智能系统，确保其安全性和有益性，而全球解决方案则致力于在各个司法管辖区内为所有人工智能系统实施安全措施。 ^[152]

人工智能安全治理研究涵盖了从人工智能潜在影响的基础性调查到具体应用的各个方面。在基础性研究方面，研究人员认为，由于人工智能具有广泛的适用性，它能够改变社会的方方面面，并将其与电力和蒸汽机相提并论。 ^[153]一些研究侧重于预测这些影响可能带来的特定风险，例如大规模失业^[154] 、武器化^[155] 、虚假信息^[156] 、监控^[157]以及权力集中[ ^[158]等风险。其他研究则探索了潜在的风险因素，例如难以监管快速发展的人工智能行业^[159] 、人工智能模型的可用性^[160]以及“竞相降低标准”的动态。 ^[161] ^[162] DeepMind长期治理与战略负责人艾伦·达福强调了竞相开发人工智能的危险性以及合作的潜在必要性：“在部署先进强大的系统之前保持高度谨慎，这或许是确保人工智能安全性和一致性的必要且充分条件；然而，如果参与者在先发优势或相对优势显著的领域展开竞争，那么他们将被迫选择次优的谨慎程度。” ^[163]一个研究方向致力于开发评估人工智能问责制的方法、框架和机制，指导和促进对基于人工智能的系统进行审计。 ^[164] ^[165] ^[166]这些方法面临的一个关键挑战是缺乏广泛接受的标准，以及对方法要求的模糊性， ^[167] ^[168]此外，行业内也缺乏安全文化。 ^[169]

为提升人工智能安全性，人们开发了多种框架，旨在使人工智能的输出符合伦理准则，并降低滥用和数据泄露等风险。诸如英伟达的 Guardrails ^[170] 、 Llama Guard ^[171] 、前言（公司）的可定制 Guardrails ^[172]以及 Claude's Constitution 等工具，能够缓解提示注入等漏洞，并确保输出符合预定义的原则。这些框架通常被集成到人工智能系统中，以提高安全性和可靠性。 ^[173]

哲学视角

人工智能安全领域与哲学考量，特别是伦理学领域，有着密切的联系。强调遵守道德规则的义务论伦理学，已被提出作为使人工智能系统与人类价值观相符的框架。一些学者认为，通过嵌入义务论原则，可以引导人工智能系统避免造成伤害的行为，确保其运行始终在伦理界限之内^[174] ，但这些建议受到了质疑，其他一些更有前景的替代方案也已被提出^[175] 。

政府行动

一些专家认为现在对人工智能进行监管为时尚早，他们担心监管会阻碍创新，而且“在不了解的情况下仓促监管”是愚蠢的。 ^[176] ^[177]而另一些人，例如商业巨头埃隆·马斯克，则呼吁采取先发制人的措施来减轻灾难性风险。 ^[178]

除了正式立法之外，政府机构也提出了伦理和安全方面的建议。2021年3月，美国国家人工智能安全委员会报告称，人工智能的进步可能会使“确保系统符合目标和价值观，包括安全性、稳健性和可信度”变得越来越重要。 ^[179]随后，美国国家标准与技术研究院起草了一份人工智能风险管理框架，其中建议，当“存在灾难性风险时，应以安全的方式停止开发和部署，直到风险得到充分控制”。 ^[180]

2021年9月，中华人民共和国发布了人工智能伦理准则，强调人工智能决策应始终处于人类控制之下，并呼吁建立问责机制。同月，英国发布了其为期十年的国家人工智能战略^[181] ，该战略指出，英国政府“认真对待非结盟通用人工智能的长期风险，以及它将给世界带来的不可预见的变化”。 ^[182]该战略阐述了评估人工智能长期风险（包括灾难性风险）的行动方案。 ^[182]英国政府于2023年11月1日至2日举办了首届全球人工智能安全峰会，此次峰会被誉为“为政策制定者和世界领导人提供了一个契机，共同探讨人工智能的当前和未来风险，以及如何通过全球协调的方式来降低这些风险”。 ^[183] ^[184]中国传媒项目指出，“按照世界各地民主社会的标准来看，其方法的关键方面仍然存在根本性的不安全因素”，并认为中国人工智能安全方法的一部分侧重于加强中国共产党的信息控制。 ^[185]

政府机构，尤其是美国的政府机构，也鼓励开展人工智能技术安全研究。情报高级研究计划活动 IARPA）启动了TrojAI项目，旨在识别和防御针对人工智能系统的木马攻击。 ^[186]美国国防高级研究计划署（DARPA）致力于可解释人工智能的研究，并提高其抵御对抗性攻击的能力。 ^[187] ^[188]国家科学基金会（NSF）支持可信机器学习中心，并为人工智能安全实证研究提供数百万美元的资金。 ^[189]

2024年，联合国大会通过了第一项关于促进“安全、可靠和值得信赖”的人工智能系统的全球决议，强调在人工智能的设计、开发、部署和使用过程中尊重、保护和促进人权。 ^[190]

2024年5月，英国科学、创新与技术部（DSIT）宣布，将拨款850万英镑用于人工智能安全研究，该资金来自“系统性人工智能安全快速资助计划”。该计划由人工智能安全研究所的克里斯托弗·萨默菲尔德和沙哈尔·阿文领导，并与英国研究与创新合作开展。技术大臣米歇尔·多内兰在[2024年首尔人工智能峰会上宣布了这项计划，并表示其目标是确保人工智能在全社会范围内的安全，有前景的提案将有机会获得进一步的资助。英国还与另外10个国家和欧盟签署了一项协议，旨在建立一个国际人工智能安全研究所网络，以促进合作并共享信息和资源。此外，英国人工智能安全研究所计划在旧金山设立办事处。 ^[191]

2024年11月，时任美国总统乔·拜登和中国国家主席习近平重申，必须保持对核武器使用的人类控制，而非人工智能。 ^[192] ^[193]作为2025财年国防授权法案的一部分，国会在《美国联邦法典》中加入了第1638条，即“关于使用人工智能支持战略威慑的国会意见”。该条款规定，“人工智能的使用不应损害核保障的完整性，无论是在武器系统的功能方面，还是在指挥机构通信的验证方面，亦或是总统关于使用核武器的决定执行过程中必须有积极的人类行动这一原则方面。” ^[194] ^[195] 2026年2月，特朗普政府公开重申，核武器决策仍将由人类控制，一位五角大楼高级官员重申了“国防部的政策，即在所有关于是否使用核武器的决策中，都必须有人参与”。 ^[196]

2025年9月，法国人工智能安全中心 (CeSIA)、未来社会和人类兼容人工智能中心(CHAI) 联合发布了一项全球呼吁划定人工智能红线心敦促各国政府在 2026 年底前达成一项具有约束力的国际协议，禁止不可接受的人工智能用途。该宣言最初由包括 10 位诺贝尔奖得主在内的 200 位知名人士签署，并由玛丽亚·雷萨在联合国大会上宣布。 ^[197]

2025年12月，唐纳德·特朗普总统签署了一项行政命令，旨在建立“人工智能国家政策框架”。 ^[198]该行政命令不鼓励各州政府对人工智能进行监管，并敦促国会通过一项法律来阻止此类监管。 ^[199]白宫称此举是出于经济和国家安全的考虑，而一些人则批评特朗普此举给人工智能监管带来了不确定性。 ^[200] ^[201]

企业自律

人工智能实验室和公司通常遵循一些非正式法律法规之外的安全实践和规范。 ^[202]治理研究人员的目标之一就是塑造这些规范。文献中提到的安全建议包括：进行第三方审计^[203] 、提供赏金以发现故障^[203] 、共享人工智能事故^[203] （为此目的创建了一个人工智能事故数据库） ^[204] 、遵循相关指南来决定是否发布研究成果或模型^[205] ，以及改进人工智能实验室的信息安全和网络安全。 ^[206]

各公司也做出了承诺。Cohere、 OpenAI和AI21提出并达成一致，制定了“部署语言模型的最佳实践”，重点在于减少滥用。 ^[207]为了避免加剧竞争，OpenAI 在其章程中声明：“如果一个价值观一致、注重安全的计划在我们之前接近构建通用人工智能（AGI），我们承诺停止与该计划竞争，并开始协助该计划。” ^[208]此外，DeepMind 首席执行官 Demis Hassabis、Facebook 人工智能总监 Yann LeCun 等行业领袖也签署了公开信，例如《阿西洛马原则》 ^[209]和《自主武器公开信》 ^[210] 。

非营利组织

2020年代，美国和欧洲涌现出多个非营利组织，致力于人工智能安全及相关公共政策，其中包括安全人工智能联盟、生命未来研究所（Future of Life Institute ）和公众优先行动）。 ^[211]这些组织通常扮演硅谷监督者的角色，倡导制定特定的联邦、州或地方法规。 ^[212]它们也与“引领未来”（Leading the Future）等行业组织展开竞争，后者主张放松对人工智能公司的监管。 ^[213] ^[214]

国际法规

在《特定常规武器公约》框架下，各国自2014年以来一直在讨论致命性自主武器系统。2016年，该条约缔约国设立了致命自主武器系统政府专家组不限成员名额），以继续进行相关讨论。 ^[215]讨论内容涵盖国际人道主义法、问责制、可能的禁令和规章，以及对人工智能武器所需的人为控制程度。 ^[216]

参见

人工智能对齐
人工智能战争
人工智能与选举
人工智能检测软件
幻觉（人工智能）

参考

^ Ahmed, Shazeda; Jaźwińska, Klaudia; Ahlawat, Archana; Winecoff, Amy; Wang, Mona. Field-building and the epistemic culture of AI safety. First Monday. 2024-04-14. ISSN 1396-0466. doi:10.5210/fm.v29i4.13626  （英语）.
^ Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob. Unsolved Problems in ML Safety. arXiv. 2022-06-16. arXiv:2109.13916 .
^ Champagne, Dylan. President Trump Targets State AI Regulations. The Regulatory Review. 2026-02-26 [2026-02-27] （美国英语）.
^ What is California's AI safety law?. Brookings. 2025-12-23 [2026-02-27] （美国英语）.
^ Artificial Intelligence 2024 Legislation. www.ncsl.org. [2026-02-27].
^ Perrigo, Billy. U.K.'s AI Safety Summit Ends With Limited, but Meaningful, Progress. Time. 2023-11-02 [2024-06-02] （英语）.
^ De-Arteaga, Maria. Machine Learning in High-Stakes Settings: Risks and Opportunities (PhD论文). Carnegie Mellon University. 2020-05-13.
^ Mehrabi, Ninareh; Morstatter, Fred; Saxena, Nripsuta; Lerman, Kristina; Galstyan, Aram. A Survey on Bias and Fairness in Machine Learning. ACM Computing Surveys. 2021, 54 (6): 1–35 [2022-11-28]. ISSN 0360-0300. S2CID 201666566. arXiv:1908.09635 . doi:10.1145/3457607. （原始内容存档于2022-11-23）（英语）.
^ Feldstein, Steven. The Global Expansion of AI Surveillance (报告). Carnegie Endowment for International Peace. 2019.
^ Barnes, Beth. Risks from AI persuasion. Lesswrong. 2021 [2022-11-23]. （原始内容存档于2022-11-23）.
^ Brundage, Miles; Avin, Shahar; Clark, Jack; Toner, Helen; Eckersley, Peter; Garfinkel, Ben; Dafoe, Allan; Scharre, Paul; Zeitzoff, Thomas; Filar, Bobby; Anderson, Hyrum. The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation. Apollo-University Of Cambridge Repository, Apollo-University Of Cambridge Repository. Apollo - University of Cambridge Repository. 2018-04-30 [2022-11-28]. S2CID 3385567. doi:10.17863/cam.22520. （原始内容存档于2022-11-23）.
^ Davies, Pascale. How NATO is preparing for a new era of AI cyber attacks. euronews. December 26, 2022 [2024-03-23] （英语）.
^ Ahuja, Anjana. AI's bioterrorism potential should not be ruled out. Financial Times. February 7, 2024 [2024-03-23].
^ Carlsmith, Joseph. Is Power-Seeking AI an Existential Risk?. 2022-06-16. arXiv:2206.13353 .
^ Minardi, Di. The grim fate that could be 'worse than extinction'. BBC. 16 October 2020 [2024-03-23].
^ AGI Expert Peter Voss Says AI Alignment Problem is Bogus | NextBigFuture.com. 2023-04-04 [2023-07-23] （美国英语）.
^ Dafoe, Allan. Yes, We Are Worried About the Existential Risk of Artificial Intelligence. MIT Technology Review. 2016 [2022-11-28]. （原始内容存档于2022-11-28）.
^ ^18.0 ^18.1 Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain. Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts. Journal of Artificial Intelligence Research. 2018-07-31, 62: 729–754 [2022-11-28]. ISSN 1076-9757. S2CID 8746462. arXiv:1705.08807 . doi:10.1613/jair.1.11222 . （原始内容存档于2023-02-10）.
^ Stein-Perlman, Zach; Weinstein-Raun, Benjamin; Grace. 2022 Expert Survey on Progress in AI. AI Impacts. 2022-08-04 [2022-11-23]. （原始内容存档于2022-11-23）.
^ Michael, Julian; Holtzman, Ari; Parrish, Alicia; Mueller, Aaron; Wang, Alex; Chen, Angelica; Madaan, Divyam; Nangia, Nikita; Pang, Richard Yuanzhe; Phang, Jason; Bowman, Samuel R. What Do NLP Researchers Believe? Results of the NLP Community Metasurvey. Association for Computational Linguistics. 2022-08-26. arXiv:2208.12852 .
^ Markoff, John. In 1949, He Imagined an Age of Robots. The New York Times. 2013-05-20 [2022-11-23]. ISSN 0362-4331. （原始内容存档于2022-11-23）.
^ Artificial intelligence: A handbook of professionalism. University of Sussex. January 1988. ISBN 978-0-470-21103-8.
^ Association for the Advancement of Artificial Intelligence. AAAI Presidential Panel on Long-Term AI Futures. [2022-11-23]. （原始内容存档于2022-09-01）.
^ PT-AI 2011 – Philosophy and Theory of Artificial Intelligence (PT-AI 2011). [2022-11-23]. （原始内容存档于2022-11-23）.
^ Yampolskiy, Roman V., Müller, Vincent C. , 编, Artificial Intelligence Safety Engineering: Why Machine Ethics is a Wrong Approach, Philosophy and Theory of Artificial Intelligence, Studies in Applied Philosophy, Epistemology and Rational Ethics (Berlin; Heidelberg, Germany: Springer Berlin Heidelberg), 2013, 5: 389–396 [2022-11-23], ISBN 978-3-642-31673-9, doi:10.1007/978-3-642-31674-6_29, （原始内容存档于2023-03-15）
^ McLean, Scott; Read, Gemma J. M.; Thompson, Jason; Baber, Chris; Stanton, Neville A.; Salmon, Paul M. The risks associated with Artificial General Intelligence: A systematic review. Journal of Experimental & Theoretical Artificial Intelligence. 2023-07-04, 35 (5): 649–663. Bibcode:2023JETAI..35..649M. ISSN 0952-813X. S2CID 238643957. doi:10.1080/0952813X.2021.1964003 . hdl:11343/289595  （英语）.
^ Wile, Rob. Elon Musk: Artificial Intelligence Is 'Potentially More Dangerous Than Nukes'. Business Insider. August 3, 2014 [2024-02-22] （美国英语）.
^ Kuo, Kaiser. Baidu CEO Robin Li interviews Bill Gates and Elon Musk at the Boao Forum, March 29, 2015. 事件发生在 55:49. 2015-03-31 [2022-11-23]. （原始内容存档于2022-11-23）.
^ Cellan-Jones, Rory. Stephen Hawking warns artificial intelligence could end mankind. BBC News. 2014-12-02 [2022-11-23]. （原始内容存档于2015-10-30）.
^ Future of Life Institute. Research Priorities for Robust and Beneficial Artificial Intelligence: An Open Letter. Future of Life Institute. [2022-11-23]. （原始内容存档于2022-11-23）.
^ Future of Life Institute. AI Research Grants Program. Future of Life Institute. October 2016 [2022-11-23]. （原始内容存档于2022-11-23）.
^ SafArtInt 2016. [2022-11-23]. （原始内容存档于2022-11-23）.
^ Bach, Deborah. UW to host first of four White House public workshops on artificial intelligence. UW News. 2016 [2022-11-23]. （原始内容存档于2022-11-23）.
^ Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan. Concrete Problems in AI Safety. 2016-07-25. arXiv:1606.06565 .
^ Future of Life Institute. AI Principles. Future of Life Institute. [2022-11-23]. （原始内容存档于2022-11-23）.
^ Yohsua, Bengio; Daniel, Privitera; Tamay, Besiroglu; Rishi, Bommasani; Stephen, Casper; Yejin, Choi; Danielle, Goldfarb; Hoda, Heidari; Leila, Khalatbari. International Scientific Report on the Safety of Advanced AI (报告). Department for Science, Innovation and Technology. May 2024.
^ SafeML ICLR 2019 Workshop. [2022-11-23]. （原始内容存档于2022-11-23）.
^ Browne, Ryan. British Prime Minister Rishi Sunak pitches UK as home of A.I. safety regulation as London bids to be next Silicon Valley. CNBC. 2023-06-12 [2023-06-25] （英语）.
^ Bertuzzi, Luca. UK's AI safety summit set to highlight risk of losing human control over 'frontier' models. Euractiv. October 18, 2023 [March 2, 2024].
^ Bengio, Yoshua; Privitera, Daniel; Bommasani, Rishi; Casper, Stephen; Goldfarb, Danielle; Mavroudis, Vasilios; Khalatbari, Leila; Mazeika, Mantas; Hoda, Heidari. International Scientific Report on the Safety of Advanced AI (PDF). GOV.UK. 2024-05-17 [2024-07-08]. （原始内容存档 (PDF)于2024-06-15）. Alt URL
^ Shepardson, David. US, Britain announce partnership on AI safety, testing. 1 April 2024 [2 April 2024].
^ What International AI Safety report says on jobs, climate, cyberwar and more. The Guardian. 2025-01-29 [2025-03-03]. ISSN 0261-3077 （英国英语）.
^ Launch of the First International Report on AI Safety chaired by Yoshua Bengio. mila.quebec. January 29, 2025 [2025-03-03] （英语）.
^ Goodfellow, Ian; Papernot, Nicolas; Huang, Sandy; Duan, Rocky; Abbeel, Pieter; Clark, Jack. Attacking Machine Learning with Adversarial Examples. OpenAI. 2017-02-24 [2022-11-24]. （原始内容存档于2022-11-24）.
^ Szegedy, Christian; Zaremba, Wojciech; Sutskever, Ilya; Bruna, Joan; Erhan, Dumitru; Goodfellow, Ian; Fergus, Rob. Intriguing properties of neural networks. ICLR. 2014-02-19. arXiv:1312.6199 .
^ Kurakin, Alexey; Goodfellow, Ian; Bengio, Samy. Adversarial examples in the physical world. ICLR. 2017-02-10. arXiv:1607.02533 .
^ Kannan, Harini; Kurakin, Alexey; Goodfellow, Ian. Adversarial Logit Pairing. 2018-03-16. arXiv:1803.06373 .
^ Gilmer, Justin; Adams, Ryan P.; Goodfellow, Ian; Andersen, David; Dahl, George E. Motivating the Rules of the Game for Adversarial Example Research. 2018-07-19. arXiv:1807.06732 .
^ Carlini, Nicholas; Wagner, David. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text. IEEE Security and Privacy Workshops. 2018-03-29. arXiv:1801.01944 .
^ Sheatsley, Ryan; Papernot, Nicolas; Weisman, Michael; Verma, Gunjan; McDaniel, Patrick. Adversarial Examples in Constrained Domains. 2022-09-09. arXiv:2011.01183 .
^ Suciu, Octavian; Coull, Scott E.; Johns, Jeffrey. Exploring Adversarial Examples in Malware Detection. IEEE Security and Privacy Workshops. 2019-04-13. arXiv:1810.08280 .
^ Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, John. Training language models to follow instructions with human feedback. NeurIPS. 2022-03-04. arXiv:2203.02155 .
^ Gao, Leo; Schulman, John; Hilton, Jacob. Scaling Laws for Reward Model Overoptimization. ICML. 2022-10-19. arXiv:2210.10760 .
^ Yu, Sihyun; Ahn, Sungsoo; Song, Le; Shin, Jinwoo. RoMA: Robust Model Adaptation for Offline Model-based Optimization. NeurIPS. 2021-10-27. arXiv:2110.14188 .
^ Hendrycks, Dan; Mazeika, Mantas. X-Risk Analysis for AI Research. 2022-09-20. arXiv:2206.05862 .
^ ^56.0 ^56.1 Prompt injection attacks might 'never be properly mitigated' UK NCSC warns. TechRadar. 2025-12-09 [2025-12-12] （英语）.
^ Why Anthropic and OpenAI are obsessed with securing LLM model weights. VentureBeat. 2023-12-15.
^ The rise of AI fake news is creating a 'misinformation superspreader'. The Washington Post. 2023-12-17 [2025-12-12]. ISSN 0190-8286 （美国英语）.
^ Brando, Axel; Serra, Isabel; Mezzetti, Enrico; Cazorla, Francisco J.; Perez-Cerrolaza, Jon; Abella, Jaume. On Neural Networks Redundancy and Diversity for Their Use in Safety-Critical Systems. Computer. May 2023, 56 (5): 41-50. doi:10.1109/MC.2023.3236523.
^ Machida, Fumio. N-Version Machine Learning Models for Safety Critical Systems. 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE: 48-51. 2019. doi:10.1109/DSN-W.2019.00017.
^ Tran, Khoa A.; Kondrashova, Olga; Bradley, Andrew; Williams, Elizabeth D.; Pearson, John V.; Waddell, Nicola. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Medicine. 2021, 13 (1): 152. ISSN 1756-994X. PMC 8477474 . PMID 34579788. doi:10.1186/s13073-021-00968-x  （英语）.
^ Guo, Chuan; Pleiss, Geoff; Sun, Yu; Weinberger, Kilian Q. On calibration of modern neural networks. Proceedings of the 34th international conference on machine learning. Proceedings of machine learning research 70. PMLR: 1321–1330. 2017-08-06.
^ Ovadia, Yaniv; Fertig, Emily; Ren, Jie; Nado, Zachary; Sculley, D.; Nowozin, Sebastian; Dillon, Joshua V.; Lakshminarayanan, Balaji; Snoek, Jasper. Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift. NeurIPS. 2019-12-17. arXiv:1906.02530 .
^ Bogdoll, Daniel; Breitenstein, Jasmin; Heidecker, Florian; Bieshaar, Maarten; Sick, Bernhard; Fingscheidt, Tim; Zöllner, J. Marius. Description of Corner Cases in Automated Driving: Goals and Challenges. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). 2021: 1023–1028. ISBN 978-1-6654-0191-3. arXiv:2109.09607 . doi:10.1109/ICCVW54120.2021.00119.
^ Hendrycks, Dan; Mazeika, Mantas; Dietterich, Thomas. Deep Anomaly Detection with Outlier Exposure. ICLR. 2019-01-28. arXiv:1812.04606 .
^ Hendrycks, Dan; Gimpel, Kevin. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. ICLR. 2018-10-03. arXiv:1610.02136 .
^ Urbina, Fabio; Lentzos, Filippa; Invernizzi, Cédric; Ekins, Sean. Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence. 2022, 4 (3): 189–191. ISSN 2522-5839. PMC 9544280 . PMID 36211133. doi:10.1038/s42256-022-00465-9 （英语）.
^ Propaganda-as-a-service may be on the horizon if large language models are abused. VentureBeat. 2021-12-14 [2022-11-24]. （原始内容存档于2022-11-24）.
^ Center for Security and Emerging Technology; Buchanan, Ben; Bansemer, John; Cary, Dakota; Lucas, Jack; Musser, Micah. Automating Cyber Attacks: Hype and Reality. 2020 [2022-11-28]. S2CID 234623943. doi:10.51593/2020ca002 . （原始内容存档于2022-11-24）.
^ Markov, Todor; Zhang, Chong; Agarwal, Sandhini; Eloundou, Tyna; Lee, Teddy; Adler, Steven; Jiang, Angela; Weng, Lilian. New-and-Improved Content Moderation Tooling. OpenAI. 2022-08-10 [2022-11-24]. （原始内容存档于2023-01-11）.
^ Savage, Neil. Breaking into the black box of artificial intelligence. Nature. 2022-03-29 [2022-11-24]. PMID 35352042. S2CID 247792459. doi:10.1038/d41586-022-00858-1. （原始内容存档于2022-11-24）.
^ Center for Security and Emerging Technology; Rudner, Tim; Toner, Helen. Key Concepts in AI Safety: Interpretability in Machine Learning. CSET Issue Brief. 2021 [2022-11-28]. S2CID 233775541. doi:10.51593/20190042 . （原始内容存档于2022-11-24）.
^ McFarland, Matt. Uber pulls self-driving cars after first fatal crash of autonomous vehicle. CNNMoney. 2018-03-19 [2022-11-24]. （原始内容存档于2022-11-24）.
^ Felder, Ryan Marshall. Coming to Terms with the Black Box Problem: How to Justify AI Systems in Health Care. Hastings Center Report. July 2021, 51 (4): 38–45. ISSN 0093-0334. PMID 33821471. doi:10.1002/hast.1248 （英语）.
^ ^75.0 ^75.1 Doshi-Velez, Finale; Kortz, Mason; Budish, Ryan; Bavitz, Chris; Gershman, Sam; O'Brien, David; Scott, Kate; Schieber, Stuart; Waldo, James; Weinberger, David; Weller, Adrian. Accountability of AI Under the Law: The Role of Explanation. 2019-12-20. arXiv:1711.01134 .
^ Fong, Ruth; Vedaldi, Andrea. Interpretable Explanations of Black Boxes by Meaningful Perturbation. 2017 IEEE International Conference on Computer Vision (ICCV). 2017: 3449–3457. ISBN 978-1-5386-1032-9. arXiv:1704.03296 . doi:10.1109/ICCV.2017.371.
^ Meng, Kevin; Bau, David; Andonian, Alex; Belinkov, Yonatan. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems. 2022, 35. arXiv:2202.05262 .
^ Bau, David; Liu, Steven; Wang, Tongzhou; Zhu, Jun-Yan; Torralba, Antonio. Rewriting a Deep Generative Model. ECCV. 2020-07-30. arXiv:2007.15646 .
^ Räuker, Tilman; Ho, Anson; Casper, Stephen; Hadfield-Menell, Dylan. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. IEEE SaTML. 2022-09-05. arXiv:2207.13243 .
^ Bau, David; Zhou, Bolei; Khosla, Aditya; Oliva, Aude; Torralba, Antonio. Network Dissection: Quantifying Interpretability of Deep Visual Representations. CVPR. 2017-04-19. arXiv:1704.05796 .
^ McGrath, Thomas; Kapishnikov, Andrei; Tomašev, Nenad; Pearce, Adam; Wattenberg, Martin; Hassabis, Demis; Kim, Been; Paquet, Ulrich; Kramnik, Vladimir. Acquisition of chess knowledge in AlphaZero. Proceedings of the National Academy of Sciences. 2022-11-22, 119 (47). Bibcode:2022PNAS..11906625M. ISSN 0027-8424. PMC 9704706 . PMID 36375061. arXiv:2111.09259 . doi:10.1073/pnas.2206625119  （英语）.
^ Goh, Gabriel; Cammarata, Nick; Voss, Chelsea; Carter, Shan; Petrov, Michael; Schubert, Ludwig; Radford, Alec; Olah, Chris. Multimodal neurons in artificial neural networks. Distill. 2021, 6 (3). S2CID 233823418. doi:10.23915/distill.00030 .
^ Cammarata, Nick; Goh, Gabriel; Carter, Shan; Voss, Chelsea; Schubert, Ludwig; Olah, Chris. Curve circuits. Distill. 2021, 6 (1) [5 December 2022]. doi:10.23915/distill.00024.006 (不活跃 1 July 2025). （原始内容存档于5 December 2022）.
^ Olsson, Catherine; Elhage, Nelson; Nanda, Neel; Joseph, Nicholas; DasSarma, Nova; Henighan, Tom; Mann, Ben; Askell, Amanda; Bai, Yuntao; Chen, Anna; Conerly, Tom. In-context learning and induction heads. Transformer Circuits Thread. 2022. arXiv:2209.11895 .
^ Olah, Christopher. Interpretability vs Neuroscience [rough note]. [2022-11-24]. （原始内容存档于2022-11-24）.
^ Gu, Tianyu; Dolan-Gavitt, Brendan; Garg, Siddharth. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. 2019-03-11. arXiv:1708.06733 .
^ Chen, Xinyun; Liu, Chang; Li, Bo; Lu, Kimberly; Song, Dawn. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. 2017-12-14. arXiv:1712.05526 .
^ Carlini, Nicholas; Terzis, Andreas. Poisoning and Backdooring Contrastive Learning. ICLR. 2022-03-28. arXiv:2106.09667 .
^ How 'sleeper agent' AI assistants can sabotage code. The Register. 16 January 2024 [2025-01-12]. （原始内容存档于2024-12-24）（英语）.
^ Russell, Stuart J.; Norvig, Peter. Artificial intelligence: A modern approach 4th. Pearson. 2021: 5, 1003 [September 12, 2022]. ISBN 978-0-13-461099-3.
^ Russell, Stuart J.; Norvig, Peter. Artificial intelligence: A modern approach 4th. Pearson. 2021: 5, 1003 [September 12, 2022]. ISBN 978-0-13-461099-3.
^ Ngo, Richard; Chan, Lawrence; Mindermann, Sören. The Alignment Problem from a Deep Learning Perspective. International Conference on Learning Representations. 2022. arXiv:2209.00626 .
^ Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob. The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. 2022-02-14 [2022-07-21].
^ Russell, Stuart J.; Norvig, Peter. Artificial intelligence: A modern approach 4th. Pearson. 2021: 5, 1003 [September 12, 2022]. ISBN 978-0-13-461099-3.
^ Carlsmith, Joseph. Is Power-Seeking AI an Existential Risk?. 2022-06-16. arXiv:2206.13353  [cs.CY].
^ Russell, Stuart J. Human compatible: Artificial intelligence and the problem of control. Penguin Random House. 2020. ISBN 978-0-525-55863-7. OCLC 1113410915.
^ Christian, Brian. The alignment problem: Machine learning and human values. W. W. Norton & Company. 2020 [September 12, 2022]. ISBN 978-0-393-86833-3. OCLC 1233266753. （原始内容存档于February 10, 2023）.
^ Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David. Goal Misgeneralization in Deep Reinforcement Learning. Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR: 12004–12019. 2022-06-28 [2023-03-11].
^ Pillay, Tharin. New Tests Reveal AI's Capacity for Deception. TIME. 2024-12-15 [2025-01-12] （英语）.
^ Perrigo, Billy. Exclusive: New Research Shows AI Strategically Lying. TIME. 2024-12-18 [2025-01-12] （英语）.
^ Ouyang, Long; et al. Training language models to follow instructions with human feedback (PDF). NeurIPS. 2022. arXiv:2203.02155 .
^ Zaremba, Wojciech; Brockman, Greg; OpenAI. OpenAI Codex. OpenAI. 2021-08-10 [2022-07-23]. （原始内容存档于February 3, 2023）.
^ Kober, Jens; Bagnell, J. Andrew; Peters, Jan. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research. 2013-09-01, 32 (11): 1238–1274 [September 12, 2022]. ISSN 0278-3649. S2CID 1932843. doi:10.1177/0278364913495721. （原始内容存档于October 15, 2022）（英语）.
^ Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter. Reward (Mis)design for autonomous driving. Artificial Intelligence. 2023-03-01, 316. ISSN 0004-3702. S2CID 233423198. arXiv:2104.13906 . doi:10.1016/j.artint.2022.103829  （英语）.
^ Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik. On the Opportunities and Risks of Foundation Models. Stanford CRFM. 2022-07-12. arXiv:2108.07258 .
^ Russell, Stuart J. Human compatible: Artificial intelligence and the problem of control. Penguin Random House. 2020. ISBN 978-0-525-55863-7. OCLC 1113410915.
^ Stray, Jonathan. Aligning AI Optimization to Community Well-Being. International Journal of Community Well-Being. 2020, 3 (4): 443–463. ISSN 2524-5295. PMC 7610010 . PMID 34723107. S2CID 226254676. doi:10.1007/s42413-020-00086-3 （英语）.
^ Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob. The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. 2022-02-14 [2022-07-21].
^ Ngo, Richard; Chan, Lawrence; Mindermann, Sören. The Alignment Problem from a Deep Learning Perspective. International Conference on Learning Representations. 2022. arXiv:2209.00626 .
^ Smith, Craig S. Geoff Hinton, AI's Most Famous Researcher, Warns Of 'Existential Threat'. Forbes. [2023-05-04] （英语）.
^ Russell, Stuart J. Human compatible: Artificial intelligence and the problem of control. Penguin Random House. 2020. ISBN 978-0-525-55863-7. OCLC 1113410915.
^ Statement on AI Risk | CAIS. www.safe.ai. [2024-02-11] （英语）.
^ Grace, Katja; Stewart, Harlan; Sandkühler, Julia Fabienne; Thomas, Stephen; Weinstein-Raun, Ben; Brauner, Jan. Thousands of AI Authors on the Future of AI. Journal of Artificial Intelligence Research. 2025, 84. arXiv:2401.02843 . doi:10.1613/jair.1.19087 .
^ Perrigo, Billy. Meta's AI Chief Yann LeCun on AGI, Open-Source, and AI Risk. TIME. 2024-02-13 [2024-06-26] （英语）.
^ What is AI alignment?. TechTarget. 2023-05-03 [2025-06-28] （英语）.
^ Ahmed, Shazeda; Jaźwińska, Klaudia; Ahlawat, Archana; Winecoff, Amy; Wang, Mona. Field-building and the epistemic culture of AI safety. First Monday. 2024-04-14. ISSN 1396-0466. doi:10.5210/fm.v29i4.13626  （英语）.
^ ^117.0 ^117.1 Ortega, Pedro A.; Maini, Vishal; DeepMind safety team. Building safe artificial intelligence: specification, robustness, and assurance. DeepMind Safety Research – Medium. 2018-09-27 [2022-07-18]. （原始内容存档于February 10, 2023）.
^ ^118.0 ^118.1 Rorvig, Mordechai. Researchers Gain New Understanding From Simple AI. Quanta Magazine. 2022-04-14 [2022-07-18]. （原始内容存档于February 10, 2023）.
^ Doshi-Velez, Finale; Kim, Been. Towards A Rigorous Science of Interpretable Machine Learning. 2017-03-02. arXiv:1702.08608  [stat.ML].
^ ^120.0 ^120.1 Amodei, Dario; Olah, Chris. Concrete Problems in AI Safety. 2016-06-21. arXiv:1606.06565  [cs.AI] （英语）.
^ Russell, Stuart; Dewey, Daniel; Tegmark, Max. Research Priorities for Robust and Beneficial Artificial Intelligence. AI Magazine. 2015-12-31, 36 (4): 105–114 [September 12, 2022]. ISSN 2371-9621. S2CID 8174496. arXiv:1602.03506 . doi:10.1609/aimag.v36i4.2577 . hdl:1721.1/108478. （原始内容存档于February 2, 2023）.
^ Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario. Deep reinforcement learning from human preferences. Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc.: 4302–4310. 2017. ISBN 978-1-5108-6096-4.
^ Heaven, Will Douglas. The new version of GPT-3 is much better behaved (and should be less toxic). MIT Technology Review. 2022-01-27 [2022-07-18]. （原始内容存档于February 10, 2023）.
^ Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay. Taxonomy of Machine Learning Safety: A Survey and Primer. ACM Computing Surveys. 2022-03-07, 55 (8): 1–38. doi:10.1145/3551385.
^ Clifton, Jesse. Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda. Center on Long-Term Risk. 2020 [2022-07-18]. （原始内容存档于January 1, 2023）.
^ Prunkl, Carina; Whittlestone, Jess. Beyond Near- and Long-Term. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. New York NY USA: ACM. 2020-02-07: 138–143 [September 12, 2022]. ISBN 978-1-4503-7110-0. doi:10.1145/3375627.3375803. （原始内容存档于October 16, 2022）（英语）.
^ Irving, Geoffrey; Askell, Amanda. AI Safety Needs Social Scientists. Distill. 2019-02-19, 4 (2) [September 12, 2022]. ISSN 2476-0757. S2CID 159180422. doi:10.23915/distill.00014 . （原始内容存档于February 10, 2023）.
^ Gazos, Alexandros; Kahn, James; Kusche, Isabel; Büscher, Christian; Götz, Markus. Organising AI for safety: Identifying structural vulnerabilities to guide the design of AI-enhanced socio-technical systems. Safety Science. 2025-04-01, 184. ISSN 0925-7535. doi:10.1016/j.ssci.2024.106731 .
^ ^129.0 ^129.1 ^129.2 ^129.3 Zwetsloot, Remco; Dafoe, Allan. Thinking About Risks From AI: Accidents, Misuse and Structure. Lawfare. 2019-02-11 [2022-11-24]. （原始内容存档于2023-08-19）.
^ Zhang, Yingyu; Dong, Chuntong; Guo, Weiqun; Dai, Jiabao; Zhao, Ziming. Systems theoretic accident model and process (STAMP): A literature review. Safety Science. 2022, 152 [2022-11-28]. S2CID 244550153. doi:10.1016/j.ssci.2021.105596. （原始内容存档于2023-03-15）（英语）.
^ Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob. Unsolved Problems in ML Safety. arXiv. 2022-06-16. arXiv:2109.13916 .
^ Gazos, Alexandros; Kahn, James; Kusche, Isabel; Büscher, Christian; Götz, Markus. Organising AI for safety: Identifying structural vulnerabilities to guide the design of AI-enhanced socio-technical systems. Safety Science. 2025-04-01, 184. ISSN 0925-7535. doi:10.1016/j.ssci.2024.106731 .
^ Center for Security and Emerging Technology; Hoffman, Wyatt. AI and the Future of Cyber Competition. CSET Issue Brief. 2021 [2022-11-28]. S2CID 234245812. doi:10.51593/2020ca007 . （原始内容存档于2022-11-24）.
^ Brundage, Miles; Avin, Shahar; Clark, Jack; Toner, Helen; Eckersley, Peter; Garfinkel, Ben; Dafoe, Allan; Scharre, Paul; Zeitzoff, Thomas; Filar, Bobby; Anderson, Hyrum. The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation. Apollo-University Of Cambridge Repository, Apollo-University Of Cambridge Repository. Apollo - University of Cambridge Repository. 2018-04-30 [2022-11-28]. S2CID 3385567. doi:10.17863/cam.22520. （原始内容存档于2022-11-23）.
^ Gafni, Ruti; Levy, Yair. The role of artificial intelligence (AI) in improving technical and managerial cybersecurity tasks' efficiency. Information & Computer Security. 2024-01-01, 32 (5): 711–728. ISSN 2056-4961. doi:10.1108/ICS-04-2024-0102.
^ Abroshan, Hossein. AI to protect AI: A modular pipeline for detecting label-flipping poisoning attacks. Results in Engineering (Elsevier). 2025. doi:10.1016/j.rineng.2025.101513.
^ Abroshan, Hossein; Hashmi, Syed Waquas. A Multi-Stage Backdoor Detection (MSBD) Framework. IEEE Access (IEEE). 2026: 1–1. ISSN 2169-3536. doi:10.1109/ACCESS.2026.3659007 .
^ Center for Security and Emerging Technology; Imbrie, Andrew; Kania, Elsa. AI Safety, Security, and Stability Among Great Powers: Options, Challenges, and Lessons Learned for Pragmatic Engagement. 2019 [2022-11-28]. S2CID 240957952. doi:10.51593/20190051 . （原始内容存档于2022-11-24）.
^ Future of Life Institute. AI Strategy, Policy, and Governance (Allan Dafoe). 事件发生在 22:05. 2019-03-27 [2022-11-23]. （原始内容存档于2022-11-23）.
^ Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob. Unsolved Problems in ML Safety. arXiv. 2022-06-16. arXiv:2109.13916 .
^ Zou, Andy; Xiao, Tristan; Jia, Ryan; Kwon, Joe; Mazeika, Mantas; Li, Richard; Song, Dawn; Steinhardt, Jacob; Evans, Owain; Hendrycks, Dan. Forecasting Future World Events with Neural Networks. NeurIPS. 2022-10-09. arXiv:2206.15474 .
^ Gathani, Sneha; Hulsebos, Madelon; Gale, James; Haas, Peter J.; Demiralp, Çağatay. Augmenting Decision Making via Interactive What-If Analysis. Conference on Innovative Data Systems Research. 2022-02-08. arXiv:2109.06160 .
^ Lindelauf, Roy, Osinga, Frans; Sweijs, Tim , 编, Nuclear Deterrence in the Algorithmic Age: Game Theory Revisited, NL ARMS Netherlands Annual Review of Military Studies 2020, Nl Arms (The Hague: T.M.C. Asser Press), 2021: 421–436, ISBN 978-94-6265-418-1, doi:10.1007/978-94-6265-419-8_22 （英语）
^ ^144.0 ^144.1 Newkirk II, Vann R. Is Climate Change a Prisoner's Dilemma or a Stag Hunt?. The Atlantic. 2016-04-21 [2022-11-24]. （原始内容存档于2022-11-24）.
^ Armstrong, Stuart; Bostrom, Nick; Shulman, Carl. Racing to the Precipice: a Model of Artificial Intelligence Development (报告). Future of Humanity Institute, Oxford University.
^ Dafoe, Allan. AI Governance: A Research Agenda (报告). Centre for the Governance of AI, Future of Humanity Institute, University of Oxford.
^ Dafoe, Allan; Hughes, Edward; Bachrach, Yoram; Collins, Tantum; McKee, Kevin R.; Leibo, Joel Z.; Larson, Kate; Graepel, Thore. Open Problems in Cooperative AI. NeurIPS. 2020-12-15. arXiv:2012.08630 .
^ Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore. Cooperative AI: machines must learn to find common ground. Nature. 2021, 593 (7857): 33–36 [2022-11-24]. Bibcode:2021Natur.593...33D. PMID 33947992. S2CID 233740521. doi:10.1038/d41586-021-01170-0. （原始内容存档于2022-11-22）.
^ Gazos, Alexandros; Kahn, James; Kusche, Isabel; Büscher, Christian; Götz, Markus. Organising AI for safety: Identifying structural vulnerabilities to guide the design of AI-enhanced socio-technical systems. Safety Science. 2025-04-01, 184. ISSN 0925-7535. doi:10.1016/j.ssci.2024.106731 .
^ Satariano, Adam; Specia, Megan. Global Leaders Warn A.I. Could Cause 'Catastrophic' Harm. The New York Times. 2023-11-01 [2024-04-20]. ISSN 0362-4331 （美国英语）.
^ Future of Life Institute. AI Strategy, Policy, and Governance (Allan Dafoe). 事件发生在 22:05. 2019-03-27 [2022-11-23]. （原始内容存档于2022-11-23）.
^ Turchin, Alexey; Dench, David; Green, Brian Patrick. Global Solutions vs. Local Solutions for the AI Safety Problem. Big Data and Cognitive Computing. 2019, 3 (16): 1–25. doi:10.3390/bdcc3010016 .
^ Crafts, Nicholas. Artificial intelligence as a general-purpose technology: an historical perspective. Oxford Review of Economic Policy. 2021-09-23, 37 (3): 521–536 [2022-11-28]. ISSN 0266-903X. doi:10.1093/oxrep/grab012 . （原始内容存档于2022-11-24）（英语）.
^ 葉俶禎; 黃子君; 張媁雯; 賴志樫. Labor Displacement in Artificial Intelligence Era: A Systematic Literature Review. 臺灣東亞文明研究學刊. 2020-12-01, 17 (2). ISSN 1812-6243. doi:10.6163/TJEAS.202012_17(2).0002 （英语）.
^ Johnson, James. Artificial intelligence & future warfare: implications for international security. Defense & Security Analysis. 2019-04-03, 35 (2): 147–169 [2022-11-28]. ISSN 1475-1798. S2CID 159321626. doi:10.1080/14751798.2019.1600800. （原始内容存档于2022-11-24）（英语）.
^ Kertysova, Katarina. Artificial Intelligence and Disinformation: How AI Changes the Way Disinformation is Produced, Disseminated, and Can Be Countered. Security and Human Rights. 2018-12-12, 29 (1–4): 55–81. ISSN 1874-7337. S2CID 216896677. doi:10.1163/18750230-02901005 .
^ Feldstein, Steven. The Global Expansion of AI Surveillance. Carnegie Endowment for International Peace. 2019.
^ Agrawal, Ajay; Gans, Joshua; Goldfarb, Avi. The economics of artificial intelligence: an agenda. Chicago, Illinois. 2019. ISBN 978-0-226-61347-5. OCLC 1099435014 （美国英语）.
^ Whittlestone, Jess; Clark, Jack. Why and How Governments Should Monitor AI Development. 2021-08-31. arXiv:2108.12427 .
^ Shevlane, Toby. Sharing Powerful AI Models | GovAI Blog. Center for the Governance of AI. 2022 [2022-11-24]. （原始内容存档于2022-11-24）.
^ Armstrong, Stuart; Bostrom, Nick; Shulman, Carl. Racing to the Precipice: a Model of Artificial Intelligence Development (报告). Future of Humanity Institute, Oxford University.
^ Askell, Amanda; Brundage, Miles; Hadfield, Gillian. The Role of Cooperation in Responsible AI Development. 2019-07-10. arXiv:1907.04534 .
^ Dafoe, Allan. AI Governance: A Research Agenda (报告). Centre for the Governance of AI, Future of Humanity Institute, University of Oxford.
^ Gursoy, Furkan; Kakadiaris, Ioannis A., System Cards for AI-Based Decision-Making for Public Policy, 2022-08-31, arXiv:2203.04754 
^ Cobbe, Jennifer; Lee, Michelle Seng Ah; Singh, Jatinder. Reviewable Automated Decision-Making: A Framework for Accountable Algorithmic Systems. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. FAccT '21. New York, NY, USA: Association for Computing Machinery. 2021-03-01: 598–609. ISBN 978-1-4503-8309-7. doi:10.1145/3442188.3445921.
^ Raji, Inioluwa Deborah; Smart, Andrew; White, Rebecca N.; Mitchell, Margaret; Gebru, Timnit; Hutchinson, Ben; Smith-Loud, Jamila; Theron, Daniel; Barnes, Parker. Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. FAT* '20. New York, NY, USA: Association for Computing Machinery. 2020-01-27: 33–44. ISBN 978-1-4503-6936-7. doi:10.1145/3351095.3372873.
^ Manheim, David; Martin, Sammy; Bailey, Mark; Samin, Mikhail; Greutzmacher, Ross. The necessity of AI audit standards boards. AI & Society. 2025, 40 (8): 6609–6624. arXiv:2404.13060 . doi:10.1007/s00146-025-02320-y.
^ Novelli, Claudio; Taddeo, Mariarosaria; Floridi, Luciano. Accountability in artificial intelligence: what it is and how it works. AI & Society. 2024, 39 (4): 1871–1882. doi:10.1007/s00146-023-01635-y. hdl:11585/914099 .
^ Manheim, David. Building a Culture of Safety for AI: Perspectives and Challenges. 26 June 2023. SSRN 4491421  请检查|ssrn=的值 (帮助).
^ NeMo Guardrails. NVIDIA NeMo Guardrails. [2024-12-08].
^ Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. Meta AI. [2024-12-08].
^ Šekrst, Kristina; McHugh, Jeremy. AI Ethics by Design: Implementing Customizable Guardrails for Responsible AI Development. arXiv:2411.14442  [cs.CY].
^ Dong, Yi; Mu, Ronghui. Building Guardrails for Large Language Models. arXiv:2402.01822  [cs].
^ D'Alessandro, W. Deontology and safe artificial intelligence. Philosophical Studies. 2024, 182 (7): 1681–1704. doi:10.1007/s11098-024-02174-y .
^ D'Alessandro, William; Kirk-Giannini, Cameron D. Artificial Intelligence: Approaches to Safety. Philosophy Compass. 2025, 20 (5). doi:10.1111/phc3.70039.
^ Ziegler, Bart. Is It Time to Regulate AI?. Wall Street Journal. 8 April 2022 [2022-11-24]. （原始内容存档于2022-11-24）.
^ Reed, Chris. How should we regulate artificial intelligence?. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2018-09-13, 376 (2128). Bibcode:2018RSPTA.37670360R. ISSN 1364-503X. PMC 6107539 . PMID 30082306. doi:10.1098/rsta.2017.0360 （英语）.
^ Belton, Keith B. How Should AI Be Regulated?. IndustryWeek. 2019-03-07 [2022-11-24]. （原始内容存档于2022-01-29）.
^ National Security Commission on Artificial Intelligence, Final Report, 2021
^ National Institute of Standards and Technology. AI Risk Management Framework. NIST. 2021-07-12 [2022-11-24]. （原始内容存档于2022-11-24）.
^ Richardson, Tim. Britain publishes 10-year National Artificial Intelligence Strategy. 2021 [2022-11-24]. （原始内容存档于2023-02-10）.
^ ^182.0 ^182.1 Guidance: National AI Strategy. GOV.UK. 2021 [2022-11-24]. （原始内容存档于2023-02-10）.
^ Hardcastle, Kimberley. We're talking about AI a lot right now – and it's not a moment too soon. The Conversation. 2023-08-23 [2023-10-31] （美国英语）.
^ Iconic Bletchley Park to host UK AI Safety Summit in early November. GOV.UK. [2023-10-31] （英语）.
^ Colville, Alex. How China Sees AI Safety. China Media Project. 2025-07-30 [2025-08-09] （美国英语）.
^ Office of the Director of National Intelligence, Intelligence Advanced Research Projects Activity. IARPA – TrojAI. [2022-11-24]. （原始内容存档于2022-11-24）.
^ Turek, Matt. Explainable Artificial Intelligence. [2022-11-24]. （原始内容存档于2021-02-19）.
^ Draper, Bruce. Guaranteeing AI Robustness Against Deception. Defense Advanced Research Projects Agency. [2022-11-24]. （原始内容存档于2023-01-09）.
^ National Science Foundation. Safe Learning-Enabled Systems. 23 February 2023 [2023-02-27]. （原始内容存档于2023-02-26）.
^ General Assembly adopts landmark resolution on artificial intelligence. UN News. 21 March 2024 [21 April 2024]. （原始内容存档于20 April 2024）.
^ Say, Mark. DSIT announces funding for research on AI safety. 23 May 2024 [11 June 2024]. （原始内容存档于24 May 2024）.
^ Renshaw, Jarrett; Hunnicutt, Trevor. Biden, Xi agree that humans, not AI, should control nuclear arms. Reuters. November 16, 2024 [February 11, 2026].
^ Khalid, Asma. Biden and Xi take a first step to limit AI and nuclear decisions at their last meeting. NPR. 2024-11-16 [2026-02-11] （英语）.
^ FY2025 NDAA, Section 1638 ("Sense of Congress with respect to use of artificial intelligence to support strategic deterrence"). Emerging Technology Observatory. Center for Security and Emerging Technology at Georgetown University. [27 February 2026].
^ H.R.5009 - Servicemember Quality of Life Improvement and National Defense Authorization Act for Fiscal Year 2025. Congress.gov. United States Congress. [27 February 2026].
^ The hypothetical nuclear attack that escalated the Pentagon’s showdown with Anthropic. The Washington Post. The Washington Post. [27 February 2026].
^ U.N. General Assembly opens with urgent plea for binding AI safeguards. NBC News. 2025-09-22 [2026-04-05] （英语）.
^ Ensuring a National Policy Framework for Artificial Intelligence. The White House. 2025-12-11 [2026-02-27] （美国英语）.
^ Johnson, Khari. Trump’s new order against AI regulation hits California especially hard. CalMatters. 2025-12-12 [2026-02-27] （美国英语）.
^ Removing Barriers to American Leadership in Artificial Intelligence. Federal Register. 2025-01-31 [2026-02-27] （英语）.
^ Trump’s AI Order Is More Bark than Bite | Brennan Center for Justice. www.brennancenter.org. 2026-02-25 [2026-02-27] （英语）.
^ Mäntymäki, Matti; Minkkinen, Matti; Birkstedt, Teemu; Viljanen, Mika. Defining organizational AI governance. AI and Ethics. 2022, 2 (4): 603–609. ISSN 2730-5953. S2CID 247119668. doi:10.1007/s43681-022-00143-x  （英语）.
^ ^203.0 ^203.1 ^203.2 Brundage, Miles; Avin, Shahar; Wang, Jasmine; Belfield, Haydn; Krueger, Gretchen; Hadfield, Gillian; Khlaaf, Heidy; Yang, Jingying; Toner, Helen; Fong, Ruth; Maharaj, Tegan. Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. 2020-04-20. arXiv:2004.07213 .
^ Welcome to the Artificial Intelligence Incident Database. [2022-11-24]. （原始内容存档于2022-11-24）.
^ Shevlane, Toby. Sharing Powerful AI Models | GovAI Blog. Center for the Governance of AI. 2022 [2022-11-24]. （原始内容存档于2022-11-24）.
^ Wiblin, Robert; Harris, Keiran. Nova DasSarma on why information security may be critical to the safe development of AI systems. 80,000 Hours. 2022 [2022-11-24]. （原始内容存档于2022-11-24）.
^ OpenAI. Best Practices for Deploying Language Models. OpenAI. 2022-06-02 [2022-11-24]. （原始内容存档于2023-03-15）.
^ OpenAI. OpenAI Charter. OpenAI. [2022-11-24]. （原始内容存档于2021-03-04）.
^ Future of Life Institute. AI Principles. Future of Life Institute. [2022-11-23]. （原始内容存档于2022-11-23）.
^ Future of Life Institute. Autonomous Weapons Open Letter: AI & Robotics Researchers. Future of Life Institute. 2016 [2022-11-24].
^ Chow, Andrew R. The People vs. AI. TIME. 2026-02-19 [2026-03-05] （英语）.
^ Wilkins, Emily. Anthropic gives $20 million to group pushing for AI regulations ahead of 2026 elections. CNBC. 2026-02-12 [2026-03-05] （英语）.
^ The Silicon Valley billionaires spending big to write America’s AI rules. Financial Times. February 26, 2026.
^ Schleifer, Theodore; Tan, Eli. Silicon Valley Pledges $200 Million to New Pro-A.I. Super PACs. The New York Times. 2025-08-26 [2026-03-05]. ISSN 0362-4331 （美国英语）.
^ GGE on lethal autonomous weapons systems. Digital Watch Observatory. 2025-11-27 [2026-04-26].
^ Statements at the First 2025 GGE LAWS Session. APILS. 2025-03-09 [2026-04-26].

[1] Ahmed, Shazeda; Jaźwińska, Klaudia; Ahlawat, Archana; Winecoff, Amy; Wang, Mona. Field-building and the epistemic culture of AI safety. First Monday. 2024-04-14. ISSN 1396-0466. doi:10.5210/fm.v29i4.13626  （英语）.

[Hendrycks2022-2] Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob. Unsolved Problems in ML Safety. arXiv. 2022-06-16. arXiv:2109.13916 .

[3] Champagne, Dylan. President Trump Targets State AI Regulations. The Regulatory Review. 2026-02-26 [2026-02-27] （美国英语）.

[4] What is California's AI safety law?. Brookings. 2025-12-23 [2026-02-27] （美国英语）.

[5] Artificial Intelligence 2024 Legislation. www.ncsl.org. [2026-02-27].

[6] Perrigo, Billy. U.K.'s AI Safety Summit Ends With Limited, but Meaningful, Progress. Time. 2023-11-02 [2024-06-02] （英语）.

[7] De-Arteaga, Maria. Machine Learning in High-Stakes Settings: Risks and Opportunities (PhD论文). Carnegie Mellon University. 2020-05-13.

[:3-8] Mehrabi, Ninareh; Morstatter, Fred; Saxena, Nripsuta; Lerman, Kristina; Galstyan, Aram. A Survey on Bias and Fairness in Machine Learning. ACM Computing Surveys. 2021, 54 (6): 1–35 [2022-11-28]. ISSN 0360-0300. S2CID 201666566. arXiv:1908.09635 . doi:10.1145/3457607. （原始内容存档于2022-11-23）（英语）.

[9] Feldstein, Steven. The Global Expansion of AI Surveillance (报告). Carnegie Endowment for International Peace. 2019.

[10] Barnes, Beth. Risks from AI persuasion. Lesswrong. 2021 [2022-11-23]. （原始内容存档于2022-11-23）.

[:13-11] Brundage, Miles; Avin, Shahar; Clark, Jack; Toner, Helen; Eckersley, Peter; Garfinkel, Ben; Dafoe, Allan; Scharre, Paul; Zeitzoff, Thomas; Filar, Bobby; Anderson, Hyrum. The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation. Apollo-University Of Cambridge Repository, Apollo-University Of Cambridge Repository. Apollo - University of Cambridge Repository. 2018-04-30 [2022-11-28]. S2CID 3385567. doi:10.17863/cam.22520. （原始内容存档于2022-11-23）.

[12] Davies, Pascale. How NATO is preparing for a new era of AI cyber attacks. euronews. December 26, 2022 [2024-03-23] （英语）.

[13] Ahuja, Anjana. AI's bioterrorism potential should not be ruled out. Financial Times. February 7, 2024 [2024-03-23].

[14] Carlsmith, Joseph. Is Power-Seeking AI an Existential Risk?. 2022-06-16. arXiv:2206.13353 .

[15] Minardi, Di. The grim fate that could be 'worse than extinction'. BBC. 16 October 2020 [2024-03-23].

[16] AGI Expert Peter Voss Says AI Alignment Problem is Bogus | NextBigFuture.com. 2023-04-04 [2023-07-23] （美国英语）.

[17] Dafoe, Allan. Yes, We Are Worried About the Existential Risk of Artificial Intelligence. MIT Technology Review. 2016 [2022-11-28]. （原始内容存档于2022-11-28）.

[:1-18] 18.0 ^18.1 Grace, Katja; Salvatier, John; Dafoe, Allan; Zhang, Baobao; Evans, Owain. Viewpoint: When Will AI Exceed Human Performance? Evidence from AI Experts. Journal of Artificial Intelligence Research. 2018-07-31, 62: 729–754 [2022-11-28]. ISSN 1076-9757. S2CID 8746462. arXiv:1705.08807 . doi:10.1613/jair.1.11222 . （原始内容存档于2023-02-10）.

[19] Stein-Perlman, Zach; Weinstein-Raun, Benjamin; Grace. 2022 Expert Survey on Progress in AI. AI Impacts. 2022-08-04 [2022-11-23]. （原始内容存档于2022-11-23）.

[20] Michael, Julian; Holtzman, Ari; Parrish, Alicia; Mueller, Aaron; Wang, Alex; Chen, Angelica; Madaan, Divyam; Nangia, Nikita; Pang, Richard Yuanzhe; Phang, Jason; Bowman, Samuel R. What Do NLP Researchers Believe? Results of the NLP Community Metasurvey. Association for Computational Linguistics. 2022-08-26. arXiv:2208.12852 .

[21] Markoff, John. In 1949, He Imagined an Age of Robots. The New York Times. 2013-05-20 [2022-11-23]. ISSN 0362-4331. （原始内容存档于2022-11-23）.

[22] Artificial intelligence: A handbook of professionalism. University of Sussex. January 1988. ISBN 978-0-470-21103-8.

[23] Association for the Advancement of Artificial Intelligence. AAAI Presidential Panel on Long-Term AI Futures. [2022-11-23]. （原始内容存档于2022-09-01）.

[24] PT-AI 2011 – Philosophy and Theory of Artificial Intelligence (PT-AI 2011). [2022-11-23]. （原始内容存档于2022-11-23）.

[25] Yampolskiy, Roman V., Müller, Vincent C. , 编, Artificial Intelligence Safety Engineering: Why Machine Ethics is a Wrong Approach, Philosophy and Theory of Artificial Intelligence, Studies in Applied Philosophy, Epistemology and Rational Ethics (Berlin; Heidelberg, Germany: Springer Berlin Heidelberg), 2013, 5: 389–396 [2022-11-23], ISBN 978-3-642-31673-9, doi:10.1007/978-3-642-31674-6_29, （原始内容存档于2023-03-15）

[26] McLean, Scott; Read, Gemma J. M.; Thompson, Jason; Baber, Chris; Stanton, Neville A.; Salmon, Paul M. The risks associated with Artificial General Intelligence: A systematic review. Journal of Experimental & Theoretical Artificial Intelligence. 2023-07-04, 35 (5): 649–663. Bibcode:2023JETAI..35..649M. ISSN 0952-813X. S2CID 238643957. doi:10.1080/0952813X.2021.1964003 . hdl:11343/289595  （英语）.

[27] Wile, Rob. Elon Musk: Artificial Intelligence Is 'Potentially More Dangerous Than Nukes'. Business Insider. August 3, 2014 [2024-02-22] （美国英语）.

[28] Kuo, Kaiser. Baidu CEO Robin Li interviews Bill Gates and Elon Musk at the Boao Forum, March 29, 2015. 事件发生在 55:49. 2015-03-31 [2022-11-23]. （原始内容存档于2022-11-23）.

[29] Cellan-Jones, Rory. Stephen Hawking warns artificial intelligence could end mankind. BBC News. 2014-12-02 [2022-11-23]. （原始内容存档于2015-10-30）.

[30] Future of Life Institute. Research Priorities for Robust and Beneficial Artificial Intelligence: An Open Letter. Future of Life Institute. [2022-11-23]. （原始内容存档于2022-11-23）.

[31] Future of Life Institute. AI Research Grants Program. Future of Life Institute. October 2016 [2022-11-23]. （原始内容存档于2022-11-23）.

[32] SafArtInt 2016. [2022-11-23]. （原始内容存档于2022-11-23）.

[33] Bach, Deborah. UW to host first of four White House public workshops on artificial intelligence. UW News. 2016 [2022-11-23]. （原始内容存档于2022-11-23）.

[34] Amodei, Dario; Olah, Chris; Steinhardt, Jacob; Christiano, Paul; Schulman, John; Mané, Dan. Concrete Problems in AI Safety. 2016-07-25. arXiv:1606.06565 .

[:21-35] Future of Life Institute. AI Principles. Future of Life Institute. [2022-11-23]. （原始内容存档于2022-11-23）.

[36] Yohsua, Bengio; Daniel, Privitera; Tamay, Besiroglu; Rishi, Bommasani; Stephen, Casper; Yejin, Choi; Danielle, Goldfarb; Hoda, Heidari; Leila, Khalatbari. International Scientific Report on the Safety of Advanced AI (报告). Department for Science, Innovation and Technology. May 2024.

[37] SafeML ICLR 2019 Workshop. [2022-11-23]. （原始内容存档于2022-11-23）.

[38] Browne, Ryan. British Prime Minister Rishi Sunak pitches UK as home of A.I. safety regulation as London bids to be next Silicon Valley. CNBC. 2023-06-12 [2023-06-25] （英语）.

[39] Bertuzzi, Luca. UK's AI safety summit set to highlight risk of losing human control over 'frontier' models. Euractiv. October 18, 2023 [March 2, 2024].

[40] Bengio, Yoshua; Privitera, Daniel; Bommasani, Rishi; Casper, Stephen; Goldfarb, Danielle; Mavroudis, Vasilios; Khalatbari, Leila; Mazeika, Mantas; Hoda, Heidari. International Scientific Report on the Safety of Advanced AI (PDF). GOV.UK. 2024-05-17 [2024-07-08]. （原始内容存档 (PDF)于2024-06-15）. Alt URL

[41] Shepardson, David. US, Britain announce partnership on AI safety, testing. 1 April 2024 [2 April 2024].

[42] What International AI Safety report says on jobs, climate, cyberwar and more. The Guardian. 2025-01-29 [2025-03-03]. ISSN 0261-3077 （英国英语）.

[43] Launch of the First International Report on AI Safety chaired by Yoshua Bengio. mila.quebec. January 29, 2025 [2025-03-03] （英语）.

[44] Goodfellow, Ian; Papernot, Nicolas; Huang, Sandy; Duan, Rocky; Abbeel, Pieter; Clark, Jack. Attacking Machine Learning with Adversarial Examples. OpenAI. 2017-02-24 [2022-11-24]. （原始内容存档于2022-11-24）.

[:4-45] Szegedy, Christian; Zaremba, Wojciech; Sutskever, Ilya; Bruna, Joan; Erhan, Dumitru; Goodfellow, Ian; Fergus, Rob. Intriguing properties of neural networks. ICLR. 2014-02-19. arXiv:1312.6199 .

[46] Kurakin, Alexey; Goodfellow, Ian; Bengio, Samy. Adversarial examples in the physical world. ICLR. 2017-02-10. arXiv:1607.02533 .

[47] Kannan, Harini; Kurakin, Alexey; Goodfellow, Ian. Adversarial Logit Pairing. 2018-03-16. arXiv:1803.06373 .

[48] Gilmer, Justin; Adams, Ryan P.; Goodfellow, Ian; Andersen, David; Dahl, George E. Motivating the Rules of the Game for Adversarial Example Research. 2018-07-19. arXiv:1807.06732 .

[49] Carlini, Nicholas; Wagner, David. Audio Adversarial Examples: Targeted Attacks on Speech-to-Text. IEEE Security and Privacy Workshops. 2018-03-29. arXiv:1801.01944 .

[50] Sheatsley, Ryan; Papernot, Nicolas; Weisman, Michael; Verma, Gunjan; McDaniel, Patrick. Adversarial Examples in Constrained Domains. 2022-09-09. arXiv:2011.01183 .

[51] Suciu, Octavian; Coull, Scott E.; Johns, Jeffrey. Exploring Adversarial Examples in Malware Detection. IEEE Security and Privacy Workshops. 2019-04-13. arXiv:1810.08280 .

[52] Ouyang, Long; Wu, Jeff; Jiang, Xu; Almeida, Diogo; Wainwright, Carroll L.; Mishkin, Pamela; Zhang, Chong; Agarwal, Sandhini; Slama, Katarina; Ray, Alex; Schulman, John. Training language models to follow instructions with human feedback. NeurIPS. 2022-03-04. arXiv:2203.02155 .

[:0-53] Gao, Leo; Schulman, John; Hilton, Jacob. Scaling Laws for Reward Model Overoptimization. ICML. 2022-10-19. arXiv:2210.10760 .

[54] Yu, Sihyun; Ahn, Sungsoo; Song, Le; Shin, Jinwoo. RoMA: Robust Model Adaptation for Offline Model-based Optimization. NeurIPS. 2021-10-27. arXiv:2110.14188 .

[X-Risk_Analysis_for_AI_Research-55] Hendrycks, Dan; Mazeika, Mantas. X-Risk Analysis for AI Research. 2022-09-20. arXiv:2206.05862 .

[:7-56] 56.0 ^56.1 Prompt injection attacks might 'never be properly mitigated' UK NCSC warns. TechRadar. 2025-12-09 [2025-12-12] （英语）.

[57] Why Anthropic and OpenAI are obsessed with securing LLM model weights. VentureBeat. 2023-12-15.

[58] The rise of AI fake news is creating a 'misinformation superspreader'. The Washington Post. 2023-12-17 [2025-12-12]. ISSN 0190-8286 （美国英语）.

[Brando2023-59] Brando, Axel; Serra, Isabel; Mezzetti, Enrico; Cazorla, Francisco J.; Perez-Cerrolaza, Jon; Abella, Jaume. On Neural Networks Redundancy and Diversity for Their Use in Safety-Critical Systems. Computer. May 2023, 56 (5): 41-50. doi:10.1109/MC.2023.3236523.

[Machida2019-60] Machida, Fumio. N-Version Machine Learning Models for Safety Critical Systems. 2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE: 48-51. 2019. doi:10.1109/DSN-W.2019.00017.

[61] Tran, Khoa A.; Kondrashova, Olga; Bradley, Andrew; Williams, Elizabeth D.; Pearson, John V.; Waddell, Nicola. Deep learning in cancer diagnosis, prognosis and treatment selection. Genome Medicine. 2021, 13 (1): 152. ISSN 1756-994X. PMC 8477474 . PMID 34579788. doi:10.1186/s13073-021-00968-x  （英语）.

[62] Guo, Chuan; Pleiss, Geoff; Sun, Yu; Weinberger, Kilian Q. On calibration of modern neural networks. Proceedings of the 34th international conference on machine learning. Proceedings of machine learning research 70. PMLR: 1321–1330. 2017-08-06.

[63] Ovadia, Yaniv; Fertig, Emily; Ren, Jie; Nado, Zachary; Sculley, D.; Nowozin, Sebastian; Dillon, Joshua V.; Lakshminarayanan, Balaji; Snoek, Jasper. Can You Trust Your Model's Uncertainty? Evaluating Predictive Uncertainty Under Dataset Shift. NeurIPS. 2019-12-17. arXiv:1906.02530 .

[64] Bogdoll, Daniel; Breitenstein, Jasmin; Heidecker, Florian; Bieshaar, Maarten; Sick, Bernhard; Fingscheidt, Tim; Zöllner, J. Marius. Description of Corner Cases in Automated Driving: Goals and Challenges. 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). 2021: 1023–1028. ISBN 978-1-6654-0191-3. arXiv:2109.09607 . doi:10.1109/ICCVW54120.2021.00119.

[65] Hendrycks, Dan; Mazeika, Mantas; Dietterich, Thomas. Deep Anomaly Detection with Outlier Exposure. ICLR. 2019-01-28. arXiv:1812.04606 .

[66] Hendrycks, Dan; Gimpel, Kevin. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. ICLR. 2018-10-03. arXiv:1610.02136 .

[67] Urbina, Fabio; Lentzos, Filippa; Invernizzi, Cédric; Ekins, Sean. Dual use of artificial-intelligence-powered drug discovery. Nature Machine Intelligence. 2022, 4 (3): 189–191. ISSN 2522-5839. PMC 9544280 . PMID 36211133. doi:10.1038/s42256-022-00465-9 （英语）.

[68] Propaganda-as-a-service may be on the horizon if large language models are abused. VentureBeat. 2021-12-14 [2022-11-24]. （原始内容存档于2022-11-24）.

[69] Center for Security and Emerging Technology; Buchanan, Ben; Bansemer, John; Cary, Dakota; Lucas, Jack; Musser, Micah. Automating Cyber Attacks: Hype and Reality. 2020 [2022-11-28]. S2CID 234623943. doi:10.51593/2020ca002 . （原始内容存档于2022-11-24）.

[70] Markov, Todor; Zhang, Chong; Agarwal, Sandhini; Eloundou, Tyna; Lee, Teddy; Adler, Steven; Jiang, Angela; Weng, Lilian. New-and-Improved Content Moderation Tooling. OpenAI. 2022-08-10 [2022-11-24]. （原始内容存档于2023-01-11）.

[:5-71] Savage, Neil. Breaking into the black box of artificial intelligence. Nature. 2022-03-29 [2022-11-24]. PMID 35352042. S2CID 247792459. doi:10.1038/d41586-022-00858-1. （原始内容存档于2022-11-24）.

[72] Center for Security and Emerging Technology; Rudner, Tim; Toner, Helen. Key Concepts in AI Safety: Interpretability in Machine Learning. CSET Issue Brief. 2021 [2022-11-28]. S2CID 233775541. doi:10.51593/20190042 . （原始内容存档于2022-11-24）.

[73] McFarland, Matt. Uber pulls self-driving cars after first fatal crash of autonomous vehicle. CNNMoney. 2018-03-19 [2022-11-24]. （原始内容存档于2022-11-24）.

[74] Felder, Ryan Marshall. Coming to Terms with the Black Box Problem: How to Justify AI Systems in Health Care. Hastings Center Report. July 2021, 51 (4): 38–45. ISSN 0093-0334. PMID 33821471. doi:10.1002/hast.1248 （英语）.

[:6-75] 75.0 ^75.1 Doshi-Velez, Finale; Kortz, Mason; Budish, Ryan; Bavitz, Chris; Gershman, Sam; O'Brien, David; Scott, Kate; Schieber, Stuart; Waldo, James; Weinberger, David; Weller, Adrian. Accountability of AI Under the Law: The Role of Explanation. 2019-12-20. arXiv:1711.01134 .

[76] Fong, Ruth; Vedaldi, Andrea. Interpretable Explanations of Black Boxes by Meaningful Perturbation. 2017 IEEE International Conference on Computer Vision (ICCV). 2017: 3449–3457. ISBN 978-1-5386-1032-9. arXiv:1704.03296 . doi:10.1109/ICCV.2017.371.

[77] Meng, Kevin; Bau, David; Andonian, Alex; Belinkov, Yonatan. Locating and editing factual associations in GPT. Advances in Neural Information Processing Systems. 2022, 35. arXiv:2202.05262 .

[78] Bau, David; Liu, Steven; Wang, Tongzhou; Zhu, Jun-Yan; Torralba, Antonio. Rewriting a Deep Generative Model. ECCV. 2020-07-30. arXiv:2007.15646 .

[79] Räuker, Tilman; Ho, Anson; Casper, Stephen; Hadfield-Menell, Dylan. Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks. IEEE SaTML. 2022-09-05. arXiv:2207.13243 .

[80] Bau, David; Zhou, Bolei; Khosla, Aditya; Oliva, Aude; Torralba, Antonio. Network Dissection: Quantifying Interpretability of Deep Visual Representations. CVPR. 2017-04-19. arXiv:1704.05796 .

[81] McGrath, Thomas; Kapishnikov, Andrei; Tomašev, Nenad; Pearce, Adam; Wattenberg, Martin; Hassabis, Demis; Kim, Been; Paquet, Ulrich; Kramnik, Vladimir. Acquisition of chess knowledge in AlphaZero. Proceedings of the National Academy of Sciences. 2022-11-22, 119 (47). Bibcode:2022PNAS..11906625M. ISSN 0027-8424. PMC 9704706 . PMID 36375061. arXiv:2111.09259 . doi:10.1073/pnas.2206625119  （英语）.

[82] Goh, Gabriel; Cammarata, Nick; Voss, Chelsea; Carter, Shan; Petrov, Michael; Schubert, Ludwig; Radford, Alec; Olah, Chris. Multimodal neurons in artificial neural networks. Distill. 2021, 6 (3). S2CID 233823418. doi:10.23915/distill.00030 .

[83] Cammarata, Nick; Goh, Gabriel; Carter, Shan; Voss, Chelsea; Schubert, Ludwig; Olah, Chris. Curve circuits. Distill. 2021, 6 (1) [5 December 2022]. doi:10.23915/distill.00024.006 (不活跃 1 July 2025). （原始内容存档于5 December 2022）.

[84] Olsson, Catherine; Elhage, Nelson; Nanda, Neel; Joseph, Nicholas; DasSarma, Nova; Henighan, Tom; Mann, Ben; Askell, Amanda; Bai, Yuntao; Chen, Anna; Conerly, Tom. In-context learning and induction heads. Transformer Circuits Thread. 2022. arXiv:2209.11895 .

[85] Olah, Christopher. Interpretability vs Neuroscience [rough note]. [2022-11-24]. （原始内容存档于2022-11-24）.

[86] Gu, Tianyu; Dolan-Gavitt, Brendan; Garg, Siddharth. BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain. 2019-03-11. arXiv:1708.06733 .

[87] Chen, Xinyun; Liu, Chang; Li, Bo; Lu, Kimberly; Song, Dawn. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. 2017-12-14. arXiv:1712.05526 .

[88] Carlini, Nicholas; Terzis, Andreas. Poisoning and Backdooring Contrastive Learning. ICLR. 2022-03-28. arXiv:2106.09667 .

[89] How 'sleeper agent' AI assistants can sabotage code. The Register. 16 January 2024 [2025-01-12]. （原始内容存档于2024-12-24）（英语）.

[aima4-90] Russell, Stuart J.; Norvig, Peter. Artificial intelligence: A modern approach 4th. Pearson. 2021: 5, 1003 [September 12, 2022]. ISBN 978-0-13-461099-3.

[aima42-91] Russell, Stuart J.; Norvig, Peter. Artificial intelligence: A modern approach 4th. Pearson. 2021: 5, 1003 [September 12, 2022]. ISBN 978-0-13-461099-3.

[dlp2023-92] Ngo, Richard; Chan, Lawrence; Mindermann, Sören. The Alignment Problem from a Deep Learning Perspective. International Conference on Learning Representations. 2022. arXiv:2209.00626 .

[mmmm2022-93] Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob. The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. 2022-02-14 [2022-07-21].

[aima43-94] Russell, Stuart J.; Norvig, Peter. Artificial intelligence: A modern approach 4th. Pearson. 2021: 5, 1003 [September 12, 2022]. ISBN 978-0-13-461099-3.

[Carlsmith2022-95] Carlsmith, Joseph. Is Power-Seeking AI an Existential Risk?. 2022-06-16. arXiv:2206.13353  [cs.CY].

[:2102-96] Russell, Stuart J. Human compatible: Artificial intelligence and the problem of control. Penguin Random House. 2020. ISBN 978-0-525-55863-7. OCLC 1113410915.

[Christian2020-97] Christian, Brian. The alignment problem: Machine learning and human values. W. W. Norton & Company. 2020 [September 12, 2022]. ISBN 978-0-393-86833-3. OCLC 1233266753. （原始内容存档于February 10, 2023）.

[gmdrl-98] Langosco, Lauro Langosco Di; Koch, Jack; Sharkey, Lee D.; Pfau, Jacob; Krueger, David. Goal Misgeneralization in Deep Reinforcement Learning. Proceedings of the 39th International Conference on Machine Learning. International Conference on Machine Learning. PMLR: 12004–12019. 2022-06-28 [2023-03-11].

[99] Pillay, Tharin. New Tests Reveal AI's Capacity for Deception. TIME. 2024-12-15 [2025-01-12] （英语）.

[100] Perrigo, Billy. Exclusive: New Research Shows AI Strategically Lying. TIME. 2024-12-18 [2025-01-12] （英语）.

[feedback2022-101] Ouyang, Long; et al. Training language models to follow instructions with human feedback (PDF). NeurIPS. 2022. arXiv:2203.02155 .

[OpenAICodex-102] Zaremba, Wojciech; Brockman, Greg; OpenAI. OpenAI Codex. OpenAI. 2021-08-10 [2022-07-23]. （原始内容存档于February 3, 2023）.

[103] Kober, Jens; Bagnell, J. Andrew; Peters, Jan. Reinforcement learning in robotics: A survey. The International Journal of Robotics Research. 2013-09-01, 32 (11): 1238–1274 [September 12, 2022]. ISSN 0278-3649. S2CID 1932843. doi:10.1177/0278364913495721. （原始内容存档于October 15, 2022）（英语）.

[104] Knox, W. Bradley; Allievi, Alessandro; Banzhaf, Holger; Schmitt, Felix; Stone, Peter. Reward (Mis)design for autonomous driving. Artificial Intelligence. 2023-03-01, 316. ISSN 0004-3702. S2CID 233423198. arXiv:2104.13906 . doi:10.1016/j.artint.2022.103829  （英语）.

[Opportunities_Risks-105] Bommasani, Rishi; Hudson, Drew A.; Adeli, Ehsan; Altman, Russ; Arora, Simran; von Arx, Sydney; Bernstein, Michael S.; Bohg, Jeannette; Bosselut, Antoine; Brunskill, Emma; Brynjolfsson, Erik. On the Opportunities and Risks of Foundation Models. Stanford CRFM. 2022-07-12. arXiv:2108.07258 .

[:21022-106] Russell, Stuart J. Human compatible: Artificial intelligence and the problem of control. Penguin Random House. 2020. ISBN 978-0-525-55863-7. OCLC 1113410915.

[107] Stray, Jonathan. Aligning AI Optimization to Community Well-Being. International Journal of Community Well-Being. 2020, 3 (4): 443–463. ISSN 2524-5295. PMC 7610010 . PMID 34723107. S2CID 226254676. doi:10.1007/s42413-020-00086-3 （英语）.

[mmmm20222-108] Pan, Alexander; Bhatia, Kush; Steinhardt, Jacob. The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models. International Conference on Learning Representations. 2022-02-14 [2022-07-21].

[dlp20232-109] Ngo, Richard; Chan, Lawrence; Mindermann, Sören. The Alignment Problem from a Deep Learning Perspective. International Conference on Learning Representations. 2022. arXiv:2209.00626 .

[:2-110] Smith, Craig S. Geoff Hinton, AI's Most Famous Researcher, Warns Of 'Existential Threat'. Forbes. [2023-05-04] （英语）.

[:21023-111] Russell, Stuart J. Human compatible: Artificial intelligence and the problem of control. Penguin Random House. 2020. ISBN 978-0-525-55863-7. OCLC 1113410915.

[112] Statement on AI Risk | CAIS. www.safe.ai. [2024-02-11] （英语）.

[113] Grace, Katja; Stewart, Harlan; Sandkühler, Julia Fabienne; Thomas, Stephen; Weinstein-Raun, Ben; Brauner, Jan. Thousands of AI Authors on the Future of AI. Journal of Artificial Intelligence Research. 2025, 84. arXiv:2401.02843 . doi:10.1613/jair.1.19087 .

[114] Perrigo, Billy. Meta's AI Chief Yann LeCun on AGI, Open-Source, and AI Risk. TIME. 2024-02-13 [2024-06-26] （英语）.

[115] What is AI alignment?. TechTarget. 2023-05-03 [2025-06-28] （英语）.

[116] Ahmed, Shazeda; Jaźwińska, Klaudia; Ahlawat, Archana; Winecoff, Amy; Wang, Mona. Field-building and the epistemic culture of AI safety. First Monday. 2024-04-14. ISSN 1396-0466. doi:10.5210/fm.v29i4.13626  （英语）.

[building2018-117] 117.0 ^117.1 Ortega, Pedro A.; Maini, Vishal; DeepMind safety team. Building safe artificial intelligence: specification, robustness, and assurance. DeepMind Safety Research – Medium. 2018-09-27 [2022-07-18]. （原始内容存档于February 10, 2023）.

[:333-118] 118.0 ^118.1 Rorvig, Mordechai. Researchers Gain New Understanding From Simple AI. Quanta Magazine. 2022-04-14 [2022-07-18]. （原始内容存档于February 10, 2023）.

[119] Doshi-Velez, Finale; Kim, Been. Towards A Rigorous Science of Interpretable Machine Learning. 2017-03-02. arXiv:1702.08608  [stat.ML].

[concrete2016-120] 120.0 ^120.1 Amodei, Dario; Olah, Chris. Concrete Problems in AI Safety. 2016-06-21. arXiv:1606.06565  [cs.AI] （英语）.

[121] Russell, Stuart; Dewey, Daniel; Tegmark, Max. Research Priorities for Robust and Beneficial Artificial Intelligence. AI Magazine. 2015-12-31, 36 (4): 105–114 [September 12, 2022]. ISSN 2371-9621. S2CID 8174496. arXiv:1602.03506 . doi:10.1609/aimag.v36i4.2577 . hdl:1721.1/108478. （原始内容存档于February 2, 2023）.

[drlfhp-122] Christiano, Paul F.; Leike, Jan; Brown, Tom B.; Martic, Miljan; Legg, Shane; Amodei, Dario. Deep reinforcement learning from human preferences. Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc.: 4302–4310. 2017. ISBN 978-1-5108-6096-4.

[LessToxic-123] Heaven, Will Douglas. The new version of GPT-3 is much better behaved (and should be less toxic). MIT Technology Review. 2022-01-27 [2022-07-18]. （原始内容存档于February 10, 2023）.

[124] Mohseni, Sina; Wang, Haotao; Yu, Zhiding; Xiao, Chaowei; Wang, Zhangyang; Yadawa, Jay. Taxonomy of Machine Learning Safety: A Survey and Primer. ACM Computing Surveys. 2022-03-07, 55 (8): 1–38. doi:10.1145/3551385.

[125] Clifton, Jesse. Cooperation, Conflict, and Transformative Artificial Intelligence: A Research Agenda. Center on Long-Term Risk. 2020 [2022-07-18]. （原始内容存档于January 1, 2023）.

[126] Prunkl, Carina; Whittlestone, Jess. Beyond Near- and Long-Term. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society. New York NY USA: ACM. 2020-02-07: 138–143 [September 12, 2022]. ISBN 978-1-4503-7110-0. doi:10.1145/3375627.3375803. （原始内容存档于October 16, 2022）（英语）.

[:42-127] Irving, Geoffrey; Askell, Amanda. AI Safety Needs Social Scientists. Distill. 2019-02-19, 4 (2) [September 12, 2022]. ISSN 2476-0757. S2CID 159180422. doi:10.23915/distill.00014 . （原始内容存档于February 10, 2023）.

[128] Gazos, Alexandros; Kahn, James; Kusche, Isabel; Büscher, Christian; Götz, Markus. Organising AI for safety: Identifying structural vulnerabilities to guide the design of AI-enhanced socio-technical systems. Safety Science. 2025-04-01, 184. ISSN 0925-7535. doi:10.1016/j.ssci.2024.106731 .

[:12-129] 129.0 ^129.1 ^129.2 ^129.3 Zwetsloot, Remco; Dafoe, Allan. Thinking About Risks From AI: Accidents, Misuse and Structure. Lawfare. 2019-02-11 [2022-11-24]. （原始内容存档于2023-08-19）.

[130] Zhang, Yingyu; Dong, Chuntong; Guo, Weiqun; Dai, Jiabao; Zhao, Ziming. Systems theoretic accident model and process (STAMP): A literature review. Safety Science. 2022, 152 [2022-11-28]. S2CID 244550153. doi:10.1016/j.ssci.2021.105596. （原始内容存档于2023-03-15）（英语）.

[Hendrycks20222-131] Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob. Unsolved Problems in ML Safety. arXiv. 2022-06-16. arXiv:2109.13916 .

[:22-132] Gazos, Alexandros; Kahn, James; Kusche, Isabel; Büscher, Christian; Götz, Markus. Organising AI for safety: Identifying structural vulnerabilities to guide the design of AI-enhanced socio-technical systems. Safety Science. 2025-04-01, 184. ISSN 0925-7535. doi:10.1016/j.ssci.2024.106731 .

[133] Center for Security and Emerging Technology; Hoffman, Wyatt. AI and the Future of Cyber Competition. CSET Issue Brief. 2021 [2022-11-28]. S2CID 234245812. doi:10.51593/2020ca007 . （原始内容存档于2022-11-24）.

[:132-134] Brundage, Miles; Avin, Shahar; Clark, Jack; Toner, Helen; Eckersley, Peter; Garfinkel, Ben; Dafoe, Allan; Scharre, Paul; Zeitzoff, Thomas; Filar, Bobby; Anderson, Hyrum. The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation. Apollo-University Of Cambridge Repository, Apollo-University Of Cambridge Repository. Apollo - University of Cambridge Repository. 2018-04-30 [2022-11-28]. S2CID 3385567. doi:10.17863/cam.22520. （原始内容存档于2022-11-23）.

[135] Gafni, Ruti; Levy, Yair. The role of artificial intelligence (AI) in improving technical and managerial cybersecurity tasks' efficiency. Information & Computer Security. 2024-01-01, 32 (5): 711–728. ISSN 2056-4961. doi:10.1108/ICS-04-2024-0102.

[136] Abroshan, Hossein. AI to protect AI: A modular pipeline for detecting label-flipping poisoning attacks. Results in Engineering (Elsevier). 2025. doi:10.1016/j.rineng.2025.101513.

[137] Abroshan, Hossein; Hashmi, Syed Waquas. A Multi-Stage Backdoor Detection (MSBD) Framework. IEEE Access (IEEE). 2026: 1–1. ISSN 2169-3536. doi:10.1109/ACCESS.2026.3659007 .

[138] Center for Security and Emerging Technology; Imbrie, Andrew; Kania, Elsa. AI Safety, Security, and Stability Among Great Powers: Options, Challenges, and Lessons Learned for Pragmatic Engagement. 2019 [2022-11-28]. S2CID 240957952. doi:10.51593/20190051 . （原始内容存档于2022-11-24）.

[:11-139] Future of Life Institute. AI Strategy, Policy, and Governance (Allan Dafoe). 事件发生在 22:05. 2019-03-27 [2022-11-23]. （原始内容存档于2022-11-23）.

[Hendrycks20223-140] Hendrycks, Dan; Carlini, Nicholas; Schulman, John; Steinhardt, Jacob. Unsolved Problems in ML Safety. arXiv. 2022-06-16. arXiv:2109.13916 .

[141] Zou, Andy; Xiao, Tristan; Jia, Ryan; Kwon, Joe; Mazeika, Mantas; Li, Richard; Song, Dawn; Steinhardt, Jacob; Evans, Owain; Hendrycks, Dan. Forecasting Future World Events with Neural Networks. NeurIPS. 2022-10-09. arXiv:2206.15474 .

[142] Gathani, Sneha; Hulsebos, Madelon; Gale, James; Haas, Peter J.; Demiralp, Çağatay. Augmenting Decision Making via Interactive What-If Analysis. Conference on Innovative Data Systems Research. 2022-02-08. arXiv:2109.06160 .

[143] Lindelauf, Roy, Osinga, Frans; Sweijs, Tim , 编, Nuclear Deterrence in the Algorithmic Age: Game Theory Revisited, NL ARMS Netherlands Annual Review of Military Studies 2020, Nl Arms (The Hague: T.M.C. Asser Press), 2021: 421–436, ISBN 978-94-6265-418-1, doi:10.1007/978-94-6265-419-8_22 （英语）

[:14-144] 144.0 ^144.1 Newkirk II, Vann R. Is Climate Change a Prisoner's Dilemma or a Stag Hunt?. The Atlantic. 2016-04-21 [2022-11-24]. （原始内容存档于2022-11-24）.

[:16-145] Armstrong, Stuart; Bostrom, Nick; Shulman, Carl. Racing to the Precipice: a Model of Artificial Intelligence Development (报告). Future of Humanity Institute, Oxford University.

[:17-146] Dafoe, Allan. AI Governance: A Research Agenda (报告). Centre for the Governance of AI, Future of Humanity Institute, University of Oxford.

[147] Dafoe, Allan; Hughes, Edward; Bachrach, Yoram; Collins, Tantum; McKee, Kevin R.; Leibo, Joel Z.; Larson, Kate; Graepel, Thore. Open Problems in Cooperative AI. NeurIPS. 2020-12-15. arXiv:2012.08630 .

[:15-148] Dafoe, Allan; Bachrach, Yoram; Hadfield, Gillian; Horvitz, Eric; Larson, Kate; Graepel, Thore. Cooperative AI: machines must learn to find common ground. Nature. 2021, 593 (7857): 33–36 [2022-11-24]. Bibcode:2021Natur.593...33D. PMID 33947992. S2CID 233740521. doi:10.1038/d41586-021-01170-0. （原始内容存档于2022-11-22）.

[:23-149] Gazos, Alexandros; Kahn, James; Kusche, Isabel; Büscher, Christian; Götz, Markus. Organising AI for safety: Identifying structural vulnerabilities to guide the design of AI-enhanced socio-technical systems. Safety Science. 2025-04-01, 184. ISSN 0925-7535. doi:10.1016/j.ssci.2024.106731 .

[150] Satariano, Adam; Specia, Megan. Global Leaders Warn A.I. Could Cause 'Catastrophic' Harm. The New York Times. 2023-11-01 [2024-04-20]. ISSN 0362-4331 （美国英语）.

[:112-151] Future of Life Institute. AI Strategy, Policy, and Governance (Allan Dafoe). 事件发生在 22:05. 2019-03-27 [2022-11-23]. （原始内容存档于2022-11-23）.

[152] Turchin, Alexey; Dench, David; Green, Brian Patrick. Global Solutions vs. Local Solutions for the AI Safety Problem. Big Data and Cognitive Computing. 2019, 3 (16): 1–25. doi:10.3390/bdcc3010016 .

[153] Crafts, Nicholas. Artificial intelligence as a general-purpose technology: an historical perspective. Oxford Review of Economic Policy. 2021-09-23, 37 (3): 521–536 [2022-11-28]. ISSN 0266-903X. doi:10.1093/oxrep/grab012 . （原始内容存档于2022-11-24）（英语）.

[154] 葉俶禎; 黃子君; 張媁雯; 賴志樫. Labor Displacement in Artificial Intelligence Era: A Systematic Literature Review. 臺灣東亞文明研究學刊. 2020-12-01, 17 (2). ISSN 1812-6243. doi:10.6163/TJEAS.202012_17(2).0002 （英语）.

[155] Johnson, James. Artificial intelligence & future warfare: implications for international security. Defense & Security Analysis. 2019-04-03, 35 (2): 147–169 [2022-11-28]. ISSN 1475-1798. S2CID 159321626. doi:10.1080/14751798.2019.1600800. （原始内容存档于2022-11-24）（英语）.

[156] Kertysova, Katarina. Artificial Intelligence and Disinformation: How AI Changes the Way Disinformation is Produced, Disseminated, and Can Be Countered. Security and Human Rights. 2018-12-12, 29 (1–4): 55–81. ISSN 1874-7337. S2CID 216896677. doi:10.1163/18750230-02901005 .

[157] Feldstein, Steven. The Global Expansion of AI Surveillance. Carnegie Endowment for International Peace. 2019.

[158] Agrawal, Ajay; Gans, Joshua; Goldfarb, Avi. The economics of artificial intelligence: an agenda. Chicago, Illinois. 2019. ISBN 978-0-226-61347-5. OCLC 1099435014 （美国英语）.

[159] Whittlestone, Jess; Clark, Jack. Why and How Governments Should Monitor AI Development. 2021-08-31. arXiv:2108.12427 .

[:20-160] Shevlane, Toby. Sharing Powerful AI Models | GovAI Blog. Center for the Governance of AI. 2022 [2022-11-24]. （原始内容存档于2022-11-24）.

[:162-161] Armstrong, Stuart; Bostrom, Nick; Shulman, Carl. Racing to the Precipice: a Model of Artificial Intelligence Development (报告). Future of Humanity Institute, Oxford University.

[162] Askell, Amanda; Brundage, Miles; Hadfield, Gillian. The Role of Cooperation in Responsible AI Development. 2019-07-10. arXiv:1907.04534 .

[:172-163] Dafoe, Allan. AI Governance: A Research Agenda (报告). Centre for the Governance of AI, Future of Humanity Institute, University of Oxford.

[164] Gursoy, Furkan; Kakadiaris, Ioannis A., System Cards for AI-Based Decision-Making for Public Policy, 2022-08-31, arXiv:2203.04754 

[165] Cobbe, Jennifer; Lee, Michelle Seng Ah; Singh, Jatinder. Reviewable Automated Decision-Making: A Framework for Accountable Algorithmic Systems. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. FAccT '21. New York, NY, USA: Association for Computing Machinery. 2021-03-01: 598–609. ISBN 978-1-4503-8309-7. doi:10.1145/3442188.3445921.

[166] Raji, Inioluwa Deborah; Smart, Andrew; White, Rebecca N.; Mitchell, Margaret; Gebru, Timnit; Hutchinson, Ben; Smith-Loud, Jamila; Theron, Daniel; Barnes, Parker. Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. FAT* '20. New York, NY, USA: Association for Computing Machinery. 2020-01-27: 33–44. ISBN 978-1-4503-6936-7. doi:10.1145/3351095.3372873.

[167] Manheim, David; Martin, Sammy; Bailey, Mark; Samin, Mikhail; Greutzmacher, Ross. The necessity of AI audit standards boards. AI & Society. 2025, 40 (8): 6609–6624. arXiv:2404.13060 . doi:10.1007/s00146-025-02320-y.

[168] Novelli, Claudio; Taddeo, Mariarosaria; Floridi, Luciano. Accountability in artificial intelligence: what it is and how it works. AI & Society. 2024, 39 (4): 1871–1882. doi:10.1007/s00146-023-01635-y. hdl:11585/914099 .

[AICulture-169] Manheim, David. Building a Culture of Safety for AI: Perspectives and Challenges. 26 June 2023. SSRN 4491421  请检查|ssrn=的值 (帮助).

[170] NeMo Guardrails. NVIDIA NeMo Guardrails. [2024-12-08].

[171] Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations. Meta AI. [2024-12-08].

[172] Šekrst, Kristina; McHugh, Jeremy. AI Ethics by Design: Implementing Customizable Guardrails for Responsible AI Development. arXiv:2411.14442  [cs.CY].

[173] Dong, Yi; Mu, Ronghui. Building Guardrails for Large Language Models. arXiv:2402.01822  [cs].

[174] D'Alessandro, W. Deontology and safe artificial intelligence. Philosophical Studies. 2024, 182 (7): 1681–1704. doi:10.1007/s11098-024-02174-y .

[175] D'Alessandro, William; Kirk-Giannini, Cameron D. Artificial Intelligence: Approaches to Safety. Philosophy Compass. 2025, 20 (5). doi:10.1111/phc3.70039.

[176] Ziegler, Bart. Is It Time to Regulate AI?. Wall Street Journal. 8 April 2022 [2022-11-24]. （原始内容存档于2022-11-24）.

[177] Reed, Chris. How should we regulate artificial intelligence?. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2018-09-13, 376 (2128). Bibcode:2018RSPTA.37670360R. ISSN 1364-503X. PMC 6107539 . PMID 30082306. doi:10.1098/rsta.2017.0360 （英语）.

[178] Belton, Keith B. How Should AI Be Regulated?. IndustryWeek. 2019-03-07 [2022-11-24]. （原始内容存档于2022-01-29）.

[179] National Security Commission on Artificial Intelligence, Final Report, 2021

[180] National Institute of Standards and Technology. AI Risk Management Framework. NIST. 2021-07-12 [2022-11-24]. （原始内容存档于2022-11-24）.

[181] Richardson, Tim. Britain publishes 10-year National Artificial Intelligence Strategy. 2021 [2022-11-24]. （原始内容存档于2023-02-10）.

[:18-182] 182.0 ^182.1 Guidance: National AI Strategy. GOV.UK. 2021 [2022-11-24]. （原始内容存档于2023-02-10）.

[183] Hardcastle, Kimberley. We're talking about AI a lot right now – and it's not a moment too soon. The Conversation. 2023-08-23 [2023-10-31] （美国英语）.

[184] Iconic Bletchley Park to host UK AI Safety Summit in early November. GOV.UK. [2023-10-31] （英语）.

[185] Colville, Alex. How China Sees AI Safety. China Media Project. 2025-07-30 [2025-08-09] （美国英语）.

[186] Office of the Director of National Intelligence, Intelligence Advanced Research Projects Activity. IARPA – TrojAI. [2022-11-24]. （原始内容存档于2022-11-24）.

[187] Turek, Matt. Explainable Artificial Intelligence. [2022-11-24]. （原始内容存档于2021-02-19）.

[188] Draper, Bruce. Guaranteeing AI Robustness Against Deception. Defense Advanced Research Projects Agency. [2022-11-24]. （原始内容存档于2023-01-09）.

[189] National Science Foundation. Safe Learning-Enabled Systems. 23 February 2023 [2023-02-27]. （原始内容存档于2023-02-26）.

[190] General Assembly adopts landmark resolution on artificial intelligence. UN News. 21 March 2024 [21 April 2024]. （原始内容存档于20 April 2024）.

[191] Say, Mark. DSIT announces funding for research on AI safety. 23 May 2024 [11 June 2024]. （原始内容存档于24 May 2024）.

[192] Renshaw, Jarrett; Hunnicutt, Trevor. Biden, Xi agree that humans, not AI, should control nuclear arms. Reuters. November 16, 2024 [February 11, 2026].

[193] Khalid, Asma. Biden and Xi take a first step to limit AI and nuclear decisions at their last meeting. NPR. 2024-11-16 [2026-02-11] （英语）.

[194] FY2025 NDAA, Section 1638 ("Sense of Congress with respect to use of artificial intelligence to support strategic deterrence"). Emerging Technology Observatory. Center for Security and Emerging Technology at Georgetown University. [27 February 2026].

[195] H.R.5009 - Servicemember Quality of Life Improvement and National Defense Authorization Act for Fiscal Year 2025. Congress.gov. United States Congress. [27 February 2026].

[196] The hypothetical nuclear attack that escalated the Pentagon’s showdown with Anthropic. The Washington Post. The Washington Post. [27 February 2026].

[197] U.N. General Assembly opens with urgent plea for binding AI safeguards. NBC News. 2025-09-22 [2026-04-05] （英语）.

[198] Ensuring a National Policy Framework for Artificial Intelligence. The White House. 2025-12-11 [2026-02-27] （美国英语）.

[199] Johnson, Khari. Trump’s new order against AI regulation hits California especially hard. CalMatters. 2025-12-12 [2026-02-27] （美国英语）.

[200] Removing Barriers to American Leadership in Artificial Intelligence. Federal Register. 2025-01-31 [2026-02-27] （英语）.

[201] Trump’s AI Order Is More Bark than Bite | Brennan Center for Justice. www.brennancenter.org. 2026-02-25 [2026-02-27] （英语）.

[202] Mäntymäki, Matti; Minkkinen, Matti; Birkstedt, Teemu; Viljanen, Mika. Defining organizational AI governance. AI and Ethics. 2022, 2 (4): 603–609. ISSN 2730-5953. S2CID 247119668. doi:10.1007/s43681-022-00143-x  （英语）.

[:19-203] 203.0 ^203.1 ^203.2 Brundage, Miles; Avin, Shahar; Wang, Jasmine; Belfield, Haydn; Krueger, Gretchen; Hadfield, Gillian; Khlaaf, Heidy; Yang, Jingying; Toner, Helen; Fong, Ruth; Maharaj, Tegan. Toward Trustworthy AI Development: Mechanisms for Supporting Verifiable Claims. 2020-04-20. arXiv:2004.07213 .

[204] Welcome to the Artificial Intelligence Incident Database. [2022-11-24]. （原始内容存档于2022-11-24）.

[:202-205] Shevlane, Toby. Sharing Powerful AI Models | GovAI Blog. Center for the Governance of AI. 2022 [2022-11-24]. （原始内容存档于2022-11-24）.

[206] Wiblin, Robert; Harris, Keiran. Nova DasSarma on why information security may be critical to the safe development of AI systems. 80,000 Hours. 2022 [2022-11-24]. （原始内容存档于2022-11-24）.

[207] OpenAI. Best Practices for Deploying Language Models. OpenAI. 2022-06-02 [2022-11-24]. （原始内容存档于2023-03-15）.

[208] OpenAI. OpenAI Charter. OpenAI. [2022-11-24]. （原始内容存档于2021-03-04）.

[:212-209] Future of Life Institute. AI Principles. Future of Life Institute. [2022-11-23]. （原始内容存档于2022-11-23）.

[210] Future of Life Institute. Autonomous Weapons Open Letter: AI & Robotics Researchers. Future of Life Institute. 2016 [2022-11-24].

[211] Chow, Andrew R. The People vs. AI. TIME. 2026-02-19 [2026-03-05] （英语）.

[212] Wilkins, Emily. Anthropic gives $20 million to group pushing for AI regulations ahead of 2026 elections. CNBC. 2026-02-12 [2026-03-05] （英语）.

[213] The Silicon Valley billionaires spending big to write America’s AI rules. Financial Times. February 26, 2026.

[214] Schleifer, Theodore; Tan, Eli. Silicon Valley Pledges $200 Million to New Pro-A.I. Super PACs. The New York Times. 2025-08-26 [2026-03-05]. ISSN 0362-4331 （美国英语）.

[215] GGE on lethal autonomous weapons systems. Digital Watch Observatory. 2025-11-27 [2026-04-26].

[216] Statements at the First 2025 GGE LAWS Session. APILS. 2025-03-09 [2026-04-26].

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

[63]

[64]

[65]

[66]

[67]

[68]

[69]

[70]

[71]

[72]

[73]

[74]

[75]

[76]

[77]

[78]

[79]

[80]

[81]

[82]

[83]

[84]

[85]

[86]

[87]

[88]

[89]

[90]

[91]

[92]

[93]

[94]

[95]

[96]

[97]

[98]

[99]

[100]