轻松掌握XML路径查询，一招解锁数据提取之道

引言

XML（可扩展标记语言）是一种用于存储和传输数据的标记语言，广泛应用于网页内容、配置文件等领域。在处理XML数据时，路径查询是必不可少的技能。本文将详细介绍XML路径查询的方法，帮助您轻松掌握这一技能，高效地提取所需数据。

XML基础

在开始学习XML路径查询之前，我们需要了解一些XML的基础知识。

XML结构

XML文档由元素、属性和文本组成。以下是一个简单的XML示例：

<bookstore> <book> <title>Harry Potter</title> <author>J.K. Rowling</author> <price>29.99</price> </book> <book> <title>Learn XML</title> <author>John Doe</author> <price>39.99</price> </book> </bookstore>

在这个示例中，<bookstore> 是根元素，包含两个 <book> 元素。每个 <book> 元素又包含 <title>、<author> 和 <price> 元素。

XML命名空间

XML命名空间用于区分具有相同名称的元素。以下是一个使用命名空间的XML示例：

<ns:bookstore xmlns:ns="http://www.example.com"> <ns:book> <ns:title>Harry Potter</ns:title> <ns:author>J.K. Rowling</ns:author> <ns:price>29.99</ns:price> </ns:book> <ns:book> <ns:title>Learn XML</ns:title> <ns:author>John Doe</ns:author> <ns:price>39.99</ns:price> </ns:book> </ns:bookstore>

在这个示例中，<ns:bookstore> 元素属于命名空间 http://www.example.com。

XML路径查询

XML路径查询是用于定位XML文档中特定元素的方法。以下是一些常用的XML路径查询方法：

1. 基本路径

基本路径使用斜杠 / 来表示从根元素到目标元素的路径。以下是一些示例：

/bookstore/book/title：查询根元素下所有 <book> 元素的 <title> 子元素。
/bookstore/book[author='J.K. Rowling']：查询根元素下所有 <author> 属性值为 “J.K. Rowling” 的 <book> 元素。

2. 递归路径

递归路径使用双斜杠 // 来表示从当前元素开始的所有后代元素。以下是一些示例：

//book/title：查询所有 <book> 元素的 <title> 后代元素。
//book[author='J.K. Rowling']：查询所有 <author> 属性值为 “J.K. Rowling” 的 <book> 元素。

3. 属性路径

属性路径使用 @ 符号来表示元素的属性。以下是一些示例：

/bookstore/book/@author：查询根元素下所有 <book> 元素的 author 属性。
//book/@price：查询所有 <book> 元素的 price 属性。

4. 跨命名空间查询

跨命名空间查询使用 namespace-uri 函数来指定命名空间。以下是一些示例：

namespace-uri('http://www.example.com')/bookstore：查询命名空间为 http://www.example.com 的 <bookstore> 元素。
namespace-uri('http://www.example.com')//book/title：查询命名空间为 http://www.example.com 的所有 <title> 元素。

实践示例

以下是一个使用Python的xml.etree.ElementTree模块进行XML路径查询的示例：

import xml.etree.ElementTree as ET xml_data = ''' <bookstore> <book> <title>Harry Potter</title> <author>J.K. Rowling</author> <price>29.99</price> </book> <book> <title>Learn XML</title> <author>John Doe</author> <price>39.99</price> </book> </bookstore> ''' root = ET.fromstring(xml_data) # 查询所有书名 titles = [book.find('title').text for book in root.findall('book')] print('所有书名：', titles) # 查询作者为 "J.K. Rowling" 的书籍价格 prices = [book.find('price').text for book in root.findall('.//book[author="J.K. Rowling"]')] print('作者为 "J.K. Rowling" 的书籍价格：', prices)