XPath is a technology that accurately and efficiently specifies and extracts the required information from tree structure data, such as HTML and XML.
XPath Overview
XPath is a language used to specify and extract specific elements and attributes in an XML or HTML document tree. Since web pages are often composed of HTML, XPath is widely used in various scenarios, such as web scraping, data retrieval, and XML data search.
Typical usage is as follows:
- Data Extraction in Web Crawlers: XPath is used to automatically fetch specific data, such as product information on e-commerce websites or article titles on news websites.
- XML Data Search and Processing: In business systems and data link processing, data is often exchanged in XML format. By using XPath, you can efficiently extract the necessary information from large amounts of XML data.
- Parsing API Responses (XML Format): When using XML-based APIs, such as SOAP, XPath is used to parse response data. This allows you to quickly obtain specific status codes and result data, improving the accuracy of system integration and automated processing.
Tool recommendation
In the process of learning and using XPath, common tools include:
Chrome Developer Tools: A powerful tool that comes with Chrome browser, you can call it by pressing the F12 key. In the "Elements" panel, you can quickly locate HTML elements on a web page by hovering and clicking, and right-clicking on an element can directly copy the XPath expression for easy verification and testing. For example, when analyzing an e-commerce product page, you can use Chrome developer tools to quickly obtain XPath for product names, prices, and other elements. Firefox Developer Tools: Also comes with the browser, similar to Chrome Developer Tools. It also allows you to easily locate elements in your page, as well as view and test XPath expressions. This is a good option for developers who are used to using the Firefox browser. Online XPath testing tools: Online tools such as "XPath Tester" can be used without installation and can be used by opening a web page. Simply paste the content of an XML or HTML document, enter an XPath expression, and see the match results in real time. This tool is especially suitable for beginners to get started quickly and do simple XPath exercises. XPath Helper Plugin: Taking Chrome Browser as an example, after installing the XPath Helper plugin, it will add a floating window to the browser interface when browsing the web, displaying the XPath path of the current mouse-over element, and you can edit and test XPath expressions directly in the window, greatly improving development efficiency.
XPath Abstract Syntax
Select a node
XPath uses path expressions to pick nodes in an XML document. Nodes are selected by following a path or step. The most useful path expressions are listed below:
expression | description | | nodename | Picks all child nodes of this node. | | / | Pick from the root node (take child node). | | // | Selects nodes in the document from the current node of the matching selection, regardless of their location (take descendant nodes). | | . | Choose the current node. | | .. | Choose the parent node of the current node. | | @ | Choose an attribute. |
In the table below, we have listed some path expressions along with the results of the expressions:
Path expressions | outcome | | bookstore | Pick all nodes named bookstore. | | /bookstore | Choose the root element bookstore. Note: If the path starts with a forward slash ( / ), then this path always represents an absolute path to an element! | | bookstore/book | Picks all book elements that are child elements of the bookstore. | | //book | Picks all book sub-elements, regardless of their position in the document. | | bookstore//book | Select all book elements that are descendants of the bookstore element, regardless of where they are located under the bookstore. | | //@lang | Choose all the properties named lang. |
Predicates
A predicate is used to find a specific node or a node containing a specified value. The predicate is embedded in square brackets. In the table below, we list some path expressions with predicates, and the results of the expressions:
Path expressions | outcome | | /bookstore/book[1] | Choose the first book element that is part of the bookstore child element. | | /bookstore/book[last()] | Picks the last book element that is part of the bookstore subelement. | | /bookstore/book[last()-1] | Picks the penultimate book element that belongs to the bookstore child element. | | /bookstore/book[position()<3] | Choose the first two book elements that are child elements of the bookstore element. | | //title[@lang] | Picks all title elements that have a property named lang. | | //title[@lang='eng'] | Picks all title elements that have a lang attribute with an eng value. | | /bookstore/book[price>35.00] | Picks all the book elements of the bookstore element, and the value of the price element must be greater than 35.00. | | /bookstore/book[price>35.00]//title | Picks all the title elements of the book element in the bookstore element, and the value of the price element must be greater than 35.00. |
Select an unknown node
XPath wildcards can be used to pick up unknown XML elements.
wildcard | description | | * | Matches any element node. | | @* | Matches any attribute node. | | node() | Matches any type of node. |
In the table below, we list some path expressions, and the results of these expressions:
Path expressions | outcome | | /bookstore/* | Picks all the child elements of the bookstore element. | | //* | Picks all elements in the document. | | //title[@*] | Picks all title elements with attributes. |
Choose a number of paths
By using "|" in the path expression operators, you can choose several paths.
In the table below, we list some path expressions, and the results of these expressions:
Path expressions | outcome | | //book/title | //book/price | Picks all the title and price elements of the book element. | | //title | //price | Choose all title and price elements in the document. | | /bookstore/book/title | //price | Picks all title elements of the book element that belong to the bookstore element, and all price elements in the document. |
XPath function
The following is a commented list of XPath-specific additions to XPath by core XPath functions and XSLT, including descriptions, syntax, parameter lists, result types, and sources in the corresponding W3C recommendations.
- boolean()
- ceiling()
- choose()
- concat()
- contains()
- count()
- current() XSLT specific
- document() XSLT specific
- element-available()
- false()
- floor()
- format-number() XSLT specific
- function-available()
- generate-id() XSLT specific
- id()
- key() XSLT specific
- lang()
- last()
- local-name()
- name()
- namespace-uri()
- normalize-space()
- not()
- number()
- position()
- round()
- starts-with()
- string()
- string-length()
- substring()
- substring-after()
- substring-before()
- sum()
- system-property() XSLT specific
- translate()
- true()
- unparsed-entity-url() XSLT specific
Practical tests
Open Chrome and get the text content of all the A tab links with the following command:
As shown below:
Reference:The hyperlink login is visible. |