This article is a mirror article of machine translation, please click here to jump to the original article.

View: 75|Reply: 0

Use XPath to retrieve XML, HTML document content

[Copy link]
Posted on3 days ago | | | |
XPath is a technology that accurately and efficiently specifies and extracts the required information from tree structure data, such as HTML and XML.

XPath Overview

XPath is a language used to specify and extract specific elements and attributes in an XML or HTML document tree. Since web pages are often composed of HTML, XPath is widely used in various scenarios, such as web scraping, data retrieval, and XML data search.

Typical usage is as follows:

  • Data Extraction in Web Crawlers: XPath is used to automatically fetch specific data, such as product information on e-commerce websites or article titles on news websites.
  • XML Data Search and Processing: In business systems and data link processing, data is often exchanged in XML format. By using XPath, you can efficiently extract the necessary information from large amounts of XML data.
  • Parsing API Responses (XML Format): When using XML-based APIs, such as SOAP, XPath is used to parse response data. This allows you to quickly obtain specific status codes and result data, improving the accuracy of system integration and automated processing.

Tool recommendation

In the process of learning and using XPath, common tools include:

Chrome Developer Tools: A powerful tool that comes with Chrome browser, you can call it by pressing the F12 key. In the "Elements" panel, you can quickly locate HTML elements on a web page by hovering and clicking, and right-clicking on an element can directly copy the XPath expression for easy verification and testing. For example, when analyzing an e-commerce product page, you can use Chrome developer tools to quickly obtain XPath for product names, prices, and other elements.
Firefox Developer Tools: Also comes with the browser, similar to Chrome Developer Tools. It also allows you to easily locate elements in your page, as well as view and test XPath expressions. This is a good option for developers who are used to using the Firefox browser.
Online XPath testing tools: Online tools such as "XPath Tester" can be used without installation and can be used by opening a web page. Simply paste the content of an XML or HTML document, enter an XPath expression, and see the match results in real time. This tool is especially suitable for beginners to get started quickly and do simple XPath exercises.
XPath Helper Plugin: Taking Chrome Browser as an example, after installing the XPath Helper plugin, it will add a floating window to the browser interface when browsing the web, displaying the XPath path of the current mouse-over element, and you can edit and test XPath expressions directly in the window, greatly improving development efficiency.

XPath Abstract Syntax

Select a node

XPath uses path expressions to pick nodes in an XML document. Nodes are selected by following a path or step. The most useful path expressions are listed below:
expression
description
nodenamePicks all child nodes of this node.
/Pick from the root node (take child node).
//Selects nodes in the document from the current node of the matching selection, regardless of their location (take descendant nodes).
.Choose the current node.
..Choose the parent node of the current node.
@Choose an attribute.

In the table below, we have listed some path expressions along with the results of the expressions:
Path expressions
outcome
bookstorePick all nodes named bookstore.
/bookstore
Choose the root element bookstore.
Note: If the path starts with a forward slash ( / ), then this path always represents an absolute path to an element!
bookstore/bookPicks all book elements that are child elements of the bookstore.
//bookPicks all book sub-elements, regardless of their position in the document.
bookstore//bookSelect all book elements that are descendants of the bookstore element, regardless of where they are located under the bookstore.
//@langChoose all the properties named lang.

Predicates

A predicate is used to find a specific node or a node containing a specified value.
The predicate is embedded in square brackets.
In the table below, we list some path expressions with predicates, and the results of the expressions:
Path expressions
outcome
/bookstore/book[1]Choose the first book element that is part of the bookstore child element.
/bookstore/book[last()]Picks the last book element that is part of the bookstore subelement.
/bookstore/book[last()-1]Picks the penultimate book element that belongs to the bookstore child element.
/bookstore/book[position()<3]Choose the first two book elements that are child elements of the bookstore element.
//title[@lang]Picks all title elements that have a property named lang.
//title[@lang='eng']Picks all title elements that have a lang attribute with an eng value.
/bookstore/book[price>35.00]Picks all the book elements of the bookstore element, and the value of the price element must be greater than 35.00.
/bookstore/book[price>35.00]//titlePicks all the title elements of the book element in the bookstore element, and the value of the price element must be greater than 35.00.

Select an unknown node

XPath wildcards can be used to pick up unknown XML elements.
wildcard
description
*Matches any element node.
@*Matches any attribute node.
node()Matches any type of node.

In the table below, we list some path expressions, and the results of these expressions:
Path expressions
outcome
/bookstore/*Picks all the child elements of the bookstore element.
//*Picks all elements in the document.
//title[@*]Picks all title elements with attributes.

Choose a number of paths

By using "|" in the path expression operators, you can choose several paths.

In the table below, we list some path expressions, and the results of these expressions:
Path expressions
outcome
//book/title | //book/pricePicks all the title and price elements of the book element.
//title | //priceChoose all title and price elements in the document.
/bookstore/book/title | //pricePicks all title elements of the book element that belong to the bookstore element, and all price elements in the document.

XPath function

The following is a commented list of XPath-specific additions to XPath by core XPath functions and XSLT, including descriptions, syntax, parameter lists, result types, and sources in the corresponding W3C recommendations.

  • boolean()
  • ceiling()
  • choose()
  • concat()
  • contains()
  • count()
  • current() XSLT specific
  • document() XSLT specific
  • element-available()
  • false()
  • floor()
  • format-number() XSLT specific
  • function-available()
  • generate-id() XSLT specific
  • id()
  • key() XSLT specific
  • lang()
  • last()
  • local-name()
  • name()
  • namespace-uri()
  • normalize-space()
  • not()
  • number()
  • position()
  • round()
  • starts-with()
  • string()
  • string-length()
  • substring()
  • substring-after()
  • substring-before()
  • sum()
  • system-property() XSLT specific
  • translate()
  • true()
  • unparsed-entity-url() XSLT specific

Practical tests

Open Chrome and get the text content of all the A tab links with the following command:
As shown below:



Reference:The hyperlink login is visible.




Previous:[Playwright] (3) Automated testing to catch error exceptions
Next:.NET/C# calls the Azure Translator text translation interface service
Disclaimer:
All software, programming materials or articles published by Code Farmer Network are only for learning and research purposes; The above content shall not be used for commercial or illegal purposes, otherwise, users shall bear all consequences. The information on this site comes from the Internet, and copyright disputes have nothing to do with this site. You must completely delete the above content from your computer within 24 hours of downloading. If you like the program, please support genuine software, purchase registration, and get better genuine services. If there is any infringement, please contact us by email.

Mail To:help@itsvse.com