The IMPORTXML function is a powerful function that allows you to import data from structured data types like XML, HTML, CSV, TSV, RSS, and ATOM XML.
If you want to scrape a website or mine data, you can use this function to do this directly in Google Sheets.
In this tutorial, I will show you how to use the IMPORTXML function in Google Sheets.
Table of Contents
The syntax of the IMPORTXML function is:
- url – this is the URL of the page that you want to import data for. It must include the protocol (http:// or https://). It also must be in quotation marks or a reference to a cell containing the URL
- xpath_query – this is the XPath query to run on the structured data. XPath is a language that is used to query data stored on a website. In order to return the data you want, you will need to use the correct XPath query to return the information you are looking for
XPath is a language that is used to query data from XML documents.
Learning the syntax of XPath can seem very overwhelming if you are new to using it.
But it is really not that complicated.
XPath uses a path or step-like expressions to query data similar to the way computer files are structured.
For example, take a look at the way this folder is structured.
The documents folder is inside of the “Desktop” folder, which is inside of the “This PC” folder.
So the structure of these files is something like this:
XPath uses this same kind of progression to query data.
With XPath, you select the HTML elements from a webpage that you want to return.
To illustrate what I mean by this, here is the HTML source code of a webpage. To access the source page of a website, right-click and select View page source.
With XPath, you would use the various HTML tags (<h1>, <h2>, <p>, <a>, <body>, etc.) for the data that you want to import.
I’m not going to go into too much detail on XPath in this guide.
But here are some basic rules for selecting certain elements with XPath:
- // – this will select elements in the document from the current element that matches the selection
- / – select from the root element
- . – select the current element
- .. – selects the parent of the current element
- @ – selects attributes
Now that we have covered some basics of selecting elements, let’s take a look at some more specific examples
- //h1 – this would select all of the h1 elements
- //p – this would select all p elements
- [@class]= – this would select all the elements that match the specified class
I highly recommend you check out this post to learn the syntax of XPath.
Using the IMPORTXML Function
Now that we have covered the basics of the syntax and an overview of how the Xpath language works, let’s look at how to actually use this function.
In this example, I will be using the function to import data from a Wikipedia page.
Here are the steps to do this:
1. In this example, I have copied and pasted the URL that I am importing data from in my file already. Select the cell where you want to start the function and type “=IMPORTXML”. Press Tab when you are done to begin entering into the function
2. This first parameter of the function is the URL. I have selected the cell that contains my URL. Alternatively, you can put the URL directly into the function as long as it is wrapped in quotation marks
3. The next parameter of the function is the XPath query of what data you want to be returned. In this example, I want to return all the h2 elements from the target URL. So my query is “//h2”
4. After adding the closing parenthesis and pressing Enter, the function will calculate and return all the data from my query. In this example, you can see the function returned all the h2 elements from the target URL
In the previous example, I returned all the h2’s from the target URL using the IMPORTXML function, however, there are a ton of other things that you can import with the function as well.
Here are a few examples of some other formulas and the data that they import:
This will import the title of the page.
This will import all the h3 elements on the page.
This will import all of the internal links on the page
In HTML “tr” is table row, “td” is table data. The “1” here means to return the first element. So this formula will return the first table data element from each table row.
IMPORTXML is a powerful function, but to get the most use out of it, you need to be pretty comfortable with XPath.
The function itself is very easy to use, however learning the syntax of XPath can take a little bit of time.
There are many advanced predicates that can be used with XPath to pull very specific data from your source page, but that is beyond the scope of this article.
Make sure to watch the video if you are at all confused about using this function.
More Google Sheets Tutorials: