Search for and retrieve US Patent and Trademark Office Patent Data
pypatent is a tiny Python package to easily search for and scrape US Patent and Trademark Office Patent Data.
This version implements Selenium support for scraping.
Previous versions were using the requests
library for all requests, however this has had problems with the USPTO site lately.
I notice some users have been able to use requests
without issue, while others get 4xx errors.
PyPatent Version 1.2 implements an optional new WebConnection object to give the user the option to use Selenium WebDrivers in place of the requests
library.
This WebConnection object is optional.
If used, it should be passed as an argument when initializing Search
or Patent
objects.
Use it in the following cases:
requests
requests
but with a custom user-agent or headersSee bottom of README for examples.
Python 3, BeautifulSoup, requests, pandas, re, selenium
pip install pypatent
If using Selenium for scraping (introduced in version 1.2), be sure to install a Selenium WebDriver.
For Chrome, use chromedriver
. For Firefox, use geckodriver
.
See the Selenium download page for more details and options.
The Search object works similarly to the Advanced Search at the USPTO, with additional options.
There are two methods to specify your search criteria, and you can use one or both.
You may search for a certain string in all fields of the patent:
pypatent.Search('microsoft') # Will return results matching 'microsoft' in any field
You may also specify complex search criteria as demonstrated on the USPTO site:
pypatent.Search('TTL/(tennis AND (racquet OR racket))')
Alternatively, you can specify one or more Field Code arguments to search within the specified fields. Multiple Field Code arguments will create a search with AND logic. OR logic can be used within a single argument. For more complex logic, use a custom string.
pypatent.Search(pn='adobe', ttl='software') # Equivalent to search('PN/adobe AND TTL/software')
pypatent.Search(pn=('adobe or macromedia'), ttl='software') # Equivalent to search('PN/(adobe or macromedia) AND TTL/software')
String criteria can be used in conjunction with Field Code arguments:
pypatent.Search('acrobat', pn='adobe', ttl='software') # Equivalent to search('acrobat AND PN/adobe AND TTL/software')
The Field Code arguments have the same meaning as on the USPTO site.
The results_limit
argument lets you change how many patent results are retrieved. The default is 50, equivalent to one page of results.
pypatent.Search('microsoft', results_limit=10) # Fetch 10 results only
By default, pypatent retrieves the details of every patent by visiting each patent's URL from the search results.
This can take a long time since each page has to be scraped.
If you just need the patent titles and URLs from the search results, set get_patent_details
to False
:
pypatent.Search('microsoft', get_patent_details=False) # Fetch patent numbers and titles only
pypatent has convenience methods to format the Search object into either a Pandas DataFrame or list of dicts.
pypatent.Search('microsoft').as_dataframe()
pypatent.Search('microsoft', get_patent_details=False).as_list()
Sample result (without patent details):
[{
'title': 'Electronic device',
'url': 'http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=1&p=1&f=G&l=50&d=PTXT&S1=microsoft&OS=microsoft&RS=microsoft'
},
{'title': 'Portable electric device', ... }
The Search
class uses the Patent
class to retrieve and store patent details for a given patent URL.
You can use it directly if you already know the patent URL (e.g. you ran a Search with get_patent_details=False
)
# Create a Patent object
this_patent = pypatent.Patent(title='Base station device, first location management device, terminal device, communication control method, and communication system',
url='http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=HITOFF&u=%2Fnetahtml%2FPTO%2Fsearch-adv.htm&r=4&p=1&f=G&l=50&d=PTXT&S1=aaa&OS=aaa&RS=aaa')
# Fetch the details
this_patent.fetch_details()
Note, not all fields from the patent page are scraped. I hope to add more, and pull requests are appreciated :)
This version implements Selenium support for scraping.
Previous versions were using the requests
library for all requests, however the USPTO site has been causing problems for it.
I notice some users have been able to use requests
without issue, while others get 4xx errors.
PyPatent Version 1.2 implements a new WebConnection object to give the user the option to use Selenium WebDrivers in place of the requests
library.
This WebConnection object is optional.
If used, it should be passed as an argument when initializing Search
or Patent
objects.
Use it in the following cases:
requests
requests
but with a custom user-agent or headersAn example using the Firefox WebDriver:
import pypatent
from selenium import webdriver
driver = webdriver.Firefox() # Requires geckodriver in your PATH
conn = pypatent.WebConnection(use_selenium=True, selenium_driver=driver)
res = pypatent.Search('microsoft', get_patent_details=True, web_connection=conn)
print(res)
An example using the requests
library with a custom user agent:
import pypatent
conn = pypatent.WebConnection(use_selenium=False, user_agent='Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36')
res = pypatent.Search('microsoft', get_patent_details=True, web_connection=conn)
print(res)
An example using the requests
library with default user agent (WebConnection is not necessary here as we are using the defaults)
import pypatent
res = pypatent.Search('microsoft', get_patent_details=True)
print(res)
This version makes searching and storing patent data easier:
Search
and Patent
Search
object searches the USPTO site and can output the results as a DataFrame or list. It can scrape the details of each patent, or just get the patent title and URL. Most users will only need to use this object.Patent
object fetches and holds a single patent's info. Fetching the patent's details is now optional. This object should only be used when you already have the patent URL and aren't conducting a search.