Create HTML profiling reports from pandas DataFrame objects
pandas-profiling
Documentation | Discord | Stack Overflow | Latest changelog
Do you like this project? Show us your love and give feedback!
pandas-profiling
primary goal is to provide a one-line Exploratory Data Analysis (EDA) experience in a consistent and fast solution. Like pandas df.describe()
function, that is so handy, pandas-profiling delivers an extended analysis of a DataFrame while alllowing the data analysis to be exported in different formats such as html and json.
The package outputs a simple and digested analysis of a dataset, including time-series and text.
The report contains three additional sections:
🎁 Latest features
- Looking for how you can do an EDA for Time-Series 🕛 ? Check this blogpost.
- You want to compare 2 datasets and get a report? Check this blogpost
Pandas-profiling can be used to deliver a variety of different use-case. The documentation includes guides, tips and tricks for tackling them:
Use case | Description |
---|---|
Comparing datasets | Comparing multiple version of the same dataset |
Profiling a Time-Series dataset | Generating a report for a time-series dataset with a single line of code |
Profiling large datasets | Tips on how to prepare data and configure pandas-profiling for working with large datasets |
Handling sensitive data | Generating reports which are mindful about sensitive data in the input dataset |
Dataset metadata and data dictionaries | Complementing the report with dataset details and column-specific data dictionaries |
Customizing the report's appearance | Changing the appearance of the report's page and of the contained visualizations |
⚡ Looking for a Spark backend to profile large datasets? It's work in progress.
Start by loading your pandas DataFrame
as you normally would, e.g. by using:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])
To generate the standard profiling report, merely run:
profile = ProfileReport(df, title="Pandas Profiling Report")
There are two interfaces to consume the report inside a Jupyter notebook: through widgets and through an embedded HTML report.
The above is achieved by simply displaying the report as a set of widgets. In a Jupyter Notebook, run:
profile.to_widgets()
The HTML report can be directly embedded in a cell in a similar fashion:
profile.to_notebook_iframe()
To generate a HTML report file, save the ProfileReport
to an object and use the to_file()
function:
profile.to_file("your_report.html")
Alternatively, the report's data can be obtained as a JSON file:
# As a JSON string
json_data = profile.to_json()
# As a file
profile.to_file("your_report.json")
For standard formatted CSV files (which can be read directly by pandas without additional settings), the pandas_profiling
executable can be used in the command line. The example below generates a report named Example Profiling Report, using a configuration file called default.yaml
, in the file report.html
by processing a data.csv
dataset.
pandas_profiling --title "Example Profiling Report" --config_file default.yaml data.csv report.html
Additional details on the CLI are available on the documentation.
The following example reports showcase the potentialities of the package across a wide range of dataset and data types:
Additional details, including information about widget support, are available on the documentation.
You can install using the pip
package manager by running:
pip install -U pandas-profiling
The package declares "extras", sets of additional dependencies.
[notebook]
: support for rendering the report in Jupyter notebook widgets.[unicode]
: support for more detailed Unicode analysis, at the expense of additional disk space.Install these with e.g.
pip install -U pandas-profiling[notebook,unicode]
You can install using the conda
package manager by running:
conda install -c conda-forge pandas-profiling
Download the source code by cloning the repository or click on Download ZIP to download the latest stable version.
Install it by navigating to the proper directory and running:
pip install -e .
The profiling report is written in HTML and CSS, which means a modern browser is required.
You need Python 3 to run the package. Other dependencies can be found in the requirements files:
Filename | Requirements |
---|---|
requirements.txt | Package requirements |
requirements-dev.txt | Requirements for development |
requirements-test.txt | Requirements for testing |
setup.py | Requirements for widgets etc. |
To maximize its usefulness in real world contexts, pandas-profiling
has a set of implicit and explicit integrations with a variety of other actors in the Data Science ecosystem:
Integration type | Description |
---|---|
Other DataFrame libraries | How to compute the profiling of data stored in libraries other than pandas |
Great Expectations | Generating Great Expectations expectations suites directly from a profiling report |
Interactive applications | Embedding profiling reports in Streamlit, Dash or Panel applications |
Pipelines | Integration with DAG workflow execution tools like Airflow or Kedro |
Cloud services | Using pandas-profiling in hosted computation services like Lambda, Google Cloud or Kaggle |
IDEs | Using pandas-profiling directly from integrated development environments such as PyCharm |
Need help? Want to share a perspective? Report a bug? Ideas for collaborations? Reach out via the following channels:
❗ Before reporting an issue on GitHub, check out Common Issues.
Learn how to get involved in the Contribution Guide.
A low-threshold place to ask questions or start contributing is the Data Centric AI Community's Discord.