A tool for checking, sanitizing and minimizing GTFS feeds.
Tidy (and validate) GTFS feeds.
Fixes small inconsistencies, minimizes the overall feed size, prepares the feed for secure, standard-compliant further use.
Output feeds are semantically equivalent to the input feed. In this context, semantical equivalency means that the output feed provides exactly the same trips with exactly the same attributes (routes, stop-times, shapes, agency, fares etc.). In other words, the output feed is equivalent to the input feed from a passenger's perspective.
calender.txt
and calender_dates.txt
by searching optimal progression and/or exception covers.stop_times.txt
and frequencies.txt
and searching for optimal frequency covers.shape_dist_traveled
) have gaps, try to fill them by interpolating surrounding measurementsFor more features, see the help page (--help
).
$ go get github.com/patrickbr/gtfstidy
Each GTFS processor has to be enabled explicitly. See
$ gtfstidy --help
for possible options.
You can use
$ gtfstidy -v <feed>
to do a simple feed validation.
Process the SFMTA-Feed with all processors enabled:
$ gtfstidy -SCRmTcdsOeD sanfrancisco.zip
Validate the SFMTA-Feed:
$ gtfstidy -v sanfrancisco.zip
All zipped sizes using zip -9
.
Processed with -SCRmTcdsOeD
.
File | # lines before | size before (B) | # lines after | size after (B) | |
---|---|---|---|---|---|
agency.txt |
2 | 159 | 2 | 153 | |
calendar_dates.txt |
11 | 172 | 11 | 161 | |
calendar.txt |
4 | 194 | 4 | 190 | |
fare_attributes.txt |
3 | 109 | 3 | 107 | |
fare_rules.txt |
83 | 1k | 83 | 392 | |
frequencies.txt |
--- | --- | 3,904 | 111k | |
routes.txt |
83 | 3,2K | 83 | 1,9k | |
shapes.txt |
201,831 | 7.4M | 19,863 | 620K | |
stops.txt |
3,556 | 212K | 3,540 | 172K | |
stop_times.txt |
1,115,598 | 46M | 893,344 | 25M | |
trips.txt |
28,989 | 1,4M | 21,741 | 739K |
Compression gain:
File | processors | size before (B) | size after (B) | |
---|---|---|---|---|
feed.zip |
-SCRmTcdsOeD |
8.8M | 5.9M | |
feed.zip |
none | 8.8M | 8.5M |
Processed with -SCRmTcdsOeD
.
File | # lines before | size before (B) | # lines after | size after (B) | |
---|---|---|---|---|---|
agency.txt |
424 | 40K | 424 | 33K | |
calendar_dates.txt |
4,387,195 | 113M | 455,843 | 6.6M | |
calendar.txt |
32,822 | 2M | 28,873 | 1014K | |
fare_attributes.txt |
--- | --- | --- | --- | |
fare_rules.txt |
--- | --- | --- | --- | |
frequencies.txt |
--- | --- | 2,023 | 57K | |
routes.txt |
5,564 | 254K | 4,613 | 105K | |
shapes.txt |
--- | --- | --- | --- | |
stops.txt |
31,753 | 2.4M | 31,753 | 1.4M | |
stop_times.txt |
11,569,991 | 790M | 10,771,471 | 323M | |
transfers.txt |
21,691 | 846K | 21,691 | 295K | |
trips.txt |
1,055,231 | 86M | 662,771 | 26M |
Compression gain:
File | processors | size before (B) | size after (B) | |
---|---|---|---|---|
feed.zip |
-SCRmTcdsOeD |
80M | 63M | |
feed.zip |
none | 80M | 77M |
Processed with -SCRmTcdsOeD
.
File | # lines before | size before (B) | # lines after | size after (B) | |
---|---|---|---|---|---|
agency.txt |
21 | 1.7K | 21 | 1.6K | |
calendar_dates.txt |
1 | 32 | 8 | 122 | |
calendar.txt |
49 | 1.8K | 27 | 972 | |
fare_attributes.txt |
--- | --- | --- | --- | |
fare_rules.txt |
--- | --- | --- | --- | |
frequencies.txt |
--- | --- | 21,425 | 604K | |
routes.txt |
434 | 25K | 434 | 22K | |
shapes.txt |
784,062 | 22M | 578,211 | 15M | |
stops.txt |
7,401 | 385K | 7,401 | 323K | |
stop_times.txt |
1,879,757 | 69M | 751,415 | 23M | |
trips.txt |
87,348 | 3.3M | 34,812 | 1.1M |
Compression gain:
File | processors | size before (B) | size after (B) | |
---|---|---|---|---|
feed.zip |
-SCRmTcdsOeD |
16M | 9.4M | |
feed.zip |
none | 16M | 16M |
There are two classes of processors. Processors with a lowercase flag modify existing entries. Processors with an uppercase flag delete existing entries, either because they are duplicates or because they can be combined with other entries.
IDs are packed into dense integer arrays, either as base 10 or base 36 integers. You should not use this processor if you are referencing entities from outside the static feed (for example, if the IDs are references from a GTFS-realtime feed).
-i
: pack IDs into dense base 10 integers-d
: pack IDs into dense base 36 integersEvery file.
routes.txt
route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color,route_text_color
AB,DTA,10,Airport - Bullfrog,,3,,,
BFC,DTA,20,Bullfrog - Furnace Creek Resort,,3,,,
STBA,DTA,30,Stagecoach - Airport Shuttle,,3,,,
CITY,DTA,40,City,,3,,,
AAMV,DTA,50,Airport - Amargosa Valley,,3,,,
AAMV2,DTA,50,Airport - Amargosa Valley,,3,,,
routes.txt
route_id,agency_id,route_short_name,route_long_name,route_type
2,1,20,Bullfrog - Furnace Creek Resort,3
3,1,30,Stagecoach - Airport Shuttle,3
4,1,40,City,3
5,1,50,Airport - Amargosa Valley,3
6,1,50,Airport - Amargosa Valley,3
1,1,10,Airport - Bullfrog,3
Feed is checked for entries that are not referenced anywhere. These entries are removed from the output.
-O
: remove entities that are not referenced anywheretrips.txt
, stops.txt
, routes.txt
, calendar_dates.txt
, calendar.txt
stops.txt
stop_id,stop_name,stop_desc,stop_lat,stop_lon,zone_id,stop_url,parent_station
META1,Furnace Creek,,36.425288,-117.1333162,,,
META2,Furnace Creek,,36.425288,-117.1333162,,,
FUR_CREEK_RES,Furnace Creek Resort (Demo),,36.425288,-117.133162,,,META1
stops.txt
stop_id,stop_name,stop_lat,stop_lon,parent_station
META1,Furnace Creek,36.42529,-117.133316,
FUR_CREEK_RES,Furnace Creek Resort (Demo),36.42529,-117.13316,META1
Minimizes shape geometries using Douglas-Peucker. This processor implicitely calls the shape remeasurer! The shape coordinates are projected to web mercator (EPSG:3857) prior to minimization. The ε value for Douglas-Peucker is set to 1.0
-s
: minimize shapes (using Douglas-Peucker)shapes.txt
shapes.txt
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
A_shp,0,0,1,
A_shp,0.6,0.5,2,6.8310
A_shp,1,1,3,15.8765
A_shp,2,1,4
A_shp,3,1,5,36.76
A_shp,3.5,1,6,
B_shp,0,0,1,
B_shp,0.6,0.5,2,6.8310
B_shp,1,1,3,15.8765
B_shp,2,1,4
B_shp,3,1,5,36.76
B_shp,3.5,1,6,
shapes.txt
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
A_shp,0,0,1,0
A_shp,0.6,0.5,2,6.831
A_shp,1,1,3,15.8765
A_shp,3.5,1,6,42.910156
B_shp,0,0,1,0
B_shp,0.6,0.5,2,6.831
B_shp,1,1,3,15.8765
B_shp,3.5,1,6,42.910156
Minimizes service ranges in calendar.txt
and calendar_dates.txt
by searching for optimal coverages of range entries in calendar.txt
and exception entries in calendar_dates.txt
.
-c
: minimize services by searching for the optimal exception/range coveragecalendar.txt
, calendar_dates.txt
calendar.txt
(empty)
calendar_dates.txt
service_id,date,exception_type
FULLW,20160814,1
FULLW,20160815,2
FULLW,20160816,1
FULLW,20160817,1
FULLW,20160818,1
FULLW,20160819,1
FULLW,20160820,1
FULLW,20160821,1
calendar.txt
service_id,monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date
FULLW,0,1,1,1,1,1,1,20160814,20160821
calendar_dates.txt
(empty)
Minimizes stop times in stop_times.txt
and trips in trips.txt
by searching for progression (frequency) covers on the stop times. If multiple trips with equivalent attributes (route, shapes etc) and the same relative stop times are found, they are checked for frequency patterns. If a pattern could be found, the trips are combined into a single frequency-based trip (via frequency.txt
). Existing frequencies in frequencies.txt
are also optimized and/or combined with entries in stop_times.txt
.
The algorithm is based on a CAP (Cover by Arithmetic Progression) algorithm proposed by Hannah Bast and Sabine Storandt.
-T
: search for frequency patterns in explicit trips and combine themtrips.txt
, stop_times.txt
, frequencies.txt
stop_times.txt
trip_id,arrival_time,departure_time,stop_id,stop_sequence
AB1a,8:00:00,8:00:00,BEATTY_AIRPORT,1
AB1a,8:10:00,8:15:00,BULLFROG,2
AB1b,8:10:00,8:10:00,BEATTY_AIRPORT,1
AB1b,8:20:00,8:25:00,BULLFROG,2
AB1c,8:20:00,8:20:00,BEATTY_AIRPORT,1
AB1c,8:30:00,8:35:00,BULLFROG,2
AB1d,8:30:00,8:30:00,BEATTY_AIRPORT,1
AB1d,8:40:00,8:45:00,BULLFROG,2
trips.txt
route_id,service_id,trip_id,trip_headsign,direction_id,block_id,shape_id
AB,FULLW,AB1a,to Bullfrog,0,1,A_shp
AB,FULLW,AB1b,to Bullfrog,0,1,A_shp
AB,FULLW,AB1c,to Bullfrog,0,1,A_shp
AB,FULLW,AB1d,to Bullfrog,0,1,A_shp
frequencies.txt
(empty)
stop_times.txt
trip_id,arrival_time,departure_time,stop_id,stop_sequence
AB1a,8:00:00,8:00:00,BEATTY_AIRPORT,1
AB1a,8:10:00,8:15:00,BULLFROG,2
trips.txt
route_id,service_id,trip_id,trip_headsign,direction_id,block_id,shape_id
AB,FULLW,AB1a,to Bullfrog,0,1,A_shp
frequencies.txt
trip_id,start_time,end_time,headway_secs,exact_times
AB1a,8:00:00,8:40:00,600,1
Removes duplicate routes (routes that have the same attributes and the same fare rules), updates references in trips.txt
and deletes redundant rules in fare_rules.txt
as well.
-R
: remove route duplicatesroutes.txt
, trips.txt
, fare_rules.txt
routes.txt
route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color,route_text_color
AB,DTA,10,Airport - Bullfrog,,3,,,
BFC,DTA,20,Bullfrog - Furnace Creek Resort,,3,,,
CFC,DTA,20,Bullfrog - Furnace Creek Resort,,3,,,
fare_rules.txt
fare_id,route_id,origin_id,destination_id,contains_id
p,AB,,,
p,BFC,,,
p,CFC,,,
routes.txt
route_id,agency_id,route_short_name,route_long_name,route_type
AB,DTA,10,Airport - Bullfrog,3
CFC,DTA,20,Bullfrog - Furnace Creek Resort,3
fare_rules.txt
fare_id,route_id
p,AB
p,CFC
Removes duplicate services (services that cover the same set of dates) and updates references.
-C
: remove duplicate services in calendar.txt and calendar_dates.txtcalendar_dates.txt
, calendar.txt
, trips.txt
calendar_dates.txt
service_id,date,exception_type
B,20160815,1
B,20160816,1
B,20160817,1
B,20160818,1
B,20160819,1
B,20160820,1
calendar.txt
service_id,monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date
A,1,1,1,1,1,1,0,20160814,20160821
calendar_dates.txt
(empty)
calendar.txt
service_id,monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date
A,1,1,1,1,1,1,0,20160814,20160821
Removes duplicate shapes and updates references in trips.txt
. Shape equality testing is done with a simple heuristic which resembles the Fréchet-Distance but is faster. The check never underestimates the distance between two shapes, but overestimates it for shapes with total distances that are >>
the max distance. This processor implicitely calls the shape remeasurer.
-S
: remove shape duplicatesshapes.txt
, trips.txt
shapes.txt
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
A_shp,0,0,1,0
A_shp,0.6,0.5,2,6.831
A_shp,1,1,3,15.8765
A_shp,2,1,4,26.315065
A_shp,3,1,5,36.76
A_shp,3.5,1,6,42.910156
B_shp,0,0,1,0
B_shp,0.6,0.5,2,6.831
B_shp,1,1,3,15.8765
B_shp,2,1,4,26.315065
B_shp,3,1,5,36.75
B_shp,3.500005,1,6,42.91
shapes.txt
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
B_shp,0,0,1,0
B_shp,0.6,0.5,2,6.831
B_shp,1,1,3,15.8765
B_shp,2,1,4,26.315065
B_shp,3,1,5,36.75
B_shp,3.500005,1,6,42.91
Remeasures shapes and fills measurement gaps.
-m
: remeasure shapes (filling measurement-holes)shapes.txt
shapes.txt
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
A_shp,0,0,1,
A_shp,0.6,0.5,2,6.8310
A_shp,1,1,3,15.8765
A_shp,2,1,4
A_shp,3,1,5,36.76
A_shp,3.5,1,6,
shapes.txt
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
A_shp,0,0,1,0
A_shp,0.6,0.5,2,6.831
A_shp,1,1,3,15.8765
A_shp,2,1,4,26.315065
A_shp,3,1,5,36.76
A_shp,3.5,1,6,42.910156
If optional field values of feed entries have errors, this processors sets them to the default values specified in the GTFS standard.
-e
: if non-required fields have errors, fall back to the default valuesEvery file, if errors are present.
routes.txt
route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color,route_text_color
AB,DTAoopserror,10,Airport - Bullfrog,,3,,,
BFC,DTA,20,Bullfrog - Furnace Creek Resort,,3,,,
STBA,DTA,30,Stagecoach - Airport Shuttle,,3,,,
CITY,DTA,40,City,,3,,,
AAMV,DTA,50,Airport - Amargosa Valley,,3,,,
AAMV2,DTA,50,Airport - Amargosa Valley,,3,,,
routes.txt
route_id,agency_id,route_short_name,route_long_name,route_type
CITY,DTA,40,City,3
AAMV,DTA,50,Airport - Amargosa Valley,3
AAMV2,DTA,50,Airport - Amargosa Valley,3
AB,,10,Airport - Bullfrog,3
BFC,DTA,20,Bullfrog - Furnace Creek Resort,3
STBA,DTA,30,Stagecoach - Airport Shuttle,3
If feed entries have errors that can't be fixed in any other way (e.g. by -e
), this processor completely removes them. This cascades through the entire feed: if an erroneous route is deleted, all trips that refer to this route are also deleted, all stop times for trips referencing this route are deleted as well and so on. You effectively get an error-free subset of the input feed.
-D
: drop erroneous entries from feedEvery file, if errors are present.
routes.txt
route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color,route_text_color
AB,DTAoopserror,10,Airport - Bullfrog,,3,,,
BFC,DTA,20,Bullfrog - Furnace Creek Resort,,3,,,
STBA,DTA,30,Stagecoach - Airport Shuttle,,3,,,
CITY,DTA,40,City,,3,,,
AAMV,DTA,50,Airport - Amargosa Valley,,3,,,
AAMV2,DTA,50,Airport - Amargosa Valley,,3,,,
routes.txt
route_id,agency_id,route_short_name,route_long_name,route_type
CITY,DTA,40,City,3
AAMV,DTA,50,Airport - Amargosa Valley,3
AAMV2,DTA,50,Airport - Amargosa Valley,3
BFC,DTA,20,Bullfrog - Furnace Creek Resort,3
STBA,DTA,30,Stagecoach - Airport Shuttle,3
The order in which feed processors are called is always the same, regardless of the flag ordering. The processors are run in this order:
-e
)-D
)-O
)-m
)-s
)-S
)-R
)-C
)-c
)-T
)-i
or -d
)GPL v2, see LICENSE