Fscrawler Versions Save

Elasticsearch File System Crawler (FS Crawler)

fscrawler-2.9

2 years ago

What's Changed

New Contributors

Full Changelog: https://github.com/dadoonet/fscrawler/compare/fscrawler-2.8...fscrawler-2.9

fscrawler-2.8

2 years ago

What's Changed

  • #1356: ci(Mergify): configuration update (thanks to @dadoonet)
  • #1322: Update Log4J 2.15.0 and Elasticsearch 7.16.1 (thanks to @dadoonet)
  • #1276: Revert "Remove a52f2ab6-086b-4285-a7a1-78ecdc6404ba vulnerability id" (thanks to @dadoonet)
  • #1275: Remove a52f2ab6-086b-4285-a7a1-78ecdc6404ba vulnerability id (thanks to @dadoonet)
  • #1228: latest docker tag should be only the latest stable version (thanks to @dadoonet)

🚀 New features

  • #1368: Add support for Delete Document (thanks to @dadoonet)
  • #1298: Add more default displayed fields (thanks to @dadoonet)

🚨 Bug Fixes

  • #1358: fs.ocr.enabled is always false (thanks to @ywjung)
  • #1286: Fix starting fscrawler with Docker (thanks to @dadoonet)
  • #1271: fix: not working optional libraries (e.g. jpeg2000) (thanks to @NickUfer)

💉 Updated features

  • #1393: Bump guava from 31.0.1-jre to 31.1-jre (thanks to @dependabot)
  • #1392: Bump docker-maven-plugin from 0.39.0 to 0.39.1 (thanks to @dependabot)
  • #1376: Use our own Http Client and remove specific distributions (thanks to @dadoonet)
  • #1390: Bump nexus-staging-maven-plugin from 1.6.10 to 1.6.12 (thanks to @dependabot)
  • #1387: Bump maven-compiler-plugin from 3.9.0 to 3.10.0 (thanks to @dependabot)
  • #1386: Bump nexus-staging-maven-plugin from 1.6.8 to 1.6.10 (thanks to @dependabot)
  • #1385: Bump maven-javadoc-plugin from 3.3.1 to 3.3.2 (thanks to @dependabot)
  • #1384: Bump jakarta.activation-api from 2.0.1 to 2.1.0 (thanks to @dependabot)
  • #1383: Bump jersey.version from 3.0.3 to 3.0.4 (thanks to @dependabot)
  • #1381: Bump slf4j-api from 1.7.35 to 1.7.36 (thanks to @dependabot)
  • #1382: Bump jcl-over-slf4j from 1.7.33 to 1.7.36 (thanks to @dependabot)
  • #1377: Bump websocket-client from 9.4.44.v20210927 to 9.4.45.v20220203 (thanks to @dependabot)
  • #1374: Bump docker-maven-plugin from 0.38.1 to 0.39.0 (thanks to @dependabot)
  • #1373: Bump ossindex-maven-plugin from 3.1.0 to 3.2.0 (thanks to @dependabot)
  • #1371: Update to Elasticsearch 7.17.0 (thanks to @dadoonet)
  • #1369: Bump json-path from 2.6.0 to 2.7.0 (thanks to @dependabot)
  • #1365: Bump slf4j-api from 1.7.33 to 1.7.35 (thanks to @dependabot)
  • #1364: Bump versions-maven-plugin from 2.8.1 to 2.9.0 (thanks to @dependabot)
  • #1355: Bump jcl-over-slf4j from 1.7.32 to 1.7.33 (thanks to @dependabot)
  • #1354: Bump elasticsearch-rest-high-level-client from 7.16.2 to 7.16.3 (thanks to @dependabot)
  • #1353: Bump slf4j-api from 1.7.32 to 1.7.33 (thanks to @dependabot)
  • #1352: Bump woodstox-core from 6.2.7 to 6.2.8 (thanks to @dependabot)
  • #1350: Bump maven-jar-plugin from 3.2.1 to 3.2.2 (thanks to @dependabot)
  • #1351: Bump maven-compiler-plugin from 3.8.1 to 3.9.0 (thanks to @dependabot)
  • #1349: Bump jcommander from 1.81 to 1.82 (thanks to @dependabot)
  • #1330: Bump tika.version from 2.1.0 to 2.2.0 (thanks to @dependabot)
  • #1348: Switch to the new sonatype service (thanks to @dadoonet)
  • #1339: Bump log4j-api from 2.17.0 to 2.17.1 (thanks to @dependabot)
  • #1346: Bump build-helper-maven-plugin from 3.2.0 to 3.3.0 (thanks to @dependabot)
  • #1347: Bump maven-jar-plugin from 3.2.0 to 3.2.1 (thanks to @dependabot)
  • #1333: Bump elasticsearch-rest-high-level-client from 7.16.1 to 7.16.2 (thanks to @dependabot)
  • #1331: Bump jackson.version from 2.13.0 to 2.13.1 (thanks to @dependabot)
  • #1332: Bump docker-maven-plugin from 0.38.0 to 0.38.1 (thanks to @dependabot)
  • #1329: Bump log4j-core from 2.16.0 to 2.17.0 (thanks to @dependabot)
  • #1326: Bump snakeyaml from 1.29 to 1.30 (thanks to @dependabot)
  • #1325: Bump log4j.version from 2.15.0 to 2.16.0 (thanks to @dependabot)
  • #1309: Bump woodstox-core from 6.2.6 to 6.2.7 (thanks to @dependabot)
  • #1314: Bump bcprov-jdk15on from 1.69 to 1.70 (thanks to @dependabot)
  • #1316: Bump httpcore.version from 4.4.14 to 4.4.15 (thanks to @dependabot)
  • #1321: Bump httpasyncclient from 4.1.4 to 4.1.5 (thanks to @dependabot)
  • #1301: Bump docker-maven-plugin from 0.37.0 to 0.38.0 (thanks to @dependabot)
  • #1317: Bump jdom2 from 2.0.6 to 2.0.6.1 (thanks to @dependabot)
  • #1320: Bump log4j-core from 2.14.1 to 2.15.0 (thanks to @dependabot)
  • #1290: Bump joda-time from 2.10.12 to 2.10.13 (thanks to @dependabot)
  • #1288: Bump junit4-maven-plugin from 2.7.8 to 2.7.9 (thanks to @dependabot)
  • #1289: Bump randomizedtesting-runner from 2.7.8 to 2.7.9 (thanks to @dependabot)
  • #1285: Bump jansi from 2.3.4 to 2.4.0 (thanks to @dependabot)
  • #1277: Bump jsoup from 1.14.2 to 1.14.3 (thanks to @dependabot)
  • #1278: Bump joda-time from 2.10.10 to 2.10.12 (thanks to @dependabot)
  • #1280: Bump guava from 30.1.1-jre to 31.0.1-jre (thanks to @dependabot)
  • #1279: Bump jcl-over-slf4j from 1.7.31 to 1.7.32 (thanks to @dependabot)
  • #1198: Update to Tika 2.1 (thanks to @dadoonet)
  • #1268: Bump jackson.version from 2.12.5 to 2.13.0 (thanks to @dependabot)
  • #1265: Bump guava from 30.1.1-jre to 31.0.1-jre (thanks to @dependabot)
  • #1262: Bump MockFtpServer from 2.8.0 to 3.0.0 (thanks to @dependabot)
  • #1269: Bump jsoup from 1.14.2 to 1.14.3 (thanks to @dependabot)
  • #1270: Bump websocket-client from 9.4.43.v20210629 to 9.4.44.v20210927 (thanks to @dependabot)
  • #1261: Bump elasticsearch-rest-high-level-client from 7.14.1 to 7.15.0 (thanks to @dependabot)
  • #1260: Bump jersey.version from 3.0.2 to 3.0.3 (thanks to @dependabot)
  • #1248: Bump maven-javadoc-plugin from 3.3.0 to 3.3.1 (thanks to @dependabot)
  • #1243: Bump sqlite-jdbc from 3.36.0.2 to 3.36.0.3 (thanks to @dependabot)
  • #1242: Bump jackson.version from 2.12.4 to 2.12.5 (thanks to @dependabot)
  • #1245: Bump elasticsearch-rest-high-level-client from 7.14.0 to 7.14.1 (thanks to @dependabot)
  • #1241: Bump sqlite-jdbc from 3.36.0.1 to 3.36.0.2 (thanks to @dependabot)
  • #1233: Bump docker-maven-plugin from 0.36.1 to 0.37.0 (thanks to @dependabot)

📝 Documentation updates

  • #1345: Improve documentation for settings (thanks to @cbb-colab)
  • #1310: Update ocr.rst, the path was wrong and not working (thanks to @sahin52)
  • #1256: Add section Workaround for huge temporary files (thanks to @dfbm)

🚦 Tests

  • #1327: Split Build, IT and Unit Tests (thanks to @dadoonet)
  • #1323: Add more traces when converting dates (thanks to @dadoonet)

Thanks to

@NickUfer, @cbb-colab, @cwperry, @dadoonet, @dependabot, @dependabot[bot], @dfbm, @mergify[bot], @sahin52 and @ywjung

fscrawler-2.7

2 years ago

The FSCrawler team is pleased to announce the FSCrawler 2.7 release!

FSCrawler

FS Crawler offers a simple way to index binary files into elasticsearch.

Usage

Download FSCrawler 2.7:

wget https://repo1.maven.org/maven2/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7/fscrawler-es7-2.7.zip

Start FS crawler with:

bin/fscrawler job_name

FS crawler will read a local file (default to ~/.fscrawler/{job_name}/_settings.json). If the file does not exist, FS crawler will propose to create your first job.

$ bin/fscrawler job_name
18:28:58,174 WARN  [f.p.e.c.f.FsCrawler] job [job_name] does not exist
18:28:58,177 INFO  [f.p.e.c.f.FsCrawler] Do you want to create it (Y/N)?
y
18:29:05,711 INFO  [f.p.e.c.f.FsCrawler] Settings have been created in [~/.fscrawler/job_name/_settings.json]. Please review and edit before relaunch

Create a directory named /tmp/es or c:\tmp\es, add some files you want to index in it and start again:

$ bin/fscrawler job_name
18:30:34,330 INFO  [f.p.e.c.f.FsCrawlerImpl] Starting FS crawler
18:30:34,332 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started in watch mode. It will run unless you stop it with CTRL+C.
18:30:34,682 INFO  [f.p.e.c.f.FsCrawlerImpl] FS crawler started for [job_name] for [/tmp/es] every [15m]

More details in the documentation.

New features

  • #991: Add Workplace Search connector.
  • #1203: Add FTP crawler. By helsonxiao.
  • #1211: Add file.content_type field on folders.
  • #1210: Add file.filename field on folders.
  • #1179: Automatically create Custom Sources.
  • #1037: Split console logs and actual logs and add a banner :).
  • #1036: Support ssl verification configurable. By TommyLike.
  • #1035: Log index errors in documents.log.
  • #1031: Add an external Log4J2 configuration file.
  • #907: Add path_prefix option.
  • #820: Generate FSCrawler docker images. By toto1310.
  • #776: Report HEAP size at startup.
  • #752: Add option to ignore symlinks. By budachst.
  • #715: Allow custom index name in the REST API. By kikkauz.
  • #698: Add Cross-Origin Resource Sharing (CORS) headers to RestServer. By isaac-ipl.
  • #692: Allow running OCR but not on PDF files.
  • #673: Add support for YAML configuration.
  • #663: Add Patterns table to includes and excludes. By wrathagom.

Fixed Bugs

  • #1224: Fix NPE in Console when running with Docker.
  • #1217: Check if date is null when formatting it to RFC3339.
  • #1204: Split build and deploy phases for Docker images.
  • #1201: 2.7 - Docker image broken. By agrantdeakin.
  • #1194: Elasticsearch node settings should not be null by default.
  • #1193: Corrupt PDF can lead to a StackOverflow.
  • #1137: Ignore errors when parsing a 0 byte file.
  • #1085: fscrawler.bat added a CD to move to the appropriate directory. By CircuitGuy.
  • #1084: InputStream must have > 0 bytes. By yuanzhian.
  • #1066: Start fscrawler instead of internal services.
  • #1041: Fixed an issue that caused an error when running in a windows environment. By muraken720.
  • #1006: Running fscrawler with no argument now lists existing jobs. By janhoy.
  • #1005: Fix ENTRYPOINT in Dockerfile to allow variable substitution. By Maijin.
  • #994: Using cloud id gives "invalid IPv6 Address". By tdaroly.
  • #973: Fix SSH crawling from Windows machine.
  • #899: FSCrawler can't index .doc or .docx elements. By LaaKii.
  • #895: java.lang.NoSuchMethodError: parsing some Word files. By mwaltersbmc.
  • #860: Bug Syntax error in fscrawler file, to init fscrawler. By CarlosRCDev.
  • #847: sun.jnu.encoding=UTF-8 added in .bat and .sh both. By shahariaazam.
  • #834: FS Crawler freezes when crawling a 0 byte TXT file. By dansfelix.
  • #819: Fix Percentage computation.
  • #760: Allow passing test parameters to Maven CLI.
  • #714: fix release-drafter. By jetersen.
  • #701: Change log level and display logs only if filters on content.
  • #691: OCR without pdf_ocr. By Newmski.
  • #686: Wait for healthy index when creating the index.
  • #681: SSH dirs should be seen as dirs and not files.
  • #680: trying to index remote files with ssh - files seen as folder. By sblanc0054.
  • #660: Fix authentication when sending announcement email.

Main changes

  • #1218: Isolate WorkplaceSearchClient and ElasticsearchClient.
  • #1213: Switch back to Java 11.
  • #1049: Update Dockerfile to use JDK14. By mario-89.
  • #1212: Let's use JsonPath.
  • #1207: Generate only 2 docker images.
  • #1206: Detect when fscrawler runs in foreground and adapt logs.
  • #1205: Add logs to the console when running a Docker instance.
  • #1172: Move CI from Travis to GitHub actions.
  • #872: Add more information to the _simulate API.
  • #700: Add dependency convergence checks.
  • #695: Exclude the PDFParser from the DefaultParser.
  • #694: Display full names when catching parsing errors.
  • #693: Move fs.pdf_ocr setting to fs.ocr.pdf_strategy.
  • #675: Warn in case of Tika error.
  • #1219: Update to Elasticsearch 7.14.0 and 6.8.18.
  • #1180: Bump tika.version from 1.26 to 1.27.

Removed

  • #978: files lost. By bluebell1990.

Have fun! -FSCrawler team

fscrawler-2.6

5 years ago

What's Changed

  • Update Jackson to 2.9.8 (#657) @dadoonet
  • Update to Tika 1.20 (#655) @dadoonet
  • Update to Elasticsearch 6.5.3 (#649) @dadoonet
  • Add a warning when using both silent and debug/trace (#647) @dadoonet
  • Add documentation on how to run as a Windows service (#648) @dadoonet
  • Check Elasticsearch 6 minor version (#642) @dadoonet
  • Force the default number of shards to be 1 (#644) @dadoonet
  • Update Guava transitive dependency to 27.0.1-jre (#645) @dadoonet
  • Revisit Elasticsearch.Node and Rest settings (#638) @dadoonet
  • Update to elasticsearch 6.5.1 (#637) @dadoonet
  • Ignore dirs when .fscrawlerignore file is detected (#633) @dadoonet
  • Update issue templates (#632) @dadoonet
  • Support multiple OCR languages (#631) @dadoonet
  • Update Tika to 1.19.1 (#624) @dadoonet
  • Create specific elasticsearch clients (#616) @dadoonet
  • Add Release Drafter to automatically generate the release notes (#611) @dadoonet
  • Add a Noop Parser (#610) @dadoonet
  • Dump stack when not able to close FSCrawler (#609) @dadoonet
  • Make default root dir Windows compatible (#595) @dadoonet
  • Update to Tika 1.19 (#603) @dadoonet
  • Update ossindex-maven-plugin to 3.0.1 (#604) @dadoonet
  • Update to Jackson 2.9.7 (#602) @dadoonet
  • Update to Elasticsearch 6.4.1 (#594) @dadoonet
  • Add LGTM code quality badges (#597) @xcorail
  • Support XML reoccurring structures (#593) @dadoonet
  • Add a filter by content option (#585) @dadoonet
  • Exclude dirs depending on dir full name (relative to root) (#561) @dadoonet
  • Ignore files bigger than X (#584) @dadoonet
  • Add hocr option for Tesseract-based OCR (#583) @dadoonet
  • Allow path partial matching (#582) @dadoonet
  • Add support for Last Accessed date and Created date (#580) @dadoonet
  • Use _doc doc type instead of doc (#581) @dadoonet
  • Fix wrong detection of removed settings (#579) @dadoonet
  • Add support for cloud id (#577) @dadoonet
  • Update maven-compiler-plugin to 3.8.0 (#576) @dadoonet
  • Add ossindex Maven plugin (#572) @dadoonet
  • Close bulk processors with awaitClose instead of close (#570) @dadoonet
  • Update to elasticsearch 6.3.2 (#569) @dadoonet
  • Add File Permissions to generated documents (#567) @dadoonet
  • Skip sonar build for external PRs (#568) @dadoonet
  • Add a developer guide (#565) @dadoonet
  • Add support for bulk size in bytes with unit (#563) @dadoonet
  • Update to Elasticsearch 6.3.1 (#557) @dadoonet
  • Revert "Use _doc doc type instead of doc" (#558) @dadoonet
  • Use _doc doc type instead of doc (#554) @dadoonet
  • Fix Sonar Critical issues (#551) @dadoonet
  • Fix SonarQube hook (#550) @dadoonet
  • Move documentation to https://readthedocs.org (#543) @dadoonet
  • Allow using store_source without indexing content (#544) @dadoonet
  • Update to Tika 1.18 (#542) @dadoonet
  • Update to Elasticsearch 6.3.0 (#541) @dadoonet
  • Add a version check in tests (#527) @dadoonet
  • Raw fields should be considered as text/keyword (#526) @dadoonet
  • Add tests on OSS image as well (#525) @dadoonet
  • Update elasticsearch to 6.2.2 (#524) @dadoonet
  • Check that pipeline actually exists when starting (#522) @dadoonet
  • Allow setting Tesseract path to executable and data (#520) @dadoonet
  • Reduce Time to run tests from the IDE (#518) @dadoonet
  • Update to elasticsearch 6.2.1 (#517) @dadoonet
  • Split IT into different classes (#514) @dadoonet
  • Start elasticsearch with docker-maven-plugin when running from the CLI (#513) @dadoonet
  • Autodetect if a local node is running before starting docker (#512) @dadoonet
  • Start removal of core module (#508) @dadoonet
  • Create fscrawler-rest module (#506) @dadoonet
  • Create fscrawler-crawler-fs and fscrawler-crawler-ssh modules (#505) @dadoonet
  • Clean package names (#504) @dadoonet
  • Create fscrawler-tika and fscrawler-beans modules (#503) @dadoonet
  • Create the fscrawler-cli module (#502) @dadoonet
  • Move to Docker based integration tests (#500) @dadoonet
  • Modify announcement email (#501) @dadoonet
  • readme: add note that fs settings also affect rest (#492) @shadiakiki1986
  • Fix ignore folders documentation (#488) @dadoonet
  • Add more tests about moving files (#487) @dadoonet
  • Includes and Excludes should not be case sensitive (#486) @dadoonet
  • Split project into modules (#435) @dadoonet
  • add setPipeline call when using REST (#475) @shadiakiki1986
  • Add more info in case of bulk failures (#457) @dadoonet
  • Don't rely on disk space for tests (#456) @dadoonet
  • Update to Lucene 7.0.1 (#452) @dadoonet
  • Update to maven-versions-plugin 2.5 (#453) @dadoonet
  • Update to Log4J 2.9.1 (#451) @dadoonet
  • Update to SQLite 3.20.1 (#450) @dadoonet
  • Update to Jackson 2.9.2 (#449) @dadoonet
  • Update to elasticsearch 6.0.0-beta2 (#434) @dadoonet
  • Update dependencies (Jackson, Log4J, Jansi, SQLite, JSch, JCommander, Randomized Testing) (#430) @dadoonet
  • use StringBuilder in a loop (#361) @ctamisier
  • Add continue_on_error option to continue on error while crawling (#330) @kneubi
  • Fix links typo (#326) @soruly
  • Patch Log4J 2.8 to display messages on Windows (#323) @dadoonet
  • Missing documentation for some local FS settings (#287) @shadiakiki1986
  • add link to repo with dockerfile usage of fscrawler (#278) @shadiakiki1986
  • documentation for loop moved to under --loop instead of under --rest (#277) @shadiakiki1986
  • Use path analyzer for directory fields (#272) @dadoonet
  • Prevent customised mappings from being overwritten (#231) @edjeavons
  • Elasticsearch Client must use search size if set (#240) @babadofar
  • Add OCR integration documentation (#224) @Jdecaudin
  • Default REST elasticsearch port should be 9200 and not 9300 (#142) @FredDut

Thanks to

@FredDut, @Jdecaudin, @Quix0r, @babadofar, @barts2108, @coder-sa, @ctamisier, @dadoonet, @edjeavons, @fgaujous, @gpcmol, @it20one, @kneubi, @shadiakiki1986, @soruly, @vakopian, @xcorail, Ajitpal Singh and Julien Decaudin