Alternative to fscrawler in opensearch

I’ve recently moved from Elastic towards opendistro. However if i understood correctly, opensearch is the way forward instead.
I’ve moved almost all our currently used functionalities towards opensearch, however i’m left with 1 gap:
To index SMB/NFS shares in our organisation i’ve been using FSCRAWLER (Welcome to FSCrawler’s documentation! — FSCrawler 2.10-SNAPSHOT documentation), and it’s respective docker (Docker Hub).
Is there an alternative to index files on a smb/nfs share that is compatible with opensearch?
my google-fu seems to not find anything.

Thanks in advance!

Hey @Scarecrow - interesting. I wasn’t even aware of FS Crawler - looks useful. Have you tried it yet with OpenSearch? Glancing at the site I see a couple issues with their 2.8 snapshot:

Tika has explicit support of OpenSearch, but that version of the Java client has OpenSearch blocking code. There is an OpenSearch Java client in the works but in the meantime an older version of FS Crawler should work (one that uses Elasticsearch REST Client 7.13.4 or lower).

Once the OpenSearch Java client is GA, I think we could easily help FS Crawler support OpenSearch - it’s a fairly simple conversion.

@searchymcsearchface i have tried actually, with the same docker config i used for elastic.
It just starts and exits with an error code [0] which says basicly nothing :wink:
I’ll see about using an older version and what that gives, thanks for the suggestion!

wanted to give a final (?) update to this:
When i pull the 2.7 from dockerhub it’s default java rest client version is 14.0 if i understand it correctly, and it ends up refusing the connection:

So i guess i’ll have to wait for the work on the opensearch java client :frowning:

@Scarecrow Actually, you’ll have to go back to one that uses 7.13.4 as per the documentation Compatibility - OpenSearch documentation

Hi @Scarecrow,

I managed to get fscrawler working with OpenSearch, but I had to build it myself with a a few tweaks :

  1. Like what @searchymcsearchface said, it needs version 7.13.4, or you can checkout the last known code that was using 7.13.4 from the git repo… c3d120ea33c3d53fb2182ae72d5634cd15f50593
  2. Now you have to build it, but before this, there is a checkVersion() that needs to be commented out because it will halt fscrawler when it detects that the “7” version is a mismatch with OpenSearch’s “1” version number.
  3. After building, you can try to run it. It will complain that no default settings found for version “1”. So just copy the folder ~/.fscrawler/_default/7/ to ~/.fscrawler/_default/1/

Hope this helps.

hi @HelloWorld ,

you have just been promoted to be my life savior ;). Much thanks for investigating this.

I’m not well versed in git and/or building from a specific version, so I’ll have to investigate. I don’t suppose you have your own repo where this version you’ve build is running in?
I suppose you run it from your local machine where you’ve done your build, as opposed to me needing a docker image, but if I remember correctly there are docker build instructions somewhere aswell related to fscrawler, so (again) i guess I’ll have to investigate.

After the current world-ending-work-crisis (whats in a name :wink: ) has been averted I’ll report back here on my findings and experiences!

1 Like

A note on #2, you can actually run OpenSearch in compatibility mode (so you don’t have to alter the version check code)

In opensearch.yml

compatibility.override_main_response_version: true

OpenSearch will report as 7.10.2

If you can get FSCrawler to work, definitely go that route. David Pilato has done some great work there, and it is battle hardened.

Over on Apache Tika in 2.x, we’ve added fetchers and emitters that might be of interest. The notion is you configure a fetcher to get the bytes of files (we currently have local fileshare, s3 and gcs) and an emitter (we support local fileshare, Solr and OpenSearch)…tika takes care of most of the rest.

To scale it out, you can spin up a bunch of tika-servers in a pod and farm out requests to tika-servers. You send the file keys, tika fetches the bytes, runs the parse and emits to OpenSearch.

See: Tika 2.0 -- Robustness and Scale - Tim Allison - YouTube

Hit us up over on the tika lists or our jira (Tika - ASF JIRA) if you want to try this and we need to add a fetcher for smb/nfs.

1 Like

Sorry about the late answer but here goes:

  • building from the git version before 7.13.4 does not work (to me at least) since the docker image contains a bug that is resolved later.

I’ve looked over the link you’ve given towards the presentation for the tika fetchers, SMB/NFS would definetly be needed for it to fit our needs :slight_smile:

To “fix” fscrawler to work again with opensearch I would assume just adding the java rest client off opensearch as a different option in fscrawler would be “enough”, as right now the following modules are present:


elasticsearch-client-base
elasticsearch-client-v7
elasticsearch-client-v6

But then again, i’m not very familiar with fscrawler or the coding behind it so I might be wrong.
I’ll update this topic again when I see some progress on a front on this note!

If you do want to consider heading in the tika-pipes route in the future, please open an issue on our JIRA: https://issues.apache.org/jira/projects/TIKA

Thank you!

as requested (and for reference): [TIKA-3659] SMB/NFS support - ASF JIRA

Hi :wave:t3:

I’m the author of FSCrawler.

Disclaimer: I’m also an elastic employee so I might be biased :wink:

Thanks @tallison for pinging me in the Tika issue so I discovered this discussion.

I have some plans for the future and one of my idea is to make FSCrawler even more pluggable.
I’d like to support a plugin system so people would be able to write their own plugins easily.

https://github.com/dadoonet/fscrawler/issues/1114

Another thing I’m thinking of is to remove entirely the dependency with any client and build what I need by myself.
That would mean that the same internal client would support any version of Elasticsearch which would probably mean OpenSearch as well although it won’t be tested.

Another idea would be to support a beats protocole output. That would allow people to connect FSCrawler to Logstash for example and let them use whatever output they want.

Sadly, for all this, there’s a lot of refactoring to do, and I have no idea of when I’ll be able to do this.
One of the short term thing people could do is to fork FSCrawler and change the elasticsearch client to the OpenSearch one.

Of course, I recommend using Elastic and its Workplace Search product as it provides for free a full search solution including a powerful UI and connectors to many other systems (Dropbox, gmail, GitHub…). FSCrawler supports it since 2.7.

Also adding here for reference, the discussion I had already with @Scarecrow :wink:

https://github.com/dadoonet/fscrawler/issues/1274

3 Likes

Welcome @dadoonet. We have different biases, but glad to have you here.

If you did remove client dependencies, would you be open to fixes that patch any incompatibilities that may crop up?

Sorry to be a necromancer here, but when using the docker-compose I am trying to use this config to try whether that fixes the issue.

Is this the correct docker-compose then? I would only put the setting in environment of opensearch-node1:

version: '3'
services:
  opensearch-node1:
    image: opensearchproject/opensearch:latest
    container_name: opensearch-node1
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node1
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2
      - bootstrap.memory_lock=true # along with the memlock settings below, disables swapping
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m" # minimum and maximum Java heap size, recommend setting both to 50% of system RAM
      - compatibility.override_main_response_version=true. # <-- here?
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536 # maximum number of open files for the OpenSearch user, set to at least 65536 on modern systems
        hard: 65536
    volumes:
      - opensearch-data1:/usr/share/opensearch/data
    ports:
      - 9200:9200
      - 9600:9600 # required for Performance Analyzer
    networks:
      - opensearch-net
  opensearch-node2:
    image: opensearchproject/opensearch:latest
    container_name: opensearch-node2
    environment:
      - cluster.name=opensearch-cluster
      - node.name=opensearch-node2
      - discovery.seed_hosts=opensearch-node1,opensearch-node2
      - cluster.initial_cluster_manager_nodes=opensearch-node1,opensearch-node2
      - bootstrap.memory_lock=true
      - "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
    ulimits:
      memlock:
        soft: -1
        hard: -1
      nofile:
        soft: 65536
        hard: 65536
    volumes:
      - opensearch-data2:/usr/share/opensearch/data
    networks:
      - opensearch-net
  opensearch-dashboards:
    image: opensearchproject/opensearch-dashboards:latest
    container_name: opensearch-dashboards
    ports:
      - 5601:5601
    expose:
      - "5601"
    environment:
      OPENSEARCH_HOSTS: '["https://opensearch-node1:9200","https://opensearch-node2:9200"]'
    networks:
      - opensearch-net

volumes:
  opensearch-data1:
  opensearch-data2:

networks:
  opensearch-net:

I can answer this myself: yes, it is.

Going to localhost:9200/ gives me this without compatibility.override_main_response_version=true. set:

{
  "name" : "opensearch-node1",
  "cluster_name" : "opensearch-cluster",
  "cluster_uuid" : "iPE1JxDBRp2FF22Bufhs8A",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.10.0",
    "build_type" : "tar",
    "build_hash" : "eee49cb340edc6c4d489bcd9324dda571fc8dc03",
    "build_date" : "2023-09-20T23:55:20.784011088Z",
    "build_snapshot" : false,
    "lucene_version" : "9.7.0",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

and with compatibility.override_main_response_version=true:

{
  "name" : "opensearch-node1",
  "cluster_name" : "opensearch-cluster",
  "cluster_uuid" : "iPE1JxDBRp2FF22Bufhs8A",
  "version" : {
    "number" : "7.10.2",
    "build_type" : "tar",
    "build_hash" : "eee49cb340edc6c4d489bcd9324dda571fc8dc03",
    "build_date" : "2023-09-20T23:55:20.784011088Z",
    "build_snapshot" : false,
    "lucene_version" : "9.7.0",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}