Mango query performance #3828

tk185141 · 2021-11-11T11:51:54Z

tk185141
Nov 11, 2021

My team has selected couchdb for our persistence tier because we have strong requirements for highly scalable data replication for an IoT style deployment. The IoT device has very limited hardware (disk/ram) so we want to also use the DB for queries as well (not just a data transport).

We have created a very simple application to test out the speed for Like queries. Note that both of these test runs were done on a development machine using the standard install of CouchDB and MongoDB on a mac so the results are based upon vanilla out of the box configurations.

To validate the scenarios, we created an 200K document DB (example document looks like this)..

{
  "_id": "00ab4t62mjek",
  "_rev": "1-e7a8ef87fb88594184f7e987e1c8c0e7",
  "randomString": "728547d59f417568badb9a6d5902f0f804821ad356c3e053c14611a3df4aeafb42efed59d27bdd1e8e5701cb96faf4e48dbeae431ea1fb6e90d3040f508e1f00f809c652b41de8f96c8fda62c9417aeb580c0dabff26d2bc1fcfac131d619153564cdb948e319fb213286924c53ea8b730a790b0df8045d84e18ed977bea12be9b09258daf40b7f51961ad67013b51db6eea25d885d23a0c84cebc2e5b040be0a175c303c9a22210efd5a8bf8b3c6d45c51cb17e9c1985a5c45ace8d4976c3eb5c7cffbeadb6b6fd"
}

We want to be able to do very basic pattern matches (basically the SQL equivalent of %SEARCH_STRING%) but the performance we are seeing with couchdb(mango) is extremely slow (9 seconds) in comparison to MongoDB queries (400ms). I see that the community typically suggests lucene but the overhead for this IoT scenario is not really something that can fit our footprint requirements.

CouchDB Results

(updated scenario since the original screenshot had the wrong query)

If we compare this against a very similar query in MongoDB (note the randomString is randomly generated so not the same in both DBs but the scripts are essentially the same that are populating the DBs) the performance is in ms.

rnewson · 2021-11-11T15:38:24Z

rnewson
Nov 11, 2021
Collaborator

The warning that there's "No index available for this query" is the explanation. Without an index, mango will read the entire database each time. That's great for prototyping and adhoc queries but not for you. Build an index (the "_index" endpoint) for this field and you'll see a huge improvement.

4 replies

rnewson Nov 11, 2021
Collaborator

On second glance noticing you want to search for random substrings, and I don't think we can build any index that helps there. Is there any other search criteria that you'd add that we could build an index on? e.g, type or a timestamp.

tk185141 Nov 11, 2021
Author

Agree. Unfortunately, from my reading and testing, adding the index actually hurts performance.

Run QueryExplainmanage indexes
{
 "dbname": "todd-test",
 "index": {
  "ddoc": "_design/a14e98c0aad8c17ca7da4d9c934d0afc4584403a",
  "name": "randomString-index",
  "type": "json",
  "partitioned": false,
  "def": {
   "fields": [
    {
     "randomString": "asc"
    }
   ]
  }
 },
 "partitioned": false,
 "selector": {
  "randomString": {
   "$regex": "55339778df3fc6472d818967ef661e2b2d0aebfa42d6d6bd8d6ca1"
  }
 },
 "opts": {
  "use_index": [],
  "bookmark": "nil",
  "limit": 25,
  "skip": 0,
  "sort": {},
  "fields": "all_fields",
  "partition": "",
  "r": [
   49
  ],
  "conflicts": false,
  "stale": false,
  "update": true,
  "stable": false,
  "execution_stats": false
 },
 "limit": 25,
 "skip": 0,
 "fields": "all_fields",
 "mrargs": {
  "include_docs": true,
  "view_type": "map",
  "reduce": false,
  "partition": null,
  "start_key": [],
  "end_key": [
   "<MAX>"
  ],
  "direction": "fwd",
  "stable": false,
  "update": true,
  "conflicts": "undefined"
 }
}

tk185141 Nov 11, 2021
Author

The example I used was randomly generated strings. In actuality, each of these fields are short descriptions that describe items. However, the performance is not really any different.

rnewson Nov 11, 2021
Collaborator

It would help to see a realistic example of the problem at hand.

Assuming free-form text field (rather than a long random unbroken string), my suggestion would be a search index, since that will tokenize the field and allow efficient searching on the terms generated. Alternatively if you can't use lucene (and the jvm that comes with it) you could do the same thing in a map function. break the description field up on some boundary (e.g, whitespace) and emit each of them. you can then efficiently find all docs with a particular term (but couldn't efficiently do boolean ands/ors/etc). If your description fields are, in reality, multiple fields that happen to be in a string together, consider exposing them as fields in their own right.

kocolosk · 2021-11-11T16:05:02Z

kocolosk
Nov 11, 2021
Collaborator

Hi Todd! Good to hear from you.

Putting all the work of the actual DBMS aside for a moment, I was curious about the latency floor that we could expect from the $regex operator working against 200k strings. I generated a list of random hexadecimal strings (using couch_uuids:random()) and found that it took about 3000 +/- 300 milliseconds to process them all using a simple regular expression like the one you have in your screenshot.

Now, CouchDB's regular expression operator hasn't seen a lot of optimization work. In particular, it does not compile the expression, which one would expect to be useful when executing the RE 200k times in a loop! I tried doing that and was a little surprised to find that the compiled regular expression still took 2500 +/- 200 milliseconds to execute. Not too much improvement.

On the other hand, your use case doesn't actually need the full might of PCRE. Erlang does have highly optimized code for matching parts of binary strings, so I tried using binary:match/2 with the same pattern. This got me down to ~550 ms ... still not as fast as MongoDB, but several times better than the PCRE implementation.

Of course that still leaves 70% of the execution time outside of the regular expression execution, presumably taken up with the process of running a table scan on all the document data. IIRC CouchDB doesn't do any predicate pushdown which probably drives up the overhead a lot for a query like this that scans a lot of data to return a small result set.

11 replies

tk185141 Nov 12, 2021
Author

@kocolosk

For the env I am testing on, I am just using the MacOS CouchDB Application so I am using Project Fauxton to change the configuration. For my tests, I changed the shards (maybe too far for now but to illustrate a point I went with
"cluster":{"n":"1","q":"16"}). I did this a few times and was seeing situations where the values were not being persisted (maybe after a restart).

However, after doing this, I did not really see any changes in performance. Is there a view similar to the indexing view in Fauxton to see any type of rebalancing happening to verify the shards are active?

tk185141 Nov 12, 2021
Author

@janl / @rnewson

Here is an document that has been generalized but has the structure we are looking to query on.
For the regex, we’re looking to support searching with wildcards on itemId.itemCode, shortDescription.values[].value, and longDescription.values.[].value. (note the nested which adds some complexity)

{
  "_id": "0987654321",
  "_rev": "1-1ba950fd923ea18b799e187e8bfa3f7a",
  "version": 1632247337569,
  "packageIdentifiers": [
    {
      "type": "0",
      "value": "1234"
    }
  ],
  "longDescription": {
    "values": [
      {
        "locale": "en-US",
        "value": "ICED COFFEE"
      }
    ]
  },
  "shortDescription": {
    "values": [
      {
        "locale": "en-US",
        "value": "ICED COFFEE"
      }
    ]
  },
  "merchandiseCategory": {
    "nodeId": "1-999-999-999"
  },
  "alternateCategories": [],
  "status": "ACTIVE",
  "departmentId": "1300",
  "nonMerchandise": false,
  "familyCode": null,
  "referenceId": "55555",
  "manufacturerCode": null,
  "externalIdentifiers": [],
  "posNumber": null,
  "sourceSystem": null,
  "dynamicAttributes": [
    {
      "type": "retail-item",
      "attributes": [
        {
          "key": "ECOMM_PRODUCT_SEARCH_KEYS",
          "value": null,
          "localizedValue": null
        },
        {
          "key": "ADDITIONAL_DESCRIPTION",
          "value": null,
          "localizedValue": null
        },
        {
          "key": "ECOMM_DESCRIPTION",
          "value": null,
          "localizedValue": null
        },
        {
          "key": "ITEM_LINKED_AS_TAG",
          "value": null,
          "localizedValue": null
        },
        {
          "key": "ITEM_TYPE_CODE",
          "value": "0",
          "localizedValue": null
        }
      ]
    }
  ],
  "itemId": {
    "itemCode": "1234567890"
  },
  "auditTrail": {
    "lastUpdated": "2021-09-21T18:02:56Z",
    "lastUpdatedByUser": "removed"
  },
  "fetchTime": 1636474832214,
  "entity_version": "1636474832214"
}

tk185141 Nov 12, 2021
Author

Some more related (in an effort to simulate what we have). I used two different libraries to generate random words versus the long string.

The first package I used was txtgen (https://github.com/ndaidong/txtgen). The performance improvement I referenced above were attributed to the limited randomness of the words being generated. WIth records that looked like this, we were getting fast responses (~ 1 second or less)

{
  "_id": "00t9emcax8x4",
  "_rev": "1-45bee97493a7fe9fc081cf86f16466e3",
  "randomString": "Before tangerines, pineapples were only lions. A happy kiwi without horses is truly a grapefruit of warm flies! A pioneering grapefruit is an alligator of the mind. What we don't know for sure is whether or not few can name a proud grapefruit that isn't a self-confident eagle. Shouting with happiness, dogs are bright raspberries? The first diplomatic strawberry is, in its own way, a zebra? An elephant is a hard-working hippopotamus! The creative kitten comes from an efficient blackberry."
}

When then moved to a more random (gibberish) word generator with the package jabber (https://github.com/dejavu1987/jabber). With the records that looked like this, the performance was back to the ~9 second response time.

{
  "_id": "00tbzjdge05rd",
  "_rev": "1-d14ce48e96dbb018b113519a2dd18a90",
  "randomString": "Tinob. Madug fadadedipo tar no fubenogim lomedota yoro yibofula. Haperupuna ragu ne bobaxara hagolo sudoyip zonamob qupucuri lapov mebonora roricecufo. Bicem rebequm. Curubeca cim gudob bocepi torohibire te yodalo recudugam.",
  "type": "jabber"
}

kocolosk Nov 12, 2021
Collaborator

Regarding the shard count, changing q in the server config will only change the default for newly-created databases. Increasing the shard count on an existing DB can be done using the _reshard API. I don't recall how much of that interface is exposed in Fauxton, but the guide for using the API is here.

tk185141 Nov 12, 2021
Author

@kocolosk I took a sort of long path to solve this but got it to work.

I created a new local DB with the q value of 16 (extreme I think but good for testing) nano.db.create('todd-test2-sharded', {q:16})
I then replicated the existing DB into that new DB.
One the compaction and things were done, I reran the same queries as before.

Non-Sharded

Sharded

tk185141 · 2021-11-12T17:03:45Z

tk185141
Nov 12, 2021
Author

@kocolosk wrote

I tried using binary:match/2 with the same pattern. This got me down to ~550 ms ... still not as fast as MongoDB, but several times better than the PCRE implementation.

Is this internal to couchdb or something we can declare in our queries for testing?

1 reply

kocolosk Nov 12, 2021
Collaborator

Right, sorry, I should have clarified, this is not currently exposed directly to users. It'd be very simple to do so if we agreed it's something we wanted.

tk185141 · 2021-11-12T20:25:39Z

tk185141
Nov 12, 2021
Author

Verification of shard impacts

Is the size denoted here accurate in terms of disk footprint? I am trying to understand the impact of sharding (single node) for an IoT like deployment. Same question on memory.

2 replies

kocolosk Nov 12, 2021
Collaborator

Honestly I can't recall which size is reported in the UI; GET /todd-test2-sharded would show you a sizes dictionary that reports a few different measures of DB size including sizes.disk.

The fixed storage overhead for each shard is very small, just a few KB. There's no additional replication going on, so the total size of a Q=16 DB ought to be quite close to the total size of a Q=1 DB with the same documents in it, and the Q=16 DB can run compaction with less overall free space (since it only needs to rewrite 1/16th of the data at a time). Active DB shards do increase memory utilization a bit, but I'd wager it's less than a megabyte per shard when idle. Others on the IBM team might have a more precise heuristic there.

Glad to see the sharding helped drive down response times. It's a bit brute-force but hey we'll take it 👍

tk185141 Nov 12, 2021
Author

Looks roughly the same

Non-Sharded
{"db_name":"todd-test2","purge_seq":"0-g1AAAABXeJzLYWBgYMpgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUnlsQBJhgYg9R8IshIZ8KhNZEiqhyjKAgBm5Rxs","update_seq":"301898-g1AAAABdeJzLYWBgYMpgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUnlsQBJhgYg9R8IspIYmPya8ShPZEiqh6rzOZ4FAAzzHlY","sizes":{"file":129470905,"external":61202839,"active":114969882},"props":{},"doc_del_count":0,"doc_count":301898,"disk_format_version":8,"compact_running":false,"cluster":{"q":2,"n":1,"w":1,"r":1},"instance_start_time":"0"}

Sharded
{"db_name":"todd-test2-sharded","purge_seq":"0-g1AAAAKjeJzLYWBgEMhgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUox5bEASYYPQOo_EGQlMhBU-wCi9j0xai9A1N4nRu0BiNrzxKjdAFG7nxi1CyBq1xOjdgJE7Xxi1DZA1PbjV5tUACST6gmGbVICSF0-YXUBIHXxhNU5gNT5E1ZnAFJnT1idAkidPmF1AiB18gTVJTIk8UMUZQEATKDdhg","update_seq":"301898-g1AAAALTeJzLYWBgEMhgTmEQTM4vTc5ISXLIyU9OzMnILy7JAUox5bEASYYPQOo_EGQlMTB4LSOo_AFE-XuQcs_zBJVfgCi_D1ZuQ1D5AYjy82DlVwgq3wBRvh_sdmmCyhdAlK8Hm-5HUPkEiPL5YOWLCSpvgCjvByv_hEd5UgGQTKqHhrnnH3xKE0BK82FKdfEpDQApjYfFZCk-pQ4gpf4wUzPwKTUAKbWHmWqET6kCSKk-zFR8KS9JAKRUHqZ0FR6liQxJ_FB1HlezAP9V6nM","sizes":{"file":114687380,"external":61202839,"active":114054645},"props":{},"doc_del_count":0,"doc_count":301898,"disk_format_version":8,"compact_running":false,"cluster":{"q":16,"n":1,"w":1,"r":1},"instance_start_time":"0"}

janl · 2021-11-19T17:26:27Z

janl
Nov 19, 2021
Collaborator

not directly applicable, but I made a little demo for what it could look like adding string/array/date manipulation functions to the Mango indexer, which would help speed up this query: https://gist.github.com/janl/e5469f6f08c9be0405f31451889d5030

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mango query performance #3828

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 18 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Mango query performance #3828

tk185141 Nov 11, 2021

Replies: 5 comments · 18 replies

rnewson Nov 11, 2021 Collaborator

rnewson Nov 11, 2021 Collaborator

tk185141 Nov 11, 2021 Author

tk185141 Nov 11, 2021 Author

rnewson Nov 11, 2021 Collaborator

kocolosk Nov 11, 2021 Collaborator

tk185141 Nov 12, 2021 Author

tk185141 Nov 12, 2021 Author

tk185141 Nov 12, 2021 Author

kocolosk Nov 12, 2021 Collaborator

tk185141 Nov 12, 2021 Author

tk185141 Nov 12, 2021 Author

kocolosk Nov 12, 2021 Collaborator

tk185141 Nov 12, 2021 Author

kocolosk Nov 12, 2021 Collaborator

tk185141 Nov 12, 2021 Author

janl Nov 19, 2021 Collaborator

tk185141
Nov 11, 2021

Replies: 5 comments 18 replies

rnewson
Nov 11, 2021
Collaborator

rnewson Nov 11, 2021
Collaborator

tk185141 Nov 11, 2021
Author

tk185141 Nov 11, 2021
Author

rnewson Nov 11, 2021
Collaborator

kocolosk
Nov 11, 2021
Collaborator

tk185141 Nov 12, 2021
Author

tk185141 Nov 12, 2021
Author

tk185141 Nov 12, 2021
Author

kocolosk Nov 12, 2021
Collaborator

tk185141 Nov 12, 2021
Author

tk185141
Nov 12, 2021
Author

kocolosk Nov 12, 2021
Collaborator

tk185141
Nov 12, 2021
Author

kocolosk Nov 12, 2021
Collaborator

tk185141 Nov 12, 2021
Author

janl
Nov 19, 2021
Collaborator