mongodb - How to insert quickly to a very large collection -

i have collection of on 70 million documents. whenever add new documents in batches (lets 2k), insert operation slow. suspect because, mongo engine comparing _id's of new documents 70 million find out _id duplicate entries. since _id based index disk-resident, it'll make code lot slow.

is there anyway avoid this. want mongo take new documents , insert are, without doing check. possible?

diagnosing "slow" performance

your question includes number of leading assumptions how mongodb works. i'll address below, i'd advise try understand performance issues based on facts such database metrics (i.e. serverstatus, mongostat, mongotop), system resource monitoring, , information in mongodb log on slow queries. metrics need monitored on time can identify "normal" deployment, recommend using mongodb-specific monitoring tool such mms monitoring.

a few interesting presentations provide relevant background material performance troubleshooting , debugging are:

william zola: the (only) 3 reasons slow mongodb performance
aska kamsky: diagnostics , debugging mongodb

improving efficiency of inserts

aside understanding actual performance challenges lie , tuning deployment, improve efficiency of inserts by:

removing unused or redundant secondary indexes on collection
using bulk api insert documents in batches

assessing assumptions

whenever add new documents in batches (lets 2k), insert operation slow. suspect because, mongo engine comparing _id's of new documents 70 million find out _id duplicate entries. since _id based index disk-resident, it'll make code lot slow.

if collection has 70 million entries, not mean index lookup involves 70 million comparisons. indexed values stored in b-trees allow small number of efficient comparisons. exact number depend on depth of tree , how indexes built , value you're looking .. on order of 10s (not millions) of comparisons.

if you're curious internals, there experimental storage & index stats can enable in development environment: storage-viz: storage visualizers , commands mongodb.

since _id based index disk-resident, it'll make code lot slow.

mongodb loads working set (portion of data & index entries accessed) available memory.

if able create ids in approximately ascending order (for example, generated objectids) updates occur @ right side of b-tree , working set smaller (faq: "must working set fit in ram").

yes, can let mongo use _id itself, don't want waste index it. moreover, if let mongo generate _id won't need compare still duplicate key errors?

a unique _id required documents in mongodb. default objectid generated based on formula should ensure uniqueness (i.e. there extremely low chance of returning duplicate key exception, application not duplicate key exceptions , have retry new _id).

if have better candidate unique _id in documents, feel free use field (or collection of fields) instead of relying on generated _id. note _id immutable, shouldn't use fields might want modify later.

Search This Blog

Brent