Bulk deleting bad data
I had to clear up some old bad data that was left over from a bit of bad code. Unfortunately the bad data didn't rear its ugly head until recently and a lot of data had built up. It was also very hard to detect the bad data because of many places it could be referenced from…only if it had no references to it from any of 7 places would it need to be deleted.
This means doing a really horrible query either like this:
select id from atable where id not in (select id from anothertable) and id not in (select id from yetanothertable) and id not in (select id from moretables) ..... .....
This is very, very, very slow.
A more efficient way of doing this is this:
select id from atable a where not exists (select id from anothertable b where a.id = b.id) and not exists (select id from yetanothertable c where a.id = c.id) and not exists (select id from moretables d where a.id = d.id) ..... .....
However, when you are dealing with potentially 100,000's of rows it is still quite slow…but it does get there. The next problem is actually deleting the data once you've managed to select it. As a little test I thought I'd try and delete the whole lot, but that just didn't work…too slow. Even if I did have the patience to leave it running for hours, I couldn't let it lock up the database like that for that long.
So, the only solution was to do it in batches. I wrote a quick java program that would iterate through an do the deletes in small batches of a 100 or so at a time. My first mistake was trying to reuse some Spring/Hibernate code I already had instead of going straight to old school and using JDBC.
Although in theory you can get a Connection object from the hibernate Session, via session.connection(), it really is NOT the same as just getting a good old fashioned JDBC connection. The deletes were taking absolutely ages, so I profiled it and noticed that hibernate was still trying to do some of its funky stuff in the background, really slowing things down.
Plan B (or is it D by now?). Spring comes with a handy little JdbcTemplate which lets you do real JDBC but without a lot of the exception and connection/statement/resultset closing pains. Finally…it worked.
So, lesson of the day:
- not exists type queries are faster than not in queries
- Bulk deletes can be verrrrrrry slow
- Batching deletes is better, but with real JDBC not hibernate SQL calls