Genius !
How very meta.
WHEN IT'S CRITICAL, YOU CAN COUNT ON US
2ndQuadrant provides full 24/7 problem resolution technical support for production systems.
If PostgreSQL breaks, we'll get you back up quickly.
[1] https://2ndquadrant.com/en/support/support-postgresql/I accept all of the humour on that point with a grin myself, though I must say its a nice problem to have. Thanks to everybody for reading and commenting.
2ndQuadrant is a large enough company that we have CTOs who write blogs and design stuff, we have other people who run blog websites and a variety of infrastructure, but mainly we have many dev and support staff helping customers.
So regardless of whether their complaints are justified or not, it should not be taken as an endorsment of mySQL over postgres, but rather of an endorsement of NoSQL over RDBMS .
Which is really just what every company at these scales do (except for google and f5 if whitepapers are to be considered).
If you use only 1 table, then it's by no means "relational", so I don't see why you'll need a database system designed from start to finish to support a relational model.
Postgres is lacking in these scenarios, whereas a particular fine tuned version (or fork) of InnoDB (or MyRocks or whatever they end up chosing) handles this better.
See Facebook's "mysql-5.6" branch, that has hundreds of patches piled on to support especially these taxing workloads.
So MVCC also has almost zero bearing on what they are doing.
MVCC and how it has been implemented in the two storage engines in discussion here absolutely has a lot to do with it.
FYI The term "relational" in "relational database" does not refer to the relation between tables but to the mathematical concept of a relation which is some set of tuples ie a single table.
But I think that's a mute point, I doubt they kept any part of the relational model inside their schemaless database.
Also, what is the scale of Uber that you think is unsuitable for RDBMS? Uber does a million rides per day but these rides are very predictably local. I don't know enough to make any firm claims here but at first sight this doesn't look like the sort of scale that is unachievable on an RDBMS in principle.
Never said it's unsuitable, I have no idea. Just said the move from relational to NoSQL is what most billion dollar companies with mass-market apps or websites do.
Many of them also do sharding on normal RDBMS.
From the context, I don't think they ever evaluated Postgres NoSQL features. But I don't think they would get a different conclusion if they did. MySQL trades some consistency guarantees for speed, and it looks like Uber doesn't need those extra guarantees anyway.
We still have many applications that talk directly to MySQL, and we still have our original API monolith that talks directly to Postgres.
All new applications are being built using distributed databases like our in-house Schemaless system which happens to be backed by MySQL, and we also have Riak and Cassandra in production.
No they aren't. You need to understand that databases are being used in a whole range of different ways than you think.
Batch/streaming analytics are typically done with HDFS, Cassandra, HBase, MongoDB etc. Event aggregation often with a time series database like Druid, InfluxDB etc. Web API serving layers are still the domain of lightweight SQL databases like MySQL, PostgreSQL. Customer data warehouses are still dominated by Teradata, Oracle, SAP etc.
What you misconstrue as a single platform is often a multi headed beast with various architectural components. Each comprising their own technology stacks.
Unstructured information is, surprise, unstructured, which means it is harder to query / analyze, since the structure needs to be fully scanned and parsed to perform the computation.
I have seen many cases where MongoDB is used for this and suffers from really bad performance.
Do a simple POC where you ram 1 million rows in a MongoDB and then make a webapp to do some basic analytics. Look, it works and get responses within a second. Cool!
Then real data comes in at 12 billion rows and your analytics take 3 hours to run.
So you try to do the sharding thing, and realize that it works for a while, except now every analytics query needs to hit every shard....
Worked on a 1.3 trillion row (for 1 table, others varying below this) database this year for predictive analytics, and it was mind boggling the hoops to jump through to try to get that thing to run anything in a manageable time frame.
Any POC level show of a database would be about meaningless at that scale. You have to do a real test to view the challenges around it.
Keeping 1.3 trillion in memory is pretty expensive, so we were trying to cut some costs by using that to funnel data in and out.
Not a bad solution overall.
Ideally, you'd actually use more than that, however diminishing returns leaves you with that as what we were going with.
(note I said build, not use, different skillsets between driving a car and re-configuring the engine to run on seed oil)
also their replication strategy looked like a joke and not enough automation. hopefully they never use galera, else their bad engeering practices could actually suffer in huge data losses.
I've once run galera-cluster and it pretty easy came to data losses especially after short network splits which occured randomly at the network.
Given the fact that the change which occurred is an "apples to oranges comparison" change (PostgreSQL and an SQL normalized DB table structure to MySQL and a noSQL style single table key-value store) then there is some credence one can put towards the rumor that some (or most) of the change may have come about from a PHB [1] saying "do this, this way".
[1] PHB (Pointy Haired Boss, Dilbert cartoon reference)
Honestly, why would they? If there is a product that does what they need, why spend resources improving another.
Just because it's open source and they could spend money improving it, doesn't mean a company will spend that money improving it.
I would hardly say MySQL does what they need without any improvements. They built an entire second platform on top of it.
first: if you think any particular db platform is clearly a winner in "db wars", you are naive. there are so many factors involved in configuring the db, the backend, the frontend etc. that you can always find a case where: the supposedly winning db is failing, or the supposedly worse db is performing perfectly fine. and from my experience, you should always use the platform/framework/language that is best for the current project, not the one you madly love. clearly postgres wasn't working for uber. that does not mean it will not work for your project. i have a recent experience where a binary file of a programming object works much much faster than mysql and solves several other problems. would i say "use binary files instead of rdbms"? of course not. but in this one case it does wonders. the "tech-vs-tech" wars need to end, they are pointless.
second: if you cannot setup your blog to withstand an HN spike then maybe you don't have as much real world experience with scalability (albeit simple) as you might think (hint: static page cache behind cdn will make you almost bulletproof - also, with for example Azure, that's dead cheap).
I wouldn't want my page behind a CDN. CDNs make users much more trackable across sites.
My point isn't that CDNs are bad for everyone. My point is, once more, that most questions are not as simple as they may appear.
How does has CDN do this in a way that a "regular" web deployment wouldn't?
This also comes up when the CDN handles SSL termination.
So far what i have seen, the only thing Uber is talented at is violating local laws and then throwing sacks of money at it to pay fines or whatever. (and inflating their own (bubble)value, but probably not many people agree with that)
Seeing their blogs mysql-> postgres followed by a postgres -> mysql migration, doesn't give me the idea they are very talented (they still might be, but so far no data has proven me this). Talented would be to forsee these issues and to have avoided encountering them at all. At least that would be my definition of very talented.
I clearly indicate their recent changing of db software twice to avoid issues (experts could solve) is an indication one in my eyes is not very talented.
And why do you make the assumption, i'm a postgresql expert to answer that question.
Though others did, so if you would have bothered to read the other thousands of comments on the matter. You would not have needed to ask this question,
PostgreSQL Heap-Only-Tuples (HOT) from: http://use-the-index-luke.com/blog/2016-07-29/on-ubers-choice-of-databases
It's not terribly nice, but the comment is true, and relevant.
I run varnish even on silly joke websites I set up with no traffic and literal kilobits worth of only static content, it's just so easy to do and it's a good habit to have.
No it doesn't. Stop projecting.
So that tech-x versus tech-y is still very relevant.
There can be clear winners in such discussions, though this changes over the years. Many people are now concluding that Postgres is a winner at the present time and usage is expanding significantly. The people I meet aren't madly in love with Postgres, they make rational choices with the best information they have. Uber posted their information in the hope others would benefit. I think they have and I thank them for it.
(Whether you forgive me or not, I don't manage our blog site, but I guess they'll be some discussions. ;-) )
> 2ndQuadrant is working on highly efficient upgrades from earlier major releases, starting with 9.1 → 9.5/9.6.
I hadn't heard of that before. Anybody know more about this? I'm currently babysitting a 9.1 deployment which we desperately want to get upgraded. The amount of downtime this can tolerate a very limited and I was currently tasked with coming up with a plan. Its going to get hairy. If such a tool is really on its way, I could make a case for holding off on the upgrade for a few more months and save quite a bit of work.
I have updated ~2TB of database from 9.0 all the way to 9.5 over the years.
Also: You don't need to be offline to copy the data directory. Check `pg_start_backup` or `pg_basebackup` (which calls the former)
Regarding your second point, I meant copying the data directory as in a 'cp' command. Or rsync if you will. The functions you mentioned are only useful when doing a dump, isn't it? And recovering from a upgrade problem using a dump is way slower than just starting the previous version in the backup data directory.
Yes. That's not possible. But if I announce the downtime, bring master and slave down, migrate the slave and run our test-suite, migrate the master, run the test suite again and bring the site back up, then I know whether the migration worked.
If the migration on the slave fails, well, then I can figure out where the problem lies and just bring master back.
If the migration on master fails, but works on slave, then I can bring slave up as the new master.
No matter what, there's always one working copy and the downtime is limited to two `pg_upgrade -k` runs (which is measured in minutes).
> Regarding your second point, I meant copying the data directory as in a 'cp' command. Or rsync if you will.
Yes. You execute `select pg_start_backup()` to tell the server that you're now going to run cp or rsync and to thus keep the data files in a consistent state. Once you have finished cp/rsync, you execute `select pg_stop_backup()` to put the server back in the original mode.
This works while the server is running.
If you don't want the hassle of executing these commands, you can also invoke the command-line tool `pg_basebackup` which does all of this for you.
(See also https://www.postgresql.org/docs/9.5/static/continuous-archiving.html#BACKUP-STANDALONE)
;)
Making "zero downtime" upgrades possible is part of the whole logical replication effort - both within the community and in 2ndQuadrant in particular.
Petr Jelinek actually described how to do that using UDR (uni-directional logical replication) in 2014:
https://wiki.postgresql.org/images/a/a8/Udr-pgconf.pdf
There might be a newer talk somewhere, I'm sure he spoke about it on several events.
"Companies" are taking a whole range of approaches to storing data. With a combination of NoSQL, SQL and Filesystem e.g. HDFS and everything else in between. Cassandra in particular is killing it right now under the stewardship of Datastax which is why they've grown from 1 person up to 400+ employees.
Also, there is something that caused some noise to me:
This point is correct; PostgreSQL indexes currently use a
direct pointer between the index entry and the heap tuple
version. InnoDB secondary indexes are “indirect indexes”
in that they do not refer to the heap tuple version
directly, they contain the value of the Primary Key (PK)
of the tuple.
That's true, but the article doesn't make explicit that the PK on InnoDB is a clustered index and, that there are other optimizations like adaptive hashing to make read queries faster.It is interesting/ironic to see the article complain "those limitations were actually true in the distant past of 5-10 years ago, so that leaves us with the impression of comparing MySQL as it is now with PostgreSQL as it was a decade ago." In the MySQL world, we very very frequently see the opposite -- Postgres fans bashing MySQL for things that haven't been true in 10-15 years, as well as things that simply have never been true. It certainly is frustrating, just like what the author is experiencing!
Having a favorite/preferred database is fine, but I don't understand all the extreme views -- why do so few of these articles take the view that Postgres is a better fit for some workloads, and MySQL/InnoDB is a better fit some other workloads?
Or even just an acknowledgement that the authors of these articles rarely, if ever, have a comparable amount of expertise in both databases -- which would be necessary to make a fair comparison. Yes, Uber's original article clearly shares this same problem, but at least they seem to acknowledge it more clearly than the author of this response article. Take the section on replication comparison, for example: the author is describing logical replication support in Postgres even though it's currently a third-party addon. Cool, but MySQL has all sorts of third-party replication systems too. Alibaba has implemented physical replication in MySQL. And meanwhile even in MySQL core, there are two different types of logical replication -- there's no restriction to only use statement-based logical replication as this article implies.