it happens everywhere.
it happens everywhere.
> [...] This design difference means that the MySQL replication binary log is significantly more compact than the PostgreSQL WAL stream.
Doesn't sound like what you described at all.
on-disk format/write amplification: > For tables with a large number of secondary indexes, these superfluous steps can cause enormous inefficiencies. For instance, if we have a table with a dozen indexes defined on it, an update to a field that is only covered by a single index must be propagated into all 12 indexes to reflect the ctid for the new row.
How wide is their data? Depending on the answer to this, it could be that they've over-indexed, have a poor indexing strategy, or are reacting to the poor queries generated by an ORM (not sure if they use one, or if they hand-code their own SQL).
data corruption: everyone has bugs. https://bugs.mysql.com/search.php?search_for=&status=Active&severity=1&limit=10&order_by=&cmd=display&direction=ASC&os=0&phpver=&bug_age=0 add in mysql's tendency to loosely-adhere to the SQL standard and there are many ways that you can actually corrupt your own data.
i'm not here to debate whether postgres is better than mysql. i'm just saying that it seems like a lot of research went into justifying a switch. who knows, maybe that research could have been spent optimizing their current environment.
Better than telling them that they are wrong while not only not pulling even one line from their article, but misattributing it to be something very different from what it is.
In what world does the response to the concerns and analysis in the article can ever be: "data corruption: everyone has bugs"...
MySQL handles it differently than Postgres, and gives them better performance for their purpose (based on their experience/test). They were explaining it in the parlance of the terms that MySQL and Postgres. If those are buzzwords, then MySQL and Postgres are both created using buzzwords?
Their explanation is not perfect (for me, why do their datamodel needs massive updates?). But I wouldn't write it off as buzzwords and dismissing Postgres because of data corruption. There are a lot of other things they were trying to explain there.
Combining that fact with the way Postgres and InnoDB handle secondary indexes means WAL-shipping not only ships the entire new row, but also all the disk blocks of all the secondary index updates, unlike MySQL's row-based replication.
This is actually something I greatly like and want to see in Postgres. Perhaps the decision to choose secondary indexes with direct links to rows was taken because at that time (before replication) it was less read-heavy and the write-amplification wasn't such a concern. But now, when replication is a common requirement (and network IO is not as fast as disk IO) it makes a lot of sense to switch to a single-point-of-update allowing way of storing secondary indexes
(And even microservices are established practices -- heck Amazon was built on them more than a decade ago).
The amount of research about the on-disk internals of both PostgreSQL and MySQL are a lot more effort than I would have probably spent (granted, I don't have a team of highly paid devs at my disposal, but still, I've seen technical decisions made on the basis of Google Trends...).
The "laundry list of buzzwords" is a detailed, up to engineering standards, analysis of the issues they faced, with example cases and explanatory follow-up.
As far away from a "laundry list of buzzwords" as you could possibly get. Not to mention they did the very opposite of switching willy nilly to some new "shiny toy".
In other words, the comment is not even wrong.
Disclaimer, I'm a hardcore Postgres user myself, but I also keep tabs on the other tools.
[0] https://github.com/facebook/mysql-5.6
And yes, MySQL has a lot of problems, but it seems to be better for "seat of the pants" usage (that is, when you don't need too many SQL capabilities and wants to use it mostly as a data store)
It really put me off ever wanting to even consider working at Facebook.
Digging around I can't find either the blog or video :(
The problem with making absolute statements about what FB does is that it seems to be structured as a large number of mostly independent teams that are free to choose and develop whatever technology they need to solve their problems. There's actually a talk about scaling by a FB dev (don't have the link right now) that uses this as an example for what it means when they talk about "Facebook scale" (another example is that their monorepo outgrew git so they created their own extension to mercurial instead).
It's entirely possible that the main use of MySQL at FB is as a key-value store while at the same time there are small parts of applications using it with plain old queries and joins.
Joke asides, one thing I've been trying to figure out for awhile is the limitation at which certain components/ systems broke down. Basically, something along the line of "given X records, this operations would take Y time, or would cause Z A B C problems". I've actually got developers friends asking me how fast a simple "SELECT * FROM X WHERE index=?" would take on a million row table, since they were surprised that some NoSQL DB could do a query on hundred million rows in a few seconds.
I guess that's part of why you only learned how to scale after having done it once.
The tone seems to suggest this is fast, a simple index query in an RDBMS of even a hundred million rows will take milliseconds on even a weak computer.
Even better, if you only need one field from the record, and it's part of a compound index, you can frequently return the data from just the indexes; no disk seeks required. Small tip with InnoDB on MySQL - any non-primary key index is automatically a compound index with the primary key.
Like Illniyar, that's also what I interpreted GP to mean when they said:
> "surprised that some NoSQL DB could do a query on hundred million rows in a few seconds"
which is not fast at all. However, if it actually did something on a hundred million rows that would be impressive. Not sure exactly sure which they meant, though.
This was running on an old Core i5 / 16gb of ram with a lot of junk running on Windows.
It would probably have been faster on a production environment.
We too often rely on a buzz-word heuristic and that's how you end up with dozens of random technologies that are harder to maintain and don't necessarily solve any of your problems. This method is good, because it shows that when you understand the problem the right way, you can find the right solution, even if by popularity it looks like a "step backwards"
Massive Kudos.
Sadly, they also mixed in issues which are easily solved, or in a particularly egregious case, where they just complain about a bug. As though MySQL never had a bug. That was silly.
My read of it was: Postgres annoyed us a few times, and we got fed up with its, so now something different will annoy us. Please look forward to our blog post in 4 years about how we're using X instead of MySQL/Schemaless because those were also imperfect.
Doing so deliberately is whats commendable.
It seems to me that Cassandra (C*) may have been a better match to their needs... yes it means more administrative and application tuning, but would definitely scale to meet their needs. RethinkDB would also likely be a better match to what they are wanting to do.
That said, I've been holding out for some time on PostgreSQL's in the box replication story to take root and mature. There are definitely more mature features and solutions to sharding and replication around MySQL, I just tend to find MySQL to be brittle and every time I've ever worked with it, I have at least a half dozen WTF moments... from binary data handling/indexing, foreign key syntax, ANSI out of spec, and others. PostgreSQL has some great features, and once the replication issues settle in, it will become the default choice for a lot of projects in a lot of organizations. Though, mySQL/maria and even MS-SQL are currently better options for many SQL use cases.
A multi-pronged approach that might involve multiple stakeholders just doesn't seem like their way of doing things.
Then they wouldn't get to build cool new stuff and write blog posts about how they had to build cool new stuff because OMG UBER SCALE.
It would randomly die, but that was always either my fault or the applications fault, never pgsql itself.
The lack of master-master seems to be the big thing everyone mentions, but PostgresXL is currently in a usable-in-production state.
Compare to MongoDB, RethinkDB, MS-SQL and others where the tooling for replication comes in the box. Yes, to of the examples are "no-sql" but even the mysql replication is in the box and supported as such.
https://www.postgresql.org/docs/current/static/high-availability.html
I'm not sure what more you can ask from documentation.
Does MySQL document Vitess, Galera, MaxScale, etc...?
id first last birth_year
1 Blaise Pascal 1623
2 Gottfried Leibniz 1646
3 Emmy Noether 1882
4 Muhammad al-Khwārizmī 780
5 Alan Turing 1912
6 Srinivasa Ramanujan 1887
7 Ada Lovelace 1815
8 Henri Poincaré 1854
Make no mistake, they are all great mathematicians and deserve a place in math pantheon.
id first last birth_year
1 Blaise Pascal 1623
2 Gottfried Leibniz 1646
3 Emmy Noether 1882
4 Muhammad al-Khw�rizmī 780
5 Alan Turing 1912
6 Srinivasa Ramanujan 1887
7 Ada Lovelace 1815
8 Henri Poincaré 1854
(I know, I know. It is possible to configure MySQL encodings correctly. And given that they've put a lot of engineering thought into choosing MySQL, they certainly have.)In MySQL there is no templates and you can change the encoding of a table at any time using ALTER TABLE statement.
And in cloud IDEs like c9.io you cannot use unicode collations in Postgres because they need to be installed separately and they are not installed.
Ugh.
Changing encoding of a table (or a database after creation) is usually not a wise thing to do; since the db engine is not going to go over all the data and convert it anyway, so if your data is corrupted now it will remain corrupted. If one simply needs to tell the DB to treat the data differently without worrying about data conversion, the encoding of a database is stored in `pg_database` and can be changed with an UPDATE query.
And this is all from the official documentation. I'm not a Postgres expert.
That was No Fun™ to solve.
max_connections sets exactly that: the maximum number of client
connections allowed. This is very important to some of the
below parameters (particularly work_mem) because there are some
memory resources that are or can be allocated on a per-client
basis, so the maximum number of clients suggests the maximum
possible memory use. Generally, PostgreSQL on good hardware can
support a few hundred connections. If you want to have
thousands instead, you should consider using connection pooling
software to reduce the connection overhead.
Replication, Clustering, and Connection Pooling [1][0] https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Server
[1] https://wiki.postgresql.org/wiki/Replication,_Clustering,_and_Connection_Pooling
This means the PG has explicit IPC overhead, vs the quick and seductive path (to the dark side?) of simply sharing memory between threads. Safety vs speed.
Can you backup your statements with some facts. I have seen the postgres community to be much better particularly now with Oracle taking ownership.
It won't specifically address the problems posted here, but will solve other ones like trying to scale a system beyond a single node.
FWIW, PostgreSQL's mitigation to the write amplification problem mentioned here are "heap only tuples". It is highly likely that Uber could have prevented a lot of this pain by learning more about this feature--which is notably not mentioned even once in this entire article, which to me completely undermines the feel they are trying to achieve of "we really really really know what we are doing"--and tuning their table density parameters to take maximal advantage.
Instead of "why we moved off of X" it would be much better to see "we are considering moving off of X: anyone know what we are doing wrong?". Sure, maybe Uber knows about HOT, and did extensive analysis to determine it wasn't a solution to their problem; but it frankly does not seem at all to be the case. Anyone who knows a lot about PostgreSQL would have explained HOT to them, so they probably didn't even consult with PostgreSQL experts behind the scenes.
Sadly, "we are looking for help with a complex technical challenge" is something the market punishes under the premise that everyone has to be entirely self-sufficient gods of their technology stack :(. The only time I remember ever having seen a company reach out to the community for help was reddit (with respect to something involving PostgreSQL or Cassandra... I don't remembee the specific issue).
0: https://wiki.postgresql.org/wiki/Index-only_scans#Interaction_with_HOT
Accordingly, using pgbouncer to do connection pooling with
Postgres has been generally successful for us. However, we have
had occasional application bugs in our backend services that
caused them to open more active connections (usually “idle in
transaction” connections) than the services ought to be using,
and these bugs have caused extended downtimes for us.
We used a 2 column InnoDB-backed table for all of our data storage, massively sharded, and run in a 3-host master-slave-slave configuration.
At that time EC2 would routinely kill hosts without the courtesy of a poke via ACPI and as such we became very good at quickly recovering shards. In a nutshell this mechanism was to have the new host contact a backup slave, perform an lvm snap, pipe the compressed snap over a TCP connection, unroll it and carry on, letting replication take up the delta.
That enabled us to not only manage the 10 million or so daily active users of that title, but was also the platform under the 12 or so additional titles that studio had.
We had lots and lots of very simple things and failures were contained.
I think at the time we were the 3rd-largest consumer of EC2 after Netflix and "another" outfit I never learned the name of. EA being what it was, however, we were never permitted to open source a lot of the cool stuff Netflix and ourselves seemed to develop in parallel.
Things I do remember:
- We used ephemeral volumes for all data stores. This was pre-provisioned IOPs and EBS was flaky as heck back then. I can't remember the disk layout, although we did experiment a great deal.
- We took great pains to ensure there was enough space to make the snapshot (IIRC is was a telemetry/monitoring item)
- The pipe scripts were essentially "netcat".
The best I can offer is this talk: http://vimeo.com/57861199
Opening and closing connections is very slow and expensive, so almost always better to keep these 300 connections open than to try to be fancy.
You COULD try to say give each server only 3 connections and make them share, which cuts you to 90 connections.. but then you have to try to share state between different python processes (not easy), and will often end up with deadlocks and sync overhead.
Its a little outdated but I've found it largely holds. That is, you'd want a mighty box for 1000 concurrent connections.
That said, connection queuing works really well with postgres such that throughput is frequently better at lower connection counts than at higher ones.
I have no experience with mysql that is less than 15 years out of date.
Numbers vary, but that's the principal that uses more connections than cores.
Unfortuantely, MySQL has no good replacement for Postgresql's PGBouncer which greatly mitigates the issue of cross-process connection pooling. (I'm actually working on one, but for Uber to use it they'd have to switch back to SQLAlchemy >:) )
Postgres's MVCC is superior (can rollback DDL, can add indexes online, can have open read transactions for a VERY long time without impacting other parts of the system)
Postgres supports many types of indexes, not just b-tree. One thing it doesn't have is clustered b-tree indexes... which is really what MySQL does that makes it somewhat "better." I wonder how Uber adds an index to a table that already has 1B+ rows in it with mysql?
Postgres have WAL level replication is a better guarantee of actually replicating the data correctly. I cannot tell you how many times I've had to tell my boss that the "mysql replicas might be slightly out of sync with the master" because of various replication issues. The way it handles triggers and scheduled events alone is garbage and can very easily break replication and/or silently cause inconsistency.
As for data corruption, if there is a bug that causes corruption, then there is a bug. I don't think that is a fundamental design flaw as implied in this article. You shouldn't rely on 1/2 assed replication design to accidentally save you from the data corruption bug. There are many downsides to the design MySQL has that are simply not listed here.
I have been both a professional MySQL administrator as well as Postgresql (as well as SQL Server and many NoSQL engines). Many of these Postgres issues are only issues at crazy huge scale, and I would say at that point you probably want to move away from relational anyway. MySQL has its own very large set of problems at scale as well.
It sounds like Uber is using MySQL as just a data bucket with primary keys ("Schemaless") which is good -- because you can't alter tables to save your life with MySQL.
At the end of the data each developer/business needs to use what works for them, but I would really shy away from pointing to this article as a linchpin in the "MySQL vs. Postgres" war (if there even is such a thing.)
Percona also recommends using this tool to safely perform alters for any Galera-based replication product (Percona XtraDB Cluster, MariaDB Galera Cluster, etc.)
This space is far from done, until we can change data structures on the fly...we're going to find there to be a constant struggle between schema and schema-less and neither side will be right or wrong.
Solution: Quantum states; We suspect every possible scenario is happening already, so we should just migrate our database to the appropriate quantum state, where the data is in the structure we desire. Sounds insane, it probably is.
I know first hand shops like FB use a similar method in production.
Also, big note, the whole point of "schemaless" (And other things, like fb's friend feed architecture) is that you don't make schema changes.
We allocate our own IDs, and disable foreign keys for bulk loading.
It only failed catastrophically, causing a critical production incident, twice.
Each time took dozens of engineer hours to vet in advance, thus costing thousands of dollars.
Online DDL changes and indexing with pg cost us... basically nothing, and never caused downtime.
My and Pg each have their place... but if you want to modify large tables, MySQL is almost certainly the wrong tool for the job.
Other than the "schemaless" layer, the issues mentioned are fairly common.
I find their whole writeup to be terribly myopic. When they started their service, Postgres was almost certainly the right choice for what they were building and their MySQL setup was not. Now Postgres is less effective for them.
These kind of tech switches are _inevitable_ if you're making the right choices for your organization.
This strikes me as very similar to the article where Twitter ditched Rails. The focus should be inward... how they chose a tool that didn't support their use case and how they solved the problem, but instead they're about the flaws (that aren't really flaws) of the tool.
It's always the craftsman.
Been there, really no fun.
Also, MySQL* is getting LSM-Tree support lately, which makes high performance data ingestion combined with OLTP workload quite feasible.
* https://github.com/facebook/mysql-5.6/tree/webscalesql-5.6.27.75/storage/rocksdb
"As long as you can shard "
Not sure how sharding helps with 1B+ tables when adding indices, care to share?
And then do your joins, in your application code.
[1] https://www.percona.com/doc/percona-toolkit/2.1/pt-online-schema-change.html
Saved my life a few time :)
Plus, in 5.7, there are a fair number of online-DDL changes, and adding indexes is (usually) one of them.
They have a couple posts about "Schemaless", but I still don't understand why they used MySQL as the data store instead of something like Cassandra. ( https://eng.uber.com/schemaless-part-one/ ) From that post it looks like they basically built a no-sql database on top of a relational database.
The only reason given was operational trust ( "If we get paged at 3 am when the datastore is not answering queries and takes down the business, would we have the operational knowledge to quickly fix it?" ). The project took nearly a year to roll out, and in that time the operation knowledge could surely be trained, hired, or contracted.
They don't explicitly answer the question "Why didn't you use InnoDB/WiredTiger/etc. for your dataplane?", but you get the idea that they were very happy with the specific characteristics of MySQL for their use case and so they built on top of it. It also sounds like they had some deadlines (specifically, the death of their datastore) that they had to meet :).
There are really not a large number of options here anymore with the departure of FoundationDB from the market. CockroachDB might be an option in a few years, though I'm still confused why they are moving towards a SQL-ish vs key-value interface...
Pissed me off so much. Only thing close to Google's F0 RDBMS on the market, at a reasonable rate, and the beginning of a good offer to enterprises. Then, "poof!" It's a good example of why I tell companies to not put anything critical into something from a startup. If they do, better have a synchronized, backup option tested and ready to go.
"why they are moving towards a SQL-ish vs key-value interface..."
That's easy: most databases and buyers use SQL. Key-value is preferred by startups & non-critical, side projects in big companies you see here a lot but aren't representative of most of the market. Need first-rate, SQL support. I think EnterpriseDB shows that it's also a good idea to clone a market leader's features onto alternative database.
Did you mean F1 (instead of F0)?
A great way to get familiar with something is to be the folks who write it. It's also much more fun to design and implement something new than to just learn some other fella's software. I'm guilty of this myself.
But I've started to remind myself that "somebody else has had this problem" and there's probably a good enough solution out there already.
Put another way, is what you are trying to do really so novel? In the case of Uber's infrastructure, you would have to talk for awhile to convince me that they really really need something not-off-the-shelf.
You can't use Cassandra if you need atomic increments (yes, they're included but painfully slow due to several trips required to satisfy PAXOS).
Also there are no transaction rollbacks (atomic batches always go one way - forward).
You may hit GC pauses if the JVM is not tuned properly.
If the use case involves its of deletes then tombstone related issues need to be considered.
Sure postgres has JSONB (comparing to mongo db), Key value store such as HStore, but they do well if they fit on one machine. The moment you hit that scale you have to realize that there are tools specifically built for this.
There is the phoenix project https://phoenix.apache.org/ that salesforce is using for scaling. Definitely worth a try.
But again, the title of the article sounded nothing more than a rant to me.
It seems that the replication arguments in the post are pretty weak (migrating to 9.4+ to allow heterogeneous versions would still have less downtime than migrating to MySQL), but clustered indexes seem like a huge win for their use case of updating only a subset of each row. Whether that is reason enough to totally ditch postgres, I don't know.
Note that the master is still available for read and write during this. Replicas will lag though.
> Updates and writes to the table that begin after the ALTER TABLE operation begins are stalled until the new table is ready, then are automatically redirected to the new table without any failed updates.
While the master may be technically up for writes during this period, it's not much a goer if your table is large and has any write traffic at all, as anything writing will stall for what may be an extended period.
Sure this is sometimes boring additional work - eg you don't delete 1M records with one statement, you break it into 1,000 statements each deleting 1,000 records.
Sucks, but keeps your db and your users happy.
BTW this is true for PostgreSQL and MySQL and Oracle and every db that allows concurrent DML.
For example, if you have a table with 1M user records, and you run a migration to add a column in MySQL, then any updates to the table will be stalled while the table is rewritten to add the extra column (which may take a while). This is independent of how many records it touches - even if the transaction only touched 1 record and would take 10ms to execute, if the migration takes 10 minutes it may be stalled for up to 10 minutes.
In Postgres you can add a nullable column, and the table will only be locked for a very short amount of time, independent of the size of the table.
Unless you're on a replica.
My summary of the arguments against Postgres and some basic thoughts on each:
1. Writes are more expensive because all secondary indexes must be updated with a new physical location.
This may be true, but the MySQL model of using primary keys from secondary indexes will mean that reads are inherently expensive. They even mention this:
> This design means that InnoDB is at a slight disadvantage to Postgres when doing a secondary key lookup, since two indexes must be searched with InnoDB compared to just one for Postgres.
So it seems like a classic read vs. write trade-off.
I'm also a little skeptical of any performance claims that don't include any numbers. It's possible that efficient coding in Postgres makes this much more of a wash in terms of performance than claimed here.
2. Replication is less efficient because it's sending a lot of physical information out along the stream.
This is quite true, but IMO unlikely to be a major issues for most users unless they're dealing with a huge amount of data and streaming it over a slow connection (i.e. across the continent like Uber's disaster recovery center).
3. Data corruption from a bug found in 9.2.
Certainly a bad situation, but IMO not really a valid claim for situation. 9.2 is way behind at this point, and there's not much to say that they wouldn't have encountered a similar bug or something worse in MySQL in all that time, especially operating at scale.
To give a counter-anecdote, I operated Postgres at scale for a long time across many versions starting at 9.1 and was lucky enough to have never once encountered a bug with data corruption.
4. Postgres' MVCC model makes it easy for replicas to accidentally fall behind their master.
This one is valid (and annoying), but there are very good reasons for it, and you have some switches to control the behavior based on value transactions finishing on followers or prompt replication more highly.
5. Upgrades are difficult because the WAL stream works at a physical level and is not compatible between database versions.
Again, this is valid, but the statement-based replication is a scary idea. Row-level replication is more interesting and probably something that Postgres should have though.
Some good news is that Postgres is getting closer to logical WAL streaming, which should make in-place upgrades possible.
With MySQL the indexes are usually kept in memory so there should not be noticeable overhead.
* Postgres Professional roadmap ( Pluggable storages, Multimaster cluster with sharding, Effective partitioning, Adaptive query planning, Page-level data compression, Connection pooling, Native querying for jsonb with indexing support, ....) https://wiki.postgresql.org/wiki/Postgres_Professional_roadmap
* EnterpriseDB database server roadmap ( Parallelism, Replication, Vertical Scalability, Performance ) https://wiki.postgresql.org/wiki/EnterpriseDB_database_server_roadmap
====
And "Scalable PostgreSQL for real-time workloads https://www.citusdata.com " --> https://github.com/citusdata/citus
https://github.com/citusdata/citus/blob/release-5.1/CHANGELOG.md
"citus v5.0.0 (March 24, 2016)
- Public release under AGPLv3
- PostgreSQL extension compatible with PostgreSQL 9.5 and 9.4"
I'm not sure this post is illustrative of any generally applicable considerations (re: the title) in the choice of Postgresql vs MySQL, since Uber seems to no longer be using a relational model for most of their data and is using MySQL effectively as a key-value store.
> say a developer has some code that has to email a receipt to a user. Depending on how it’s written, the code may implicitly have a database transaction that’s held open until after the email finishes sending. While it’s always bad form to let your code hold open database transactions while performing unrelated blocking I/O, the reality is that most engineers are not database experts and may not always understand this problem, especially when using an ORM that obscures low-level details like open transactions.
I have to very seriously disagree here, ORMs make a lot of things easy - and you can get away with building stuff for a while without understanding the underlying databases or SQL but only to a certain scale (I'd say more like medium-scale, definitely not large or Uber level). If you have engineers writing code that interacts with a database without understanding transactional semantics, the engineer in question not the database is the problem.
> We started out with Postgres 9.1 and successfully completed the upgrade process to move to Postgres 9.2. However, the process took so many hours that we couldn’t afford to do the process again.
There seem to be ways [0][1] to do online upgrades with Postgres (before logical decoding in 9.4), although I haven't personally used them. Not sure if they explored these options at Uber or not?
[0] https://github.com/markokr/skytools [1] http://slony.info/
In spirit I agree with you, its the engineer's fault for not reading the documentation of their ORM or equivalent.
But in these big ships with hundreds of programmers, leaving transactions open in hibernate is a daily occurrence somewhere.
Usually caught before production, but happens oh-so-frequently, that anything that exacerbates the pain from this would be seen in a negative light for sure.
We have MySQL replication across the country where I work and I certainly wouldn't characterize it as robust; it fails every 3-6 months. MySQL replication is certainly a lot older and easier to use than Postgres's, but SQL databases are fundamentally CP systems. When you say "This design means that replicas can routinely lag seconds behind master, and therefore it is easy to write code that results in killed transactions" it sounds like you're blaming the way replication was implemented for a physical problem. There is no way to design a replication system such that two highly-consistent databases can achieve perfect availability in the face of real-world networks. A worse protocol can exacerbate the problem, but a better one can't make it go away.
I have never seen corruption with Postgres (unlike MySQL), but I have never tried cross-datacenter replication with it. Apart from that, Postgres generally seems to do much better with consistency than MySQL does, where DDL statements are not transactional, etc. So I am not surprised to hear that their system trips harder on Postgres's more aggressive consistency.
In short, I suspect a more robust solution to their problem is a NoSQL database. On the other hand, it sounds like they want a combination of availability and consistency that will be difficult to get off-the-shelf. I'm glad they found a way to make it work. I wouldn't generally choose Postgres for a scalable system with an aggressive availability constraint--but then again, I wouldn't choose MySQL either, and I generally avoid problems that demand highly scalable, highly available solutions.
Just read about their new schemaless db in their blog an the first paragraph contains this:
"The basic entity of data is called a cell. It is immutable, and once written, it cannot be overwritten. (In special cases, we can delete old records.) A cell is referenced by a row key, column name, and ref key. A cell’s contents are updated by writing a new version with a higher ref key but same row key and column name."
So, mmm..., not saying that postgres didn't pose a problem for them but I think postgres' db model fits better to their new db then mysql. They probably had to work really hard to get mysql to work like postgres.
Without this issue, it looks like two things needed.to be done with postgres that would have solved their problems have indexes that point to primary id and do logical replication (which they say a plugin solved in 9.4).
Is this a case of "I got burned by something so I won't use it again"
"Each Schemaless shard is a separate MySQL database, and each MySQL database server contains a set of MySQL databases. Each database contains a MySQL table for the cells (called the entity table) and a MySQL table for each secondary index, along with a set of auxiliary tables."
So... 1 table, with 1 index and manually created and updated secondary index tables.
With this scheme I can only assume postgres will work just as well or better.
- No transactions for DDL changes.
- Oldschool commandline client. auto commit disabled by default. no history.
- Weird sql syntax + semantics. f.e. null == empty string.
- Oracle InstantClient SDK needs an Oracle account to download and is closed source, and is not available as a package for anything except rpm-based distros
- getting it to run with PHP is a major PITA, once again due to the above-mentioned issues
- Holy f..ing cow, why does it translate everything down to ASCII by default instead of returning raw bytes?
- It's expensive as f..k
- Try to run a query using InstantClient with a ; at the end, it will barf
- DID I MENTION IT CANNOT DO A SIMPLE LIMIT AFTER AN ORDER BY?! (at least not until 12.1, which was released in 2013; due to various issues, one including the pricing, I have seen multiple orgs running way older versions. But come on, over 30 years with only ROWNUM?!)
> Typically, write amplification refers to a problem with writing data to SSD disks: a small logical update (say, writing a few bytes) becomes a much larger, costlier update when translated to the physical layer.
This is exactly the type of problem solved by the file system layer.
If power is lost between the fs lying to the OS and its subsequently actually writing to disk, the data that the fs lied about is lost.
You don't want that with a DB on top of it.
> Statement-based replication replicates logical SQL statements (e.g., it would literally replicate literal statements such as: UPDATE users SET birth_year=770 WHERE id = 4)
Postgres has that too (using a 3rd party tool, but it's an officially supported tool). We were using it on reddit 10 years ago. It caused a lot of problems. I wouldn't call that an advantage for Mysql.
Honestly, reading this it seems like the summary is: "We don't follow great engineering practices so we need a database more forgiving". Which is fine if that's how you want to run your business, but isn't really the death knell for Postgres.
A specific example:
> This problem might not be apparent to application developers writing code that obscures where transactions start and end. For instance, say a developer has some code that has to email a receipt to a user. Depending on how it’s written, the code may implicitly have a database transaction that’s held open until after the email finishes sending. While it’s always bad form to let your code hold open database transactions while performing unrelated blocking I/O, the reality is that most engineers are not database experts and may not always understand this problem, especially when using an ORM that obscures low-level details like open transactions.
Your developer should understand database transactions. But you should make it easier for them by abstracting it so that they don't have to. And in this particular case, I'd say they shouldn't be using the database to do locking around sending a receipt. It should be put into a queue and that queue should be processed separately, which avoids the transaction problem altogether.
Super, duper, common issue, you will find this at every large shop at some point in its life time, usually around the time of hiring people and expanding extremely fast, and taking on some tech debt.
All functions of extreme scale, hyper growth, and yeah, not following the absolute best practices all the time, but tech debt is like any debt, you get something now, and pay later. If they continue going up and to the right they will be able to afford it.
The one huge advantage of an ORM is the ability to support multiple databases, but that only really works if you can do everything using the ORM. The moment you have a function too complex that you need to write some SQL, now you need some case statements and multiple integration tests for all the database your product needs to support.
They remove some boiler plate while adding others. In one of my own projects, I just created several files (pgsql.commands, mysql.commands, etc.), a basic set of classes around them and a base set of commands that will work for all the DBs I wanted to support (so the command files had an inheritance model, albeit only one layer).
With all that being said, most ORMs I've used do have explicit transaction support. I know Squirl had a `transaction {}` block you could wrap commands around. Transactions shouldn't be an excuse. They should be off by default and explicit added around blocks of things that need to be atomic.
> Try enforcing this on teams that use ORMs like hibernate with 500 developers.
I realize this is a hyperbole (I hope) because you really shouldn't have 500 developers all on the same monolithic project (unless you're developing like...the Linux kernel). Getting your team to at least try to implement best practices does take some effort, but with things like weekly demos and code reviewed commits, it's do-able.
After you have that, then it doesn't even matter on the backend. The models you present to the layer above can have complex and changing relationships to the actual storage -- maybe they contain data which is derived from what's in the database, but transformed after being fetched so that none of their properties actually correspond to a column or a field in a store. In my experience -- having seen the tragedy that is a Rails project gone full ActiveRecord -- this pattern enforces an excellent separation of concerns and constrains a problem which can otherwise grow unboundedly in complexity.
The latter only really works if you've got a primary implementation language and aren't integrating lots of applications / libraries written in different languages communicating with the same database. You need to go down the SOA / distributed RPC / FFI route to integrate different languages, and that has its own complexities.
Personally I prefer treating the DB as a canonical store of facts. Models with a lot of code are pretty suspect. Retrievals can be tuned to just the facts required, ma'am - you don't accidentally drag in the banana + gorilla + whole forest. Doesn't stop you building a higher-level service layer if that's what you need, either. You'll need that when you scale up anyway; chatty models won't work at that level either.
The code doesn't go in the models, it goes in the service/arbitration layer. DB as a store of facts is obvious -- DB as a 1:1 representation of what yet-unforseen features, UIs and platforms will need is a naive and limiting assumption. You have to build your application in a way that future product needs won't be constrained by storage and modeling decisions, which is a tension that Rails apps frequently encounter.
It has several advantages: support for multiple databases (which is useful, sometimes), the ability to serialize/deserialize an object graph in one go, and sometimes a decent query builder which lets you compose queries as opposed to concatenating strings.
Unfortunately, it's also terribly easy to destroy performance by using lazy collections configured the wrong way for your use case and not notice it, to the point where I strongly advocate using query builders instead.
> I realize this is a hyperbole (I hope) because you really shouldn't have 500 developers all on the same monolithic project (unless you're developing like...the Linux kernel). Getting your team to at least try to implement best practices does take some effort, but with things like weekly demos and code reviewed commits, it's do-able.
The problem is that when you come in later, that the codebase grew way too fast and the deadlines are tight, retrofitting best practices on an existing ball-of-mud can be daunting.
[1] was the point of the comment you replied to and it provided a very important constraint as well
[2] rarely needed & easy to implement with recursive queries in native SQL
[3] building queries is pretty straight forward; what ORMs usually tend to bring to the table is knowledge about the schema and therefore compile-time error reporting - but this can be done in any language where one has the level of reflection, or, in worst case, by a two-stage compilation process where stage 1 generates code from the schema that can then be used by the compiler for verification in stage 2
As for building queries, SQL is straightforward (mostly). The problem is that it composes very badly. Any time you need to implement something like an advanced search (ie, lookup the same information, but with a number of different search criteria only known at runtime), the best you can do is concatenating partial queries and hoping you got the parentheses count right. Not to mention that a query builder will help with stuff the SQL syntax is miserable for, like IN clauses.
You construct a graph of objects with very few queries, transform it and write it back; with knowledge of the db-schema this will outperform any ORM-based solution and give much greater flexibility.
> As for building queries [...] The problem is that it composes very badly. [...] the best you can do is concatenating partial queries and hoping you got the parentheses count right.
There is an area between full-fledged ORMs and string concatenation. In a purely functional approach queries are composed by composing functions that compile to queries. Postmodern[1][2] is a good, open source example, though in most commercial projects we just built our own wrappers for the tasks and databases at hand. This also allows for much better performing code since for a lot of tasks hooking up the db-reader to the json emitter without going through object instantiation reduces memory and cpu consumption by an order of magnitude (or two), while in the same project, some code benefits from a OOP approach (for which you just use a reader that constructs the objects on the fly).
[1] http://marijnhaverbeke.nl/postmodern/
[2] http://marijnhaverbeke.nl/postmodern/s-sql.html (yes, it does string concatenation at run time, but it does it for you, you don't worry about getting the parenthesis count right)
- SQL "bad" - ORM "good" - NoSql apis "good"
So, the thing is that when facing with SQL "everyone" try to "abstract" it more.
Or instead use NoSql, because is "easier".
Fine.
Then if exist a market demand for a better API for the databases, why the databases guys not DO IT?
Yep, I know SQL is supposely the way for it, but bear with me: I live in the FoxPro era so I know what is code "to the metal" in database without SQL (and it was fine and easy).
If the SQL layer could be optional and more bare layer is provided (you can copy the dbase ideas!) then the problem of ORM mappers could be solved far easier (I imagine!).
How this could be?
With a AST api layer, for example. So I can send:
TABLE "Customer" SELECT "*"
So, imagine a kind of LLVM but for databases...
And I'm not talking about NoSql borrowing some relational concepts, but the opposite, and more directly, the API.
I know that most folks (including the ones that downvote!) have no clue what I'm talking about, because almost nobody (recently) have experience in talking against a database without SQL. Is like if the only way to talk to a NoSql was using Json + REST. That constrain badly your mind.
The inner relational model is far more rich, and simpler, SQL was not designed to be used by developers and it show (specially when you have contrived syntax as with CTE)
And before you ask what is the problem with CTE, is exactly because the way to have that is create a contrived syntax that obscure what is goin on. SQL is too restricted, yet too broad for interfacing.
CTEs aren't required very often - you generally only need them for recursive CTEs, and that's iterative retrieval analogous to pointer-chasing. It's typically a sign of a data model that's poorly suited to relational storage, e.g. trees and graphs.
I have issues with the irregularity of SQL syntax - it deeply annoys me that we have both 'where' and 'having' for doing exactly the same thing, one pre-fold and one post-fold - and I don't like my lack of access to the potential the index has for related fetches (window functions are not pleasant to work with) - but mostly my problems come not from lack of access to the relational nature, but lack of power over poor implementation / query planner / etc. details.
So they make a ORM
Then it create worse issues. Is like the Database-side and the app-side are at war (without intention!) with each other and the database-side only concern themselves for the kind of issues that only happened for specialized tasks.
And the app-side still try to interface to the database, but poorly.
----
Articulating this better, I hope: Is like the movement now to improve the syntax and semantics JS. Or like ASM.js.
So, why not have SQL 2.0 with the fixes everyone since 20 or more years ago already know? And the libraries/API made to acknowledge that databases are not primarily used by "end-users" but app developers?
But that is probably like ask why not clean C++... ;)
I know this is just ask for the moon (obviously not exist one "SQL", only several more or less alike implementations), but the trouble with ORM and databases is more than a decade old and the answer is mostly "yep, we have that problem. Move along"
I agree with the ideas of irregularities. The WHERE vs HAVING doesn't bother me much, and I can't think of a better syntax off hand. My SQL pet peeves are that the SELECT list is at the beginning - I don't know what columns I want exactly until after I type out the JOINs, so I usually type SELECT * and then fill in later. I'd rather put it between WHERE and ORDER BY. I'd also like UPDATE to put the WHERE before the SET, so you don't risk blowing up a ton of data if you forget or miss the WHERE.
Probably a lost cause to get that in, but I would think it wouldn't be too hard to at least support those syntax changes in addition to the current standard.
If you are using them to ease your understanding, chances are that you are doing the job in a very memory intensive way. It also tends to be slower because the generated intermediate results do not have indexes nor good statistics that could help the query planner to be efficient. Essentially, you are taking upon yourself the query planner job and assuming you'll do it better yourself.
Sure it was. Who else would it have been designed to be used by?
> Sure it was. Who else would it have been designed to be used by?
More accurately than the grandparent: SQL wasn't designed exclusively to be used by developers. It was designed to be accessible to analysts (domain-focused analysts, not systems analysts.)
Most of the time, the problem is to "embed" a foreign language (SQL) into an existing one (Python, Ruby, etc.), but you can do it in a proper way such as LINQ, then you have type-checking etc.
For example I hate so much when I have to read mySQL documentation jus to find out what went wrong when I used ORDER BY and the optimization didn't kick in? I mean, if I have to keep in my head the whole optimization mechanism of the DB engine and all the steps of the filesort algorithm going on behind the scenes just to write a single wretched SQL statement, then what's the point of SQL in the first place?
And this was just one example. The truth is that the SQL abstraction is leaking. Most of the commands don't encapsulate anything at all.
Landing somewhere in-between by getting things done not perfect but shippable, and doing it fast as possible is par for the course ime. The best code is always written the 2nd or 3rd time, never the first.
Especially on the case of hyper growth worrying your competitors like lyft or postmates or amazon might get something pivotal out first.
People have to learn somewhere, usually things like this they learn at scale, on the job.
I would expect
def my_view(request):
try:
receipt = generate_receipt(...)
receipt.send_email(...)
except SomeKindOfEmailError as e:
# okay do something else
and this should be synchronous and thus blocking. So to not block, they either wrote co-routines (asynchronous) or execute things in parallel. Have I interpreted their problem incorrectly?get data for receipt
generate receipt
send email
write success to database
close connection
To a junior programmer this would probably look reasonable, and to be fair, it takes some experience and getting burned, or good training, to know it is not.
When there's a noticeable contention due to the number of parallel transactions, one should consider ways to loosen their data guarantees (the status can be unset for some time) and go for an explicit asynchronous approaches.
get data for receipt would be a SELECT and only write success to database would do INSERT or UPDATE. I expect junior programmer to complete the above in at least two SQL calls. I have a feeling they were trying something smart.
EDIT: hmm reading the other commenter above, probably they are trying to lock on the data for full data integrity. Okay. That makes sense then. I was looking at the problem from the wrong angle.
Open db connection 1
get data for receipt
Close db connection 1
generate receipt
send email
open db connection 2
write success to database
close db connection 2
I guess you could speed this up a lot by doing it in bulk instead of opening and closing db connections twice for every email. Anyway, the version you wrote sounds reasonable for everything but really big operations to me, but then again I'm fairly junior.
To be fair, this problem isn't limited to databases. Filesystems, even HDDs/SSDs, have been known to readily ignore flush() calls in order to achieve better benchmark results.
The advantage you gain is that the queue is a nice buffer if something gets held up with email sending, and also, the queue processor can work in bulk, say sending 100 emails, and then opening a connection to the DB and writing them all in one statement.
transaction 1:
mark an unsent receipt as being processed
get the marked receipt
send email
transaction 2:
mark the receipt as sent
I think a better solution would be to use a centralized queue that actually does the mail sending, and retries in case of failure.
A bit like SMTP?
The MS recommendation is the opposite: get the connection early, finish with it late, let us worry about the "real" connection. I've been working on multi-TB databases like that, with hundreds of concurrent requests, and never had problems.
That is, if you cannot and will not do async stuff in your program, there are ready-made tools that will do that particular thing asynchronously for you, and have been doing so for years (or even decades).
If you just put an item in to a queue, and then have a specific cluster of machines that does nothing but grab items from that queue and push them through an email infrastructure, then it'll be a lot easier to maintain.
[0]: https://www.percona.com/doc/percona-server/5.7/index.html
Knowing what to be careful with on MySQL, I would certainly consider using the Percona flavor again in a new application.
Table had a field that was varchar(128) and wasn't used (original filename but system generated a hashed filename and served it with that name (for sharding, lots of files)) so for 5 years MySQL had been silently killing everything > 128 characters and it didn't matter.
Then along comes our hero (me in this story) who does all the upgrades, tests everything thoroughly (he thought) and it's all fine, deploy, test everything fine.
Next morning, 9 bug reports "Can't upload files at all FIXITNOW!!!".
Turns out company was auto generating PDF's from some report software that dumped all the data into the filename and none of them where less than 128 chars.
MySQL 5.7 started not silently dumping data into the bitbucket in the sky and instead threw an error (when quite reasonably you tried to shove 300 characters into a varchar(128)).
TLDR: MySQL doing the right thing broke the broken.
Don't get me wrong, I would generally choose Postgres over MySQL for an RDBMS with replication requirements these days, but I'm not sure I would have made that same descion a few years ago.
There are valid reasons that long established companies such as Google, Twitter, Facebook and countless others chose MySQL as their primary data store.
And with that in mind, I would take the slow to deliver replication options postgresql supplies in base over the previous third-party options which where comparable to the built-in options in mysql... and I'd take it over mysql's built-in options.
and I've used both postgresql and mysql in highly transactional/reliable production environments for over 5 years.
I think you're misrepresenting things here. While all do use MySQL for some specific tasks, it's clear that all also have many other central datastores which are not MySQL. In Google's case this is a gross overstatement.
[I know there's different opinions but I'd rather have a fully open database if I have a choice. For the record I default to PostgreSQL]
the only sources of failover are rando scripts over the internet, that you have to download hammer in to your version dialect and hope you don't trigger one of the many uncovered failover modes.
sure log shipping works, but that's far, FAR from a working solution. the gap requires ton of development hour, testing etc.
can't really blame people for using things with mature tooling.
(but I didn't switch to MySQL that'd be madness! there are plenty of good, replicated redundant stores out there, both nosql and sql)
Yes, there were some other points that were just extra annoyances for them but clearly that point was the most important to them. It's what the header image and the first 60% of the article was talking about and yet nobody seems to be engaging with that point in this thread.
Is the design choice bad? They never said it was. It's just an engineering trade off. It's very possible that most workloads benefit from this design. But if you workload involves updating lots of existing rows at large scale, then MySQL is going to be a better choice for you.
Also they didn't mention anything about the auto vacuumer, which mostly solved the issue they are talking about.
Their lack of mention of the vacuumer and not seeming to know that Postgres supports statement level replication makes me wonder if they took a deep dive into the wrong part of the technology.
On Postgres an update requires a rewrite of every index of the row.
On MySQL it only requires an update of the indexes that were touched by the update.
If you have a table with 10 indexes then this means doing 10 extra writes physically to the disk.
Please refer to 64~ page of https://momjian.us/main/writings/pgsql/mvcc.pdf.
There's been quite a few improvements to VACUUM in 9.6 [1], including avoiding full-table scans.
[1] https://www.postgresql.org/docs/9.6/static/release-9-6.html#AEN129664
Or use Cassandra which is a perfect fit (or ScyllaDB which is a better version of it).
This all sounds like an aversion to just paying for or using better products when the problem is easily solved.
Honestly, if a commercial database provides what you require and you have the budget? Sure, why not. But I'd always try to avoid that myself, because the serious players (MS, Oracle, is anyone still using DB2?) are really, really expensive.
SQL Server works just fine and has lots of concurrency control to do whatever they need. And the way they're using the database doesn't seem to really make this an issue outside of their replication.
> are really, really expensive
This whole janky setup they have sounds even worse. None of the commercial relational databases are really that expensive considering what they offer, and we're talking about Uber here. We're a small startup that pays for both.
Money for (a better) working product with support is the right call, not build it yourself. This is just poor tech decision (outside of using a RDBMS in the first place).
But I'm sure you know all this as it seems you work for an enterprise database company that actually makes the exact product Uber should use.
Our solution has been to build a distribution layer that makes our product performant at scale, rather than sacrificing data quality. We use CitusDB for the reads and an in-house system for the writes and distributed systems operations. We have never had a problem with data corruption in PostgreSQL, aside from one or two cases early on in which we made operational mistakes.
With proper tuning and some amount of durability-via-replication, we've been able to get great results, and that's supporting ad hoc analytical reads. (For example, you can blunt a lot of the WAL headaches listed here with asynchronous commit.)
1. The encoding and translation schemes of Postgres and mySQL/InnoDB are well described in the blog post, and I would also agree that InnoDB’s design is, all things considered, better for all the reasons outlined in the post.
2. I don’t understand why anyone still uses lseek() followed by read()/write() and not pread()/pwrite() syscalls. It’s trivial to replace the pair of calls with one. Aerospike is another datastore that resorts to pairs of seek/red-write instead of pread/pwrite calls.
3. Process/connection model makes no real sense nowadays - although to be fair, there is, today, practically almost no difference in terms of footprint between OS threads and OS processes (other than memory and FDs sharing semantics, they are practically the same). It’s still more appropriate to use threads (although I ‘d argue maintaining a pool of threads for processing requests and one/few threads for multiplexing network I/O is the better choice).
4. ALTER TABLE is obviously a pain point with mySQL, although I am not really sure many users with large datasets care; they probably figured out long ago it’s going to be an issue and they designed and expanded accordingly. It’s also a relatively rare operation. That said, other than using mySQL (or any other RDBMS) to build the data plane for an elaborate, distributed KV store, one should consider Salesforce’s approach too. Their tables have some 50 or so columns, and the column names are generic (e.g column_0, column_1, … ). They have a registry where they assign column indices (e.g column_0) to a specific high-level entity type (e.g customer title, or price), and whenever they need to query, they just translate from the high level entity to the actual column names and it works. They also, IIRC, use other tables to index those columns (e.g such an index table can have just 3 columns, table id, column index, value) and they consult that index when needed (FriendFeed did something similar).
5. Cassandra should have no problem supporting the operations and semantics of Shemaless ass described in their blog posts. However, given they already operate it in production, they probably considered it and decided against it.
https://www.postgresql.org/message-id/6248.1046130083%40sss.pgh.pa.us
Manfred Spraul <manfred(at)colorfullife(dot)com> writes:
> Tom Lane wrote:
>> It seems unlikely to me that eliminating lseek on some platforms would
>> be worth the hassle of maintaining two code paths. lseek is mighty
>> cheap as system calls go.
>>
> It was considered expensive enough to write a syscall avoidance layer
> that caches the file pointer and skips lseek if fpos==offset.
You're missing the point: that layer is mostly there to ensure that we
don't foul up the kernel's readahead recognition for sequential fetches.
It's nice that Linux doesn't care, but Linux is not the only platform
we worry about.
regards, tom lane
- we used X in a fashion that suited us best
- it caused us problems Y because of some technicalities of X
- so we switched to Z and we could avoid Y thanks to how Z handles the technicalities differently than Y
and the top rated HN comments are:
- you used the X wrong
- all the technicalities of X that caused you problems Y are actually superior features of X
https://www.yumpu.com/en/document/view/53683323/migrating-uber-from-mysql-to-postgresql
This is especially relevant when doing range queries over the primary key. Imagine a table containing billions of chat messages, from which you want to retrieve a single conversation history. With a clustered primary key on (conversation id, message id), MySQL would need to process just a couple of database pages. Postgres, on the other hand, would need to reference a semi-random page for each of the messages.
Now imagine a 10k message chat conversation, a table too large to fit into RAM, and storage by means of spinning rust (yeah, yeah, I know what year it is :-)). The difference would be somewhere between 2 and 3 orders of magnitude.
So if you have an index on (conversation_id, message_id), and you try to retrieve message ids of a specific conversation, only the index will be touched.
Still, it's quite inefficient maintaining an extra copy of the data that is never actually used. Though no longer multiple orders of magnitude less efficient.
However, I'd guess that programmers don't often think to add these seemingly useless fields to an index, as it feels inefficient and just wrong. But at least this offers an out in pathetic cases.
I hope this follows in the next blog post.
pg_upgade has a --link option which uses hard links in the new cluster to reference files from the old cluster. This can be a very fast way to do upgrades even for large databases (most of the data between major versions will look the same; perhaps only some mucking with system catalogs is required in the new cluster). Furthermore, you can use rsync with --hard-links to very quickly upgrade your standby instances (creating hard links on the remote server rather than transferring the full data).
that is all referenced in the current documentation: https://www.postgresql.org/docs/current/static/pgupgrade.html
You see this kind of misunderstanding commonly pooled with other "a computer has physically moving parts" misunderstandings like the ORM or connection pooling concerns outlined.
After tyranny of abstractions, nobody knows how the moving parts really work.
// As alternatives given you'd like to keep the immutable data approach which brings a lot of goodness, consider a log-structured file system for the disk concerns, and geo-sensible replication for the latency concerns. At this scale, for near real-time app, bi-coastal DB is a bad model. You shouldn't have all users in SF querying a database in DC. Given the nature of the business model, they can share geographically at one time scale, and roll up and replicate geo diverse data at a high latency leisure.
The Uber engineers should, perhaps, take a look at Changelogs of last 5 or so releases of MySQL to see how many bugs in InnoDB has been found in each release and read stories about data loses due to inability to repair storages.
According to old-school DBA tradition, it is much better to have an straightforward storage engine based on right principles written in C than fancy storage written in C++. At least if one values ones data.
Well, in the age of in-memory "databases", "durability through replication" and "eventual consistency" old school focus on disk commits might sound a bit outdated, until one gets that moment when shards got messed up and there is no way to know how many writes are missing and where.
Database is a durable storage which guarantees data consistency and ability to roll-back to a clean state through direct-access (by passing all caches) writes. At lest this is what we had in glorious times of IDS 7.3
Competition breeds excellence.
System calls are not context switches. I wish people would distinguish between them. A system call is just a change of privilege level and is efficient --- there's no cache invalidation required on almost any system.
A context switch, on the other hand, involves a call to the scheduler, saving and restoring of much more CPU register state, and various kinds of cache invalidation. (It's even more expensive if you're switching between different processes instead of different threads in the same process.)
The kernel may perform a context switch while executing a system call --- this context switch is what makes blocking calls blocking. But even IO system calls do not necessarily cause context switches, especially in the case where an operation can be satisfied by accessing only the page cache.
tl;dr A system call is not necessarily a context switch
Poor replica MVCC support
They are actually pointing to a blog article written in 2010 -> http://blog.2ndquadrant.com/tradeoffs_in_hot_standby_deplo/
Do they realise that it is 2016 ?
Guess they did't bother to understand the hot standby feedback system.
> Postgres’s design resulted in inefficiencies and difficulties for our data at Uber.
What kind of inefficiency ? The explain what is purpose of WAL and replication which every database person knows about but didn't care to explain the actual problem at hand ?
Data corruption
> During a routine master database promotion to increase database capacity, we ran into a Postgres 9.2 bug
Why the heck didn't they upgrade to a newer version ? Did you report this bug to pg dev , did they take so much time to fix this, or were you just assuming that the bug could fix itself ?
> The bug we ran into only affected certain releases of Postgres 9.2 and has been fixed for a long time now. However, we still find it worrisome that this class of bug can happen at all.
Postgres 9.2 is pretty old and there has been 3 major releases after that. WTF ?
I can say countless instances where MySQL data corruption was a constant nuisance with version 5.5 and they have fixed it with newer releases.
Replication
> During peak traffic early on, our bandwidth to the storage web service simply wasn’t fast enough to keep up with the rate at which WALs were being written to it
So you have run into a hardware limitation and then blame postgres. What was limit that you hit ? I don't understand this point at all.
Concept of context switching
I am surprised that this is actually an issue, in a database the slowest part is always the disk and not the CPU. Confused on how did they hit this limitation first without actually touching others.
Time taken by a context switch : http://stackoverflow.com/questions/21887797/what-is-the-overhead-of-a-context-switch
Which is in microseconds.
InnoDB buffer pool
> By comparison, the InnoDB storage engine implements its own LRU in something it calls the InnoDB buffer pool
Postgres has something similar called shared_buffer. They are speaking as if postgres relies entirely on the operating system which is false.
> It makes it possible to implement a custom LRU design. For instance, it’s possible to detect pathological access patterns that would blow out the LRU and prevent them from doing too much damage
Not sure what kind of damage they are speaking. In a postgres sequential scan (full table scan), a ring buffer is used instead and does not result in the shared buffers being blown away.
If you need a custom LRU design, there is definitely something wrong in the way that you are using an OLTP database.
Connection Handling
This is complete BS. Nobody uses databases without connection pools. Agree that a thread is more lightweight than a process, but you would never hit this limit at all in real time which is in the order of microseconds again. In a production system, one would open connections immediately and then hold them in the connection pool. This overhead is almost not visible at all. If you are constantly opening and closing connections then there is something seriously wrong with your design.
> However, we have had occasional application bugs in our backend services that caused them to open more active connections (usually “idle in transaction” connections) than the services ought to be using, and these bugs have caused extended downtimes for us
So they are blaming the database for a bug in their design/system. Computers are no match for human stupidity.
> Accordingly, using pgbouncer to do connection pooling with Postgres has been generally successful for us.
Again what is the problem, the whole article smells more and more like a useless rant, just because you dont know how to use them ?
Conclusion
Another thing is that they have not given any kind of query/access pattern in which they use postgres/mysql. They put in a couple of low level things and then say that postgres is badly designed.
I can think of only two logical explanations
1) The article writer was already familiar with MySQL and they didn't bother to even dig into postgres deeper
2) They have been paid by oracle :P
why postgres is designed such a way of "physical replication" rather than the design of mysql's ONLY "logical replication"?
Because postgres empowering data integrity with help of forceful constraints.
The conclusion reinforces this: "Postgres served us well in the early days of Uber, but we ran into significant problems scaling Postgres with our growth."
I read this a both endorsement of PostgreSQL as well as highlighting some of the problems that any large-scale use of it would run into.
Anyone know why they are using MySQL over MariaDB[1]?
1. https://mariadb.org/
[0] http://insights.dice.com/2015/03/19/why-i-choose-postgresql-over-mysqlmariadb/
[1] https://news.ycombinator.com/item?id=9231751