There are number of really good questions, and answers you guys surfaced here. Thanks for @SandyWalsh, @slashfoo, @aihuaa for bringing this up here and to @wgbartley, @Dave, and @bko for answering and distilling things further (over the Memorial Day weekend no less! ) . All of this really helpful and I think it provides the Spark Team what we need to address these concerns. Here’s my take:
Regarding Scalability and Reliability
We have tons of monitoring and alerting systems in place to detect when the cloud is not behaving up to par. However, given what you guys are saying, there is clearly room for improvement. Also, upon closer inspection, our internal metrics do reveal evidence of API timeouts–I intend to provision more computing resources to this ASAP and also wire up alerts so we know about it before you do and can silently fix without you having to experience it.
As @Dave suggested, some of this improvement to our cloud reliability will come via provisioned infrastructure in data centers distributed around the world. Additionally, I’m working on improving our external monitoring services to more accurately see what you guys are seeing. I could use help with this. Next time a curl request to the api.spark.io fails, if you’d like to copy and paste the output in a private message to me or here with any extenuating details about the failure, that could be useful in correlating with the graphs and logs we see on the backend.
The focus of our monitoring systems is currently oriented toward the core services that power the Spark Cloud, the REST API and the device service that Core’s connect to. Though there are occasional hiccups with this aspect of the Spark Cloud, often times it’s the the peripheral aspects like the IDE and the mobile apps where things are breaking. As we grow, we’ll putting more energy in improving robustness and messaging within those contexts.
Regarding the Build IDE
I’ve catalogued the “Save, then Flash flashes a previous version of the firmware I’m writing on the spark IDE.” bug and either me or @shime will be looking closer at this week.
To address “Spark IDE fails to save and times out.”, we’ll be upgrading our database services to a more elastic, fault tolerant setup. Again, we’re working on that this week. You should not be seeing this failure anymore.
Flash failure: There are two issues here, one is the messaging in the IDE, the second is the over the air (OTA) firmware mechanism itself. We’ve experienced this problem ourselves and seen others impacted by it, and have begun designing + solutions for both, we’re very eager to implement and roll this out.
Hope this post addresses most of the concerns raised above.