Scalability and Reliability of Spark Cloud

I’m curious about a SLA for Spark Cloud. Even playing with Tinker I frequently get slow response times or general timeouts trying to set an output or check an input.

I’d be hesitant to roll out any sort of application that depends on Spark Cloud without getting some guarantees of availability/redundancy/response times/etc.

I’m going to ping @Dave and @jgoggins to comment on this one. Us users obviously haven’t signed any SLA, and I’m not sure if one is implied anywhere. I work as a web developer / jack-of-all-trades at a data center, so I’m quite familiar with SLAs!

For me, there may be a hiccup once or twice an hour (if polling every minute) with the individual Cores being unreachable. However, it’s difficult to pin that on any given point of failure (Core, wifi, internet connection, Spark Cloud, etc) without monitoring each one individually.

Personally, I’m waiting on the local cloud to be available before I pursue any “enterprise” projects. That way, I can guarantee my own uptimes and response times.

1 Like

Instead of opening a new thread, I’ll semi-hijack this one and put the problems I’ve encountered with the spark cloud in the past day or so.

  • Save, then Flash flashes a previous version of the firmware I’m writing on the spark IDE.
  • Spark IDE fails to save and times out.
  • trying to access the API times out (curl) having good connection on both core and laptop, same with tinker.

Adding to that, I’d like to (again) raise the question about a local “spark cloud” that I could run on my own servers, maybe even locally.

I actually encountered that a few times this evening.

Saving should happen immediately. Did you mean that it fails to flash and times out?

This one I usually attribute to something between my Core versus the Internet and the Cloud. It usually works a second time around. I'm not sure where it breaks down, though. Since it is such a quick workaround to just fire off the API call a second time, I don't know how high it is on the priority chain to be fixed.

The Holy Grail! I know the guys are working feverishly on this, but it's a complex beast. Refactoring the existing cloud into an easy-to-install package isn't a quick task. The ETA for it is this summer (in the Northern hemisphere), but no hard date yet. The good news is that Spark is headquartered in Minneapolis, so their summer is shorter than most of the US. Summer doesn't seem to end in Tennessee until October! :wink:

I'm going to ping @Dave and @jgoggins again, but it is a long weekend for the Spark staff, so I'd give them until close-of-business on Tuesday before really hammering them for feedback.

1 Like

On the flashing timing out, the ui says that it is taking longer than usual sometimes and then never does… and I can assure you that my computer and core are very well connected, and seeing that I’m not the only one having issues with the IDE or the cloud, I have no reason to delve further.

I see it regularly, too. I kinda wish there was a way to cancel the operation once you know it's not going to go through, but there's probably some dark magic that has to happen server-side that they haven't been able to implement yet.

If I had a real answer, I wouldn't hesitate to let you know. It gets frustrating for me a bit sometimes too. Let's wait until Tuesday (or sooner) to see what the Spark staff have to say.

So I think @wgbartley is right to get the Spark team involved–they manage it day-to-day and have a bigger picture view as well. The cloud is hosted on Amazon AWS so the actual host uptime is as good as any other cloud service. Problems in the past have involved things like all of us filling up the disk space on the cloud(!) and some other similar problems. The Spark guys do have a duty rotation for cloud emergencies and someone is on call 24x7.

But from my perspective, this does happen to me too. It seems to be related to the time of day (probably number of users) and certain times are fast and good while other are not so good. I think that some of the web build IDE problems are browser related. I use Chrome, Firefox, Safari and IE and they all have their little issues. Sometimes I just reload the page and all is well.

When the over-the-air update does not work, the core reboots and runs the last good firmware, so I think that might be what is happening in point one. I can always tell this going to happen if I watch the sequence of LED flashes because the core goes “dark” for several seconds after flashing magenta for a while instead of the normal flashing sequence. Lots of times this has been fixed for me by a factory reset or even several factory resets. Working on code that uses the core’s external flash was particularly troublesome since if I had a bug, it was easy to clobber something important and it sometimes took 2-3 factory resets to get back to normal.

I don’t have any problems with curl timeouts and I don’t think I have ever seen that. Because it is AWS, the API server does not answer pings, but measuring the delay in other ways, I know my house is about 130-140ms away from the cloud round trip. Sometimes people forget that this latency exists and try to hit the cloud API too fast. Some of the cloud APIs on the core, like Spark.publish() have rate limits, but others do not and that can lead to problems.

The core has been very resilient for me if I go back to basics and start from good firmware with a factory reset after a big problem.

2 Likes

Dear Spark users,

I like to ask how you guys do a development.
As my understanding the compile process happens in the cloud.
means even the process via DFU would not help because the downloaded binary may be an old version, is it?

I received my 2 cores on friday and I was happy: some basic tests did work out extremely fast.
Yesterday I could manage to flash my 2nd core for 1 time with my own, long time prepared software; then not any more.

Today, Sunday morning GMT+2hrs, it did work perfect for a couple of times.
Then after I wrote the 1st more complicated code, —— compiler seems to work (syntax error detection) , but flashing not.

Yesterday it ignored all changes, even when I renamed (and saved) the app.
Today it does nothing at all, no led flashing procedure (only the breathing).

The most confusing is he message is: „Flash successful! Please wait a moment while your Core is updated…“……“Ready“.

I really would suggest to add some handshake there to bring the success message only after really successful flashing; no idea if this is possible.

My question: is this a new issue, or do you all use some other way to put the software on the core? Or how so many successful projects came out already?

Hi @aihuaa,

I’m sorry the build IDE has been giving you some issues this weekend, for that I’ll ping @jgoggins.

You can always compile using the cloud using the https://github.com/spark/spark-cli , or you can setup the local build environment and avoid the build IDE and cloud entirely – instructions here: https://github.com/spark/core-firmware

The build IDE is undergoing constant development, so we try to fix issues like this as soon as possible when they come up, so please keep those error reports coming!

I hope that helps!

Thanks!
David

1 Like

Hi @SandyWalsh,

Good question! We’re always improving the performance of our cloud services! We’re planning on adding more availability zones around the world as we grow, but in the meantime please keep in mind that any messages through the cloud to the core need to travel from your controller -> cloud -> core, and then back again, so for example if you have 100ms latency, you can expect a 400ms response time depending on where your core is located. Additionally some things are throttled slightly, sending events from your core, for example, is limited to an average of 1 event per second per core at the moment, etc.

For enterprise clients and businesses that need more than the standard cloud service included with your Core, we have enterprise support levels available as well. Please feel free to email us at hello@spark.io, and Steph or Teke can help!

Thanks,
David

1 Like

Hi Guys,

There are number of really good questions, and answers you guys surfaced here. Thanks for @SandyWalsh, @slashfoo, @aihuaa for bringing this up here and to @wgbartley, @Dave, and @bko for answering and distilling things further (over the Memorial Day weekend no less! :smile: ) . All of this really helpful and I think it provides the Spark Team what we need to address these concerns. Here’s my take:

Regarding Scalability and Reliability

We have tons of monitoring and alerting systems in place to detect when the cloud is not behaving up to par. However, given what you guys are saying, there is clearly room for improvement. Also, upon closer inspection, our internal metrics do reveal evidence of API timeouts–I intend to provision more computing resources to this ASAP and also wire up alerts so we know about it before you do and can silently fix without you having to experience it.

As @Dave suggested, some of this improvement to our cloud reliability will come via provisioned infrastructure in data centers distributed around the world. Additionally, I’m working on improving our external monitoring services to more accurately see what you guys are seeing. I could use help with this. Next time a curl request to the api.spark.io fails, if you’d like to copy and paste the output in a private message to me or here with any extenuating details about the failure, that could be useful in correlating with the graphs and logs we see on the backend.

The focus of our monitoring systems is currently oriented toward the core services that power the Spark Cloud, the REST API and the device service that Core’s connect to. Though there are occasional hiccups with this aspect of the Spark Cloud, often times it’s the the peripheral aspects like the IDE and the mobile apps where things are breaking. As we grow, we’ll putting more energy in improving robustness and messaging within those contexts.

Regarding the Build IDE

  • I’ve catalogued the “Save, then Flash flashes a previous version of the firmware I’m writing on the spark IDE.” bug and either me or @shime will be looking closer at this week.

  • To address “Spark IDE fails to save and times out.”, we’ll be upgrading our database services to a more elastic, fault tolerant setup. Again, we’re working on that this week. You should not be seeing this failure anymore.

  • Flash failure: There are two issues here, one is the messaging in the IDE, the second is the over the air (OTA) firmware mechanism itself. We’ve experienced this problem ourselves and seen others impacted by it, and have begun designing + solutions for both, we’re very eager to implement and roll this out.

Hope this post addresses most of the concerns raised above.
Thanks!

-joe

4 Likes