I’m migrating away from the v5 streams endpoint due to the change announced with the removal of offset > 900 and have run into an issue with the helix endpoint.
V5 /streams claims a total of ~75k active streams at the current time.
I’m currently recursing helix /streams using a combination of after={{previousCursor}}&first=100 and have seen no signs of it stopping. Always returning >95 stream objects each pull, at >100k results so far.
Has anybody run into issues here or know of a reason why the cursor seemingly doesn’t end?
I’d assume the original cursor would get to “the end” of that request, and I’d be able to start my request again after a short period.
EDIT:
I’m currently running tests to see how “Unique” these results are as my hunch is this is just “updated” stream objects being returned after a certain period of time.
EDIT2:
Test Results: 118k stream objects pulled. 68k unique stream ids over ~5 minutes.
at around 65k uids, the count was only going up 5-10 per 100 (which seems in line with new streams starting)
This seems like the api is simply returning updated stream objects as the cursor gets to the end, meaning the cursor never really “ends”. However I’d love some clarification on this
EDIT3:
For anybody reading this in future curious, I’ve just ran a test to check when the cursor “looped” and it seems to be at around 3-4 minutes on average.
This is one of the frustrating things about Pagination in helix, as there are 2 issues at play here that can cause issues with paging through all results.
First, as you have to wait for a request to complete before you have a cursor to get the next page, you can’t do parallel requests so going through all >60k streams is going to take a while, and the longer it takes the more likely streams will go online/offline or change in viewership which leads to streams changing position in the results and leading to missing some streamers while counting some twice. The more pages you go through and time passes, the greater the degree of error.
Secondly, some Helix cursors just stop at the end of results, others such as Streams loop back around to the start again. There are several ways to detect this, either by checking the cursor (may not be reliable), checking for a large degree of duplicates per page, or going by the viewer count (if you were on pages with 0 viewer streams, and then start seeing a page with thousands for each stream in the page then you’ve looped back to the start).
That’s great to know.
Gotta love that even in the new api’s there’s inconsistency lmao
So previously where I could “shotgun” offset requests every x minutes and horizontally scale those parallel requests, my option here is to essentially recurse infinitely on a single node
Definitely affects data accuracy when running analytics, which is a shame. I’ll have to assess how long its really taking to “finish” the loop.