From e613c1440838b58518ca793a21c1b92baefeda54 Mon Sep 17 00:00:00 2001 From: Duckling <57977673+anotherduckling@users.noreply.github.com> Date: Mon, 23 Jan 2023 22:30:56 +0600 Subject: [PATCH] Add files via upload --- devs/scraping/devtools_detectors.md | 7 +++---- devs/scraping/disguising_your_scraper.md | 7 +++---- devs/scraping/finding_video_links.md | 3 ++- devs/scraping/gettingstarted.md | 25 ++++++++++++++++++++++++ devs/scraping/index.yml | 3 +++ devs/scraping/starting.md | 7 +++---- devs/scraping/using_apis.md | 5 ++--- 7 files changed, 41 insertions(+), 16 deletions(-) create mode 100644 devs/scraping/gettingstarted.md create mode 100644 devs/scraping/index.yml diff --git a/devs/scraping/devtools_detectors.md b/devs/scraping/devtools_detectors.md index 50c1ffd..1d46f46 100644 --- a/devs/scraping/devtools_detectors.md +++ b/devs/scraping/devtools_detectors.md @@ -1,6 +1,7 @@ --- label: Devtools detector -order: 3 +order: 997 +icon: /static/tool.png --- **TL;DR**: You are going to get fucked by sites detecting your devtools, the easiest bypass for this is using [a web sniffer extension](https://chrome.google.com/webstore/detail/web-sniffer/ndfgffclcpdbgghfgkmooklaendohaef?hl=en) @@ -98,6 +99,4 @@ At line 23 // Bypasses pref("devtools.console.bypass", true); pref("devtools.debugger.bypass", true); -``` - -### Next up: [Why your requests fail](disguising_your_scraper) \ No newline at end of file +``` \ No newline at end of file diff --git a/devs/scraping/disguising_your_scraper.md b/devs/scraping/disguising_your_scraper.md index a1059f4..a9ef877 100644 --- a/devs/scraping/disguising_your_scraper.md +++ b/devs/scraping/disguising_your_scraper.md @@ -1,6 +1,7 @@ --- label: Disguishing your scrapers -order: 4 +order: 996 +icon: /static/incognito.png --- # Disguishing your scrapers @@ -189,6 +190,4 @@ print(bypassed_response.hcaptcha_token) Keep in mind that if there is no ribbon/token, there is no way of reasonably accessing it. -In any case, this is how you, as a decent developer, handle the response properly. - -### Next up: [Finding video links](finding_video_links) \ No newline at end of file +In any case, this is how you, as a decent developer, handle the response properly. \ No newline at end of file diff --git a/devs/scraping/finding_video_links.md b/devs/scraping/finding_video_links.md index 0adc98b..37bc613 100644 --- a/devs/scraping/finding_video_links.md +++ b/devs/scraping/finding_video_links.md @@ -1,6 +1,7 @@ --- label: Finding video links -order: 5 +order: 995 +icon: codescan --- # Finding video links diff --git a/devs/scraping/gettingstarted.md b/devs/scraping/gettingstarted.md new file mode 100644 index 0000000..447204c --- /dev/null +++ b/devs/scraping/gettingstarted.md @@ -0,0 +1,25 @@ +--- +label: Getting Started +order: 1000 +icon: milestone +--- + +# Requests based scraping tutorial + +You want to start scraping? Well this guide will teach you, and not some baby selenium scraping. This guide only uses raw requests and has examples in both python and kotlin. Only basic programming knowlege in one of those languages is required to follow along in the guide. + +If you find any aspect of this guide confusing please open an issue about it and I will try to improve things. + +If you do not know programming at all then this guide will __not__ help you, learn programming! first Real scraping cannot be done by copy pasting with a vauge understanding. + +[!badge variant="light" text="Step 0"] [Starting scraping from zero](starting) + +[!badge variant="light" text="Step 1"] [Properly scraping JSON apis often found on sites](using_apis) + +[!badge variant="light" text="Step 2"] [Evading developer tools detection when scraping](devtools_detectors) + +[!badge variant="light" text="Step 3"] [Why your requests fail and how to fix them](disguising_your_scraper) + +[!badge variant="light" text="Step 4"] [Finding links and scraping videos](finding_video_links) + +Once you've read and understood the concepts behind scraping take a look at [a provider for CloudStream](https://github.com/recloudstream/cloudstream-extensions/blob/master/VidstreamBundle/src/main/kotlin/com/lagradost/VidEmbedProvider.kt#L4). I added tons of comments to make every aspect of writing CloudStream providers clear. Even if you're not planning on contributing to Cloudstream looking at the code may help \ No newline at end of file diff --git a/devs/scraping/index.yml b/devs/scraping/index.yml new file mode 100644 index 0000000..766e9df --- /dev/null +++ b/devs/scraping/index.yml @@ -0,0 +1,3 @@ +icon: /static/scraper.png +label: Scraping +expanded: false diff --git a/devs/scraping/starting.md b/devs/scraping/starting.md index 6e7782d..6dee852 100644 --- a/devs/scraping/starting.md +++ b/devs/scraping/starting.md @@ -1,8 +1,8 @@ --- label: Starting -order: 1 +order: 999 +icon: rocket --- - Scraping is just downloading a webpage and getting the wanted information from it. As a start you can scrape the README.md @@ -219,5 +219,4 @@ fun main() { val description = descriptionRegex.find(response.text)?.groups?.get(1)?.value println(description) } -``` -### Next up: [Properly scraping JSON apis](../using_apis.md) \ No newline at end of file +``` \ No newline at end of file diff --git a/devs/scraping/using_apis.md b/devs/scraping/using_apis.md index 346a2ab..4c16145 100644 --- a/devs/scraping/using_apis.md +++ b/devs/scraping/using_apis.md @@ -1,6 +1,7 @@ --- label: Using APIs -order: 2 +order: 998 +icon: /static/api.png --- ### About @@ -168,5 +169,3 @@ One thing to note is that you don't need to add all of the json key/value pairs ### Note Even though we set `DeserializationFeature.FAIL_ON_UNKNOWN_PROPERTIES` as `false` it will still error on missing properties.
If a json may or may not include some info, make those properties as nullable in the structure you build. - -### Next up: [Evading developer tools detection](devtools_detectors) \ No newline at end of file