Shriyakupp commited on
Commit
980dc8d
·
verified ·
1 Parent(s): 4819f9d

Upload 107 files

Browse files
This view is limited to 50 files because it contains too many changes.   See raw diff
Files changed (50) hide show
  1. markdown_files/1._Development_Tools.md +22 -0
  2. markdown_files/2._Deployment_Tools.md +18 -0
  3. markdown_files/3._Large_Language_Models.md +68 -0
  4. markdown_files/4._Data_Sourcing.md +39 -0
  5. markdown_files/5._Data_Preparation.md +34 -0
  6. markdown_files/6._Data_Analysis.md +31 -0
  7. markdown_files/7._Data_Visualization.md +18 -0
  8. markdown_files/AI_Code_Editors__GitHub_Copilot.md +31 -0
  9. markdown_files/AI_Terminal_Tools__llm.md +76 -0
  10. markdown_files/Actor_Network_Visualization.md +26 -0
  11. markdown_files/Authentication__Google_Auth.md +93 -0
  12. markdown_files/BBC_Weather_API_with_Python.md +74 -0
  13. markdown_files/Base_64_Encoding.md +77 -0
  14. markdown_files/Browser__DevTools.md +69 -0
  15. markdown_files/CI_CD__GitHub_Actions.md +79 -0
  16. markdown_files/CORS.md +88 -0
  17. markdown_files/CSS_Selectors.md +39 -0
  18. markdown_files/Cleaning_Data_with_OpenRefine.md +31 -0
  19. markdown_files/Containers__Docker,_Podman.md +94 -0
  20. markdown_files/Convert_HTML_to_Markdown.md +183 -0
  21. markdown_files/Convert_PDFs_to_Markdown.md +139 -0
  22. markdown_files/Correlation_with_Excel.md +33 -0
  23. markdown_files/Crawling_with_the_CLI.md +137 -0
  24. markdown_files/Data_Aggregation_in_Excel.md +32 -0
  25. markdown_files/Data_Analysis_with_DuckDB.md +37 -0
  26. markdown_files/Data_Analysis_with_Python.md +37 -0
  27. markdown_files/Data_Analysis_with_SQL.md +39 -0
  28. markdown_files/Data_Cleansing_in_Excel.md +30 -0
  29. markdown_files/Data_Preparation_in_the_Editor.md +30 -0
  30. markdown_files/Data_Preparation_in_the_Shell.md +36 -0
  31. markdown_files/Data_Storytelling.md +18 -0
  32. markdown_files/Data_Transformation_in_Excel.md +30 -0
  33. markdown_files/Data_Transformation_with_dbt.md +64 -0
  34. markdown_files/Data_Visualization_with_Seaborn.md +20 -0
  35. markdown_files/Database__SQLite.md +148 -0
  36. markdown_files/DevContainers__GitHub_Codespaces.md +57 -0
  37. markdown_files/Editor__VS_Code.md +31 -0
  38. markdown_files/Embeddings.md +106 -0
  39. markdown_files/Extracting_Audio_and_Transcripts.md +298 -0
  40. markdown_files/Forecasting_with_Excel.md +25 -0
  41. markdown_files/Function_Calling.md +184 -0
  42. markdown_files/Geospatial_Analysis_with_Excel.md +33 -0
  43. markdown_files/Geospatial_Analysis_with_Python.md +34 -0
  44. markdown_files/Geospatial_Analysis_with_QGIS.md +32 -0
  45. markdown_files/Hybrid_RAG_with_TypeSense.md +154 -0
  46. markdown_files/Images__Compression.md +83 -0
  47. markdown_files/Interactive_Notebooks__Marimo.md +58 -0
  48. markdown_files/JSON.md +8 -0
  49. markdown_files/JavaScript_tools__npx.md +47 -0
  50. markdown_files/LLM_Agents.md +123 -0
markdown_files/1._Development_Tools.md ADDED
@@ -0,0 +1,22 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "1. Development Tools"
3
+ original_url: "https://tds.s-anand.net/#/development-tools?id=development-tools"
4
+ downloaded_at: "2025-06-08T23:21:33.929318"
5
+ ---
6
+
7
+ [Development Tools](#/development-tools?id=development-tools)
8
+ =============================================================
9
+
10
+ **NOTE**: The tools in this module are **PRE-REQUISITES** for the course. You would have used most of these before. If most of this is new to you, please take this course later.
11
+
12
+ Some tools are fundamental to data science because they are industry standards and widely used by data science professionals. Mastering these tools will align you with current best practices and making you more adaptable in a fast-evolving industry.
13
+
14
+ The tools we cover here are not just popular, they’re the core technology behind most of today’s data science and software development.
15
+
16
+ [Previous
17
+
18
+ Tools in Data Science](#/README)
19
+
20
+ [Next
21
+
22
+ Editor: VS Code](#/vscode)
markdown_files/2._Deployment_Tools.md ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "2. Deployment Tools"
3
+ original_url: "https://tds.s-anand.net/#/deployment-tools?id=deployment-tools"
4
+ downloaded_at: "2025-06-08T23:26:43.558808"
5
+ ---
6
+
7
+ [Deployment Tools](#/deployment-tools?id=deployment-tools)
8
+ ==========================================================
9
+
10
+ Any application you build is likely to be deployed somewhere. This section covers the most popular tools involved in deploying an application.
11
+
12
+ [Previous
13
+
14
+ Version Control: Git, GitHub](#/git)
15
+
16
+ [Next
17
+
18
+ Markdown](#/markdown)
markdown_files/3._Large_Language_Models.md ADDED
@@ -0,0 +1,68 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "3. Large Language Models"
3
+ original_url: "https://tds.s-anand.net/#/large-language-models?id=large-language-models"
4
+ downloaded_at: "2025-06-08T23:23:17.306109"
5
+ ---
6
+
7
+ [Large Language Models](#/large-language-models?id=large-language-models)
8
+ =========================================================================
9
+
10
+ This module covers the practical usage of large language models (LLMs).
11
+
12
+ **LLMs incur a cost.** For the May 2025 batch, use [aipipe.org](https://aipipe.org/) as a proxy.
13
+ Emails with `@ds.study.iitm.ac.in` get a **$1 per calendar month** allowance. (Don’t exceed that.)
14
+
15
+ Read the [AI Pipe documentation](https://github.com/sanand0/aipipe) to learn how to use it. But in short:
16
+
17
+ 1. Replace `OPENAI_BASE_URL`, i.e. `https://api.openai.com/v1` with `https://aipipe.org/openrouter/v1...` or `https://aipipe.org/openai/v1...`
18
+ 2. Replace `OPENAI_API_KEY` with the [`AIPIPE_TOKEN`](https://aipipe.org/login)
19
+ 3. Replace model names, e.g. `gpt-4.1-nano`, with `openai/gpt-4.1-nano`
20
+
21
+ For example, let’s use [Gemini 2.0 Flash Lite](https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-0-flash-lite) via [OpenRouter](https://openrouter.ai/google/gemini-2.0-flash-lite-001) for chat completions and [Text Embedding 3 Small](https://platform.openai.com/docs/models/text-embedding-3-small) via [OpenAI](https://platform.openai.com/docs/) for embeddings:
22
+
23
+ ```
24
+ curl https://aipipe.org/openrouter/v1/chat/completions \
25
+ -H "Content-Type: application/json" \
26
+ -H "Authorization: Bearer $AIPIPE_TOKEN" \
27
+ -d '{
28
+ "model": "google/gemini-2.0-flash-lite-001",
29
+ "messages": [{ "role": "user", "content": "What is 2 + 2?"} }]
30
+ }'
31
+
32
+ curl https://aipipe.org/openai/v1/embeddings \
33
+ -H "Content-Type: application/json" \
34
+ -H "Authorization: Bearer $AIPIPE_TOKEN" \
35
+ -d '{ "model": "text-embedding-3-small", "input": "What is 2 + 2?" }'Copy to clipboardErrorCopied
36
+ ```
37
+
38
+ Or using [`llm`](https://llm.datasette.io/):
39
+
40
+ ```
41
+ llm keys set openai --value $AIPIPE_TOKEN
42
+
43
+ export OPENAI_BASE_URL=https://aipipe.org/openrouter/v1
44
+ llm 'What is 2 + 2?' -m openrouter/google/gemini-2.0-flash-lite-001
45
+
46
+ export OPENAI_BASE_URL=https://aipipe.org/openai/v1
47
+ llm embed -c 'What is 2 + 2' -m 3-smallCopy to clipboardErrorCopied
48
+ ```
49
+
50
+ **For a 50% discount** (but slower speed), use [Flex processing](https://platform.openai.com/docs/guides/flex-processing) by adding `service_tier: "flex"` to your JSON request.
51
+
52
+ [AI Proxy - Jan 2025](#/large-language-models?id=ai-proxy-jan-2025)
53
+ -------------------------------------------------------------------
54
+
55
+ For the Jan 2025 batch, we had created API keys for everyone with an `iitm.ac.in` email to use `gpt-4o-mini` and `text-embedding-3-small`. Your usage is limited to **$1 per calendar month** for this course. Don’t exceed that.
56
+
57
+ **Use [AI Proxy](https://github.com/sanand0/aiproxy)** instead of OpenAI. Specifically:
58
+
59
+ 1. Replace your API to `https://api.openai.com/...` with `https://aiproxy.sanand.workers.dev/openai/...`
60
+ 2. Replace the `OPENAI_API_KEY` with the `AIPROXY_TOKEN` that someone will give you.
61
+
62
+ [Previous
63
+
64
+ Local LLMs: Ollama](#/ollama)
65
+
66
+ [Next
67
+
68
+ Prompt engineering](#/prompt-engineering)
markdown_files/4._Data_Sourcing.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "4. Data Sourcing"
3
+ original_url: "https://tds.s-anand.net/#/data-sourcing?id=data-sourcing"
4
+ downloaded_at: "2025-06-08T23:24:22.670487"
5
+ ---
6
+
7
+ [Data Sourcing](#/data-sourcing?id=data-sourcing)
8
+ =================================================
9
+
10
+ Before you do any kind of data science, you obviously have to get the data to be able to analyze it, visualize it, narrate it, and deploy it.
11
+ And what we are going to cover in this module is how you get the data.
12
+
13
+ There are three ways you can get the data.
14
+
15
+ 1. The first is you can **download** the data. Either somebody gives you the data and says download it from here, or you are asked to download it from the internet because it’s a public data source. But that’s the first way—you download the data.
16
+ 2. The second way is you can **query it** from somewhere. It may be on a database. It may be available through an API. It may be available through a library. But these are ways in which you can selectively query parts of the data and stitch it together.
17
+ 3. The third way is you have to **scrape it**. It’s not directly available in a convenient form that you can query or download. But it is, in fact, on a web page. It’s available on a PDF file. It’s available in a Word document. It’s available on an Excel file. It’s kind of structured, but you will have to figure out that structure and extract it from there.
18
+
19
+ In this module, we will be looking at the tools that will help you either download from a data source or query from an API or from a database or from a library. And finally, how you can scrape from different sources.
20
+
21
+ [![Data Sourcing - Introduction](https://i.ytimg.com/vi_webp/1LyblMkJzOo/sddefault.webp)](https://youtu.be/1LyblMkJzOo)
22
+
23
+ Here are links used in the video:
24
+
25
+ * [The Movies Dataset](https://www.kaggle.com/rounakbanik/the-movies-dataset)
26
+ * [IMDb Datasets](https://imdb.com/interfaces/)
27
+ * [Download the IMDb Datasets](https://datasets.imdbws.com/)
28
+ * [Explore the Internet Movie Database](https://gramener.com/imdb/)
29
+ * [What does the world search for?](https://gramener.com/search/)
30
+ * [HowStat - Cricket statistics](https://howstat.com/cricket/home.asp)
31
+ * [Cricket Strike Rates](https://gramener.com/cricket/)
32
+
33
+ [Previous
34
+
35
+ Project 1](#/project-tds-virtual-ta)
36
+
37
+ [Next
38
+
39
+ Scraping with Excel](#/scraping-with-excel)
markdown_files/5._Data_Preparation.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "5. Data Preparation"
3
+ original_url: "https://tds.s-anand.net/#/data-preparation?id=data-preparation"
4
+ downloaded_at: "2025-06-08T23:22:16.649843"
5
+ ---
6
+
7
+ [Data Preparation](#/data-preparation?id=data-preparation)
8
+ ==========================================================
9
+
10
+ Data preparation is crucial because raw data is rarely perfect.
11
+
12
+ It often contains errors, inconsistencies, or missing values. For example, marks data may have ‘NA’ or ‘absent’ for non-attendees, which you need to handle.
13
+
14
+ This section teaches you how to clean up data, convert it to different formats, aggregate it if required, and get a feel for the data before you analyze.
15
+
16
+ Here are links used in the video:
17
+
18
+ * [Presentation used in the video](https://docs.google.com/presentation/d/1Gb0QnPUN1YOwM_O5EqDdXUdL-5Azp1Tf/view)
19
+ * [Scraping assembly elections - Notebook](https://colab.research.google.com/drive/1SP8yVxzmofQO48-yXF3rujqWk2iM0KSl)
20
+ * [Assembly election results (CSV)](https://github.com/datameet/india-election-data/blob/master/assembly-elections/assembly.csv)
21
+ * [`pdftotext` software](https://www.xpdfreader.com/pdftotext-man.html)
22
+ * [OpenRefine software](https://openrefine.org)
23
+ * [The most persistent party](https://gramener.com/election/parliament#story.ddp)
24
+ * [TN assembly election cartogram](https://gramener.com/election/cartogram?ST_NAME=Tamil%20Nadu)
25
+
26
+ [![Data Preparation - Introduction](https://i.ytimg.com/vi_webp/dF3zchJJKqk/sddefault.webp)](https://youtu.be/dF3zchJJKqk)
27
+
28
+ [Previous
29
+
30
+ Scraping: Live Sessions](#/scraping-live-sessions)
31
+
32
+ [Next
33
+
34
+ Data Cleansing in Excel](#/data-cleansing-in-excel)
markdown_files/6._Data_Analysis.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "6. Data Analysis"
3
+ original_url: "https://tds.s-anand.net/#/data-analysis?id=data-analysis"
4
+ downloaded_at: "2025-06-08T23:26:37.046522"
5
+ ---
6
+
7
+ [Data analysis](#/data-analysis?id=data-analysis)
8
+ =================================================
9
+
10
+ [Data Analysis: Introduction Podcast](https://drive.google.com/file/d/1isjtxFa43CLIFlLpo8mwwQfBog9VlXYl/view) by [NotebookLM](https://notebooklm.google.com/)
11
+
12
+ Once you’ve prepared the data, your next task is to analyze it to get insights that are not immediately obvious.
13
+
14
+ In this module, you’ll learn:
15
+
16
+ * **Statistical analysis**: Calculate correlations, regressions, forecasts, and outliers using **spreadsheets**
17
+ * **Data summarization**: Aggregate and pivot data using **Python** and **databases**.
18
+ * **Geo-data Collection & Processing**: Gather and process geospatial data using tools like Python (GeoPandas) and QGIS.
19
+ * **Geo-visualization**: Create and visualize geospatial data on maps using Excel, QGIS, and Python libraries such as Folium.
20
+ * **Network & Proximity Analysis**: Analyze geospatial relationships and perform network analysis to understand data distribution and clustering.
21
+ * **Storytelling & Decision Making**: Develop narratives and make informed decisions based on geospatial data insights.
22
+
23
+ [![Data Analysis - Introduction](https://i.ytimg.com/vi_webp/CRSljunxjnk/sddefault.webp)](https://youtu.be/CRSljunxjnk)
24
+
25
+ [Previous
26
+
27
+ Extracting Audio and Transcripts](#/extracting-audio-and-transcripts)
28
+
29
+ [Next
30
+
31
+ Correlation with Excel](#/correlation-with-excel)
markdown_files/7._Data_Visualization.md ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "7. Data Visualization"
3
+ original_url: "https://tds.s-anand.net/#/data-visualization?id=data-visualization"
4
+ downloaded_at: "2025-06-08T23:27:12.693601"
5
+ ---
6
+
7
+ [Data visualization](#/data-visualization?id=data-visualization)
8
+ ================================================================
9
+
10
+ [![Data visualization](https://i.ytimg.com/vi_webp/XkxRDql00UU/sddefault.webp)](https://youtu.be/XkxRDql00UU)
11
+
12
+ [Previous
13
+
14
+ Network Analysis in Python](#/network-analysis-in-python)
15
+
16
+ [Next
17
+
18
+ Visualizing Forecasts with Excel](#/visualizing-forecasts-with-excel)
markdown_files/AI_Code_Editors__GitHub_Copilot.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "AI Code Editors: GitHub Copilot"
3
+ original_url: "https://tds.s-anand.net/#/github-copilot?id=ai-editor-github-copilot"
4
+ downloaded_at: "2025-06-08T23:26:20.399680"
5
+ ---
6
+
7
+ [AI Editor: GitHub Copilot](#/github-copilot?id=ai-editor-github-copilot)
8
+ -------------------------------------------------------------------------
9
+
10
+ AI Code Editors like [GitHub Copilot](https://github.com/features/copilot), [Cursor](https://www.cursor.com/), [Windsurf](http://windsurf.com/), [Roo Code](https://roocode.com/), [Cline](https://cline.bot/), [Continue.dev](https://www.continue.dev/), etc. use LLMs to help you write code faster.
11
+
12
+ Most are built on top of [VS Code](#/vscode). These are now a standard tool in every developer’s toolkit.
13
+
14
+ [GitHub Copilot](https://github.com/features/copilot) is [free](https://github.com/features/copilot/plans) (as of May 2025) for 2,000 completions and 50 chats.
15
+
16
+ [![Getting started with GitHub Copilot | Tutorial (11 min)](https://i.ytimg.com/vi_webp/n0NlxUyA7FI/sddefault.webp)](https://youtu.be/n0NlxUyA7FI)
17
+
18
+ You should learn about:
19
+
20
+ * [Code Suggestions](https://docs.github.com/en/enterprise-cloud@latest/copilot/using-github-copilot/using-github-copilot-code-suggestions-in-your-editor), which is a basic feature.
21
+ * [Using Chat](https://docs.github.com/en/copilot/github-copilot-chat/using-github-copilot-chat-in-your-ide), which lets you code in natural language.
22
+ * [Changing the chat model](https://docs.github.com/en/copilot/using-github-copilot/ai-models/changing-the-ai-model-for-copilot-chat). The free version includes Claude 3.5 Sonnet, a good coding model.
23
+ * [Prompts](https://docs.github.com/en/copilot/copilot-chat-cookbook) to understand how people use AI code editors.
24
+
25
+ [Previous
26
+
27
+ Editor: VS Code](#/vscode)
28
+
29
+ [Next
30
+
31
+ Python tools: uv](#/uv)
markdown_files/AI_Terminal_Tools__llm.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "AI Terminal Tools: llm"
3
+ original_url: "https://tds.s-anand.net/#/llm?id=llm-cli-llm"
4
+ downloaded_at: "2025-06-08T23:25:09.715323"
5
+ ---
6
+
7
+ [LLM CLI: llm](#/llm?id=llm-cli-llm)
8
+ ------------------------------------
9
+
10
+ [`llm`](https://pypi.org/project/llm) is a command-line utility for interacting with large language models—simplifying prompts, managing models and plugins, logging every conversation, and extracting structured data for pipelines.
11
+
12
+ [![Language models on the command-line w/ Simon Willison](https://i.ytimg.com/vi_webp/QUXQNi6jQ30/sddefault.webp)](https://youtu.be/QUXQNi6jQ30?t=100)
13
+
14
+ ### [Basic Usage](#/llm?id=basic-usage)
15
+
16
+ [Install llm](https://github.com/simonw/llm#installation). Then set up your [`OPENAI_API_KEY`](https://platform.openai.com/api-keys) environment variable. See [Getting started](https://github.com/simonw/llm?tab=readme-ov-file#getting-started).
17
+
18
+ **TDS Students**: See [Large Language Models](#/large-language-models) for instructions on how to get and use `OPENAI_API_KEY`.
19
+
20
+ ```
21
+ # Run a simple prompt
22
+ llm 'five great names for a pet pelican'
23
+
24
+ # Continue a conversation
25
+ llm -c 'now do walruses'
26
+
27
+ # Start a memory-aware chat session
28
+ llm chat
29
+
30
+ # Specify a model
31
+ llm -m gpt-4.1-nano 'Summarize tomorrow’s meeting agenda'
32
+
33
+ # Extract JSON output
34
+ llm 'List the top 5 Python viz libraries with descriptions' \
35
+ --schema-multi 'name,description'Copy to clipboardErrorCopied
36
+ ```
37
+
38
+ Or use llm without installation using [`uvx`](#/uv):
39
+
40
+ ```
41
+ # Run llm via uvx without any prior installation
42
+ uvx llm 'Translate "Hello, world" into Japanese'
43
+
44
+ # Specify a model
45
+ uvx llm -m gpt-4.1-nano 'Draft a 200-word blog post on data ethics'
46
+
47
+ # Use structured JSON output
48
+ uvx llm 'List the top 5 programming languages in 2025 with their release years' \
49
+ --schema-multi 'rank,language,release_year'Copy to clipboardErrorCopied
50
+ ```
51
+
52
+ ### [Key Features](#/llm?id=key-features)
53
+
54
+ * **Interactive prompts**: `llm '…'` — Fast shell access to any LLM.
55
+ * **Conversational flow**: `-c '…'` — Continue context across prompts.
56
+ * **Model switching**: `-m MODEL` — Use OpenAI, Anthropic, local models, and more.
57
+ * **Structured output**: `llm json` — Produce JSON for automation.
58
+ * **Logging & history**: `llm logs path` — Persist every prompt/response in SQLite.
59
+ * **Web UI**: `datasette "$(llm logs path)"` — Browse your entire history with Datasette.
60
+ * **Persistent chat**: `llm chat` — Keep the model in memory across multiple interactions.
61
+ * **Plugin ecosystem**: `llm install PLUGIN` — Add support for new models, data sources, or workflows. ([Language models on the command-line - Simon Willison’s Weblog](https://simonwillison.net/2024/Jun/17/cli-language-models/?utm_source=chatgpt.com))
62
+
63
+ ### [Practical Uses](#/llm?id=practical-uses)
64
+
65
+ * **Automated coding**. Generate code scaffolding, review helpers, or utilities on demand. For example, after running`llm install llm-cmd`, run `llm cmd 'Undo the last git commit'`. Inspired by [Simon’s post on using LLMs for rapid tool building](https://simonwillison.net/2025/Mar/11/using-llms-for-code/).
66
+ * **Transcript processing**. Summarize YouTube or podcast transcripts using Gemini. See [Putting Gemini 2.5 Pro through its paces](https://www.macstories.net/mac/llm-youtube-transcripts-with-claude-and-gemini-in-shortcuts/).
67
+ * **Commit messages**. Turn diffs into descriptive commit messages, e.g. `git diff | llm 'Write a concise git commit message explaining these changes'`. \
68
+ * **Data extraction**. Convert free-text into structured JSON for automation. [Structured data extraction from unstructured content using LLM schemas](https://simonwillison.net/2025/Feb/28/llm-schemas/).
69
+
70
+ [Previous
71
+
72
+ Terminal: Bash](#/bash)
73
+
74
+ [Next
75
+
76
+ Spreadsheet: Excel, Google Sheets](#/spreadsheets)
markdown_files/Actor_Network_Visualization.md ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Actor Network Visualization"
3
+ original_url: "https://tds.s-anand.net/#/actor-network-visualization?id=actor-network-visualization"
4
+ downloaded_at: "2025-06-08T23:23:12.679629"
5
+ ---
6
+
7
+ [Actor Network Visualization](#/actor-network-visualization?id=actor-network-visualization)
8
+ -------------------------------------------------------------------------------------------
9
+
10
+ Find the shortest path between Govinda & Angelina Jolie using IMDb data using Python: [networkx](https://pypi.org/project/networkx/) or [scikit-network](https://pypi.org/project/scikit-network).
11
+
12
+ [![Jolie No. 1](https://i.ytimg.com/vi_webp/lcwMsPxPIjc/sddefault.webp)](https://youtu.be/lcwMsPxPIjc)
13
+
14
+ * [Notebook: How this video was created](https://github.com/sanand0/jolie-no-1/blob/master/jolie-no-1.ipynb)
15
+ * [The data used to visualize the network](https://github.com/sanand0/jolie-no-1/blob/master/imdb-actor-pairing.ipynb)
16
+ * [The shortest path between actors](https://github.com/sanand0/jolie-no-1/blob/master/shortest-path.ipynb)
17
+ * [IMDB data](https://developer.imdb.com/non-commercial-datasets/)
18
+ * [Codebase](https://github.com/sanand0/jolie-no-1)
19
+
20
+ [Previous
21
+
22
+ Data Visualization with ChatGPT](#/data-visualization-with-chatgpt)
23
+
24
+ [Next
25
+
26
+ RAWgraphs](#/rawgraphs)
markdown_files/Authentication__Google_Auth.md ADDED
@@ -0,0 +1,93 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Authentication: Google Auth"
3
+ original_url: "https://tds.s-anand.net/#/google-auth?id=google-authentication-with-fastapi"
4
+ downloaded_at: "2025-06-08T23:25:42.202598"
5
+ ---
6
+
7
+ [Google Authentication with FastAPI](#/google-auth?id=google-authentication-with-fastapi)
8
+ -----------------------------------------------------------------------------------------
9
+
10
+ Secure your API endpoints using Google ID tokens to restrict access to specific email addresses.
11
+
12
+ [![🔥 Python FastAPI Google Login Tutorial | OAuth2 Authentication (19 min)](https://i.ytimg.com/vi_webp/4ExQYRCwbzw/sddefault.webp)](https://youtu.be/4ExQYRCwbzw)
13
+
14
+ Google Auth is the most commonly implemented single sign-on mechanism because:
15
+
16
+ * It’s popular and user-friendly. Users can log in with their existing Google accounts.
17
+ * It’s secure: Google supports OAuth2 and OpenID Connect to handle authentication.
18
+
19
+ Here’s how you build a FastAPI app that identifies the user.
20
+
21
+ 1. Go to the [Google Cloud Console – Credentials](https://console.developers.google.com/apis/credentials) and click **Create Credentials > OAuth client ID**.
22
+ 2. Choose **Web application**, set your authorized redirect URIs (e.g., `http://localhost:8000/`).
23
+ 3. Copy the **Client ID** and **Client Secret** into a `.env` file:
24
+
25
+ ```
26
+ GOOGLE_CLIENT_ID=your-client-id.apps.googleusercontent.com
27
+ GOOGLE_CLIENT_SECRET=your-client-secretCopy to clipboardErrorCopied
28
+ ```
29
+ 4. Create your FastAPI `app.py`:
30
+
31
+ ```
32
+ # /// script
33
+ # dependencies = ["python-dotenv", "fastapi", "uvicorn", "itsdangerous", "httpx", "authlib"]
34
+ # ///
35
+
36
+ import os
37
+ from dotenv import load_dotenv
38
+ from fastapi import FastAPI, Request
39
+ from fastapi.responses import RedirectResponse
40
+ from starlette.middleware.sessions import SessionMiddleware
41
+ from authlib.integrations.starlette_client import OAuth
42
+
43
+ load_dotenv()
44
+ app = FastAPI()
45
+ app.add_middleware(SessionMiddleware, secret_key="create-a-random-secret-key")
46
+
47
+ oauth = OAuth()
48
+ oauth.register(
49
+ name="google",
50
+ client_id=os.getenv("GOOGLE_CLIENT_ID"),
51
+ client_secret=os.getenv("GOOGLE_CLIENT_SECRET"),
52
+ server_metadata_url="https://accounts.google.com/.well-known/openid-configuration",
53
+ client_kwargs={"scope": "openid email profile"},
54
+ )
55
+
56
+ @app.get("/")
57
+ async def application(request: Request):
58
+ user = request.session.get("user")
59
+ # 3. For authenticated users: say hello
60
+ if user:
61
+ return f"Hello {user['email']}"
62
+ # 2. For users who have just logged in, save their details in the session
63
+ if "code" in request.query_params:
64
+ token = await oauth.google.authorize_access_token(request)
65
+ request.session["user"] = token["userinfo"]
66
+ return RedirectResponse("/")
67
+ # 1. For users who are logging in for the first time, redirect to Google login
68
+ return await oauth.google.authorize_redirect(request, request.url)
69
+
70
+ if __name__ == "__main__":
71
+ import uvicorn
72
+ uvicorn.run(app, port=8000)Copy to clipboardErrorCopied
73
+ ```
74
+
75
+ Now, run `uv run app.py`.
76
+
77
+ 1. When you visit <http://localhost:8000/> you’ll be redirected to a Google login page.
78
+ 2. When you log in, you’ll be redirected back to <http://localhost:8000/>
79
+ 3. Now you’ll see the email ID you logged in with.
80
+
81
+ Instead of displaying the email, you can show different content based on the user. For example:
82
+
83
+ * Allow access to specfic users and not others
84
+ * Fetch the user’s personalized information
85
+ * Display different content based on the user
86
+
87
+ [Previous
88
+
89
+ Web Framework: FastAPI](#/fastapi)
90
+
91
+ [Next
92
+
93
+ Local LLMs: Ollama](#/ollama)
markdown_files/BBC_Weather_API_with_Python.md ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "BBC Weather API with Python"
3
+ original_url: "https://tds.s-anand.net/#/bbc-weather-api-with-python?id=bbc-weather-location-id-with-python"
4
+ downloaded_at: "2025-06-08T23:24:13.538036"
5
+ ---
6
+
7
+ [BBC Weather location ID with Python](#/bbc-weather-api-with-python?id=bbc-weather-location-id-with-python)
8
+ -----------------------------------------------------------------------------------------------------------
9
+
10
+ [![BBC Weather location API with Python](https://i.ytimg.com/vi_webp/IafLrvnamAw/sddefault.webp)](https://youtu.be/IafLrvnamAw)
11
+
12
+ You’ll learn how to get the location ID of any city from the BBC Weather API – as a precursor to scraping weather data – covering:
13
+
14
+ * **Understanding API Calls**: Learn how backend API calls work when searching for a city on the BBC weather website.
15
+ * **Inspecting Web Interactions**: Use the browser’s inspect element feature to track API calls and understand the network activity.
16
+ * **Extracting Location IDs**: Identify and extract the location ID from the API response using Python.
17
+ * **Using Python Libraries**: Import and use requests, json, and urlencode libraries to make API calls and process responses.
18
+ * **Constructing API URLs**: Create structured API URLs dynamically with constant prefixes and query parameters using urlencode.
19
+ * **Building Functions**: Develop a Python function that accepts a city name, constructs the API call, and returns the location ID.
20
+
21
+ To open the browser Developer Tools on Chrome, Edge, or Firefox, you can:
22
+
23
+ * Right-click on the page and select “Inspect” to open the developer tools
24
+ * OR: Press `F12`
25
+ * OR: Press `Ctrl+Shift+I` on Windows
26
+ * OR: Press `Cmd+Opt+I` on Mac
27
+
28
+ Here are links and references:
29
+
30
+ * [BBC Location ID scraping - Notebook](https://colab.research.google.com/drive/1-iV-tbtRicKR_HXWeu4Hi5aXJCV3QdQp)
31
+ * [BBC Weather - Palo Alto (location ID: 5380748)](https://www.bbc.com/weather/5380748)
32
+ * [BBC Locator Service - Los Angeles](https://locator-service.api.bbci.co.uk/locations?api_key=AGbFAKx58hyjQScCXIYrxuEwJh2W2cmv&stack=aws&locale=en&filter=international&place-types=settlement%2Cairport%2Cdistrict&order=importance&s=los%20angeles&a=true&format=json)
33
+ * Learn about the [`requests` package](https://docs.python-requests.org/en/latest/user/quickstart/). Watch [Python Requests Tutorial: Request Web Pages, Download Images, POST Data, Read JSON, and More](https://youtu.be/tb8gHvYlCFs)
34
+
35
+ [BBC Weather data with Python](#/bbc-weather-api-with-python?id=bbc-weather-data-with-python)
36
+ ---------------------------------------------------------------------------------------------
37
+
38
+ [![Scrape BBC weather with Python](https://i.ytimg.com/vi_webp/Uc4DgQJDRoI/sddefault.webp)](https://youtu.be/Uc4DgQJDRoI)
39
+
40
+ You’ll learn how to scrape the live weather data of a city from the BBC Weather API, covering:
41
+
42
+ * **Introduction to Web Scraping**: Understand the basics of web scraping and its legality.
43
+ * **Libraries Overview**: Learn the importance of [`requests`](https://docs.python-requests.org/en/latest/user/quickstart/) and [`BeautifulSoup`](https://beautiful-soup-4.readthedocs.io/).
44
+ * **Fetching HTML**: Use [`requests`](https://docs.python-requests.org/en/latest/user/quickstart/) to fetch HTML content from a web page.
45
+ * **Parsing HTML**: Utilize [`BeautifulSoup`](https://beautiful-soup-4.readthedocs.io/) to parse and navigate the HTML content.
46
+ * **Identifying Data**: Inspect HTML elements to locate specific data (e.g., high and low temperatures).
47
+ * **Extracting Data**: Extract relevant data using [`BeautifulSoup`](https://beautiful-soup-4.readthedocs.io/)‘s `find_all()` function.
48
+ * **Data Cleanup**: Clean extracted data to remove unwanted elements.
49
+ * **Post-Processing**: Use regular expressions to split large strings into meaningful parts.
50
+ * **Data Structuring**: Combine extracted data into a structured pandas DataFrame.
51
+ * **Handling Special Characters**: Replace unwanted characters for better data manipulation.
52
+ * **Saving Data**: Save the cleaned data into CSV and Excel formats.
53
+
54
+ Here are links and references:
55
+
56
+ * [BBC Weather scraping - Notebook](https://colab.research.google.com/drive/1-gkMzE-TKe3U_yh1v0NPn4TM687H2Hcf)
57
+ * [BBC Locator Service - Mumbai](https://locator-service.api.bbci.co.uk/locations?api_key=AGbFAKx58hyjQScCXIYrxuEwJh2W2cmv&stack=aws&locale=en&filter=international&place-types=settlement%2Cairport%2Cdistrict&order=importance&s=mumbai&a=true&format=json)
58
+ * [BBC Weather - Mumbai (location ID: 1275339)](https://www.bbc.com/weather/1275339)
59
+ * [BBC Weather API - Mumbai (location ID: 1275339)](https://weather-broker-cdn.api.bbci.co.uk/en/forecast/aggregated/1275339)
60
+ * Learn about the [`json` package](https://docs.python.org/3/library/json.html). Watch [Python Tutorial: Working with JSON Data using the json Module](https://youtu.be/9N6a-VLBa2I)
61
+ * Learn about the [`BeautifulSoup` package](https://beautiful-soup-4.readthedocs.io/). Watch [Python Tutorial: Web Scraping with BeautifulSoup and Requests](https://youtu.be/ng2o98k983k)
62
+ * Learn about the [`pandas` package](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html). Watch
63
+ + [Python Pandas Tutorial (Part 1): Getting Started with Data Analysis - Installation and Loading Data](https://youtu.be/ZyhVh-qRZPA)
64
+ + [Python Pandas Tutorial (Part 2): DataFrame and Series Basics - Selecting Rows and Columns](https://youtu.be/zmdjNSmRXF4)
65
+ * Learn about the [`re` package](https://docs.python.org/3/library/re.html). Watch [Python Tutorial: re Module - How to Write and Match Regular Expressions (Regex)](https://youtu.be/K8L6KVGG-7o)
66
+ * Learn about the [`datetime` package](https://docs.python.org/3/library/datetime.html). Watch [Python Tutorial: Datetime Module - How to work with Dates, Times, Timedeltas, and Timezones](https://youtu.be/eirjjyP2qcQ)
67
+
68
+ [Previous
69
+
70
+ Crawling with the CLI](#/crawling-cli)
71
+
72
+ [Next
73
+
74
+ Scraping IMDb with JavaScript](#/scraping-imdb-with-javascript)
markdown_files/Base_64_Encoding.md ADDED
@@ -0,0 +1,77 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Base 64 Encoding"
3
+ original_url: "https://tds.s-anand.net/#/base64-encoding?id=base-64-encoding"
4
+ downloaded_at: "2025-06-08T23:27:25.543180"
5
+ ---
6
+
7
+ [Base 64 Encoding](#/base64-encoding?id=base-64-encoding)
8
+ =========================================================
9
+
10
+ Base64 is a method to convert binary data into ASCII text. It’s essential when you need to transmit binary data through text-only channels or embed binary content in text formats.
11
+
12
+ Watch this quick explanation of how Base64 works (3 min):
13
+
14
+ [![What is Base64? (3 min)](https://i.ytimg.com/vi_webp/8qkxeZmKmOY/sddefault.webp)](https://youtu.be/8qkxeZmKmOY)
15
+
16
+ Here’s how it works:
17
+
18
+ * It takes 3 bytes (24 bits) and converts them into 4 ASCII characters
19
+ * … using 64 characters: A-Z, a-z, 0-9, + and / (padding with `=` to make the length a multiple of 4)
20
+ * There’s a URL-safe variant of Base64 that replaces + and / with - and \_ to avoid issues in URLs
21
+ * Base64 adds ~33% overhead (since every 3 bytes becomes 4 characters)
22
+
23
+ Common Python operations with Base64:
24
+
25
+ ```
26
+ import base64
27
+
28
+ # Basic encoding/decoding
29
+ text = "Hello, World!"
30
+ # Convert text to base64
31
+ encoded = base64.b64encode(text.encode()).decode() # SGVsbG8sIFdvcmxkIQ==
32
+ # Convert base64 back to text
33
+ decoded = base64.b64decode(encoded).decode() # Hello, World!
34
+ # Convert to URL-safe base64
35
+ url_safe = base64.urlsafe_b64encode(text.encode()).decode() # SGVsbG8sIFdvcmxkIQ==
36
+
37
+ # Working with binary files (e.g., images)
38
+ with open('image.png', 'rb') as f:
39
+ binary_data = f.read()
40
+ image_b64 = base64.b64encode(binary_data).decode()
41
+
42
+ # Data URI example (embed images in HTML/CSS)
43
+ data_uri = f"data:image/png;base64,{image_b64}"Copy to clipboardErrorCopied
44
+ ```
45
+
46
+ Data URIs allow embedding binary data directly in HTML/CSS. This reduces the number of HTTP requests and also works offline. But it increases the file size.
47
+
48
+ For example, here’s an SVG image embedded as a data URI:
49
+
50
+ ```
51
+ <img
52
+ src="data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAzMiAzMiI+PGNpcmNsZSBjeD0iMTYiIGN5PSIxNiIgcj0iMTUiIGZpbGw9IiMyNTYzZWIiLz48cGF0aCBmaWxsPSIjZmZmIiBkPSJtMTYgNyAyIDcgNyAyLTcgMi0yIDctMi03LTctMiA3LTJaIi8+PC9zdmc+"
53
+ />Copy to clipboardErrorCopied
54
+ ```
55
+
56
+ Base64 is used in many places:
57
+
58
+ * JSON: Encoding binary data in JSON payloads
59
+ * Email: MIME attachments encoding
60
+ * Auth: HTTP Basic Authentication headers
61
+ * JWT: Encoding tokens in web authentication
62
+ * SSL/TLS: PEM certificate format
63
+ * SAML: Encoding assertions in SSO
64
+ * Git: Encoding binary files in patches
65
+
66
+ Tools for working with Base64:
67
+
68
+ * [Base64 Decoder/Encoder](https://www.base64decode.org/) for online encoding/decoding
69
+ * [data: URI Generator](https://dopiaza.org/tools/datauri/index.php) converts files to Data URIs
70
+
71
+ [Previous
72
+
73
+ LLM Text Extraction](#/llm-text-extraction)
74
+
75
+ [Next
76
+
77
+ Vision Models](#/vision-models)
markdown_files/Browser__DevTools.md ADDED
@@ -0,0 +1,69 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Browser: DevTools"
3
+ original_url: "https://tds.s-anand.net/#/devtools?id=browser-devtools"
4
+ downloaded_at: "2025-06-08T23:21:14.785028"
5
+ ---
6
+
7
+ [Browser: DevTools](#/devtools?id=browser-devtools)
8
+ ---------------------------------------------------
9
+
10
+ [Chrome DevTools](https://developer.chrome.com/docs/devtools/overview/) is the de facto standard for web development and data analysis in the browser.
11
+ You’ll use this a lot when debugging and inspecting web pages.
12
+
13
+ Here are the key features you’ll use most:
14
+
15
+ 1. **Elements Panel**
16
+
17
+ * Inspect and modify HTML/CSS in real-time
18
+ * Copy CSS selectors for web scraping
19
+ * Debug layout issues with the Box Model
20
+
21
+ ```
22
+ // Copy selector in Console
23
+ copy($0); // Copies selector of selected elementCopy to clipboardErrorCopied
24
+ ```
25
+ 2. **Console Panel**
26
+
27
+ * JavaScript REPL environment
28
+ * Log and debug data
29
+ * Common console methods:
30
+
31
+ ```
32
+ console.table(data); // Display data in table format
33
+ console.group("Name"); // Group related logs
34
+ console.time("Label"); // Measure execution timeCopy to clipboardErrorCopied
35
+ ```
36
+ 3. **Network Panel**
37
+
38
+ * Monitor API requests and responses
39
+ * Simulate slow connections
40
+ * Right-click on a request and select “Copy as fetch” to get the request.
41
+ 4. **Essential Keyboard Shortcuts**
42
+
43
+ * `Ctrl+Shift+I` (Windows) / `Cmd+Opt+I` (Mac): Open DevTools
44
+ * `Ctrl+Shift+C`: Select element to inspect
45
+ * `Ctrl+L`: Clear console
46
+ * `$0`: Reference currently selected element
47
+ * `$$('selector')`: Query selector all (returns array)
48
+
49
+ Videos from Chrome Developers (37 min total):
50
+
51
+ * [Fun & powerful: Intro to Chrome DevTools](https://youtu.be/t1c5tNPpXjs) (5 min)
52
+ * [Different ways to open Chrome DevTools](https://youtu.be/X65TAP8a530) (5 min)
53
+ * [Faster DevTools navigation with shortcuts and settings](https://youtu.be/xHusjrb_34A) (3 min)
54
+ * [How to log messages in the Console](https://youtu.be/76U0gtuV9AY) (6 min)
55
+ * [How to speed up your workflow with Console shortcuts](https://youtu.be/hdRDTj6ObiE) (6 min)
56
+ * [HTML vs DOM? Let’s debug them](https://youtu.be/J-02VNxE7lE) (5 min)
57
+ * [Caching demystified: Inspect, clear, and disable caches](https://youtu.be/mSMb-aH6sUw) (7 min)
58
+ * [Console message logging](https://youtu.be/76U0gtuV9AY) (6 min)
59
+ * [Console workflow shortcuts](https://youtu.be/hdRDTj6ObiE) (6 min)
60
+ * [HTML vs DOM debugging](https://youtu.be/J-02VNxE7lE) (5 min)
61
+ * [Cache inspection and management](https://youtu.be/mSMb-aH6sUw) (7 min)
62
+
63
+ [Previous
64
+
65
+ Unicode](#/unicode)
66
+
67
+ [Next
68
+
69
+ CSS Selectors](#/css-selectors)
markdown_files/CI_CD__GitHub_Actions.md ADDED
@@ -0,0 +1,79 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "CI/CD: GitHub Actions"
3
+ original_url: "https://tds.s-anand.net/#/github-actions?id=cicd-github-actions"
4
+ downloaded_at: "2025-06-08T23:24:27.252899"
5
+ ---
6
+
7
+ [CI/CD: GitHub Actions](#/github-actions?id=cicd-github-actions)
8
+ ----------------------------------------------------------------
9
+
10
+ [GitHub Actions](https://github.com/features/actions) is a powerful automation platform built into GitHub. It helps automate your development workflow - running tests, deploying applications, updating datasets, retraining models, etc.
11
+
12
+ * Understand the basics of [YAML configuration files](https://docs.github.com/en/actions/writing-workflows/quickstart)
13
+ * Explore the [pre-built actions from the marketplace](https://github.com/marketplace?type=actions)
14
+ * How to [handle secrets securely](https://docs.github.com/en/actions/security-for-github-actions/security-guides/using-secrets-in-github-actions)
15
+ * [Triggering a workflow](https://docs.github.com/en/actions/writing-workflows/choosing-when-your-workflow-runs/triggering-a-workflow)
16
+ * Staying within the [free tier limits](https://docs.github.com/en/billing/managing-billing-for-your-products/managing-billing-for-github-actions/about-billing-for-github-actions)
17
+ * [Caching dependencies to speed up workflows](https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/caching-dependencies-to-speed-up-workflows)
18
+
19
+ Here is a sample `.github/workflows/iss-location.yml` that runs daily, appends the International Space Station location data into `iss-location.json`, and commits it to the repository.
20
+
21
+ ```
22
+ name: Log ISS Location Data Daily
23
+
24
+ on:
25
+ schedule:
26
+ # Runs at 12:00 UTC (noon) every day
27
+ - cron: "0 12 * * *"
28
+ workflow_dispatch: # Allows manual triggering
29
+
30
+ jobs:
31
+ collect-iss-data:
32
+ runs-on: ubuntu-latest
33
+ permissions:
34
+ contents: write
35
+
36
+ steps:
37
+ - name: Checkout repository
38
+ uses: actions/checkout@v4
39
+
40
+ - name: Install uv
41
+ uses: astral-sh/setup-uv@v5
42
+
43
+ - name: Fetch ISS location data
44
+ run: | # python
45
+ uv run --with requests python << 'EOF'
46
+ import requests
47
+
48
+ data = requests.get('http://api.open-notify.org/iss-now.json').text
49
+ with open('iss-location.jsonl', 'a') as f:
50
+ f.write(data + '\n')
51
+ 'EOF'
52
+
53
+ - name: Commit and push changes
54
+ run: | # shell
55
+ git config --local user.email "github-actions[bot]@users.noreply.github.com"
56
+ git config --local user.name "github-actions[bot]"
57
+ git add iss-location.jsonl
58
+ git commit -m "Update ISS position data [skip ci]" || exit 0
59
+ git pushCopy to clipboardErrorCopied
60
+ ```
61
+
62
+ Tools:
63
+
64
+ * [GitHub CLI](https://cli.github.com/): Manage workflows from terminal
65
+ * [Super-Linter](https://github.com/github/super-linter): Validate code style
66
+ * [Release Drafter](https://github.com/release-drafter/release-drafter): Automate releases
67
+ * [act](https://github.com/nektos/act): Run actions locally
68
+
69
+ [![Github Actions CI/CD - Everything you need to know to get started](https://i.ytimg.com/vi_webp/mFFXuXjVgkU/sddefault.webp)](https://youtu.be/mFFXuXjVgkU)
70
+
71
+ * [How to handle secrets in GitHub Actions](https://youtu.be/1tD7km5jK70)
72
+
73
+ [Previous
74
+
75
+ Serverless hosting: Vercel](#/vercel)
76
+
77
+ [Next
78
+
79
+ Containers: Docker, Podman](#/docker)
markdown_files/CORS.md ADDED
@@ -0,0 +1,88 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "CORS"
3
+ original_url: "https://tds.s-anand.net/#/cors?id=cors-cross-origin-resource-sharing"
4
+ downloaded_at: "2025-06-08T23:26:45.742238"
5
+ ---
6
+
7
+ [CORS: Cross-Origin Resource Sharing](#/cors?id=cors-cross-origin-resource-sharing)
8
+ -----------------------------------------------------------------------------------
9
+
10
+ CORS (Cross-Origin Resource Sharing) is a security mechanism that controls how web browsers handle requests between different origins (domains, protocols, or ports). Data scientists need CORS for APIs serving data or analysis to a browser on a different domain.
11
+
12
+ Watch this practical explanation of CORS (3 min):
13
+
14
+ [![CORS in 100 Seconds](https://i.ytimg.com/vi_webp/4KHiSt0oLJ0/sddefault.webp)](https://youtu.be/4KHiSt0oLJ0)
15
+
16
+ Key CORS concepts:
17
+
18
+ * **Same-Origin Policy**: Browsers block requests between different origins by default
19
+ * **CORS Headers**: Server responses must include specific headers to allow cross-origin requests
20
+ * **Preflight Requests**: Browsers send OPTIONS requests to check if the actual request is allowed
21
+ * **Credentials**: Special handling required for requests with cookies or authentication
22
+
23
+ If you’re exposing your API with a GET request publicly, the only thing you need to do is set the HTTP header `Access-Control-Allow-Origin: *`.
24
+
25
+ Here are other common CORS headers:
26
+
27
+ ```
28
+ Access-Control-Allow-Origin: https://example.com
29
+ Access-Control-Allow-Methods: GET, POST, PUT, DELETE
30
+ Access-Control-Allow-Headers: Content-Type, Authorization
31
+ Access-Control-Allow-Credentials: trueCopy to clipboardErrorCopied
32
+ ```
33
+
34
+ To implement CORS in FastAPI, use the [`CORSMiddleware` middleware](https://fastapi.tiangolo.com/tutorial/cors/):
35
+
36
+ ```
37
+ from fastapi import FastAPI
38
+ from fastapi.middleware.cors import CORSMiddleware
39
+
40
+ app = FastAPI()
41
+
42
+ app.add_middleware(CORSMiddleware, allow_origins=["*"]) # Allow GET requests from all origins
43
+ # Or, provide more granular control:
44
+ app.add_middleware(
45
+ CORSMiddleware,
46
+ allow_origins=["https://example.com"], # Allow a specific domain
47
+ allow_credentials=True, # Allow cookies
48
+ allow_methods=["GET", "POST", "PUT", "DELETE"], # Allow specific methods
49
+ allow_headers=["*"], # Allow all headers
50
+ )Copy to clipboardErrorCopied
51
+ ```
52
+
53
+ Testing CORS with JavaScript:
54
+
55
+ ```
56
+ // Simple request
57
+ const response = await fetch("https://api.example.com/data", {
58
+ method: "GET",
59
+ headers: { "Content-Type": "application/json" },
60
+ });
61
+
62
+ // Request with credentials
63
+ const response = await fetch("https://api.example.com/data", {
64
+ credentials: "include",
65
+ headers: { "Content-Type": "application/json" },
66
+ });Copy to clipboardErrorCopied
67
+ ```
68
+
69
+ Useful CORS debugging tools:
70
+
71
+ * [CORS Checker](https://cors-test.codehappy.dev/): Test CORS configurations
72
+ * Browser DevTools Network tab: Inspect CORS headers and preflight requests
73
+ * [cors-anywhere](https://github.com/Rob--W/cors-anywhere): CORS proxy for development
74
+
75
+ Common CORS errors and solutions:
76
+
77
+ * `No 'Access-Control-Allow-Origin' header`: Configure server to send proper CORS headers
78
+ * `Request header field not allowed`: Add required headers to `Access-Control-Allow-Headers`
79
+ * `Credentials flag`: Set both `credentials: 'include'` and `Access-Control-Allow-Credentials: true`
80
+ * `Wild card error`: Cannot use `*` with credentials; specify exact origins
81
+
82
+ [Previous
83
+
84
+ Tunneling: ngrok](#/ngrok)
85
+
86
+ [Next
87
+
88
+ REST APIs](#/rest-apis)
markdown_files/CSS_Selectors.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "CSS Selectors"
3
+ original_url: "https://tds.s-anand.net/#/css-selectors?id=css-selectors"
4
+ downloaded_at: "2025-06-08T23:24:42.527184"
5
+ ---
6
+
7
+ [CSS Selectors](#/css-selectors?id=css-selectors)
8
+ -------------------------------------------------
9
+
10
+ CSS selectors are patterns used to select and style HTML elements on a web page. They are fundamental to web development and data scraping, allowing you to precisely target elements for styling or extraction.
11
+
12
+ For data scientists, understanding CSS selectors is crucial when:
13
+
14
+ * Web scraping with tools like Beautiful Soup or Scrapy
15
+ * Selecting elements for browser automation with Selenium
16
+ * Styling data visualizations and web applications
17
+ * Debugging website issues using browser DevTools
18
+
19
+ Watch this comprehensive introduction to CSS selectors (20 min):
20
+
21
+ [![Learn Every CSS Selector In 20 Minutes (20 min)](https://i.ytimg.com/vi_webp/l1mER1bV0N0/sddefault.webp)](https://youtu.be/l1mER1bV0N0)
22
+
23
+ The Mozilla Developer Network (MDN) provides detailed documentation on the three main types of selectors:
24
+
25
+ * [Basic CSS selectors](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Styling_basics/Basic_selectors): Learn about element (`div`), class (`.container`), ID (`#header`), and universal (`*`) selectors
26
+ * [Attribute selectors](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Styling_basics/Attribute_selectors): Target elements based on their attributes or attribute values (`[type="text"]`)
27
+ * [Combinators](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Styling_basics/Combinators): Use relationships between elements (`div > p`, `div + p`, `div ~ p`)
28
+
29
+ Practice your CSS selector skills with this interactive tool:
30
+
31
+ * [CSS Diner](https://flukeout.github.io/): A fun game that teaches CSS selectors through increasingly challenging levels
32
+
33
+ [Previous
34
+
35
+ Browser: DevTools](#/devtools)
36
+
37
+ [Next
38
+
39
+ JSON](#/json)
markdown_files/Cleaning_Data_with_OpenRefine.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Cleaning Data with OpenRefine"
3
+ original_url: "https://tds.s-anand.net/#/cleaning-data-with-openrefine?id=cleaning-data-with-openrefine"
4
+ downloaded_at: "2025-06-08T23:26:48.911609"
5
+ ---
6
+
7
+ [Cleaning Data with OpenRefine](#/cleaning-data-with-openrefine?id=cleaning-data-with-openrefine)
8
+ -------------------------------------------------------------------------------------------------
9
+
10
+ [![Cleaning data with OpenRefine](https://i.ytimg.com/vi_webp/zxEtfHseE84/sddefault.webp)](https://youtu.be/zxEtfHseE84)
11
+
12
+ This session covers the use of OpenRefine for data cleaning, focusing on resolving entity discrepancies:
13
+
14
+ * **Data Upload and Project Creation**: Import data into OpenRefine and create a new project for analysis.
15
+ * **Faceting Data**: Use text facets to group similar entries and identify frequency of address crumbs.
16
+ * **Clustering Methodology**: Apply clustering algorithms to merge similar entries with minor differences, such as punctuation.
17
+ * **Manual and Automated Clustering**: Learn to merge clusters manually or in one go, trusting the system’s clustering accuracy.
18
+ * **Entity Resolution**: Clean and save the data by resolving multiple versions of the same entity using Open Refine.
19
+
20
+ Here are links used in the video:
21
+
22
+ * [OpenRefine software](https://openrefine.org)
23
+ * [Dataset for OpenRefine](https://drive.google.com/file/d/1ccu0Xxk8UJUa2Dz4lihmvzhLjvPy42Ai/view)
24
+
25
+ [Previous
26
+
27
+ Data Preparation in the Editor](#/data-preparation-in-the-editor)
28
+
29
+ [Next
30
+
31
+ Profiling Data with Python](#/profiling-data-with-python)
markdown_files/Containers__Docker,_Podman.md ADDED
@@ -0,0 +1,94 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Containers: Docker, Podman"
3
+ original_url: "https://tds.s-anand.net/#/docker?id=containers-docker-podman"
4
+ downloaded_at: "2025-06-08T23:26:01.579602"
5
+ ---
6
+
7
+ [Containers: Docker, Podman](#/docker?id=containers-docker-podman)
8
+ ------------------------------------------------------------------
9
+
10
+ [Docker](https://www.docker.com/) and [Podman](https://podman.io/) are containerization tools that package your application and its dependencies into a standardized unit for software development and deployment.
11
+
12
+ Docker is the industry standard. Podman is compatible with Docker and has better security (and a slightly more open license). In this course, we recommend Podman but Docker works in the same way.
13
+
14
+ Initialize the container engine:
15
+
16
+ ```
17
+ podman machine init
18
+ podman machine startCopy to clipboardErrorCopied
19
+ ```
20
+
21
+ Common Operations. (You can use `docker` instead of `podman` in the same way.)
22
+
23
+ ```
24
+ # Pull an image
25
+ podman pull python:3.11-slim
26
+
27
+ # Run a container
28
+ podman run -it python:3.11-slim
29
+
30
+ # List containers
31
+ podman ps -a
32
+
33
+ # Stop container
34
+ podman stop container_id
35
+
36
+ # Scan image for vulnerabilities
37
+ podman scan myapp:latest
38
+
39
+ # Remove container
40
+ podman rm container_id
41
+
42
+ # Remove all stopped containers
43
+ podman container pruneCopy to clipboardErrorCopied
44
+ ```
45
+
46
+ You can create a `Dockerfile` to build a container image. Here’s a sample `Dockerfile` that converts a Python script into a container image.
47
+
48
+ ```
49
+ FROM python:3.11-slim
50
+ # Set working directory
51
+ WORKDIR /app
52
+ # Typically, you would use `COPY . .` to copy files from the host machine,
53
+ # but here we're just using a simple script.
54
+ RUN echo 'print("Hello, world!")' > app.py
55
+ # Run the script
56
+ CMD ["python", "app.py"]Copy to clipboardErrorCopied
57
+ ```
58
+
59
+ To build, run, and deploy the container, run these commands:
60
+
61
+ ```
62
+ # Create an account on https://hub.docker.com/ and then login
63
+ podman login docker.io
64
+
65
+ # Build and run the container
66
+ podman build -t py-hello .
67
+ podman run -it py-hello
68
+
69
+ # Push the container to Docker Hub. Replace $DOCKER_HUB_USERNAME with your Docker Hub username.
70
+ podman push py-hello:latest docker.io/$DOCKER_HUB_USERNAME/py-hello
71
+
72
+ # Push adding a specific tag, e.g. dev
73
+ TAG=dev podman push py-hello docker.io/$DOCKER_HUB_USERNAME/py-hello:$TAGCopy to clipboardErrorCopied
74
+ ```
75
+
76
+ Tools:
77
+
78
+ * [Dive](https://github.com/wagoodman/dive): Explore image layers
79
+ * [Skopeo](https://github.com/containers/skopeo): Work with container images
80
+ * [Trivy](https://github.com/aquasecurity/trivy): Security scanner
81
+
82
+ [![Podman Tutorial Zero to Hero | Full 1 Hour Course](https://i.ytimg.com/vi_webp/YXfA5O5Mr18/sddefault.webp)](https://youtu.be/YXfA5O5Mr18)
83
+
84
+ [![Learn Docker in 7 Easy Steps - Full Beginner's Tutorial](https://i.ytimg.com/vi_webp/gAkwW2tuIqE/sddefault.webp)](https://youtu.be/gAkwW2tuIqE)
85
+
86
+ * Optional: For Windows, see [WSL 2 with Docker getting started](https://youtu.be/5RQbdMn04Oc)
87
+
88
+ [Previous
89
+
90
+ CI/CD: GitHub Actions](#/github-actions)
91
+
92
+ [Next
93
+
94
+ DevContainers: GitHub Codespaces](#/github-codespaces)
markdown_files/Convert_HTML_to_Markdown.md ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Convert HTML to Markdown"
3
+ original_url: "https://tds.s-anand.net/#/convert-html-to-markdown?id=markdown-crawler"
4
+ downloaded_at: "2025-06-08T23:25:38.805247"
5
+ ---
6
+
7
+ [Converting HTML to Markdown](#/convert-html-to-markdown?id=converting-html-to-markdown)
8
+ ----------------------------------------------------------------------------------------
9
+
10
+ When working with web content, converting HTML files to plain text or Markdown is a common requirement for content extraction, analysis, and preservation. For example:
11
+
12
+ * **Content analysis**: Extract clean text from HTML for natural language processing
13
+ * **Data mining**: Strip formatting to focus on the actual content
14
+ * **Offline reading**: Convert web pages to readable formats for e-readers or offline consumption
15
+ * **Content migration**: Move content between different CMS platforms
16
+ * **SEO analysis**: Extract headings, content structure, and text for optimization
17
+ * **Archive creation**: Store web content in more compact, preservation-friendly formats
18
+ * **Accessibility**: Convert content to formats that work better with screen readers
19
+
20
+ This tutorial covers both converting existing HTML files and combining web crawling with HTML-to-text conversion in a single workflow – all using the command line.
21
+
22
+ ### [defuddle-cli](#/convert-html-to-markdown?id=defuddle-cli)
23
+
24
+ [defuddle-cli](https://github.com/defuddle/defuddle) specializes in HTML - Markdown conversion. It’s a bit slow and not very customizable but produces clean Markdown that preserves structure, links, and basic formatting. Best for content where preserving the document structure is important.
25
+
26
+ ```
27
+ find . -name '*.html' -exec npx --package defuddle-cli -y defuddle parse {} --md -o {}.md \;Copy to clipboardErrorCopied
28
+ ```
29
+
30
+ * `find . -name '*.html'`: Finds all HTML files in the current directory and subdirectories
31
+ * `-exec ... \;`: Executes the following command for each file found
32
+ * `npx --package defuddle-cli -y`: Installs and runs defuddle-cli without prompting
33
+ * `defuddle parse {} --md`: Parses the HTML file (represented by `{}`) and converts to markdown
34
+ * `-o {}.md`: Outputs to a file with the original name plus .md extension
35
+
36
+ ### [Pandoc](#/convert-html-to-markdown?id=pandoc)
37
+
38
+ [Pandoc](https://pandoc.org/) is a bit slow and highly customizable, preserving almost all formatting elements, leading to verbose markdown. Best for academic or documentation conversion where precision matters.
39
+
40
+ Pandoc can convert from many other formats (such as Word, PDF, LaTeX, etc.) to Markdown and vice versa, making it one of most popular and versatele document convertors.
41
+
42
+ [![How to Convert a Word Document to Markdown for Free using Pandoc (12 min)](https://i.ytimg.com/vi/HPSK7q13-40/sddefault.jpg)](https://youtu.be/HPSK7q13-40)
43
+
44
+ ```
45
+ find . -name '*.html' -exec pandoc -f html -t markdown_strict -o {}.md {} \;Copy to clipboardErrorCopied
46
+ ```
47
+
48
+ * `find . -name '*.html'`: Finds all HTML files in the current directory and subdirectories
49
+ * `-exec ... \;`: Executes the following command for each file found
50
+ * `pandoc`: The Swiss Army knife of document conversion
51
+ * `-f html -t markdown_strict`: Convert from HTML format to strict markdown
52
+ * `-o {}.md {}`: Output to a markdown file, with the input file as the last argument
53
+
54
+ ### [Lynx](#/convert-html-to-markdown?id=lynx)
55
+
56
+ [Lynx](https://lynx.invisible-island.net/) is fast and generates text (not Markdown) with minimal formatting. Lynx renders the HTML as it would appear in a text browser, preserving basic structure but losing complex formatting. Best for quick content extraction or when processing large numbers of files.
57
+
58
+ ```
59
+ find . -type f -name '*.html' -exec sh -c 'for f; do lynx -dump -nolist "$f" > "${f%.html}.txt"; done' _ {} +Copy to clipboardErrorCopied
60
+ ```
61
+
62
+ * `find . -type f -name '*.html'`: Finds all HTML files in the current directory and subdirectories
63
+ * `-exec sh -c '...' _ {} +`: Executes a shell command with batched files for efficiency
64
+ * `for f; do ... done`: Loops through each file in the batch
65
+ * `lynx -dump -nolist "$f"`: Uses the lynx text browser to render HTML as plain text
66
+ + `-dump`: Output the rendered page to stdout
67
+ + `-nolist`: Don’t include the list of links at the end
68
+ * `> "${f%.html}.txt"`: Save output to a .txt file with the same base name
69
+
70
+ ### [w3m](#/convert-html-to-markdown?id=w3m)
71
+
72
+ [w3m](https://w3m.sourceforge.net/) is very slow processing with minimal formatting. w3m tends to be more thorough in its rendering than lynx but takes considerably longer. It supports basic JavaScript processing, making it better at handling modern websites with dynamic content. Best for cases where you need slightly better rendering than lynx, particularly for complex layouts and tables, and when some JavaScript processing is beneficial.
73
+
74
+ ```
75
+ find . -type f -name '*.html' \
76
+ -exec sh -c 'for f; do \
77
+ w3m -dump -T text/html -cols 80 -no-graph "$f" > "${f%.html}.md"; \
78
+ done' _ {} +Copy to clipboardErrorCopied
79
+ ```
80
+
81
+ * `find . -type f -name '*.html'`: Finds all HTML files in the current directory and subdirectories
82
+ * `-exec sh -c '...' _ {} +`: Executes a shell command with batched files for efficiency
83
+ * `for f; do ... done`: Loops through each file in the batch
84
+ * `w3m -dump -T text/html -cols 80 -no-graph "$f"`: Uses the w3m text browser to render HTML
85
+ + `-dump`: Output the rendered page to stdout
86
+ + `-T text/html`: Specify input format as HTML
87
+ + `-cols 80`: Set output width to 80 columns
88
+ + `-no-graph`: Don’t show graphic characters for tables and frames
89
+ * `> "${f%.html}.md"`: Save output to a .md file with the same base name
90
+
91
+ ### [Comparison](#/convert-html-to-markdown?id=comparison)
92
+
93
+ | Approach | Speed | Format Quality | Preservation | Best For |
94
+ | --- | --- | --- | --- | --- |
95
+ | defuddle-cli | Slow | High | Good structure and links | Content migration, publishing |
96
+ | pandoc | Slow | Very High | Almost everything | Academic papers, documentation |
97
+ | lynx | Fast | Low | Basic structure only | Quick extraction, large batches |
98
+ | w3m | Very Slow | Medium-Low | Basic structure with better tables | Improved readability over lynx |
99
+
100
+ ### [Optimize Batch Processing](#/convert-html-to-markdown?id=optimize-batch-processing)
101
+
102
+ 1. **Process in parallel**: Use GNU Parallel for multi-core processing:
103
+
104
+ ```
105
+ find . -name "*.html" | parallel "pandoc -f html -t markdown_strict -o {}.md {}"Copy to clipboardErrorCopied
106
+ ```
107
+ 2. **Filter files before processing**:
108
+
109
+ ```
110
+ find . -name "*.html" -type f -size -1M -exec pandoc -f html -t markdown {} -o {}.md \;Copy to clipboardErrorCopied
111
+ ```
112
+ 3. **Customize output format** with additional parameters:
113
+
114
+ ```
115
+ # For pandoc, preserve line breaks but simplify other formatting
116
+ find . -name "*.html" -exec pandoc -f html -t markdown --wrap=preserve --atx-headers {} -o {}.md \;Copy to clipboardErrorCopied
117
+ ```
118
+ 4. **Handle errors gracefully**:
119
+
120
+ ```
121
+ find . -name "*.html" -exec sh -c 'for f; do pandoc -f html -t markdown "$f" -o "${f%.html}.md" 2>/dev/null || echo "Failed: $f" >> conversion_errors.log; done' _ {} +Copy to clipboardErrorCopied
122
+ ```
123
+
124
+ ### [Choosing the Right Tool](#/convert-html-to-markdown?id=choosing-the-right-tool)
125
+
126
+ * **Need speed with minimal formatting?** Use the lynx approach
127
+ * **Need precise, complete conversion?** Use pandoc
128
+ * **Need a balance of structure and cleanliness?** Try defuddle-cli
129
+ * **Working with complex tables?** w3m might render them better
130
+
131
+ Remember that the best approach depends on your specific use case, volume of files, and how you intend to use the converted text.
132
+
133
+ ### [Combined Crawling and Conversion](#/convert-html-to-markdown?id=combined-crawling-and-conversion)
134
+
135
+ Sometimes you need to both crawl a website and convert its content to markdown or text in a single workflow, like [Crawl4AI](#/convert-html-to-markdown?id=crawl4ai) or [markdown-crawler](#/convert-html-to-markdown?id=markdown-crawler).
136
+
137
+ 1. **For research/data collection**: Use a specialized crawler (like Crawl4AI) with post-processing conversion
138
+ 2. **For simple website archiving**: Markdown-crawler provides a convenient all-in-one solution
139
+ 3. **For high-quality conversion**: Use wget/wget2 for crawling followed by pandoc for conversion
140
+ 4. **For maximum speed**: Combine wget with lynx in a pipeline
141
+
142
+ ### [Crawl4AI](#/convert-html-to-markdown?id=crawl4ai)
143
+
144
+ [Crawl4AI](https://github.com/crawl4ai/crawl4ai) is designed for single-page extraction with high-quality content processing. Crawl4AI is optimized for AI training data extraction, focusing on clean, structured content rather than complete site preservation. It excels at removing boilerplate content and preserving the main article text.
145
+
146
+ ```
147
+ uv venv
148
+ source .venv/bin/activate.fish
149
+ uv pip install crawl4ai
150
+ crawl4ai-setupCopy to clipboardErrorCopied
151
+ ```
152
+
153
+ * `uv venv`: Creates a Python virtual environment using uv (a faster alternative to virtualenv)
154
+ * `source .venv/bin/activate.fish`: Activates the virtual environment (fish shell syntax)
155
+ * `uv pip install crawl4ai`: Installs the crawl4ai package
156
+ * `crawl4ai-setup`: Initializes crawl4ai’s required dependencies
157
+
158
+ ### [markdown-crawler](#/convert-html-to-markdown?id=markdown-crawler)
159
+
160
+ [markdown-crawler](https://pypi.org/project/markdown-crawler/) combines web crawling with markdown conversion in one tool. It’s efficient for bulk processing but tends to produce lower-quality markdown conversion compared to specialized converters like pandoc or defuddle. Best for projects where quantity and integration are more important than perfect formatting.
161
+
162
+ ```
163
+ uv venv
164
+ source .venv/bin/activate.fish
165
+ uv pip install markdown-crawler
166
+ markdown-crawler -t 5 -d 3 -b ./markdown https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied
167
+ ```
168
+
169
+ * `uv venv` and activation: Same as above
170
+ * `uv pip install markdown-crawler`: Installs the markdown-crawler package
171
+ * `markdown-crawler`: Runs the crawler with these options:
172
+ + `-t 5`: Sets 5 threads for parallel crawling
173
+ + `-d 3`: Limits crawl depth to 3 levels
174
+ + `-b ./markdown`: Sets the base output directory
175
+ + Final argument is the starting URL
176
+
177
+ [Previous
178
+
179
+ Convert PDFs to Markdown](#/convert-pdfs-to-markdown)
180
+
181
+ [Next
182
+
183
+ LLM Website Scraping](#/llm-website-scraping)
markdown_files/Convert_PDFs_to_Markdown.md ADDED
@@ -0,0 +1,139 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Convert PDFs to Markdown"
3
+ original_url: "https://tds.s-anand.net/#/convert-pdfs-to-markdown?id=tips-for-optimal-pdf-conversion"
4
+ downloaded_at: "2025-06-08T23:25:59.398450"
5
+ ---
6
+
7
+ [Converting PDFs to Markdown](#/convert-pdfs-to-markdown?id=converting-pdfs-to-markdown)
8
+ ----------------------------------------------------------------------------------------
9
+
10
+ PDF documents are ubiquitous in academic, business, and technical contexts, but extracting and repurposing their content can be challenging. This tutorial explores various command-line tools for converting PDFs to Markdown format, with a focus on preserving structure and formatting suitable for different use cases, including preparation for Large Language Models (LLMs).
11
+
12
+ Use Cases:
13
+
14
+ * **LLM training and fine-tuning**: Create clean text data from PDFs for AI model training
15
+ * **Knowledge base creation**: Transform PDFs into searchable, editable markdown documents
16
+ * **Content repurposing**: Convert academic papers and reports for web publication
17
+ * **Data extraction**: Pull structured content from PDF documents for analysis
18
+ * **Accessibility**: Convert PDFs to more accessible formats for screen readers
19
+ * **Citation and reference management**: Extract bibliographic information from academic papers
20
+ * **Documentation conversion**: Transform technical PDFs into maintainable documentation
21
+
22
+ ### [PyMuPDF4LLM](#/convert-pdfs-to-markdown?id=pymupdf4llm)
23
+
24
+ [PyMuPDF4LLM](https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/) is a specialized component of the PyMuPDF library that generates Markdown specifically formatted for Large Language Models. It produces high-quality markdown with good preservation of document structure. It’s specifically optimized for producing text that works well with LLMs, removing irrelevant formatting while preserving semantic structure. Requires PyTorch, which adds dependencies but enables more advanced processing capabilities.
25
+
26
+ PyMuPDF4LLM uses [MuPDF](https://mupdf.com/) as its PDF parsing engine. [PyMuPDF](https://pymupdf.readthedocs.io/) is emerging as a strong default for PDF text extraction due to its accuracy and performance in handling complex PDF structures.
27
+
28
+ ```
29
+ PYTHONUTF8=1 uv run --with pymupdf4llm python -c 'import pymupdf4llm; h = open("pymupdf4llm.md", "w"); h.write(pymupdf4llm.to_markdown("$FILE.pdf"))'Copy to clipboardErrorCopied
30
+ ```
31
+
32
+ * `PYTHONUTF8=1`: Forces Python to use UTF-8 encoding regardless of system locale
33
+ * `uv run --with pymupdf4llm`: Uses uv package manager to run Python with the pymupdf4llm package
34
+ * `python -c '...'`: Executes Python code directly from the command line
35
+ * `import pymupdf4llm`: Imports the PDF-to-Markdown module
36
+ * `h = open("pymupdf4llm.md", "w")`: Creates a file to write the markdown output
37
+ * `h.write(pymupdf4llm.to_markdown("$FILE.pdf"))`: Converts the PDF to markdown and writes to file
38
+
39
+ [Markitdown](#/convert-pdfs-to-markdown?id=markitdown)
40
+ ------------------------------------------------------
41
+
42
+ [![Microsoft MarkItDown - Convert Files and Office Documents to Markdown - Install Locally (9 min)](https://i.ytimg.com/vi/v65Oyddfxeg/sddefault.jpg)](https://youtu.be/v65Oyddfxeg)
43
+
44
+ [Markitdown](https://github.com/microsoft/markitdown) is Microsoft’s tool for converting various document formats to Markdown, including PDFs, DOCX, XLSX, PPTX, and ZIP files. It’s a versatile multi-format converter that handles PDFs via PDFMiner, DOCX via Mammoth, XLSX via Pandas, and PPTX via Python-PPTX. Good for batch processing of mixed document types. The quality of PDF conversion is generally good but may struggle with complex layouts or heavily formatted documents.
45
+
46
+ ```
47
+ PYTHONUTF8=1 uvx markitdown $FILE.pdf > markitdown.mdCopy to clipboardErrorCopied
48
+ ```
49
+
50
+ * `PYTHONUTF8=1`: Forces Python to use UTF-8 encoding
51
+ * `uvx markitdown`: Runs the markitdown tool via the uv package manager
52
+ * `$FILE.pdf`: The input PDF file
53
+ * `> markitdown.md`: Redirects output to a markdown file
54
+
55
+ ### [Unstructured](#/convert-pdfs-to-markdown?id=unstructured)
56
+
57
+ [Unstructured](https://unstructured.io/) is rapidly becoming the de facto library for parsing over 40 different file types. It is excellent for extracting text and tables from diverse document formats. Particularly useful for generating clean content to pass to LLMs. Strong community support and actively maintained.
58
+
59
+ [GROBID](#/convert-pdfs-to-markdown?id=grobid)
60
+ ----------------------------------------------
61
+
62
+ If you specifically need to parse references from text-native PDFs or reliably OCR’ed ones, [GROBID](https://github.com/kermitt2/grobid) remains the de facto choice. It excels at extracting structured bibliographic information with high accuracy.
63
+
64
+ ```
65
+ # Start GROBID service
66
+ docker run -t --rm -p 8070:8070 lfoppiano/grobid:0.7.2
67
+
68
+ # Process PDF with curl
69
+ curl -X POST -F "input=@paper.pdf" localhost:8070/api/processFulltextDocument > references.tei.xmlCopy to clipboardErrorCopied
70
+ ```
71
+
72
+ ### [Mistral OCR API](#/convert-pdfs-to-markdown?id=mistral-ocr-api)
73
+
74
+ [Mistral OCR](https://mistral.ai/products/ocr/) offers an end-to-end cloud API that preserves both text and layout, making it easier to isolate specific sections like References. It shows the most promise currently, though it requires post-processing.
75
+
76
+ [Azure Document Intelligence API](#/convert-pdfs-to-markdown?id=azure-document-intelligence-api)
77
+ ------------------------------------------------------------------------------------------------
78
+
79
+ For enterprise users already in the Microsoft ecosystem, [Azure Document Intelligence](https://azure.microsoft.com/en-us/products/ai-services/document-intelligence) provides excellent raw OCR with enterprise SLAs. May require custom model training or post-processing to match GROBID’s reference extraction capabilities.
80
+
81
+ ### [Other libraries](#/convert-pdfs-to-markdown?id=other-libraries)
82
+
83
+ [Docling](https://github.com/DS4SD/docling) is IBM’s document understanding library that supports PDF conversion. It can be challenging to install, particularly on Windows and some Linux distributions. Offers advanced document understanding capabilities beyond simple text extraction.
84
+
85
+ [MegaParse](https://github.com/QuivrHQ/MegaParse) takes a comprehensive approach using LibreOffice, Pandoc, Tesseract OCR, and other tools. It has Robust handling of different document types but requires an OpenAI API key for some features. Good for complex documents but has significant dependencies.
86
+
87
+ [Comparison of PDF-to-Markdown Tools](#/convert-pdfs-to-markdown?id=comparison-of-pdf-to-markdown-tools)
88
+ --------------------------------------------------------------------------------------------------------
89
+
90
+ | Tool | Strengths | Weaknesses | Best For |
91
+ | --- | --- | --- | --- |
92
+ | PyMuPDF4LLM | Structure preservation, LLM optimization | Requires PyTorch | AI training data, semantic structure |
93
+ | Markitdown | Multi-format support, simple usage | Less precise layout handling | Batch processing, mixed documents |
94
+ | Unstructured | Wide format support, active development | Can be resource-intensive | Production pipelines, integration |
95
+ | GROBID | Reference extraction excellence | Narrower use case | Academic papers, citations |
96
+ | Docling | Advanced document understanding | Installation difficulties | Research applications |
97
+ | MegaParse | Comprehensive approach | Requires OpenAI API | Complex documents, OCR needs |
98
+
99
+ How to pick:
100
+
101
+ * **Need LLM-ready content?** PyMuPDF4LLM is specifically designed for this
102
+ * **Working with multiple document formats?** Markitdown handles diverse inputs
103
+ * **Extracting academic references?** GROBID remains the standard
104
+ * **Building a production pipeline?** Unstructured offers the best integration options
105
+ * **Handling complex layouts?** Consider commercial OCR like Mistral or Azure Document Intelligence
106
+
107
+ The optimal approach depends on your specific requirements regarding accuracy, structure preservation, and the intended use of the extracted content.
108
+
109
+ [Tips for Optimal PDF Conversion](#/convert-pdfs-to-markdown?id=tips-for-optimal-pdf-conversion)
110
+ ------------------------------------------------------------------------------------------------
111
+
112
+ 1. **Pre-process PDFs** when possible:
113
+
114
+ ```
115
+ # Optimize a PDF for text extraction first
116
+ ocrmypdf --optimize 3 --skip-text input.pdf optimized.pdfCopy to clipboardErrorCopied
117
+ ```
118
+ 2. **Try multiple tools** on the same document to compare results:
119
+ 3. **Handle scanned PDFs** appropriately:
120
+
121
+ ```
122
+ # For scanned documents, run OCR first
123
+ ocrmypdf --force-ocr input.pdf ocr_ready.pdf
124
+ PYTHONUTF8=1 uvx markitdown ocr_ready.pdf > markitdown.mdCopy to clipboardErrorCopied
125
+ ```
126
+ 4. **Consider post-processing** for better results:
127
+
128
+ ```
129
+ # Simple post-processing example
130
+ sed -i 's/\([A-Z]\)\./\1\.\n/g' output.md # Add line breaks after sentencesCopy to clipboardErrorCopied
131
+ ```
132
+
133
+ [Previous
134
+
135
+ Scraping PDFs with Tabula](#/scraping-pdfs-with-tabula)
136
+
137
+ [Next
138
+
139
+ Convert HTML to Markdown](#/convert-html-to-markdown)
markdown_files/Correlation_with_Excel.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Correlation with Excel"
3
+ original_url: "https://tds.s-anand.net/#/correlation-with-excel?id=correlation-with-excel"
4
+ downloaded_at: "2025-06-08T23:24:31.921246"
5
+ ---
6
+
7
+ [Correlation with Excel](#/correlation-with-excel?id=correlation-with-excel)
8
+ ----------------------------------------------------------------------------
9
+
10
+ [![Correlation with Excel](https://i.ytimg.com/vi_webp/lXHCyhO7DmY/sddefault.webp)](https://youtu.be/lXHCyhO7DmY)
11
+
12
+ You’ll learn to calculate and interpret correlations using Excel, covering:
13
+
14
+ * **Enabling the Data Analysis Tool Pack**: Steps to enable the Excel data analysis tool pack.
15
+ * **Correlation Analysis**: Understanding statistical association between variables.
16
+ * **Creating a Correlation Matrix**: Steps to generate and interpret a correlation matrix.
17
+ * **Scatterplots and Trendlines**: Plotting data and adding trend lines to visualize correlations.
18
+ * **Analyzing Results**: Comparing correlation coefficients and understanding their implications.
19
+ * **Insights and Further Analysis**: Interpreting scatterplots and planning further analysis for deeper insights.
20
+
21
+ Here are the links used in the video:
22
+
23
+ * [Understand correlation](https://www.khanacademy.org/math/ap-statistics/bivariate-data-ap/correlation-coefficient-r/v/correlation-coefficient-intuition-examples)
24
+ * [COVID-19 vaccinations data explorer - Website](https://ourworldindata.org/covid-vaccinations?country=OWID_WRL)
25
+ * [COVID-19 vaccinations - Correlations Excel file](https://docs.google.com/spreadsheets/d/1_vQF2i5ubKmHQMBqoTwsu6AlevWsQtTD/view#gid=790744269)
26
+
27
+ [Previous
28
+
29
+ 6. Data Analysis](#/data-analysis)
30
+
31
+ [Next
32
+
33
+ Regression with Excel](#/regression-with-excel)
markdown_files/Crawling_with_the_CLI.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Crawling with the CLI"
3
+ original_url: "https://tds.s-anand.net/#/crawling-cli?id=wpull"
4
+ downloaded_at: "2025-06-08T23:26:52.185904"
5
+ ---
6
+
7
+ [Crawling with the CLI](#/crawling-cli?id=crawling-with-the-cli)
8
+ ----------------------------------------------------------------
9
+
10
+ Since websites are a common source of data, we often download entire websites (crawling) and then process them offline.
11
+
12
+ Web crawling is essential in many data-driven scenarios:
13
+
14
+ * **Data mining and analysis**: Gathering structured data from multiple pages for market research, competitive analysis, or academic research
15
+ * **Content archiving**: Creating offline copies of websites for preservation or backup purposes
16
+ * **SEO analysis**: Analyzing site structure, metadata, and content to improve search rankings
17
+ * **Legal compliance**: Capturing website content for regulatory or compliance documentation
18
+ * **Website migration**: Creating a complete copy before moving to a new platform or design
19
+ * **Offline access**: Downloading educational resources, documentation, or reference materials for use without internet connection
20
+
21
+ The most commonly used tool for fetching websites is [`wget`](https://www.gnu.org/software/wget/). It is pre-installed in many UNIX distributions and easy to install.
22
+
23
+ [![Scraping Websites using Wget (8 min)](https://i.ytimg.com/vi/pLfH5TZBGXo/sddefault.jpg)](https://youtu.be/pLfH5TZBGXo)
24
+
25
+ To crawl the [IIT Madras Data Science Program website](https://study.iitm.ac.in/ds/) for example, you could run:
26
+
27
+ ```
28
+ wget \
29
+ --recursive \
30
+ --level=3 \
31
+ --no-parent \
32
+ --convert-links \
33
+ --adjust-extension \
34
+ --compression=auto \
35
+ --accept html,htm \
36
+ --directory-prefix=./ds \
37
+ https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied
38
+ ```
39
+
40
+ Here’s what each option does:
41
+
42
+ * `--recursive`: Enables recursive downloading (following links)
43
+ * `--level=3`: Limits recursion depth to 3 levels from the initial URL
44
+ * `--no-parent`: Restricts crawling to only URLs below the initial directory
45
+ * `--convert-links`: Converts all links in downloaded documents to work locally
46
+ * `--adjust-extension`: Adds proper extensions to files (.html, .jpg, etc.) based on MIME types
47
+ * `--compression=auto`: Automatically handles compressed content (gzip, deflate)
48
+ * `--accept html,htm`: Only downloads files with these extensions
49
+ * `--directory-prefix=./ds`: Saves all downloaded files to the specified directory
50
+
51
+ [wget2](https://gitlab.com/gnuwget/wget2) is a better version of `wget` and supports HTTP2, parallel connections, and only updates modified sites. The syntax is (mostly) the same.
52
+
53
+ ```
54
+ wget2 \
55
+ --recursive \
56
+ --level=3 \
57
+ --no-parent \
58
+ --convert-links \
59
+ --adjust-extension \
60
+ --compression=auto \
61
+ --accept html,htm \
62
+ --directory-prefix=./ds \
63
+ https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied
64
+ ```
65
+
66
+ There are popular free and open-source alternatives to Wget:
67
+
68
+ ### [Wpull](#/crawling-cli?id=wpull)
69
+
70
+ [Wpull](https://github.com/ArchiveTeam/wpull) is a wget‐compatible Python crawler that supports on-disk resumption, WARC output, and PhantomJS integration.
71
+
72
+ ```
73
+ uvx wpull \
74
+ --recursive \
75
+ --level=3 \
76
+ --no-parent \
77
+ --convert-links \
78
+ --adjust-extension \
79
+ --compression=auto \
80
+ --accept html,htm \
81
+ --directory-prefix=./ds \
82
+ https://study.iitm.ac.in/ds/Copy to clipboardErrorCopied
83
+ ```
84
+
85
+ ### [HTTrack](#/crawling-cli?id=httrack)
86
+
87
+ [HTTrack](https://www.httrack.com/html/fcguide.html) is dedicated website‐mirroring tool with rich filtering and link‐conversion options.
88
+
89
+ ```
90
+ httrack "https://study.iitm.ac.in/ds/" \
91
+ -O "./ds" \
92
+ "+*.study.iitm.ac.in/ds/*" \
93
+ -r3Copy to clipboardErrorCopied
94
+ ```
95
+
96
+ ### [Robots.txt](#/crawling-cli?id=robotstxt)
97
+
98
+ `robots.txt` is a standard file found in a website’s root directory that specifies which parts of the site should not be accessed by web crawlers. It’s part of the Robots Exclusion Protocol, an ethical standard for web crawling.
99
+
100
+ **Why it’s important**:
101
+
102
+ * **Server load protection**: Prevents excessive traffic that could overload servers
103
+ * **Privacy protection**: Keeps sensitive or private content from being indexed
104
+ * **Legal compliance**: Respects website owners’ rights to control access to their content
105
+ * **Ethical web citizenship**: Shows respect for website administrators’ wishes
106
+
107
+ **How to override robots.txt restrictions**:
108
+
109
+ * **wget, wget2**: Use `-e robots=off`
110
+ * **httrack**: Use `-s0`
111
+ * **wpull**: Use `--no-robots`
112
+
113
+ **When to override robots.txt (use with discretion)**:
114
+
115
+ Only bypass `robots.txt` when:
116
+
117
+ * You have explicit permission from the website owner
118
+ * You’re crawling your own website
119
+ * The content is publicly accessible and your crawling won’t cause server issues
120
+ * You’re conducting authorized security testing
121
+
122
+ Remember that bypassing `robots.txt` without legitimate reason may:
123
+
124
+ * Violate terms of service
125
+ * Lead to IP banning
126
+ * Result in legal consequences in some jurisdictions
127
+ * Cause reputation damage to your organization
128
+
129
+ Always use the minimum necessary crawling speed and scope, and consider contacting website administrators for permission when in doubt.
130
+
131
+ [Previous
132
+
133
+ Scraping with Google Sheets](#/scraping-with-google-sheets)
134
+
135
+ [Next
136
+
137
+ BBC Weather API with Python](#/bbc-weather-api-with-python)
markdown_files/Data_Aggregation_in_Excel.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Data Aggregation in Excel"
3
+ original_url: "https://tds.s-anand.net/#/data-aggregation-in-excel?id=data-aggregation-in-excel"
4
+ downloaded_at: "2025-06-08T23:26:38.121838"
5
+ ---
6
+
7
+ [Data Aggregation in Excel](#/data-aggregation-in-excel?id=data-aggregation-in-excel)
8
+ -------------------------------------------------------------------------------------
9
+
10
+ [![Data aggregation in Excel](https://i.ytimg.com/vi_webp/NkpT0dDU8Y4/sddefault.webp)](https://youtu.be/NkpT0dDU8Y4)
11
+
12
+ You’ll learn data aggregation and visualization techniques in Excel, covering:
13
+
14
+ * **Data Cleanup**: Remove empty columns and rows with missing values.
15
+ * **Creating Excel Tables**: Convert raw data into tables for easier manipulation and formula application.
16
+ * **Date Manipulation**: Extract week, month, and year from date columns using Excel functions (WEEKNUM, TEXT).
17
+ * **Color Scales**: Apply color scales to visualize clusters and trends in data over time.
18
+ * **Pivot Tables**: Create pivot tables to aggregate data by location and date, summarizing values weekly and monthly.
19
+ * **Sparklines**: Use sparklines to visualize trends within pivot tables, making data patterns more apparent.
20
+ * **Data Bars**: Implement data bars for graphical illustrations of numerical columns, showing trends and waves.
21
+
22
+ Here are links used in the video:
23
+
24
+ * [COVID-19 data Excel file - raw data](https://docs.google.com/spreadsheets/d/14HLgSmME95q--6lcBv9pUstqHL183wTd/view)
25
+
26
+ [Previous
27
+
28
+ Splitting Text in Excel](#/splitting-text-in-excel)
29
+
30
+ [Next
31
+
32
+ Data Preparation in the Shell](#/data-preparation-in-the-shell)
markdown_files/Data_Analysis_with_DuckDB.md ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Data Analysis with DuckDB"
3
+ original_url: "https://tds.s-anand.net/#/data-analysis-with-duckdb?id=data-analysis-with-duckdb"
4
+ downloaded_at: "2025-06-08T23:26:27.065997"
5
+ ---
6
+
7
+ [Data Analysis with DuckDB](#/data-analysis-with-duckdb?id=data-analysis-with-duckdb)
8
+ -------------------------------------------------------------------------------------
9
+
10
+ [![Data Analysis with DuckDB](https://i.ytimg.com/vi_webp/4U0GqYrET5s/sddefault.webp)](https://youtu.be/4U0GqYrET5s)
11
+
12
+ You’ll learn how to perform data analysis using DuckDB and Pandas, covering:
13
+
14
+ * **Parquet for Data Storage**: Understand why Parquet is a faster, more compact, and better-typed storage format compared to CSV, JSON, and SQLite.
15
+ * **DuckDB Setup**: Learn how to install and set up DuckDB, along with integrating it into a Jupyter notebook environment.
16
+ * **File Format Comparisons**: Compare file formats by speed and size, observing the performance difference between saving and loading data in CSV, JSON, SQLite, and Parquet.
17
+ * **Faster Queries with DuckDB**: Learn how DuckDB uses parallel processing, columnar storage, and on-disk operations to outperform Pandas in speed and memory efficiency.
18
+ * **SQL Query Execution in DuckDB**: Run SQL queries directly on Parquet files and Pandas DataFrames to compute metrics such as the number of unique flight routes delayed by certain time intervals.
19
+ * **Memory Efficiency**: Understand how DuckDB performs analytics without loading entire datasets into memory, making it highly efficient for large-scale data analysis.
20
+ * **Mixing DuckDB and Pandas**: Learn to interleave DuckDB and Pandas operations, leveraging the strengths of both tools to perform complex queries like correlations and aggregations.
21
+ * **Ranking and Filtering Data**: Use SQL and Pandas to rank arrival delays by distance and extract key insights, such as the earliest flight arrival for each route.
22
+ * **Joining Data**: Create a cost analysis by joining datasets and calculating total costs of flight delays, demonstrating DuckDB’s speed in joining and aggregating large datasets.
23
+
24
+ Here are the links used in the video:
25
+
26
+ * [Data analysis with DuckDB - Notebook](https://drive.google.com/file/d/1Y9XSs-LeSz-ZmnQj4OGP-Q4yDkPJrmsZ/view)
27
+ * [Parquet file format](https://parquet.apache.org/) - a fast columnar storage format that’s becoming a de facto standard for big data
28
+ * [DuckDB](https://duckdb.org/) - a fast in-memory database that’s very good with large-scale analysis
29
+ * [Plotly Datasets](https://github.com/plotly/datasets/) - a collection of sample datasets for analysis. This includes the [Kaggle Flights Dataset](https://www.kaggle.com/datasets/usdot/flight-delays) that the notebook downloads as [2015\_flights.parquet](https://github.com/plotly/datasets/raw/master/2015_flights.parquet)
30
+
31
+ [Previous
32
+
33
+ Data Analysis with Datasette](#/data-analysis-with-datasette)
34
+
35
+ [Next
36
+
37
+ Data Analysis with ChatGPT](#/data-analysis-with-chatgpt)
markdown_files/Data_Analysis_with_Python.md ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Data Analysis with Python"
3
+ original_url: "https://tds.s-anand.net/#/data-analysis-with-python?id=data-analysis-with-python"
4
+ downloaded_at: "2025-06-08T23:24:24.926726"
5
+ ---
6
+
7
+ [Data Analysis with Python](#/data-analysis-with-python?id=data-analysis-with-python)
8
+ -------------------------------------------------------------------------------------
9
+
10
+ [![Data Analysis with Python](https://i.ytimg.com/vi_webp/ZPfZH14FK90/sddefault.webp)](https://youtu.be/ZPfZH14FK90)
11
+
12
+ You’ll learn practical data analysis techniques in Python using Pandas, covering:
13
+
14
+ * **Reading Parquet Files**: Utilize Pandas to read Parquet file formats for efficient data handling.
15
+ * **Dataframe Inspection**: Methods to preview and understand the structure of a dataset.
16
+ * **Pivot Tables**: Creating and interpreting pivot tables to summarize data.
17
+ * **Percentage Calculations**: Normalize pivot table values to percentages for better insights.
18
+ * **Correlation Analysis**: Calculate and interpret correlation between variables, including significance testing.
19
+ * **Statistical Significance**: Use statistical tests to determine the significance of observed correlations.
20
+ * **Datetime Handling**: Extract and manipulate date and time information from datetime columns.
21
+ * **Data Visualization**: Generate and customize heat maps to visualize data patterns effectively.
22
+ * **Leveraging AI**: Use ChatGPT to generate and refine analytical code, enhancing productivity and accuracy.
23
+
24
+ Here are the links used in the video:
25
+
26
+ * [Data analysis with Python - Notebook](https://colab.research.google.com/drive/1wEUEeF_e2SSmS9uf2-3fZJQ2kEFRnxah)
27
+ * [Card transactions dataset (Parquet)](https://drive.google.com/file/u/3/d/1XGvuFjoTwlybkw0cc9u34horMF9vMhrB/view)
28
+ * [10 minutes to Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html)
29
+ * [Python Pandas tutorials](https://www.youtube.com/playlist?list=PL-osiE80TeTsWmV9i9c58mdDCSskIFdDS)
30
+
31
+ [Previous
32
+
33
+ Outlier Detection with Excel](#/outlier-detection-with-excel)
34
+
35
+ [Next
36
+
37
+ Data Analysis with SQL](#/data-analysis-with-sql)
markdown_files/Data_Analysis_with_SQL.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Data Analysis with SQL"
3
+ original_url: "https://tds.s-anand.net/#/data-analysis-with-sql?id=data-analysis-with-sql"
4
+ downloaded_at: "2025-06-08T23:22:33.461136"
5
+ ---
6
+
7
+ [Data Analysis with SQL](#/data-analysis-with-sql?id=data-analysis-with-sql)
8
+ ----------------------------------------------------------------------------
9
+
10
+ [![Data Analysis with Databases](https://i.ytimg.com/vi_webp/Xn3QkYrThbI/sddefault.webp)](https://youtu.be/Xn3QkYrThbI)
11
+
12
+ You’ll learn how to perform data analysis using SQL (via Python), covering:
13
+
14
+ * **Database Connection**: How to connect to a MySQL database using SQLAlchemy and Pandas.
15
+ * **SQL Queries**: Execute SQL queries directly from a Python environment to retrieve and analyze data.
16
+ * **Counting Rows**: Use SQL to count the number of rows in a table.
17
+ * **User Activity Analysis**: Query and identify top users by post count.
18
+ * **Post Concentration**: Determine if a small percentage of users contribute the majority of posts using SQL aggregation.
19
+ * **Correlation Calculation**: Calculate the Pearson correlation coefficient between user attributes such as age and reputation.
20
+ * **Regression Analysis**: Compute the regression slope to understand the relationship between views and reputation.
21
+ * **Handling Large Data**: Perform calculations on large datasets by fetching aggregated values from the database rather than entire datasets.
22
+ * **Statistical Analysis in SQL**: Use SQL as a tool for statistical analysis, demonstrating its power beyond simple data retrieval.
23
+ * **Leveraging AI**: Use ChatGPT to generate SQL queries and Python code, enhancing productivity and accuracy.
24
+
25
+ Here are the links used in the video:
26
+
27
+ * [Data analysis with databases - Notebook](https://colab.research.google.com/drive/1j_5AsWdf0SwVHVgfbEAcg7vYguKUN41o)
28
+ * [SQLZoo](https://www.sqlzoo.net/wiki/SQL_Tutorial) has simple interactive tutorials to learn SQL
29
+ * [Stats database](https://relational-data.org/dataset/Stats) that has an anonymized dump of [stats.stackexchange.com](https://stats.stackexchange.com/)
30
+ * [Pandas `read_sql`](https://pandas.pydata.org/docs/reference/api/pandas.read_sql.html)
31
+ * [SQLAlchemy docs](https://docs.sqlalchemy.org/)
32
+
33
+ [Previous
34
+
35
+ Data Analysis with Python](#/data-analysis-with-python)
36
+
37
+ [Next
38
+
39
+ Data Analysis with Datasette](#/data-analysis-with-datasette)
markdown_files/Data_Cleansing_in_Excel.md ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Data Cleansing in Excel"
3
+ original_url: "https://tds.s-anand.net/#/data-cleansing-in-excel?id=data-cleansing-in-excel"
4
+ downloaded_at: "2025-06-08T23:27:03.007571"
5
+ ---
6
+
7
+ [Data Cleansing in Excel](#/data-cleansing-in-excel?id=data-cleansing-in-excel)
8
+ -------------------------------------------------------------------------------
9
+
10
+ [![Clean up data in Excel](https://i.ytimg.com/vi_webp/7du7xkqeu4s/sddefault.webp)](https://youtu.be/7du7xkqeu4s)
11
+
12
+ You’ll learn basic but essential data cleaning techniques in Excel, covering:
13
+
14
+ * **Find and Replace**: Use Ctrl+H to replace or remove specific terms (e.g., removing “[more]” from country names).
15
+ * **Changing Data Formats**: Convert columns from general to numerical format.
16
+ * **Removing Extra Spaces**: Use the TRIM function to clean up unnecessary spaces in text.
17
+ * **Identifying and Removing Blank Cells**: Highlight and delete entire rows with blank cells using the “Go To Special” function.
18
+ * **Removing Duplicates**: Use the “Remove Duplicates” feature to eliminate duplicate entries, demonstrated with country names.
19
+
20
+ Here are links used in the video:
21
+
22
+ * [List of Largest Cities Excel file](https://docs.google.com/spreadsheets/d/1jl8tHGoxmIba4J78aJVfT9jtZv7lfCbV/view)
23
+
24
+ [Previous
25
+
26
+ 5. Data Preparation](#/data-preparation)
27
+
28
+ [Next
29
+
30
+ Data Transformation in Excel](#/data-transformation-in-excel)
markdown_files/Data_Preparation_in_the_Editor.md ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Data Preparation in the Editor"
3
+ original_url: "https://tds.s-anand.net/#/data-preparation-in-the-editor?id=data-preparation-in-the-editor"
4
+ downloaded_at: "2025-06-08T23:22:43.469063"
5
+ ---
6
+
7
+ [Data Preparation in the Editor](#/data-preparation-in-the-editor?id=data-preparation-in-the-editor)
8
+ ----------------------------------------------------------------------------------------------------
9
+
10
+ [![Data preparation in the editor](https://i.ytimg.com/vi_webp/99lYu43L9uM/sddefault.webp)](https://youtu.be/99lYu43L9uM)
11
+
12
+ You’ll learn how to use a text editor [Visual Studio Code](https://code.visualstudio.com/) to process and clean data, covering:
13
+
14
+ * **Format** JSON files
15
+ * **Find all** and multiple cursors to extract specific fields
16
+ * **Sort** lines
17
+ * **Delete duplicate** lines
18
+ * **Replace** text with multiple cursors
19
+
20
+ Here are the links used in the video:
21
+
22
+ * [City-wise product sales JSON](https://drive.google.com/file/d/1VEnKChf4i04iKsQfw0MwoJlfkOBGQ65B/view?usp=drive_link)
23
+
24
+ [Previous
25
+
26
+ Data Preparation in the Shell](#/data-preparation-in-the-shell)
27
+
28
+ [Next
29
+
30
+ Cleaning Data with OpenRefine](#/cleaning-data-with-openrefine)
markdown_files/Data_Preparation_in_the_Shell.md ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Data Preparation in the Shell"
3
+ original_url: "https://tds.s-anand.net/#/data-preparation-in-the-shell?id=data-preparation-in-the-shell"
4
+ downloaded_at: "2025-06-08T23:26:41.381829"
5
+ ---
6
+
7
+ [Data Preparation in the Shell](#/data-preparation-in-the-shell?id=data-preparation-in-the-shell)
8
+ -------------------------------------------------------------------------------------------------
9
+
10
+ [![Data preparation in the shell](https://i.ytimg.com/vi_webp/XEdy4WK70vU/sddefault.webp)](https://youtu.be/XEdy4WK70vU)
11
+
12
+ You’ll learn how to use UNIX tools to process and clean data, covering:
13
+
14
+ * `curl` (or `wget`) to fetch data from websites.
15
+ * `gzip` (or `xz`) to compress and decompress files.
16
+ * `wc` to count lines, words, and characters in text.
17
+ * `head` and `tail` to get the start and end of files.
18
+ * `cut` to extract specific columns from text.
19
+ * `uniq` to de-duplicate lines.
20
+ * `sort` to sort lines.
21
+ * `grep` to filter lines containing specific text.
22
+ * `sed` to search and replace text.
23
+ * `awk` for more complex text processing.
24
+
25
+ Here are the links used in the video:
26
+
27
+ * [Data preparation in the shell - Notebook](https://colab.research.google.com/drive/1KSFkQDK0v__XWaAaHKeQuIAwYV0dkTe8)
28
+ * [Data Science at the Command Line](https://jeroenjanssens.com/dsatcl/)
29
+
30
+ [Previous
31
+
32
+ Data Aggregation in Excel](#/data-aggregation-in-excel)
33
+
34
+ [Next
35
+
36
+ Data Preparation in the Editor](#/data-preparation-in-the-editor)
markdown_files/Data_Storytelling.md ADDED
@@ -0,0 +1,18 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Data Storytelling"
3
+ original_url: "https://tds.s-anand.net/#/data-storytelling?id=data-storytelling"
4
+ downloaded_at: "2025-06-08T23:21:12.671499"
5
+ ---
6
+
7
+ [Data Storytelling](#/data-storytelling?id=data-storytelling)
8
+ =============================================================
9
+
10
+ [![Narrate a story](https://i.ytimg.com/vi_webp/aF93i6zVVQg/sddefault.webp)](https://youtu.be/aF93i6zVVQg)
11
+
12
+ [Previous
13
+
14
+ RAWgraphs](#/rawgraphs)
15
+
16
+ [Next
17
+
18
+ Narratives with LLMs](#/narratives-with-llms)
markdown_files/Data_Transformation_in_Excel.md ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Data Transformation in Excel"
3
+ original_url: "https://tds.s-anand.net/#/data-transformation-in-excel?id=data-transformation-in-excel"
4
+ downloaded_at: "2025-06-08T23:26:40.285938"
5
+ ---
6
+
7
+ [Data Transformation in Excel](#/data-transformation-in-excel?id=data-transformation-in-excel)
8
+ ----------------------------------------------------------------------------------------------
9
+
10
+ [![Data transformation in Excel](https://i.ytimg.com/vi_webp/gR2IY5Naja0/sddefault.webp)](https://youtu.be/gR2IY5Naja0)
11
+
12
+ You’ll learn data transformation techniques in Excel, covering:
13
+
14
+ * **Calculating Ratios**: Compute metro area to city area and metro population to city population ratios.
15
+ * **Using Pivot Tables**: Create pivot tables to aggregate data and identify outliers.
16
+ * **Filtering Data**: Apply filters in pivot tables to analyze specific subsets of data.
17
+ * **Counting Data Occurrences**: Use pivot tables to count the frequency of specific entries.
18
+ * **Creating Charts**: Generate charts from pivot table data to visualize distributions and outliers.
19
+
20
+ Here are links used in the video:
21
+
22
+ * [List of Largest Cities Excel file](https://docs.google.com/spreadsheets/d/1jl8tHGoxmIba4J78aJVfT9jtZv7lfCbV/view)
23
+
24
+ [Previous
25
+
26
+ Data Cleansing in Excel](#/data-cleansing-in-excel)
27
+
28
+ [Next
29
+
30
+ Splitting Text in Excel](#/splitting-text-in-excel)
markdown_files/Data_Transformation_with_dbt.md ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Data Transformation with dbt"
3
+ original_url: "https://tds.s-anand.net/#/dbt?id=data-transformation-with-dbt"
4
+ downloaded_at: "2025-06-08T23:26:28.166999"
5
+ ---
6
+
7
+ [Data Transformation with dbt](#/dbt?id=data-transformation-with-dbt)
8
+ ---------------------------------------------------------------------
9
+
10
+ [![Data Transformation with dbt](https://i.ytimg.com/vi_webp/5rNquRnNb4E/sddefault.webp)](https://youtu.be/5rNquRnNb4E)
11
+
12
+ You’ll learn how to transform data using dbt (data build tool), covering:
13
+
14
+ * **dbt Fundamentals**: Understand what dbt is and how it brings software engineering practices to data transformation
15
+ * **Project Setup**: Learn how to initialize a dbt project, configure your warehouse connection, and structure your models
16
+ * **Models and Materialization**: Create your first dbt models and understand different materialization strategies (view, table, incremental)
17
+ * **Testing and Documentation**: Implement data quality tests and auto-generate documentation for your data models
18
+ * **Jinja Templating**: Use Jinja for dynamic SQL generation, making your transformations more maintainable and reusable
19
+ * **References and Dependencies**: Learn how to reference other models and manage model dependencies
20
+ * **Sources and Seeds**: Configure source data connections and manage static reference data
21
+ * **Macros and Packages**: Create reusable macros and leverage community packages to extend functionality
22
+ * **Incremental Models**: Optimize performance by only processing new or changed data
23
+ * **Deployment and Orchestration**: Set up dbt Cloud or integrate with Airflow for production deployment
24
+
25
+ Here’s a minimal dbt model example, `models/staging/stg_customers.sql`:
26
+
27
+ ```
28
+ with source as (
29
+ select * from {{ source('raw', 'customers') }}
30
+ ),
31
+
32
+ renamed as (
33
+ select
34
+ id as customer_id,
35
+ first_name,
36
+ last_name,
37
+ email,
38
+ created_at
39
+ from source
40
+ )
41
+
42
+ select * from renamedCopy to clipboardErrorCopied
43
+ ```
44
+
45
+ Tools and Resources:
46
+
47
+ * [dbt Core](https://github.com/dbt-labs/dbt-core) - The open-source transformation tool
48
+ * [dbt Cloud](https://www.getdbt.com/product/dbt-cloud) - Hosted platform for running dbt
49
+ * [dbt Packages](https://hub.getdbt.com/) - Reusable modules from the community
50
+ * [dbt Documentation](https://docs.getdbt.com/) - Comprehensive guides and references
51
+ * [Jaffle Shop](https://github.com/dbt-labs/jaffle_shop) - Example dbt project for learning
52
+ * [dbt Slack Community](https://www.getdbt.com/community/) - Active community for support and discussions
53
+
54
+ Watch this dbt Fundamentals Course (90 min):
55
+
56
+ [![dbt Fundamentals Course](https://i.ytimg.com/vi_webp/5rNquRnNb4E/sddefault.webp)](https://youtu.be/5rNquRnNb4E)
57
+
58
+ [Previous
59
+
60
+ Parsing JSON](#/parsing-json)
61
+
62
+ [Next
63
+
64
+ Transforming Images](#/transforming-images)
markdown_files/Data_Visualization_with_Seaborn.md ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Data Visualization with Seaborn"
3
+ original_url: "https://tds.s-anand.net/#/data-visualization-with-seaborn?id=data-visualization-with-seaborn"
4
+ downloaded_at: "2025-06-08T23:24:55.808928"
5
+ ---
6
+
7
+ [Data Visualization with Seaborn](#/data-visualization-with-seaborn?id=data-visualization-with-seaborn)
8
+ -------------------------------------------------------------------------------------------------------
9
+
10
+ [Seaborn](https://seaborn.pydata.org/) is a data visualization library for Python. It’s based on Matplotlib but a bit easier to use, and a bit prettier.
11
+
12
+ [![Seaborn Tutorial : Seaborn Full Course](https://i.ytimg.com/vi_webp/6GUZXDef2U0/sddefault.webp)](https://youtu.be/6GUZXDef2U0)
13
+
14
+ [Previous
15
+
16
+ Visualizing Charts with Excel](#/visualizing-charts-with-excel)
17
+
18
+ [Next
19
+
20
+ Data Visualization with ChatGPT](#/data-visualization-with-chatgpt)
markdown_files/Database__SQLite.md ADDED
@@ -0,0 +1,148 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Database: SQLite"
3
+ original_url: "https://tds.s-anand.net/#/sqlite?id=database-sqlite"
4
+ downloaded_at: "2025-06-08T23:26:00.500923"
5
+ ---
6
+
7
+ [Database: SQLite](#/sqlite?id=database-sqlite)
8
+ -----------------------------------------------
9
+
10
+ Relational databases are used to store data in a structured way. You’ll often access databases created by others for analysis.
11
+
12
+ PostgreSQL, MySQL, MS SQL, Oracle, etc. are popular databases. But the most installed database is [SQLite](https://www.sqlite.org/index.html). It’s embedded into many devices and apps (e.g. your phone, browser, etc.). It’s lightweight but very scalable and powerful.
13
+
14
+ Watch these introductory videos to understand SQLite and how it’s used in Python (34 min):
15
+
16
+ [![SQLite Introduction - Beginners Guide to SQL and Databases (22 min)](https://i.ytimg.com/vi_webp/8Xyn8R9eKB8/sddefault.webp)](https://youtu.be/8Xyn8R9eKB8)
17
+
18
+ [![SQLite Backend for Beginners - Create Quick Databases with Python and SQL (13 min)](https://i.ytimg.com/vi_webp/Ohj-CqALrwk/sddefault.webp)](https://youtu.be/Ohj-CqALrwk)
19
+
20
+ There are many non-relational databases (NoSQL) like [ElasticSearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html), [MongoDB](https://www.mongodb.com/docs/manual/), [Redis](https://redis.io/docs/latest/), etc. that you should know about and we may cover later.
21
+
22
+ Core Concepts:
23
+
24
+ ```
25
+ -- Create a table
26
+ CREATE TABLE users (
27
+ id INTEGER PRIMARY KEY,
28
+ name TEXT NOT NULL,
29
+ email TEXT UNIQUE,
30
+ created_at DATETIME DEFAULT CURRENT_TIMESTAMP
31
+ );
32
+
33
+ -- Insert data
34
+ INSERT INTO users (name, email) VALUES
35
+ ('Alice', 'alice@example.com'),
36
+ ('Bob', 'bob@example.com');
37
+
38
+ -- Query data
39
+ SELECT name, COUNT(*) as count
40
+ FROM users
41
+ GROUP BY name
42
+ HAVING count > 1;
43
+
44
+ -- Join tables
45
+ SELECT u.name, o.product
46
+ FROM users u
47
+ LEFT JOIN orders o ON u.id = o.user_id
48
+ WHERE o.status = 'pending';Copy to clipboardErrorCopied
49
+ ```
50
+
51
+ Python Integration:
52
+
53
+ ```
54
+ import sqlite3
55
+ from pathlib import Path
56
+ import pandas as pd
57
+
58
+ async def query_database(db_path: Path, query: str) -> pd.DataFrame:
59
+ """Execute SQL query and return results as DataFrame.
60
+
61
+ Args:
62
+ db_path: Path to SQLite database
63
+ query: SQL query to execute
64
+
65
+ Returns:
66
+ DataFrame with query results
67
+ """
68
+ try:
69
+ conn = sqlite3.connect(db_path)
70
+ return pd.read_sql_query(query, conn)
71
+ finally:
72
+ conn.close()
73
+
74
+ # Example usage
75
+ db = Path('data.db')
76
+ df = await query_database(db, '''
77
+ SELECT date, COUNT(*) as count
78
+ FROM events
79
+ GROUP BY date
80
+ ''')Copy to clipboardErrorCopied
81
+ ```
82
+
83
+ Common Operations:
84
+
85
+ 1. **Database Management**
86
+
87
+ ```
88
+ -- Backup database
89
+ .backup 'backup.db'
90
+
91
+ -- Import CSV
92
+ .mode csv
93
+ .import data.csv table_name
94
+
95
+ -- Export results
96
+ .headers on
97
+ .mode csv
98
+ .output results.csv
99
+ SELECT * FROM table;Copy to clipboardErrorCopied
100
+ ```
101
+ 2. **Performance Optimization**
102
+
103
+ ```
104
+ -- Create index
105
+ CREATE INDEX idx_user_email ON users(email);
106
+
107
+ -- Analyze query
108
+ EXPLAIN QUERY PLAN
109
+ SELECT * FROM users WHERE email LIKE '%@example.com';
110
+
111
+ -- Show indexes
112
+ SELECT * FROM sqlite_master WHERE type='index';Copy to clipboardErrorCopied
113
+ ```
114
+ 3. **Data Analysis**
115
+
116
+ ```
117
+ -- Time series aggregation
118
+ SELECT
119
+ date(timestamp),
120
+ COUNT(*) as events,
121
+ AVG(duration) as avg_duration
122
+ FROM events
123
+ GROUP BY date(timestamp);
124
+
125
+ -- Window functions
126
+ SELECT *,
127
+ AVG(amount) OVER (
128
+ PARTITION BY user_id
129
+ ORDER BY date
130
+ ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
131
+ ) as moving_avg
132
+ FROM transactions;Copy to clipboardErrorCopied
133
+ ```
134
+
135
+ Tools to work with SQLite:
136
+
137
+ * [SQLiteStudio](https://sqlitestudio.pl/): Lightweight GUI
138
+ * [DBeaver](https://dbeaver.io/): Full-featured GUI
139
+ * [sqlite-utils](https://sqlite-utils.datasette.io/): CLI tool
140
+ * [Datasette](https://datasette.io/): Web interface
141
+
142
+ [Previous
143
+
144
+ Spreadsheet: Excel, Google Sheets](#/spreadsheets)
145
+
146
+ [Next
147
+
148
+ Version Control: Git, GitHub](#/git)
markdown_files/DevContainers__GitHub_Codespaces.md ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "DevContainers: GitHub Codespaces"
3
+ original_url: "https://tds.s-anand.net/#/github-codespaces?id=features-to-explore"
4
+ downloaded_at: "2025-06-08T23:27:28.688679"
5
+ ---
6
+
7
+ [IDE: GitHub Codespaces](#/github-codespaces?id=ide-github-codespaces)
8
+ ----------------------------------------------------------------------
9
+
10
+ [GitHub Codespaces](https://github.com/features/codespaces) is a cloud-hosted development environment built right into GitHub that gets you coding faster with pre-configured containers, adjustable compute power, and seamless integration with workflows like Actions and Copilot.
11
+
12
+ **Why Codespaces helps**
13
+
14
+ * **Reproducible onboarding**: Say goodbye to “works on my machine” woes—everyone uses the same setup for assignments or demos.
15
+ * **Anywhere access**: Jump back into your project from a laptop, tablet, or phone without having to reinstall anything.
16
+ * **Rapid experimentation & debugging**: Spin up short-lived environments on any branch, commit, or PR to isolate bugs or test features, or keep longer-lived codespaces for big projects.
17
+
18
+ [![Introduction to GitHub Codespaces (5 min)](https://i.ytimg.com/vi_webp/-tQ2nxjqP6o/sddefault.webp)](https://www.youtube.com/watch?v=-tQ2nxjqP6o)
19
+
20
+ ### [Quick Setup](#/github-codespaces?id=quick-setup)
21
+
22
+ 1. [**From the GitHub UI**](https://github.com/codespaces)
23
+
24
+ * Go to your repo and click **Code → Codespaces → New codespace**.
25
+ * Pick the branch and machine specs (2–32 cores, 8–64 GB RAM), then click **Create codespace**.
26
+ 2. [**In Visual Studio Code**](https://code.visualstudio.com/docs/remote/codespaces)
27
+
28
+ * Press `Ctrl+Shift+P` (or `Cmd+Shift+P` on Mac), choose **Codespaces: Create New Codespace**, and follow the prompts.
29
+ 3. [**Via GitHub CLI**](https://docs.github.com/en/codespaces/developing-in-a-codespace/using-github-codespaces-with-github-cli)
30
+
31
+ ```
32
+ gh auth login
33
+ gh codespace create --repo OWNER/REPO
34
+ gh codespace list # List all codespaces
35
+ gh codespace code # opens in your local VS Code
36
+ gh codespace ssh # SSH into the codepsaceCopy to clipboardErrorCopied
37
+ ```
38
+
39
+ ### [Features To Explore](#/github-codespaces?id=features-to-explore)
40
+
41
+ * **Dev Containers**: Set up your environment the same way every time using a `devcontainer.json` or your own Dockerfile. [Introduction to dev containers](https://docs.github.com/en/codespaces/setting-up-your-project-for-codespaces/adding-a-dev-container-configuration/introduction-to-dev-containers)
42
+ * **Prebuilds**: Build bigger or more complex repos in advance so codespaces start up in a flash. [About prebuilds](https://docs.github.com/en/codespaces/prebuilding-your-codespaces/about-github-codespaces-prebuilds)
43
+ * **Port Forwarding**: Let Codespaces spot and forward the ports your web apps use automatically. [Forward ports in Codespaces](https://docs.github.com/en/codespaces/developing-in-a-codespace/forwarding-ports-in-your-codespace)
44
+ * **Secrets & Variables**: Keep your environment variables safe in the Codespaces settings for your repo. [Manage Codespaces secrets](https://docs.github.com/en/enterprise-cloud@latest/codespaces/managing-codespaces-for-your-organization/managing-development-environment-secrets-for-your-repository-or-organization)
45
+ * **Dotfiles Integration**: Bring in your dotfiles repo to customize shell settings, aliases, and tools in every codespace. [Personalizing your codespaces](https://docs.github.com/en/codespaces/setting-your-user-preferences/personalizing-github-codespaces-for-your-account)
46
+ * **Machine Types & Cost Control**: Pick from VMs with 2 to 32 cores and track your usage in the billing dashboard. [Managing Codespaces costs](https://docs.github.com/en/billing/managing-billing-for-github-codespaces/about-billing-for-github-codespaces)
47
+ * **VS Code & CLI Integration**: Flip between browser VS Code and your desktop editor, and script everything with the CLI. [VS Code Remote: Codespaces](https://code.visualstudio.com/docs/remote/codespaces)
48
+ * **GitHub Actions**: Power up prebuilds and your CI/CD right inside codespaces using Actions workflows. [Prebuilding your codespaces](https://docs.github.com/en/codespaces/prebuilding-your-codespaces)
49
+ * **Copilot in Codespaces**: Let Copilot help you write code with in-editor AI suggestions. [Copilot in Codespaces](https://docs.github.com/en/codespaces/reference/using-github-copilot-in-github-codespaces)
50
+
51
+ [Previous
52
+
53
+ Containers: Docker, Podman](#/docker)
54
+
55
+ [Next
56
+
57
+ Tunneling: ngrok](#/ngrok)
markdown_files/Editor__VS_Code.md ADDED
@@ -0,0 +1,31 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Editor: VS Code"
3
+ original_url: "https://tds.s-anand.net/#/vscode?id=editor-vs-code"
4
+ downloaded_at: "2025-06-08T23:26:42.473263"
5
+ ---
6
+
7
+ [Editor: VS Code](#/vscode?id=editor-vs-code)
8
+ ---------------------------------------------
9
+
10
+ Your editor is the most important tool in your arsenal. That’s where you’ll spend most of your time. Make sure you’re comfortable with it.
11
+
12
+ [**Visual Studio Code**](https://code.visualstudio.com/) is, *by far*, the most popular code editor today. According to the [2024 StackOverflow Survey](https://survey.stackoverflow.co/2024/technology/#1-integrated-development-environment) almost 75% of developers use it. We recommend you learn it well. Even if you use another editor, you’ll be working with others who use it, and it’s a good idea to have some exposure.
13
+
14
+ Watch these introductory videos (35 min) from the [Visual Studio Docs](https://code.visualstudio.com/docs) to get started:
15
+
16
+ * [Getting Started](https://code.visualstudio.com/docs/introvideos/basics): Set up and learn the basics of Visual Studio Code. (7 min)
17
+ * [Code Editing](https://code.visualstudio.com/docs/introvideos/codeediting): Learn how to edit and run code in VS Code. (3 min)
18
+ * [Productivity Tips](https://code.visualstudio.com/docs/introvideos/productivity): Become a VS Code power user with these productivity tips. (4 min)
19
+ * [Personalize](https://code.visualstudio.com/docs/introvideos/configure): Personalize VS Code to make it yours with themes. (2 min)
20
+ * [Extensions](https://code.visualstudio.com/docs/introvideos/extend): Add features, themes, and more to VS Code with extensions! (4 min)
21
+ * [Debugging](https://code.visualstudio.com/docs/introvideos/debugging): Get started with debugging in VS Code. (6 min)
22
+ * [Version Control](https://code.visualstudio.com/docs/introvideos/versioncontrol): Learn how to use Git version control in VS Code. (3 min)
23
+ * [Customize](https://code.visualstudio.com/docs/introvideos/customize): Learn how to customize your settings and keyboard shortcuts in VS Code. (6 min)
24
+
25
+ [Previous
26
+
27
+ 1. Development Tools](#/development-tools)
28
+
29
+ [Next
30
+
31
+ AI Code Editors: GitHub Copilot](#/github-copilot)
markdown_files/Embeddings.md ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Embeddings"
3
+ original_url: "https://tds.s-anand.net/#/embeddings?id=openai-embeddings"
4
+ downloaded_at: "2025-06-08T23:27:24.487774"
5
+ ---
6
+
7
+ [Embeddings: OpenAI and Local Models](#/embeddings?id=embeddings-openai-and-local-models)
8
+ -----------------------------------------------------------------------------------------
9
+
10
+ Embedding models convert text into a list of numbers. These are like a map of text in numerical form. Each number represents a feature, and similar texts will have numbers close to each other. So, if the numbers are similar, the text they represent mean something similar.
11
+
12
+ This is useful because text similarity is important in many common problems:
13
+
14
+ 1. **Search**. Find similar documents to a query.
15
+ 2. **Classification**. Classify text into categories.
16
+ 3. **Clustering**. Group similar items into clusters.
17
+ 4. **Anomaly Detection**. Find an unusual piece of text.
18
+
19
+ You can run embedding models locally or using an API. Local models are better for privacy and cost. APIs are better for scale and quality.
20
+
21
+ | Feature | Local Models | API |
22
+ | --- | --- | --- |
23
+ | **Privacy** | High | Dependent on provider |
24
+ | **Cost** | High setup, low after that | Pay-as-you-go |
25
+ | **Scale** | Limited by local resources | Easily scales with demand |
26
+ | **Quality** | Varies by model | Typically high |
27
+
28
+ The [Massive Text Embedding Benchmark (MTEB)](https://huggingface.co/spaces/mteb/leaderboard) provides comprehensive comparisons of embedding models. These models are compared on several parameters, but here are some key ones to look at:
29
+
30
+ 1. **Rank**. Higher ranked models have higher quality.
31
+ 2. **Memory Usage**. Lower is better (for similar ranks). It costs less and is faster to run.
32
+ 3. **Embedding Dimensions**. Lower is better. This is the number of numbers in the array. Smaller dimensions are cheaper to store.
33
+ 4. **Max Tokens**. Higher is better. This is the number of input tokens (words) the model can take in a *single* input.
34
+ 5. Look for higher scores in the columns for Classification, Clustering, Summarization, etc. based on your needs.
35
+
36
+ ### [Local Embeddings](#/embeddings?id=local-embeddings)
37
+
38
+ [![Guide to Local Embeddings with Sentence Transformers](https://i.ytimg.com/vi/OATCgQtNX2o/sddefault.jpg)](https://youtu.be/OATCgQtNX2o)
39
+
40
+ Here’s a minimal example using a local embedding model:
41
+
42
+ ```
43
+ # /// script
44
+ # requires-python = "==3.12"
45
+ # dependencies = [
46
+ # "sentence-transformers",
47
+ # "httpx",
48
+ # "numpy",
49
+ # ]
50
+ # ///
51
+
52
+ from sentence_transformers import SentenceTransformer
53
+ import numpy as np
54
+
55
+ model = SentenceTransformer('BAAI/bge-base-en-v1.5') # A small, high quality model
56
+
57
+ async def embed(text: str) -> list[float]:
58
+ """Get embedding vector for text using local model."""
59
+ return model.encode(text).tolist()
60
+
61
+ async def get_similarity(text1: str, text2: str) -> float:
62
+ """Calculate cosine similarity between two texts."""
63
+ emb1 = np.array(await embed(text1))
64
+ emb2 = np.array(await embed(text2))
65
+ return float(np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2)))
66
+
67
+ async def main():
68
+ print(await get_similarity("Apple", "Orange"))
69
+ print(await get_similarity("Apple", "Lightning"))
70
+
71
+
72
+ if __name__ == "__main__":
73
+ import asyncio
74
+ asyncio.run(main())Copy to clipboardErrorCopied
75
+ ```
76
+
77
+ Note the `get_similarity` function. It uses a [Cosine Similarity](https://en.wikipedia.org/wiki/Cosine_similarity) to calculate the similarity between two embeddings.
78
+
79
+ ### [OpenAI Embeddings](#/embeddings?id=openai-embeddings)
80
+
81
+ For comparison, here’s how to use OpenAI’s API with direct HTTP calls. Replace the `embed` function in the earlier script:
82
+
83
+ ```
84
+ import os
85
+ import httpx
86
+
87
+ async def embed(text: str) -> list[float]:
88
+ """Get embedding vector for text using OpenAI's API."""
89
+ async with httpx.AsyncClient() as client:
90
+ response = await client.post(
91
+ "https://api.openai.com/v1/embeddings",
92
+ headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"},
93
+ json={"model": "text-embedding-3-small", "input": text}
94
+ )
95
+ return response.json()["data"][0]["embedding"]Copy to clipboardErrorCopied
96
+ ```
97
+
98
+ **NOTE**: You need to set the [`OPENAI_API_KEY`](https://platform.openai.com/api-keys) environment variable for this to work.
99
+
100
+ [Previous
101
+
102
+ Vision Models](#/vision-models)
103
+
104
+ [Next
105
+
106
+ Multimodal Embeddings](#/multimodal-embeddings)
markdown_files/Extracting_Audio_and_Transcripts.md ADDED
@@ -0,0 +1,298 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Extracting Audio and Transcripts"
3
+ original_url: "https://tds.s-anand.net/#/extracting-audio-and-transcripts?id=media-tools-yt-dlp"
4
+ downloaded_at: "2025-06-08T23:25:44.497461"
5
+ ---
6
+
7
+ [Extracting Audio and Transcripts](#/extracting-audio-and-transcripts?id=extracting-audio-and-transcripts)
8
+ ----------------------------------------------------------------------------------------------------------
9
+
10
+ [Media Processing: FFmpeg](#/extracting-audio-and-transcripts?id=media-processing-ffmpeg)
11
+ -----------------------------------------------------------------------------------------
12
+
13
+ [FFmpeg](https://ffmpeg.org/) is the standard command-line tool for processing video and audio files. It’s essential for data scientists working with media files for:
14
+
15
+ * Extracting audio/video for machine learning
16
+ * Converting formats for web deployment
17
+ * Creating visualizations and presentations
18
+ * Processing large media datasets
19
+
20
+ Basic Operations:
21
+
22
+ ```
23
+ # Basic conversion
24
+ ffmpeg -i input.mp4 output.avi
25
+
26
+ # Extract audio
27
+ ffmpeg -i input.mp4 -vn output.mp3
28
+
29
+ # Convert format without re-encoding
30
+ ffmpeg -i input.mkv -c copy output.mp4
31
+
32
+ # High quality encoding (crf: 0-51, lower is better)
33
+ ffmpeg -i input.mp4 -preset slower -crf 18 output.mp4Copy to clipboardErrorCopied
34
+ ```
35
+
36
+ Common Data Science Tasks:
37
+
38
+ ```
39
+ # Extract frames for computer vision
40
+ ffmpeg -i input.mp4 -vf "fps=1" frames_%04d.png # 1 frame per second
41
+ ffmpeg -i input.mp4 -vf "select='eq(n,0)'" -vframes 1 first_frame.jpg
42
+
43
+ # Create video from image sequence
44
+ ffmpeg -r 1/5 -i img%03d.png -c:v libx264 -vf fps=25 output.mp4
45
+
46
+ # Extract audio for speech recognition
47
+ ffmpeg -i input.mp4 -ar 16000 -ac 1 audio.wav # 16kHz mono
48
+
49
+ # Trim video/audio for training data
50
+ ffmpeg -ss 00:01:00 -i input.mp4 -t 00:00:30 -c copy clip.mp4Copy to clipboardErrorCopied
51
+ ```
52
+
53
+ Processing Multiple Files:
54
+
55
+ ```
56
+ # Concatenate videos (first create files.txt with list of files)
57
+ echo "file 'input1.mp4'
58
+ file 'input2.mp4'" > files.txt
59
+ ffmpeg -f concat -i files.txt -c copy output.mp4
60
+
61
+ # Batch process with shell loop
62
+ for f in *.mp4; do
63
+ ffmpeg -i "$f" -vn "audio/${f%.mp4}.wav"
64
+ doneCopy to clipboardErrorCopied
65
+ ```
66
+
67
+ Data Analysis Features:
68
+
69
+ ```
70
+ # Get media file information
71
+ ffprobe -v quiet -print_format json -show_format -show_streams input.mp4
72
+
73
+ # Display frame metadata
74
+ ffprobe -v quiet -print_format json -show_frames input.mp4
75
+
76
+ # Generate video thumbnails
77
+ ffmpeg -i input.mp4 -vf "thumbnail" -frames:v 1 thumb.jpgCopy to clipboardErrorCopied
78
+ ```
79
+
80
+ Watch this introduction to FFmpeg (12 min):
81
+
82
+ [![FFmpeg in 12 Minutes](https://i.ytimg.com/vi_webp/MPV7JXTWPWI/sddefault.webp)](https://youtu.be/MPV7JXTWPWI)
83
+
84
+ Tools:
85
+
86
+ * [ffmpeg.lav.io](https://ffmpeg.lav.io/): Interactive command builder
87
+ * [FFmpeg Explorer](https://ffmpeg.guide/): Visual FFmpeg command generator
88
+ * [FFmpeg Buddy](https://evanhahn.github.io/ffmpeg-buddy/): Simple command generator
89
+
90
+ Tips:
91
+
92
+ 1. Use `-c copy` when possible to avoid re-encoding
93
+ 2. Monitor progress with `-progress pipe:1`
94
+ 3. Use `-hide_banner` to reduce output verbosity
95
+ 4. Test commands with small clips first
96
+ 5. Use hardware acceleration when available (-hwaccel auto)
97
+
98
+ Error Handling:
99
+
100
+ ```
101
+ # Validate file before processing
102
+ ffprobe input.mp4 2>&1 | grep "Invalid"
103
+
104
+ # Continue on errors in batch processing
105
+ ffmpeg -i input.mp4 output.mp4 -xerror
106
+
107
+ # Get detailed error information
108
+ ffmpeg -v error -i input.mp4 2>&1 | grep -A2 "Error"Copy to clipboardErrorCopied
109
+ ```
110
+
111
+
112
+
113
+ [Media tools: yt-dlp](#/extracting-audio-and-transcripts?id=media-tools-yt-dlp)
114
+ -------------------------------------------------------------------------------
115
+
116
+ [yt-dlp](https://github.com/yt-dlp/yt-dlp) is a feature-rich command-line tool for downloading audio/video from thousands of sites. It’s particularly useful for extracting audio and transcripts from videos.
117
+
118
+ Install using your package manager:
119
+
120
+ ```
121
+ # macOS
122
+ brew install yt-dlp
123
+
124
+ # Linux
125
+ curl -L https://github.com/yt-dlp/yt-dlp/releases/latest/download/yt-dlp -o ~/.local/bin/yt-dlp
126
+ chmod a+rx ~/.local/bin/yt-dlp
127
+
128
+ # Windows
129
+ winget install yt-dlpCopy to clipboardErrorCopied
130
+ ```
131
+
132
+ Common operations for extracting audio and transcripts:
133
+
134
+ ```
135
+ # Download audio only at lowest quality suitable for speech
136
+ yt-dlp -f "ba[abr<50]/worstaudio" \
137
+ --extract-audio \
138
+ --audio-format mp3 \
139
+ --audio-quality 32k \
140
+ "https://www.youtube.com/watch?v=VIDEO_ID"
141
+
142
+ # Download auto-generated subtitles
143
+ yt-dlp --write-auto-sub \
144
+ --skip-download \
145
+ --sub-format "srt" \
146
+ "https://www.youtube.com/watch?v=VIDEO_ID"
147
+
148
+ # Download both audio and subtitles with custom output template
149
+ yt-dlp -f "ba[abr<50]/worstaudio" \
150
+ --extract-audio \
151
+ --audio-format mp3 \
152
+ --audio-quality 32k \
153
+ --write-auto-sub \
154
+ --sub-format "srt" \
155
+ -o "%(title)s.%(ext)s" \
156
+ "https://www.youtube.com/watch?v=VIDEO_ID"
157
+
158
+ # Download entire playlist's audio
159
+ yt-dlp -f "ba[abr<50]/worstaudio" \
160
+ --extract-audio \
161
+ --audio-format mp3 \
162
+ --audio-quality 32k \
163
+ -o "%(playlist_index)s-%(title)s.%(ext)s" \
164
+ "https://www.youtube.com/playlist?list=PLAYLIST_ID"Copy to clipboardErrorCopied
165
+ ```
166
+
167
+ For Python integration:
168
+
169
+ ```
170
+ # /// script
171
+ # requires-python = ">=3.9"
172
+ # dependencies = ["yt-dlp"]
173
+ # ///
174
+
175
+ import yt_dlp
176
+
177
+ def download_audio(url: str) -> None:
178
+ """Download audio at speech-optimized quality."""
179
+ ydl_opts = {
180
+ 'format': 'ba[abr<50]/worstaudio',
181
+ 'postprocessors': [{
182
+ 'key': 'FFmpegExtractAudio',
183
+ 'preferredcodec': 'mp3',
184
+ 'preferredquality': '32'
185
+ }]
186
+ }
187
+
188
+ with yt_dlp.YoutubeDL(ydl_opts) as ydl:
189
+ ydl.download([url])
190
+
191
+ # Example usage
192
+ download_audio('https://www.youtube.com/watch?v=VIDEO_ID')Copy to clipboardErrorCopied
193
+ ```
194
+
195
+ Tools:
196
+
197
+ * [ffmpeg](https://ffmpeg.org/): Required for audio extraction and conversion
198
+ * [whisper](https://github.com/openai/whisper): Can be used with yt-dlp for speech-to-text
199
+ * [gallery-dl](https://github.com/mikf/gallery-dl): Alternative for image-focused sites
200
+
201
+ Note: Always respect copyright and terms of service when downloading content.
202
+
203
+ [Whisper transcription](#/extracting-audio-and-transcripts?id=whisper-transcription)
204
+ ------------------------------------------------------------------------------------
205
+
206
+ [Faster Whisper](https://github.com/SYSTRAN/faster-whisper) is a highly optimized implementation of OpenAI’s [Whisper model](https://github.com/openai/whisper), offering up to 4x faster transcription while using less memory.
207
+
208
+ You can install it via:
209
+
210
+ * `pip install faster-whisper`
211
+ * [Download Windows Standalone](https://github.com/Purfview/whisper-standalone-win/releases)
212
+
213
+ Here’s a basic usage example:
214
+
215
+ ```
216
+ faster-whisper-xxl "video.mp4" --model medium --language enCopy to clipboardErrorCopied
217
+ ```
218
+
219
+ Here’s my recommendation for transcribing videos. This saves the output in JSON as well as SRT format in the source directory.
220
+
221
+ ```
222
+ faster-whisper-xxl --print_progress --output_dir source --batch_recursive \
223
+ --check_files --standard --output_format json srt \
224
+ --model medium --language en $FILECopy to clipboardErrorCopied
225
+ ```
226
+
227
+ * `--model`: The OpenAI Whisper model to use. You can choose from:
228
+ + `tiny`: Fastest but least accurate
229
+ + `base`: Good for simple audio
230
+ + `small`: Balanced speed/accuracy
231
+ + `medium`: Recommended default
232
+ + `large-v3`: Most accurate but slowest
233
+ * `--output_format`: The output format to use. You can pick multiple formats from:
234
+ + `json`: Has the most detailed information including timing, text, quality, etc.
235
+ + `srt`: A popular subtitle format. You can use this in YouTube, for example.
236
+ + `vtt`: A modern subtitle format.
237
+ + `txt`: Just the text transcript
238
+ * `--output_dir`: The directory to save the output files. `source` indicates the source directory, i.e. where the input `$FILE` is
239
+ * `--language`: The language of the input file. If you don’t specify it, it analyzes the first 30 seconds to auto-detect. You can speed it up by specifying it.
240
+
241
+ Run `faster-whisper-xxl --help` for more options.
242
+
243
+ [Gemini transcription](#/extracting-audio-and-transcripts?id=gemini-transcription)
244
+ ----------------------------------------------------------------------------------
245
+
246
+ The [Gemini](https://gemini.google.com/) models from Google are notable in two ways:
247
+
248
+ 1. They have a *huge* input context window. Gemini 2.0 Flash can accept 1M tokens, for example.
249
+ 2. They can handle audio input.
250
+
251
+ This allows us to use Gemini to transcribe audio files.
252
+
253
+ LLMs are not good at transcribing audio *faithfully*. They tend to correct errors and meander from what was said. But they are intelligent. That enables a few powerful workflows. Here are some examples:
254
+
255
+ 1. **Transcribe into other languages**. Gemini will handle the transcription and translation in a single step.
256
+ 2. **Summarize audio transcripts**. For example, convert a podcast into a tutorial, or a meeting recording into actions.
257
+ 3. **Legal Proceeding Analysis**. Extract case citations, dates, and other details from a legal debate.
258
+ 4. **Medical Consultation Summary**. Extract treatments, medications, details of next visit, etc. from a medical consultation.
259
+
260
+ Here’s how to use Gemini to transcribe audio files.
261
+
262
+ 1. Get a [Gemini API key](https://aistudio.google.com/app/apikey) from Google AI Studio.
263
+ 2. Set the `GEMINI_API_KEY` environment variable to the API key.
264
+ 3. Set the `MP3_FILE` environment variable to the path of the MP3 file you want to transcribe.
265
+ 4. Run this code:
266
+
267
+ ```
268
+ curl -X POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash-002:streamGenerateContent?alt=sse \
269
+ -H "X-Goog-API-Key: $GEMINI_API_KEY" \
270
+ -H "Content-Type: application/json" \
271
+ -d "$(cat << EOF
272
+ {
273
+ "contents": [
274
+ {
275
+ "role": "user",
276
+ "parts": [
277
+ {
278
+ "inline_data": {
279
+ "mime_type": "audio/mp3",
280
+ "data": "$(base64 --wrap=0 $MP3_FILE)"
281
+ }
282
+ },
283
+ {"text": "Transcribe this"}
284
+ ]
285
+ }
286
+ ]
287
+ }
288
+ EOF
289
+ )"Copy to clipboardErrorCopied
290
+ ```
291
+
292
+ [Previous
293
+
294
+ Transforming Images](#/transforming-images)
295
+
296
+ [Next
297
+
298
+ 6. Data Analysis](#/data-analysis)
markdown_files/Forecasting_with_Excel.md ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Forecasting with Excel"
3
+ original_url: "https://tds.s-anand.net/#/forecasting-with-excel?id=forecasting-with-excel"
4
+ downloaded_at: "2025-06-08T23:26:21.478299"
5
+ ---
6
+
7
+ [Forecasting with Excel](#/forecasting-with-excel?id=forecasting-with-excel)
8
+ ----------------------------------------------------------------------------
9
+
10
+ [![Forecasting with Excel](https://i.ytimg.com/vi_webp/QrTimmxwZw4/sddefault.webp)](https://youtu.be/QrTimmxwZw4)
11
+
12
+ Here are links used in the video:
13
+
14
+ * [FORECAST reference](https://support.microsoft.com/en-us/office/forecast-and-forecast-linear-functions-50ca49c9-7b40-4892-94e4-7ad38bbeda99)
15
+ * [FORECAST.ETS reference](https://support.microsoft.com/en-us/office/forecast-ets-function-15389b8b-677e-4fbd-bd95-21d464333f41)
16
+ * [Height-weight dataset](https://docs.google.com/spreadsheets/d/1iMFVPh8q9KgnfLwBeBMmX1GaFabP02FK/view) from [Kaggle](https://www.kaggle.com/datasets/burnoutminer/heights-and-weights-dataset)
17
+ * [Traffic dataset](https://docs.google.com/spreadsheets/d/1w2R0fHdLG5ZGW-papaK7wzWq_-WDArKC/view) from [Kaggle](https://www.kaggle.com/datasets/fedesoriano/traffic-prediction-dataset)
18
+
19
+ [Previous
20
+
21
+ Regression with Excel](#/regression-with-excel)
22
+
23
+ [Next
24
+
25
+ Outlier Detection with Excel](#/outlier-detection-with-excel)
markdown_files/Function_Calling.md ADDED
@@ -0,0 +1,184 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Function Calling"
3
+ original_url: "https://tds.s-anand.net/#/function-calling?id=function-calling-with-openai"
4
+ downloaded_at: "2025-06-08T23:26:04.973121"
5
+ ---
6
+
7
+ [Function Calling with OpenAI](#/function-calling?id=function-calling-with-openai)
8
+ ----------------------------------------------------------------------------------
9
+
10
+ [Function Calling](https://platform.openai.com/docs/guides/function-calling) allows Large Language Models to convert natural language into structured function calls. This is perfect for building chatbots and AI assistants that need to interact with your backend systems.
11
+
12
+ OpenAI supports [Function Calling](https://platform.openai.com/docs/guides/function-calling) – a way for LLMs to suggest what functions to call and how.
13
+
14
+ [![OpenAI Function Calling - Full Beginner Tutorial](https://i.ytimg.com/vi_webp/aqdWSYWC_LI/sddefault.webp)](https://youtu.be/aqdWSYWC_LI)
15
+
16
+ Here’s a minimal example using Python and OpenAI’s function calling that identifies the weather in a given location.
17
+
18
+ ```
19
+ # /// script
20
+ # requires-python = ">=3.11"
21
+ # dependencies = [
22
+ # "httpx",
23
+ # ]
24
+ # ///
25
+
26
+ import httpx
27
+ import os
28
+ from typing import Dict, Any
29
+
30
+
31
+ def query_gpt(user_input: str, tools: list[Dict[str, Any]]) -> Dict[str, Any]:
32
+ response = httpx.post(
33
+ "https://api.openai.com/v1/chat/completions",
34
+ headers={
35
+ "Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",
36
+ "Content-Type": "application/json",
37
+ },
38
+ json={
39
+ "model": "gpt-4o-mini",
40
+ "messages": [{"role": "user", "content": user_input}],
41
+ "tools": tools,
42
+ "tool_choice": "auto",
43
+ },
44
+ )
45
+ return response.json()["choices"][0]["message"]
46
+
47
+
48
+ WEATHER_TOOL = {
49
+ "type": "function",
50
+ "function": {
51
+ "name": "get_weather",
52
+ "description": "Get the current weather for a location",
53
+ "parameters": {
54
+ "type": "object",
55
+ "properties": {
56
+ "location": {"type": "string", "description": "City name or coordinates"}
57
+ },
58
+ "required": ["location"],
59
+ "additionalProperties": False,
60
+ },
61
+ "strict": True,
62
+ },
63
+ }
64
+
65
+ if __name__ == "__main__":
66
+ response = query_gpt("What is the weather in San Francisco?", [WEATHER_TOOL])
67
+ print([tool_call["function"] for tool_call in response["tool_calls"]])Copy to clipboardErrorCopied
68
+ ```
69
+
70
+ ### [How to define functions](#/function-calling?id=how-to-define-functions)
71
+
72
+ The function definition is a [JSON schema](https://json-schema.org/) with a few OpenAI specific properties.
73
+ See the [Supported schemas](https://platform.openai.com/docs/guides/structured-outputs#supported-schemas).
74
+
75
+ Here’s an example of a function definition for scheduling a meeting:
76
+
77
+ ```
78
+ MEETING_TOOL = {
79
+ "type": "function",
80
+ "function": {
81
+ "name": "schedule_meeting",
82
+ "description": "Schedule a meeting room for a specific date and time",
83
+ "parameters": {
84
+ "type": "object",
85
+ "properties": {
86
+ "date": {
87
+ "type": "string",
88
+ "description": "Meeting date in YYYY-MM-DD format"
89
+ },
90
+ "time": {
91
+ "type": "string",
92
+ "description": "Meeting time in HH:MM format"
93
+ },
94
+ "meeting_room": {
95
+ "type": "string",
96
+ "description": "Name of the meeting room"
97
+ }
98
+ },
99
+ "required": ["date", "time", "meeting_room"],
100
+ "additionalProperties": False
101
+ },
102
+ "strict": True
103
+ }
104
+ }Copy to clipboardErrorCopied
105
+ ```
106
+
107
+ ### [How to define multiple functions](#/function-calling?id=how-to-define-multiple-functions)
108
+
109
+ You can define multiple functions by passing a list of function definitions to the `tools` parameter.
110
+
111
+ Here’s an example of a list of function definitions for handling employee expenses and calculating performance bonuses:
112
+
113
+ ```
114
+ tools = [
115
+ {
116
+ "type": "function",
117
+ "function": {
118
+ "name": "get_expense_balance",
119
+ "description": "Get expense balance for an employee",
120
+ "parameters": {
121
+ "type": "object",
122
+ "properties": {
123
+ "employee_id": {
124
+ "type": "integer",
125
+ "description": "Employee ID number"
126
+ }
127
+ },
128
+ "required": ["employee_id"],
129
+ "additionalProperties": False
130
+ },
131
+ "strict": True
132
+ }
133
+ },
134
+ {
135
+ "type": "function",
136
+ "function": {
137
+ "name": "calculate_performance_bonus",
138
+ "description": "Calculate yearly performance bonus for an employee",
139
+ "parameters": {
140
+ "type": "object",
141
+ "properties": {
142
+ "employee_id": {
143
+ "type": "integer",
144
+ "description": "Employee ID number"
145
+ },
146
+ "current_year": {
147
+ "type": "integer",
148
+ "description": "Year to calculate bonus for"
149
+ }
150
+ },
151
+ "required": ["employee_id", "current_year"],
152
+ "additionalProperties": False
153
+ },
154
+ "strict": True
155
+ }
156
+ }
157
+ ]Copy to clipboardErrorCopied
158
+ ```
159
+
160
+ Best Practices:
161
+
162
+ 1. **Use Strict Mode**
163
+ * Always set `strict: True` to ensure valid function calls
164
+ * Define all required parameters
165
+ * Set `additionalProperties: False`
166
+ 2. **Use tool choice**
167
+ * Set `tool_choice: "required"` to ensure that the model will always call one or more tools
168
+ * The default is `tool_choice: "auto"` which means the model will choose a tool only if appropriate
169
+ 3. **Clear Descriptions**
170
+ * Write detailed function and parameter descriptions
171
+ * Include expected formats and units
172
+ * Mention any constraints or limitations
173
+ 4. **Error Handling**
174
+ * Validate function inputs before execution
175
+ * Return clear error messages
176
+ * Handle missing or invalid parameters
177
+
178
+ [Previous
179
+
180
+ Hybrid RAG with TypeSense](#/hybrid-rag-typesense)
181
+
182
+ [Next
183
+
184
+ LLM Agents](#/llm-agents)
markdown_files/Geospatial_Analysis_with_Excel.md ADDED
@@ -0,0 +1,33 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Geospatial Analysis with Excel"
3
+ original_url: "https://tds.s-anand.net/#/geospatial-analysis-with-excel?id=geospatial-analysis-with-excel"
4
+ downloaded_at: "2025-06-08T23:26:02.659173"
5
+ ---
6
+
7
+ [Geospatial Analysis with Excel](#/geospatial-analysis-with-excel?id=geospatial-analysis-with-excel)
8
+ ----------------------------------------------------------------------------------------------------
9
+
10
+ [![Geospatial analysis with Excel](https://i.ytimg.com/vi_webp/49LjxNvxyVs/sddefault.webp)](https://youtu.be/49LjxNvxyVs)
11
+
12
+ You’ll learn how to create a data-driven story about coffee shop coverage in Manhattan, covering:
13
+
14
+ * **Data Collection**: Collect and scrape data for coffee shop locations and census population from various sources.
15
+ * **Data Processing**: Use Python libraries like geopandas for merging population data with geographic maps.
16
+ * **Map Creation**: Generate coverage maps using tools like QGIS and Excel to visualize coffee shop distribution and population impact.
17
+ * **Visualization**: Create physical, Power BI, and video visualizations to present the data effectively.
18
+ * **Storytelling**: Craft a narrative around coffee shop competition, including strategic insights and potential market changes.
19
+
20
+ Here are links that explain how the video was made:
21
+
22
+ * [The Making of the Manhattan Coffee Kings](https://blog.gramener.com/the-making-of-manhattans-coffee-kings/)
23
+ * [Shaping and merging maps](https://blog.gramener.com/shaping-and-merging-maps/)
24
+ * [Visualizing data on 3D maps](https://blog.gramener.com/visualizing-data-on-3d-maps/)
25
+ * [Physical and digital 3D maps](https://blog.gramener.com/physical-and-digital-3d-maps/)
26
+
27
+ [Previous
28
+
29
+ Data Analysis with ChatGPT](#/data-analysis-with-chatgpt)
30
+
31
+ [Next
32
+
33
+ Geospatial Analysis with Python](#/geospatial-analysis-with-python)
markdown_files/Geospatial_Analysis_with_Python.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Geospatial Analysis with Python"
3
+ original_url: "https://tds.s-anand.net/#/geospatial-analysis-with-python?id=geospatial-analysis-with-python"
4
+ downloaded_at: "2025-06-08T23:22:52.295346"
5
+ ---
6
+
7
+ [Geospatial Analysis with Python](#/geospatial-analysis-with-python?id=geospatial-analysis-with-python)
8
+ -------------------------------------------------------------------------------------------------------
9
+
10
+ [![Geospatial analysis with Python](https://i.ytimg.com/vi_webp/m_qayAJt-yE/sddefault.webp)](https://youtu.be/m_qayAJt-yE)
11
+
12
+ You’ll learn how to perform geospatial analysis for location-based decision making, covering:
13
+
14
+ * **Distance Calculation**: Compute distances between various store locations and a reference point, such as the Empire State Building.
15
+ * **Data Visualization**: Visualize store locations on a map using Python libraries like Folium.
16
+ * **Store Density Analysis**: Determine the number of stores within a specified radius.
17
+ * **Proximity Analysis**: Identify the closest and farthest stores from a specific location.
18
+ * **Decision Making**: Use geospatial data to assess whether opening a new store is feasible based on existing store distribution.
19
+
20
+ Here are links used in the video:
21
+
22
+ * [Jupyter Notebook](https://colab.research.google.com/drive/1TwKw2pQ9XKSdTUUsTq_ulw7rb-xVhays?usp=sharing)
23
+ * Learn about the [`pandas` package](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html) and [video](https://youtu.be/vmEHCJofslg)
24
+ * Learn about the [`numpy` package](https://numpy.org/doc/stable/user/whatisnumpy.html) and [video](https://youtu.be/8JfDAm9y_7s)
25
+ * Learn about the [`folium` package](https://python-visualization.github.io/folium/latest/) and [video](https://youtu.be/t9Ed5QyO7qY)
26
+ * Learn about the [`geopy` package](https://pypi.org/project/geopy/) and [video](https://youtu.be/3jj_5kVmPLs)
27
+
28
+ [Previous
29
+
30
+ Geospatial Analysis with Excel](#/geospatial-analysis-with-excel)
31
+
32
+ [Next
33
+
34
+ Geospatial Analysis with QGIS](#/geospatial-analysis-with-qgis)
markdown_files/Geospatial_Analysis_with_QGIS.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Geospatial Analysis with QGIS"
3
+ original_url: "https://tds.s-anand.net/#/geospatial-analysis-with-qgis?id=geospatial-analysis-with-qgis"
4
+ downloaded_at: "2025-06-08T23:23:28.541219"
5
+ ---
6
+
7
+ [Geospatial Analysis with QGIS](#/geospatial-analysis-with-qgis?id=geospatial-analysis-with-qgis)
8
+ -------------------------------------------------------------------------------------------------
9
+
10
+ [![Geospatial analysis with QGIS](https://i.ytimg.com/vi_webp/tJhehs0o-ik/sddefault.webp)](https://youtu.be/tJhehs0o-ik)
11
+
12
+ You’ll learn how to use QGIS for geographic data processing, covering:
13
+
14
+ * **Shapefiles and KML Files**: Create and manage shapefiles and KML files for storing and analyzing geographic information.
15
+ * **Downloading QGIS**: Install QGIS on different operating systems and familiarize yourself with its interface.
16
+ * **Geospatial Data**: Access and utilize shapefiles from sources like Diva-GIS and integrate them into QGIS projects.
17
+ * **Creating Custom Shapefiles**: Learn how to create custom shapefiles when existing ones are unavailable, including creating a shapefile for South Sudan.
18
+ * **Editing and Visualization**: Use QGIS tools to edit shapefiles, add attributes, and visualize geographic data with various styling and labeling options.
19
+ * **Exporting Data**: Export shapefiles or KML files for use in other applications, such as Google Earth.
20
+
21
+ Here are links used in the video:
22
+
23
+ * [QGIS Project](https://www.qgis.org/en/site/)
24
+ * [Shapefile Data](https://www.diva-gis.org/gdata)
25
+
26
+ [Previous
27
+
28
+ Geospatial Analysis with Python](#/geospatial-analysis-with-python)
29
+
30
+ [Next
31
+
32
+ Network Analysis in Python](#/network-analysis-in-python)
markdown_files/Hybrid_RAG_with_TypeSense.md ADDED
@@ -0,0 +1,154 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Hybrid RAG with TypeSense"
3
+ original_url: "https://tds.s-anand.net/#/hybrid-rag-typesense?id=install-and-run-typesense"
4
+ downloaded_at: "2025-06-08T23:25:43.332058"
5
+ ---
6
+
7
+ [Hybrid Retrieval Augmented Generation (Hybrid RAG) with TypeSense](#/hybrid-rag-typesense?id=hybrid-retrieval-augmented-generation-hybrid-rag-with-typesense)
8
+ --------------------------------------------------------------------------------------------------------------------------------------------------------------
9
+
10
+ Hybrid RAG combines semantic (vector) search with traditional keyword search to improve retrieval accuracy and relevance. By mixing exact text matches with embedding-based similarity, you get the best of both worlds: precision when keywords are present, and semantic recall when phrasing varies. [TypeSense](https://typesense.org/) makes this easy with built-in hybrid search and automatic embedding generation.
11
+
12
+ Below is a fully self-contained Hybrid RAG tutorial using TypeSense, Python, and the command line.
13
+
14
+ ### [Install and run TypeSense](#/hybrid-rag-typesense?id=install-and-run-typesense)
15
+
16
+ [Install TypeSense](https://typesense.org/docs/guide/install-typesense.html).
17
+
18
+ ```
19
+ mkdir typesense-data
20
+
21
+ docker run -p 8108:8108 \
22
+ -v typesense-data:/data typesense/typesense:28.0 \
23
+ --data-dir /data \
24
+ --api-key=secret-key \
25
+ --enable-corsCopy to clipboardErrorCopied
26
+ ```
27
+
28
+ * **`docker run`**: spins up a containerized TypeSense server on port 8108
29
+ + `-p 8108:8108` maps host port to container port.
30
+ + `-v typesense-data:/data` mounts a Docker volume for persistence.
31
+ + `--data-dir /data` points TypeSense at that volume.
32
+ + `--api-key=secret-key` secures your API.
33
+ + `--enable-cors` allows browser-based requests.
34
+
35
+ **Expected output:**
36
+
37
+ * Docker logs showing TypeSense startup messages, such as `Started Typesense API server`.
38
+ * Listening on `http://0.0.0.0:8108`.
39
+
40
+ ### [Embed and import documents into TypeSense](#/hybrid-rag-typesense?id=embed-and-import-documents-into-typesense)
41
+
42
+ Follow the steps in the [RAG with the CLI](#/rag-cli) tutorial to create a `chunks.json` that has one `{id, content}` JSON object per line.
43
+
44
+ [TypeSense supports automatic embedding of documents](https://typesense.org/docs/28.0/api/vector-search.html#option-b-auto-embedding-generation-within-typesense). We’ll use that capability.
45
+
46
+ Save the following as `addnotes.py` and run it with `uv run addnotes.py`.
47
+
48
+ ```
49
+ # /// script
50
+ # requires-python = ">=3.13"
51
+ # dependencies = ["httpx"]
52
+ # ///
53
+ import json
54
+ import httpx
55
+ import os
56
+
57
+ headers = {"X-TYPESENSE-API-KEY": "secret-key"}
58
+
59
+ schema = {
60
+ "name": "notes",
61
+ "fields": [
62
+ {"name": "id", "type": "string", "facet": False},
63
+ {"name": "content", "type": "string", "facet": False},
64
+ {
65
+ "name": "embedding",
66
+ "type": "float[]",
67
+ "embed": {
68
+ "from": ["content"],
69
+ "model_config": {
70
+ "model_name": "openai/text-embedding-3-small",
71
+ "api_key": os.getenv("OPENAI_API_KEY"),
72
+ },
73
+ },
74
+ },
75
+ ],
76
+ }
77
+
78
+ with open("chunks.json", "r") as f:
79
+ chunks = [json.loads(line) for line in f.readlines()]
80
+
81
+ with httpx.Client() as client:
82
+ # Create the collection
83
+ if client.get(f"http://localhost:8108/collections/notes", headers=headers).status_code == 404:
84
+ r = client.post("http://localhost:8108/collections", json=schema, headers=headers)
85
+
86
+ # Embed the chunks
87
+ result = client.post(
88
+ "http://localhost:8108/collections/notes/documents/import?action=emplace",
89
+ headers={**headers, "Content-Type": "text/plain"},
90
+ data="\n".join(json.dumps(chunk) for chunk in chunks),
91
+ )
92
+ print(result.text)Copy to clipboardErrorCopied
93
+ ```
94
+
95
+ * **`httpx.Client`**: an HTTP client for Python.
96
+ * **Collection schema**: `id` and `content` fields plus an `embedding` field with auto-generated embeddings from OpenAI.
97
+ * **Auto-embedding**: the `embed` block instructs TypeSense to call the specified model for each document.
98
+ * **`GET /collections/notes`**: checks existence.
99
+ * **`POST /collections`**: creates the collection.
100
+ * **`POST /collections/notes/documents/import?action=emplace`**: bulk upsert documents, embedding them on the fly.
101
+
102
+ **Expected output:**
103
+
104
+ * A JSON summary string like `{"success": X, "failed": 0}` indicating how many docs were imported.
105
+ * (On timeouts, re-run until all chunks are processed.)
106
+
107
+ ### [4. Run a hybrid search and answer a question](#/hybrid-rag-typesense?id=_4-run-a-hybrid-search-and-answer-a-question)
108
+
109
+ Now, we can use a single `curl` against the Multi-Search endpoint to combine keyword and vector search as a [hybrid search](https://typesense.org/docs/28.0/api/vector-search.html#hybrid-search):
110
+
111
+ ```
112
+ Q="What does the author affectionately call the => syntax?"
113
+
114
+ payload=$(jq -n --arg coll "notes" --arg q "$Q" \
115
+ '{
116
+ searches: [
117
+ {
118
+ collection: $coll,
119
+ q: $q,
120
+ query_by: "content,embedding",
121
+ sort_by: "_text_match:desc",
122
+ prefix: false,
123
+ exclude_fields: "embedding"
124
+ }
125
+ ]
126
+ }'
127
+ )
128
+ curl -s 'http://localhost:8108/multi_search' \
129
+ -H "X-TYPESENSE-API-KEY: secret-key" \
130
+ -d "$payload" \
131
+ | jq -r '.results[].hits[].document.content' \
132
+ | llm -s "${Q} - \$Answer ONLY from these notes. Cite verbatim from the notes." \
133
+ | uvx streamdownCopy to clipboardErrorCopied
134
+ ```
135
+
136
+ * **`query_by: "content,embedding"`**: tells TypeSense to score by both keyword and vector similarity.
137
+ * **`sort_by: "_text_match:desc"`**: boosts exact text hits.
138
+ * **`exclude_fields: "embedding"`**: keeps responses lightweight.
139
+ * **`curl -d`**: posts the search request.
140
+ * **`jq -r`**: extracts each hit’s `content`. See [jq manual](https://stedolan.github.io/jq/manual/)
141
+ * **`llm -s`** and **`uvx streamdown`**: generate and stream a grounded answer.
142
+
143
+ **Expected output:**
144
+
145
+ * The raw matched snippets printed first.
146
+ * Then a concise, streamed LLM answer citing the note verbatim.
147
+
148
+ [Previous
149
+
150
+ RAG with the CLI)](#/rag-cli)
151
+
152
+ [Next
153
+
154
+ Function Calling](#/function-calling)
markdown_files/Images__Compression.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Images: Compression"
3
+ original_url: "https://tds.s-anand.net/#/image-compression?id=images-compression"
4
+ downloaded_at: "2025-06-08T23:26:15.003538"
5
+ ---
6
+
7
+ [Images: Compression](#/image-compression?id=images-compression)
8
+ ----------------------------------------------------------------
9
+
10
+ Image compression is essential when deploying apps. Often, pages have dozens of images. Image analysis runs over thousands of images. The cost of storage and bandwidth can grow over time.
11
+
12
+ Here are things you should know when you’re compressing images:
13
+
14
+ * **Image dimensions** are the width and height of the image in pixels. This impacts image size a lot
15
+ * **Lossless** compression (PNG, WebP) preserves exact data
16
+ * **Lossy** compression (JPEG, WebP) removes some data for smaller files
17
+ * **Vector** formats (SVG) scale without quality loss
18
+ * **WebP** is the modern standard, supporting both lossy and lossless
19
+
20
+ Here’s a rule of thumb you can use as of 2025.
21
+
22
+ * Use SVG if you can (i.e. if it’s vector graphics or you can convert it to one)
23
+ * Else, reduce the image to as small as you can, and save as (lossy or lossless) WebP
24
+
25
+ Common operations with Python:
26
+
27
+ ```
28
+ from pathlib import Path
29
+ from PIL import Image
30
+ import io
31
+
32
+ async def compress_image(input_path: Path, output_path: Path, quality: int = 85) -> None:
33
+ """Compress an image while maintaining reasonable quality."""
34
+ with Image.open(input_path) as img:
35
+ # Convert RGBA to RGB if needed
36
+ if img.mode == 'RGBA':
37
+ img = img.convert('RGB')
38
+ # Optimize for web
39
+ img.save(output_path, 'WEBP', quality=quality, optimize=True)
40
+
41
+ # Batch process images
42
+ paths = Path('images').glob('*.jpg')
43
+ for p in paths:
44
+ await compress_image(p, p.with_suffix('.webp'))Copy to clipboardErrorCopied
45
+ ```
46
+
47
+ Command line tools include [cwebp](https://developers.google.com/speed/webp/docs/cwebp), [pngquant](https://pngquant.org/), [jpegoptim](https://github.com/tjko/jpegoptim), and [ImageMagick](https://imagemagick.org/).
48
+
49
+ ```
50
+ # Convert to WebP
51
+ cwebp -q 85 input.png -o output.webp
52
+
53
+ # Optimize PNG
54
+ pngquant --quality=65-80 image.png
55
+
56
+ # Optimize JPEG
57
+ jpegoptim --strip-all --all-progressive --max=85 image.jpg
58
+
59
+ # Convert and resize
60
+ convert input.jpg -resize 800x600 output.jpg
61
+
62
+ # Batch convert
63
+ mogrify -format webp -quality 85 *.jpgCopy to clipboardErrorCopied
64
+ ```
65
+
66
+ Watch this video on modern image formats and optimization (15 min):
67
+
68
+ [![Modern Image Optimization (15 min)](https://i.ytimg.com/vi_webp/F1kYBnY6mwg/sddefault.webp)](https://youtu.be/F1kYBnY6mwg)
69
+
70
+ Tools for image optimization:
71
+
72
+ * [squoosh.app](https://squoosh.app/): Browser-based compression
73
+ * [ImageOptim](https://imageoptim.com/): GUI tool for Mac
74
+ * [sharp](https://sharp.pixelplumbing.com/): Node.js image processing
75
+ * [Pillow](https://python-pillow.org/): Python imaging library
76
+
77
+ [Previous
78
+
79
+ Markdown](#/markdown)
80
+
81
+ [Next
82
+
83
+ Static hosting: GitHub Pages](#/github-pages)
markdown_files/Interactive_Notebooks__Marimo.md ADDED
@@ -0,0 +1,58 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "Interactive Notebooks: Marimo"
3
+ original_url: "https://tds.s-anand.net/#/marimo?id=interactive-notebooks-marimo"
4
+ downloaded_at: "2025-06-08T23:25:35.286078"
5
+ ---
6
+
7
+ [Interactive Notebooks: Marimo](#/marimo?id=interactive-notebooks-marimo)
8
+ -------------------------------------------------------------------------
9
+
10
+ [Marimo](https://marimo.app/) is a new take on notebooks that solves some headaches of Jupyter. It runs cells reactively - when you change one cell, all dependent cells update automatically, just like a spreadsheet.
11
+
12
+ Marimo’s cells can’t be run out of order. This makes Marimo more reproducible and easier to debug, but requires a mental shift from the Jupyter/Colab way of working.
13
+
14
+ It also runs Python directly in the browser and is quite interactive. [Browse the gallery of examples](https://marimo.io/gallery). With a wide variety of interactive widgets, It’s growing popular as an alternative to Streamlit for building data science web apps.
15
+
16
+ Common Operations:
17
+
18
+ ```
19
+ # Create new notebook
20
+ uvx marimo new
21
+
22
+ # Run notebook server
23
+ uvx marimo edit notebook.py
24
+
25
+ # Export to HTML
26
+ uvx marimo export notebook.pyCopy to clipboardErrorCopied
27
+ ```
28
+
29
+ Best Practices:
30
+
31
+ 1. **Cell Dependencies**
32
+
33
+ * Keep cells focused and atomic
34
+ * Use clear variable names
35
+ * Document data flow between cells
36
+ 2. **Interactive Elements**
37
+
38
+ ```
39
+ # Add interactive widgets
40
+ slider = mo.ui.slider(1, 100)
41
+ # Create dynamic Markdown
42
+ mo.md(f"{slider} {"🟢" * slider.value}")Copy to clipboardErrorCopied
43
+ ```
44
+ 3. **Version Control**
45
+
46
+ * Keep notebooks are Python files
47
+ * Use Git to track changes
48
+ * Publish on [marimo.app](https://marimo.app/) for collaboration
49
+
50
+ [!["marimo: an open-source reactive notebook for Python" - Akshay Agrawal (Nbpy2024)](https://i.ytimg.com/vi_webp/9R2cQygaoxQ/sddefault.webp)](https://youtu.be/9R2cQygaoxQ)
51
+
52
+ [Previous
53
+
54
+ Narratives with LLMs](#/narratives-with-llms)
55
+
56
+ [Next
57
+
58
+ HTML Slides: RevealJS](#/revealjs)
markdown_files/JSON.md ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "JSON"
3
+ original_url: "https://tds.s-anand.net/#/revealjs"
4
+ downloaded_at: "2025-06-08T23:25:13.176607"
5
+ ---
6
+
7
+ 404 - Not found
8
+ ===============
markdown_files/JavaScript_tools__npx.md ADDED
@@ -0,0 +1,47 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "JavaScript tools: npx"
3
+ original_url: "https://tds.s-anand.net/#/npx?id=javascript-tools-npx"
4
+ downloaded_at: "2025-06-08T23:21:38.208039"
5
+ ---
6
+
7
+ [JavaScript tools: npx](#/npx?id=javascript-tools-npx)
8
+ ------------------------------------------------------
9
+
10
+ [npx](https://docs.npmjs.com/cli/v8/commands/npx) is a command-line tool that comes with npm (Node Package Manager) and allows you to execute npm package binaries and run one-off commands without installing them globally. It’s essential for modern JavaScript development and data science workflows.
11
+
12
+ For data scientists, npx is useful when:
13
+
14
+ * Running JavaScript-based data visualization tools
15
+ * Converting notebooks and documents
16
+ * Testing and formatting code
17
+ * Running development servers
18
+
19
+ Here are common npx commands:
20
+
21
+ ```
22
+ # Run a package without installing
23
+ npx http-server . # Start a local web server
24
+ npx prettier --write . # Format code or docs
25
+ npx eslint . # Lint JavaScript
26
+ npx typescript-node script.ts # Run TypeScript directly
27
+ npx esbuild app.js # Bundle JavaScript
28
+ npx jsdoc . # Generate JavaScript docs
29
+
30
+ # Run specific versions
31
+ npx prettier@3.2 --write . # Use prettier 3.2
32
+
33
+ # Execute remote scripts (use with caution!)
34
+ npx github:user/repo # Run from GitHubCopy to clipboardErrorCopied
35
+ ```
36
+
37
+ Watch this introduction to npx (6 min):
38
+
39
+ [![What you can do with npx (6 min)](https://i.ytimg.com/vi_webp/55WaAoZV_tQ/sddefault.webp)](https://youtu.be/55WaAoZV_tQ)
40
+
41
+ [Previous
42
+
43
+ Python tools: uv](#/uv)
44
+
45
+ [Next
46
+
47
+ Unicode](#/unicode)
markdown_files/LLM_Agents.md ADDED
@@ -0,0 +1,123 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: "LLM Agents"
3
+ original_url: "https://tds.s-anand.net/#/llm-agents?id=command-line-agent-example"
4
+ downloaded_at: "2025-06-08T23:25:53.665479"
5
+ ---
6
+
7
+ [LLM Agents: Building AI Systems That Can Think and Act](#/llm-agents?id=llm-agents-building-ai-systems-that-can-think-and-act)
8
+ -------------------------------------------------------------------------------------------------------------------------------
9
+
10
+ LLM Agents are AI systems that can define and execute their own workflows to accomplish tasks. Unlike simple prompt-response patterns, agents make multiple LLM calls, use tools, and adapt their approach based on intermediate results. They represent a significant step toward more autonomous AI systems.
11
+
12
+ [![Building LLM Agents with LangChain (13 min)](https://i.ytimg.com/vi_webp/DWUdGhRrv2c/sddefault.webp)](https://youtu.be/DWUdGhRrv2c)
13
+
14
+ ### [What Makes an Agent?](#/llm-agents?id=what-makes-an-agent)
15
+
16
+ An LLM agent consists of three core components:
17
+
18
+ 1. **LLM Brain**: Makes decisions about what to do next
19
+ 2. **Tools**: External capabilities the agent can use (e.g., web search, code execution)
20
+ 3. **Memory**: Retains context across multiple steps
21
+
22
+ Agents operate through a loop:
23
+
24
+ * Observe the environment
25
+ * Think about what to do
26
+ * Take action using tools
27
+ * Observe results
28
+ * Repeat until task completion
29
+
30
+ ### [Command-Line Agent Example](#/llm-agents?id=command-line-agent-example)
31
+
32
+ We’ve created a minimal command-line agent called [`llm-cmd-agent.py`](llm-cmd-agent.py) that:
33
+
34
+ 1. Takes a task description from the command line
35
+ 2. Generates code to accomplish the task
36
+ 3. Automatically extracts and executes the code
37
+ 4. Passes the results back to the LLM
38
+ 5. Provides a final answer or tries again if the execution fails
39
+
40
+ Here’s how it works:
41
+
42
+ ```
43
+ uv run llm-cmd-agent.py "list all Python files under the current directory, recursively, by size"
44
+ uv run llm-cmd-agent.py "convert the largest Markdown file to HTML"Copy to clipboardErrorCopied
45
+ ```
46
+
47
+ The agent will:
48
+
49
+ 1. Generate a shell script to list files with their sizes
50
+ 2. Execute the script in a subprocess
51
+ 3. Capture the output (stdout and stderr)
52
+ 4. Pass the output back to the LLM for interpretation
53
+ 5. Present a final answer to the user
54
+
55
+ Under the hood, the agent follows this workflow:
56
+
57
+ 1. Initial prompt to generate a shell script
58
+ 2. Code extraction from the LLM response
59
+ 3. Code execution in a subprocess
60
+ 4. Result interpretation by the LLM
61
+ 5. Error handling and retry logic if needed
62
+
63
+ This demonstrates the core agent loop of:
64
+
65
+ * Planning (generating code)
66
+ * Execution (running the code)
67
+ * Reflection (interpreting results)
68
+ * Adaptation (fixing errors if needed)
69
+
70
+ ### [Agent Architectures](#/llm-agents?id=agent-architectures)
71
+
72
+ Different agent architectures exist for different use cases:
73
+
74
+ 1. **ReAct** (Reasoning + Acting): Interleaves reasoning steps with actions
75
+ 2. **Reflexion**: Adds self-reflection to improve reasoning
76
+ 3. **MRKL** (Modular Reasoning, Knowledge and Language): Combines neural and symbolic modules
77
+ 4. **Plan-and-Execute**: Creates a plan first, then executes steps
78
+
79
+ ### [Real-World Applications](#/llm-agents?id=real-world-applications)
80
+
81
+ LLM agents can be applied to various domains:
82
+
83
+ 1. **Research assistants** that search, summarize, and synthesize information
84
+ 2. **Coding assistants** that write, debug, and explain code
85
+ 3. **Data analysis agents** that clean, visualize, and interpret data
86
+ 4. **Customer service agents** that handle queries and perform actions
87
+ 5. **Personal assistants** that manage schedules, emails, and tasks
88
+
89
+ ### [Project Ideas](#/llm-agents?id=project-ideas)
90
+
91
+ Here are some practical agent projects you could build:
92
+
93
+ 1. **Study buddy agent**: Helps create flashcards, generates practice questions, and explains concepts
94
+ 2. **Job application assistant**: Searches job listings, tailors resumes, and prepares interview responses
95
+ 3. **Personal finance agent**: Categorizes expenses, suggests budgets, and identifies savings opportunities
96
+ 4. **Health and fitness coach**: Creates workout plans, tracks nutrition, and provides motivation
97
+ 5. **Course project helper**: Breaks down assignments, suggests resources, and reviews work
98
+
99
+ ### [Best Practices](#/llm-agents?id=best-practices)
100
+
101
+ 1. **Clear instructions**: Define the agent’s capabilities and limitations
102
+ 2. **Effective tool design**: Create tools that are specific and reliable
103
+ 3. **Robust error handling**: Agents should recover gracefully from failures
104
+ 4. **Memory management**: Balance context retention with token efficiency
105
+ 5. **User feedback**: Allow users to correct or guide the agent
106
+
107
+ ### [Limitations and Challenges](#/llm-agents?id=limitations-and-challenges)
108
+
109
+ Current LLM agents face several challenges:
110
+
111
+ 1. **Hallucination**: Agents may generate false information or tool calls
112
+ 2. **Planning limitations**: Complex tasks require better planning capabilities
113
+ 3. **Tool integration complexity**: Each new tool adds implementation overhead
114
+ 4. **Context window constraints**: Limited memory for long-running tasks
115
+ 5. **Security concerns**: Tool access requires careful permission management
116
+
117
+ [Previous
118
+
119
+ Function Calling](#/function-calling)
120
+
121
+ [Next
122
+
123
+ LLM Image Generation](#/llm-image-generation)